Barcodes Genomes

8/3/2019 Barcodes Genomes

1/29

This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formattedPDF and full text (HTML) versions will be made available soon.

Barcodes for genomes and applications

BMC Bioinformatics 2008, 9:546 doi:10.1186/1471-2105-9-546

Fengfeng Zhou ([email protected])Victor Olman ([email protected])

Ying Xu ([email protected])

ISSN 1471-2105

Article type Research article

Submission date 16 June 2008

Acceptance date 17 December 2008

Publication date 17 December 2008

Article URL http://www.biomedcentral.com/1471-2105/9/546

Like all articles in BMC journals, this peer-reviewed article was published immediately uponacceptance. It can be downloaded, printed and distributed freely for any purposes (see copyright

notice below).

Articles in BMC journals are listed in PubMed and archived at PubMed Central.

For information about publishing your research in BMC journals or any BioMed Central journal, go to

http://www.biomedcentral.com/info/authors/

BMC Bioinformatics

2008 Zhou et al. , licensee BioMed Central Ltd.This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
mailto:[email protected]:[email protected]:[email protected]://www.biomedcentral.com/1471-2105/9/546http://www.biomedcentral.com/info/authors/http://creativecommons.org/licenses/by/2.0http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/info/authors/http://www.biomedcentral.com/1471-2105/9/546mailto:[email protected]:[email protected]:[email protected]


2/29

-1-

Barcodesforgenomesandapplications

FengfengZhou,VictorOlmanandYingXu*

DepartmentofBiochemistryandMolecularBiologyandInstituteofBioinformatics,

andBioEnergyScienceCenter(BESC),UniversityofGeorgia,Athens,GA30602,

USA.

Theseauthorscontributedequallytothiswork*Correspondingauthor

Emailaddresses:

FZ:[email protected]

VO:[email protected]

YX:[email protected]


3/29

-2-

Abstract

Background

Eachgenomehasastabledistributionofthecombinedfrequencyforeachk-merand

itsreversecomplementmeasuredinsequencefragmentsasshortas1000bpsacross

thewholegenome,for1


4/29

-3-

Background

The challenges being faced in sorting out short genomic fragments generated by

metagenome sequencing projects [1] pose a fundamental question: does each

genome have a unique signature imprinted on its short sequence fragments so that

fragmentsfromthesamegenomesinametagenomecanbeidentifiedaccurately?A

positiveanswertothisquestioncouldhavesignificantimplicationstomanyimportant

genomeandmetagenomeanalysisproblemssuchasidentificationofgeneticmaterial

transferredfromotherorganisms[2]orthroughvirusinvasions[3,4],separationof

short sequence fragments generated by metagenome sequencing into individual

genomes[5]andphylogeneticanalysesofgenomes[6].

Understandingthe intrinsic properties of genomesequences,either generalto allor

specifictosomeclassesofgenomes,hasbeenthefocusofmanystudiesinthepast

twodecades.EarlierworkincludesthediscoveryoftheperiodicitypropertyofDNA

sequencesacrossbothprokaryoticandeukaryoticgenomes[7]andtherealizationthat

codingsequencesfollowMarkovchainproperties[8-10].Karlinandcolleagueshave

studiedvariousgenomepropertiesbasedonanalysesofk-merfrequencydistributions,

and have observed that the di-nucleotide relative abundance, a normalized di-mer

frequency with respect to the mono-mer frequencies, is generally stable across a

genomemeasuredon50Kbase-pair(bp)fragments[11-13].Theyevensuggestedthat

such normalized di-mer frequency distributions can possibly serve as signatures of

genomes.

Inthispaper,wepresentabarcodingschemeforallsequencedgenomes,andillustrate

a number of interesting and useful properties of the barcodes, which we can take

advantagetosolvechallenginggenomeanalysisproblems.Wehighlightthepowerof

this barcoding scheme through addressing two application problems: metagenome

binningproblemandidentificationofhorizontallytransferredgenes.


5/29

-4-

Results

Barcodesandtheirproperties

We have calculatedthe barcodefor each sequenced prokaryotic genome, using the

followingprocedure.Foreachgenome,wepartitionitssequenceintoaseriesofnon-

overlappingandequal-sizedfragmentsofMbps;thenforeachk-mer(1


6/29

-5-

givingrisetotheverticalbandswithconsistentgreylevelsacrosseachbarcode;(b)

thesmallfractionofthefragmentswithclearlydifferent,abnormal,barcodes

(horizontalstripesinthebarcodes)thantherestofthegenometypicallyrepresent2-3

specialclassesofgenes(seediscussionlater);(c)multiplechromosomesofthesame

organismsgenerallyhavehighlysimilarbarcodes(Figure2(a))buttheyeachhave

theiruniquepatternsofabnormalfragments;and(d)thebarcodessimilaritiestendto

begenerallyproportionaltothegenomesphylogeneticcloseness(Figure2(b)).

Tounderstandwhyagenomicsequencehasthebarcodeproperty,wehaveexamined

randomnucleotidesequencesgeneratedusingdifferentmodels,includingMarkov

chainmodelsoforderfrom0to6.Weobservedthatbarcodesforrandomnucleotide

sequencesgeneratedusingathird-orderMarkovchainmodelaretheclosesttothe

barcodesofgenomicsequencesintermsoftheirappearances(Additionalfile1),and

higherorderMarkovchainmodelsdonotseemtoaddmuchtothisproperty.Hence

webelievethatthebarcodepropertyofprokaryoticgenomesismainlyduetothe

third-orderMarkovchainpropertyofthecodingsequencesinthegenomes,which

countfor80-90%ofatypicalprokaryoticgenome.Itisworthnotingthatbarcodesfor

codingandnon-codingsequencesofthesamegenomearegenerallydifferentthough

theyshareaweaklysimilarbackbonestructurewhileeachofthesetwoclassesof

(composite)regionsgenerallyhashighlysimilarbarcodes(Figure3).

Extensiontoothergenomes

Inadditiontoprokaryotes,wehavealsocalculatedthebarcodesfortheotherclasses

of sequenced genomes, namely eukaryotic, mitochondrial, plastid and plasmid

genomes. For eukaryotes, we studied the barcodes for four key components in

eukaryotic genomes, namely the (composite) regions of repetitive sequences,

promotersequences(the1000-bpupstreamregionfromeachtranslationstart),coding


7/29

-6-

regions and introns, respectively (Figure 3(b)-(e)). We observed that (i) different

regions in a high-level eukaryotic genome (e.g., human) have similar backbone

structuresintheirbarcodes,and(ii)thebarcodesforthefourtypesofregionshave

increasinglyhighercomplexity,goingfromrepetitivesequencestocodingregionsto

introns and promoter sequences. This is consistent with the belief that introns and

promotersequencesareprobablythemostinformationrichamongthefourbecauseof

thepossiblylargenumbersofregulatoryelementstheyencode.

Thebarcodesofthemitochondrialgenomesaregenerallyuniquecomparedtothe

barcodesoftheothergenomesastheyhaveadistinctoverallappearance(e.g.,Figure

3(h)-(i)).Theirsimilarappearancemaybetheresultofallmitochondriaoriginating

fromProteobacteria.Thebarcodesofallplasmidgenomesalsotendtohavesimilar

characteristicsamongthemselves,possiblyduetobeingundersimilarselection

pressurecausedbytheirfrequenttransferringamongcellcultures.Thebarcodesofall

theplastidgenomesarealsogenerallyuniquecomparedtothebarcodesoftheothers

(e.g.,Figure3(j)-(k)).Forexample,amajorityofthemeachconsistoftwodark

horizontalbendstowardoneendintheirbarcodesalongthegenomeaxis,whose

correspondinggenomicregionsconsistofRNAgenessuchasribosomalRNAsand

tRNAs,plusribosomalproteins.Thefuzzierappearanceoftheplastidbarcodes

indicatesthattheirk-merfrequenciesalongthegenomeaxisarenotasstableasinthe

othergenomes.Theoverallsimilarappearancesoftheplastidbarcodesmaybedueto

alloriginatingfromtheCyanobacteria.

Oneinterestingquestionisdodifferentclassesofgenomeshavetheirunique

characteristicsintheirbarcodes?Ouranswerisyes,basedontheirhighlyseparable

distributions in the feature space defined by two particular features, as shown in

Figure4,oneofwhichmeasurestheoverallfrequencyvariationforall4-mersacross


8/29

-7-

thegenomesbarcode,andtheothermeasurestheoverallsimilaritylevelamongall

theM-bpfragmentsofthegenome,eachconsideredasavectorof4-merfrequencies.

While Figure 2(b) indicates that barcodes generally preserve sequence-level

similarities, Figure 4 suggests that barcodes also capture a higher-level similarity

beyondindividualgenomesequencesimilaritiesthroughthetexturesoftheirimages,

which are the common and unique characteristics of different classes of genomes.

Thispropertyindicatesthatbarcodesarenotjustasimplevisualizationtool,instead

they havecapturedsome fairly basic informationabout genomes!Fromapplication

point of view, we believe that this feature will prove to be useful to metagenome

analyses as fragments from different classes of genomes such as eukaryotes,

prokaryotes or different organelle genomes, have different characteristics in their

barcodeimages.

Identificationofabnormalsequencefragments

Our procedure for identifying sequence fragments with abnormal barcodes in a

genome employs a clustering strategy to divide all the sequence fragments in a

genomeintotwogroups:(a)alargegroupoffragmentswiththeirbarcodesallsimilar

toeachotherand(b)therest(seeMETHODSsection).

Usingthisprocedure,wehaveidentified30,582abnormalfragments,covering

30,889genesacrossall thecompleteprokaryoticgenomes.Specifically28,460such

fragments are identified in the 542 bacterial genomes, covering 28,562 genes, and

2,122suchfragmentsareidentifiedinthe46archaealgenomes,covering2,327genes.

We found that the percentage of fragments with abnormal barcodes ranges from

9.40% to 32.32% across all thebacterialgenomes,withtheaveragebeing 12.85%.

Among the 46 sequenced archaeal genomes, the percentage of fragments with

abnormal barcodes ranges from 9.86% to 23.14%, with the average being 13.58%.

Further information can be found from Additional file 1. The detailed frequency


9/29

-8-

informationforabnormalfragmentsacrossdifferentgenomesisinAdditionalfile2

[15].

While we found that it is generally more challenging to study the abnormal

fragments in eukaryotes, we did apply the same procedure to different human

chromosomes, and found that the percentage of abnormal fragments ranges from

10.08%to31.32%,withtheaveragebeing12.10%.

Wehaveanalyzedtheabnormalfragmentsacrosstheprokaryoticgenomes,and

foundthefollowing:~30%oftheabnormalfragmentscanbeexplainedintermsof(a)

horizontalgenetransfers,(b)phageinvasionsand(c)highlyexpressedgenes,based

on PHX-PA [16, 17] and Prophinder [18],respectively.Amongthe genes that fall

intothis30%,6.99%arehorizontallytransferredgenes,4.97%bacteriophagegenes

and18.90%highlyexpressedgenes,basedontheabovetwopredictionprograms

notethatthesenumbersdonotadduptoexactly30%sincethereareoverlapsamong

them.The genesfalling into different categories are given inAdditional file 3[15].

We have carried out an enrichment analysis of such elements in regions with

abnormal versus normal barcodes. We found that the highly expressed genes are

enriched in the abnormal fragments, with the enrichment ratio >1 across all the

genomes and the average enrichment ratio being 1.90. Similar results hold for the

horizontallytransferredgenesandbacteriophagegenes.Allthedetaileddatacanbe

foundinAdditionalfile2

We noted that our estimate of the percentages of foreign fragments in

bacterial genomes (after deducting the highly expressed genes) is in general

agreement with the previous estimates though different information and techniques

areusedtoderivetheestimates[19].


10/29

-9-

We do not yet have an explanation for the remaining ~70% of abnormal

fragmentsinprokaryoticgenomes ,althoughwesuspectthattheymostlyfallintothe

samethree categoriesone reason that wecould not explain themnow ispossibly

duetothelimitedcoverageofthecurrentdatabasesforhorizontallytransferredgenes,

bacteriophagegenomesandhighlyexpressedgenes.Webelievethatbyusingmore

sophisticated computational procedures, one may be able to derive the level of

abnormalityofafragmentsbarcodeinagenome,andpossiblylinksuchinformation

towhensuchfragmentswerehorizontallytransferred[20].

Binningmetagenomesequence

Theabilitytosequenceamicrobialcommunityhasledtothesequencingofatleast

7.04 Giga bps of metagenome sequences, already 2.22 times the total complete

genome sequences accumulated in the past two decades [21]. These metagenome

sequences have opened many doors to new research possibilities, and have posed

some challenging problems.Onesuch problem is determining which fragments are

from the same organisms in a large pool ofmetagenomicfragments [22], typically

~1000 bps in lengths after the initial assembly using the Sanger sequencing

techniques.

We have applied a clustering algorithm (see METHODS section) forbinning

sequence fragments together based on their barcode similarities, and tested the

clustering strategy on three sets of simulated metagenome data created by cutting

actualbacterialgenomesintofragmentsandmixingthemtogether.Thethreetestsets

consistofall sequencefragmentsfromthreesetsofgenomes,respectively,extracted

fromtheGenBank.Thefirstsetconsistsof11genomesrandomlyselectedfromthe

samegenusbutfrom11differentspecies(thegenushasonly11sequencedspecies)

whilethelasttwosetseachconsistof30and100genomesrandomlyselectedfrom30

and 100 different bacterial genera, respectively. The genome names are given in

Additionalfile4[15].


11/29

-10-

Toassessthebinningabilityofouralgorithmasafunctionofthefragmentsize,

wehaveconsideredfragmentsizeM=1000,2000,5000and10000.Totestthelimit

ofourbinningalgorithm,wehavealsoconsideredM=500.Foreachsetofgenomes,

wepartitionedeachgenomeintofragmentsofsizeM,andthenmixedthefragments

ofthesamelengthintoonepool.Wethencalculatedthebarcodeforeachfragment,

anddida clusteringanalysis,assumingthatthenumberofgenomesineachpoolis

known (this information is derivable from the 16S rRNAs). We have carried out

binningpredictions,onedirectlyonthegeneratedfragmentsandoneonareducedset

ofgeneratedfragments,inwhichweremove10%ofthefragmentsfromeachgenome

whose barcodes are most different from the average barcode of the genome. The

consideration is that each bacterial genome has ~13% of fragments with abnormal

barcodesonaverage,whicharenotexpectedtobebinnedcorrectlywiththerestof

theirhostgenome.Thiswaywecanmoreaccuratelyassessthebinningabilityofour

algorithm.Table1givesthebinningresultsonthethreesetsofsyntheticmetagenome

data,boththeoriginalsetandthereducedset.

Fromthetable,wecanseethatthebinningaccuracy(intothecorrectgenomes)

ishighforfragmentsizeM=1000andabove,atboththespeciesandthegenuslevel.

Fromthetable,weseethatthereisadropinthebinningaccuracywhenthenumberof

the underlying genomes is increased from 30 to 100. This indicates the increased

complexityoftheproblemasafunctionofthenumberofunderlyinggenomes.

Wehavecomparedourbinningperformancewiththepublishedresultsbythe

bestavailablealgorithmPhyloPythia[5].Ourcomparisonindicatesthatouralgorithm

gives consistently more accurate and more specific binning results across different

fragmentsizes.Forexample,atthespecieslevel,ouralgorithmhasbetterthan50%

accuracyonourtestsetwhenthefragmentsizeisatleast2000bpswhilenobinning

results at the species level is given in McHardy et al. [5]. At the genus level, the

accuracy (the average of binning specificity and sensitivity, extracted from Figure


12/29

-11-

1(a)and(b)inMcHardyetal.[5])byPhyloPythiais45.5%forfragmentsize1000,

56%for2000,74%for5000and82.5%for10000(nodataisprovidedfor500),all

measuredin termsof putting fragments into the correctgenerawhileours isto the

correctgenomesandwithmoreaccuratebinningresults.Itshouldbenotedthatthe

testsetusedbyPhyloPythiaisdifferentthanours,whichmayaffecttheperformance

statisticssomewhatthoughwesuspectthatwillbeinsignificant,consideringthesizes

of the test sets. Another key difference between the two algorithms is that while

PhyloPythia is a supervised learning algorithm, which requires a training set, our

algorithmdoesnotrequireatrainingset,andhenceitismoregeneral.

Onethingworthnotingisthataprokaryoticgenome,onaverage,has~13-14%

ofabnormalsequencefragments, when the fragmentsize isM = 1,000, suggesting

that the theoretical limit for binning accuracy should be no better than 86-87%.

Similarlywe expect that thetheoretical limitsof binning accuracy for2,000, 5,000

and 10,000 fragments-based binning should, in general, be no better than 87.36%,

87.58%and88.4%,respectively.

DiscussionandConclusion

A natural question is do all nucleotide sequences have the barcode property like

genomesequenceshave?Theanswerisno,basedonthelargenumberofrandomly

generatedsequencesthatwehaveexamined.Figure1(e)showsatypicalbarcodeofa

randomsequencegeneratedusingazeroth

orderMarkovchainmodel.Wefoundthat

noneofthesogeneratednucleotidesequenceshastheverticalbandstructuresasin

genomes barcodes. More generally, barcodes for genomes and the randomly

generatednucleotide sequences have different characteristics as shown in Figure 5.

Thebarcodeanalysesinthispaperaremainlybasedondatafromprokaryotes.

Thoughwehaveappliedthesamebarcodemodeltoeukaryotesandmadeinteresting

observations,wesuspectthatthecurrentbarcodingschemeisrichenoughtocapture


13/29

-12-

all the complexity of eukaryotes. Further studies along this direction are clearly

needed.

We believe that for many genome analysis problems, particularly for

prokaryoticgenomes,thebarcodesprovideanatural,intuitive,information-richand

unified framework for studying them. Further applications of this capability to

numerousgenomeanalysisproblemscanbe envisioned,such asphylogenystudies,

particularly for genomes without obvious marker genes such as viruses, more

thorough examination of different types of genomic regions in eukaryotes, their

structures andorganization, further studies of horizontalgene transfers,assistingin

genome assembly of higher-order organisms (e.g.,populus which we are currently

workingon)andpossiblymanymore.Webelievethatwehaveonlybegunexploring

thetruepowerofthisnewcapabilityforgenomestudies.

MethodsMappingfrequenciestogreylevels

Thefrequencyofeach k-merismappedtoagreylevelasfollows.Wefirstcountthe

frequencyofeachk-meracrossallprokaryoticgenomes,andsortthefrequencylist

S[1:N(k)]intheincreasingorderofthefrequencieswithN(k)beingthenumberofk-

mers.WethenfindanintegerL,L>0,andpartitionS[1:N(k)]intoLsub-listssothat

the following function is minimized:=

=

Li

i

i SS1

)( , where iS is the sum of all

frequencies in the ith

sub-list, S is the average of S, andL is a parameter to be

determined bytheminimizationresult. ForM=1000and k=4,wefoundL=14

gives the best valuefor the above objective function. The computedpartition of S

givesamappingoffrequenciestothegreylevels.Notethatthismappingisgenome-

independent so each grey level in the barcodes has the same meaning in different

genomes.


14/29

-13-

Barcodesimilaritycalculation

We define the distance (or dissimilarity) between two barcodes based on their

simplified representations, each of which is a matrix having the same number of

columnsofthebarcodeandthenumberofgreylevels,L,usedinbarcodeimagesas

the number of rows; each element in the matrix represents the frequency of the

correspondinggreylevelacrosseachcolumninthebarcode.Fortwosuchmatrices

M1andM2withKcolumnsandLrows,wedefinetheirbarcodedistanceas

2

2

1 1

1 )),(),(( jiMjiML

i

K

j

= =

.

ClearlythisisageneralizationoftheEuclideandistancebetweentwovectorsofthe

averaged k-mer frequencies across each genome, widely used for genome

comparisonsasintheworkofKarlinandcolleagues[13,16,23,24]andmanyothers.

ThisisequivalenttothespecialcaseofourbarcodedistancewhenL=1.FigureS4in

Additionalfile1providesacomparisonbetweenthetwodistances.

Identificationofabnormalfragmentsinagenome

We have used the following procedure to identify fragments in a genome with

substantially different barcodes than the average barcode of the genome. The

procedureconsistsoftwokeysteps.First,foreachk-mer,weselectthefragmentsin

thegenomethathavethehighestorthelowestX%ofthisk-mersfrequencyamong

all fragments, with X being a parameter. Then we sort all the fragments in the

increasing order of the number of times they are selected in the first step, termed

function F(p), withp being the index of a fragment. Let 0p be the fragment index

havingthehighestsecond-orderderivativeof F(p).Weconsiderallfragmentspwith

F(p)>F( 0p )tobethenon-nativefragmentsofthegenomeastheyhaveusedthemost

numberofk-merswithfrequenciesthataresubstantiallydifferentthanthetypicalk-

mer frequencies throughout the genome. We found that the abnormal fragment


15/29

-14-

predictionisnotverysensitivetothedetailedvalueofXwithintherangefrom5to

20.SowehavechosenX=10asthedefaultvalueofourprogram.

The rationale for this procedure is that fragments with higher F(p) values

represent fragments that have more abnormal k-mer frequencies compared to the

average k-mer frequenciesin the genome, and hence are more probable tobenon-

native fragments. By examining the curveof theF(p) function, wefoundthat it is

convex with one sharp transition point 0p , indicating a transition point from the

typicalfragmentstotheabnormalfragmentsinthegenome(seeAdditionalfile1).

Hencewehaveusedthispointastheseparationpointbetweenthenormal(ornative)

fragmentsandtheabnormalfragments.

Metagenomebinningalgorithm

Our binningprocedure startswithanapplicationof the CLUMPprogram [25] toa

givenpooloffragments(notnecessarilyofthesamelengths)tobeclusteredbasedon

theirbarcodesimilarities.AuniquefeatureofCLUMPisthatitisquiteaccuratein

identifyingthecoreelementsofeachclusteraswehavepreviouslydemonstrated[25],

though a weakness of the algorithm could be that it does not always handle the

boundary elements accurately. Hence we have combined CLUMP with a K-means

basedclusteringapproachthatweimplemented.Afteridentifyingtheinitialclusters

formedbyCLUMPbasedonbarcodesimilarities,assumingthatweknowthenumber

of clusters to be identified, we pick a seed from each predicted cluster randomly

according to the density distribution of the cluster. Then we run the K-means

algorithm,usingtheselectedseeds.Foreachpooloffragments,werunthistwo-step

clustering algorithm multiple times, using a different set of seeds for each run. In

decidingthenumberofruns,ourruleofthumbbasedonourexperienceworkingon

themetagenomedataistouse500*(thenumberofclusters),.Foreachgivensetof

seeds, we run the K-means algorithm 10000 iterations. Among all the computed


16/29

-15-

clustering results for each pool, we choose the clustering result 1 2, , ..., KC C C that

minimizesthefollowingfunctionasthefinalbinningresult:

=

K

i CX i

XiX1

2)( ,

where 1 2, , ..., KC C C isapartitionofagivenpoolofmetagenomicfragmentswitheach

Cibeingasubsetofthepooland iX beingtheaverageofthebarcodesofallfragments

in , 1,..., .iC i K=

Authors'contributionsY.X.conceivedtheproject.F.Z.andV.O.analyzedthedataandperformedthe

experiments.Y.X.supervisedthisprojectasP.I.andwrotethemanuscript.

AcknowledgementsThisworkwas supportedby National Science Foundation (DBI-0354771, ITR-IIS-

0407204, DBI-0542119, CCF0621700), a U.S. Department of Energy BioEnergy

ResearchCentergrantfromtheOfficeofBiologicalandEnvironmentalResearchin

the DOE Office of Science, and a Distinguished Scholar grant from the Georgia

Cancer Coalition. We would like to thankthe two anonymous reviewers for their

helpful comments on our work. We would also like to thank Ms Joan Yantko for

preparingthemanuscript.


17/29

-16-

References

1. BackhedF,LeyRE,SonnenburgJL,PetersonDA,GordonJI:Host-

bacterialmutualisminthehumanintestine.Science2005,

307(5717):1915-1920.

2. JainR,RiveraMC,LakeJA:Horizontalgenetransferamonggenomes:

thecomplexityhypothesis.ProceedingsoftheNationalAcademyof

SciencesoftheUnitedStatesofAmerica1999,96(7):3801-3806.

3. FreyTK:Neurologicalaspectsofrubellavirusinfection.Intervirology

1997,40(2-3):167-175.

4. RybchinVN,SvarchevskyAN:TheplasmidprophageN15:alinearDNA

withcovalentlyclosedends.MolMicrobiol1999,33(5):895-903.

5. McHardyAC,MartinHG,TsirigosA,HugenholtzP,RigoutsosI:

Accuratephylogeneticclassificationofvariable-lengthDNAfragments.

NatMethods2007,4(1):63-72.

6. YangE,BinW,PengJ,ZhangX,WangJ,YangJ,DongJ,ChuY,Zhang

J,JinQ:ComparativegenomicsandphylogeneticanalysisofS.

dysenteriaesubgroup.SciChinaCLifeSci2005,48(4):406-413.

7. TrifonovEN,SussmanJL:ThepitchofchromatinDNAisreflectedinits

nucleotidesequence.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1980,77(7):3816-3820.

8. BorodovskyM,SprizhitskiiY,GolovanovE,AleksandrovA:Statistical

patternsinprimarystructuresoffunctionalregionsintheE.coligenome.

I.Oligonucleotidefrequenciesanalysis.MolecularBiology1986,20:826-

833.



II.Non-homogeneousMarkovmodels.MolecularBiology1986,20:833-

840.



III.Computerrecognitionofcodingregions.MolecularBiology1986,

20:1145-1150.

11. KarlinS,BurgeC:Dinucleotiderelativeabundanceextremes:agenomic

signature.TrendsGenet1995,11(7):283-290.

12. KarlinS,ZhuZY,KarlinKD:Theextendedenvironmentof

mononuclearmetalcentersinproteinstructures.Proceedingsofthe


18/29

-17-

NationalAcademyofSciencesoftheUnitedStatesofAmerica1997,

94(26):14225-14230.

13. KarlinS,BrocchieriL,MrazekJ,CampbellAM,SpormannAM:A

chimericprokaryoticancestryofmitochondriaandprimitiveeukaryotes.

ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica1999,96(16):9190-9195.

14. Computed_barcodes:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/].

15. Supplementary_material:[http://csbl.bmb.uga.edu/~ffzhou/BoDB/supp/].

16. MrazekJ,BhayaD,GrossmanAR,KarlinS:Highlyexpressedandalien

genesoftheSynechocystisgenome.NucleicAcidsRes2001,29(7):1590-

1601.

17. KarlinS,MrazekJ:Predictedhighlyexpressedgenesofdiverse

prokaryoticgenomes.JBacteriol2000,182(18):5238-5250.

18. Lima-MendezG,HeldenJV,ToussaintA,LeplaeR:Prophinder:a

computationaltoolforprophagepredictioninpro-karyoticgenomes.

Bioinformatics2008.

19. OchmanH,LawrenceJG,GroismanEA:Lateralgenetransferandthe

natureofbacterialinnovation.Nature2000,405(6784):299-304.

20. LawrenceJG,OchmanH:Ameliorationofbacterialgenomes:ratesof

changeandexchange.JMolEvol1997,44(4):383-397.

21. LioliosK,MavromatisK,TavernarakisN,KyrpidesNC:TheGenomes

OnLineDatabase(GOLD)in2007:statusofgenomicandmetagenomic

projectsandtheirassociatedmetadata.NucleicAcidsRes2008,

36(Databaseissue):D475-479.

22. McHardyAC,RigoutsosI:What'sinthemix:phylogeneticclassification

ofmetagenomesequencesamples.Currentopinioninmicrobiology2007,

10(5):499-503.

23. KarlinS,MrazekJ,MaJ,BrocchieriL:Predictedhighlyexpressedgenes

inarchaealgenomes.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica2005,102(20):7303-7308.

24. MrazekJ,KarlinS:Detectingaliengenesinbacterialgenomes.AnnNY

AcadSci1999,870:314-329.

25. OlmanV,MaoF,WuH,XuY:ParallelClusteringAlgorithmforLarge

DataSetswithapplicationsinBioinformatics.IEEE/ACMTransactionson

ComputationalBiologyandBioinformatics2007,Accepted.

26. DeSantisTZ,Jr.,HugenholtzP,KellerK,BrodieEL,LarsenN,Piceno

YM,PhanR,AndersenGL:NAST:amultiplesequencealignmentserver


19/29

-18-

forcomparativeanalysisof16SrRNAgenes.NucleicAcidsRes2006,

34(WebServerissue):W394-399.

27. CormenTH,LeisersonCE,RivestRL,SteinC:Introductionto

Algorithms,SecondEdition.Cambridge,MATheMITPress;2001.


20/29

-19-

Figures

Figure1Barcodesforfiveprokaryoticgenomes.

(a)E.coliK-12;(b)E.coliO157;(c)chromosome1ofB.pseudomalleiK96243;(d)

archaeanP.furiosusDSM3638;and(e)arandomnucleotidesequencegenerated

usingazeroth

orderMarkovchainmodel.Thex-axisforeachbarcodeisthelistofall

4-mersarrangedinthealphabeticalorder,andthey-axisisthegenomeaxiswitheach

pixelrepresentingafragmentofMbplong.

Figure2Basicfeaturesofbarcodes.

(a)Barcodedistancedistributionamongchromosomesfromthesameorganisms,

acrossallprokaryoticandeukaryoticchromosomalgenomes.Thex-axisisthe

barcodedistanceandthey-axisisthefrequencyofchromosomepairsofthesame

organismhavingaparticularbarcodedistance.(b)Genomebarcodedistancesversus

sequencesimilaritiesamongthecorresponding16SrRNAs(basedonthemultiple

sequencealignmentgiveninDeSantisTZetal.[26]).They-axisrepresentsthe

barcodedistance,andthex-axisisthesequenceidentityaxisbetweentwo16SrRNAs

groupedintoninebins,wherethesequenceidentityiscalculatedastheaverage

sequenceidentityoverall16Spairsineachbin.

Figure3Barcodesofsomeorganisms.

Barcodesof(a)Humanchromosome1(226.21Mbps);majorcomponentsofhuman

chromosomeinacompositeform:(b)repetitivesequence,(c)promotersequence,(d)

codingregionsand(e)introns;and(f)codingand(g)non-codingregionsofE.coli

K-12.Onlya639-Kbpregionofeachsequencein(b)(g)isdisplayedsoeachpixel

representsthesamesequencelength.639Kbpsisusedsincethisisthelengthofthe

shortestregionamongthemall,i.e.thetotalnon-codingregionofE.coliK-12.

Mitochondrialgenomebarcodesof(h)C.elegans(13794bps)and(i)Drosophila

melanogaster(19517bps).Plastidgenomebarcodesof(j)aquaticplant


21/29

-20-

Ceratophyllumdemersum(156252bps)and(k)landplantPopulustrichocarpa

(157033bps).

Figure4Barcodesinfeaturespace.Thex-axisistheaverageofvariationsofthe4-merfrequenciesacrossawhole

genomeacrossall4-mers,andthey-axismeasuresthesimilaritylevelamongall

1000-bppartitionedfragmentsofthegenome,eachrepresentedasa136-dimensional

vectorof4-merfrequencies;Specifically,foreachgenome,webuildaminimum

spanningtree[27]basedonthe4-merfrequencyvectorsforitssequencefragments

andtheirdistances.They-axisistheaveragedweight(distance)ofalledgesinthe

minimumspanningtree.Thegreendotsrepresentprokaryotes(586genomes),the

blueonesforeukaryotes(83chromosomes),theredonesforplastids(101genomes

withlengths>20,000bps),thebrownonesforplasmidsofprokaryoticgenomes(237

plasmids>20,000bps)andtheblackformitochondria(120genomeswithlengths>

20,000bps).

Figure5Distributionofratiosbetweenbarcodevariationsofallprokaryoticgenomesandtheircorrespondingrandomlygeneratednucleotidesequences.

Foreachgenome,acorrespondingrandomnucleotidesequenceisdefinedasa

randomsequenceofthesamelengthandwiththesamemono-nucleotidefrequencies

asthoseofthegenome,generatedusingazeroth

orderMarkovchainmodel.The

variationofabarcodeisdefinedasthestandarddeviationofthelistoftheaveraged

frequenciesofallthek-mersalongthegenome.Thex-axisistheratioofthebarcode

variationsbetweenagenomeandacorrespondingrandomsequence,andthey-axis

representsthefrequencyofcaseswithaparticularvariationratio.


22/29

-21-

Tables

Table1Binningaccuraciesofourbarcode-basedclusteringalgorithm.

11genomes 30genomes 100genomes

Originalgenomes

Filteredgenomes

Originalgenomes

Filteredgenomes

Originalgenomes

Filteredgenomes

FS=500bps 71.10% 77.30% 51.6% 55.70% 40.50% 41.10%

FS=1000bps 79.90% 85.90% 65.30% 70.30% 51.10% 52.60%

FS=2000bps 86.30% 91.70% 74.80% 80.60% 61.00% 68.53%

FS=5000bps 91.10% 98.10% 86.60% 93.20% 79.40% 81.90%

FS=10000bps 95.80% 99.30% 91.90% 97.50% 86.60% 89.18%

Thebinningaccuracyisdefinedas(predictionspecificity+predictionsensitivity)/2,

andFSisforfragmentsize,whereboththe specificityandsensitivityaremeasuredin

terms of putting the fragments into the correct bin corresponding to each genome,

definedbythemajorityofthefragmentsinthebin.ThecolumnOriginalgenomes

liststhebinning accuracy of ouralgorithmonall thenon-overlapping fragmentsin

eachgroupof genomes,andthecolumnFilteredgenomesgivestheaccuracyafter

removingthe10%fragmentswiththemostabnormalbarcodesfromeachgenome.


23/29

-22-

AdditionalfilesAdditionalfile1

Fileformat:DOC

Title:Supplementarymaterial.

Description:Supplementarymaterial1-3.

Additionalfile2

Fileformat:XLS

Title:SupplementaryTable1.

Description:HXishighlyexpressedgene,HTishorizontallytransferredgene,and

PHisthephagegene.UnknownGeneconsistsofgeneswithinfragmentsofabnormal

barcodesbutdonotbelongtotheabovethreecategories.

Additionalfile3

Fileformat:ZIPcompressedTXT

Title:SupplementaryTable2.

Description:Geneclassificationsofalltheprokaryoticgenomes.

Additionalfile4

Fileformat:XLS

Title:SupplementaryTable3

Description:Genomesusedinthebinningsection.


24/29


25/29


26/29


27/29


28/29


29/29

Additional files provided with this submission:

Additional file 1: barcode-suppl.doc, 737Khttp://www.biomedcentral.com/imedia/7848742892070350/supp1.docAdditional file 2: stable1.xls, 185K

http://www.biomedcentral.com/imedia/5579318302405972/supp2.xlsAdditional file 3: stable2.zip, 4115Khttp://www.biomedcentral.com/imedia/2665914822070350/supp3.zipAdditional file 4: stable3.xls, 27Khttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xls
http://www.biomedcentral.com/imedia/7848742892070350/supp1.dochttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/7405639442070350/supp4.xlshttp://www.biomedcentral.com/imedia/2665914822070350/supp3.ziphttp://www.biomedcentral.com/imedia/5579318302405972/supp2.xlshttp://www.biomedcentral.com/imedia/7848742892070350/supp1.doc

Documents

Barcodes Genomes