12
Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski, and Donovan H. Parks Australian Centre for Ecogenomics, School of Chemistryand Molecular Biosciences, The University of Queensland, St Lucia QLD 4072, Australia Correspondence: [email protected] Reconstructing the complete evolutionary history of extant life on our planet will be one of the most fundamental accomplishments of scientific endeavor, akin to the completion of the periodic table, which revolutionized chemistry. The road to this goal is via comparative genomics because genomes are our most comprehensive and objective evolutionary docu- ments. The genomes of plant and animal species have been systematically targeted over the past decade to provide coverage of the tree of life. However, multicellularorganisms only emerged in the last 550 million years of more than three billion years of biological evolution and thus comprise a small fraction of total biological diversity. The bulk of biodiversity, both past and present, is microbial. We have only scratched the surface in our understanding of the microbial world, as most microorganisms cannot be readily grown in the laboratory and remain unknown to science. Ground-breaking, culture-independent molecular techniques developed over the past 30 years have opened the door to this so-called microbial dark matter with an accelerating momentum driven byexponential increases in sequencing capacity. We are on the verge of obtaining representative genomes across all life for the first time. However, historical use of morphology, biochemical properties, behavioral traits, and single-marker genes to infer organismal relationships mean that the existing highly incomplete tree is riddled with taxonomic errors. Concerted efforts are now needed to synthesize and integrate the burgeoning genomic data resources into a coherent universal tree of life and genome- based taxonomy. SETTING THE STAGE FOR A TAXONOMIC CLASSIFICATION BASED ON EVOLUTIONARY RELATIONSHIPS C losely following on from Darwin’s thesis that all life forms on our planet arose from a common ancestor (Darwin 1859), have been biologists’ attempts to classify life natural- istically according to evolution. Initially and understandably, phenotype (morphology, de- velopment, etc.) was the primary basis for in- ferring relationships between organisms result- ing in a schema that lumped all microbial life (which had been discovered 200 years earlier through the advent of the microscope) into a single “primitive” kingdom at the base of the tree (Fig 1A) (Haeckel 1866). The discovery of the structure of DNA in the latter half of the 20th century and its role as the heritable blue- print of life led to the proposal that genes are a more objective basis than phenotype for inferring evolutionary (phylogenetic) relation- Editor: Howard Ochman Additional Perspectives on Microbial Evolution available at www.cshperspectives.org Copyright # 2016 Cold Spring Harbor Laboratory Press; all rights reserved; doi: 10.1101/cshperspect.a018085 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 1 on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/ Downloaded from

Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

Genome-Based Microbial TaxonomyComing of Age

Philip Hugenholtz, Adam Skarshewski, and Donovan H. Parks

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The Universityof Queensland, St Lucia QLD 4072, Australia

Correspondence: [email protected]

Reconstructing the complete evolutionary history of extant life on our planet will be one ofthe most fundamental accomplishments of scientific endeavor, akin to the completion of theperiodic table, which revolutionized chemistry. The road to this goal is via comparativegenomics because genomes are our most comprehensive and objective evolutionary docu-ments. The genomes of plant and animal species have been systematically targeted over thepast decade to provide coverage of the tree of life. However, multicellular organisms onlyemerged in the last 550 million years of more than three billion years of biological evolutionand thus comprise a small fraction of total biological diversity. The bulk of biodiversity, bothpast and present, is microbial. We have only scratched the surface in our understanding of themicrobial world, as most microorganisms cannot be readily grown in the laboratory andremain unknown to science. Ground-breaking, culture-independent molecular techniquesdeveloped over the past 30 years have opened the door to this so-called microbial dark matterwith an accelerating momentum driven byexponential increases in sequencing capacity. Weare on the verge of obtaining representative genomes across all life for the first time. However,historical use of morphology, biochemical properties, behavioral traits, and single-markergenes to infer organismal relationships mean that the existing highly incomplete tree isriddled with taxonomic errors. Concerted efforts are now needed to synthesize and integratethe burgeoning genomic data resources into a coherent universal tree of life and genome-based taxonomy.

SETTING THE STAGE FOR A TAXONOMICCLASSIFICATION BASED ONEVOLUTIONARY RELATIONSHIPS

Closely following on from Darwin’s thesisthat all life forms on our planet arose

from a common ancestor (Darwin 1859), havebeen biologists’ attempts to classify life natural-istically according to evolution. Initially andunderstandably, phenotype (morphology, de-velopment, etc.) was the primary basis for in-

ferring relationships between organisms result-ing in a schema that lumped all microbial life(which had been discovered 200 years earlierthrough the advent of the microscope) into asingle “primitive” kingdom at the base of thetree (Fig 1A) (Haeckel 1866). The discovery ofthe structure of DNA in the latter half of the20th century and its role as the heritable blue-print of life led to the proposal that genesare a more objective basis than phenotype forinferring evolutionary (phylogenetic) relation-

Editor: Howard Ochman

Additional Perspectives on Microbial Evolution available at www.cshperspectives.org

Copyright # 2016 Cold Spring Harbor Laboratory Press; all rights reserved; doi: 10.1101/cshperspect.a018085

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

1

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 2: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

ships (Zuckerkandl and Pauling 1965). CarlWoese ran with this idea focusing on compari-son of universally conserved components of theprotein manufacturing machinery in the cell,the ribosomal RNA (rRNA) genes, which hecorrectly reasoned would produce an objectivetree of life. His first trees and all subsequentefforts turned the phenotype-based tree on itshead; instead of microorganisms occupyinga lowly corner of the tree, all multicellular lifeclustered together in a corner of one of threenewly described primary lines of descent (Fig.1B) (Woese and Fox 1977). Small subunit ribo-somal RNA- (16S rRNA)-based classification ofbacteria and archaea was enthusiastically em-braced by microbiologists following Woese’sdiscoveries, in large part because natural rela-tionships between microbes are virtually un-detectable using phenotypic properties (Stanierand Van Niel 1962). Thirty years on, 16S rRNAsequences form the basis of microbial classifi-cation; however, vast numbers of discrepanciesexist between taxonomy and phylogeny withmany currently defined taxa not forming evo-lutionarily coherent (monophyletic) groups. Aconspicuous case in point is the genus Clostrid-ium, which is superficially united by a commonmorphology and ability to produce endospores,but represents dozens of phylogenetically dis-tinct groups within the phylum Firmicutes (Yu-tin and Galperin 2013). This greatly impedesour ability to understand the ecology andevolution of ecosystems, such as mammalianguts, where clostridia are important functionalpopulations. The large number of unresolvedtaxonomic errors is due to a combination ofhistorical artifacts (phenotypic classification)and limitations with rRNA gene trees such aspoor phylogenetic resolution, inadequate refer-ence sequences, and sequencing artifacts (no-tably chimeras, as discussed below). Genometrees inferred that using multiple marker genesoffer greater phylogenetic resolution than16S rRNA and other single-marker gene trees(Ciccarelli et al. 2006; Lang et al. 2013) and are,therefore, a more reliable basis for taxono-mic classification. However, publicly availablegenome sequences are still far from represen-tative of microbial diversity as a whole, as

revealed by the development of culture-inde-pendent methods.

MOST MICROBIAL DIVERSITY ISUNCULTURED BUT HAS RECENTLY BECOMEREADILY ACCESSIBLE GENOMICALLY

Culture-independent molecular techniquesfounded on 16S rRNA in the mid-1980s (Olsenet al. 1986) highlighted our ignorance of most ofthe tree of life by crudely outlining its borders.This was achieved by sequencing 16S rRNAgenes from bulk DNAs extracted directly fromenvironmental sources (Pace 1997). This typeof microbial community profiling has improvedwith increased sequencing and computing ca-pacity. The startling conclusion from morethan two decades of such culture-independentenvironmental sequence surveys is that .80%of microbial evolutionary diversity is represent-ed by uncultured microorganisms distributedacross upward of 100 major lines of descentwithin the Bacteria and Archaea (Harris et al.2013) and that the amount of recognized micro-bial dark matter is still increasing (Fig. 2).

The task to obtain representative genomiccoverage of all recognized microbial diversity,estimated conservatively to represent hundredsof thousands of species (Curtis and Sloan2005), is daunting. However, two promisingculture-independent approaches have emergedto achieve this goal. The first is metagenomics,the application of high-throughput sequencingof DNA extracted directly from environmentalsamples, and the second is single-cell genomics,the physical separation and amplification ofcells before sequencing. Initially, it was unfeasi-ble to extract genomes of individual popula-tions from metagenomic data (a bioinformaticprocess called binning) because of insufficientsequencing depth and inadequate binningtools. Only in some instances could completeor near-complete genomes be reconstructedfrom environmental sequences, and these weretypically of dominant populations with mini-mal genomic heterogeneity (Tyson et al. 2004;Garcia Martin et al. 2006; Elkins et al. 2008).In contrast, most populations in early meta-genomic studies remained unidentified, being

P. Hugenholtz et al.

2 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 3: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

Flexibac

terFla

vobact

erium

Planctomyces

Agrobacterium

Mitochondrion

Rhodocycl

us

Escher

ichia

Desulfovib

rio

Chloroplas

t

Synechococcus

Gloeobacter

Chlam

ydia

Chlorob

ium

Clostrid

ium

Bac

illus

Helioba

cterium

Arth

roba

cter

pOP

S19

pOPS66

Roo

t

pJP 78pSL 50

Thermo

filum

Thermoproteus

Pyrod

ictium

Sulfolobu

s

Gp.

3 lo

w te

mp

Gp.

2 lo

w te

mp

Gp.

1 lo

w te

mp

Mar

ine

Gp.

1 lo

w te

mp

Metha

nosp

irillum

Methano

bacteiu

m

Methanothermus

Methanopyrus

Haloferax

Archaeoglobus

Therm

oplas

ma

Marine low temp

Methanococcus

Thermococc

uspS

L 22pS

L 12

pJP 27

Coprinu

sHom

o Zea

Cry

ptom

onas

Ach

lya

Costaria

Porphyra

Param

ecium Babesia

Dictyostelium

Entamoeba

Naegleria

Euglena

Encephali

tozoon

Trichom

onas

Vairim

orpha

Giardia

Trypanosom

aPhysaru

m

Chloroflexus

Thermus

Aquifex

EM17

Thermotoga

Lepton

ema

Bac

teria

Archae

a

Euca

rya

0.1

chan

ges/

site

AB

Figu

re1.

Two

rep

rese

nta

tio

ns

of

the

tree

of

life

.T

he

firs

t(A

)b

ased

on

ph

eno

typ

icco

mp

aris

on

sre

sult

ing

inlu

mp

ing

of

mic

roo

rgan

ism

sin

toa

sin

gle

un

dif

fere

nti

ated

mas

sat

the

bas

eo

fth

etr

ee(c

ircl

edin

red

),an

dth

ese

con

d(B

)b

ased

on

gen

oty

pic

(rR

NA

gen

es)

com

par

iso

ns

reve

alin

gth

atm

ost

div

ersi

ty(i

ncl

ud

ing

the

Eu

cary

a)is

actu

ally

mic

rob

ial

wit

hm

ult

icel

lula

rli

fefo

rms

on

lyem

ergi

ng

rela

tive

lyre

cen

tly

inev

olu

tio

nar

yh

isto

ry(c

ircl

edin

red

).(P

anel

Ais

rep

rod

uce

dfr

om

Hae

kcel

1866

,an

dis

free

lyav

aila

ble

inth

ep

ub

lic

do

mai

nan

dfr

eeo

fkn

own

rest

rict

ion

su

nd

erco

pyr

igh

tla

w.P

anel

Bis

bas

edo

nF

igu

re2

inB

arn

set

al.

1996

.)

Genome-Based Microbial Taxonomy Coming of Age

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 3

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 4: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

represented by a smattering of anonymous se-quence data. Single-cell genomics overcomesthis problem by separating environmental sam-ples into their component populations up front,most efficiently via cell sorting, thereby avoid-ing the need for binning and allowing se-quencing resources to be focused on individualgenomes (Lasken and McLean 2014). The po-tential of this approach for articulating micro-bial dark matter was shown in a recent studydescribing 200 single-cell genomes representing29 major uncultured bacterial and archaeal lin-eages that challenge established boundaries be-tween the three domains of life. These includean archaeal-type purine synthesis in bacteriaand complete sigma factors in Archaea similarto those in Bacteria (Rinke et al. 2013). Despitethis promising start, we need thousands ofmicrobial dark matter genomes, rather thanhundreds, to gain a comprehensive picture ofmicrobial evolution and ecology. Technicalchallenges associated with single-cell genomics,including the need for whole-genome amplifi-

cation and the high potential for contamina-tion, currently preclude this level of scale-up.Moreover, the average completeness of a sin-gle-cell genome is currently only 35% (Parkset al. 2015) because of large biases in genomecoverage introduced during the whole-genomeamplification step (Lasken and McLean 2014).Recently, a new and effective binning strategybased on differing relative abundance patternsof populations between related microbial com-munities has been developed by multiple groups(Albertsen et al. 2013; Alneberg et al. 2013; Sha-ron and Banfield 2013; Imelfort et al. 2014; Kanget al. 2014; Nielsen et al. 2014). This approachuses differential sequencing coverage of a givenpopulation between metagenomic data sets asa proxy of its relative abundance, and leveragesthe much higher genome coverage afforded bymodern sequencers, which produce tens ofbillions of sequence base pairs per run. Unlikesingle-cell genomics, differential coverage bin-ning will scale to produce tens of thousands ofpopulation genomes in a short time frame. For

600016S sequences from clones

16S sequences from named isolates5000

4000

3000

2000

Cum

ulat

ive

phyl

ogen

etic

div

ersi

ty

1000

2000 2001 2002 2003 2004 2005 2006 2007

Year

2008 2009 2010 2011 2012 2013

19%19%

19%20%

22%24%

30%34%

40%46%

51%55%

60%68%

0

Phylogenetic diversityPhylogenetic diversitymissed by isolate sequencesmissed by isolate sequences

Phylogenetic diversitymissed by isolate sequences

Figure 2. Increase in recognized phylogenetic diversity as measured by rRNA sequence novelty (y-axis) over time(x-axis). Recognized diversity has increased by an order of magnitude in the past 10 years, most of which(�80%) is a result of newly identified uncultured lineages (microbial dark matter). Note that the apparentleveling off of novel phylogenetic diversity discovery in 2013 is likely a consequence of the recent move from full-length 16S rRNA sequencing to partial gene sequencing on next-generation sequencing platforms, which are notincluded in this estimate (adapted from Rinke et al. 2013).

P. Hugenholtz et al.

4 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 5: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

example, �50 Gbp of sequence data (a singlelane of HiSeq Illumina data) from more thanthree related environmental samples will typi-cally produce 50 to 100 high-quality populationbins (.80% complete, ,10% contaminated)(Imelfort et al. 2014; Parks et al. 2015). Fur-thermore, the method does not require spe-cialized equipment beyond access to high-throughput sequencing and high-performancecomputing, and therefore is being rapidlyadopted by research groups worldwide. It isentirely feasible that .500,000 population ge-nomes will be generated in the next few yearsproviding the necessary volume of data to ad-equately cover all of the major lines of descentin the bacterial and archaeal domains. This willform the basis of a comprehensive genome-based classification framework of microbialdiversity.

RECONCILING TAXONOMY WITHPHYLOGENY

Although phylogenetic inference is an objectiveapproximation of evolutionary history, micro-bial systematics—the classification of micro-organisms into hierarchical taxonomic groups(nominally domain, phylum, class, order, fam-ily, genus, species) based on phylogenetic infer-ence (or other criteria), is a largely subjectivehuman construct to help organize information.Few would argue the natural evolutionary dis-tinction between Bacteria and Archaea so therank of Domain (Kingdom) is not especiallycontroversial, with the exception of the place-ment of the Eucarya (Spang et al. 2015). How-ever, the circumscription of all ranks sub-ordinate to Domain is often fiercely debatedbetween taxonomic “lumpers” and “splitters”(Endersby 2009), which unfortunately dimin-ishes the enterprise in the eyes of other scientistsas they consider it subjective and arbitrary.Microbial systematics has been divorced fromevolutionary theory for much of its history(Doolittle 2015), so the process of reconcilingtaxonomy with phylogeny will serve to unite thetwo. This involves two main tasks.

The first is to identify polyphyletic taxa at allranks (phylum to species) in a phylogenetic tree

and to reclassify them according to a standardset of rules. Figure 3 is a genome-based recon-struction of the Gammaproteobacteria high-lighting some of the polyphyletic orders inthis class. In this tree, the order Alteromonadalescomprises four distinct lines of descent. To cor-rect this conflict between phylogeny and taxon-omy, the group containing the type genus afterwhich the order was named, Alteromonas, is firstidentified (starred in tree) and retains the ordername. The other groups can then be renamedafter the oldest validly described genus in eachgroup (shown in parentheses in Fig. 3), in thiscase Shewanellales after Shewanella, Psychromo-nadales after Psychromonas, and Cellvibrionalesafter Cellvibrio. The three other polyphyleticorders in Figure 3, Aeromonadales, Oceanospir-illales, and Pseudomonadales, can be similarlycorrected. This same approach can be used forall taxonomic ranks above genus, and if no val-idly named isolate exists for a given group,which is indeed the case for the majority ofbranches because of microbial dark matter,groups can be named after environmental 16SrRNA sequences or population genomes, withthe caveat that the name should be unique inthe taxonomy (McDonald et al. 2012). Usingthis approach, we found that 85% of 635 namedisolates belonging to the class Clostridia re-quired reassignment at one or more ranks toreconcile existing taxonomic classificationswith a genome-based phylogeny (Table 1).

The second task concerns the unevenness ofrank assignments according to sequence-basedmetrics. Konstantinidis and Tiedje proposedthe average amino acid identity (AAI) of sharedgenes between two genomes as a measure ofrelatedness (Konstantinidis and Tiedje 2005).They plotted taxonomic ranks as a function ofAAI for 175 fully sequenced strains revealing ahigh degree of overlap between different ranks,even nonadjacent ranks. We repeated this anal-ysis with nearly 6000 genomes and the currentNational Center for Biotechnology Informa-tion (NCBI) taxonomy with much the sameresult—up to five different ranks overlapped ata given AAI and the range of AAIs for eachrankoften exceeded 20% (Fig. 4). Recently, Yarzaet al. (2014) proposed using 16S rRNA gene

Genome-Based Microbial Taxonomy Coming of Age

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 5

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 6: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

sequence identity thresholds to rationally cir-cumscribe different taxonomic ranks. However,like AAIs, this does not take into account differ-ent evolutionary tempos across the tree of life,which can result in an uneven application of tax-onomic ranks as fast-evolving lineages will havelower sequence identities thantheir slowerevolv-ing counterparts for the same divergence times.Ideally, ancestors belonging to the same rank

should have been contemporaries in the past.Therefore, defining and normalizing rank distri-butions using estimated evolutionarydivergenc-es is a more naturalistic approach for assigningranks. Although accurately estimating diver-gence times has proven challenging (Arbogastet al. 2002), branch lengths in a genome-basedphylogeny may provide a suitable approxima-tion if these lengths are appropriately normal-

1639

Pasteurellales

Enterobacteriales

Vibrionales205

105

21 Aeromonadales *

Aeromonadales (Succinivibrionales)

Pseudomonadales *

Alteromonadales (Cellvibrionales)

Pseudomonadales (Moraxellales)

Oceanospirillales (Alcanivorales)

Oceanospirillales *

Oceanospirillales (Kangiellales)

9

164

46

42

314

2

6

60

12

40 Alteromonadales (Shewanellales)

Alteromonadales (Psychromonadales)

Alteromonadales *

Figure 3. A tree of isolate genomes belonging to the class Gammaproteobacteria inferred from a concatenatedalignment of 83 single-copy marker genes. Lineages are collapsed at the order level showing that some aremonophyletic (in black), but many others are polyphyletic (colored), the most extreme case in this instancebeing the Alteromonadales. Renaming of orders to remove polyphyly is shown in parentheses. Numbers insidecollapsed groups indicate the number of genomes comprising a given order.

P. Hugenholtz et al.

6 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 7: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

ized to take into account varying evolutionaryrates.

TRANSITIONING FROM A 16S rRNA-TO GENOME-BASED TAXONOMY

To date, 16S rRNA has been the most widelyused gene for inferring evolutionary relation-ships between microorganisms and serves asthe primary basis for microbial taxonomy. Itshigh degree of sequence conservation (Woese1987) has the dual benefits of providing a do-

main-level overview of microbial diversity, andenables culture-independent environmentalsurveys using near universal polymerase chainreaction (PCR) priming sites. The public repos-itory of 16S rRNA gene sequences number inthe millions and it is, therefore, also the mostcomprehensively sequenced marker gene avail-able to us. However, this same sequence conser-vation also leads to chimera formation in envi-ronmental surveys. Chimeras are PCR-inducedartifacts whereby an incompletely synthesized16S rRNA amplicon acts as a primer on a dif-ferent template producing a hybrid molecule.Such mispriming only requires short stretchesof sequence identity between the 30-end ofthe incomplete template and homologous re-gion of the foreign template, which aboundsin 16S rRNA because of its high degree of con-servation. Moreover, very similar chimeras canbe formed in independent studies of similarhabitats in which parent sequences are presentin approximately similar ratios (e.g., Bacter-oides–Clostridium chimeras produced in hu-man gut surveys) (Haas et al. 2011). Chimeras

Table 1 Extent of taxonomic reassignments requiredto reconcile existing taxonomy with genome-basedphylogeny for members of the class Clostridia

No. ranks requiring

reassignment

No. changes/

635

% of

total

0 92 14.51 123 19.42 309 48.73 111 17.5

Only changes in three ranks are included: order, family,

and genus.

20

40

60

80

100

0

38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98

Average amino acid identity (AAI)

Per

cent

dis

trib

utio

n

Diff. domains

Domain

Phylum

Class

Order

Family

Genus

Species

Figure 4. A histogram of pairwise average amino acid identities (AAI) between 290 archaeal and 5582 bacterialreference genomes colored by rank affiliation showing the uneveness of current taxonomic classifications.Extreme outliers include isolates classified in the same species (red) with only 68% AAI (Bdellovibrio bacter-iovorus) and others classified in different genera (green) sharing as high as 94% AAI (e.g., Tannerella andCoprobacter) (adapted from Konstantinidis and Tiedje 2005).

Genome-Based Microbial Taxonomy Coming of Age

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 7

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 8: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

have been recognized as a problem associatedwith PCR-based surveys since the 1990s andalarms have been raised about the growingnumber of undetected chimeras in the publicdatabases (Hugenholtz and Huber 2003; Ashel-ford et al. 2005). Environmental sequences nowdominate the public databases in both numberand phylogenetic diversity coverage, unlike theearly 2000s when they still represented a minor-ity of available data (Fig. 1). It is very likely thatthe problem of undetected chimeras has in-creased accordingly as suggested by the pooroverlap between different chimera-detectiontools (Haas et al. 2011). All such tools detectchimeric sequences by comparison to a refer-ence set of 16S rRNA sequences. If the referenceset is compromised by undetected chimeras,detection of new chimeric sequences is alsocompromised. In short, chimeras are a recog-nized but underestimated problem in 16S rRNAdata sets that artificially inflate diversity esti-mates and introduce noise into phylogenetictrees, ultimately compromising taxonomic clas-sifications based on these trees.

Concatenated protein marker gene trees de-rived from isolate and population genomesare much less susceptible to chimeric artifactsand provided completeness and contaminationchecks used for the latter (Parks et al. 2015).Combined with the higher phylogenetic resolu-tion afforded by concatenated markers, suchtrees are the logical choice for a phylogeneticallyreconciled taxonomy. However, the 16S rRNAdatabase is currently two orders of magnitudelarger than the genome database and significantefforts have been invested into curating 16SrRNA trees for taxonomic classification (Kimet al. 2012; McDonald et al. 2012; Quast et al.2012; Cole et al. 2013). Therefore, and despiteambiguities arising from chimeric sequences,efforts should be made to incorporate 16SrRNA-based classifications into genome-basedtaxonomy to provide taxonomic continuity inthe literature. For microbial isolates, the processof connecting 16S rRNA to genome sequencesto allow transfer of taxonomic informationshould be straightforward. However, for manyisolates, 16S rRNA genes were sequenced beforetheir genomes, introducing the risk of connect-

ing different taxa between trees. Sequencing oftype material is very important in this context asit provides unambiguous signposts in phyloge-netic trees, not only for transfer of taxonomicinformation, but also to enable correction of thenumerous polyphyly errors that presently existin microbial taxonomy (Fig. 3; Table 1) (Kyrpi-des et al. 2014). For most microbial diversity,however, we are reliant on single-cell or popu-lation genomes to connect 16S rRNA-basedclassifications to their corresponding locationsin genome trees. Unfortunately, rRNA genes areoften difficult to recover in genomes derivedfrom metagenomic data. This is because of thehigh conservation of these genes, which oftenresults in chimeric assembly, and the fact thatthe rRNA operon is typically the largest repeatin a microbial genome if present in greater thanone copy. This confounds differential coveragebinning because the rRNA genes do not binwith the rest of the genome as a result of havinga higher coverage if present in the assembly as acollapsed repeat. Development of new methodsto obtain full-length non-chimeric 16S rRNAgenes for population genomes should be a pri-ority as this will enable a smooth transition to agenome-based microbial taxonomy. Moreover,16S rRNA is likely to remain a valuable tool inthe microbiologist’s toolbox for the foreseeablefuture, for example, 16S rRNA-targeted fluores-cence in situ hybridization (Amann et al. 2001),so it cannot be easily ignored as we transition toa genome-based taxonomy.

APPLICATIONS FOR A REPRESENTATIVEGENOME TREE AND GENOME-BASEDTAXONOMY

One important and largely open question is theextent to which lateral gene transfer (LGT) playsa role in microbial evolution and ecology. Thistopic was first broached in a holistic mannerwith the availability of the first microbial ge-nome sequences. Phylogenetic analysis of nu-merous gene families revealed a high level ofinferred discordance with the 16S rRNA-basedtree over evolutionary timescales, raising thequestion as to whether the universal tree of lifewould be better represented as a network be-

P. Hugenholtz et al.

8 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 9: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

cause of LGT blurring organismal boundaries(Doolittle 1999). Despite the inference that LGTis a relatively common phenomenon, a compar-ison of 56 genomes selected to maximize 16SrRNA-defined phylogenetic diversity with ran-domly selected genome data sets of the samesize showed that tree-based selection resultedin higher rates of discovery of novel gene fami-lies, gene fusions, and operon arrangements(Wu et al. 2009). These findings argue againstLGT overriding vertical inheritance. However,few would argue against LGT being a potentevolutionary force, the most public face ofwhich is transfer of antibiotic resistance be-tween bacteria (Arber 2014). An analysis ofmicrobial isolate genomes to identify recentlylaterally transferred genes (.99% nucleotideidentity) in humans and the environmentpoints to ecology being a more important driverof LGT than phylogeny, particularly in humans(Smillie et al. 2011). This analysis only consid-ered isolate genomes, however, which do notadequately represent the ecosystems from whichthey were obtained. Population genomicsprovides the means to obtain a representativesampling of the component members of a givenecosystem, and a genome-based taxonomy thenbecomes the essential framework for determin-ing the frequency and mode of LGT betweenorganisms of varying evolutionary relatedness(Beiko et al. 2005).

Current concepts on many topics have like-ly been oversimplified by limited and highlyskewed sampling of microbial diversity, andstand to be overhauled by microbial dark mattergenome sequencing. Recent examples includethe discovery of nonphotosynthetic basal Cya-nobacteria (Di Rienzi et al. 2013; Soo et al.2014) and a large radiation of candidate bacte-rial phyla with unusual ribosome structures andbiogenesis mechanisms that may actually bemore representative of the “average” bacteriumthan our current preconceptions (Rinke et al.2013; Brown et al. 2015). Similarly, categoriza-tion of the bacterial cell envelope as either single(monoderm) or double (diderm) layer struc-tures, based on studies of mostly Firmicutesand Proteobacteria, will become more sophisti-cated with greater representation of candidate

phyla (Sutcliffe 2010). In summary, our knowl-edge of the microbial world is set to expanddramatically in the coming years as microbialdark matter is rapidly illuminated through ge-nome sequencing. Coupling these advances to asystematized genome-based classification willserve to organize this valuable resource into acoherent framework that will greatly facilitateanalysis and communication.

REFERENCES

Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL,Tyson GW, Nielsen PH. 2013. Genome sequences ofrare, uncultured bacteria obtained by differential cover-age binning of multiple metagenomes. Nat Biotechnol31: 533–538.

Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J,Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C.2013. CONCOCT: Clustering cONtigs on COverage andComposiTion. arXiv:1312.4038.

Amann R, Fuchs BM, Behrens S. 2001. The identification ofmicroorganisms by fluorescence in situ hybridisation.Curr Opin Biotechnol 12: 231–236.

Arber W. 2014. Horizontal gene transfer among bacteriaand its role in biological evolution. Life (Basel) 4: 217–224.

Arbogast BS, Edwards SV, Wakeley J, Beerli P, Slowinski JB.2002. Estimating divergence times from molecular dataon phylogenetic and population genetic timescales. AnnuRev Ecol Syst 33: 707–740.

Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weight-man AJ. 2005. At least 1 in 20 16S rRNA sequence recordscurrently held in public repositories is estimated to con-tain substantial anomalies. Appl Environ Microbiol 71:7724–7736.

Beiko RG, Harlow TJ, Ragan MA. 2005. Highways of genesharing in prokaryotes. Proc Natl Acad Sci 102: 14332–14337.

Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, SinghA, Wilkins MJ, Wrighton KC, Williams KH, BanfieldJF. 2015. Unusual biology across a group comprisingmore than 15% of domain bacteria. Nature 523: 208–211.

Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B,Bork P. 2006. Toward automatic reconstruction of a high-ly resolved tree of life. Science 311: 1283–1287.

Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y,Brown CT, Porras-Alfaro A, Kuske CR, Tiedje JM. 2013.Ribosomal Database Project: Data and tools for highthroughput rRNA analysis. Nucleic Acids Res 42: D633–D642.

Curtis TP, Sloan WT. 2005. Microbiology. Exploring micro-bial diversity—A vast below. Science 309: 1331–1333.

Darwin CR. 1859. On the origin of species by means of naturalselection, or the preservation of favoured races in the strug-gle for life, 1st ed. John Murray, London.

Genome-Based Microbial Taxonomy Coming of Age

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 9

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 10: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

Di Rienzi SC, Sharon I, Wrighton KC, Koren O, Hug LA,Thomas BC, Goodrich JK, Bell JT, Spector TD, BanfieldJF, et al. 2013. The human gut and groundwater harbornon-photosynthetic bacteria belonging to a new candi-date phylum sibling to cyanobacteria. eLife 2: e01102.

Doolittle WF. 1999. Phylogenetic classification and the uni-versal tree. Science 284: 2124–2129.

Doolittle WF. 2015. Rethinking the tree of life. Microbe Mag,Oct, www.microbemagazine.org.

Elkins JG, Podar M, Graham DE, Makarova KS, Wolf Y,Randau L, Hedlund BP, Brochier-Armanet C, Kunin V,Anderson I, et al. 2008. A korarchaeal genome revealsinsights into the evolution of the Archaea. Proc NatlAcad Sci 105: 8102–8107.

Endersby J. 2009. Lumpers and splitters: Darwin, Hooker,and the search for order. Science 326: 1496–1499.

Garcıa Martın H, Ivanova N, Kunin V, Warnecke F, BarryKW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E,et al. 2006. Metagenomic analysis of two enhanced bio-logical phosphorus removal (EBPR) sludge communi-ties. Nat Biotechnol 24: 1263–1269.

Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Gian-noukos G, Ciulla D, Tabbaa D, Highlander SK, SodergrenE, et al. 2011. Chimeric 16S rRNA sequence formationand detection in Sanger and 454-pyrosequenced PCRamplicons. Genome Res 21: 494–504.

Haeckel H. 1866. Generelle morphologie der organismen: All-gemeine grundzuge der organischen Formen-Wissenschaft,mechanisch begrundet durch die von Charles Darwin re-formirte descendenztheorie, von Ernst Haeckel. Reimer,Berlin.

Harris JK, Caporaso JG, Walker JJ, Spear JR, Gold NJ,Robertson CE, Hugenholtz P, Goodrich J, McDonald D,Knights D, et al. 2013. Phylogenetic stratigraphy in theGuerrero Negro hypersaline microbial mat. ISME J 7:50–60.

Hugenholtz P, Huber T. 2003. Chimeric 16S rDNA sequenc-es of diverse origin are accumulating in the public data-bases. Int J Syst Evol Microbiol 53: 289–293.

Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P,Tyson GW. 2014. GroopM: An automated tool for therecovery of population genomes from related metage-nomes. PeerJ 2: e603.

Kang DD, Froula J, Egan R, Wang Z. 2014. MetaBAT: Meta-genome binning based on abundance and tetranucleo-tide frequency. Presented at the Ninth Annual DOE JointGenome Institute User Meeting. Walnut Creek, CA.

Kim OS, Cho YJ, Lee K, Yoon SH, Kim M, Na H, Park SC,Jeon YS, Lee JH, Yi H, et al. 2012. Introducing EzTaxon-e:A prokaryotic 16S rRNA gene sequence database withphylotypes that represent uncultured species. Int J SystEvol Microbiol 62: 716–721.

Konstantinidis KT, Tiedje JM. 2005. Towards a genome-based taxonomy for prokaryotes. J Bacteriol 187: 6258–6264.

Kyrpides NC, Hugenholtz P, Eisen JA, Woyke T, Goker M,Parker CT, Amann R, Beck BJ, Chain PS, Chun J, et al.2014. Genomic encyclopedia of bacteria and archaea:Sequencing a myriad of type strains. PLoS Biol 12:e1001920.

Lang JM, Darling AE, Eisen JA. 2013. Phylogeny of bacterialand archaeal genomes using conserved genes: Supertreesand supermatrices. PLoS ONE 8: e62510.

Lasken RS, McLean JS. 2014. Recent advances in genomicDNA sequencing of microbial species from single cells.Nat Rev Genet 15: 577–584.

McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantisTZ, Probst A, Andersen GL, Knight R, Hugenholtz P.2012. An improved Greengenes taxonomy with explicitranks for ecological and evolutionary analyses of bacteriaand archaea. ISME J 6: 610–618.

Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J,Sunagawa S, Plichta DR, Gautier L, Pedersen AG, LeChatelier E, et al. 2014. Identification and assembly ofgenomes and genetic elements in complex metagenomicsamples without using reference genomes. Nat Biotechnol32: 822–828.

Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA. 1986.Microbial ecology and evolution: A ribosomal RNA ap-proach. Annu Rev Microbiol 40: 337–365.

Pace NR. 1997. A molecular view of microbial diversity andthe biosphere. Science 276: 734–740.

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, TysonGW. 2015. CheckM: Assessing the quality of microbialgenomes recovered from isolates, single cells, and meta-genomes. Genome Res 25: 1043–1055.

Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P,Peplies J, Glockner FO. 2012. The SILVA ribosomal RNAgene database project: Improved data processing andweb-based tools. Nucleic Acids Res 41: D590–D596.

Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ,Cheng JF, Darling A, Malfatti S, Swan BK, et al. 2013.Insights into the phylogeny and coding potential of mi-crobial dark matter. Nature 499: 431–437.

Sharon I, Banfield JF. 2013. Microbiology. Genomes frommetagenomics. Science 342: 1057–1058.

Smillie CS, Smith MB, Friedman J, Cordero OX, David LA,Alm EJ. 2011. Ecology drives a global network of geneexchange connecting the human microbiome. Nature480: 241–244.

Soo RM, Skennerton CT, Sekiguchi Y, Imelfort M, Paech SJ,Dennis PG, Sheen JA, Parks DH, Tyson GW, HugenholtzP. 2014. An expanded genomic representation of the phy-lum Cyanobacteria. Genome Biol Evol 6: 1031–1045.

Spang A, Saw JH, Jørgensen SL, Zaremba-Niedzwiedzka K,Martijn J, Lind AE, van Eijk R, Schleper C, Guy L, EttemaTJ. 2015. Complex archaea that bridge the gap betweenprokaryotes and eukaryotes. Nature 521: 173–179.

Stanier RY, Van Niel CB. 1962. The concept of a bacterium.Arch Mikrobiol 42: 17–35.

Sutcliffe IC. 2010. A phylum level perspective on bacterialcell envelope architecture. Trends Microbiol 18: 464–470.

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS,Banfield JF. 2004. Community structure and metabolismthrough reconstruction of microbial genomes from theenvironment. Nature 428: 37–43.

Woese CR. 1987. Bacterial evolution. Microbiol Rev 51: 221–271.

P. Hugenholtz et al.

10 Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 11: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

Woese CR, Fox GE. 1977. Phylogenetic structure of the pro-karyotic domain: The primary kingdoms. Proc Natl AcadSci 71: 5088–5090.

Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E,Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ,et al. 2009. A phylogeny-driven genomic encyclopaediaof Bacteria and Archaea. Nature 462: 1056–1060.

Yarza P, Yilmaz P, Pruesse E, Glockner FO, Ludwig W, Schlei-fer KH, Whitman WB, Euzeby J, Amann R, Rossello-

Mora R. 2014. Uniting the classification of cultured anduncultured bacteria and archaea using 16S rRNA genesequences. Nat Rev Microbiol 12: 635–645.

Yutin N, Galperin MY. 2013. A genomic update on clos-tridial phylogeny: Gram-negative spore formers andother misplaced clostridia. Environ Microbiol 15: 2631–2641.

Zuckerkandl E, Pauling L. 1965. Molecules as documents ofevolutionary history. J Theor Biol 8: 357–366.

Genome-Based Microbial Taxonomy Coming of Age

Cite this article as Cold Spring Harb Perspect Biol 2016;8:a018085 11

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from

Page 12: Genome-Based Microbial Taxonomy Coming of Agecshperspectives.cshlp.org/content/8/6/a018085.full.pdf · Genome-Based Microbial Taxonomy Coming of Age Philip Hugenholtz, Adam Skarshewski,

March 17, 20162016; doi: 10.1101/cshperspect.a018085 originally published onlineCold Spring Harb Perspect Biol 

 Philip Hugenholtz, Adam Skarshewski and Donovan H. Parks Genome-Based Microbial Taxonomy Coming of Age

Subject Collection Microbial Evolution

Genetics, and RecombinationNot So Simple After All: Bacteria, Their Population

William P. HanageAgeGenome-Based Microbial Taxonomy Coming of

Donovan H. ParksPhilip Hugenholtz, Adam Skarshewski and

Realizing Microbial EvolutionHoward Ochman

Horizontal Gene Transfer and the History of LifeVincent Daubin and Gergely J. Szöllosi

EvolutionThe Importance of Microbial Experimental Thoughts Toward a Theory of Natural Selection:

Daniel Dykhuizen

Early Microbial Evolution: The Age of AnaerobesWilliam F. Martin and Filipa L. Sousa

Prokaryotic GenomesCoevolution of the Organization and Structure of

Marie Touchon and Eduardo P.C. Rocha

Microbial SpeciationB. Jesse Shapiro and Martin F. Polz

Mutation and Its Role in the Evolution of BacteriaThe Engine of Evolution: Studying−−Mutation

Ruth HershbergCampylobacter coli

and Campylobacter jejuniThe Evolution of

Samuel K. Sheppard and Martin C.J. Maiden

Mutation)Natural Selection Mimics Mutagenesis (Adaptive The Origin of Mutants Under Selection: How

Sophie Maisnier-Patin and John R. Roth

EvolutionPaleobiological Perspectives on Early Microbial

Andrew H. Knoll

Preexisting GenesEvolution of New Functions De Novo and from

Joakim NäsvallDan I. Andersson, Jon Jerlström-Hultqvist and

http://cshperspectives.cshlp.org/cgi/collection/ For additional articles in this collection, see

Copyright © 2016 Cold Spring Harbor Laboratory Press; all rights reserved

on November 26, 2020 - Published by Cold Spring Harbor Laboratory Press http://cshperspectives.cshlp.org/Downloaded from