Upload
jonathan-eisen
View
1.229
Download
1
Embed Size (px)
Citation preview
Sea
Searching for Novel Forms of Life
Jonathan A. Eisen UC Davis
@phylogenomics
Diversity of Life Workshop Pacifica, CA
December 16, 2015
Once You Find Something Alive …
You find a CLE
Once You Find Something Alive …
You find a CLE
Separate Origin from Known Life?
Common Origin with Known Life?
Once You Find Something Alive …
You find a CLE
Separate Origin from Known Life?
Common Origin with Known Life?
Homologies w/ Known Life?
Once You Find Something Alive …
You find a CLE
Separate Origin from Known Life?
Common Origin with Known Life?
Homologies w/ Known Life?
No
Once You Find Something Alive …
You find a CLE
Separate Origin from Known Life?
Common Origin with Known Life?
Homologies w/ Known Life?
Yes
How Novel Is It?
Once You Find Something Alive …
You find a CLE
Separate Origin from Known Life?
Common Origin with Known Life?
Homologies w/ Known Life?
Yes
How Novel Is It?
• Novel form • Novel function • Novel phylogeny
How Novel Is It?
• Novel form • Novel function • Novel phylogeny
How Novel Is It?
Phylogeny
Archaea
Worse Classification of Cultured Taxa by rRNA
rRNA rRNArRNA
ACUGC ACCUAU CGUUCG
ACUCC AGCUAU CGAUCG
ACCCC AGCUCU CGCUCG
Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG
EukaryotesBacteria
CarlWoese
Woese 3 Domain Tree
rRNA Phylotyping: One Taxon
DNA
ACTGC ACCTAT CGTTCG
ACTGC ACCTAT CGTTCG
ACTGC ACCTAT CGTTCG
Taxa Characters B1 ACTGCACCTATCGTTCG B2 ACTCCACCTATCGTTCG E1 ACTCCAGCTATCGATCG E2 ACTCCAGGTATCGATCG A1 ACCCCAGCTCTCGCTCG A2 ACCCCAGCTCTGGCTCG New1 ACTGCACCTATCGTTCG
EukaryotesBacteria Archaea
Many sequences from one sample all point to the same branch on the tree
NormPace
Expanded Tree (Pace 1997)
Archaea
Eukaryotes
Bacteria
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Is There Anything Like This?
Archaea
Eukaryotes
Bacteria
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
??????
Metagenomics
metagenomics
ACUGC ACCUAU CGUUCG
ACUCC AGCUAU CGAUCG
ACCCC AGCUCU CGCUCG
Taxa Characters S ACUGCACCUAUCGUUCG R ACUCCACCUAUCGUUCG E ACUCCAGCUAUCGAUCG F ACUCCAGGUAUCGAUCG C ACCCCAGCUCUCGCUCG W ACCCCAGCUCUGGCUCG
Taxa Characters S ACUGCACCUAUCGUUCG
E ACUCCAGCUAUCGAUCG
C ACCCCAGCUCUCGCUCG
EukaryotesBacteria Archaea
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Eukaryotes
??????
Archaea
Bacteria
Scanned through GOS data for rRNAs that fit this pattern
rRNA Tree of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
Eukaryotes
??????
Archaea
Bacteria
??????????
RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
Venter et al., Science 304: 66. 2004
RecA Phylotyping - Sargasso Metagenome
RecA Tree of Life?
Archaea
Eukaryotes
Bacteria
???????????
Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.
Based on tree from Pace 1997 Science 276:734-740
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Novel RecA Sequences in GOS Data
Wu et al PLoS One 2011
Novel RpoBs too
Wu et al PLoS One 2011
GOS 1
GOS 2
GOS 3
GOS 4
GOS 5Wu et al PLoS One 2011
I am happy to wellcome you as a new member of the 4th domain club. If by chance you are passing through Europe I will be delighted to invite you to give a seminar in Marseille and show you our strange bugs.
Kind regards Didier
Phylogenetic ID of Novel Lineages
Virus Origins
2007-2014: GEBA
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
Synapomorphies Exist
Missing Microbes?
Challenge: Poor Sampling
From Wu et al. 2009 Nature 462, 1056-1060
JGI Dark Matter Project
environmental samples (n=9)
isolation of singlecells (n=9,600)
whole genomeamplification (n=3,300)
SSU rRNA gene based identification
(n=2,000)
genome sequencing, assembly and QC (n=201)
draft genomes(n=201)
SAK
HSM ETLTG
HOT
GOM
GBS
EPR
TAETL T
PR
EBS
AK E
SM G TATTG
OM
OT
seawater brackish/freshwater hydrothermal sediment bioreactor
GN04WS3 (Latescibacteria)GN01
!"#$%&'$LD1
WS1PoribacteriaBRC1
LentisphaeraeVerrucomicrobia
OP3 (Omnitrophica)ChlamydiaePlanctomycetes
NKB19 (Hydrogenedentes)WYOArmatimonadetesWS4
ActinobacteriaGemmatimonadetesNC10SC4WS2
Cyanobacteria()*&2
Deltaproteobacteria
EM19 (Calescamantes)+,-*./'&'012345678#89/,-568/:
GAL35Aquificae
EM3Thermotogae
Dictyoglomi
SPAMGAL15
CD12 (Aerophobetes)OP8 (Aminicenantes)AC1SBR1093
ThermodesulfobacteriaDeferribacteres
Synergistetes
OP9 (Atribacteria)()*&2
CaldisericaAD3
Chloroflexi
AcidobacteriaElusimicrobiaNitrospirae49S1 2B
CaldithrixGOUTA4
*;<%0123=/68>8?8,6@98/:Chlorobi
486?8,A-5BTenericutes4AB@9/,-568/Chrysiogenetes
Proteobacteria
4896@9/,-565BTG3SpirochaetesWWE1 (Cloacamonetes)
C=1ZB3
=D)&'EF58>@,@,,AB&CG56?ABOP1 (Acetothermia)Bacteriodetes
TM7GN02 (Gracilibacteria)
SR1BH1
OD1 (Parcubacteria)
(*1OP11 (Microgenomates)
Euryarchaeota
Micrarchaea
DSEG (Aenigmarchaea)Nanohaloarchaea
Nanoarchaea
Cren MCGThaumarchaeota
Cren C2Aigarchaeota
Cren pISA7
Cren ThermoproteiKorarchaeota
pMC2A384 (Diapherotrites)
BACTERIA ARCHAEA
archaeal toxins (Nanoarchaea)
lytic murein transglycosylase
stringent response (Diapherotrites, Nanoarchaea)
ppGpp
limitingamino acids
SpotT RelA
(GTP or GDP)+ PPi
GTP or GDP+ATP
limitingphosphate,fatty acids,carbon, iron
DksA
Expression of components for stress response
sigma factor (Diapherotrites, Nanoarchaea)
!4
"#$#"%
!2!3 !1
-35 -10
&'()
&*()
+',#-./0123452
oxidoretucase
+ +e- donor e- acceptor
H
'Ribo
ADP
+
'62
O
Reduction
OxidationH
'Ribo
ADP
'6
O
2H
',)##$#6##$#72#####################',)6+ + -
HGT from Eukaryotes (Nanoarchaea)
Eukaryota
O68*62
OH
'6
*8*63
OO
68*62
'6
*8*63
O
tetra-peptide
O68*62
OH
'6
*8*63
OO
68*62
'6
*8*63
O
tetra-peptide
murein (peptido-glycan)
archaeal type purine synthesis (Microgenomates)
PurFPurD9:3'PurL/QPurMPurKPurE9:3*PurB
PurP
?
Archaea
adenine guanine
O
6##'2
+'
'62
'
'
H
H
'
'
'
H
HH' '
H
PRPP ;,<*,+
IMP
,<*,+
A*
GUA *G U
GU
A
*
GU
A UA * U
A * U
Growing AA chain
=+',>?/0@#recognizes
UGA1+',
UGA recoded for Gly (Gracilibacteria)
ribosome
Woyke et al. Nature 2013.
Tanja Woyke
Microbial Dark Matter Part 2
• Ramunas Stepanauskas
• Tanja Woyke • Jonathan Eisen • Duane Moser • Tullis Onstott
• More accurate phylogeny • Rooting • Incorporating New and Fragmented Data • Lateral gene transfer • More biology about the “novel” lineages
Challenge: Reference Information
Three Domains of Life
Bacteria Archaea Eukaryotes
Bacteria Archaea Eukaryotes
Archaea and Bacteria as Sister
Bacteria ArchaeaEukaryotes
Eukaryotes and Bacteria as
Bacteria Archaea Eukaryotes
Archaea and Eukaryotes as
Bacteria Archaea Eukaryotes
Other Patterns
Archaea
Bacteria Archaea Eukaryotes
Outgroup for the Tree of Life?
Bacteria Archaea Eukaryotes
Lateral Gene Transfer
Archaea
Bacteria Archaea Eukaryotes
Lateral Gene Transfer
Archaea
Automated Genome Tree
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510
Jenna Lang
Better Reference Data (e.g., PhyEco Markers)
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE 8(10): e77033. doi:10.1371/journal.pone.0077033
Better Binning (e.g., HIC)
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in thesynthetic microbial community are shown before and after filtering, along with the percent of totalconstituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs isshown for read pairs mapping to each chromosome. For each read pair the minimum path length onthe circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each binwas recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 andplotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanningthe linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)due to edge eVects induced by BWA treating the sequence as a linear chromosome ratherthan circular.
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairsassociating each genomic replicon in the synthetic community is shown as a heat map (see color scale,blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the samealignment parameters as were used in the top ranked clustering (described above). We firstcounted the number of Hi-C reads associating each reference assembly replicon (Fig. 2;Table S3), observing that Hi-C data associated replicons within the same species (cell)orders of magnitude more frequently than it associated replicons from diVerent species.The rate of within-species association was 98.8% when ignoring read pairs mapping lessthan 1,000 bp apart. Including read pairs <1,000 bp inflated this figure to 99.97%. Fig. 3illustrates this by visualizing the graph of contigs and their associations. Similarly, forthe two E. coli strains (K12, BL21) we observed the rate of within-strain association tobe 96.36%. When evaluated on genes unique to each strain (where read mapping to eachstrain would be unambiguous), the self-association rate was observed to be >99%.
We observed that the rate of association of L. brevis plasmids 1 and 2 with each other andwith the L. brevis chromosome was at least 100-fold higher than with the other constituentsof the synthetic community (Fig. 2). Chromosome and plasmid Hi-C contact maps showthat the plasmids associate with sequences throughout the L. brevis chromosome (Fig. 4;Figs. S3–S5) and exhibit the expected enrichment near restriction sites. This demonstratesthat metagenomic Hi-C can be used to associate plasmids to specific strains in microbialcommunities as well as to determine cell co-localization of plasmids with one another.
Variant graph connectednessAlgorithms that reconstruct single-molecule genotypes from samples containing two ormore closely-related strains or chromosomal haplotypes depend on reads or read pairsthat indicate whether pairs of variants coexist in the same DNA molecule. Such algorithms
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 11/19
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edgesdepicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereofdepicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites arerepresented as nodes, and sequence reads define edges between variant sites observed inthe same read (or read pair). We reasoned that variant graphs constructed from Hi-Cdata would have much greater connectivity (where connectivity is defined as the meanpath length between randomly sampled variant positions) than graphs constructed frommate-pair sequencing data, simply because Hi-C inserts span megabase distances. Suchconnectivity should, in theory, enable more accurate reconstruction of single-moleculegenotypes from smaller amounts of data. Furthermore, by linking distant sites with fewerintermediate nodes in the graph, estimates of linkage disequilibrium at distant sites (from amixed population) are likely to have greater precision.
To evaluate whether Hi-C produces more connected variant graphs we compared theconnectivity of variant graphs constructed from Hi-C data to those constructed fromsimulated mate-pair data (with average inserts of 5 kb, 10 kb, 20 kb, and 40 kb). To excludepaired-end products from the analysis, Hi-C reads with inserts under 1 kb were excludedfrom the analysis. For each variant graph constructed from these inputs, 10,000 variantposition pairs were sampled at random, with 94.75% and 100% of these pairs belonging tothe same connected graph component of the Hi-C and 40 kb variant graphs, respectively.
Beitel et al. (2014), PeerJ, DOI 10.7717/peerj.415 12/19
Chris Beitel@datscimed
Aaron Darling @koadman
Phylosift - Automated Bayesian Phylogenomics
Input Sequences rRNA workflow
protein workflow
profile HMMs used to align candidates to reference alignment
Taxonomic Summaries
parallel option
hmmalign multiple alignment
LAST fast candidate search
pplacer phylogenetic placement
LAST fast candidate search
LAST fast candidate search
search input against references
hmmalign multiple alignment
hmmalign multiple alignment
Infernal multiple alignment
LAST fast candidate search
<600 bp
>600 bp
Sample Analysis & Comparison
Krona plots, Number of reads placed
for each marker gene
Edge PCA, Tree visualization, Bayes factor tests
each
inpu
t seq
uenc
e sc
anne
d ag
ains
t bot
h w
orkf
low
s
Aaron Darling @koadman
Erik Matsen @ematsen
Holly Bik @hollybik
Guillaume Jospin @guillaumejospin
Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. (2014) PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2:e243 http://dx.doi.org/10.7717/peerj.243
Erik Lowe
Normalizing Across Genes Tree OTU
Wu, D., Doroud, L, Eisen, JA 2013. arXiv. TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic
Dongying Wu
Challenge: Engaging Public
The Rise of Citizen Microbiology
Darlene Cavalier
Eisen Lab Citizen MicrobiologyKitty Microbiome
Georgia Barguil
Jack Gilbert
Project MERCCURI
Phone and
Shoes
Kitty Microbiome Project
tinyurl/kittybiomeHolly Ganz
David Coil
Acknowledgements
DOE JGI Sloan GBMF NSF
DHS DARPA
Aaron Darling Lizzy Wilbanks
Jenna Lang Russell Neches
Rob KnightJack Gilbert Tanja Woyke Rob Dunn
Katie PollardJessica Green
Darlene Cavalier
Eddy RubinWendy Brown
Dongying Wu
Phil Hugenholtz
DSMZ
Sundar
Srijak Bhatnagar David Coil
Alex Alexiev
Hannah Holland-MoritzHolly Bik
John Zhang
Holly Menninger
Guillaume Jospin
David Lang
Cassie Ettinger
Tim HarkinsJennifer Gardy
Holly Ganz