View
341
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Metagenomic projects provide a unique window into the genetic composition of microbial communities. To date, metagenomic analyses have focused primarily on studying the composition of microbial populations and inferring shared metabolic pathways. In this work we analyze how high-quality metagenomic data can be leveraged to infer the composition of transcriptional regulatory networks through a combination of in silico and in vitro methods. Using the SOS response as a case example, we analyze human gut microbiome data to determine the composition of the SOS meta-regulon in a natural context. Our analysis provides proof of concept that the existing knowledgebase on regulatory networks and reference genomes can be effectively leveraged to mine meta-genomic data and reconstruct multi-species regulatory networks. This approach allows us to identify de novo the core elements of the human gut SOS meta-regulon, highlighting the relevance of error-prone polymerases in this stress response, and identifies putative novel SOS protein clusters involved in cell wall biogenesis, chromosome partitioning and restriction modification. The methodology implemented in this work can be applied to other metagenomic datasets and transcriptional systems, potentially providing the means to compare regulatory networks across metagenomes. The use of metagenomic data to analyze transcriptional regulatory networks provides a realistic snapshot of these systems in their natural context and allows probing at their extended composition in non-culturable organisms, yielding insights into their interconnection and into the overall structure of transcriptional systems in microbiomes.
Citation preview
CAATCCGAGGCATGGCATGGTCGTTAGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001001000101010001TGCCATCGATAGCTTGAGACTCGAAGGGAGATAGATGACGACAGCTATTCGAGCATC01011010100100100010100101011CGACCTAGCTTGAGATCGAGCGAAGATAGATGACGACAGCTATTCGAGCATC0101101010100100110010100101011001AGCCTCTGAGATCGAGGGAGATAAGATGACGACAGCTATTCGAGCATC01011010101001000101001010010110011110ATCCGACTTCGATGCATCGATACAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111101001001010ATTCGAATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111001001010101011010GATGCCATCGATCAGTTGCTCTCTTCTCAGAGAGAG01010101001010100010001111110010010101010000101001ATGCCATAAGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010100101010111ATGCCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101010111101010110ATGCCAATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101010111101001011001TATACTCACGGCTACGTTGCATGCAT010100010100010010010010010001111111100101010010101000100000TACGCGCCTACGTTGCATGCAT0101000101000100100100100100011111111001010100101010001010101110GCTACCCGTTGCATGCAT01010001010001001001001001000111111110010101001010100010101011011011GGCTCGCATCCACATG0101010101010101010101001010101010000101001010010101010100001000011010
BIOLOGICAL SCIENCES
Beyond the regulon reconstructing the SOS response of the human gut microbiome
Ivan Erill
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
2
The researchsome
Comparative genomics
Molecular microbiology
Computational biology
Bioinformatics
Transcription factors
Stress responses
Microbial metagenomics
Codon usage indices
Machine learning
Evolutionary simulations
Motif search & discovery
High-throughput assays
Clinical microbiology
Molecular phylogeny
00000
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
3
The researchsome
Comparative genomics
Molecular microbiology
Computational biology
Bioinformatics
Transcription factors
Stress responses
Microbial metagenomics
Codon usage indices
Machine learning
Evolutionary simulations
Motif search & discovery
High-throughput assays
Clinical microbiology
Molecular phylogeny
00001
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
4
On regulons
RegulonsSets of genes/operons (transcriptionally)
regulated by a particular transcription factor (TF)
Cellular response to specific internal or external stimuli
Defined by specific binding of TF to promoter region of regulated genes Regulon genes can be repressed or activated TF recognizes a specific binding motif
.
Guzmán-Vargas and Santillán BMC Systems Biology 2:13 (2008)
ATGTCGATCAGCTAGCC...
RNA-polymerase
Transcription Factor (TF)
Open reading frame
00000
Schematic bacterial promoter
TFi
TG1 TG2
TG3
TG4
S
TFx
Gx
TFyTFi
TG1 TG2
TG3
TG4
S
TFx
Gx
TFy
Regulon
CTGTAAAG CTGCACAG CTGATCAG
TF-binding motif
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
5
On metagenomesMetagenome
Multi-species, heterogeneous collection of high-throughput reads from a natural habitat
The good“Unculturable” speciesDiversity samplingNatural population sampling
The badLow coverageHigh-levels of polymorphismDiversity of low complexity regionsContamination with eukaryotic DNA
The uglyLack of proper models for
Pre-filtering Assembly Gene calling Analysis?.
00000
High-throughput sequencing
Gest, H. Microbiology Today 35: 220 (2008)P. D. Schloss and J. Handelsman, Genome Biol. 6:229, (2005)
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
6
On metagenomesMetagenome
Multi-species, heterogeneous collection of high-throughput reads from natural habitat
PropertiesLots of data!Noisy! Increasingly cheap and abundant!
Post-processing typical formatAssembled contigs/scaffolds with predicted,
functionally annotated genes
Problem How do we extract useful information from
metagenome data?(i.e. how do we evade Brenner’s “low input, high-throughput, no output” epithet?)
.
.
00001
Assembly, gene calling & functional annotation
High-throughput sequencing
Friedberg, E. C. Nat Rev Mol Cell Biol 9, 8-9 (2008)
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
7
Analysis of metagenomic dataThe metagenome & regulatory networks
The metagenomeMulti-species, heterogeneous collection of high-
throughput reads from natural habitat
Problem How do we extract useful information from metagenome
data?.
Conventional workflow (e.g. metabolic networks) Knowledge from references is used as terminal Data is mapped onto existing, static knowledgebase Inference on mapped data
.
00000
Assembly, gene calling & functional annotation
High-throughput sequencing
Pathway mapping, clustering and enrichmentx
yz
s w
a
m
n
Phylogeny
Pathway
Map to reference Low discovery potential
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
8
Analysis of metagenomic dataThe metagenome & regulatory networks
The metagenomeMulti-species, heterogeneous collection of high-
throughput reads from natural habitat
Problem How do we extract useful information from metagenome
data?.
Conventional workflow (e.g. metabolic networks) Knowledge from references is used as terminal Data is mapped onto existing, static knowledgebase Inference on mapped data Interesting repertoire of new questions
.
00001
Assembly, gene calling & functional annotation
High-throughput sequencing
Pathway mapping, clustering and enrichmentx
yz
s w
a
m
n
Phylogeny
Pathway
Map to reference Low discovery potentialMuegge, B. D. et al. Science, 332 (6032), 970-974 (2011)
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
9
Analysis of metagenomic dataThe metagenome & regulatory networks
The metagenomeMulti-species, heterogeneous collection of high-
throughput reads from natural habitat
Problem How do we extract useful information on regulatory
networks from metagenome data?.
Alternative workflow Knowledge from reference used as seed Directed mining of metagenome data Inference on mined data
.
00010
Assembly, gene calling & functional annotation
High-throughput sequencing
Regulon analysis, clustering and enrichment
x
nw
sm
x
nw
sm
x
n
wsm
z
x
n
wsm
z
Seed reference High discovery potential
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
10
Analysis of metagenomic dataThe metagenome & regulatory networks
The metagenomeMulti-species, heterogeneous collection of high-
throughput reads from natural habitat
Problem How do we extract useful information on regulatory
networks from metagenome data?.
Alternative workflow (e.g. regulatory networks) Knowledge from reference as seed Directed mining of metagenome data Inference on mined data Promising questions and challenges
.
00011
Assembly, gene calling & functional annotation
High-throughput sequencing
Regulon analysis, clustering and enrichment
x
nw
sm
x
nw
sm
x
n
wsm
z
x
n
wsm
z
Is network composition governed by convergent evolution or by phylogeny?Can we effectively infer regulatory networks from metagenomics data?Seed reference High discovery potential
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
11
Analysis of metagenomic dataMetagenomics and regulatory network analysis
AdvantagesReal bacterial populationsUnculturable organisms and mobile elementsVariability at species and subspecies levels
ChallengesNoisy search process, huge datasetHow to: data integration, enrichment and analysis
GoalsProof of concept
Analyze the potential of meta-genomic & regulatory sequence data to explore known regulatory systems
Study a regulatory network in its natural setting
.
00100
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
12
Analysis of metagenomic data
Metagenomics and regulatory network analysisRequires
A regulatory network to analyzeThe bacterial SOS response
A metagenome on which to analyze itThe human gut microbiome
.
00101
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
13
The bacterial SOS responseTranscriptional response against DNA damage
.
00000
“Canonical” stress responseWidespread in bacteria
Well-characterized in most bacterial phylaE. coli, B. subtilis, M. tuberculosis, V. parahaemolyticus, S. meliloti, B. bacteriovorus, X. campestris, G. sulfurreducens…
Two-component system RecA (sensor) LexA (repressor)
response to DNA damaging agents
Well-characterized regulon Target genes
~40 in E. coli / ~30 B. subtilis
Functions Recombination & DNA repair Cell-division inhibition Translesion synthesis
. Erill, I. et al. FEMS Microbiol. Rev. 31 (6), 637 (2007)
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
14
The bacterial SOS responseTranscriptional response against DNA damage
.
00001
Erill, I. et al. FEMS Microbiol. Rev. 31 (6), 637 (2007)
High clinical relevanceWidespread in bacteria
Two-component system RecA (sensor) LexA (repressor) Response to
Broad range of antibiotics Bacteriophage infection
Extended regulon Functions
Integron recombination Bacteriophage induction Toxin production Dissemination of pathogenicity islands Antibiotic-induced mutagenesis Regulation of persistence
. Guerin, E. et al., Science, 324 (5930), 1034 (2009)
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
15
The bacterial SOS responseTranscriptional response against DNA damage
.
00010
Erill, I. et al. FEMS Microbiol. Rev. 31 (6), 637 (2007)
Interesting evolutionWidespread in bacteria
Absent in some clades (Bacteroidetes/Chlorobi group) Supplanted by competence regulon (S. pneumoniae)
Extreme diversity of LexA-binding motifs Clade-specific & monophyletic
.Geobacteres
Gram-positive
Myxobacteriales
Xanthomonadales
Alpha Proteobacteria
Beta/Gamma Proteobacteria
Cyanobacteria
Fibrobacteres
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
16
The human gut microbiome
Metagenomics projectTarget metagenome
Human microbiomeMultiple datasets (locations: gut, armpit, etc.)Multiple initiatives (HMP & MetaHit)Available data & features:
High-throughput sequencing + 16S RNA data ORF predictions & functional annotation
.
00000
Qin, J. et al. Nature. 464, 59 (2010)Nelson, K.E, et al. Science. 328, 994 (2010)
Segata, N. et al. Gen. Biol. 13, R42 (2012)
MetaHit human gut microbiome
GammaproteobacteriaActinobacteria
Other
Bacteroides
Firmicutes
GammaproteobacteriaActinobacteria
Other
Bacteroides
Firmicutes
86 healthy subjectsLarge contigs, high-quality gene calling7.1 Gbp total sequence – 4.5 M contigs (N50: 2.2 kbp)9.3 M predicted ORF (3.7M complete), λ=660 bp 1 M COG annotations
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
17
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation
Gram-positive bacteria CollecTF database 118 sites, 8 species
Reference genome panel 121 genomes from MetaHit and the Human
Microbiome Jumpstart Reference Strains Consortium
Reference SOS response 18 described SOS responses
Acidobacteria Alphaproteobacteria Gammaproteobacteria Deltaproteobacteria Bacilli Clostridia Actinobacteria Fibrobacteria
272 regulated genes
.
00001
collectf.umbc.edu
Kiliç, S. et al. Nuc. Acids Res. 42, D156-D160 (2013)Nelson, K.E, et al. Science. 328, 994 (2010)
Cornish, J. P. et al. Evol Bioinform. 8: 449–461 (2012)Erill, I. et al. FEMS Microbiol. Rev. 31 (6), 637 (2007)
Gram-positive reference motif
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
18
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-based search
Reference motif, 2 strands
Operon prediction Site-operon association
Distance-based
Taxonomic annotation Through reference panel mapping
for phylogenetic filtering of results
Functional clustering Through COG mapping
for functional analysis
.
00010
GAACTACTGTTC
GAACTACTGTTC
GTACAACTGTTCGATCTATTGTTC
GAACTCATGTTT
GTTCAAAAGATC
GAACTCCTGTCC
PSSM-based search
LexA-binding motif score histogram
0
0.05
0.1
0.15
0.2
0.25
0.3
1 5 9 13 17 21 25 29
Score
Fre
qu
ency
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
19
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-based search
Reference motif, 2 strands
Operon prediction Site-operon association
Distance-based
Taxonomic annotation Through reference panel mapping
for phylogenetic filtering of results
Functional clustering Through COG mapping
for functional analysis
.
00011
GAACTACTGTTC
GTACAACTGTTCGATCTATTGTTC
GAACTCATGTTT
GTTCAAAAGATC
GAACTACTGTTC GAACTACTGTTC
GAACTCATGTTT
GAACTACTGTTCGAACTCCTGTCC
Operon prediction
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
20
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-based search
Reference motif, 2 strands
Operon prediction Site-operon association
Distance-based
Taxonomic annotation Through reference panel mapping
for phylogenetic filtering of results
Functional clustering Through COG mapping
for functional analysis
.
00100
GAACTACTGTTC
GTACAACTGTTCGATCTATTGTTC
GAACTCATGTTT
GTTCAAAAGATC
GAACTACTGTTC GAACTACTGTTC
GAACTCATGTTT
GAACTACTGTTCGAACTCCTGTCC
Site-operon association
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
21
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-based search
Reference motif, 2 strands
Operon prediction Site-operon association
Distance-based
Taxonomic annotation Through reference panel mapping
for phylogenetic filtering of results
Functional clustering Through COG mapping
for functional analysis
.
00101
GAACTCATGTTT
GAACTACTGTTC
GAACTCATGTTT
GAACTACTGTTC
Refe
renc
e ge
nom
e lib
rary
Taxonomic annotation
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
22
Analysis workflowWorkflow
Data compilation LexA-binding motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-based search
Reference motif, 2 strands
Operon prediction Site-operon association
Distance-based
Taxonomic annotation Through reference panel mapping
for phylogenetic filtering of results
Functional clustering Through COG mapping
for functional analysis
.
00110
GAACTCATGTTT
GAACTACTGTTC
GAACTCATGTTT
GAACTACTGTTC
COG
refe
renc
e lib
rary
Functional clustering
COG123
COG345
COG567
COG789
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
23
The human gut microbiomeWorkflow
Data compilation Motif compilation Reference genome panel Reference SOS response
Metagenome mining PSSM-search Operon prediction Site-operon association Phylogeny annotation Functional clustering
Analysis Positional enrichment analysis Data filtering COG enrichment analysis Gene-based functional analysis
.
00111
GAACTCATGTTT
GAACTACTGTTC
GAACTCATGTTT
GAACTACTGTTC
Data for analysis
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
24
The human gut SOS response
Initial search resultsOver 500,000 putative LexA-binding sites identified
Positional enrichment analysisPromoter regions
Site scores are significantly enriched in promoter regionsHigh-scoring sites co-localize in promoter regions
.
00000
Permutation analysis of site scores
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
25
The human gut SOS response
Data filteringTwo-pronged approach
Distance-basedOnly sites located between -350 and +50 of predicted TLS
Taxomomy-basedOnly sites associated with predicted protein-coding genes mapping to Gram-
positive reference genomes
Filtering resultsDramatic reduction in the number of putative sites
Over 43,000 sites meeting both criteriaTaxonomy-based filtering provides enhanced resolution
Law of large numbers: high-scoring sites can be identified in the promoter region of many Bacteroides genes
.
00001
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
26
The human gut SOS responseCOG category analysis
Inferred regulon maps experimentally characterized SOS responsesGradual enrichment of canonical SOS categories with score cutoff:
repair/replication (L), signal transduction (T) and transcription (K) genes
Cell cycle control (D) category not enriched COGs are getting old!
.
00010
0
0.1
0.2
0.3
0.4
0.5
J K L D V T M C G F R SCOG category
Rela
tive
freq
uenc
y MetaHit COG referenceCOGs with SOS siteCOGs with site >12 bitsCOGs with site >14 bitsCOGs with site >16 bitsSOS ensemble reference
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
27
The human gut SOS responseCOG analysis
QuestionHow to identify “SOS COGs”?
Score enrichment measureGoal
Identify bona-fide members of the regulon Capture maximum number of known SOS genes
Analysis of canonical SOS genes in 308 Gram-positive genomesLexA-binding site scores normally distributed
(lexA: =16.2 bits, =2.3; recA: =16.3 bits, =2.5)Cumulative distribution approximately linear
in central scoring range 12-20 bitsPrototypical SOS COG
High linear coefficient of determination (R2>0.85, empirically set)
At least: one site above average score (16 bits) 10 sites in 12-20 bit range
.
00011
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
9 11 13 15 17 19 21 23Site score (bits)
Cu
mu
lati
ve d
istr
ibu
tio
n
lexA (Firmicutes)recA (Firmicutes)
Quantile-quantile plot
911131517192123
9 11 13 15 17 19 21 23Theoretical
Em
pir
ica
l
lexA (Firmicutes)recA (Firmicutes)
Canonical SOS genes
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
28
The human gut SOS responseCOG analysis
ResultsDetection of canonical SOS regulon
lexA, recA, excision repair, recombinationSOS meta-regulon composition
Four major functions Transcriptional repression (lexA) Translesion synthesis (dinB, uvrX, imuB, umuD) Sensing of DNA-damage & stabilization (recA) Excision repair (uvrA, uvrB, uvrD, pcrA)
Translesion synthesis as primary SOS component Interesting new putative SOS regulon COGs
COG0732 HsdS – restriction endonuclease
COG2001 MraZ – cell wall biogenesis
COG4974 CodV – chromosome partitioning
.
00100
0.86recNCOG0497
0.87ruvACOG0632
0.87codVCOG4974
0.88parECOG0187
0.91uvrACOG0178
0.91hsdSCOG0732
0.91MraZCOG2001
0.92uvrD, pcrACOG0210
0.96lexA,umuDCOG1974
0.97uvrBCOG0556
0.98recA,imuACOG0468
0.98dinB, imuB, uvrXCOG0389
r2Associated genesCOG
0.86recNCOG0497
0.87ruvACOG0632
0.87codVCOG4974
0.88parECOG0187
0.91uvrACOG0178
0.91hsdSCOG0732
0.91MraZCOG2001
0.92uvrD, pcrACOG0210
0.96lexA,umuDCOG1974
0.97uvrBCOG0556
0.98recA,imuACOG0468
0.98dinB, imuB, uvrXCOG0389
r2Associated genesCOG
0
0.2
0.4
0.6
0.8
1
Nor
mal
ized
num
ber o
f site
s CO
G19
74 -
lexA
, um
uD
CO
G03
89 -
dinB
, uv
rX, i
muB
CO
G04
68 -
recA
, im
uA
CO
G50
56 -
uvrB
CO
G02
10 -
uvrD
, pc
rA
CO
G01
78 -
uvrA
CO
G20
01 -
mra
Z
CO
G04
97 -
recN
CO
G01
87 -
parE
CO
G07
32 -
hsdS
CO
G49
74 -
codV
CO
G06
32 -
ruvA
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
29
The human gut SOS responseTargeted gene analysis
Assessment of non-canonical functions in genes with high-scoring sitesToxin-antitoxin / virulence systems (higB / rhuM)
Linked to persistence phenotypesPhage integrases (intP)
In line with integron integrase regulation and phage control by SOS response
Validation of enriched COGsCell wall biogenesis (mraZ)
Possible role in cell division control Evidence of convergent regulation
YneA (B. subtilis), DivS (C. glutamicum)
Experimental validationEMSA with purified B. subtilis protein
.
00101
recA- + - + - + - +
mraZ intPrhuM
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
30
Beyond the regulon
Proof of concept: the human gut SOS meta-regulonMethodology
Provides the means to expand our knowledge on described regulatory systemsCOG enrichment as a method for functional assessment of the meta-regulonAnalysis allows visualizing a regulatory response in a wild-population
Inference of novel knowledge on regulon function and componentsConsistent with known SOS responses; primary focus on mutagenesisContains several elements linking it to other cellular processes of clinical relevance
Future directionsAnalyze and compare regulatory networks in metagenomes
Is network evolution dictated by phylogeny or habitat?How do changes in habitat affect meta-regulons?How does the overlap between meta-regulons vary among populations?
00000
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
31
Beyond the regulon
Automating meta-regulon inferenceA transcription factor
Exists in a subset of speciesBinding sites for the TF are enriched in a subset of functional clusters
How can we automatically determine the set of species & COGs?
00001
0
0.05
0.1
0.15
0.2
0.25
0.3
5 10 15 20 25 30
Aver
age
scor
e co
unt i
n ge
ne u
pstr
eam
regi
ons
Score (bits)
LexA-binding site score distribution
Firmicutes (SOS COGs)Firmicutes (random COGs)All taxa, all COGs
0
2
4
6
8
10
12
14
16
18
-60 -40 -20 0 20 40
Aver
age
scor
e co
unt i
n ge
ne u
pstr
eam
regi
ons
Score (bits)
LexA-binding site score distribution
Firmicutes (SOS COGs)Firmicutes (random COGs)All taxa, all COGs
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
32
Beyond the regulon
EM algorithm for isolation of enriched COGs/taxaDefine likelihood function
Statistical test for mixture model in observed distribution
Assign weights to COGs (Ci) and taxa (Tj)For given COG weights, compute likelihood of each taxon, update weight with likelihood
For given taxon weights, compute likelihood of each COG, update weight with likelihood
00010
C60.1
C50.8
C40.7
C30.3
C20.2
C10.3
T6T5T4T3T2T1
0.50.40.20.90.60.5
C60.1
C50.8
C40.7
C30.3
C20.2
C10.3
T6T5T4T3T2T1
0.50.40.20.90.60.5
C60.1
C50.8
C40.7
C30.3
C20.2
C10.3
T6T5T4T3T2T1
0.50.40.20.80.60.5
C60.1
C50.8
C40.7
C30.3
C20.2
C10.3
T6T5T4T3T2T1
0.50.40.20.80.60.5
ACACGGATCGATCGAGGCATGGCATGGTCGTTGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001100100001ACCATCGATTCGATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111010101111010CGGATGCATGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101101001110GGCTGATCCACATG010101010101010101010100101010101000010100101001010101010000100010011011
ACAACGCCTERILLGTATAGCAGTGTGTCATTGCTTTAGCTAGTACACAGACACGCBIOLOGICALATUMBC0101010101110001010100010LAB010010101001000011110001010001010001001011100SCIENCESCCAGGACATGAGCTAAAAC
33
Conclusions & AcknowledgementsAcknowledgements
Erill Lab Joe CornishNeus Sanchez-AlberolaPat O’Neill Jameel GhebaRon O’KeefeTalmo PereiraDavid Nicholson
Wolf LabRichard WolfLanyn Perez
Barbé LabSusana Campoy Jordi Barbé
Funding UMBC Office of Research – Special Research Assistantship/Initiative Support NSF grant MCB-1158056
.
CAATCCGAGGCATGGCATGGTCGTTAGATTGCTGATTTTGAATGATCGATCGATCGATGGGC010101001001000101010001TGCCATCGATAGCTTGAGACTCGAAGGGAGATAGATGACGACAGCTATTCGAGCATC01011010100100100010100101011CGACCTAGCTTGAGATCGAGCGAAGATAGATGACGACAGCTATTCGAGCATC0101101010100100110010100101011001AGCCTCTGAGATCGAGGGAGATAAGATGACGACAGCTATTCGAGCATC01011010101001000101001010010110011110ATCCGACTTCGATGCATCGATACAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111101001001010ATTCGAATGCATCGATCAGTTGCTCTCTTCTCAGAGAGAG0101010100101010001000111111001001010101011010GATGCCATCGATCAGTTGCTCTCTTCTCAGAGAGAG01010101001010100010001111110010010101010000101001ATGCCATAAGCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010100101010111ATGCCATGCATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101010111101010110ATGCCAATGGCCCCTTCGCTCGCTAAG10101010001010101000001011100010100010101010111101001011001TATACTCACGGCTACGTTGCATGCAT010100010100010010010010010001111111100101010010101000100000TACGCGCCTACGTTGCATGCAT0101000101000100100100100100011111111001010100101010001010101110GCTACCCGTTGCATGCAT01010001010001001001001001000111111110010101001010100010101011011011GGCTCGCATCCACATG0101010101010101010101001010101010000101001010010101010100001000011010
BIOLOGICAL SCIENCES
Beyond the regulon reconstructing the SOS response of the human gut microbiome
Ivan Erill