Upload
mark-pallen
View
1.873
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Introductory lecture on microbial genomics for University of Birmingham Biosciences first-year module Bio153
Citation preview
Bio153 Microbial Genomics
Professor Mark PallenUniversity of Birmingham
Microbial Genomics General features of microbial genomes Historical overview Genome sequencing, annotation and analysis Genome evolution What we can learn from a genome sequence?
General features of genomes
Microbial Human Small WSIWYG genomes
(Mbp) Gene density high (>90%)
intergenic regions short very little repetitive or non-
coding DNA Introns very rare
Protein-coding genes (CDS) short (~1kbp)
Operons with promoters just upstream
Fewer non-coding RNAs
Very large genomes (Gbp) Gene density low
Only 25% is genes Introns mean only1% codes
Genes can span ≥30 kbp Genes have ~3
transcripts Splicing and splice variants
Promoter regions distant from gene
Bacterial genome organisation
Chromosomes Plasmids
Most commonly single circular chromosome (always DNA) BUT many species have
linear chromosome(s) (e.g. Borrelia, Streptomyces, Rhodoccus)
BUT a few species with two chromosomes (e.g. Vibrio cholerae)
Can be mix of circular and linear (e.g. Agrobacterium tumefaciens, B. burgdoferi)
Independent autonomous replicon, can be circular or linear
may integrate into chromosome copy number varies 1 to 10s often carry non-essential genes
that confer an adaptive advantage in certain conditions
Bacterial Genome Size
species which occupy restricted ecological niches, (e.g. obligate intracellular parasites and endosymbionts) tend to have smaller genomes (<1.5 Mb) than generalist bacteria smallest known bacterial genome: Carsonella
ruddii, 160 kb! (Nakabachi et al. 2006)
BUT mitochondrial genomes are smaller
largest genomes found in bacteria with complex developmental cycles, e.g. Streptomyces largest bacterial genome: Sorangium cellulosum, 13
Mb
Bacterial genomes are made from DNA In 1944, Oswald Avery, Colin MacLeod, and Maclyn
McCarty showed that DNA (not proteins) was the genetic material responsible for inheritance. Identified DNA as the "transforming principle" while
studying Streptococcus pneumoniae Avery, Oswald T., Colin M. MacLeod, and Maclyn McCarty.
Studies on the chemical nature of the substance inducing transformation of pneumococcal types. Journal of Experimental Medicine. 1944 Feb 1; 79(2): 137-158.
In 1952, this work was supported by Alfred Hershey and Martha Chase who showed that only the DNA of a virus needs to enter a bacterium to infect it. Used radioactively labelled bacteriophage Hershey AD and Chase M. Independent functions of viral
protein and nucleic acid in growth of bacteriophage. Journal of General Physiology. 1952. 36: 39-56.
Viral genomes are variable Use RNA or DNA but not
both in genome Some have RNA genomes!
Grouped into families depending on type of genome: DNA or
RNA, single- or double-stranded
Typically dozens of genes or fewer
Large genomes in pox viruses (~200 kb)
Massive genomes in megaviruses (1Mbp!)
Year Milestone
1977 Invention of dideoxy chain terminator sequencing (“Sanger sequencing”)
1979 Sequencing of the 5.3-kilobase genome of bacteriophage phiX174
1981 First human mitochondrial genome sequence*
1982 Determination of the 48.5-kilobase genome sequence of bacteriophage lambda through first use of shotgun sequencing
1986 Development of automated fluorescent sequencing
1995 First complete genome sequences obtained of free-living bacteria (Haemophilus influenzae and Mycoplasma genitalium)
1996 Mycoplasma becomes first bacterial genus that has completely sequenced genomes from two different species (M. genitalium and M. pneumoniae)
1997 First genome sequences from Escherichia coli and Bacillus subtilis
1998 First genome sequence from Mycobacterium tuberculosis; genome sequence from Rickettsia prowazekii provides first evidence of reductive evolution
Microbial Genomics Timeline
Year Milestone
1999 Helicobacter pylori becomes the first species with completely sequenced genomes from two isolates
2000 Meningococcal genome sequence primes first application of reverse vaccinology
2001 Second E. coli genome sequences reveal unexpected level of horizontal gene transfer; genome sequence of M. leprae provides compelling evidence of bacterial pseudogenes and reductive evolution; first paper reporting genome sequences of two strains from one species (Staphylococcus aureus) in a single publication.
2002 Genome sequencing of multiple strains of Bacillus anthracis to provide markers for forensic epidemiology
2003 Genome sequencing of uncultivable Tropheryma whipplei leads to design of axenic growth medium
2004 Genome sequence of mimivirus blurs distinctions between bacteria and viruses
2005 Use of whole-genome sequencing used to identify target of new anti-tuberculosis drug Mycoplasma genitalium genome sequenced using pyrosequencing
2006-2011
Bacterial metagenomics survey of the Sargasso sea yields >1 million new genesRise of next-generation or high-throughput sequencing
Microbial Genomics Timeline
The first genome sequences The first sequenced gene was from bacteriophage MS2
The gene encoding the coat protein 1972 Min Jou W, Haegeman G, Ysebaert M, and Fiers W. Nucleotide
sequence of the gene coding for the bacteriophage MS2 coat protein. Nature. 1972 May 12; 237(5350): 82-88.
The first sequenced genome was bacteriophage MS2 1976 RNA genome is 3,569 nucleotides Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D,
Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, and Ysebaert M. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature. 1976 Apr 8; 260(5551): 500-507.
The first genome sequences The first sequenced DNA genome was bacteriophage
Φ-X174 1977 5368 base pairs Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes
CA, Hutchison CA, Slocombe PM, and Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977 265 (5596): 687-695.
The first sequenced bacterial genome was Haemophilus influenzae 1995 1,830,140 base pairs Fleischmann R, Adams M, White O, Clayton R, Kirkness E,
Kerlavage A, Bult C, Tomb J, Dougherty B, and Merrick J. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 1995. 269 (5223): 496-512.
Overview of a genome project Choose strain
Fresh isolate or tractable lab strain?
Choose strategy Shotgun sequencing Paired-end sequencing Draft or complete?
Choose chemistry Sanger; 454; Illumina;
Ion Torrent Assembly
Automated
Closure and finishing Manually intensive Difficulty depends on
how repetitive Data Release
Immediate or delayed?
Annotation Manually intensive
bottle neck Publication
Methods for genome sequencing – historicSanger method sequencing Sanger F and Coulson AR. A rapid method for
determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology. 1975 94: 441-448.
Step 1, a sequence-specific DNA primer is radiolabeled
Step 2, the primer is annealed to the template DNA Step 3, the primer is extended by DNA polymerase
Incorporation of a deoxynucleotide - further extension possible
Incorporation of a dideoxynucleotide – chain termination Four reactions set up
ddATP, dATP, dCTP, dGTP, dTTP ddCTP, dATP, dCTP, dGTP, dTTP ddGTP, dATP, dCTP, dGTP, dTTP ddTTP, dATP, dCTP, dGTP, dTTP
Methods for genome sequencing – historicSanger method sequencing
Methods for genome sequencing – automated Sanger sequencing Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR,
Heiner C, Kent SBH, and Hood LE. Fluorescence detection in automated DNA sequence analysis. Nature. 1986 321: 674-679.
Replaced radioisotopes with fluorescent dyes Safer for the researchers Each of the four DNA bases could be dyed a different colour Eliminated the need to run separate reactions in separate lanes The migration of the dye could be read because of the fluorescence This information allowed automatic gel reading
Further improvements were made Improved dye chemistry using fluorescent dideoxy-terminators (DuPont):
Prober JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ, Jensen MA, and Baumeister K. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238: 336-341.
Replacing slab gels with re-useable capillary tubes: Ruiz-Martinez MC, Berka J, Belenkii A, Foret F, Miller AW, and Karger BL. DNA sequencing by capillary electrophoresis with replaceable linear polyacrylamide and laser-induced fluorescence detection. Analytical Chemistry 1993 65: 2851-2858.
Random shearing
Size selection
Cloning
Sequence each insert with two primers
Pick colonies to create shotgun library
bacterial chromosome
plasmid vector
Plasmid preps
Whole-Genome Shotgun Sanger Sequencing
High-throughput Sequencing100x faster, 100x cheaper!
A disruptive technologySeveral technologies in the marketplace from 2007
onwards 454 (Roche) Illumina Ion Torrent PacBio
Fundamentally new approaches Solid-phase amplification of clonal templates in “molecular
colonies” Massive increase in number of “clones” compensates for shorter read
length New chemistries for sequence reading
454: pyrophosphate detection on base addition Illumina: reversible de-protection of fluorescent bases
Random shearing
Size selection
bacterial chromosome
High-Throughput Shotgun Sequencing
Add adaptersAmplifySequence
454 sequencing
Emulsion-based clonal amplification
Anneal sstDNA to an excess of DNA Capture
Beads
Emulsify beads and PCR reagents in
water-in-oil microreactors
Break microreactors, enrich for DNA-positive beads
Clonal amplification occurs inside microreactors
Pyrosequencing DNA template with primer mixed
with the enzymes along with the two substrates adenosine 5’-phosphosulfate (APS) and luciferin
1. one of the four nucleotides added to reaction
2. If complementary to base in template strand then DNA polymerase incorporates it
3. Pyrophosphate (Ppi) released then converted to ATP by sulfurylase in the presence of APS.
4. ATP serves as a substrate to luciferase, causing a light reaction.
5. Excess nucleotides degraded by apyrase.
Illumina Sequencing
The Sequence Assembly Problem Sequencing technologies generate reads of
<1000 bp These reads must be assembled into a single
continuous genomic sequence. Shotgun sequencing exploits many
overlapping sequences (high coverage) to infer ordering directly from the sequences themselves
The Repeat Problem Repeats at read ends can be assembled in
multiple ways
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT
Correct
Incorrect
Paired-end Sequencing
Random shearing
Size selection for 3kb or 8kb etc
bacterial chromosome
Add linkers
Circularise
Shear and select on size and presence of linkers
Add adapters
Obtain sequences from either side of
linker known distance apart in genome
Create long fragments of known lengthObtain sequence from paired ends
known distance apartAllows assembly of contigs across repeats into scaffolds
Scaffold
Contig 3Contig 2Contig 1
Physical Gap
Sequence Gap
Genome Assembly
Re-sequencing Short reads
(<200bp) inefficient de novo assembly
Instead they are mapped against a reference genome
Re-sequencing is like assembling a jigsaw puzzle using the image on the lid
Genome annotation Annotation is the addition of information about the
predicted sequence features to the flat file of DNA code Identification of potential coding sequences - CDS Homology searches to predict function Other features can be annotated as well
rRNAs Potential promoters tRNAs Small non-coding RNAs Repeat sequences Insertion sequences (ISs), transposons, gene fragments
Location of the origin of replication Determination of the number of bases, genes, and
G+C%.
How to go from this….?>Escherichia coli K-12 MG1655_3870656-3890655
TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAACGGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCGGTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTTCAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAATCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCCCGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACCTCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCTCAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTGGCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGAGACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATTCCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGGTCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCTGCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATCGCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATTCAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCATTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCGCGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGGAGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTTGTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTACTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAGTTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGATGACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACGGAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCAAAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCACCGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCAAGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCTTCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGGAATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATTCATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGAGCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGGCGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTTAATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTAGCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACTGGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGAGAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTTCATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTATAGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCATTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCGCAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCGCCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACTCGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTTCAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACGATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTGATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGTCAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAAGCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCGCGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCGACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACCTCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCTGTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCCATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCATAGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGAAGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGGATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACTTCCGCCGCCGA
…to this? FT gene complement(9299..10702)
FT /db_xref="GenBank:2367266”
FT /gene="dnaA”
FT /note="b3702”
FT CDS complement(9299..10702)
FT /db_xref="GI:2367267”
FT /db_xref="PID:g2367267”
FT /function="putative regulator; DNA - replication, repair,
FT restriction/modification”
FT /codon_start=1
FT /protein_id="AAC76725.1”
FT /gene="dnaA”
FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR
FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT
FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG
FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF
FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR
FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL
FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR
FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF
FT SNLIRTLSS”
FT /product="DNA biosynthesis; initiation of chromosome
FT replication; can be transcription regulator”
FT /transl_table=11
FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;
FT CG Site No. 851”
Or this?
An ORF is not a CDS!
Non-coding ORFs
CDSs (note ORF can extend
upstream of start codon)
An ORF is just an open reading frameThere are many more ORFs than protein coding genes (CDSs) in a genome
Actual sequence
10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAAM S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K
10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAM S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K
The Problem of Frameshift Errors
Frameshifted sequence after single base error
Homology Similarities in
form (sequence) allow us to infer similarities in “meaning” (structure and function)
Homology is not just sequence similarity Two sequences
can be similar without any common ancestry, particularly if low complexity
the cat sat on the mat die Katze sass auf der Matte
vge|GBant88-2 ITLITCVSVKDNSKRYVVAGvge|GEfae9-178 LTLITCDQATKTTGRIIVIAvge|GSpne1-403 MTLITCDPIPTFNKRLLVNFsortase_staur LTLITCDDYNEKTGVWEKRK
Types of Homology Homologues can be
divided into Orthologues: lines of
descent congruent with whole genome
Paralogues: result of gene duplication
Xenologues: result of HGT
Homology Searches The aim of homology searches is to identify
sequences within these databases that are homologous to your sequence.
This involves comparing your sequence with all the database sequences looking for stretches of sequence that appear to be
similar then scoring the matches and ranking them a measure of the significance of the match is given
Most common program used for homology searches is BLAST
Bacterial Genome Dynamics
Gene Loss Gene Gain
Rapid emergence of genetically uniform pathogens from variable ancestral populations
Drastic downsizing in isolated intracellular niches
Accumulation of pseudogenes and IS elements after shift to new niche
Gene Duplication
Horizontal gene transferby phage, plasmids, pathogenicity islands
single nucleotide polymorphisms (SNPs)
Recombination and rearrangements
Gene Change
Bacterial Genome Dynamics
Horizontal gene transfer Horizontal (or lateral) gene transfer denotes any
transfer, exchange or acquisition of genetic material that differs from the normal mode of transmission from parents to offspring (vertical transmission).
Vertical gene transfer Horizontal gene transfer
Bacterial mobile genetic elements Transposons
pieces of DNA that act as ‘jumping genes’ that change location on chromosome or plasmid chromosomal localization.
encode transposase that catalyses the transposition event can carry resistance or virulence genes
Insertion sequences (IS elements) transposable elements that encode only the transposase multiple copies of same IS within genome provide targets
for homologous recombination, rearrangements and replicon fusions
Conjugative transposons normally integrated into the chromosome excise then transferred to recipient cells by conjugation
Bacterial mobile genetic elements Plasmids
self-replicating extrachromosomal replicons usually circular but can be linear Can carry resistance or virulence genes
Bacteriophages bacterial viruses can carry virulence genes can insert into bacterial chromosome as prophages
(lysogeny) Integrons
complex natural cloning and gene expression systems able to capture promoterless gene cassettes by site-specific recombination
allow formation of large arrays of gene cassettes transferred as a whole between different replicons.
Genomic islands large chromosomal regions, part of the flexible gene
pool previously transferred by other mobile genetic
elements present in some bacteria but absent in close relatives carry multiple genes that increase phenotypic
versatility contribute to dynamic character of bacterial
chromosomes and can be excised from the chromosome and transferred to other recipients
pathogenicity islands contain dozens of genes that allow quantum leap to complex new virulence phenotype
Core genomes and Pangenomes Core genome
pool of genes shared by all members of a bacterial species
Accessory or dispensable genome pool of genes present in some but not all genomes
within the same bacterial species Pangenome
global gene repertoire of a bacterial species, comprised of core genome + accessory genome
Metagenome global gene repertoire of mixed microbial
population
Escherichia coli Core and Pan-genomes
Welch et al. Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):17020-4
Metagenomics Environmental
shotgun sequencing DNA extracted from
mixed microbial communities sequenced en masse
Assembled into contigs Typically only small
contigs can be obtained
Uses of a genome sequence Gene discovery
Fuelling hypothesis driven research on pathogen biology
Comparative genomics SNP discovery and genomic epiemiology
Functional genomics Transcriptomics Proteomics Interactome Structural Genomics Mass Mutagenesis
Haemolytic-uraemic syndrome Shiga-toxin-producing E. coli (STEC)
bloody diarrhoea; damage to kidneys and brain anaemia; loss of platelets
German E. coli O104:H4 outbreak
May-July 2011 >4000 cases >40 deaths Link to sprouting seeds High risk of haemolytic-
uraemic syndrome Females particularly at
risk
Frank et al DOI: 10.1056/NEJMoa1106483
Take-away messages from the genome Pathogens don’t bother with passports!
Not a new strain: something similar seen in Germany ten years ago and in Korea
closest genome-sequenced strain was isolated from Central African Republic in late 1990s, belongs to an enteroaggregative lineage
German STEC probably comes from a lineage circulating in human populations rather than from an animal source (unlike E. coli O157)
Take-away messages Bacteria evolve
quickly Virulence factors in
E. coli can jump from one lineage to another on mobile genetic elements
Pathotypes can overlap and evolve
Antibiotic resistance seen where no obvious prior use of antibiotics
Take-away messages from genome sequence
Genome sequencing brings the advantages of open-endedness (revealing the “unknown
unknowns”), universal applicability ultimate in resolution
Bench-top sequencing platforms now generate data sufficiently quickly and cheaply to have an impact on real-world clinical and epidemiological problems
Comprehensive Coverage of Human Microbiome
Comprehensive coverage of tree of life
What will you do when you can sequence everything?