Bio153 microbial genomics 2012

Bio153 Microbial Genomics

Professor Mark PallenUniversity of Birmingham

Microbial Genomics General features of microbial genomes Historical overview Genome sequencing, annotation and analysis Genome evolution What we can learn from a genome sequence?

General features of genomes

Microbial Human Small WSIWYG genomes

(Mbp) Gene density high (>90%)

intergenic regions short very little repetitive or non-

coding DNA Introns very rare

Protein-coding genes (CDS) short (~1kbp)

Operons with promoters just upstream

Fewer non-coding RNAs

Very large genomes (Gbp) Gene density low

Only 25% is genes Introns mean only1% codes

Genes can span ≥30 kbp Genes have ~3

transcripts Splicing and splice variants

Promoter regions distant from gene

Bacterial genome organisation

Chromosomes Plasmids

Most commonly single circular chromosome (always DNA) BUT many species have

linear chromosome(s) (e.g. Borrelia, Streptomyces, Rhodoccus)

BUT a few species with two chromosomes (e.g. Vibrio cholerae)

Can be mix of circular and linear (e.g. Agrobacterium tumefaciens, B. burgdoferi)

Independent autonomous replicon, can be circular or linear

may integrate into chromosome copy number varies 1 to 10s often carry non-essential genes

that confer an adaptive advantage in certain conditions

Bacterial Genome Size

species which occupy restricted ecological niches, (e.g. obligate intracellular parasites and endosymbionts) tend to have smaller genomes (<1.5 Mb) than generalist bacteria smallest known bacterial genome: Carsonella

ruddii, 160 kb! (Nakabachi et al. 2006)

BUT mitochondrial genomes are smaller

largest genomes found in bacteria with complex developmental cycles, e.g. Streptomyces largest bacterial genome: Sorangium cellulosum, 13

Mb

Bacterial genomes are made from DNA In 1944, Oswald Avery, Colin MacLeod, and Maclyn

McCarty showed that DNA (not proteins) was the genetic material responsible for inheritance. Identified DNA as the "transforming principle" while

studying Streptococcus pneumoniae Avery, Oswald T., Colin M. MacLeod, and Maclyn McCarty.

Studies on the chemical nature of the substance inducing transformation of pneumococcal types. Journal of Experimental Medicine. 1944 Feb 1; 79(2): 137-158.

In 1952, this work was supported by Alfred Hershey and Martha Chase who showed that only the DNA of a virus needs to enter a bacterium to infect it. Used radioactively labelled bacteriophage Hershey AD and Chase M. Independent functions of viral

protein and nucleic acid in growth of bacteriophage. Journal of General Physiology. 1952. 36: 39-56.

Viral genomes are variable Use RNA or DNA but not

both in genome Some have RNA genomes!

Grouped into families depending on type of genome: DNA or

RNA, single- or double-stranded

Typically dozens of genes or fewer

Large genomes in pox viruses (~200 kb)

Massive genomes in megaviruses (1Mbp!)

Year Milestone

1977 Invention of dideoxy chain terminator sequencing (“Sanger sequencing”)

1979 Sequencing of the 5.3-kilobase genome of bacteriophage phiX174

1981 First human mitochondrial genome sequence*

1982 Determination of the 48.5-kilobase genome sequence of bacteriophage lambda through first use of shotgun sequencing

1986 Development of automated fluorescent sequencing

1995 First complete genome sequences obtained of free-living bacteria (Haemophilus influenzae and Mycoplasma genitalium)

1996 Mycoplasma becomes first bacterial genus that has completely sequenced genomes from two different species (M. genitalium and M. pneumoniae)

1997 First genome sequences from Escherichia coli and Bacillus subtilis

1998 First genome sequence from Mycobacterium tuberculosis; genome sequence from Rickettsia prowazekii provides first evidence of reductive evolution

Microbial Genomics Timeline

Year Milestone

1999 Helicobacter pylori becomes the first species with completely sequenced genomes from two isolates

2000 Meningococcal genome sequence primes first application of reverse vaccinology

2001 Second E. coli genome sequences reveal unexpected level of horizontal gene transfer; genome sequence of M. leprae provides compelling evidence of bacterial pseudogenes and reductive evolution; first paper reporting genome sequences of two strains from one species (Staphylococcus aureus) in a single publication.

2002 Genome sequencing of multiple strains of Bacillus anthracis to provide markers for forensic epidemiology

2003 Genome sequencing of uncultivable Tropheryma whipplei leads to design of axenic growth medium

2004 Genome sequence of mimivirus blurs distinctions between bacteria and viruses

2005 Use of whole-genome sequencing used to identify target of new anti-tuberculosis drug Mycoplasma genitalium genome sequenced using pyrosequencing

2006-2011

Bacterial metagenomics survey of the Sargasso sea yields >1 million new genesRise of next-generation or high-throughput sequencing

Microbial Genomics Timeline

The first genome sequences The first sequenced gene was from bacteriophage MS2

The gene encoding the coat protein 1972 Min Jou W, Haegeman G, Ysebaert M, and Fiers W. Nucleotide

sequence of the gene coding for the bacteriophage MS2 coat protein. Nature. 1972 May 12; 237(5350): 82-88.

The first sequenced genome was bacteriophage MS2 1976 RNA genome is 3,569 nucleotides Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D,

Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, and Ysebaert M. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature. 1976 Apr 8; 260(5551): 500-507.

The first genome sequences The first sequenced DNA genome was bacteriophage

Φ-X174 1977 5368 base pairs Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes

CA, Hutchison CA, Slocombe PM, and Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977 265 (5596): 687-695.

The first sequenced bacterial genome was Haemophilus influenzae 1995 1,830,140 base pairs Fleischmann R, Adams M, White O, Clayton R, Kirkness E,

Kerlavage A, Bult C, Tomb J, Dougherty B, and Merrick J. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 1995. 269 (5223): 496-512.

Overview of a genome project Choose strain

Fresh isolate or tractable lab strain?

Choose strategy Shotgun sequencing Paired-end sequencing Draft or complete?

Choose chemistry Sanger; 454; Illumina;

Ion Torrent Assembly

Automated

Closure and finishing Manually intensive Difficulty depends on

how repetitive Data Release

Immediate or delayed?

Annotation Manually intensive

bottle neck Publication

Methods for genome sequencing – historicSanger method sequencing Sanger F and Coulson AR. A rapid method for

determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology. 1975 94: 441-448.

Step 1, a sequence-specific DNA primer is radiolabeled

Step 2, the primer is annealed to the template DNA Step 3, the primer is extended by DNA polymerase

Incorporation of a deoxynucleotide - further extension possible

Incorporation of a dideoxynucleotide – chain termination Four reactions set up

ddATP, dATP, dCTP, dGTP, dTTP ddCTP, dATP, dCTP, dGTP, dTTP ddGTP, dATP, dCTP, dGTP, dTTP ddTTP, dATP, dCTP, dGTP, dTTP

Methods for genome sequencing – historicSanger method sequencing

Methods for genome sequencing – automated Sanger sequencing Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR,

Heiner C, Kent SBH, and Hood LE. Fluorescence detection in automated DNA sequence analysis. Nature. 1986 321: 674-679.

Replaced radioisotopes with fluorescent dyes Safer for the researchers Each of the four DNA bases could be dyed a different colour Eliminated the need to run separate reactions in separate lanes The migration of the dye could be read because of the fluorescence This information allowed automatic gel reading

Further improvements were made Improved dye chemistry using fluorescent dideoxy-terminators (DuPont):

Prober JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ, Jensen MA, and Baumeister K. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238: 336-341.

Replacing slab gels with re-useable capillary tubes: Ruiz-Martinez MC, Berka J, Belenkii A, Foret F, Miller AW, and Karger BL. DNA sequencing by capillary electrophoresis with replaceable linear polyacrylamide and laser-induced fluorescence detection. Analytical Chemistry 1993 65: 2851-2858.

Random shearing

Size selection

Cloning

Sequence each insert with two primers

Pick colonies to create shotgun library

bacterial chromosome

plasmid vector

Plasmid preps

Whole-Genome Shotgun Sanger Sequencing

High-throughput Sequencing100x faster, 100x cheaper!

A disruptive technologySeveral technologies in the marketplace from 2007

onwards 454 (Roche) Illumina Ion Torrent PacBio

Fundamentally new approaches Solid-phase amplification of clonal templates in “molecular

colonies” Massive increase in number of “clones” compensates for shorter read

length New chemistries for sequence reading

454: pyrophosphate detection on base addition Illumina: reversible de-protection of fluorescent bases

Random shearing

Size selection


High-Throughput Shotgun Sequencing

Add adaptersAmplifySequence

454 sequencing

Emulsion-based clonal amplification

Anneal sstDNA to an excess of DNA Capture

Beads

Emulsify beads and PCR reagents in

water-in-oil microreactors

Break microreactors, enrich for DNA-positive beads

Clonal amplification occurs inside microreactors

Pyrosequencing DNA template with primer mixed

with the enzymes along with the two substrates adenosine 5’-phosphosulfate (APS) and luciferin

1. one of the four nucleotides added to reaction

2. If complementary to base in template strand then DNA polymerase incorporates it

3. Pyrophosphate (Ppi) released then converted to ATP by sulfurylase in the presence of APS.

4. ATP serves as a substrate to luciferase, causing a light reaction.

5. Excess nucleotides degraded by apyrase.

Illumina Sequencing

The Sequence Assembly Problem Sequencing technologies generate reads of

<1000 bp These reads must be assembled into a single

continuous genomic sequence. Shotgun sequencing exploits many

overlapping sequences (high coverage) to infer ordering directly from the sequences themselves

The Repeat Problem Repeats at read ends can be assembled in

multiple ways

ATTTATGTGTGTGTGGTGTG

GTGTGGTGTGCACTACTGCT

ACTACTGCTGACTACTGTGTGGTGTG

GTGTGGTGTGATATCCCT

ATTTATGTGTGTGTGGTGTG

GTGTGGTGTGCACTACTGCT

ACTACTGCTGACTACTGTGTGGTGTG

GTGTGGTGTGATATCCCT

Correct

Incorrect

Paired-end Sequencing

Random shearing

Size selection for 3kb or 8kb etc


Add linkers

Circularise

Shear and select on size and presence of linkers

Add adapters

Obtain sequences from either side of

linker known distance apart in genome

Create long fragments of known lengthObtain sequence from paired ends

known distance apartAllows assembly of contigs across repeats into scaffolds

Scaffold

Contig 3Contig 2Contig 1

Physical Gap

Sequence Gap

Genome Assembly

Re-sequencing Short reads

(<200bp) inefficient de novo assembly

Instead they are mapped against a reference genome

Re-sequencing is like assembling a jigsaw puzzle using the image on the lid

Genome annotation Annotation is the addition of information about the

predicted sequence features to the flat file of DNA code Identification of potential coding sequences - CDS Homology searches to predict function Other features can be annotated as well

rRNAs Potential promoters tRNAs Small non-coding RNAs Repeat sequences Insertion sequences (ISs), transposons, gene fragments

Location of the origin of replication Determination of the number of bases, genes, and

G+C%.

How to go from this….?>Escherichia coli K-12 MG1655_3870656-3890655

TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAACGGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCGGTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTTCAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAATCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCCCGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACCTCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCTCAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTGGCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGAGACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATTCCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGGTCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCTGCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATCGCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATTCAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCATTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCGCGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGGAGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTTGTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTACTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAGTTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGATGACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACGGAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCAAAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCACCGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCAAGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCTTCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGGAATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATTCATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGAGCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGGCGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTTAATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTAGCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACTGGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGAGAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTTCATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTATAGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCATTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCGCAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCGCCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACTCGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTTCAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACGATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTGATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGTCAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAAGCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCGCGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCGACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACCTCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCTGTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCCATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCATAGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGAAGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGGATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACTTCCGCCGCCGA

…to this? FT gene complement(9299..10702)

FT /db_xref="GenBank:2367266”

FT /gene="dnaA”

FT /note="b3702”

FT CDS complement(9299..10702)

FT /db_xref="GI:2367267”

FT /db_xref="PID:g2367267”

FT /function="putative regulator; DNA - replication, repair,

FT restriction/modification”

FT /codon_start=1

FT /protein_id="AAC76725.1”

FT /gene="dnaA”

FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR

FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT

FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG

FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF

FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR

FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL

FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR

FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF

FT SNLIRTLSS”

FT /product="DNA biosynthesis; initiation of chromosome

FT replication; can be transcription regulator”

FT /transl_table=11

FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;

FT CG Site No. 851”

Or this?

An ORF is not a CDS!

Non-coding ORFs

CDSs (note ORF can extend

upstream of start codon)

An ORF is just an open reading frameThere are many more ORFs than protein coding genes (CDSs) in a genome

Actual sequence

10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAAM S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K

10 20 30 40 50 60 70 | | | | | | |ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAM S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K

The Problem of Frameshift Errors

Frameshifted sequence after single base error

Homology Similarities in

form (sequence) allow us to infer similarities in “meaning” (structure and function)

Homology is not just sequence similarity Two sequences

can be similar without any common ancestry, particularly if low complexity

the cat sat on the mat die Katze sass auf der Matte

vge|GBant88-2 ITLITCVSVKDNSKRYVVAGvge|GEfae9-178 LTLITCDQATKTTGRIIVIAvge|GSpne1-403 MTLITCDPIPTFNKRLLVNFsortase_staur LTLITCDDYNEKTGVWEKRK

Types of Homology Homologues can be

divided into Orthologues: lines of

descent congruent with whole genome

Paralogues: result of gene duplication

Xenologues: result of HGT

Homology Searches The aim of homology searches is to identify

sequences within these databases that are homologous to your sequence.

This involves comparing your sequence with all the database sequences looking for stretches of sequence that appear to be

similar then scoring the matches and ranking them a measure of the significance of the match is given

Most common program used for homology searches is BLAST

Bacterial Genome Dynamics

Gene Loss Gene Gain

Rapid emergence of genetically uniform pathogens from variable ancestral populations

Drastic downsizing in isolated intracellular niches

Accumulation of pseudogenes and IS elements after shift to new niche

Gene Duplication

Horizontal gene transferby phage, plasmids, pathogenicity islands

single nucleotide polymorphisms (SNPs)

Recombination and rearrangements

Gene Change

Bacterial Genome Dynamics

Horizontal gene transfer Horizontal (or lateral) gene transfer denotes any

transfer, exchange or acquisition of genetic material that differs from the normal mode of transmission from parents to offspring (vertical transmission).

Vertical gene transfer Horizontal gene transfer

Bacterial mobile genetic elements Transposons

pieces of DNA that act as ‘jumping genes’ that change location on chromosome or plasmid chromosomal localization.

encode transposase that catalyses the transposition event can carry resistance or virulence genes

Insertion sequences (IS elements) transposable elements that encode only the transposase multiple copies of same IS within genome provide targets

for homologous recombination, rearrangements and replicon fusions

Conjugative transposons normally integrated into the chromosome excise then transferred to recipient cells by conjugation

Bacterial mobile genetic elements Plasmids

self-replicating extrachromosomal replicons usually circular but can be linear Can carry resistance or virulence genes

Bacteriophages bacterial viruses can carry virulence genes can insert into bacterial chromosome as prophages

(lysogeny) Integrons

complex natural cloning and gene expression systems able to capture promoterless gene cassettes by site-specific recombination

allow formation of large arrays of gene cassettes transferred as a whole between different replicons.

Genomic islands large chromosomal regions, part of the flexible gene

pool previously transferred by other mobile genetic

elements present in some bacteria but absent in close relatives carry multiple genes that increase phenotypic

versatility contribute to dynamic character of bacterial

chromosomes and can be excised from the chromosome and transferred to other recipients

pathogenicity islands contain dozens of genes that allow quantum leap to complex new virulence phenotype

Core genomes and Pangenomes Core genome

pool of genes shared by all members of a bacterial species

Accessory or dispensable genome pool of genes present in some but not all genomes

within the same bacterial species Pangenome

global gene repertoire of a bacterial species, comprised of core genome + accessory genome

Metagenome global gene repertoire of mixed microbial

population

Escherichia coli Core and Pan-genomes

Welch et al. Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):17020-4

Metagenomics Environmental

shotgun sequencing DNA extracted from

mixed microbial communities sequenced en masse

Assembled into contigs Typically only small

contigs can be obtained

Uses of a genome sequence Gene discovery

Fuelling hypothesis driven research on pathogen biology

Comparative genomics SNP discovery and genomic epiemiology

Functional genomics Transcriptomics Proteomics Interactome Structural Genomics Mass Mutagenesis

Haemolytic-uraemic syndrome Shiga-toxin-producing E. coli (STEC)

bloody diarrhoea; damage to kidneys and brain anaemia; loss of platelets

German E. coli O104:H4 outbreak

May-July 2011 >4000 cases >40 deaths Link to sprouting seeds High risk of haemolytic-

uraemic syndrome Females particularly at

risk

Frank et al DOI: 10.1056/NEJMoa1106483

Take-away messages from the genome Pathogens don’t bother with passports!

Not a new strain: something similar seen in Germany ten years ago and in Korea

closest genome-sequenced strain was isolated from Central African Republic in late 1990s, belongs to an enteroaggregative lineage

German STEC probably comes from a lineage circulating in human populations rather than from an animal source (unlike E. coli O157)

Take-away messages Bacteria evolve

quickly Virulence factors in

E. coli can jump from one lineage to another on mobile genetic elements

Pathotypes can overlap and evolve

Antibiotic resistance seen where no obvious prior use of antibiotics

Take-away messages from genome sequence

Genome sequencing brings the advantages of open-endedness (revealing the “unknown

unknowns”), universal applicability ultimate in resolution

Bench-top sequencing platforms now generate data sufficiently quickly and cheaply to have an impact on real-world clinical and epidemiological problems

Comprehensive Coverage of Human Microbiome

Comprehensive coverage of tree of life

What will you do when you can sequence everything?

Education

Bio153 microbial genomics 2012