Download pdf - BM405 Lecture Slides 21/11/2014 University of Strathclyde

Microbial Genomics andBioinformaticsBM4051.Introduction

Leighton Pritchard1,2,3

1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA

Acceptable Use Policy

Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.

These slides will be made available on SlideShare.

These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde

https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde


Table of Contents

IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?

High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?

The endpoints

• 2003: Erwinia carotovora subsp. atroseptica

• 2014: Dickeya spp., Campylobacter spp., and Escherichia coli

Table of Contents



2003: E. carotovora subsp. atroseptica

• £250k collaboration between SCRI, University ofCambridge, WT Sanger Institute

• Single isolate: E. carotovora subsp. atroseptica SCRI1043

• The first sequenced enterobacterial plant pathogen (32authors!) 1

• All repeats and gaps bridged and sequenced directly

• Result: a single, complete, high-quality 5Mbp circularchromosome at 10.2X coverage: 106,500 reads

1Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101

http://dx.doi.org/10.1073/pnas.0402424101


A genome sequence is a starting point. . .

• Manual annotation by the Sanger Pathogen Sequencing Unit

• Literature searches and comparisons

• Six people, for six months ≈ three person-years

• Genes: BLAST, GLIMMER, ORPHEUS

• Functional domains: PFAM, SIGNALP, TMHMM

• Metabolism: KEGG

• ncRNA: RFAM


Working (Eca_Sanger_annotation.gbk) and published(NC_004547.gbk) annotation files are in the data directory

Eca_Sanger_annotation.gbk

NC_004547.gbk


Compared against all 142 available bacterial genomes2

2data/Pba directory in the accompanying GitHub repository

data/Pba

Table of Contents



2013: Dickeya spp.

Sequenced and annotated 25 new isolates of Dickeya

• 25 Dickeya isolates, at least six species

• Multiple sequencing methods: 454, Illumina (SE, PE)

• Minor publications (6, 8 authors)3,4

• Results: 12-237 fragments containing 4.2-5.1Mbp, at 6-84Xcoverage, 170k-4m reads

• Automated annotation: RAST with manual corrections3

Pritchard et al. (2013) Genome Ann. 1 (4) doi:10.1128/genomeA.00087-124

Pritchard et al. (2013) Genome Ann. 1 (6) doi:10.1128/genomeA.00978-13

http://dx.doi.org/10.1128/genomeA.00087-12

http://dx.doi.org/10.1128/genomeA.00978-13

2013: Dickeya spp.

Within-genus comparisons: large-scale synteny and rearrangement

Within-species comparisons: e.g. indels, HGT

2013: Dickeya spp.

Within-genus comparisons: whole genome-based speciesdelineation5

5van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0

http://dx.doi.org/10.1099/ijs.0.052944-0

2013: Dickeya spp.

Within-genus comparisons: differences in metabolism

2014: E. coli

Sequenced and annotated ≈ 190 isolates of E. coliAll bacteria environmental, sampled from lysimeters

• Illumina paired-end sequencing. Total cost of sequencing190 bacteria: ≈£11k

• Automated annotation: PROKKA

2014: E. coli

Sequencing output variable - even though same preps, “same”bacteria, similar sources.

• Results: 5-3000 contigs (median ≈ 125); 9kbp-7.1Mbp(median ≈ 5Mbp); 170k-4m reads

2014: E. coli

Genome sequencing enables within-species classification

Bru

nei2

0070

942_

cont

igs

Mue

nste

r200

6309

1_co

ntig

s

Sen

ftenb

erg2

0070

885_

cont

igs

Lys1

42_c

ontig

s

Lys1

75_c

ontig

s

Lys1

30_c

ontig

s

Lys1

70_c

ontig

s

Lys1

26_c

ontig

s

Lys1

67_c

ontig

s

Lys1

76_c

ontig

s

Lys1

69_c

ontig

s

Lys5

0_co

ntig

s

X50

38_c

ontig

s

Lys1

31_c

ontig

s

Lys1

71_c

ontig

s

Lys1

11_c

ontig

s

Lys1

07_c

ontig

s

Lys1

14_c

ontig

s

Lys1

6_co

ntig

s

Lys2

2_co

ntig

s

Lys6

5_co

ntig

s

Lys5

6_co

ntig

s

Lys1

13_c

ontig

s

Lys1

09_c

ontig

s

Lys7

7_co

ntig

s

Lys1

02_c

ontig

s

Lys1

00_c

ontig

s

Lys9

2_co

ntig

s

Lys9

4_co

ntig

s

Lys8

0_co

ntig

s

Lys6

4_co

ntig

s

Lys8

2_co

ntig

s

AW

3_co

ntig

s

X50

08_c

ontig

s

AW

4_co

ntig

s

AW

1_co

ntig

s

Lys1

18_c

ontig

s

Lys1

38_c

ontig

s

Lys1

21_c

ontig

s

Lys1

22_c

ontig

s

Lys1

77_c

ontig

s

Lys1

55_c

ontig

s

Lys1

65_c

ontig

s

Lys1

63_c

ontig

s

Lys1

60_c

ontig

s

Lys1

61_c

ontig

s

Lys1

72_c

ontig

s

Lys1

44_c

ontig

s

Lys1

35_c

ontig

s

Lys1

46_c

ontig

s

Lys1

23_c

ontig

s

Lys1

24_c

ontig

s

Lys1

50_c

ontig

s

Lys1

40_c

ontig

s

Lys1

57_c

ontig

s

Lys1

73_c

ontig

s

Lys1

56_c

ontig

s

Lys1

58_c

ontig

s

Lys1

59_c

ontig

s

Lys1

62_c

ontig

s

Lys5

_con

tigs

X50

84_c

ontig

s

X50

42_c

ontig

s

Lys1

10_c

ontig

s

Lys1

36_c

ontig

s

Lys5

4_co

ntig

s

Lys1

_con

tigs

Lys6

_con

tigs

Lys1

12_c

ontig

s

X50

12_c

ontig

s

Lys3

0_co

ntig

s

Lys2

5_co

ntig

s

Lys4

3_co

ntig

s

Lys3

7_co

ntig

s

Lys4

0_co

ntig

s

Lys1

51_c

ontig

s

Lys3

1_co

ntig

s

Lys2

7_co

ntig

s

Lys4

2_co

ntig

s

Lys5

1_co

ntig

s

Lys3

3_co

ntig

s

Lys4

6_co

ntig

s

Lys3

8_co

ntig

s

Lys8

9_co

ntig

s

Lys2

3_co

ntig

s

Lys1

15_c

ontig

s

Lys1

08_c

ontig

s

Lys1

04_c

ontig

s

DS

M10

973_

cont

igs

Lys1

25_c

ontig

s

Lys1

05_c

ontig

s

Lys1

7_co

ntig

s

Lys1

28_c

ontig

s

Lys6

6_co

ntig

s

Lys7

3_co

ntig

s

Lys1

5_co

ntig

s

Lys9

1_co

ntig

s

DS

M86

98_c

ontig

s

DS

M86

95_c

ontig

s

Lys7

4_co

ntig

s

Lys6

1_co

ntig

s

Lys9

_con

tigs

Lys1

53_c

ontig

s

Lys8

4_co

ntig

s

Lys9

3_co

ntig

s

Lys7

2_co

ntig

s

Lys6

2_co

ntig

s

Lys2

1_co

ntig

s

Lys5

9_co

ntig

s

Lys6

3_co

ntig

s

Lys8

3_co

ntig

s

Lys1

9_co

ntig

s

Lys4

_con

tigs

AW

13_c

ontig

s

Lys4

5_co

ntig

s

Lys2

8_co

ntig

s

Lys5

3_co

ntig

s

Lys5

2_co

ntig

s

Lys3

4_co

ntig

s

Lys3

6_co

ntig

s

Lys2

4_co

ntig

s

Lys3

5_co

ntig

s

Lys6

8_co

ntig

s

Lys1

06_c

ontig

s

Lys8

8_co

ntig

s

Lys9

7_co

ntig

s

Lys7

6_co

ntig

s

Lys1

34_c

ontig

s

Lys5

8_co

ntig

s

Lys7

1_co

ntig

s

Lys8

1_co

ntig

s

Lys1

29_c

ontig

s

Lys1

20_c

ontig

s

Lys1

45_c

ontig

s

Lys1

37_c

ontig

s

Lys1

27_c

ontig

s

Lys1

52_c

ontig

s

Lys1

01_c

ontig

s

Lys9

8_co

ntig

s

Lys7

0_co

ntig

s

Lys1

33_c

ontig

s

Lys4

7_co

ntig

s

Lys7

5_co

ntig

s

Lys4

8_co

ntig

s

Lys1

48_c

ontig

s

Lys1

39_c

ontig

s

Lys1

41_c

ontig

s

Lys1

64_c

ontig

s

Lys1

49_c

ontig

s

Lys1

47_c

ontig

s

Lys6

0_co

ntig

s

Lys7

9_co

ntig

s

Lys1

68_c

ontig

s

Lys1

8_co

ntig

s

Lys8

7_co

ntig

s

Lys9

6_co

ntig

s

Lys7

_con

tigs

Lys1

54_c

ontig

s

Lys1

17_c

ontig

s

Lys1

19_c

ontig

s

Lys1

78_c

ontig

s

Lys1

16_c

ontig

s

Lys8

6_co

ntig

s

Lys9

0_co

ntig

s

Lys4

1_co

ntig

s

Lys1

3_co

ntig

s

Lys8

5_co

ntig

s

X50

02_c

ontig

s

Lys1

2_co

ntig

s

Lys3

9_co

ntig

s

Lys1

4_co

ntig

s

Lys5

5_co

ntig

s

Lys2

9_co

ntig

s

Lys9

9_co

ntig

s

X50

35_c

ontig

s

Lys8

_con

tigs

Lys3

_con

tigs

X50

34_c

ontig

s

X50

88_c

ontig

s

Lys2

0_co

ntig

s

Lys7

8_co

ntig

s

Lys1

1_co

ntig

s

Brunei20070942_contigs

Muenster20063091_contigs

Senftenberg20070885_contigs

Lys142_contigs

Lys175_contigs

Lys130_contigs

Lys170_contigs

Lys126_contigs

Lys167_contigs

Lys176_contigs

Lys169_contigs

Lys50_contigs

5038_contigs

Lys131_contigs

Lys171_contigs

Lys111_contigs

Lys107_contigs

Lys114_contigs

Lys16_contigs

Lys22_contigs

Lys65_contigs

Lys56_contigs

Lys113_contigs

Lys109_contigs

Lys77_contigs

Lys102_contigs

Lys100_contigs

Lys92_contigs

Lys94_contigs

Lys80_contigs

Lys64_contigs

Lys82_contigs

AW3_contigs

5008_contigs

AW4_contigs

AW1_contigs

Lys118_contigs

Lys138_contigs

Lys121_contigs

Lys122_contigs

Lys177_contigs

Lys155_contigs

Lys165_contigs

Lys163_contigs

Lys160_contigs

Lys161_contigs

Lys172_contigs

Lys144_contigs

Lys135_contigs

Lys146_contigs

Lys123_contigs

Lys124_contigs

Lys150_contigs

Lys140_contigs

Lys157_contigs

Lys173_contigs

Lys156_contigs

Lys158_contigs

Lys159_contigs

Lys162_contigs

Lys5_contigs

5084_contigs

5042_contigs

Lys110_contigs

Lys136_contigs

Lys54_contigs

Lys1_contigs

Lys6_contigs

Lys112_contigs

5012_contigs

Lys30_contigs

Lys25_contigs

Lys43_contigs

Lys37_contigs

Lys40_contigs

Lys151_contigs

Lys31_contigs

Lys27_contigs

Lys42_contigs

Lys51_contigs

Lys33_contigs

Lys46_contigs

Lys38_contigs

Lys89_contigs

Lys23_contigs

Lys115_contigs

Lys108_contigs

Lys104_contigs

DSM10973_contigs

Lys125_contigs

Lys105_contigs

Lys17_contigs

Lys128_contigs

Lys66_contigs

Lys73_contigs

Lys15_contigs

Lys91_contigs

DSM8698_contigs

DSM8695_contigs

Lys74_contigs

Lys61_contigs

Lys9_contigs

Lys153_contigs

Lys84_contigs

Lys93_contigs

Lys72_contigs

Lys62_contigs

Lys21_contigs

Lys59_contigs

Lys63_contigs

Lys83_contigs

Lys19_contigs

Lys4_contigs

AW13_contigs

Lys45_contigs

Lys28_contigs

Lys53_contigs

Lys52_contigs

Lys34_contigs

Lys36_contigs

Lys24_contigs

Lys35_contigs

Lys68_contigs

Lys106_contigs

Lys88_contigs

Lys97_contigs

Lys76_contigs

Lys134_contigs

Lys58_contigs

Lys71_contigs

Lys81_contigs

Lys129_contigs

Lys120_contigs

Lys145_contigs

Lys137_contigs

Lys127_contigs

Lys152_contigs

Lys101_contigs

Lys98_contigs

Lys70_contigs

Lys133_contigs

Lys47_contigs

Lys75_contigs

Lys48_contigs

Lys148_contigs

Lys139_contigs

Lys141_contigs

Lys164_contigs

Lys149_contigs

Lys147_contigs

Lys60_contigs

Lys79_contigs

Lys168_contigs

Lys18_contigs

Lys87_contigs

Lys96_contigs

Lys7_contigs

Lys154_contigs

Lys117_contigs

Lys119_contigs

Lys178_contigs

Lys116_contigs

Lys86_contigs

Lys90_contigs

Lys41_contigs

Lys13_contigs

Lys85_contigs

5002_contigs

Lys12_contigs

Lys39_contigs

Lys14_contigs

Lys55_contigs

Lys29_contigs

Lys99_contigs

5035_contigs

Lys8_contigs

Lys3_contigs

5034_contigs

5088_contigs

Lys20_contigs

Lys78_contigs

Lys11_contigs

ANIm

0.9 0.92 0.94 0.96 0.98

Value

010

0020

0030

0040

0050

0060

00

Color Keyand Histogram

Cou

nt

AB1B2CDEFUX

2014: Campylobacter spp.

Sequenced ≈ 1034 isolates of CampylobacterClinical, animal, food-associated isolates

• Illumina paired-end sequencing. Total cost of sequencing>1000 bacteria: ≈£60k

• Automated annotation: PRODIGAL

2014: Campylobacter spp.

• Identified 15554 gene families from genecalls.• To calculate, took 23 days on institute cluster (4e12 pairwise

protein comparisons!).

Table of Contents



So what’s changed?

• Cost: £250k per genome, to £60 per genome.Now cheaper to sequence a genome than to analyse it!

• Location: sequencing centre, to benchtop

• Data: volume has increased massively - what you get backfrom machines, and what’s out there to work withMore data is better, but also more challenging.

• Speed: typical sequencing run time can be less than a day

• Software: more software to do more things (but not alwaysbetter. . .)

• New kinds of experiment: genomes, exomes, variant calling,methylated sequences, . . .

• New kinds of application: diagnostics, epidemic tracking,metagenomics, . . .

So what’s changed?

Having a single genome is useful, but having thousands really helpscomparative genomics:combining genomic data, evolutionary and comparative biology

• Transfer functional understanding of model systems (e.g. E.coli) to non-model organisms

• Genomic differences may underpin phenotypic (host range,virulence, physiological) differences

• Genome comparisons aid identification of functional elementson the genome

• Studying genomics changes reveals evolutionary processes andconstraints

Table of Contents



Not all sequencing is the same

It’s all about the biology, but it all starts with the data.Sequencing technology (including library prep.) affects yoursequence data.

• Roche/454

• Illumina

• Ion Torrent

• Pacific Bioscience (PacBio)

The basic principle

DNA source is fragmented, and the fragments are sequenced.

PE vs SE

Reads may be single-end, or paired-end.

Putting the jigsaw back together is sequence assembly.

Four different chemistriesa

aLoman et al. (2012) Nat. Rev. Micro. 31:294-296 doi:10.1038/nbt.2522

Reads differ by technology, and may require different bioinformatictreatment. . .

• Roche/454: Pyrosequencing (long reads, but expensive, andhigh homopolymer errors) (700-800bp, 0.7Gbp, 23h)

• Illumina: Reversible terminator (cost-effective, massivethroughput, but short read lengths) (2x150bp, 1.5Gbp, 27h)

• Ion Torrent: Proton detection (short run times, goodthroughput, high homopolymers errors) (200bp, 1Gbp, 3h)

• PacBio: Real-time sequencing (very long reads, high errorrate, expensive) (3-15kbp, 3Gbp/day, 20min)

. . . different error profiles, varying capability to assemble/determinevariation

http://dx.doi.org/10.1038/nbt.2522

Costs of sequencinga

aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699

http://dx.doi.org/10.1186/1471-2164-15-699

Table of Contents



Benchmarked performance

Apply several sequencing technologies to the same sample(s).Benchmark comparisons inform appropriate choice of sequencingtechnology6,7,8,9,10,11,12

Progress in technologies is driving research very rapidly.Always look for most recent/relevant benchmarks.

Bioinformatic methods also need to be benchmarked.

6Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699

7Salipante et al. (2014) Appl. Environ. Micro. 80:7583-7591 doi:10.1128/AEM.02206-14

8Frey et al. (2014) BMC Genomics 15:96 doi:10.1186/1471-2164-15-96

9Koshimizu et al. (2013) PLoS One 8:e74167 doi:10.1371/journal.pone.0074167

10Quail et al. (2012) BMC Genomics 13:341 doi:10.1186/1471-2164-13-341

11Loman et al. (2012) Nat. Biotech. 30:434-439 doi:10.1038/nbt.2198

12Lam et al. (2011) Nat. Biotech. 1 (6) doi:10.1038/nbt.2065

http://dx.doi.org/10.1186/1471-2164-15-699

http://dx.doi.org/10.1128/AEM.02206-14

http://dx.doi.org/10.1186/1471-2164-15-96

http://dx.doi.org/10.1371/journal.pone.0074167

http://dx.doi.org/10.1186/1471-2164-13-341



Benchmarking on Vibrioa


• Sequenced Vibrio parahaemolyticus (2x chromosomes, closedreference genome) with four technologies

• Chose an assembler for each tech, and assembled reads• Excess reads with Ion/MiSeq: used random subsets of reads

to determine required coverage• Aligned assemblies (MUMmer) to known high-quality

chromosome sequence, to measure error

http://dx.doi.org/10.1186/1471-2164-15-699



http://dx.doi.org/10.1186/1471-2164-15-699



De novo assembly and alignment against Vibrio parahaemolyticus(2x chromosomes)

http://dx.doi.org/10.1186/1471-2164-15-699



• More and longer reads do not always give the best assemblies:read depth, read distribution, error rate also matters

• Optimal assemblies were obtained at around 60x-80xcoverage, for Illumina and Ion.

• Multiple rRNA regions are fragmented in short-read assemblies

• PacBio generated single chromosome contigs

• Assembly of multiple-chromosome bacteria is currently feasible

Variability in published genomes as methods are not standard (e.g.sequencing technology, assembler, parameter settings andpre-processing). . .

http://dx.doi.org/10.1186/1471-2164-15-699

Table of Contents



What’s coming next?

Oxford Nanopore. A sequencer the size of your hand.

• Microfluidics, single-molecule sequencing; 11-70kbp reads

• Reports current across pore (tiny electron microscope) asmolecule moves through

• $10/Mbp, 110Mbp per flowcell13

13Yaniv Erlich (2013) Future Continuous blog

http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion

Controversya

aMikheyev and Tin (2014) Mol. Ecol. Res. 14:1097-1102 doi:10.1111/1755-0998.12324

The first Nanopore paper concluded, for λ phage:

• About 10% of reads mapped to the reference genome

• <1% of all generated sequence faithfully matches thereference

http://dx.doi.org/10.1111/1755-0998.12324

Controversya

aMikheyev and Tin (2014) Mol. Ecol. Res. 14:1097-1102 doi:10.1111/1755-0998.12324

Not everyone thinks the Mikheyev and Tin paper is very good:

“But that paper is terrible. It’s just lazy.” (Mick Watson’s blog:http://biomickwatson.wordpress.com/2014/09/07/thoughts-on-oxford-nanopores-minion-mobile-dna-sequencer/)

http://dx.doi.org/10.1111/1755-0998.12324

http://biomickwatson.wordpress.com/2014/09/07/thoughts-on-oxford-nanopores-minion-mobile-dna-sequencer/



New dataa

aQuick et al. (2014) GigaScience 3:22 doi:10.1111/1755-0998.12324

It’s a fast-moving area, and results are improving.

http://dx.doi.org/10.1111/1755-0998.12324

New tools

Oxford Nanopore’s open beta went out without analysis tools.Tools (Poretools, poRe, etc.) are being written/tested/validatedby the user community14,15

14Loman and Quinlan (2014) Bioinformatics doi:10.1093/bioinformatics/btu555

15Watson et al. (2014) Bioinformatics doi:10.1093/bioinformatics/btu590

http://dx.doi.org/10.1093/bioinformatics/btu555


Table of Contents



After that, the flood. . .

High-throughput sequencing methods have completely changed thelandscape of microbiology(Nearly) complete, (mainly) accurate sequence data is nowinexpensive (and cheaper than analysis)

• GOLD (19/2/2014): 3,011 “finished” ; 9,891 “permanentdraft” genomes

• GOLD (18/11/2014): 6,649 “finished” ; 23,552 “permanentdraft” genomes

• NCBI WGS (19/2/2014): 17,023 microbial genomes

• NCBI WGS (18/11/2014): 26,026 microbial genomes

http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects&subset_requested=Complete+And+Published

http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects&subset_requested=Complete+And+Published

http://www.ncbi.nlm.nih.gov/Traces/wgs/

http://www.ncbi.nlm.nih.gov/Traces/wgs/

Predicting the future is hard. . .

Su et al. attempted to answer this16:

16http://sulab.org/2013/06/sequenced-genomes-per-year/

http://sulab.org/2013/06/sequenced-genomes-per-year/

Licence: CC-BY-SA

By: Leighton Pritchard

This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/

https://creativecommons.org/licenses/by-sa/4.0/

Microbial Genomics andBioinformaticsBM4052.Assembly









What do you get from sequencing

Sequence reads. Usually lots of them.Size/number/errors depend on technology used.

1

1Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699

http://dx.doi.org/10.1186/1471-2164-15-699

Sequence Read Data Formats

Two common read data sequence formats:

• FASTQ: Related to FASTA, a de facto standard for sequencereads

• SAM/BAM: Sequence alignment/mapping format, twoflavours - uncompressed and compressed

• CRAM: Reference-based sequence compression

You might also receive assembled genomes directly from asequencing partner

Table of Contents

Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories

AssemblyOverlap-Layout-Consensusde Bruijn graph assembly

Read MappingShort-Read Sequence Alignment

The AssemblyWhat you get back

FASTQa

aCock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137

@HISEQ2500-09:168:HA424ADXX:2:1101:1404:2061 1:N:0:ATCTCTCTCACCAACT

CGGTCTTGGGATAGATGGGTTGCAGGTTGCGGTAAAGCTCGGACTCCAGAGCGTCCAGGGTAGACTGGCTAATCTTCTGCTCTTTATCGATCATTATTTC

+

@@CBDDFFHHDFDHEGHIICGIFHHIIIIFHGGHIEHHIIIIGHGHIIIIIGGHHFFFFC@CBCCCDDBDCDDDDDDDDCCDDDD3@ABDDDDDEEEDE@

Files typically have .fq, .fastq extension.Four lines per sequence

1. Header: sequence identifier and optional description, startswith “@”

2. Raw sequence ([ACGTN])

3. Optional header, repeats line 1, starts with “+”

4. Quality scores, numbers encoded as ASCIIQphred = −10 log10 e, where e is the estimated probabilitythat a base call is incorrect (like a pH).

http://dx.doi.org/10.1093/nar/gkp1137

Quality Control

The quality of basecalls (error rate) varies between and along reads.

(real data from our E.coli sequencing: good quality)

Quality Control

Some datasets are better than others.

Reads can be trimmed, or discarded.Including poor reads compromises assembly.

FASTQ encodinga


More than one version of FASTQ, differ by quality encodingNumbers converted to ASCII start at different values


FASTQ encodinga


Versions vary by sequencer and period. Most now settled onSanger format.Quality scores (Qphred) offset to lie in the given range:

1. Sanger: 33-126, used in SAM/BAM, and Illumina 1.8+

2. Illumina 1.0-1.2: 59-126

3. Illumina 1.3-1.8: 64-126

Knowing where your data comes from, and the data formatand version, is always important.


Table of Contents





SAMa

ahttps://github.com/samtools/hts-specs

Intended to represent read alignments, also used for raw reads.Tab-delimited plain text. Headers (optional) start with “@”

http://samtools.github.io/hts-specs/SAMv1.pdf

BAMa/CRAMb

ahttps://github.com/samtools/hts-specs

bhttp://www.ebi.ac.uk/ena/software/cram-toolkit

BAM is a compressed version of SAM.

• BGZF compression.

• Random access within compressed file, through indexing.

CRAM format may come to dominate, especially in archives, asdatasets get larger:

• Reference-based compression.2

• Highly suited to compression and archiving of very largeamounts of sequence data.3

2Fritz et al. (2011) Genome Res. 21:734-740 doi:10.1101/gr.114819.110

3Cochrane et al. (2012) GigaScience 1:2 doi:10.1186/2047-217X-1-2

http://samtools.github.io/hts-specs/SAMv1.pdf

http://www.ebi.ac.uk/ena/software/cram-toolkit

http://dx.doi.org/10.1101/gr.114819.110

http://dx.doi.org/10.1186/2047-217X-1-2

Table of Contents





Read repositories

Repositories are centrally-maintained locations that keep sequenceread data from multiple projectsSubmission to a repository is a requirement for publication.

• ENA: The European Nucleotide Archive(http://www.ebi.ac.uk/ena), maintained by EBI/EMBL

• SRA: The Short Read Archive(http://www.ncbi.nlm.nih.gov/sra), maintained in the US byNCBI

http://www.ebi.ac.uk/ena

http://www.ncbi.nlm.nih.gov/sra

Sequence Assembly

Once you have reads, you can assemble a genome.

Two main approaches to read assembly:

• Overlap-Layout-Consensus: Typically used with smaller setsof longer reads (e.g. 454, PacBio, Ion, Nanopore)

• de Bruijn assembly: Typically used with many, shorter reads(e.g. Illumina), but also useful for longer reads

See e.g. Leland Taylor’s thesis(http://gcat.davidson.edu/phast/docs/Thesis PHAST LelandTaylor.pdf),and PHAST (http://gcat.davidson.edu/phast/index.html).

http://gcat.davidson.edu/phast/docs/Thesis_PHAST_LelandTaylor.pdf

http://gcat.davidson.edu/phast/index.html

Table of Contents





Overlap-Layout-Consensus

Overlap-Layout-Consensus

The oldest approach, typically used with smaller sets of fewerreads.

Can be time consuming (all-vs-all comparisons), but offset withgraph-based OLC algorithms (e.g. SGA).

• Celera Assembler4

• Newbler (the Roche/454 GS assembler)5

• String Graph Assembler6

4http://wgs-assembler.sourceforge.net/

5http://www.454.com/products/analysis-software/

6Simpson and Durbin (2012) Genome Res. 22:549-556 doi:10.1101/gr.126953.111

http://wgs-assembler.sourceforge.net/

http://www.454.com/products/analysis-software/

http://dx.doi.org/10.1101/gr.126953.111

Table of Contents





de Bruijn graph assembly

k-mer based graph (choice of k important):


k-mer based genome and read graphs7

“True” edges = genome; “Error” edges = wrong assembly

7Chaisson et al. (2009) Genome Res. 19:336-346 doi:10.1101/gr.079053.108

http://dx.doi.org/10.1101/gr.079053.108


All sequencing technologies have basecall errors.

• The proportion of errors is approximately constant per read

• Baseball errors lead to edge errors

• The more reads you have, the more errors there are

Increased coverage does not ensure increased accuracy8

8Conway and Bromage (2011) Bioinformatics 27:479-486 doi:10.1093/bioinformatics/btq697

http://dx.doi.org/10.1093/bioinformatics/btq697


Fast, as it never computes overlaps.

Sensitive to sequencing errors, resolves short repeats (graph bulgesand whirls).

Notable tools:

• Velvet9

• CLC Assembly Cell10

• Cortex11

9Zerbino and Birney (2008) Genome Res. 18:821-829 doi:10.1101/gr.074492.107

10http://www.clcbio.com/products/clc-assembly-cell/

11Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028

http://dx.doi.org/10.1101/gr.074492.107

http://www.clcbio.com/products/clc-assembly-cell/

http://dx.doi.org/10.1038/ng.1028

“Coloured” de Bruijn graph assemblies

Cortex12 allows for on-the-fly identification of complex variation,and genotyping, by tracking “coloured” edges in the graph.Colours ≈ different isolates/organisms (e.g. a reference)

12Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028

http://dx.doi.org/10.1038/ng.1028

Table of Contents





Why map reads?a

aTrapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455

“Resequencing” an organism (sequencing a close relative, lookingfor SNPs/indels)

RNA-seq, ChIP-seq, etc. - coverage ≈ expression/binding

To see where reads map on an assembled genome

• Is coverage even? (can indicate repeats)

• Are there SNPs/indels? (heterogeneous population)

• Assembly problems?

http://dx.doi.org/10.1038/nbt0509-455

Short-Read Sequence Alignmenta

aTrapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455

An embarrassment of tools (over 60 listed on Wikipedia)Main approaches:

• Alignment: Smith-Waterman mathematically guaranteed tobe the best alignment available (e.g. BFAST, MOSAIK);approximation to S-W (e.g. BLAST); ungapped or gappedalignment (e.g. MAQ, FAST, mrFAST, SOAP). Can be slow.

• Burrows-Wheeler Transform: Makes reusable index of thegenome (e.g. Bowtie, BWA), can be extended to considersequence probability (e.g. BWA-PSSM). Can be very fast.

Other tools may employ different algorithms, some designed to beparallelised on GPUs/FPGAs (e.g. NextGenMap, XpressAlign)

http://dx.doi.org/10.1038/nbt0509-455

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

Visualising Read Mapping

Several tools available, e.g. Tablet (the best. . .)13

13Milne et al. (2013) Brief. Bioinf. 14:193-202 doi:10.1093/bib/bbs012

http://dx.doi.org/10.1093/bib/bbs012

Table of Contents





In an ideal world

Ideally, you would have one sequence per chromosome/plasmid.(and no errors): a closed/complete genome.

PacBio, Sanger, manual closing, Nanopore(?)

More realistically. . .

Typically, a number of assembled fragments (contigs or scaffolds)are returned in FASTA format: a draft, disordered genome.Around 250 contigs for a 5Mbp genome is usual with Illumina

Ordering contigs

Contigs can be ordered correctly into scaffolds if paired-end readsspan gaps (typically done during assembly).Gaps are usually filled with Ns (length estimated)

Ordering contigs

Contigs and scaffolds can also be reordered by alignment to areference genome.

• Mauve/progressiveMauve14

• MUMmer15

14Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704

15Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12

http://dx.doi.org/10.1101/gr.2289704

http://dx.doi.org/10.1186/gb-2004-5-2-r12

Where next?a

aLefebure et al. (2010) Genome Biol. Evol. 2:646-655 doi:10.1093/gbe/evq048

http://dx.doi.org/10.1093/gbe/evq048

Licence: CC-BY-SA




Microbial Genomics andBioinformaticsBM4053.Whole Genome Comparisons









Table of Contents

Comparative GenomicsComputational Comparative Genomics

Bulk Genome PropertiesNucleotide Frequency/Genome Size

Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting

The Power of Comparative Genomics

Massively enabled by high-throughput sequencing, and theavailability of thousands of sequenced isolates.

Computational comparisons more powerful and precise thanexperimental comparative genomics: the ultimate microbialtyping solution

Three broad areas/scales:

• Comparison of bulk genome properties

• Whole genome sequence comparisons

• Comparison of features/functional components

Table of Contents




Nucleotide frequency/genome size

• Very easy to calculate from complete/draft genome

• Can calculate for individual contigs/scaffolds/regions

• Usually reported in GUI genome browsers

Trivial to determine using, e.g. Python

Nucleotide frequency/genome size

GC content and chromosome size can be characteristicSee data/bacteria size for example iPython notebook exercise

Blobologya

aKumar and Blaxter et al. (2011) Symbiosis 3:119-126 doi:10.1007/s13199-012-0154-6

Sequencing samples may becontaminated or containmicrobial symbionts.

Expect more host thansymbiont/contaminant DNA

GC content and read coveragecan be used to separatecontigs, following assemblyand mapping

http://nematodes.org/bioinformatics/blobology/

http://dx.doi.org/10.1007/s13199-012-0154-6

http://nematodes.org/bioinformatics/blobology/

k-mers

• Nucleotides: [ACGT]

• Dinucleotides: [AA|AC|AG|AT|CA|CC|. . .] (16 dimers)

• Trinucleotides: [AAA|AAC|AAG|AAT|ACA|. . .] (64 trimers)

• k-mers: 4k k-mers

(see example in data/shiny)

k-mers

GC content = point value; k-mer frequencies = vector (list)

Diagnostic differences in k-mer frequency, and variability.

The basis of several comparison tools

E.coli Mycoplasma spp.

Table of Contents




What to align, and why?

To be useful, aligned genomes should:

• derive from a sufficiently recent common ancestor, sohomologous regions can be identified

• derive from a sufficiently distant common ancestor, so thatthere are “interesting” differences to be identified

• help to answer your biological question

How to align, and why?

Naive sequence aligners (Needleman-Wunsch, Smith-Waterman)are not appropriate for genome alignment

• Computationally expensive on large sequences

• Cannot handle rearrangements

Very many alternative alignment algorithms proposed

• megaBLAST http://www.ncbi.nlm.nih.gov/blast/html/megablast.html

• MUMmer http://mummer.sourceforge.net/

• BLAT http://genome.ucsc.edu/goldenPath/help/blatSpec.html

• LASTZ http://www.bx.psu.edu/∼rsharris/lastz/

• LAGAN http://lagan.stanford.edu/lagan web/index.shtml

• and many, many more. . .

Example exercises in data/whole_genome_alignment.

http://www.ncbi.nlm.nih.gov/blast/html/megablast.html

http://mummer.sourceforge.net/

http://genome.ucsc.edu/goldenPath/help/blatSpec.html

http://www.bx.psu.edu/~rsharris/lastz/

http://lagan.stanford.edu/lagan_web/index.shtml

data/whole_genome_alignment

megaBLAST

Optimised for speed, over BLASTN1

• Genome-level searches

• Queries on large sequence sets

• Long alignments of very similar sequence

Uses the greedy algorithm by Zhang et al.2, not BLAST algorithm.

• Concatenates queries (“query packing”) to improveperformance

• Two modes: megaBLAST and discontinuous(dc-megablast) for divergent sequences

BLASTN now uses the megaBLAST algorithm by default

1http://www.ncbi.nlm.nih.gov/blast/Why.shtml

2Zhang et al. (2000) J. Comp. Biol. 7:203-214 doi:10.1089/10665270050081478

http://www.ncbi.nlm.nih.gov/blast/Why.shtml

http://dx.doi.org/10.1089/10665270050081478

BLAST vs megaBLAST

megaBLAST is faster, but does it give the same biological results?

megaBLAST (top) and BLAST (bottom) pairwise comparisons:

BLAST vs megaBLAST

Filter out weak matches - not quite identical:

MUMmera

aKurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12

Uses suffix trees for pattern matching: very fast even for largesequences

• Finds maximal exactmatches

• Memory use depends onlyon the reference sequencesize

Suffix trees:(http://en.wikipedia.org/wiki/Suffix tree)

• Can be built and searchedin O(n) time

• But useful algorithms arenontrivial


http://en.wikipedia.org/wiki/Suffix_tree

The MUMmer algorithma

aKurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12

1. Identify a non-overlapping subset of maximal exact matches:often Maximal Unique Matches (MUMs)

2. Cluster into alignment anchors

3. Extend between anchors to produce the final alignment

This is the basis of a very flexible suite of programs that aligndifferent kinds of sequence: mummer, nucmer, promer

• nucleotide and (more sensitive) “conceptual protein”alignments

• used for genome comparisons, assembly scaffolding, repeatdetection, . . .

• the basis of other aligners/assemblers (e.g. Mugsy, AMOS)


MUMmer vs megaBLAST

MUMmer identifies fewer weak matches

megaBLAST (top) and MUMmer (bottom) pairwise comparisons:

MUMmer vs megaBLAST

Filter out weak BLAST matches - not quite identical:

Table of Contents




DNA-DNA hybridisationa

aMorello-Mora and Amann (2001) FEMS Micro. Rev. 25:39-67 doi:10.1016/S0168-6445(00)00040-1

• “Gold Standard” forprokaryotic taxonomy,since 1960s. “70%identity ≈ same species.”

• Denature DNA from twoorganisms.

• Allow to anneal.Reassociation ≈ similarity,measured as ∆T ofdenaturation curves.

Proxy for sequence similarity - replace with genome analysis3?

3Chan et al (2012) BMC Microbiol. 12:302 doi:10.1186/1471-2180-12-302

http://dx.doi.org/10.1016/S0168-6445(00)00040-1

http://dx.doi.org/10.1186/1471-2180-12-302

Average Nucleotide Identity (ANIb)a

aGoris et al. (2007) Int. J. Syst. Biol. 57:81-91 doi:10.1099/ijs.0.64483-0

1. Break genomes into 1020tfragments2. ANIb: Mean % identity ofall BLASTN matches with> 30% identity and > 70%fragment coverage.

• DDH:ANIb linear

• DDH:%ID linear

• 70%ID ≈ 95%ANIb


Average Nucleotide Identity (ANIm)a

aRichter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA 106:19126-19131

doi:10.1073/pnas.0906412106

1. Align genomes(MUMmer)2. ANIm: Mean% identity of allmatches

• DDH:ANImlinear

• 70%ID ≈95%ANIb

TETRA: tetranucleotide frequency-based classifier introduced insame paper.



ANI/TETRA comparison

All three methods applied to Anaplasma spp.

ANIb:

A_p

hago

cyto

philu

m_N

C_0

2188

1

A_p

hago

cyto

philu

m_N

C_0

2188

0

A_p

hgoc

ytop

hilu

m_N

C_0

2187

9

A_p

hago

cyto

philu

m_N

C_0

0779

7

A_c

entr

ale_

NC

_013

532

A_m

argi

nale

_NC

_004

842

A_m

argi

nale

_NC

_012

026

A_m

argi

nale

_NC

_022

760

A_m

argi

nale

_NC

_022

784

A_phagocytophilum_NC_021881


A_phgocytophilum_NC_021879


A_centrale_NC_013532

A_marginale_NC_004842




ANIb

0.9 0.94 0.98

Value

020

40


Cou

nt

ANIm:

A_p

hgoc

ytop

hilu

m_N

C_0

2187

9

A_p

hago

cyto

philu

m_N

C_0

0779

7

A_p

hago

cyto

philu

m_N

C_0

2188

0

A_p

hago

cyto

philu

m_N

C_0

2188

1

A_c

entr

ale_

NC

_013

532

A_m

argi

nale

_NC

_012

026

A_m

argi

nale

_NC

_004

842

A_m

argi

nale

_NC

_022

760

A_m

argi

nale

_NC

_022

784










ANIm

0.9 0.94 0.98

Value0

1020

30


Cou

nt

TETRA:

A_p

hago

cyto

philu

m_N

C_0

2188

0

A_p

hgoc

ytop

hilu

m_N

C_0

2187

9

A_p

hago

cyto

philu

m_N

C_0

0779

7

A_p

hago

cyto

philu

m_N

C_0

2188

1

A_c

entr

ale_

NC

_013

532

A_m

argi

nale

_NC

_022

760

A_m

argi

nale

_NC

_022

784

A_m

argi

nale

_NC

_012

026

A_m

argi

nale

_NC

_004

842










TETRA

0.9 0.94 0.98

Value

020

40


Cou

nt

ANIb discards information, relative to ANIm: less sensitiveANIb/ANIm ≈ evolutionary history; TETRA ≈ bulk composition

ANI in practice

Practical applications4 (note: no gene content used)

29 Dickeya isolates:species structure

GB

BC

2040

_dra

ft_re

orde

red_

stitc

hed

IPO

_222

2_dr

aft_

reor

dere

d_st

itche

dM

K10

_dra

ft_re

orde

red_

stitc

hed

MK

16_d

raft_

reor

dere

d_st

itche

dA

MY

I01.

1_st

itche

dA

MW

E01

.1_s

titch

edN

CP

PB

_353

4_dr

aft_

stitc

hed

NC

PP

B_4

53_d

raft_

stitc

hed

GB

BC

2039

_dra

ft_st

itche

dIP

O_9

80_d

raft_

stitc

hed

MK

7_dr

aft_

stitc

hed

NC

PP

B_3

274_

draf

t_st

itche

dN

CP

PB

_297

6_dr

aft_

stitc

hed

NC

_014

500

NC

PP

B_8

98_d

raft_

stitc

hed

NC

PP

B_3

537_

draf

t_st

itche

dN

CP

PB

_251

1_dr

aft_

stitc

hed

NC

_012

880

CS

L_R

W24

0_dr

aft_

stitc

hed

DW

_044

0_dr

aft_

reor

dere

d_st

itche

dN

CP

PB

_569

_dra

ft_st

itche

dN

CP

PB

_516

_dra

ft_st

itche

dN

CP

PB

_402

_dra

ft_st

itche

dN

CP

PB

_353

3_dr

aft_

stitc

hed

NC

_012

912

NC

PP

B_3

531_

draf

t_st

itche

dC

SL_

RW

192_

draf

t_st

itche

dA

JVN

01.1

_stit

ched

AP

MV

01.1

_stit

ched

NC

_013

592

AP

WM

01.1

_stit

ched

MK

19_d

raft_

stitc

hed

NC

PP

B_3

532_

draf

t_st

itche

dN

CP

PB

_253

8_dr

aft_

stitc

hed

GBBC2040_draft_reordered_stitchedIPO_2222_draft_reordered_stitchedMK10_draft_reordered_stitchedMK16_draft_reordered_stitchedAMYI01.1_stitchedAMWE01.1_stitchedNCPPB_3534_draft_stitchedNCPPB_453_draft_stitchedGBBC2039_draft_stitchedIPO_980_draft_stitchedMK7_draft_stitchedNCPPB_3274_draft_stitchedNCPPB_2976_draft_stitchedNC_014500NCPPB_898_draft_stitchedNCPPB_3537_draft_stitchedNCPPB_2511_draft_stitchedNC_012880CSL_RW240_draft_stitchedDW_0440_draft_reordered_stitchedNCPPB_569_draft_stitchedNCPPB_516_draft_stitchedNCPPB_402_draft_stitchedNCPPB_3533_draft_stitchedNC_012912NCPPB_3531_draft_stitchedCSL_RW192_draft_stitchedAJVN01.1_stitchedAPMV01.1_stitchedNC_013592APWM01.1_stitchedMK19_draft_stitchedNCPPB_3532_draft_stitchedNCPPB_2538_draft_stitched

ANIm

0.9 0.94 0.98Value

040

080

0


Cou

nt

180 E.coli isolates:subtyping

Bru

nei2

0070

942_

cont

igs

Mue

nste

r200

6309

1_co

ntig

s

Sen

ftenb

erg2

0070

885_

cont

igs

Lys1

42_c

ontig

s

Lys1

75_c

ontig

s

Lys1

30_c

ontig

s

Lys1

70_c

ontig

s

Lys1

26_c

ontig

s

Lys1

67_c

ontig

s

Lys1

76_c

ontig

s

Lys1

69_c

ontig

s

Lys5

0_co

ntig

s

X50

38_c

ontig

s

Lys1

31_c

ontig

s

Lys1

71_c

ontig

s

Lys1

11_c

ontig

s

Lys1

07_c

ontig

s

Lys1

14_c

ontig

s

Lys1

6_co

ntig

s

Lys2

2_co

ntig

s

Lys6

5_co

ntig

s

Lys5

6_co

ntig

s

Lys1

13_c

ontig

s

Lys1

09_c

ontig

s

Lys7

7_co

ntig

s

Lys1

02_c

ontig

s

Lys1

00_c

ontig

s

Lys9

2_co

ntig

s

Lys9

4_co

ntig

s

Lys8

0_co

ntig

s

Lys6

4_co

ntig

s

Lys8

2_co

ntig

s

AW

3_co

ntig

s

X50

08_c

ontig

s

AW

4_co

ntig

s

AW

1_co

ntig

s

Lys1

18_c

ontig

s

Lys1

38_c

ontig

s

Lys1

21_c

ontig

s

Lys1

22_c

ontig

s

Lys1

77_c

ontig

s

Lys1

55_c

ontig

s

Lys1

65_c

ontig

s

Lys1

63_c

ontig

s

Lys1

60_c

ontig

s

Lys1

61_c

ontig

s

Lys1

72_c

ontig

s

Lys1

44_c

ontig

s

Lys1

35_c

ontig

s

Lys1

46_c

ontig

s

Lys1

23_c

ontig

s

Lys1

24_c

ontig

s

Lys1

50_c

ontig

s

Lys1

40_c

ontig

s

Lys1

57_c

ontig

s

Lys1

73_c

ontig

s

Lys1

56_c

ontig

s

Lys1

58_c

ontig

s

Lys1

59_c

ontig

s

Lys1

62_c

ontig

s

Lys5

_con

tigs

X50

84_c

ontig

s

X50

42_c

ontig

s

Lys1

10_c

ontig

s

Lys1

36_c

ontig

s

Lys5

4_co

ntig

s

Lys1

_con

tigs

Lys6

_con

tigs

Lys1

12_c

ontig

s

X50

12_c

ontig

s

Lys3

0_co

ntig

s

Lys2

5_co

ntig

s

Lys4

3_co

ntig

s

Lys3

7_co

ntig

s

Lys4

0_co

ntig

s

Lys1

51_c

ontig

s

Lys3

1_co

ntig

s

Lys2

7_co

ntig

s

Lys4

2_co

ntig

s

Lys5

1_co

ntig

s

Lys3

3_co

ntig

s

Lys4

6_co

ntig

s

Lys3

8_co

ntig

s

Lys8

9_co

ntig

s

Lys2

3_co

ntig

s

Lys1

15_c

ontig

s

Lys1

08_c

ontig

s

Lys1

04_c

ontig

s

DS

M10

973_

cont

igs

Lys1

25_c

ontig

s

Lys1

05_c

ontig

s

Lys1

7_co

ntig

s

Lys1

28_c

ontig

s

Lys6

6_co

ntig

s

Lys7

3_co

ntig

s

Lys1

5_co

ntig

s

Lys9

1_co

ntig

s

DS

M86

98_c

ontig

s

DS

M86

95_c

ontig

s

Lys7

4_co

ntig

s

Lys6

1_co

ntig

s

Lys9

_con

tigs

Lys1

53_c

ontig

s

Lys8

4_co

ntig

s

Lys9

3_co

ntig

s

Lys7

2_co

ntig

s

Lys6

2_co

ntig

s

Lys2

1_co

ntig

s

Lys5

9_co

ntig

s

Lys6

3_co

ntig

s

Lys8

3_co

ntig

s

Lys1

9_co

ntig

s

Lys4

_con

tigs

AW

13_c

ontig

s

Lys4

5_co

ntig

s

Lys2

8_co

ntig

s

Lys5

3_co

ntig

s

Lys5

2_co

ntig

s

Lys3

4_co

ntig

s

Lys3

6_co

ntig

s

Lys2

4_co

ntig

s

Lys3

5_co

ntig

s

Lys6

8_co

ntig

s

Lys1

06_c

ontig

s

Lys8

8_co

ntig

s

Lys9

7_co

ntig

s

Lys7

6_co

ntig

s

Lys1

34_c

ontig

s

Lys5

8_co

ntig

s

Lys7

1_co

ntig

s

Lys8

1_co

ntig

s

Lys1

29_c

ontig

s

Lys1

20_c

ontig

s

Lys1

45_c

ontig

s

Lys1

37_c

ontig

s

Lys1

27_c

ontig

s

Lys1

52_c

ontig

s

Lys1

01_c

ontig

s

Lys9

8_co

ntig

s

Lys7

0_co

ntig

s

Lys1

33_c

ontig

s

Lys4

7_co

ntig

s

Lys7

5_co

ntig

s

Lys4

8_co

ntig

s

Lys1

48_c

ontig

s

Lys1

39_c

ontig

s

Lys1

41_c

ontig

s

Lys1

64_c

ontig

s

Lys1

49_c

ontig

s

Lys1

47_c

ontig

s

Lys6

0_co

ntig

s

Lys7

9_co

ntig

s

Lys1

68_c

ontig

s

Lys1

8_co

ntig

s

Lys8

7_co

ntig

s

Lys9

6_co

ntig

s

Lys7

_con

tigs

Lys1

54_c

ontig

s

Lys1

17_c

ontig

s

Lys1

19_c

ontig

s

Lys1

78_c

ontig

s

Lys1

16_c

ontig

s

Lys8

6_co

ntig

s

Lys9

0_co

ntig

s

Lys4

1_co

ntig

s

Lys1

3_co

ntig

s

Lys8

5_co

ntig

s

X50

02_c

ontig

s

Lys1

2_co

ntig

s

Lys3

9_co

ntig

s

Lys1

4_co

ntig

s

Lys5

5_co

ntig

s

Lys2

9_co

ntig

s

Lys9

9_co

ntig

s

X50

35_c

ontig

s

Lys8

_con

tigs

Lys3

_con

tigs

X50

34_c

ontig

s

X50

88_c

ontig

s

Lys2

0_co

ntig

s

Lys7

8_co

ntig

s

Lys1

1_co

ntig

s

Brunei20070942_contigs

Muenster20063091_contigs

Senftenberg20070885_contigs

Lys142_contigs

Lys175_contigs

Lys130_contigs

Lys170_contigs

Lys126_contigs

Lys167_contigs

Lys176_contigs

Lys169_contigs

Lys50_contigs

5038_contigs

Lys131_contigs

Lys171_contigs

Lys111_contigs

Lys107_contigs

Lys114_contigs

Lys16_contigs

Lys22_contigs

Lys65_contigs

Lys56_contigs

Lys113_contigs

Lys109_contigs

Lys77_contigs

Lys102_contigs

Lys100_contigs

Lys92_contigs

Lys94_contigs

Lys80_contigs

Lys64_contigs

Lys82_contigs

AW3_contigs

5008_contigs

AW4_contigs

AW1_contigs

Lys118_contigs

Lys138_contigs

Lys121_contigs

Lys122_contigs

Lys177_contigs

Lys155_contigs

Lys165_contigs

Lys163_contigs

Lys160_contigs

Lys161_contigs

Lys172_contigs

Lys144_contigs

Lys135_contigs

Lys146_contigs

Lys123_contigs

Lys124_contigs

Lys150_contigs

Lys140_contigs

Lys157_contigs

Lys173_contigs

Lys156_contigs

Lys158_contigs

Lys159_contigs

Lys162_contigs

Lys5_contigs

5084_contigs

5042_contigs

Lys110_contigs

Lys136_contigs

Lys54_contigs

Lys1_contigs

Lys6_contigs

Lys112_contigs

5012_contigs

Lys30_contigs

Lys25_contigs

Lys43_contigs

Lys37_contigs

Lys40_contigs

Lys151_contigs

Lys31_contigs

Lys27_contigs

Lys42_contigs

Lys51_contigs

Lys33_contigs

Lys46_contigs

Lys38_contigs

Lys89_contigs

Lys23_contigs

Lys115_contigs

Lys108_contigs

Lys104_contigs

DSM10973_contigs

Lys125_contigs

Lys105_contigs

Lys17_contigs

Lys128_contigs

Lys66_contigs

Lys73_contigs

Lys15_contigs

Lys91_contigs

DSM8698_contigs

DSM8695_contigs

Lys74_contigs

Lys61_contigs

Lys9_contigs

Lys153_contigs

Lys84_contigs

Lys93_contigs

Lys72_contigs

Lys62_contigs

Lys21_contigs

Lys59_contigs

Lys63_contigs

Lys83_contigs

Lys19_contigs

Lys4_contigs

AW13_contigs

Lys45_contigs

Lys28_contigs

Lys53_contigs

Lys52_contigs

Lys34_contigs

Lys36_contigs

Lys24_contigs

Lys35_contigs

Lys68_contigs

Lys106_contigs

Lys88_contigs

Lys97_contigs

Lys76_contigs

Lys134_contigs

Lys58_contigs

Lys71_contigs

Lys81_contigs

Lys129_contigs

Lys120_contigs

Lys145_contigs

Lys137_contigs

Lys127_contigs

Lys152_contigs

Lys101_contigs

Lys98_contigs

Lys70_contigs

Lys133_contigs

Lys47_contigs

Lys75_contigs

Lys48_contigs

Lys148_contigs

Lys139_contigs

Lys141_contigs

Lys164_contigs

Lys149_contigs

Lys147_contigs

Lys60_contigs

Lys79_contigs

Lys168_contigs

Lys18_contigs

Lys87_contigs

Lys96_contigs

Lys7_contigs

Lys154_contigs

Lys117_contigs

Lys119_contigs

Lys178_contigs

Lys116_contigs

Lys86_contigs

Lys90_contigs

Lys41_contigs

Lys13_contigs

Lys85_contigs

5002_contigs

Lys12_contigs

Lys39_contigs

Lys14_contigs

Lys55_contigs

Lys29_contigs

Lys99_contigs

5035_contigs

Lys8_contigs

Lys3_contigs

5034_contigs

5088_contigs

Lys20_contigs

Lys78_contigs

Lys11_contigs

ANIm

0.9 0.92 0.94 0.96 0.98

Value

010

0020

0030

0040

0050

0060

00


Cou

nt

AB1B2CDEFUX

4van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0


Table of Contents




Collinearity and Synteny

Genome rearrangements occur, but there can still be conservationof sequence similarity and ordering.

• Two elements are collinear if they lie in the same linearsequence

• Two elements are syntenous (or syntenic) if:• (orig.) they lie on the same chromosome• (mod.) there is conservation of blocks of order within the

same chromosome

Signs of evolutionary constraints, like sequence conservationor synteny, may indicate functional genome regions.

Pyrococcus spp.a

aZivanovic et al. (2002) Nuc. Acids Res. 30:1902-1910 doi:10.1093/nar/30.9.1902

Comparison of Pyrococcus genomes (P. horikoshii, P. abyssi, P.furiosus) shows chromosome-shuffling.

Transposition a major cause of genomic disruption

http://dx.doi.org/10.1093/nar/30.9.1902

Vibrio mimicus a

aHasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107

Chromosome C-II carries genes associated with environmentaladaptation; C-I carries virulence genes.C-II has undergone extensive rearrangement; C-I has not.

Suggests modularity of genome organisation, as a mechanism foradaptation (HGT, two-speed genome).


Serratia symbiotica a

aBurke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002

S. symbiotica is a recently evolved symbiont of aphidsMassive genomic decay is an adaptation to the new environment.

http://dx.doi.org/10.1093/gbe/evr002

Table of Contents




Multiple genome alignment is hard

Can we not just align all our genomes, together?

No. Because it’s really, really hard.

Analogous to problems with multiple sequence alignment (three ormore sequences).

• Computationally extremely expensive (O(Ln), L=length ofsequence, n=number of sequences)

• NP-complete problem: no known efficient way to find asolution

Heuristic (approximate) methods are used, most commonly:

• Progressive alignment

• Iterative alignment

Mauvea

aDarling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704

Progressive alignment tool, with a GUI. Application to nineenterobacteria: rearrangement of homologous backbone.

Alternatives include MLAGAN5 and MUMmer6

5Brudno et al. (2003) Genome Res. 13:721-731 doi:10.1101/gr.926603

6Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12

http://dx.doi.org/10.1101/gr.2289704

http://dx.doi.org/10.1101/gr.926603


Mauve algorithma


1. Find local alignments(multi-MUMs)

2. Build guide tree frommulti-MUMs

3. Select subset ofmulti-MUMs as anchors,and partition into LocalCollinear Blocks (LCBs):consistently orderedsubsets

4. Progressive alignmentagainst guide tree

http://dx.doi.org/10.1101/gr.2289704

Reordering contigsa


Mauve also enables draft genome reordering.Once LCBs are identified, can apply Mauve Contig Mover toreorder contigs

Example exercise in data/whole_genome_alignment

http://dx.doi.org/10.1101/gr.2289704

data/whole_genome_alignment

Table of Contents




Chromosome paintinga

aYahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055

“Chromosome painting” infers recombination-derived ‘chunks’Genome’s haplotype constructed in terms of recombination eventsfrom a ‘donor’ to a ‘recipient’ genome

http://dx.doi.org/10.1093/molbev/mst055

Chromosome paintinga

aYahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055

Recombination events summarised in a coancestry matrix.H. pylori most within geographical bounds, but asymmetricaldonation from Amerind/East Asian to European isolates.

http://dx.doi.org/10.1093/molbev/mst055

Licence: CC-BY-SA




Microbial Genomics andBioinformaticsBM4054.Genome Features









Table of Contents

Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines

Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction

Building to MetabolismReconstructing metabolism

Genome Features

• Genome features are annotated regions of the genome.

• Typically represent functional elements.

• May be simple (single region), or complex (subfeatures)

Why annotate genome features?

• Almost all use of genomics depends on annotation:annotation quality is critical to downstream use ofgenomics in biology

• Annotation is curation (a live, active process), not cataloguing

• Automated annotation from curated data (public databases) isthe only game in town, given the data quantities we generate

• But you can’t propagate something that doesn’t exist: up to30% of metabolic activity has no known gene associatedwith it1

• Biocurators can spend as much time “de-annotating”literature-based annotations as entering new data2

1Chen and Vitkup (2007) Trends Biotech. doi:10.1016/j.tibtech.2007.06.001

2Bairoch (2009) Nat. Preced. doi:10.1038/npre.2009.3092.1

http://dx.doi.org/10.1016/j.tibtech.2007.06.001

http://dx.doi.org/10.1038/npre.2009.3092.1

Gene Features

Gene features have significant substructure, especially ineukaryotes.

• 5‘ UTR

• translation start

• intron start/stop

• exon start/stop

• translation stop

• translationterminator

• 3‘ UTR

ncRNA Features

• tRNA - transfer RNA

• rRNA - ribosomal RNA

• CRISPRs -bacterial/archaeal defence(used for genome editing)

• many other classes

Regulatory/Repeat Features

Regulatory sites

• transcription start sites

• RNA polymerase binding sites

• Transcription Factor Binding Sites (TFBS)

Repetitive regions and mobile elements

• tandem repeats

• (retro-)transposable elements

• phage inclusions

Principles of feature prediction

Two main approaches to feature prediction:

• ab initio prediction - start from first principles, using only thegenome sequence:

• Unsupervised methods - not trained on a dataset• Supervised methods - trained on a dataset

• homology matches• alignment to features from related organisms (comparative

genomics, annotation transfer)• from known gene products (e.g. proteins, ncRNA)• from transcripts/other intermediates (e.g. ESTs, cDNA,

RNAseq)

Dedicated tools available for many different classes of feature.

Table of Contents




Prokaryotic CDS Prediction Methods

Using CDS prediction as an illustrative example for all featureprediction.

Sequence conservation (evolutionary constraint; an unsupervised, apriori method) can be useful

• Prokaryotes “easier” than eukaryotes for gene/CDS prediction

• Less uncertainty in predictions (isoforms, gene structure)• Very gene-dense (over 90% of chromosome is coding sequence)• No intron-exon structure

Prokaryotic CDS Prediction Methods

ORFs are plentiful:

• Problem is: “which possible ORF contains the true gene, andwhich start site is correct?”

• Still not a solved problem

Finding Open Reading Frames

The simplest approach: find ORFs (sequence between twoconsecutive in-frame stop codons)

• ORF finding is naive, does not consider:• Start codon• Promoter/RBS motifs• Wider context (e.g. overlapping genes)

Dedicated tools, e.g. Glimmer, Prodigal, RAST, GeneMarkSusually better.

Two ab initio CDS Prediction Tools

• Glimmer3

• Interpolated Markov models• Can be trained on “gold standard” datasets

• Prodigal4

• Log-likelihood model based on GC frame plots, followed bydynamic programming

• Can be trained on “gold standard” datasets

Applying these to an example bacterial chromosome. . .

3Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009

4Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119

http://dx.doi.org/10.1093/bioinformatics/btm009

http://dx.doi.org/10.1186/1471-2105-11-119

Comparing predictions in Artemisa

aCarver et al. (2012) Bioinformatics 28:464-469 doi:10.1093/bioinformatics/btr703

Not every ORF (green) is predicted to encode for a codingsequence (CDS; blue/orange).Self-contradictory CDS calls (orange); even automated annotationneeds manual curation.

http://dx.doi.org/10.1093/bioinformatics/btr703

Comparing predictions in Artemis

Glimmer(green)/Prodigal(blue) CDS prediction methods do notalways agree (presence/absence, start position).

How do we know which (if either) is best?

Table of Contents




Using a “Gold Standard”: validationa

aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4

A general approach for all predictive methods

• Define a known, “correct” set of true/false, positive/negativeetc. examples - the “gold standard”

• Evaluate your predictive method against that set for• sensitivity, specificity, accuracy, precision, etc.

This ought to be done by the method developers, but often wise toevaluate in your own system.

Many methods available, coverage beyond the scope of thisintroduction

http://dx.doi.org/10.1007/978-1-62703-986-4_4

Contingency Tables

Condition (Gold standard)True False

Test outcomePositive True Positive False PositiveNegative False Negative True Negative

Performance MetricsSensitivity = TPR = TP/(TP + FN)Specificity = TNR = TN/(FP + TN)FPR = 1− Specificity = FP/(FP + TN)

If you don’t have this information, you can’t interpretpredictive results properly.

“Gold Standard” results

• Tested glimmer5 and prodigal6 on two enterobacterial closerelatives as “gold standards” (still not perfect. . .)

1. Manually annotated (>3 expert person years)2. Community-annotated (many research groups, interested in

their own subset of genes)

• Both methods trained directly on the annotated genes ineach organism!

5Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009

6Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119

http://dx.doi.org/10.1093/bioinformatics/btm009

http://dx.doi.org/10.1186/1471-2105-11-119


Manually annotated: 4550 CDS

genecaller glimmer prodigal

predicted 4752 4287missed 284 (6%) 407 (9%)

Exact Predictionsensitivity 62% 71%

FDR 41% 25%PPV 59% 75%

Correct ORFsensitivity 94% 91%

FDR 10% 3%PPV 90% 97%


Community annotated: 4475 CDS

genecaller glimmer prodigal

predicted 4679 4467missed 112 (3%) 156 (3%)

Exact Predictionsensitivity 62% 86%

FDR 31% 14%PPV 69% 86%

Correct ORFsensitivity 97% 97%

FDR 7% 3%PPV 93% 97%

Gene/CDS Prediction

• Alternative CDS (and all other) prediction methods areunlikely to give identical results, or perform equally well

• There is No Free Lunch (this is a theorem:http://en.wikipedia.org/wiki/No free lunch theorem)

• To assess/choose between methods, performance metrics arerequired

• Even on prokaryotes (a relatively simple case), current bestmethods for CDS prediction are imperfect

• Manual correction is often required (usually the mostdemanding and time-consuming part of the process).

http://en.wikipedia.org/wiki/No_free_lunch_theorem

Table of Contents




Prokaryotic Annotation Pipelinesa

aRichardson and Watson (2012) Brief. Bioinf. 14:1-12 doi:10.1093/bib/bbs007

Many choices, including RAST7, PROKKA8, BaSYS9, etc.Often perform both CDS/feature calling and functional prediction.Two broad approaches:

1. Heavyweight: maintain database and resource, oftenannotating by homology, e.g. RAST

2. Lightweight: chain together multiple third-party packages, e.g.PROKKA

Pipelines take a lot of tedium (and control) out of annotatingbacterial genomes, but have the same issues as every otherprediction tool.

7Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75

8Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153

9Van Domselaar et al. (2005) Nuc. Acids Res. 33:W455-W459 doi:10.1093/nar/gki593

http://dx.doi.org/10.1093/bib/bbs007

http://dx.doi.org/10.1186/1471-2164-9-75


http://dx.doi.org/10.1093/nar/gki593

PROKKAa

aSeemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153

• Lightweight, and fast.

• Runs locally. (5Mbpgenome takes ≈10min onmy desktop; more detailedncRNA prediction takes≈20min)

• Flexible: built-indatabases can be replacedby user databases.

• Uses freely-accessiblethird-party tools forprediction

Simple to run (at the command-line, or in Galaxy10).

10Goecks et al. (2010) Genome Biol. 11:R86 doi:10.1186/gb-2010-11-8-r86



RASTa

aAziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75

• Server-based(http://rast.nmpdr.org/).Queues likely.

• Relies on SEED andFIGFam databases, heldat NMPDR

• FIGFam: isofunctionalhomologue families

• Produces metabolicreconstruction

http://dx.doi.org/10.1186/1471-2164-9-75

http://rast.nmpdr.org/

Table of Contents




Principles of function prediction

At genome scale, we realistically have to automate functionprediction.Function prediction is just like any other prediction method.Two main approaches to function prediction:

• ab initio prediction (on basis of feature sequence/contextonly)

• Unsupervised methods - not trained on an exemplar dataset• Supervised methods - trained on an exemplar dataset

• homology matches (sequence similarity)• alignment to features with known/predicted functions

Homology-based function prediction

Two proteins with similar sequence may have similar function.But. . .

• How similar do they have to be (and where) to share the samefunction?

• What do we mean by ‘same function’, anyway:interaction/substrate specificity? participation in a pathway?contribution to a structure? biochemical interconversion? . . .

• How confident can we be in the comparator (annotated)sequence: was that function determined experimentally?

Gene Ontology (GO)a

aAshburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556

The Gene Ontology provides a common vocabulary for describingbiological function, and unifying functional descriptions.

Ontologies (controlled vocabularies) are central toinformation-sharing.

Gene Ontology Consortium: http://geneontology.org/

Many annotation tools and databases produce GO output, orcompatible controlled vocabulary terms, e.g.

• Blast2GO11: BLAST-based annotation

• PHI-Base12: microbial pathogen-host interaction specificfunctions

• GOPred13: combines several protein function classifiers

11Conesa et al. (2005) Bioinformatics 21:3674-3676 doi:10.1093/bioinformatics/bti610

12Winnenburg et al. (2006) Nuc. Acids Res. 34:D459-D464 doi:10.1093/nar/gkj047

13Sarac et al. (2010) PLoS One 5:e12382 doi:10.1371/journal.pone.0012382

http://dx.doi.org/10.1038/75556

http://geneontology.org/

http://dx.doi.org/10.1093/bioinformatics/bti610

http://dx.doi.org/10.1093/nar/gkj047

http://dx.doi.org/10.1371/journal.pone.0012382

Gene Ontology (GO)a

aAshburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556

http://dx.doi.org/10.1038/75556

Are database annotations reliable?a

aSchnoes et al. (2013) PLoS Comp. Biol. 9:e1003063 doi:10.1371/journal.pcbi.1003063

Are protein function annotations in databases determinedexperimentally, or by annotation transfer?

High throughput experiments and genome annotations areconducted without validation of function, and placed in databases.

• GO databases record annotation origin by publication

• GO databases record evidence codes, e.g.: EXP=Inferredfrom Experiment; ISS=Inferred from Sequence Similarity

• 0.14% of contributing publications provide 25% of allexperimentally validated annotations in the Uniprot-GOAcompilation.

• There are biases in functional annotation.

No clear solution to this kind of bias - but we have to recogniseand account for it.

http://dx.doi.org/10.1371/journal.pcbi.1003063

Are database annotations reliable?a

aRadivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340

The Critical Assessment of Function Annotation (CAFA) project.

http://dx.doi.org/10.1038/nmeth.2340

Do biased database annotations matter?

Experimental annotations of proteins are incomplete. But is thatimportant?Tested by simulation, and following databases for three years.14

1. Yes. It matters.

2. Current large scale annotations are meaningful and almostsurprisingly reliable.

3. The nature and level of data incompleteness, and type ofclassification model have an effect.

4. “Low precision, high recall” (i.e. less discriminating) toolsmost significantly affected.

Molecular function prediction is usually more reliable thanbiological process prediction15

14Jiang et al. (2014) Bioinformatics 30:i609-i616 doi:10.1093/bioinformatics/btu472

15Cozzetto et al. (2013) BMC Bioinf. 14:S3-S1 doi:10.1186/1471-2105-14-S3-S1


http://dx.doi.org/10.1186/1471-2105-14-S3-S1

CAFA resultsa

aRadivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340

The Critical Assessment of Function Annotation (CAFA) 2013results. (F-measure combines precision and recall)

• You can do better thanBLAST.

• Best-performing methodsdo comparably well.

• Best methods usedevolutionary relationships,structure, and expressiondata.

• Machine Learning worksbest.

http://dx.doi.org/10.1038/nmeth.2340

Table of Contents




A wee trip to the doctor

• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where thereis disease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

• (Audience Participation!)


• You go for a checkup, and are tested for disease X

• The test has sensitivity = 0.95 (predicts disease where thereis disease)

• The test has FPR = 0.01 (predicts disease where there is nodisease)

• Your test is positive

• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?

• (Audience Participation!)


• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X ,you cannot determine this.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0


• What is the probability that you have disease X?

• Unless you know the baseline occurrence of disease X ,you cannot determine this.

• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

Table of Contents




Why Performance Metrics Mattera


• Imagine a paper describing a predictor for protein functionalclass (e.g. Type III effector)

• The paper reports sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• It predicts 130 members of the class. How many of them arelikely to be true positives?

• We need a baseline level of that class (fX ) in the genome todetermine this.

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5

http://dx.doi.org/10.1007/978-1-62703-986-4_4

Why Performance Metrics Mattera


• Imagine a paper describing a predictor for protein functionalclass (e.g. Type III effector)

• The paper reports sensitivity = 0.95, FPR = 0.01

• You run the predictor on 20,000 proteins in an organism

• It predicts 130 members of the class. How many of them arelikely to be true positives?

• We need a baseline level of that class (fX ) in the genome todetermine this.

• We estimate ≈ 200 members in protein complement, sofX = 0.01

• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5

http://dx.doi.org/10.1007/978-1-62703-986-4_4

Bayes’ Theorem

• May seem counter-intuitive: 95% sensitivity, 99% specificity=⇒ 50% chance of any prediction being incorrect

• Probability given by Bayes’ Theorem

• P(X |+) = P(+|X )P(X )

P(+|X )P(X )+P(+|X̄ )P(X̄ )

• This step commonly overlooked in the literature• confirmation bias• people want to see positive examples/tell a story• people want to think their predictor works

A cautionary talea

aArnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376

• Paper describes EffectiveT3, a type III effector predictiontool

• Reported sensitivity ≈ 0.71, FPR ≈ 0.15

• Applied tool to 739 complete bacterial and archaeal genomes

• Organisms with an identifiable T3SS: 2-7% of genomepredicted to be secreted

• Organisms without an identifiable T3SS (or known notto have one): 1-10% of genome predicted to be secreted

• “The surprisingly high number of (false) positives in genomeswithout T3SS exceeds the expected false positive rate”

• This is not a surprise, statistically.

http://dx.doi.org/10.1371/journal.ppat.1000376

A cautionary talea


• Paper describes EffectiveT3, a type III effector predictiontool

• Reported sensitivity ≈ 0.71, FPR ≈ 0.15

• Applied tool to 739 complete bacterial and archaeal genomes

• Organisms with an identifiable T3SS: 2-7% of genomepredicted to be secreted

• Organisms without an identifiable T3SS (or known notto have one): 1-10% of genome predicted to be secreted

• “The surprisingly high number of (false) positives in genomeswithout T3SS exceeds the expected false positive rate”

• This is not a surprise, statistically.


A cautionary talea


Probability that an EffectiveT3 positive prediction correspondsto a secreted protein is given by Bayes’ Theorem

• P(X |+) = P(+|X )P(X )

P(+|X )P(X )+P(+|X̄ )P(X̄ )

• P(+|X ) = sensitivity = 0.71• P(+|X̄ ) = FPR = 0.15• P(X ) = base rate ≈ 0.03 (16)

• =⇒ P(X |+) ≈ 0.13

Only 13% of predictions likely to be positive!

How many predicted type III secreted proteins were there. . .

16Boch and Bonas (2010) Annu. Rev. Phytopathol. 48:419-436 doi:10.1146/annurev-phyto-080508-081936


http://dx.doi.org/10.1146/annurev-phyto-080508-081936

A cautionary talea



Interpreting genome-scale predictionsa


• Statistics at genome-scale can be counterintuitive.

• Use Bayes’ Theorem!

• Predictions identify groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to other classifiers, (including proteinfunctional class prediction) - watch for ‘cherry-picking’ inpublications

http://dx.doi.org/10.1007/978-1-62703-986-4_4

Interpreting genome-scale predictionsa


• Statistics at genome-scale can be counterintuitive.

• Use Bayes’ Theorem!

• Predictions identify groups, not individual members of thegroup. e.g.

• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives

• Which specific individuals are truly smugglers?

• The test does not allow you to determine this - you need moreevidence for each individual

• Same principle applies to other classifiers, (including proteinfunctional class prediction) - watch for ‘cherry-picking’ inpublications

http://dx.doi.org/10.1007/978-1-62703-986-4_4

Table of Contents




Reconstructing metabolisma

aThiele and Palsson (2010) Nat. Protoc. 5:93-121 doi:10.1038/nprot.2009.203

Once metabolic functional annotation has been assigned tofeatures, we can do comparative analysis of metabolism.

http://dx.doi.org/10.1038/nprot.2009.203

Dynamic models of metabolisma

aOrth et al. (2010) Nat. Biotech. 28:245-248 doi:10.1038/nbt.1614

By using constraint-based models (e.g. Flux Balance Analysis), wecan make these into dynamic representations of bacterialmetabolism.• Upper, lower bounds to reaction rates• Define objective phenotype• Calculate conditions resulting in flux• in silico knockouts


E. coli metabolisma

aMonk et al. (2013) Proc. Natl. Acad. Sci. USA 110:20338-20343 doi:10.1073/pnas.1307797110

E. coli has a very long history of metabolic reconstruction17

Recent modelling work predicts which nutrients support growth

17Reed and Palsson (2000) J. Bact. 185:2692-2699 doi:10.1128/JB.185.9.2692-2699.2003


http://dx.doi.org/10.1128/JB.185.9.2692-2699.2003

E. coli metabolisma

aBaumler et al. (2011) BMC Syst. Biol. 5:182 doi:10.1186/1752-0509-5-182

Models are complex, and experimental validation is essentialThere’s more we don’t know. . .

http://dx.doi.org/10.1186/1752-0509-5-182

Licence: CC-BY-SA




Microbial Genomics andBioinformaticsBM4055.Finding Equivalent Features









Table of Contents

Equivalent Genome FeaturesWhat makes genome features equivalent?

Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes

ConclusionsThings I Didn’t Get ToConclusions

What makes genome features equiva-lent?

When we compare two features (e.g. genes) between two or moregenomes, there must be some basis for making the comparisonThat is, they have to be equivalent in some way, such as:

• common evolutionary origin

• functional similarity

• a family-based relationship

It’s common to define equivalence of genome features in terms ofevolutionary relationship.

Why look at equivalent features?

The real power of genomics is comparative genomics!

• Makes catalogues of genome components comparable betweenorganisms

• Differences, e.g. presence/absence of equivalents may supporthypotheses for functional or phenotypic difference

• Can identify characteristic signals for diagnosis/epidemiology

• Can build parts lists and wiring diagrams for systems andsynthetic biology

Evolutionary relationshipsa

aFitch (1970) Syst. Zool. 19:99-113 doi:10.2307/2412448

Equivalencies and relationships can be quite complex.We need precise terms to describe relationships between genomefeatures.

• analogy: functional similarity

• homology: evolutionary common ancestor

http://dx.doi.org/10.2307/2412448

Table of Contents




Who let the -logues out?a

aFitch (2000) Trends Genet. 16:227-231 doi:10.1016/S0168-9525(00)02005-9

• homologues: elements that are similar because they share acommon ancestor. There are NOT degrees of homology

• analogues: elements that are (functionally?) similar, and thismay be through common ancestry or some other means, e.g.convergent evolution

• orthologues: homologues that diverged through speciation

• paralogues: homologues that diverged through duplicationwithin the same genome

http://dx.doi.org/10.1016/S0168-9525(00)02005-9

Who let the -logues out?





ITYFIALMCTTa

aKristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030

But it’s a little more complicated than that.Biology is not well-behaved.

• Gene loss

• Homologues may diverge so widely that they can be hard torecognise

• Reconstructed evolutionary trees may not be robust inferencesof speciation (or relevant to it, in prokaryotes)

• There is no record of history - we can only make inferences

All classifications of orthology/paralogy are inferences!

http://dx.doi.org/10.1093/bib/bbr030

ITYFIALMCTTa

aKristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030

All classifications of orthology/paralogy are inferences!


Ensembl Comparaa

aVilella et al. (2009) Genome Res. 19:327-335 doi:10.1101/gr.073585.107

Some tools/databases, e.g. Ensembl Compara, use slightlydifferent definitions (almost everything’s an “orthologue”)

http://dx.doi.org/10.1101/gr.073585.107

http://www.ensembl.org/info/genome/compara/index.html

Table of Contents




Why focus on orthologues?

Formalise the idea of corresponding genes in different organisms.Orthologues serve two purposes:

• Evolutionary equivalence

• Functional equivalence (“The Ortholog Conjecture”1)

Applications in comparative genomics, functional genomics andphylogenetics.2

Over 30 databases attempt to describe orthologous relationships(http://questfororthologs.org/orthology databases3)

1Chen and Zhang (2012) PLoS Comp. Biol. 8:e1002784 doi:10.1371/journal.pcbi.1002784

2Dessimoz (2011) Brief. Bioinf. 12:375-376 doi:10.1093/bib/bbr057

3Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262

http://questfororthologs.org/orthology_databases




Finding orthologues

Multiple methods and databases4,5,6

• Pairwise genome• RBBH (aka BBH, RBH),

RSD, InParanoid, RoundUp

• Multi-genome• Graph-based: COG, eggNOG,

OrthoDB, OrthoMCL, OMA,MultiParanoid

• Tree-based: TreeFam,Ensembl Compara,PhylomeDB, LOFT

4Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030

5Trachana et al. (2011) Bioessays 33:769-780 doi:10.1002/bies.201100062

6Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006

http://armchairbiology.blogspot.co.uk/2012/07/on-reciprocal-best-blast-hits.html

http://link.springer.com/protocol/10.1007/978-1-59745-515-2_7

http://inparanoid.sbc.su.se/cgi-bin/index.cgi

http://roundup.hms.harvard.edu/

http://www.ncbi.nlm.nih.gov/COG/

http://eggnog.embl.de/

http://cegg.unige.ch/orthodb7

http://orthomcl.org/orthomcl/

http://omabrowser.org/cgi-bin/gateway.pl

http://multiparanoid.sbc.su.se/

http://www.treefam.org/

http://www.ensembl.org/info/genome/compara/index.html

http://phylomedb.org/

https://trac.nbic.nl/loft/


http://dx.doi.org/10.1002/bies.201100062

http://dx.doi.org/10.1371/journal.pone.0018755.g006

Table of Contents




Which prediction methods work best?

Taking advantage of prokaryotic operon structure: if the outerpair of a syntenic triplet of genes are orthologous, the middlegene is also likely to be orthologous.7

Specifically testing reciprocal best hits (RBH).

7Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100

http://dx.doi.org/10.1093/gbe/evs100


• Tested on 573 prokaryotic genomes

• 88-99% of RBH found in syntenic triplets

• Overwhelming majority of middle genes are RBH

RBH reliably finds orthologues.8

8Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100

http://dx.doi.org/10.1093/gbe/evs100


Four methods tested against 2,723 curated orthologues from sixSaccharomycetes

• RBBH (and cRBH); RSD (and cRSD); MultiParanoid;OrthoMCL

• Rated by statistical performance metrics: sensitivity,specificity, accuracy, FDR

cRBH most accurate and specific, with lowest FDR.9

9Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006

http://dx.doi.org/10.1371/journal.pone.0018755.g006


Testing on literature-based benchmarks for grouping by functionand correct branching of phylogeny.10

10Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262



• Performance varies by choice of method, and interpretation of“orthology”

• Biggest influence is genome annotation quality

• Relative performance varies with choice of benchmark

• (clustering) RBH outperforms more complex algorithmsunder many circumstances

What is this magic RBH method?

Table of Contents




Functional adaptation in Pbaa

aToth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444

http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444

Functional adaptation in Pbaa

aToth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444

http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444

Table of Contents




Core genome

Once equivalent genes have been identified, those present in allrelated isolates can be identified: the core genome.The core genome is expected to underpin common function.A core RBH cluster (clique) for 29 genomes:

Accessory genome

The remaining genes are the accessory genome, and areexpected to mediate function that distinguishes between isolates.

An accessory RBH cluster for 29 genomes:

Accessory clusters

Accessory RBH clusters can be pruned, to identify the accessorygenome specific to subgroups of isolates:

These genes may be responsible for subgroup-specific phenotypes

Accessory genome

Accessory genomes act as a cradle for adaptive evolution11

This is particularly so for pathogens, such as Pseudomonas spp.12

11Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608

12Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132.t002


http://dx.doi.org/10.1371/journal.ppat.1002132.t002

Core genome synteny

Using tools like i-ADHoRe13 that identify synteny and collinearity,the structural organisation of the core genome can be determined:

For Dickeya, the core genome appears to be structurallywell-conserved across all isolates.

13Proost et al. (2012) Nuc. Acids Res. 40:e11 doi:10.1093/nar/gkr955

http://dx.doi.org/10.1093/nar/gkr955

Panseqa

aLaing et al. (2010) BMC Bioinf. 11:461 doi:10.1186/1471-2105-11-461

Panseq is an online tool for identification of core and accessorygenomes, available at https://lfz.corefacility.ca/panseq/, andhttps://github.com/chadlaing/Panseq for standalone use

http://dx.doi.org/10.1186/1471-2105-11-461

https://lfz.corefacility.ca/panseq/

https://github.com/chadlaing/Panseq

Harvesta

aTreangen et al. (2014) Genome Biol. 15:524 doi:10.1186/s13059-014-0524-x

Visualising and organising comparison/pangenome data acrossthousands of bacteria is difficult.Very recently (this week), the Harvest suite of tools waspublished, for alignment and visualisation of thousands of genomes:

http://dx.doi.org/10.1186/s13059-014-0524-x

Table of Contents




Things I didn’t get to

Table of Contents




Conclusions

Conclusions

Conclusions

Licence: CC-BY-SA