Microbial Genomics andBioinformaticsBM4051.Introduction
Leighton Pritchard1,2,3
1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
The endpoints
• 2003: Erwinia carotovora subsp. atroseptica
• 2014: Dickeya spp., Campylobacter spp., and Escherichia coli
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
2003: E. carotovora subsp. atroseptica
• £250k collaboration between SCRI, University ofCambridge, WT Sanger Institute
• Single isolate: E. carotovora subsp. atroseptica SCRI1043
• The first sequenced enterobacterial plant pathogen (32authors!) 1
• All repeats and gaps bridged and sequenced directly
• Result: a single, complete, high-quality 5Mbp circularchromosome at 10.2X coverage: 106,500 reads
1Bell et al. (2004) Proc. Natl. Acad. Sci. USA 101: 30:11105-11110. doi:10.1073/pnas.0402424101
2003: E. carotovora subsp. atroseptica
A genome sequence is a starting point. . .
• Manual annotation by the Sanger Pathogen Sequencing Unit
• Literature searches and comparisons
• Six people, for six months ≈ three person-years
• Genes: BLAST, GLIMMER, ORPHEUS
• Functional domains: PFAM, SIGNALP, TMHMM
• Metabolism: KEGG
• ncRNA: RFAM
2003: E. carotovora subsp. atroseptica
Working (Eca_Sanger_annotation.gbk) and published(NC_004547.gbk) annotation files are in the data directory
2003: E. carotovora subsp. atroseptica
Compared against all 142 available bacterial genomes2
2data/Pba directory in the accompanying GitHub repository
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
2013: Dickeya spp.
Sequenced and annotated 25 new isolates of Dickeya
• 25 Dickeya isolates, at least six species
• Multiple sequencing methods: 454, Illumina (SE, PE)
• Minor publications (6, 8 authors)3,4
• Results: 12-237 fragments containing 4.2-5.1Mbp, at 6-84Xcoverage, 170k-4m reads
• Automated annotation: RAST with manual corrections3
Pritchard et al. (2013) Genome Ann. 1 (4) doi:10.1128/genomeA.00087-124
Pritchard et al. (2013) Genome Ann. 1 (6) doi:10.1128/genomeA.00978-13
2013: Dickeya spp.
Within-genus comparisons: large-scale synteny and rearrangement
Within-species comparisons: e.g. indels, HGT
2013: Dickeya spp.
Within-genus comparisons: whole genome-based speciesdelineation5
5van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
2013: Dickeya spp.
Within-genus comparisons: differences in metabolism
2014: E. coli
Sequenced and annotated ≈ 190 isolates of E. coliAll bacteria environmental, sampled from lysimeters
• Illumina paired-end sequencing. Total cost of sequencing190 bacteria: ≈£11k
• Automated annotation: PROKKA
2014: E. coli
Sequencing output variable - even though same preps, “same”bacteria, similar sources.
• Results: 5-3000 contigs (median ≈ 125); 9kbp-7.1Mbp(median ≈ 5Mbp); 170k-4m reads
2014: E. coli
Genome sequencing enables within-species classification
Bru
nei2
0070
942_
cont
igs
Mue
nste
r200
6309
1_co
ntig
s
Sen
ftenb
erg2
0070
885_
cont
igs
Lys1
42_c
ontig
s
Lys1
75_c
ontig
s
Lys1
30_c
ontig
s
Lys1
70_c
ontig
s
Lys1
26_c
ontig
s
Lys1
67_c
ontig
s
Lys1
76_c
ontig
s
Lys1
69_c
ontig
s
Lys5
0_co
ntig
s
X50
38_c
ontig
s
Lys1
31_c
ontig
s
Lys1
71_c
ontig
s
Lys1
11_c
ontig
s
Lys1
07_c
ontig
s
Lys1
14_c
ontig
s
Lys1
6_co
ntig
s
Lys2
2_co
ntig
s
Lys6
5_co
ntig
s
Lys5
6_co
ntig
s
Lys1
13_c
ontig
s
Lys1
09_c
ontig
s
Lys7
7_co
ntig
s
Lys1
02_c
ontig
s
Lys1
00_c
ontig
s
Lys9
2_co
ntig
s
Lys9
4_co
ntig
s
Lys8
0_co
ntig
s
Lys6
4_co
ntig
s
Lys8
2_co
ntig
s
AW
3_co
ntig
s
X50
08_c
ontig
s
AW
4_co
ntig
s
AW
1_co
ntig
s
Lys1
18_c
ontig
s
Lys1
38_c
ontig
s
Lys1
21_c
ontig
s
Lys1
22_c
ontig
s
Lys1
77_c
ontig
s
Lys1
55_c
ontig
s
Lys1
65_c
ontig
s
Lys1
63_c
ontig
s
Lys1
60_c
ontig
s
Lys1
61_c
ontig
s
Lys1
72_c
ontig
s
Lys1
44_c
ontig
s
Lys1
35_c
ontig
s
Lys1
46_c
ontig
s
Lys1
23_c
ontig
s
Lys1
24_c
ontig
s
Lys1
50_c
ontig
s
Lys1
40_c
ontig
s
Lys1
57_c
ontig
s
Lys1
73_c
ontig
s
Lys1
56_c
ontig
s
Lys1
58_c
ontig
s
Lys1
59_c
ontig
s
Lys1
62_c
ontig
s
Lys5
_con
tigs
X50
84_c
ontig
s
X50
42_c
ontig
s
Lys1
10_c
ontig
s
Lys1
36_c
ontig
s
Lys5
4_co
ntig
s
Lys1
_con
tigs
Lys6
_con
tigs
Lys1
12_c
ontig
s
X50
12_c
ontig
s
Lys3
0_co
ntig
s
Lys2
5_co
ntig
s
Lys4
3_co
ntig
s
Lys3
7_co
ntig
s
Lys4
0_co
ntig
s
Lys1
51_c
ontig
s
Lys3
1_co
ntig
s
Lys2
7_co
ntig
s
Lys4
2_co
ntig
s
Lys5
1_co
ntig
s
Lys3
3_co
ntig
s
Lys4
6_co
ntig
s
Lys3
8_co
ntig
s
Lys8
9_co
ntig
s
Lys2
3_co
ntig
s
Lys1
15_c
ontig
s
Lys1
08_c
ontig
s
Lys1
04_c
ontig
s
DS
M10
973_
cont
igs
Lys1
25_c
ontig
s
Lys1
05_c
ontig
s
Lys1
7_co
ntig
s
Lys1
28_c
ontig
s
Lys6
6_co
ntig
s
Lys7
3_co
ntig
s
Lys1
5_co
ntig
s
Lys9
1_co
ntig
s
DS
M86
98_c
ontig
s
DS
M86
95_c
ontig
s
Lys7
4_co
ntig
s
Lys6
1_co
ntig
s
Lys9
_con
tigs
Lys1
53_c
ontig
s
Lys8
4_co
ntig
s
Lys9
3_co
ntig
s
Lys7
2_co
ntig
s
Lys6
2_co
ntig
s
Lys2
1_co
ntig
s
Lys5
9_co
ntig
s
Lys6
3_co
ntig
s
Lys8
3_co
ntig
s
Lys1
9_co
ntig
s
Lys4
_con
tigs
AW
13_c
ontig
s
Lys4
5_co
ntig
s
Lys2
8_co
ntig
s
Lys5
3_co
ntig
s
Lys5
2_co
ntig
s
Lys3
4_co
ntig
s
Lys3
6_co
ntig
s
Lys2
4_co
ntig
s
Lys3
5_co
ntig
s
Lys6
8_co
ntig
s
Lys1
06_c
ontig
s
Lys8
8_co
ntig
s
Lys9
7_co
ntig
s
Lys7
6_co
ntig
s
Lys1
34_c
ontig
s
Lys5
8_co
ntig
s
Lys7
1_co
ntig
s
Lys8
1_co
ntig
s
Lys1
29_c
ontig
s
Lys1
20_c
ontig
s
Lys1
45_c
ontig
s
Lys1
37_c
ontig
s
Lys1
27_c
ontig
s
Lys1
52_c
ontig
s
Lys1
01_c
ontig
s
Lys9
8_co
ntig
s
Lys7
0_co
ntig
s
Lys1
33_c
ontig
s
Lys4
7_co
ntig
s
Lys7
5_co
ntig
s
Lys4
8_co
ntig
s
Lys1
48_c
ontig
s
Lys1
39_c
ontig
s
Lys1
41_c
ontig
s
Lys1
64_c
ontig
s
Lys1
49_c
ontig
s
Lys1
47_c
ontig
s
Lys6
0_co
ntig
s
Lys7
9_co
ntig
s
Lys1
68_c
ontig
s
Lys1
8_co
ntig
s
Lys8
7_co
ntig
s
Lys9
6_co
ntig
s
Lys7
_con
tigs
Lys1
54_c
ontig
s
Lys1
17_c
ontig
s
Lys1
19_c
ontig
s
Lys1
78_c
ontig
s
Lys1
16_c
ontig
s
Lys8
6_co
ntig
s
Lys9
0_co
ntig
s
Lys4
1_co
ntig
s
Lys1
3_co
ntig
s
Lys8
5_co
ntig
s
X50
02_c
ontig
s
Lys1
2_co
ntig
s
Lys3
9_co
ntig
s
Lys1
4_co
ntig
s
Lys5
5_co
ntig
s
Lys2
9_co
ntig
s
Lys9
9_co
ntig
s
X50
35_c
ontig
s
Lys8
_con
tigs
Lys3
_con
tigs
X50
34_c
ontig
s
X50
88_c
ontig
s
Lys2
0_co
ntig
s
Lys7
8_co
ntig
s
Lys1
1_co
ntig
s
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
5084_contigs
5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
5035_contigs
Lys8_contigs
Lys3_contigs
5034_contigs
5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
ANIm
0.9 0.92 0.94 0.96 0.98
Value
010
0020
0030
0040
0050
0060
00
Color Keyand Histogram
Cou
nt
AB1B2CDEFUX
2014: Campylobacter spp.
Sequenced ≈ 1034 isolates of CampylobacterClinical, animal, food-associated isolates
• Illumina paired-end sequencing. Total cost of sequencing>1000 bacteria: ≈£60k
• Automated annotation: PRODIGAL
2014: Campylobacter spp.
• Identified 15554 gene families from genecalls.• To calculate, took 23 days on institute cluster (4e12 pairwise
protein comparisons!).
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
So what’s changed?
• Cost: £250k per genome, to £60 per genome.Now cheaper to sequence a genome than to analyse it!
• Location: sequencing centre, to benchtop
• Data: volume has increased massively - what you get backfrom machines, and what’s out there to work withMore data is better, but also more challenging.
• Speed: typical sequencing run time can be less than a day
• Software: more software to do more things (but not alwaysbetter. . .)
• New kinds of experiment: genomes, exomes, variant calling,methylated sequences, . . .
• New kinds of application: diagnostics, epidemic tracking,metagenomics, . . .
So what’s changed?
Having a single genome is useful, but having thousands really helpscomparative genomics:combining genomic data, evolutionary and comparative biology
• Transfer functional understanding of model systems (e.g. E.coli) to non-model organisms
• Genomic differences may underpin phenotypic (host range,virulence, physiological) differences
• Genome comparisons aid identification of functional elementson the genome
• Studying genomics changes reveals evolutionary processes andconstraints
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
Not all sequencing is the same
It’s all about the biology, but it all starts with the data.Sequencing technology (including library prep.) affects yoursequence data.
• Roche/454
• Illumina
• Ion Torrent
• Pacific Bioscience (PacBio)
The basic principle
DNA source is fragmented, and the fragments are sequenced.
PE vs SE
Reads may be single-end, or paired-end.
Putting the jigsaw back together is sequence assembly.
Four different chemistriesa
aLoman et al. (2012) Nat. Rev. Micro. 31:294-296 doi:10.1038/nbt.2522
Reads differ by technology, and may require different bioinformatictreatment. . .
• Roche/454: Pyrosequencing (long reads, but expensive, andhigh homopolymer errors) (700-800bp, 0.7Gbp, 23h)
• Illumina: Reversible terminator (cost-effective, massivethroughput, but short read lengths) (2x150bp, 1.5Gbp, 27h)
• Ion Torrent: Proton detection (short run times, goodthroughput, high homopolymers errors) (200bp, 1Gbp, 3h)
• PacBio: Real-time sequencing (very long reads, high errorrate, expensive) (3-15kbp, 3Gbp/day, 20min)
. . . different error profiles, varying capability to assemble/determinevariation
Costs of sequencinga
aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
Benchmarked performance
Apply several sequencing technologies to the same sample(s).Benchmark comparisons inform appropriate choice of sequencingtechnology6,7,8,9,10,11,12
Progress in technologies is driving research very rapidly.Always look for most recent/relevant benchmarks.
Bioinformatic methods also need to be benchmarked.
6Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
7Salipante et al. (2014) Appl. Environ. Micro. 80:7583-7591 doi:10.1128/AEM.02206-14
8Frey et al. (2014) BMC Genomics 15:96 doi:10.1186/1471-2164-15-96
9Koshimizu et al. (2013) PLoS One 8:e74167 doi:10.1371/journal.pone.0074167
10Quail et al. (2012) BMC Genomics 13:341 doi:10.1186/1471-2164-13-341
11Loman et al. (2012) Nat. Biotech. 30:434-439 doi:10.1038/nbt.2198
12Lam et al. (2011) Nat. Biotech. 1 (6) doi:10.1038/nbt.2065
Benchmarking on Vibrioa
aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
• Sequenced Vibrio parahaemolyticus (2x chromosomes, closedreference genome) with four technologies
• Chose an assembler for each tech, and assembled reads• Excess reads with Ion/MiSeq: used random subsets of reads
to determine required coverage• Aligned assemblies (MUMmer) to known high-quality
chromosome sequence, to measure error
Benchmarking on Vibrioa
aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Benchmarking on Vibrioa
aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
De novo assembly and alignment against Vibrio parahaemolyticus(2x chromosomes)
Benchmarking on Vibrioa
aMiyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
• More and longer reads do not always give the best assemblies:read depth, read distribution, error rate also matters
• Optimal assemblies were obtained at around 60x-80xcoverage, for Illumina and Ion.
• Multiple rRNA regions are fragmented in short-read assemblies
• PacBio generated single chromosome contigs
• Assembly of multiple-chromosome bacteria is currently feasible
Variability in published genomes as methods are not standard (e.g.sequencing technology, assembler, parameter settings andpre-processing). . .
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
What’s coming next?
Oxford Nanopore. A sequencer the size of your hand.
• Microfluidics, single-molecule sequencing; 11-70kbp reads
• Reports current across pore (tiny electron microscope) asmolecule moves through
• $10/Mbp, 110Mbp per flowcell13
13Yaniv Erlich (2013) Future Continuous blog
Controversya
aMikheyev and Tin (2014) Mol. Ecol. Res. 14:1097-1102 doi:10.1111/1755-0998.12324
The first Nanopore paper concluded, for λ phage:
• About 10% of reads mapped to the reference genome
• <1% of all generated sequence faithfully matches thereference
Controversya
aMikheyev and Tin (2014) Mol. Ecol. Res. 14:1097-1102 doi:10.1111/1755-0998.12324
Not everyone thinks the Mikheyev and Tin paper is very good:
“But that paper is terrible. It’s just lazy.” (Mick Watson’s blog:http://biomickwatson.wordpress.com/2014/09/07/thoughts-on-oxford-nanopores-minion-mobile-dna-sequencer/)
New dataa
aQuick et al. (2014) GigaScience 3:22 doi:10.1111/1755-0998.12324
It’s a fast-moving area, and results are improving.
New tools
Oxford Nanopore’s open beta went out without analysis tools.Tools (Poretools, poRe, etc.) are being written/tested/validatedby the user community14,15
14Loman and Quinlan (2014) Bioinformatics doi:10.1093/bioinformatics/btu555
15Watson et al. (2014) Bioinformatics doi:10.1093/bioinformatics/btu590
Table of Contents
IntroductionA personal viewErwinia carotovora subsp. atrosepticaDickeya spp., Campylobacter spp., and Escherichia coliSo what’s changed?
High Throughput SequencingFour dominant technologiesBenchmarkingNanoporeHow fast is sequence data increasing?
After that, the flood. . .
High-throughput sequencing methods have completely changed thelandscape of microbiology(Nearly) complete, (mainly) accurate sequence data is nowinexpensive (and cheaper than analysis)
• GOLD (19/2/2014): 3,011 “finished” ; 9,891 “permanentdraft” genomes
• GOLD (18/11/2014): 6,649 “finished” ; 23,552 “permanentdraft” genomes
• NCBI WGS (19/2/2014): 17,023 microbial genomes
• NCBI WGS (18/11/2014): 26,026 microbial genomes
Predicting the future is hard. . .
Su et al. attempted to answer this16:
16http://sulab.org/2013/06/sequenced-genomes-per-year/
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/
Microbial Genomics andBioinformaticsBM4052.Assembly
Leighton Pritchard1,2,3
1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde
What do you get from sequencing
Sequence reads. Usually lots of them.Size/number/errors depend on technology used.
1
1Miyamoto et al. (2014) BMC Genomics 15:699 doi:10.1186/1471-2164-15-699
Sequence Read Data Formats
Two common read data sequence formats:
• FASTQ: Related to FASTA, a de facto standard for sequencereads
• SAM/BAM: Sequence alignment/mapping format, twoflavours - uncompressed and compressed
• CRAM: Reference-based sequence compression
You might also receive assembled genomes directly from asequencing partner
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
FASTQa
aCock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
@HISEQ2500-09:168:HA424ADXX:2:1101:1404:2061 1:N:0:ATCTCTCTCACCAACT
CGGTCTTGGGATAGATGGGTTGCAGGTTGCGGTAAAGCTCGGACTCCAGAGCGTCCAGGGTAGACTGGCTAATCTTCTGCTCTTTATCGATCATTATTTC
+
@@CBDDFFHHDFDHEGHIICGIFHHIIIIFHGGHIEHHIIIIGHGHIIIIIGGHHFFFFC@CBCCCDDBDCDDDDDDDDCCDDDD3@ABDDDDDEEEDE@
Files typically have .fq, .fastq extension.Four lines per sequence
1. Header: sequence identifier and optional description, startswith “@”
2. Raw sequence ([ACGTN])
3. Optional header, repeats line 1, starts with “+”
4. Quality scores, numbers encoded as ASCIIQphred = −10 log10 e, where e is the estimated probabilitythat a base call is incorrect (like a pH).
Quality Control
The quality of basecalls (error rate) varies between and along reads.
(real data from our E.coli sequencing: good quality)
Quality Control
Some datasets are better than others.
Reads can be trimmed, or discarded.Including poor reads compromises assembly.
FASTQ encodinga
aCock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
More than one version of FASTQ, differ by quality encodingNumbers converted to ASCII start at different values
FASTQ encodinga
aCock et al. (2009) Bioinformatics 38:1767-1771 doi:10.1093/nar/gkp1137
Versions vary by sequencer and period. Most now settled onSanger format.Quality scores (Qphred) offset to lie in the given range:
1. Sanger: 33-126, used in SAM/BAM, and Illumina 1.8+
2. Illumina 1.0-1.2: 59-126
3. Illumina 1.3-1.8: 64-126
Knowing where your data comes from, and the data formatand version, is always important.
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
SAMa
ahttps://github.com/samtools/hts-specs
Intended to represent read alignments, also used for raw reads.Tab-delimited plain text. Headers (optional) start with “@”
BAMa/CRAMb
ahttps://github.com/samtools/hts-specs
bhttp://www.ebi.ac.uk/ena/software/cram-toolkit
BAM is a compressed version of SAM.
• BGZF compression.
• Random access within compressed file, through indexing.
CRAM format may come to dominate, especially in archives, asdatasets get larger:
• Reference-based compression.2
• Highly suited to compression and archiving of very largeamounts of sequence data.3
2Fritz et al. (2011) Genome Res. 21:734-740 doi:10.1101/gr.114819.110
3Cochrane et al. (2012) GigaScience 1:2 doi:10.1186/2047-217X-1-2
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
Read repositories
Repositories are centrally-maintained locations that keep sequenceread data from multiple projectsSubmission to a repository is a requirement for publication.
• ENA: The European Nucleotide Archive(http://www.ebi.ac.uk/ena), maintained by EBI/EMBL
• SRA: The Short Read Archive(http://www.ncbi.nlm.nih.gov/sra), maintained in the US byNCBI
Sequence Assembly
Once you have reads, you can assemble a genome.
Two main approaches to read assembly:
• Overlap-Layout-Consensus: Typically used with smaller setsof longer reads (e.g. 454, PacBio, Ion, Nanopore)
• de Bruijn assembly: Typically used with many, shorter reads(e.g. Illumina), but also useful for longer reads
See e.g. Leland Taylor’s thesis(http://gcat.davidson.edu/phast/docs/Thesis PHAST LelandTaylor.pdf),and PHAST (http://gcat.davidson.edu/phast/index.html).
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
Overlap-Layout-Consensus
Overlap-Layout-Consensus
The oldest approach, typically used with smaller sets of fewerreads.
Can be time consuming (all-vs-all comparisons), but offset withgraph-based OLC algorithms (e.g. SGA).
• Celera Assembler4
• Newbler (the Roche/454 GS assembler)5
• String Graph Assembler6
4http://wgs-assembler.sourceforge.net/
5http://www.454.com/products/analysis-software/
6Simpson and Durbin (2012) Genome Res. 22:549-556 doi:10.1101/gr.126953.111
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
de Bruijn graph assembly
k-mer based graph (choice of k important):
de Bruijn graph assembly
k-mer based genome and read graphs7
“True” edges = genome; “Error” edges = wrong assembly
7Chaisson et al. (2009) Genome Res. 19:336-346 doi:10.1101/gr.079053.108
de Bruijn graph assembly
All sequencing technologies have basecall errors.
• The proportion of errors is approximately constant per read
• Baseball errors lead to edge errors
• The more reads you have, the more errors there are
Increased coverage does not ensure increased accuracy8
8Conway and Bromage (2011) Bioinformatics 27:479-486 doi:10.1093/bioinformatics/btq697
de Bruijn graph assembly
Fast, as it never computes overlaps.
Sensitive to sequencing errors, resolves short repeats (graph bulgesand whirls).
Notable tools:
• Velvet9
• CLC Assembly Cell10
• Cortex11
9Zerbino and Birney (2008) Genome Res. 18:821-829 doi:10.1101/gr.074492.107
10http://www.clcbio.com/products/clc-assembly-cell/
11Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
“Coloured” de Bruijn graph assemblies
Cortex12 allows for on-the-fly identification of complex variation,and genotyping, by tracking “coloured” edges in the graph.Colours ≈ different isolates/organisms (e.g. a reference)
12Iqbal et al. (2012) Nat. Genet. 44:226-232 doi:10.1038/ng.1028
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
Why map reads?a
aTrapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455
“Resequencing” an organism (sequencing a close relative, lookingfor SNPs/indels)
RNA-seq, ChIP-seq, etc. - coverage ≈ expression/binding
To see where reads map on an assembled genome
• Is coverage even? (can indicate repeats)
• Are there SNPs/indels? (heterogeneous population)
• Assembly problems?
Short-Read Sequence Alignmenta
aTrapnell et al. (2009) Nat. Biotech. 27:455-457 doi:10.1038/nbt0509-455
An embarrassment of tools (over 60 listed on Wikipedia)Main approaches:
• Alignment: Smith-Waterman mathematically guaranteed tobe the best alignment available (e.g. BFAST, MOSAIK);approximation to S-W (e.g. BLAST); ungapped or gappedalignment (e.g. MAQ, FAST, mrFAST, SOAP). Can be slow.
• Burrows-Wheeler Transform: Makes reusable index of thegenome (e.g. Bowtie, BWA), can be extended to considersequence probability (e.g. BWA-PSSM). Can be very fast.
Other tools may employ different algorithms, some designed to beparallelised on GPUs/FPGAs (e.g. NextGenMap, XpressAlign)
Visualising Read Mapping
Several tools available, e.g. Tablet (the best. . .)13
13Milne et al. (2013) Brief. Bioinf. 14:193-202 doi:10.1093/bib/bbs012
Table of Contents
Sequence Data FormatsFASTQSAM/BAM/CRAMRepositories
AssemblyOverlap-Layout-Consensusde Bruijn graph assembly
Read MappingShort-Read Sequence Alignment
The AssemblyWhat you get back
In an ideal world
Ideally, you would have one sequence per chromosome/plasmid.(and no errors): a closed/complete genome.
PacBio, Sanger, manual closing, Nanopore(?)
More realistically. . .
Typically, a number of assembled fragments (contigs or scaffolds)are returned in FASTA format: a draft, disordered genome.Around 250 contigs for a 5Mbp genome is usual with Illumina
Ordering contigs
Contigs can be ordered correctly into scaffolds if paired-end readsspan gaps (typically done during assembly).Gaps are usually filled with Ns (length estimated)
Ordering contigs
Contigs and scaffolds can also be reordered by alignment to areference genome.
• Mauve/progressiveMauve14
• MUMmer15
14Darling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
15Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Where next?a
aLefebure et al. (2010) Genome Biol. Evol. 2:646-655 doi:10.1093/gbe/evq048
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/
Microbial Genomics andBioinformaticsBM4053.Whole Genome Comparisons
Leighton Pritchard1,2,3
1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
The Power of Comparative Genomics
Massively enabled by high-throughput sequencing, and theavailability of thousands of sequenced isolates.
Computational comparisons more powerful and precise thanexperimental comparative genomics: the ultimate microbialtyping solution
Three broad areas/scales:
• Comparison of bulk genome properties
• Whole genome sequence comparisons
• Comparison of features/functional components
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
Nucleotide frequency/genome size
• Very easy to calculate from complete/draft genome
• Can calculate for individual contigs/scaffolds/regions
• Usually reported in GUI genome browsers
Trivial to determine using, e.g. Python
Nucleotide frequency/genome size
GC content and chromosome size can be characteristicSee data/bacteria size for example iPython notebook exercise
Blobologya
aKumar and Blaxter et al. (2011) Symbiosis 3:119-126 doi:10.1007/s13199-012-0154-6
Sequencing samples may becontaminated or containmicrobial symbionts.
Expect more host thansymbiont/contaminant DNA
GC content and read coveragecan be used to separatecontigs, following assemblyand mapping
http://nematodes.org/bioinformatics/blobology/
k-mers
• Nucleotides: [ACGT]
• Dinucleotides: [AA|AC|AG|AT|CA|CC|. . .] (16 dimers)
• Trinucleotides: [AAA|AAC|AAG|AAT|ACA|. . .] (64 trimers)
• k-mers: 4k k-mers
(see example in data/shiny)
k-mers
GC content = point value; k-mer frequencies = vector (list)
Diagnostic differences in k-mer frequency, and variability.
The basis of several comparison tools
E.coli Mycoplasma spp.
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
What to align, and why?
To be useful, aligned genomes should:
• derive from a sufficiently recent common ancestor, sohomologous regions can be identified
• derive from a sufficiently distant common ancestor, so thatthere are “interesting” differences to be identified
• help to answer your biological question
How to align, and why?
Naive sequence aligners (Needleman-Wunsch, Smith-Waterman)are not appropriate for genome alignment
• Computationally expensive on large sequences
• Cannot handle rearrangements
Very many alternative alignment algorithms proposed
• megaBLAST http://www.ncbi.nlm.nih.gov/blast/html/megablast.html
• MUMmer http://mummer.sourceforge.net/
• BLAT http://genome.ucsc.edu/goldenPath/help/blatSpec.html
• LASTZ http://www.bx.psu.edu/∼rsharris/lastz/
• LAGAN http://lagan.stanford.edu/lagan web/index.shtml
• and many, many more. . .
Example exercises in data/whole_genome_alignment.
megaBLAST
Optimised for speed, over BLASTN1
• Genome-level searches
• Queries on large sequence sets
• Long alignments of very similar sequence
Uses the greedy algorithm by Zhang et al.2, not BLAST algorithm.
• Concatenates queries (“query packing”) to improveperformance
• Two modes: megaBLAST and discontinuous(dc-megablast) for divergent sequences
BLASTN now uses the megaBLAST algorithm by default
1http://www.ncbi.nlm.nih.gov/blast/Why.shtml
2Zhang et al. (2000) J. Comp. Biol. 7:203-214 doi:10.1089/10665270050081478
BLAST vs megaBLAST
megaBLAST is faster, but does it give the same biological results?
megaBLAST (top) and BLAST (bottom) pairwise comparisons:
BLAST vs megaBLAST
Filter out weak matches - not quite identical:
MUMmera
aKurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Uses suffix trees for pattern matching: very fast even for largesequences
• Finds maximal exactmatches
• Memory use depends onlyon the reference sequencesize
Suffix trees:(http://en.wikipedia.org/wiki/Suffix tree)
• Can be built and searchedin O(n) time
• But useful algorithms arenontrivial
The MUMmer algorithma
aKurtz et al. (2004) Genome. Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
1. Identify a non-overlapping subset of maximal exact matches:often Maximal Unique Matches (MUMs)
2. Cluster into alignment anchors
3. Extend between anchors to produce the final alignment
This is the basis of a very flexible suite of programs that aligndifferent kinds of sequence: mummer, nucmer, promer
• nucleotide and (more sensitive) “conceptual protein”alignments
• used for genome comparisons, assembly scaffolding, repeatdetection, . . .
• the basis of other aligners/assemblers (e.g. Mugsy, AMOS)
MUMmer vs megaBLAST
MUMmer identifies fewer weak matches
megaBLAST (top) and MUMmer (bottom) pairwise comparisons:
MUMmer vs megaBLAST
Filter out weak BLAST matches - not quite identical:
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
DNA-DNA hybridisationa
aMorello-Mora and Amann (2001) FEMS Micro. Rev. 25:39-67 doi:10.1016/S0168-6445(00)00040-1
• “Gold Standard” forprokaryotic taxonomy,since 1960s. “70%identity ≈ same species.”
• Denature DNA from twoorganisms.
• Allow to anneal.Reassociation ≈ similarity,measured as ∆T ofdenaturation curves.
Proxy for sequence similarity - replace with genome analysis3?
3Chan et al (2012) BMC Microbiol. 12:302 doi:10.1186/1471-2180-12-302
Average Nucleotide Identity (ANIb)a
aGoris et al. (2007) Int. J. Syst. Biol. 57:81-91 doi:10.1099/ijs.0.64483-0
1. Break genomes into 1020tfragments2. ANIb: Mean % identity ofall BLASTN matches with> 30% identity and > 70%fragment coverage.
• DDH:ANIb linear
• DDH:%ID linear
• 70%ID ≈ 95%ANIb
Average Nucleotide Identity (ANIm)a
aRichter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA 106:19126-19131
doi:10.1073/pnas.0906412106
1. Align genomes(MUMmer)2. ANIm: Mean% identity of allmatches
• DDH:ANImlinear
• 70%ID ≈95%ANIb
TETRA: tetranucleotide frequency-based classifier introduced insame paper.
ANI/TETRA comparison
All three methods applied to Anaplasma spp.
ANIb:
A_p
hago
cyto
philu
m_N
C_0
2188
1
A_p
hago
cyto
philu
m_N
C_0
2188
0
A_p
hgoc
ytop
hilu
m_N
C_0
2187
9
A_p
hago
cyto
philu
m_N
C_0
0779
7
A_c
entr
ale_
NC
_013
532
A_m
argi
nale
_NC
_004
842
A_m
argi
nale
_NC
_012
026
A_m
argi
nale
_NC
_022
760
A_m
argi
nale
_NC
_022
784
A_phagocytophilum_NC_021881
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_centrale_NC_013532
A_marginale_NC_004842
A_marginale_NC_012026
A_marginale_NC_022760
A_marginale_NC_022784
ANIb
0.9 0.94 0.98
Value
020
40
Color Keyand Histogram
Cou
nt
ANIm:
A_p
hgoc
ytop
hilu
m_N
C_0
2187
9
A_p
hago
cyto
philu
m_N
C_0
0779
7
A_p
hago
cyto
philu
m_N
C_0
2188
0
A_p
hago
cyto
philu
m_N
C_0
2188
1
A_c
entr
ale_
NC
_013
532
A_m
argi
nale
_NC
_012
026
A_m
argi
nale
_NC
_004
842
A_m
argi
nale
_NC
_022
760
A_m
argi
nale
_NC
_022
784
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021880
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_012026
A_marginale_NC_004842
A_marginale_NC_022760
A_marginale_NC_022784
ANIm
0.9 0.94 0.98
Value0
1020
30
Color Keyand Histogram
Cou
nt
TETRA:
A_p
hago
cyto
philu
m_N
C_0
2188
0
A_p
hgoc
ytop
hilu
m_N
C_0
2187
9
A_p
hago
cyto
philu
m_N
C_0
0779
7
A_p
hago
cyto
philu
m_N
C_0
2188
1
A_c
entr
ale_
NC
_013
532
A_m
argi
nale
_NC
_022
760
A_m
argi
nale
_NC
_022
784
A_m
argi
nale
_NC
_012
026
A_m
argi
nale
_NC
_004
842
A_phagocytophilum_NC_021880
A_phgocytophilum_NC_021879
A_phagocytophilum_NC_007797
A_phagocytophilum_NC_021881
A_centrale_NC_013532
A_marginale_NC_022760
A_marginale_NC_022784
A_marginale_NC_012026
A_marginale_NC_004842
TETRA
0.9 0.94 0.98
Value
020
40
Color Keyand Histogram
Cou
nt
ANIb discards information, relative to ANIm: less sensitiveANIb/ANIm ≈ evolutionary history; TETRA ≈ bulk composition
ANI in practice
Practical applications4 (note: no gene content used)
29 Dickeya isolates:species structure
GB
BC
2040
_dra
ft_re
orde
red_
stitc
hed
IPO
_222
2_dr
aft_
reor
dere
d_st
itche
dM
K10
_dra
ft_re
orde
red_
stitc
hed
MK
16_d
raft_
reor
dere
d_st
itche
dA
MY
I01.
1_st
itche
dA
MW
E01
.1_s
titch
edN
CP
PB
_353
4_dr
aft_
stitc
hed
NC
PP
B_4
53_d
raft_
stitc
hed
GB
BC
2039
_dra
ft_st
itche
dIP
O_9
80_d
raft_
stitc
hed
MK
7_dr
aft_
stitc
hed
NC
PP
B_3
274_
draf
t_st
itche
dN
CP
PB
_297
6_dr
aft_
stitc
hed
NC
_014
500
NC
PP
B_8
98_d
raft_
stitc
hed
NC
PP
B_3
537_
draf
t_st
itche
dN
CP
PB
_251
1_dr
aft_
stitc
hed
NC
_012
880
CS
L_R
W24
0_dr
aft_
stitc
hed
DW
_044
0_dr
aft_
reor
dere
d_st
itche
dN
CP
PB
_569
_dra
ft_st
itche
dN
CP
PB
_516
_dra
ft_st
itche
dN
CP
PB
_402
_dra
ft_st
itche
dN
CP
PB
_353
3_dr
aft_
stitc
hed
NC
_012
912
NC
PP
B_3
531_
draf
t_st
itche
dC
SL_
RW
192_
draf
t_st
itche
dA
JVN
01.1
_stit
ched
AP
MV
01.1
_stit
ched
NC
_013
592
AP
WM
01.1
_stit
ched
MK
19_d
raft_
stitc
hed
NC
PP
B_3
532_
draf
t_st
itche
dN
CP
PB
_253
8_dr
aft_
stitc
hed
GBBC2040_draft_reordered_stitchedIPO_2222_draft_reordered_stitchedMK10_draft_reordered_stitchedMK16_draft_reordered_stitchedAMYI01.1_stitchedAMWE01.1_stitchedNCPPB_3534_draft_stitchedNCPPB_453_draft_stitchedGBBC2039_draft_stitchedIPO_980_draft_stitchedMK7_draft_stitchedNCPPB_3274_draft_stitchedNCPPB_2976_draft_stitchedNC_014500NCPPB_898_draft_stitchedNCPPB_3537_draft_stitchedNCPPB_2511_draft_stitchedNC_012880CSL_RW240_draft_stitchedDW_0440_draft_reordered_stitchedNCPPB_569_draft_stitchedNCPPB_516_draft_stitchedNCPPB_402_draft_stitchedNCPPB_3533_draft_stitchedNC_012912NCPPB_3531_draft_stitchedCSL_RW192_draft_stitchedAJVN01.1_stitchedAPMV01.1_stitchedNC_013592APWM01.1_stitchedMK19_draft_stitchedNCPPB_3532_draft_stitchedNCPPB_2538_draft_stitched
ANIm
0.9 0.94 0.98Value
040
080
0
Color Keyand Histogram
Cou
nt
180 E.coli isolates:subtyping
Bru
nei2
0070
942_
cont
igs
Mue
nste
r200
6309
1_co
ntig
s
Sen
ftenb
erg2
0070
885_
cont
igs
Lys1
42_c
ontig
s
Lys1
75_c
ontig
s
Lys1
30_c
ontig
s
Lys1
70_c
ontig
s
Lys1
26_c
ontig
s
Lys1
67_c
ontig
s
Lys1
76_c
ontig
s
Lys1
69_c
ontig
s
Lys5
0_co
ntig
s
X50
38_c
ontig
s
Lys1
31_c
ontig
s
Lys1
71_c
ontig
s
Lys1
11_c
ontig
s
Lys1
07_c
ontig
s
Lys1
14_c
ontig
s
Lys1
6_co
ntig
s
Lys2
2_co
ntig
s
Lys6
5_co
ntig
s
Lys5
6_co
ntig
s
Lys1
13_c
ontig
s
Lys1
09_c
ontig
s
Lys7
7_co
ntig
s
Lys1
02_c
ontig
s
Lys1
00_c
ontig
s
Lys9
2_co
ntig
s
Lys9
4_co
ntig
s
Lys8
0_co
ntig
s
Lys6
4_co
ntig
s
Lys8
2_co
ntig
s
AW
3_co
ntig
s
X50
08_c
ontig
s
AW
4_co
ntig
s
AW
1_co
ntig
s
Lys1
18_c
ontig
s
Lys1
38_c
ontig
s
Lys1
21_c
ontig
s
Lys1
22_c
ontig
s
Lys1
77_c
ontig
s
Lys1
55_c
ontig
s
Lys1
65_c
ontig
s
Lys1
63_c
ontig
s
Lys1
60_c
ontig
s
Lys1
61_c
ontig
s
Lys1
72_c
ontig
s
Lys1
44_c
ontig
s
Lys1
35_c
ontig
s
Lys1
46_c
ontig
s
Lys1
23_c
ontig
s
Lys1
24_c
ontig
s
Lys1
50_c
ontig
s
Lys1
40_c
ontig
s
Lys1
57_c
ontig
s
Lys1
73_c
ontig
s
Lys1
56_c
ontig
s
Lys1
58_c
ontig
s
Lys1
59_c
ontig
s
Lys1
62_c
ontig
s
Lys5
_con
tigs
X50
84_c
ontig
s
X50
42_c
ontig
s
Lys1
10_c
ontig
s
Lys1
36_c
ontig
s
Lys5
4_co
ntig
s
Lys1
_con
tigs
Lys6
_con
tigs
Lys1
12_c
ontig
s
X50
12_c
ontig
s
Lys3
0_co
ntig
s
Lys2
5_co
ntig
s
Lys4
3_co
ntig
s
Lys3
7_co
ntig
s
Lys4
0_co
ntig
s
Lys1
51_c
ontig
s
Lys3
1_co
ntig
s
Lys2
7_co
ntig
s
Lys4
2_co
ntig
s
Lys5
1_co
ntig
s
Lys3
3_co
ntig
s
Lys4
6_co
ntig
s
Lys3
8_co
ntig
s
Lys8
9_co
ntig
s
Lys2
3_co
ntig
s
Lys1
15_c
ontig
s
Lys1
08_c
ontig
s
Lys1
04_c
ontig
s
DS
M10
973_
cont
igs
Lys1
25_c
ontig
s
Lys1
05_c
ontig
s
Lys1
7_co
ntig
s
Lys1
28_c
ontig
s
Lys6
6_co
ntig
s
Lys7
3_co
ntig
s
Lys1
5_co
ntig
s
Lys9
1_co
ntig
s
DS
M86
98_c
ontig
s
DS
M86
95_c
ontig
s
Lys7
4_co
ntig
s
Lys6
1_co
ntig
s
Lys9
_con
tigs
Lys1
53_c
ontig
s
Lys8
4_co
ntig
s
Lys9
3_co
ntig
s
Lys7
2_co
ntig
s
Lys6
2_co
ntig
s
Lys2
1_co
ntig
s
Lys5
9_co
ntig
s
Lys6
3_co
ntig
s
Lys8
3_co
ntig
s
Lys1
9_co
ntig
s
Lys4
_con
tigs
AW
13_c
ontig
s
Lys4
5_co
ntig
s
Lys2
8_co
ntig
s
Lys5
3_co
ntig
s
Lys5
2_co
ntig
s
Lys3
4_co
ntig
s
Lys3
6_co
ntig
s
Lys2
4_co
ntig
s
Lys3
5_co
ntig
s
Lys6
8_co
ntig
s
Lys1
06_c
ontig
s
Lys8
8_co
ntig
s
Lys9
7_co
ntig
s
Lys7
6_co
ntig
s
Lys1
34_c
ontig
s
Lys5
8_co
ntig
s
Lys7
1_co
ntig
s
Lys8
1_co
ntig
s
Lys1
29_c
ontig
s
Lys1
20_c
ontig
s
Lys1
45_c
ontig
s
Lys1
37_c
ontig
s
Lys1
27_c
ontig
s
Lys1
52_c
ontig
s
Lys1
01_c
ontig
s
Lys9
8_co
ntig
s
Lys7
0_co
ntig
s
Lys1
33_c
ontig
s
Lys4
7_co
ntig
s
Lys7
5_co
ntig
s
Lys4
8_co
ntig
s
Lys1
48_c
ontig
s
Lys1
39_c
ontig
s
Lys1
41_c
ontig
s
Lys1
64_c
ontig
s
Lys1
49_c
ontig
s
Lys1
47_c
ontig
s
Lys6
0_co
ntig
s
Lys7
9_co
ntig
s
Lys1
68_c
ontig
s
Lys1
8_co
ntig
s
Lys8
7_co
ntig
s
Lys9
6_co
ntig
s
Lys7
_con
tigs
Lys1
54_c
ontig
s
Lys1
17_c
ontig
s
Lys1
19_c
ontig
s
Lys1
78_c
ontig
s
Lys1
16_c
ontig
s
Lys8
6_co
ntig
s
Lys9
0_co
ntig
s
Lys4
1_co
ntig
s
Lys1
3_co
ntig
s
Lys8
5_co
ntig
s
X50
02_c
ontig
s
Lys1
2_co
ntig
s
Lys3
9_co
ntig
s
Lys1
4_co
ntig
s
Lys5
5_co
ntig
s
Lys2
9_co
ntig
s
Lys9
9_co
ntig
s
X50
35_c
ontig
s
Lys8
_con
tigs
Lys3
_con
tigs
X50
34_c
ontig
s
X50
88_c
ontig
s
Lys2
0_co
ntig
s
Lys7
8_co
ntig
s
Lys1
1_co
ntig
s
Brunei20070942_contigs
Muenster20063091_contigs
Senftenberg20070885_contigs
Lys142_contigs
Lys175_contigs
Lys130_contigs
Lys170_contigs
Lys126_contigs
Lys167_contigs
Lys176_contigs
Lys169_contigs
Lys50_contigs
5038_contigs
Lys131_contigs
Lys171_contigs
Lys111_contigs
Lys107_contigs
Lys114_contigs
Lys16_contigs
Lys22_contigs
Lys65_contigs
Lys56_contigs
Lys113_contigs
Lys109_contigs
Lys77_contigs
Lys102_contigs
Lys100_contigs
Lys92_contigs
Lys94_contigs
Lys80_contigs
Lys64_contigs
Lys82_contigs
AW3_contigs
5008_contigs
AW4_contigs
AW1_contigs
Lys118_contigs
Lys138_contigs
Lys121_contigs
Lys122_contigs
Lys177_contigs
Lys155_contigs
Lys165_contigs
Lys163_contigs
Lys160_contigs
Lys161_contigs
Lys172_contigs
Lys144_contigs
Lys135_contigs
Lys146_contigs
Lys123_contigs
Lys124_contigs
Lys150_contigs
Lys140_contigs
Lys157_contigs
Lys173_contigs
Lys156_contigs
Lys158_contigs
Lys159_contigs
Lys162_contigs
Lys5_contigs
5084_contigs
5042_contigs
Lys110_contigs
Lys136_contigs
Lys54_contigs
Lys1_contigs
Lys6_contigs
Lys112_contigs
5012_contigs
Lys30_contigs
Lys25_contigs
Lys43_contigs
Lys37_contigs
Lys40_contigs
Lys151_contigs
Lys31_contigs
Lys27_contigs
Lys42_contigs
Lys51_contigs
Lys33_contigs
Lys46_contigs
Lys38_contigs
Lys89_contigs
Lys23_contigs
Lys115_contigs
Lys108_contigs
Lys104_contigs
DSM10973_contigs
Lys125_contigs
Lys105_contigs
Lys17_contigs
Lys128_contigs
Lys66_contigs
Lys73_contigs
Lys15_contigs
Lys91_contigs
DSM8698_contigs
DSM8695_contigs
Lys74_contigs
Lys61_contigs
Lys9_contigs
Lys153_contigs
Lys84_contigs
Lys93_contigs
Lys72_contigs
Lys62_contigs
Lys21_contigs
Lys59_contigs
Lys63_contigs
Lys83_contigs
Lys19_contigs
Lys4_contigs
AW13_contigs
Lys45_contigs
Lys28_contigs
Lys53_contigs
Lys52_contigs
Lys34_contigs
Lys36_contigs
Lys24_contigs
Lys35_contigs
Lys68_contigs
Lys106_contigs
Lys88_contigs
Lys97_contigs
Lys76_contigs
Lys134_contigs
Lys58_contigs
Lys71_contigs
Lys81_contigs
Lys129_contigs
Lys120_contigs
Lys145_contigs
Lys137_contigs
Lys127_contigs
Lys152_contigs
Lys101_contigs
Lys98_contigs
Lys70_contigs
Lys133_contigs
Lys47_contigs
Lys75_contigs
Lys48_contigs
Lys148_contigs
Lys139_contigs
Lys141_contigs
Lys164_contigs
Lys149_contigs
Lys147_contigs
Lys60_contigs
Lys79_contigs
Lys168_contigs
Lys18_contigs
Lys87_contigs
Lys96_contigs
Lys7_contigs
Lys154_contigs
Lys117_contigs
Lys119_contigs
Lys178_contigs
Lys116_contigs
Lys86_contigs
Lys90_contigs
Lys41_contigs
Lys13_contigs
Lys85_contigs
5002_contigs
Lys12_contigs
Lys39_contigs
Lys14_contigs
Lys55_contigs
Lys29_contigs
Lys99_contigs
5035_contigs
Lys8_contigs
Lys3_contigs
5034_contigs
5088_contigs
Lys20_contigs
Lys78_contigs
Lys11_contigs
ANIm
0.9 0.92 0.94 0.96 0.98
Value
010
0020
0030
0040
0050
0060
00
Color Keyand Histogram
Cou
nt
AB1B2CDEFUX
4van der Wolf et al. (2014) Int. J. Syst. Evol. Micr. 64:768-774 doi:10.1099/ijs.0.052944-0
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
Collinearity and Synteny
Genome rearrangements occur, but there can still be conservationof sequence similarity and ordering.
• Two elements are collinear if they lie in the same linearsequence
• Two elements are syntenous (or syntenic) if:• (orig.) they lie on the same chromosome• (mod.) there is conservation of blocks of order within the
same chromosome
Signs of evolutionary constraints, like sequence conservationor synteny, may indicate functional genome regions.
Pyrococcus spp.a
aZivanovic et al. (2002) Nuc. Acids Res. 30:1902-1910 doi:10.1093/nar/30.9.1902
Comparison of Pyrococcus genomes (P. horikoshii, P. abyssi, P.furiosus) shows chromosome-shuffling.
Transposition a major cause of genomic disruption
Vibrio mimicus a
aHasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107
Chromosome C-II carries genes associated with environmentaladaptation; C-I carries virulence genes.C-II has undergone extensive rearrangement; C-I has not.
Suggests modularity of genome organisation, as a mechanism foradaptation (HGT, two-speed genome).
Serratia symbiotica a
aBurke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002
S. symbiotica is a recently evolved symbiont of aphidsMassive genomic decay is an adaptation to the new environment.
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
Multiple genome alignment is hard
Can we not just align all our genomes, together?
No. Because it’s really, really hard.
Analogous to problems with multiple sequence alignment (three ormore sequences).
• Computationally extremely expensive (O(Ln), L=length ofsequence, n=number of sequences)
• NP-complete problem: no known efficient way to find asolution
Heuristic (approximate) methods are used, most commonly:
• Progressive alignment
• Iterative alignment
Mauvea
aDarling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
Progressive alignment tool, with a GUI. Application to nineenterobacteria: rearrangement of homologous backbone.
Alternatives include MLAGAN5 and MUMmer6
5Brudno et al. (2003) Genome Res. 13:721-731 doi:10.1101/gr.926603
6Kurtz et al. (2004) Genome Biol. 5:R12 doi:10.1186/gb-2004-5-2-r12
Mauve algorithma
aDarling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
1. Find local alignments(multi-MUMs)
2. Build guide tree frommulti-MUMs
3. Select subset ofmulti-MUMs as anchors,and partition into LocalCollinear Blocks (LCBs):consistently orderedsubsets
4. Progressive alignmentagainst guide tree
Reordering contigsa
aDarling et al. (2004) Genome Res. 14:1394-1403 doi:10.1101/gr.2289704
Mauve also enables draft genome reordering.Once LCBs are identified, can apply Mauve Contig Mover toreorder contigs
Example exercise in data/whole_genome_alignment
Table of Contents
Comparative GenomicsComputational Comparative Genomics
Bulk Genome PropertiesNucleotide Frequency/Genome Size
Whole Genome AlignmentAn Introduction to Pairwise Genome AlignmentAverage Nucleotide IdentityWhole Genome Alignment in PracticeOrdering Draft Genomes By AlignmentChromosome painting
Chromosome paintinga
aYahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
“Chromosome painting” infers recombination-derived ‘chunks’Genome’s haplotype constructed in terms of recombination eventsfrom a ‘donor’ to a ‘recipient’ genome
Chromosome paintinga
aYahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
Recombination events summarised in a coancestry matrix.H. pylori most within geographical bounds, but asymmetricaldonation from Amerind/East Asian to European isolates.
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/
Microbial Genomics andBioinformaticsBM4054.Genome Features
Leighton Pritchard1,2,3
1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Genome Features
• Genome features are annotated regions of the genome.
• Typically represent functional elements.
• May be simple (single region), or complex (subfeatures)
Why annotate genome features?
• Almost all use of genomics depends on annotation:annotation quality is critical to downstream use ofgenomics in biology
• Annotation is curation (a live, active process), not cataloguing
• Automated annotation from curated data (public databases) isthe only game in town, given the data quantities we generate
• But you can’t propagate something that doesn’t exist: up to30% of metabolic activity has no known gene associatedwith it1
• Biocurators can spend as much time “de-annotating”literature-based annotations as entering new data2
1Chen and Vitkup (2007) Trends Biotech. doi:10.1016/j.tibtech.2007.06.001
2Bairoch (2009) Nat. Preced. doi:10.1038/npre.2009.3092.1
Gene Features
Gene features have significant substructure, especially ineukaryotes.
• 5‘ UTR
• translation start
• intron start/stop
• exon start/stop
• translation stop
• translationterminator
• 3‘ UTR
ncRNA Features
• tRNA - transfer RNA
• rRNA - ribosomal RNA
• CRISPRs -bacterial/archaeal defence(used for genome editing)
• many other classes
Regulatory/Repeat Features
Regulatory sites
• transcription start sites
• RNA polymerase binding sites
• Transcription Factor Binding Sites (TFBS)
Repetitive regions and mobile elements
• tandem repeats
• (retro-)transposable elements
• phage inclusions
Principles of feature prediction
Two main approaches to feature prediction:
• ab initio prediction - start from first principles, using only thegenome sequence:
• Unsupervised methods - not trained on a dataset• Supervised methods - trained on a dataset
• homology matches• alignment to features from related organisms (comparative
genomics, annotation transfer)• from known gene products (e.g. proteins, ncRNA)• from transcripts/other intermediates (e.g. ESTs, cDNA,
RNAseq)
Dedicated tools available for many different classes of feature.
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Prokaryotic CDS Prediction Methods
Using CDS prediction as an illustrative example for all featureprediction.
Sequence conservation (evolutionary constraint; an unsupervised, apriori method) can be useful
• Prokaryotes “easier” than eukaryotes for gene/CDS prediction
• Less uncertainty in predictions (isoforms, gene structure)• Very gene-dense (over 90% of chromosome is coding sequence)• No intron-exon structure
Prokaryotic CDS Prediction Methods
ORFs are plentiful:
• Problem is: “which possible ORF contains the true gene, andwhich start site is correct?”
• Still not a solved problem
Finding Open Reading Frames
The simplest approach: find ORFs (sequence between twoconsecutive in-frame stop codons)
• ORF finding is naive, does not consider:• Start codon• Promoter/RBS motifs• Wider context (e.g. overlapping genes)
Dedicated tools, e.g. Glimmer, Prodigal, RAST, GeneMarkSusually better.
Two ab initio CDS Prediction Tools
• Glimmer3
• Interpolated Markov models• Can be trained on “gold standard” datasets
• Prodigal4
• Log-likelihood model based on GC frame plots, followed bydynamic programming
• Can be trained on “gold standard” datasets
Applying these to an example bacterial chromosome. . .
3Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009
4Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
Comparing predictions in Artemisa
aCarver et al. (2012) Bioinformatics 28:464-469 doi:10.1093/bioinformatics/btr703
Not every ORF (green) is predicted to encode for a codingsequence (CDS; blue/orange).Self-contradictory CDS calls (orange); even automated annotationneeds manual curation.
Comparing predictions in Artemis
Glimmer(green)/Prodigal(blue) CDS prediction methods do notalways agree (presence/absence, start position).
How do we know which (if either) is best?
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Using a “Gold Standard”: validationa
aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
A general approach for all predictive methods
• Define a known, “correct” set of true/false, positive/negativeetc. examples - the “gold standard”
• Evaluate your predictive method against that set for• sensitivity, specificity, accuracy, precision, etc.
This ought to be done by the method developers, but often wise toevaluate in your own system.
Many methods available, coverage beyond the scope of thisintroduction
Contingency Tables
Condition (Gold standard)True False
Test outcomePositive True Positive False PositiveNegative False Negative True Negative
Performance MetricsSensitivity = TPR = TP/(TP + FN)Specificity = TNR = TN/(FP + TN)FPR = 1− Specificity = FP/(FP + TN)
If you don’t have this information, you can’t interpretpredictive results properly.
“Gold Standard” results
• Tested glimmer5 and prodigal6 on two enterobacterial closerelatives as “gold standards” (still not perfect. . .)
1. Manually annotated (>3 expert person years)2. Community-annotated (many research groups, interested in
their own subset of genes)
• Both methods trained directly on the annotated genes ineach organism!
5Delcher et al. (2007) Bioinformatics 23:673-679 doi:10.1093/bioinformatics/btm009
6Hyatt et al. (2010) BMC Bioinf. 11:119 doi:10.1186/1471-2105-11-119
“Gold Standard” results
Manually annotated: 4550 CDS
genecaller glimmer prodigal
predicted 4752 4287missed 284 (6%) 407 (9%)
Exact Predictionsensitivity 62% 71%
FDR 41% 25%PPV 59% 75%
Correct ORFsensitivity 94% 91%
FDR 10% 3%PPV 90% 97%
“Gold Standard” results
Community annotated: 4475 CDS
genecaller glimmer prodigal
predicted 4679 4467missed 112 (3%) 156 (3%)
Exact Predictionsensitivity 62% 86%
FDR 31% 14%PPV 69% 86%
Correct ORFsensitivity 97% 97%
FDR 7% 3%PPV 93% 97%
Gene/CDS Prediction
• Alternative CDS (and all other) prediction methods areunlikely to give identical results, or perform equally well
• There is No Free Lunch (this is a theorem:http://en.wikipedia.org/wiki/No free lunch theorem)
• To assess/choose between methods, performance metrics arerequired
• Even on prokaryotes (a relatively simple case), current bestmethods for CDS prediction are imperfect
• Manual correction is often required (usually the mostdemanding and time-consuming part of the process).
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Prokaryotic Annotation Pipelinesa
aRichardson and Watson (2012) Brief. Bioinf. 14:1-12 doi:10.1093/bib/bbs007
Many choices, including RAST7, PROKKA8, BaSYS9, etc.Often perform both CDS/feature calling and functional prediction.Two broad approaches:
1. Heavyweight: maintain database and resource, oftenannotating by homology, e.g. RAST
2. Lightweight: chain together multiple third-party packages, e.g.PROKKA
Pipelines take a lot of tedium (and control) out of annotatingbacterial genomes, but have the same issues as every otherprediction tool.
7Aziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75
8Seemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153
9Van Domselaar et al. (2005) Nuc. Acids Res. 33:W455-W459 doi:10.1093/nar/gki593
PROKKAa
aSeemann (2014) Bioinformatics 30:2068-2069 doi:10.1093/bioinformatics/btu153
• Lightweight, and fast.
• Runs locally. (5Mbpgenome takes ≈10min onmy desktop; more detailedncRNA prediction takes≈20min)
• Flexible: built-indatabases can be replacedby user databases.
• Uses freely-accessiblethird-party tools forprediction
Simple to run (at the command-line, or in Galaxy10).
10Goecks et al. (2010) Genome Biol. 11:R86 doi:10.1186/gb-2010-11-8-r86
RASTa
aAziz et al. (2008) BMC Genomics 9:75 doi:10.1186/1471-2164-9-75
• Server-based(http://rast.nmpdr.org/).Queues likely.
• Relies on SEED andFIGFam databases, heldat NMPDR
• FIGFam: isofunctionalhomologue families
• Produces metabolicreconstruction
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Principles of function prediction
At genome scale, we realistically have to automate functionprediction.Function prediction is just like any other prediction method.Two main approaches to function prediction:
• ab initio prediction (on basis of feature sequence/contextonly)
• Unsupervised methods - not trained on an exemplar dataset• Supervised methods - trained on an exemplar dataset
• homology matches (sequence similarity)• alignment to features with known/predicted functions
Homology-based function prediction
Two proteins with similar sequence may have similar function.But. . .
• How similar do they have to be (and where) to share the samefunction?
• What do we mean by ‘same function’, anyway:interaction/substrate specificity? participation in a pathway?contribution to a structure? biochemical interconversion? . . .
• How confident can we be in the comparator (annotated)sequence: was that function determined experimentally?
Gene Ontology (GO)a
aAshburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
The Gene Ontology provides a common vocabulary for describingbiological function, and unifying functional descriptions.
Ontologies (controlled vocabularies) are central toinformation-sharing.
Gene Ontology Consortium: http://geneontology.org/
Many annotation tools and databases produce GO output, orcompatible controlled vocabulary terms, e.g.
• Blast2GO11: BLAST-based annotation
• PHI-Base12: microbial pathogen-host interaction specificfunctions
• GOPred13: combines several protein function classifiers
11Conesa et al. (2005) Bioinformatics 21:3674-3676 doi:10.1093/bioinformatics/bti610
12Winnenburg et al. (2006) Nuc. Acids Res. 34:D459-D464 doi:10.1093/nar/gkj047
13Sarac et al. (2010) PLoS One 5:e12382 doi:10.1371/journal.pone.0012382
Gene Ontology (GO)a
aAshburner et al. (2000) Nat. Genet. 25:25-29 doi:10.1038/75556
Are database annotations reliable?a
aSchnoes et al. (2013) PLoS Comp. Biol. 9:e1003063 doi:10.1371/journal.pcbi.1003063
Are protein function annotations in databases determinedexperimentally, or by annotation transfer?
High throughput experiments and genome annotations areconducted without validation of function, and placed in databases.
• GO databases record annotation origin by publication
• GO databases record evidence codes, e.g.: EXP=Inferredfrom Experiment; ISS=Inferred from Sequence Similarity
• 0.14% of contributing publications provide 25% of allexperimentally validated annotations in the Uniprot-GOAcompilation.
• There are biases in functional annotation.
No clear solution to this kind of bias - but we have to recogniseand account for it.
Are database annotations reliable?a
aRadivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340
The Critical Assessment of Function Annotation (CAFA) project.
Do biased database annotations matter?
Experimental annotations of proteins are incomplete. But is thatimportant?Tested by simulation, and following databases for three years.14
1. Yes. It matters.
2. Current large scale annotations are meaningful and almostsurprisingly reliable.
3. The nature and level of data incompleteness, and type ofclassification model have an effect.
4. “Low precision, high recall” (i.e. less discriminating) toolsmost significantly affected.
Molecular function prediction is usually more reliable thanbiological process prediction15
14Jiang et al. (2014) Bioinformatics 30:i609-i616 doi:10.1093/bioinformatics/btu472
15Cozzetto et al. (2013) BMC Bioinf. 14:S3-S1 doi:10.1186/1471-2105-14-S3-S1
CAFA resultsa
aRadivojac et al. (2013) Nat. Meth. 10:221-227 doi:10.1038/nmeth.2340
The Critical Assessment of Function Annotation (CAFA) 2013results. (F-measure combines precision and recall)
• You can do better thanBLAST.
• Best-performing methodsdo comparably well.
• Best methods usedevolutionary relationships,structure, and expressiondata.
• Machine Learning worksbest.
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
A wee trip to the doctor
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where thereis disease)
• The test has FPR = 0.01 (predicts disease where there is nodisease)
• Your test is positive
• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?
• (Audience Participation!)
A wee trip to the doctor
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where thereis disease)
• The test has FPR = 0.01 (predicts disease where there is nodisease)
• Your test is positive
• What is the probability that you have disease X?• 0.01, 0.05, 0.50, 0.95, 0.99?
• (Audience Participation!)
A wee trip to the doctor
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X ,you cannot determine this.
• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
A wee trip to the doctor
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X ,you cannot determine this.
• Baseline occurrence: fX• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Why Performance Metrics Mattera
aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Imagine a paper describing a predictor for protein functionalclass (e.g. Type III effector)
• The paper reports sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism
• It predicts 130 members of the class. How many of them arelikely to be true positives?
• We need a baseline level of that class (fX ) in the genome todetermine this.
• We estimate ≈ 200 members in protein complement, sofX = 0.01
• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5
Why Performance Metrics Mattera
aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Imagine a paper describing a predictor for protein functionalclass (e.g. Type III effector)
• The paper reports sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism
• It predicts 130 members of the class. How many of them arelikely to be true positives?
• We need a baseline level of that class (fX ) in the genome todetermine this.
• We estimate ≈ 200 members in protein complement, sofX = 0.01
• fX = 0.01 =⇒ P(class|+ve) = 0.490 ≈ 0.5
Bayes’ Theorem
• May seem counter-intuitive: 95% sensitivity, 99% specificity=⇒ 50% chance of any prediction being incorrect
• Probability given by Bayes’ Theorem
• P(X |+) = P(+|X )P(X )
P(+|X )P(X )+P(+|X̄ )P(X̄ )
• This step commonly overlooked in the literature• confirmation bias• people want to see positive examples/tell a story• people want to think their predictor works
A cautionary talea
aArnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
• Paper describes EffectiveT3, a type III effector predictiontool
• Reported sensitivity ≈ 0.71, FPR ≈ 0.15
• Applied tool to 739 complete bacterial and archaeal genomes
• Organisms with an identifiable T3SS: 2-7% of genomepredicted to be secreted
• Organisms without an identifiable T3SS (or known notto have one): 1-10% of genome predicted to be secreted
• “The surprisingly high number of (false) positives in genomeswithout T3SS exceeds the expected false positive rate”
• This is not a surprise, statistically.
A cautionary talea
aArnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
• Paper describes EffectiveT3, a type III effector predictiontool
• Reported sensitivity ≈ 0.71, FPR ≈ 0.15
• Applied tool to 739 complete bacterial and archaeal genomes
• Organisms with an identifiable T3SS: 2-7% of genomepredicted to be secreted
• Organisms without an identifiable T3SS (or known notto have one): 1-10% of genome predicted to be secreted
• “The surprisingly high number of (false) positives in genomeswithout T3SS exceeds the expected false positive rate”
• This is not a surprise, statistically.
A cautionary talea
aArnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
Probability that an EffectiveT3 positive prediction correspondsto a secreted protein is given by Bayes’ Theorem
• P(X |+) = P(+|X )P(X )
P(+|X )P(X )+P(+|X̄ )P(X̄ )
• P(+|X ) = sensitivity = 0.71• P(+|X̄ ) = FPR = 0.15• P(X ) = base rate ≈ 0.03 (16)
• =⇒ P(X |+) ≈ 0.13
Only 13% of predictions likely to be positive!
How many predicted type III secreted proteins were there. . .
16Boch and Bonas (2010) Annu. Rev. Phytopathol. 48:419-436 doi:10.1146/annurev-phyto-080508-081936
A cautionary talea
aArnold et al. (2009) PLoS Pathog. 5:e1000376 doi:10.1371/journal.ppat.1000376
Interpreting genome-scale predictionsa
aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Statistics at genome-scale can be counterintuitive.
• Use Bayes’ Theorem!
• Predictions identify groups, not individual members of thegroup. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives
• Which specific individuals are truly smugglers?
• The test does not allow you to determine this - you need moreevidence for each individual
• Same principle applies to other classifiers, (including proteinfunctional class prediction) - watch for ‘cherry-picking’ inpublications
Interpreting genome-scale predictionsa
aPritchard and Broadhurst (2014) Methods Mol. Biol. 1127:53-64 doi:10.1007/978-1-62703-986-4 4
• Statistics at genome-scale can be counterintuitive.
• Use Bayes’ Theorem!
• Predictions identify groups, not individual members of thegroup. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9• Test gives 100 positives
• Which specific individuals are truly smugglers?
• The test does not allow you to determine this - you need moreevidence for each individual
• Same principle applies to other classifiers, (including proteinfunctional class prediction) - watch for ‘cherry-picking’ inpublications
Table of Contents
Genome FeaturesWhat are genome features?Prokaryotic CDS PredictionAssessing Prediction MethodsProkaryotic Annotation Pipelines
Genome-Scale Functional AnnotationFunctional AnnotationA visit to the doctorStatistics of genome-scale prediction
Building to MetabolismReconstructing metabolism
Reconstructing metabolisma
aThiele and Palsson (2010) Nat. Protoc. 5:93-121 doi:10.1038/nprot.2009.203
Once metabolic functional annotation has been assigned tofeatures, we can do comparative analysis of metabolism.
Dynamic models of metabolisma
aOrth et al. (2010) Nat. Biotech. 28:245-248 doi:10.1038/nbt.1614
By using constraint-based models (e.g. Flux Balance Analysis), wecan make these into dynamic representations of bacterialmetabolism.• Upper, lower bounds to reaction rates• Define objective phenotype• Calculate conditions resulting in flux• in silico knockouts
E. coli metabolisma
aMonk et al. (2013) Proc. Natl. Acad. Sci. USA 110:20338-20343 doi:10.1073/pnas.1307797110
E. coli has a very long history of metabolic reconstruction17
Recent modelling work predicts which nutrients support growth
17Reed and Palsson (2000) J. Bact. 185:2692-2699 doi:10.1128/JB.185.9.2692-2699.2003
E. coli metabolisma
aBaumler et al. (2011) BMC Syst. Biol. 5:182 doi:10.1186/1752-0509-5-182
Models are complex, and experimental validation is essentialThere’s more we don’t know. . .
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/
Microbial Genomics andBioinformaticsBM4055.Finding Equivalent Features
Leighton Pritchard1,2,3
1Information and Computational Sciences,2Centre for Human and Animal Pathogens in the Environment,3Dundee Effector Consortium,The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content usingemail, Twitter, blogs, etc. is permitted (and encouraged),providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, areavailable at https://github.com/widdowquinn/Teaching-2014-11-21-Strathclyde
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
What makes genome features equiva-lent?
When we compare two features (e.g. genes) between two or moregenomes, there must be some basis for making the comparisonThat is, they have to be equivalent in some way, such as:
• common evolutionary origin
• functional similarity
• a family-based relationship
It’s common to define equivalence of genome features in terms ofevolutionary relationship.
Why look at equivalent features?
The real power of genomics is comparative genomics!
• Makes catalogues of genome components comparable betweenorganisms
• Differences, e.g. presence/absence of equivalents may supporthypotheses for functional or phenotypic difference
• Can identify characteristic signals for diagnosis/epidemiology
• Can build parts lists and wiring diagrams for systems andsynthetic biology
Evolutionary relationshipsa
aFitch (1970) Syst. Zool. 19:99-113 doi:10.2307/2412448
Equivalencies and relationships can be quite complex.We need precise terms to describe relationships between genomefeatures.
• analogy: functional similarity
• homology: evolutionary common ancestor
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Who let the -logues out?a
aFitch (2000) Trends Genet. 16:227-231 doi:10.1016/S0168-9525(00)02005-9
• homologues: elements that are similar because they share acommon ancestor. There are NOT degrees of homology
• analogues: elements that are (functionally?) similar, and thismay be through common ancestry or some other means, e.g.convergent evolution
• orthologues: homologues that diverged through speciation
• paralogues: homologues that diverged through duplicationwithin the same genome
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
Who let the -logues out?
ITYFIALMCTTa
aKristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
But it’s a little more complicated than that.Biology is not well-behaved.
• Gene loss
• Homologues may diverge so widely that they can be hard torecognise
• Reconstructed evolutionary trees may not be robust inferencesof speciation (or relevant to it, in prokaryotes)
• There is no record of history - we can only make inferences
All classifications of orthology/paralogy are inferences!
ITYFIALMCTTa
aKristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
All classifications of orthology/paralogy are inferences!
Ensembl Comparaa
aVilella et al. (2009) Genome Res. 19:327-335 doi:10.1101/gr.073585.107
Some tools/databases, e.g. Ensembl Compara, use slightlydifferent definitions (almost everything’s an “orthologue”)
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Why focus on orthologues?
Formalise the idea of corresponding genes in different organisms.Orthologues serve two purposes:
• Evolutionary equivalence
• Functional equivalence (“The Ortholog Conjecture”1)
Applications in comparative genomics, functional genomics andphylogenetics.2
Over 30 databases attempt to describe orthologous relationships(http://questfororthologs.org/orthology databases3)
1Chen and Zhang (2012) PLoS Comp. Biol. 8:e1002784 doi:10.1371/journal.pcbi.1002784
2Dessimoz (2011) Brief. Bioinf. 12:375-376 doi:10.1093/bib/bbr057
3Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
Finding orthologues
Multiple methods and databases4,5,6
• Pairwise genome• RBBH (aka BBH, RBH),
RSD, InParanoid, RoundUp
• Multi-genome• Graph-based: COG, eggNOG,
OrthoDB, OrthoMCL, OMA,MultiParanoid
• Tree-based: TreeFam,Ensembl Compara,PhylomeDB, LOFT
4Kristensen et al. (2011) Brief. Bioinf. 12:379-391 doi:10.1093/bib/bbr030
5Trachana et al. (2011) Bioessays 33:769-780 doi:10.1002/bies.201100062
6Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Which prediction methods work best?
Taking advantage of prokaryotic operon structure: if the outerpair of a syntenic triplet of genes are orthologous, the middlegene is also likely to be orthologous.7
Specifically testing reciprocal best hits (RBH).
7Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
Which prediction methods work best?
• Tested on 573 prokaryotic genomes
• 88-99% of RBH found in syntenic triplets
• Overwhelming majority of middle genes are RBH
RBH reliably finds orthologues.8
8Wolf and Koonin (2012) Genome Biol. Evil. 4:1286-1294 doi:10.1093/gbe/evs100
Which prediction methods work best?
Four methods tested against 2,723 curated orthologues from sixSaccharomycetes
• RBBH (and cRBH); RSD (and cRSD); MultiParanoid;OrthoMCL
• Rated by statistical performance metrics: sensitivity,specificity, accuracy, FDR
cRBH most accurate and specific, with lowest FDR.9
9Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
Which prediction methods work best?
Testing on literature-based benchmarks for grouping by functionand correct branching of phylogeny.10
10Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
Which prediction methods work best?
• Performance varies by choice of method, and interpretation of“orthology”
• Biggest influence is genome annotation quality
• Relative performance varies with choice of benchmark
• (clustering) RBH outperforms more complex algorithmsunder many circumstances
What is this magic RBH method?
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Functional adaptation in Pbaa
aToth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
Functional adaptation in Pbaa
aToth et al. (2006) Ann. Rev. Phytopath. 44:305-336 doi:10.1146/annurev.phyto.44.070505.143444
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Core genome
Once equivalent genes have been identified, those present in allrelated isolates can be identified: the core genome.The core genome is expected to underpin common function.A core RBH cluster (clique) for 29 genomes:
Accessory genome
The remaining genes are the accessory genome, and areexpected to mediate function that distinguishes between isolates.
An accessory RBH cluster for 29 genomes:
Accessory clusters
Accessory RBH clusters can be pruned, to identify the accessorygenome specific to subgroups of isolates:
These genes may be responsible for subgroup-specific phenotypes
Accessory genome
Accessory genomes act as a cradle for adaptive evolution11
This is particularly so for pathogens, such as Pseudomonas spp.12
11Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608
12Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132.t002
Core genome synteny
Using tools like i-ADHoRe13 that identify synteny and collinearity,the structural organisation of the core genome can be determined:
For Dickeya, the core genome appears to be structurallywell-conserved across all isolates.
13Proost et al. (2012) Nuc. Acids Res. 40:e11 doi:10.1093/nar/gkr955
Panseqa
aLaing et al. (2010) BMC Bioinf. 11:461 doi:10.1186/1471-2105-11-461
Panseq is an online tool for identification of core and accessorygenomes, available at https://lfz.corefacility.ca/panseq/, andhttps://github.com/chadlaing/Panseq for standalone use
Harvesta
aTreangen et al. (2014) Genome Biol. 15:524 doi:10.1186/s13059-014-0524-x
Visualising and organising comparison/pangenome data acrossthousands of bacteria is difficult.Very recently (this week), the Harvest suite of tools waspublished, for alignment and visualisation of thousands of genomes:
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Things I didn’t get to
Table of Contents
Equivalent Genome FeaturesWhat makes genome features equivalent?
Homology, Orthology, ParalogyWho let the -logues out?What’s so important about orthologues?Evaluating orthologue predictionUsing orthologue predictionsCore and Pan-genomes
ConclusionsThings I Didn’t Get ToConclusions
Conclusions
Conclusions
Conclusions
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative CommonsAttribution ShareAlike licensehttps://creativecommons.org/licenses/by-sa/4.0/