Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
The Science of Whole Genome
Sequencing:An Overview for the
Food Industry
March 17, 2020
Brought To You By:
Webinar Logistics
• Everyone is muted• Questions will be addressed during the Q&A session at the end of the
presentation• The presentation is being recorded• The recording/slides will be distributed following the presentation• Adjourn (60 minutes)• There will be 3 important survey questions at the conclusion of this
webinar. Your response is appreciated
Speaker
Dr. Jasna Kovac
Jasna Kovac, PhDAssistant Professor
Lester Earl and Veronica Casida Career Development Professor of Food Safety
Department of Food Science
The Pennsylvania State University
Email: [email protected]: 814-865-2883
The Science of Whole Genome Sequencing: An Overview for the Food Industry
Outline
Sequencing platforms• Important differences
A pre-sequencing workflow•DNA extraction and library preparation
A post-sequencing workflow•Data analyses and interpretation
Cost-benefit analysis•Advantages and disadvantages of implementing an in-house WGS
Sequencing technologies
Illumina
Ion Torrent
PacBio
Nanopore
Second generation
Third generation
Second generation sequencing technologiesShort read sequencing technologieso Illuminao Ion Torrent
Advantages:oProduce millions of readsoRelatively inexpensive o Low error rate
Disadvantages:oProduce short reads (several 100 bp)o These data allow for a draft genome assembly, but not closed genome
assembly (there will be some information that is missing)
Draft vs. closed genomeDraft genome Closed (complete) genome
Contig – a contiguous sequence
Single end and paired end sequencing
Single end sequencingo A DNA fragment is
sequenced just from one end
Paired end sequencingo A DNA fragment is
sequenced from both ends
o Adds information that helps in the assembly
Illumina• Sequencing by synthesis• DNA fragments are bound to a flow cell• A fluorescently labeled dNTP is added• A fluorophore is cleaved at the end of
the cycle to allow for the next dNTP to bind
• After each cycle, the fluorescent signal from incorporated labeled dNTPs is measured
• Fluorescent reversible terminator technology
• Upon signal capturing, the fluorophore is cleaved enzymatically
A patterned flow cell
IlluminaoLow error rate (~0.1%)
Pfeiffer et al., 2018: https://www.nature.com/articles/s41598-018-29325-6
Ion Torrent• Sequencing by synthesis• The H+ is cleaved by adding a dNTP, which causes a change in the pH• Change in the pH is measured by metal-oxide-semiconductor (CMOS) technology• Generally higher error rate compared to Illumina
Third generation sequencing technologiesLong read sequencing technologies• PacBio• Nanopore
Advantage:oProduce very long reads (> 10,000 bp)oAllow for assembly of closed genomes
Disadvantages:oMore costlyoHave a higher error rate (but more random)
PacBio• PacBio single-molecule real-time sequencing
(SMRT)• DNA synthesis occurs in arrays of
microfabricated nanostructures called zero-mode waveguides (ZMWs) (holes in a metallic film)
• Light passes through a very small opening, illuminating only the very bottom of wells
• Visualization of a single molecule• Synthesis by single polymerase/well
• Once the signal is read, the fluorophore is cleaved and the well washed with the next labeled dNTP
PacBio• Average read length is ~10,000 bp
• Some reads are much longer (~60K bp)
• Single long DNA molecule is read multiple times
• The sequencing error is higher compared to Illumina, but it is random and can be substantially reduced with greater coverage depth
Nanopore• Single long DNA molecule is read• DNA passes through a pore inserted
in electrically resistant membrane • Potential applied on the membrane
results in current flowing through the membrane opening
• DNA passing through the membrane causes disruption in the current, which allows for sequencing
• Average read length is over 10,000 bp• Some reads are over 100K bp
Nanopore MinIONApplication in food industry:o Rapid species IDo Rapid serotyping etc.
A pre-sequencing workflow
Library pooling
Flow cell loading
DNA extraction
1
Library preparation
2
3
4
A pre-sequencing workflow DNA extraction
1
A pre-sequencing workflow
DNA fragmentation
Library preparation
2
In most commonly used Illumina library preparation workflows, DNA is fragmented enzymatically with transposomes.
Transposomes add a known stretch of DNA, which is utilized in the next step (PCR) of the library preparation.
A pre-sequencing workflow
DNA fragmentation
Indexing
Library preparation
2
A pre-sequencing workflow
DNA fragmentation
P5, P7 – flow cell oligos
Library preparation
2
A pre-sequencing workflow
DNA fragmentation
P5, P7 – flow cell oligos
Addition of indices:All DNA fragments of a given isolate get the same index
Library preparation
2
A pre-sequencing workflow• Sequencing coverage depth
• The number of reads that cover a given position in a sequence
Library pooling
3
Reference genome
Dep
th o
f cov
erag
e
6 x coverage 2 x coverage
A pre-sequencing workflow• Indexed (barcoded) libraries need to be pooled in equimolar
concentrations• Sequencing coverage depth is highly sensitive to accurate
quantification
• If one library is loaded in a higher concentration compared to others, it will be sequenced at a greater depth compared to others
Library pooling
3
Library 1 Library 2 Library 3
Library 4 Library 5 Library 6
Library 7 Library 8 Library 9
Library 10 Library 11 Library 12
Library 13 Library 14 Library 15
Library 16 Library 17 Library 18
Library 19 Library 20
A pre-sequencing workflowoIllumina example
Flow cell loading
4
Bridge amplification
Illumina sequencing
Pre-sequencing and sequencing cost• Pre-sequencing costs
• Approximately the same cost of materials per library regardless of the amount of libraries
• The labor cost stays roughly the same for 1 or 96 libraries, making WGS of a single isolate costly
• Sequencing costs depend on the number of libraries (e.g., isolates)• Constant cost per sequencing run – the more isolates are sequenced, the
lower the cost per isolate• Limiting factor is the desired depth coverage
• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS
A post-sequencing workflow
A post-sequencing workflow
1Demultiplexing
2Read quality
control
3Read trimming
4 Genome assembly
5SNP detection
4a Detection of genes of interest
4cwg/cgMLST
4bGenotyping/serotyping
A post-sequencing workflow 1Demultiplexing
o Using index information to assign different sequences to different isolates after the sequencing is completed
A post-sequencing workflow 2Read quality
control
Data from Penn State Genomics Core Facility, directed by Craig Praul: [email protected]
o The first quality check includes examining the confidence with which individual base calls are made
o The image on the left shows an average quality of all reads for a given library
o Phred (quality) score:o 10 – 1 in 10 bases are
called incorrectly o 20 – 1 in 100 bases
are called incorrectlyo 30 – 1 in 1000 bases
are called incorrectly
Phre
d(q
ualit
y) s
core
–av
erag
e fo
r all
read
s
Base position in a sequenced read
A post-sequencing workflow 3Read trimming
A post-sequencing workflow 3Read trimming
o Removing sequences that do not carry biological information
A post-sequencing workflow 4 Genome assembly
• Reference-based• More accurate• Some information is lost
• De novo (without any prior information)• All data is retained• Allows for the discovery of genes or other genetic features that are not
present in the reference genome• Paired end reads can improve assembly
A post-sequencing workflow 5a Detection of genes of interest• Antimicrobial resistance genes
• Virulence genes• Genes associated with specific phenotypes of interest (e.g., heat
tolerance)
• Things to consider:• How good of a match does a sequence need to be with a reference to
identify it as a reference• % sequence length coverage (minimum usually 50% - 60%)• % sequence similarity (minimum usually 75% - 90%)• Depends also on whether translated (amino acid sequences) are used or not• Detecting a fraction of a virulence or AMR gene does not mean the gene is functional!
A post-sequencing workflow 5a Detection of genes of interest• Antimicrobial resistance genes
• Available databases:• NCBI• CARD• ARG-ANNOT• ResFinder• MEGARes• ARDB• PointFinder• PlasmidFinder
• Available programs:• Ariba, Resfinder, RGI, SRST2, AMRFinderPlus, ABRicates, staramr, BTyper, …
A post-sequencing workflow 5a Detection of genes of interest• Virulence genes
• Programs:• VirulenceFinder (for Listeria, S. aureus, E. coli, Enterococcus)• BTyper (using a custom Bacillus cereus group virulence factors database)• Abricate (using Ecoli_VF and VFDB databases)
A post-sequencing workflow 5a Detection of genes of interest• Genes associated with specific phenotypes of interest
(e.g., heat tolerance)• Requires subject matter expert to identify genes that are
associated with phenotypes of interest• Once genes of interest are know, they can be added to any of
the existing databases if used off-line
A post-sequencing workflow 5b
SEROTYPINGDifferent serotyping programs are available:
oSalmonella: SISTR, SeqSero2, MOST oSISTR performs best (94% accuracy) according to the latest benchmarking
(Uelze et al., 2020, https://aem.asm.org/content/86/5/e02265-19#sec-9)
oE. coli: ectyper (https://github.com/phac-nml/ecoli_serotyping)oListeria: LisSero (https://github.com/MDU-
PHL/LisSero/blob/master/README.md)
Genotyping/serotyping
A post-sequencing workflow 5b
GENOTYPINGSingle locus genotyping
• rpoB (for identification of sporeformers)• sigB (can help identify Listeria species, even L.
monocytogenes lineage)• …
Genotyping/serotyping
Liao et al., 2017 (AEM)https://aem.asm.org/content/83/12/e00306-17/figures-only
A post-sequencing workflow 5b
GENOTYPING
• Multi-locus sequence typing – How does it work?• Typically, we compare 7 genes (loci)• If the examined locus is different from everything that is
in the database, it gets a new allele type assigned (AT)• Each unique combination of 7 locus ATs gets a unique
sequence type (ST) number• Closely related STs are grouped in clonal complexes
(CCs)• ST numbers are easy to compare
• However, just ST number does not tell one how many differences are between two STs, it just tells that two isolates with two different STs are different
Genotyping/serotyping
Locus A Locus B Locus C Locus D Locus E Locus F Locus G
Assign AT to each locus
Assign ST to each combination of ATs
Closely related STs are grouped into CC
A post-sequencing workflow 5b
MLST – example of three different STso Two isolates with a different ST number may be very distantly
related (e.g., ST 100 and ST 2) ORo They may be closely related (e.g., ST 100 and ST 453)
Genotyping/serotyping
ST 100 (ref)
ST 2 vs. ST 100
ST 453 vs. ST 100
A post-sequencing workflow 5b
Multi-locus sequence typing databases:
• For most species: https://pubmlst.org/databases/
• Listeria: https://bigsdb.pasteur.fr/listeria/
Genotyping/serotyping
A post-sequencing workflow 5cwg/cgMLST
• Comparing thousand+ genes gives more detailed information than comparing just 7 genes
Maiden et al., 2013 (Nat Rev Microbiol), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3980634/
o Whole genome MLSTo Core genome MLSTo Accessory genome MLST
A group of organisms that are descended from a single cell but have started to diversify by recombination
A post-sequencing workflow 5cwg/cgMLST
Genome 1
Genome 2
Genome 3
CORE GENOME
WHOLE GENOME
of a genome 2
PANGENOME of genomes 1, 2, and 3
CORE GENOME
of genomes 1, 2, and 3
o Whole genome MLSTo Core genome MLSTo Accessory genome MLST
A post-sequencing workflow 6SNP detection
• Single nucleotide polymorphism (SNP)• Insertions and deletions (indels)
A post-sequencing workflow 6SNP detection
• SNP detection – things to consider:• Reference-free (less accurate) or reference-based (more accurate)• If reference-based, what reference to choose?• What pipeline to choose?• How conservative to be (e.g., what SNPs should be excluded)?
A post-sequencing workflow 6SNP detection
• SNP detection – things to consider:• Reference-free (less accurate) or reference-based (more accurate)
Can be done using assembled genomes (do not need to have raw reads)
Genomes are fragmented into short sequences (called kmers) and these
fragments are compared among isolates
Reads are aligned (mapped) to a reference genome
https://www.baseclear.com
SNP
A post-sequencing workflow 6SNP detection
• SNP detection – things to consider:• If reference-based, what reference to choose?• Rule #1 – the reference needs to be closely related, since only core SNPs
are finally used• How to identify a closely related reference?
Genome 1
Genome 2
REF. GENOME
CORE GENOME
Genome 1
Genome 2
CORE GENOME
Larger core genome -sequences are compared over a greater length to improve the chances of detecting core SNPs
Shorter core genome – fewer detected core SNPs, but that does not mean that genomes are closely related!
A post-sequencing workflow 6SNP detection
• SNP detection – things to consider:• What pipeline to choose?• There are many options; the following are used by the government
agencies:• FDA CFSAN pipeline: Davis et al., 2015, https://peerj.com/articles/cs-20/• CDC LyveSET pipeline: Katz et al., 2017,
https://www.frontiersin.org/articles/10.3389/fmicb.2017.00375/full
• Both of these pipelines are conservative (e.g., remove SNPs that are likely due to errors and horizontal gene transfer, recommibnation)
SNP vs. cgMLST analysis• Results in a comparable outcome• What is a maximum SNP or allele difference among
epidemiologically linked isolates?• It depends on:
• How clonal the species (or serotypes)• Whether a strain multiplies (and evolves) in a host or a food product throughout the
duration of an outbreak• Epidemiological data are needed to support information derived from comparative
genomics
Cost-benefit analysisThings to consider:
1. Sequencing instrument cost2. Reagents and supplies costs3. Labor cost4. Personnel wet lab and bioinformatics expertise
Cost-benefit analysisTo establish an in-house WGS or to outsource library preparation, sequencing, and/or data analyses?• Outsourcing the library preparation and sequencing cost: ~$100-
$1,500/sample (depending on how many isolates you would like to sequence and the service provider)
• In-house WGS can reduce the reagents and supplies costs down to ~$50/samples (excluding the instrument and labor costs)
• Advantage on in-house WGS – building internal expertise, having full control over the process
• Disadvantage of in-house WGS – absorbing the costs of failed experiments
• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS
Cost-benefit analysisTo establish an in-house WGS or to outsource library preparation, sequencing, and/or data analyses?• Outsourcing the library preparation and sequencing cost: ~$150-
$1,500/sample (depending on how many isolates you have, what coverage you need, and the service provider)
• In-house WGS can reduce the reagents and supplies costs down to ~$50/samples (excluding the instrument and labor costs)
• Advantage on in-house WGS – building internal expertise, having full control over the process
• Disadvantage of in-house WGS – absorbing the costs of failed experiments
• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS
Cost-benefit analysisOne WGS test or multiple phenotypic tests?oSpecies IDoSerotypingoAntibiotic resistance testingoGenotyping …
oEach of these phenotypic tests can cost at least $50oIn silico (via WGS) species ID, serotyping, AMR detection,
genotyping can be more cost-effective, IF a confident relationship between sequences and phenotypes is established.
Images used in the presentation were sourced from the following websites: Biocompare, Illumina, PacBio, Nanopore, Ion Torrent, Mlst.net, Baseclear.com, cdn3.vectorstock, and publications cited in slides.
Q & A
If your question wasn’t answered…
Please contact Scott Nichols at [email protected] or one of the trade organization representatives and we would be
happy to respond.
Thank you.
Brought To You By:
The Science of Whole Genome
Sequencing:An Overview for the
Food Industry
March 17, 2020