The Science of Whole Genome Sequencing · Paired end sequencing o A DNA fragment is sequenced from...

Preview:

Citation preview

The Science of Whole Genome

Sequencing:An Overview for the

Food Industry

March 17, 2020

Brought To You By:

Webinar Logistics

• Everyone is muted• Questions will be addressed during the Q&A session at the end of the

presentation• The presentation is being recorded• The recording/slides will be distributed following the presentation• Adjourn (60 minutes)• There will be 3 important survey questions at the conclusion of this

webinar. Your response is appreciated

Speaker

Dr. Jasna Kovac

Jasna Kovac, PhDAssistant Professor

Lester Earl and Veronica Casida Career Development Professor of Food Safety

Department of Food Science

The Pennsylvania State University

Email: jzk303@psu.eduPhone: 814-865-2883

The Science of Whole Genome Sequencing: An Overview for the Food Industry

Outline

Sequencing platforms• Important differences

A pre-sequencing workflow•DNA extraction and library preparation

A post-sequencing workflow•Data analyses and interpretation

Cost-benefit analysis•Advantages and disadvantages of implementing an in-house WGS

Sequencing technologies

Illumina

Ion Torrent

PacBio

Nanopore

Second generation

Third generation

Second generation sequencing technologiesShort read sequencing technologieso Illuminao Ion Torrent

Advantages:oProduce millions of readsoRelatively inexpensive o Low error rate

Disadvantages:oProduce short reads (several 100 bp)o These data allow for a draft genome assembly, but not closed genome

assembly (there will be some information that is missing)

Draft vs. closed genomeDraft genome Closed (complete) genome

Contig – a contiguous sequence

Single end and paired end sequencing

Single end sequencingo A DNA fragment is

sequenced just from one end

Paired end sequencingo A DNA fragment is

sequenced from both ends

o Adds information that helps in the assembly

Illumina• Sequencing by synthesis• DNA fragments are bound to a flow cell• A fluorescently labeled dNTP is added• A fluorophore is cleaved at the end of

the cycle to allow for the next dNTP to bind

• After each cycle, the fluorescent signal from incorporated labeled dNTPs is measured

• Fluorescent reversible terminator technology

• Upon signal capturing, the fluorophore is cleaved enzymatically

A patterned flow cell

IlluminaoLow error rate (~0.1%)

Pfeiffer et al., 2018: https://www.nature.com/articles/s41598-018-29325-6

Ion Torrent• Sequencing by synthesis• The H+ is cleaved by adding a dNTP, which causes a change in the pH• Change in the pH is measured by metal-oxide-semiconductor (CMOS) technology• Generally higher error rate compared to Illumina

Third generation sequencing technologiesLong read sequencing technologies• PacBio• Nanopore

Advantage:oProduce very long reads (> 10,000 bp)oAllow for assembly of closed genomes

Disadvantages:oMore costlyoHave a higher error rate (but more random)

PacBio• PacBio single-molecule real-time sequencing

(SMRT)• DNA synthesis occurs in arrays of

microfabricated nanostructures called zero-mode waveguides (ZMWs) (holes in a metallic film)

• Light passes through a very small opening, illuminating only the very bottom of wells

• Visualization of a single molecule• Synthesis by single polymerase/well

• Once the signal is read, the fluorophore is cleaved and the well washed with the next labeled dNTP

PacBio• Average read length is ~10,000 bp

• Some reads are much longer (~60K bp)

• Single long DNA molecule is read multiple times

• The sequencing error is higher compared to Illumina, but it is random and can be substantially reduced with greater coverage depth

Nanopore• Single long DNA molecule is read• DNA passes through a pore inserted

in electrically resistant membrane • Potential applied on the membrane

results in current flowing through the membrane opening

• DNA passing through the membrane causes disruption in the current, which allows for sequencing

• Average read length is over 10,000 bp• Some reads are over 100K bp

Nanopore MinIONApplication in food industry:o Rapid species IDo Rapid serotyping etc.

A pre-sequencing workflow

Library pooling

Flow cell loading

DNA extraction

1

Library preparation

2

3

4

A pre-sequencing workflow DNA extraction

1

A pre-sequencing workflow

DNA fragmentation

Library preparation

2

In most commonly used Illumina library preparation workflows, DNA is fragmented enzymatically with transposomes.

Transposomes add a known stretch of DNA, which is utilized in the next step (PCR) of the library preparation.

A pre-sequencing workflow

DNA fragmentation

Indexing

Library preparation

2

A pre-sequencing workflow

DNA fragmentation

P5, P7 – flow cell oligos

Library preparation

2

A pre-sequencing workflow

DNA fragmentation

P5, P7 – flow cell oligos

Addition of indices:All DNA fragments of a given isolate get the same index

Library preparation

2

A pre-sequencing workflow• Sequencing coverage depth

• The number of reads that cover a given position in a sequence

Library pooling

3

Reference genome

Dep

th o

f cov

erag

e

6 x coverage 2 x coverage

A pre-sequencing workflow• Indexed (barcoded) libraries need to be pooled in equimolar

concentrations• Sequencing coverage depth is highly sensitive to accurate

quantification

• If one library is loaded in a higher concentration compared to others, it will be sequenced at a greater depth compared to others

Library pooling

3

Library 1 Library 2 Library 3

Library 4 Library 5 Library 6

Library 7 Library 8 Library 9

Library 10 Library 11 Library 12

Library 13 Library 14 Library 15

Library 16 Library 17 Library 18

Library 19 Library 20

A pre-sequencing workflowoIllumina example

Flow cell loading

4

Bridge amplification

Illumina sequencing

Pre-sequencing and sequencing cost• Pre-sequencing costs

• Approximately the same cost of materials per library regardless of the amount of libraries

• The labor cost stays roughly the same for 1 or 96 libraries, making WGS of a single isolate costly

• Sequencing costs depend on the number of libraries (e.g., isolates)• Constant cost per sequencing run – the more isolates are sequenced, the

lower the cost per isolate• Limiting factor is the desired depth coverage

• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS

A post-sequencing workflow

A post-sequencing workflow

1Demultiplexing

2Read quality

control

3Read trimming

4 Genome assembly

5SNP detection

4a Detection of genes of interest

4cwg/cgMLST

4bGenotyping/serotyping

A post-sequencing workflow 1Demultiplexing

o Using index information to assign different sequences to different isolates after the sequencing is completed

A post-sequencing workflow 2Read quality

control

Data from Penn State Genomics Core Facility, directed by Craig Praul: cap142@psu.edu

o The first quality check includes examining the confidence with which individual base calls are made

o The image on the left shows an average quality of all reads for a given library

o Phred (quality) score:o 10 – 1 in 10 bases are

called incorrectly o 20 – 1 in 100 bases

are called incorrectlyo 30 – 1 in 1000 bases

are called incorrectly

Phre

d(q

ualit

y) s

core

–av

erag

e fo

r all

read

s

Base position in a sequenced read

A post-sequencing workflow 3Read trimming

A post-sequencing workflow 3Read trimming

o Removing sequences that do not carry biological information

A post-sequencing workflow 4 Genome assembly

• Reference-based• More accurate• Some information is lost

• De novo (without any prior information)• All data is retained• Allows for the discovery of genes or other genetic features that are not

present in the reference genome• Paired end reads can improve assembly

A post-sequencing workflow 5a Detection of genes of interest• Antimicrobial resistance genes

• Virulence genes• Genes associated with specific phenotypes of interest (e.g., heat

tolerance)

• Things to consider:• How good of a match does a sequence need to be with a reference to

identify it as a reference• % sequence length coverage (minimum usually 50% - 60%)• % sequence similarity (minimum usually 75% - 90%)• Depends also on whether translated (amino acid sequences) are used or not• Detecting a fraction of a virulence or AMR gene does not mean the gene is functional!

A post-sequencing workflow 5a Detection of genes of interest• Antimicrobial resistance genes

• Available databases:• NCBI• CARD• ARG-ANNOT• ResFinder• MEGARes• ARDB• PointFinder• PlasmidFinder

• Available programs:• Ariba, Resfinder, RGI, SRST2, AMRFinderPlus, ABRicates, staramr, BTyper, …

A post-sequencing workflow 5a Detection of genes of interest• Virulence genes

• Programs:• VirulenceFinder (for Listeria, S. aureus, E. coli, Enterococcus)• BTyper (using a custom Bacillus cereus group virulence factors database)• Abricate (using Ecoli_VF and VFDB databases)

A post-sequencing workflow 5a Detection of genes of interest• Genes associated with specific phenotypes of interest

(e.g., heat tolerance)• Requires subject matter expert to identify genes that are

associated with phenotypes of interest• Once genes of interest are know, they can be added to any of

the existing databases if used off-line

A post-sequencing workflow 5b

SEROTYPINGDifferent serotyping programs are available:

oSalmonella: SISTR, SeqSero2, MOST oSISTR performs best (94% accuracy) according to the latest benchmarking

(Uelze et al., 2020, https://aem.asm.org/content/86/5/e02265-19#sec-9)

oE. coli: ectyper (https://github.com/phac-nml/ecoli_serotyping)oListeria: LisSero (https://github.com/MDU-

PHL/LisSero/blob/master/README.md)

Genotyping/serotyping

A post-sequencing workflow 5b

GENOTYPINGSingle locus genotyping

• rpoB (for identification of sporeformers)• sigB (can help identify Listeria species, even L.

monocytogenes lineage)• …

Genotyping/serotyping

Liao et al., 2017 (AEM)https://aem.asm.org/content/83/12/e00306-17/figures-only

A post-sequencing workflow 5b

GENOTYPING

• Multi-locus sequence typing – How does it work?• Typically, we compare 7 genes (loci)• If the examined locus is different from everything that is

in the database, it gets a new allele type assigned (AT)• Each unique combination of 7 locus ATs gets a unique

sequence type (ST) number• Closely related STs are grouped in clonal complexes

(CCs)• ST numbers are easy to compare

• However, just ST number does not tell one how many differences are between two STs, it just tells that two isolates with two different STs are different

Genotyping/serotyping

Locus A Locus B Locus C Locus D Locus E Locus F Locus G

Assign AT to each locus

Assign ST to each combination of ATs

Closely related STs are grouped into CC

A post-sequencing workflow 5b

MLST – example of three different STso Two isolates with a different ST number may be very distantly

related (e.g., ST 100 and ST 2) ORo They may be closely related (e.g., ST 100 and ST 453)

Genotyping/serotyping

ST 100 (ref)

ST 2 vs. ST 100

ST 453 vs. ST 100

A post-sequencing workflow 5b

Multi-locus sequence typing databases:

• For most species: https://pubmlst.org/databases/

• Listeria: https://bigsdb.pasteur.fr/listeria/

Genotyping/serotyping

A post-sequencing workflow 5cwg/cgMLST

• Comparing thousand+ genes gives more detailed information than comparing just 7 genes

Maiden et al., 2013 (Nat Rev Microbiol), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3980634/

o Whole genome MLSTo Core genome MLSTo Accessory genome MLST

A group of organisms that are descended from a single cell but have started to diversify by recombination

A post-sequencing workflow 5cwg/cgMLST

Genome 1

Genome 2

Genome 3

CORE GENOME

WHOLE GENOME

of a genome 2

PANGENOME of genomes 1, 2, and 3

CORE GENOME

of genomes 1, 2, and 3

o Whole genome MLSTo Core genome MLSTo Accessory genome MLST

A post-sequencing workflow 6SNP detection

• Single nucleotide polymorphism (SNP)• Insertions and deletions (indels)

A post-sequencing workflow 6SNP detection

• SNP detection – things to consider:• Reference-free (less accurate) or reference-based (more accurate)• If reference-based, what reference to choose?• What pipeline to choose?• How conservative to be (e.g., what SNPs should be excluded)?

A post-sequencing workflow 6SNP detection

• SNP detection – things to consider:• Reference-free (less accurate) or reference-based (more accurate)

Can be done using assembled genomes (do not need to have raw reads)

Genomes are fragmented into short sequences (called kmers) and these

fragments are compared among isolates

Reads are aligned (mapped) to a reference genome

https://www.baseclear.com

SNP

A post-sequencing workflow 6SNP detection

• SNP detection – things to consider:• If reference-based, what reference to choose?• Rule #1 – the reference needs to be closely related, since only core SNPs

are finally used• How to identify a closely related reference?

Genome 1

Genome 2

REF. GENOME

CORE GENOME

Genome 1

Genome 2

CORE GENOME

Larger core genome -sequences are compared over a greater length to improve the chances of detecting core SNPs

Shorter core genome – fewer detected core SNPs, but that does not mean that genomes are closely related!

A post-sequencing workflow 6SNP detection

• SNP detection – things to consider:• What pipeline to choose?• There are many options; the following are used by the government

agencies:• FDA CFSAN pipeline: Davis et al., 2015, https://peerj.com/articles/cs-20/• CDC LyveSET pipeline: Katz et al., 2017,

https://www.frontiersin.org/articles/10.3389/fmicb.2017.00375/full

• Both of these pipelines are conservative (e.g., remove SNPs that are likely due to errors and horizontal gene transfer, recommibnation)

SNP vs. cgMLST analysis• Results in a comparable outcome• What is a maximum SNP or allele difference among

epidemiologically linked isolates?• It depends on:

• How clonal the species (or serotypes)• Whether a strain multiplies (and evolves) in a host or a food product throughout the

duration of an outbreak• Epidemiological data are needed to support information derived from comparative

genomics

Cost-benefit analysisThings to consider:

1. Sequencing instrument cost2. Reagents and supplies costs3. Labor cost4. Personnel wet lab and bioinformatics expertise

Cost-benefit analysisTo establish an in-house WGS or to outsource library preparation, sequencing, and/or data analyses?• Outsourcing the library preparation and sequencing cost: ~$100-

$1,500/sample (depending on how many isolates you would like to sequence and the service provider)

• In-house WGS can reduce the reagents and supplies costs down to ~$50/samples (excluding the instrument and labor costs)

• Advantage on in-house WGS – building internal expertise, having full control over the process

• Disadvantage of in-house WGS – absorbing the costs of failed experiments

• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS

Cost-benefit analysisTo establish an in-house WGS or to outsource library preparation, sequencing, and/or data analyses?• Outsourcing the library preparation and sequencing cost: ~$150-

$1,500/sample (depending on how many isolates you have, what coverage you need, and the service provider)

• In-house WGS can reduce the reagents and supplies costs down to ~$50/samples (excluding the instrument and labor costs)

• Advantage on in-house WGS – building internal expertise, having full control over the process

• Disadvantage of in-house WGS – absorbing the costs of failed experiments

• The number of samples you expect to sequence per year is an important factor to consider when doing cost-benefit analysis and deciding whether to implement an in-house WGS

Cost-benefit analysisOne WGS test or multiple phenotypic tests?oSpecies IDoSerotypingoAntibiotic resistance testingoGenotyping …

oEach of these phenotypic tests can cost at least $50oIn silico (via WGS) species ID, serotyping, AMR detection,

genotyping can be more cost-effective, IF a confident relationship between sequences and phenotypes is established.

Images used in the presentation were sourced from the following websites: Biocompare, Illumina, PacBio, Nanopore, Ion Torrent, Mlst.net, Baseclear.com, cdn3.vectorstock, and publications cited in slides.

Q & A

If your question wasn’t answered…

Please contact Scott Nichols at snichols@wga.com or one of the trade organization representatives and we would be

happy to respond.

Thank you.

Brought To You By:

The Science of Whole Genome

Sequencing:An Overview for the

Food Industry

March 17, 2020

Recommended