GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in the Microbiological Testing & Traceability for Foodborne Pathogens

“GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New

Way Forward in the Microbiological Testing & Traceability for Foodborne

Pathogens”Eric W. Brown, Ph.D.DirectorDivision of MicrobiologyCenter for Food Safety & Applied NutritionU.S. Food & Drug AdministrationCollege Park, Maryland 20740

2

“Whole Genome Sequencing Is The Biggest Thing To Happen To Food Microbiology Since Pasteur Showed Us How To Culture Pathogens…”

~Dr. Jorgen SchlundtExec Director and Founder

The Global Microbial Identifier

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 680

5

10

15

20

25

30

35

40

Representative* Timeline for Foodborne Illness Investigation Using Whole Genome Sequencing

Contaminated food enterscommerce

FDA, CDC, FSIS, and States use WGS in real-time and in parallel on clinical, food,

and environmental samples

Source of contaminationidentified early through WGS combined database queries

Averted Illnesses

Num

ber o

f Cas

es

Days

*Data is for illustrational purposes and does not represent an actual outbreak

Comparison of Nut Butter Outbreaks*

• Salmonella Tennessee, Con Agra, Peter Pan Peanut Butter, – 2006/2007: 715 cases, 129 hospitalizations, no deaths

• Salmonella Typhimurium, multiple peanut products, Peanut Corporation of America,– 2008/2009: 714 cases, 166 hospitalizations, may have contributed to 9

deaths

Post GenomeTrakr Network –Whole Genome Sequencing

• Salmonella Braenderup, nSpired Natural Foods, multiple almond and peanut butters, 2014: 6 cases, 1 hospitalization, no deaths

* Source: CDC’s Foodborne Outbreak Online Database (FOOD Tool)

GenomeTrakr Network• Genometrakr was established to accelerate the source tracking and

tracing of foodborne outbreaks through the use of next generation whole genome sequencing (wgs)

• It is a network of State and Federal laboratories with whole genome sequencing capability, established by FDA in 2012

• The network provides high resolution genomic sequences of food pathogens, ex. Salmonella, Listeria, STEC’s, others

• Partnership with NCBI for all storage and sharing of sequence and metadata in public domain

• Partnered with CDC in 2013 to study all clinical and environmental isolates of Listeria monocytogenes

• Today the network consists of labs at FDA, CDC, FSIS, 14 state labs and 9 international labs.http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/default.htm

WGS is BIG data! The GenomeTrakr database is about 17 terabytes large currently. 1 terabyte = 17,000 hours of normal human speech. Thus, the GenomeTrakr database is equivalent in size to about 289 thousand hours of words spoken or about 32 years worth of continuous speech.

OR;

One terabyte = 2000 file cabinets worth of papers.

You would need 34,000 four drawer standard file cabinets for GenomeTrakr data if it were printed.

*The hubble telescope has collected 45 terabytes worth of data since its launch in the early 90s. GenomeTrakr has been live since 2012 and is more than a third of the way there in data storage.

GenomeTrakr Strategy• Develop a distributed sequencing based network, rather

than a centralized model• Public access to data • Focus on collaborative efforts• Provide sequence and minimal metadata in a publicly

accessible database– Partner with NCBI for storage and serving data– Cost prohibitive for FDA to establish its own high capacity data site– Industry (food, pharma, and methods development), academia,

hospitals, clinical public health laboratories, and other government agencies have access to data for their individual needs

FDA GenomeTrakr websitehttp://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Num

ber o

f Seq

uenc

es(a

s of

the

last

day

of t

he q

uarte

r)

Total Number of Sequences in the GenomeTrakr Database

2013 2014 2015

Average Number of SequencesAdded Per Month in 2013 = 169

Average Number of SequencesAdded Per Month in 2014 = 1,076

First sequences uploaded in Feb 2013

Public Health Englanduploads more than 8,000 Salmonella sequences

Average Number of SequencesAdded Per Month in 2015 = 2,362

2016

How do we use the GenomeTrakr information?

Environmental sampling

Post inspection

Interpretation

SNP Distance

How close are the isolates? No single threshold for all species/types: rough, conservative guides

1. Inclusion: <=20 SNPs match, virtually identical2. Inconclusive: 20-100 SNPs3. Exclusion: > 100 SNPs exclude

BootstrappingDo the isolates form a unique cluster w/ >= 95% support?Is the cluster distinct from other isolates in the tree?

Data AnalysisSNPs wgMLST

Unit of Measure Single Nucleotide Substitutions (other types of mutations are excluded)

Allelle - variant of a gene. Variation could arise form a number of sources, including SNPs, insertions, deletions,

etc.

Requirements Complete or high-quality reference genome for mapping

Database of named alleles, must be actively maintained

Pros Extremely High Resolution, Methods have been published and validated

Relatively Fast, not directly dependent upon reference genome

ConsRequires reference genome,

computationally intense, requires local bioinformatics expertise

Allele database must be centralized, cannot compute novel wgMLST types locally. wgMLST schemas not publicly

available at this time.

CFSAN SNP Pipeline

• Documentation: http://snp-pipeline.rtfd.org

• Source Code: https://github.com/CFSAN-Biostatistics/snp-pipeline

• PyPI Distribution: https://pypi.python.org/pypi/snp-pipeline

Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. (2014) An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2:e620 http://dx.doi.org/10.7717/peerj.620

Intended for use by bioinformaticists (Linux)

http://snp-pipeline.rtfd.org/

https://github.com/CFSAN-Biostatistics/snp-pipeline

https://pypi.python.org/pypi/snp-pipeline

http://dx.doi.org/10.7717/peerj.620

Data Submission

Do not need to be associated with GenomeTrakr

Public Health England

Establish Bioproject

Upload data and metadata

Link to surveillance pipeline – kmer tree

Lessons Learned• WGS provides accurate, informative information in every case we have applied it, and

the distributed model for a WGS network has proved the most effective means of acquiring sequence data.

• WGS can be used to mitigate trace backs and delimit the scope of food contamination events unlike ever before – not just a regulatory tool - numerous offshoot applications exist (i.e., supply chain management, quality assurance, process evaluation, etc.)

• The development of international open source databases is critical due to the global nature of the food supply.

• Genome sequences are agnostic, portable, and instantly cross-compatible. One technology approach irrelevant of organism.

• WGS, unlike PFGE, is more than simply a “Molecular Epi-Machine”. It provides information on AMR, Virulence, serotype, and other critical factors in one assay, including historical reference to pathogen emergence. Significant lab cost savings with one approach!

• The need for increased number of well characterized environmental (food, water, facility, etc.) sequences may outweigh need for extensive clinical isolates

Next-Generation sequencing faces several large challenges as it deploys to a global public health tool:

How much metadata?

Will all share data?

Administration, coordination, and oversight?

Who pays?

Who owns the IP?

Quality concerns and curation?2014 2015

2017

2020

2016

WG

S

Looking ahead Capacity building – hardware, software and people (bioinformatics)

Slow transition from PFGE to WGS

Different authorities and distinct mandates with some overlap

Bioinformatics - training IFSH workshops/CDC/IFSTL-JIFSAN-UMD

“Hands on”

Sample submission by industry

Understanding the supply chain

Facility and transportation sanitation – resident pathogens

Prevention

Spoilage organisms

One microbiology workflow for bacterial pathogens – FDA FOODS

PROGRAM

Multiple Tests for Strain Characterization

Species Resistance

Virulence Subtype

Serotype Adaptations

ONE MICROBIOLOGICAL WORKFLOW: ONE MICROBIOLOGICAL TOOL BOX All AT YOUR FINGERTIPS

IN THE NOT SO DISTANT FUTURE….. APPs ON YOUR SMARTPHONE

Acknowledgements

• FDA• Center for Food Safety and Applied Nutrition• Center for Veterinary Medicine• Office of Regulatory Affairs• Office of Food & Veterinary Medicine

• National Institutes of Health• National Center for Biotechnology Information

• State Health and University Labs• Alaska• Arizona• California• Florida• Hawaii• Maryland• Minnesota• New Mexico• New York• South Dakota• Texas• Virginia• Washington

• USDA/FSIS• HQ and The Eastern Laboratory

• CDC• Enteric Diseases Laboratory

• INEI-ANLIS “Carolos Malbran Institute,” Argentina

• Centre for Food Safety, University College Dublin, Ireland

• Food Environmental Research Agency, UK

• Public Health England, UK

• WHO

• Illumina

• Pac Bio

• CLC Bio

• MANY other independent collaborators

Education

GenomeTrakr: Whole-Genome Sequencing for Food Safety and A New Way Forward in the Microbiological Testing & Traceability for Foodborne Pathogens