Advanced Molecular Detection
Duncan MacCannell, PhD
Georgia Tech / CDC Collaborates
March 12th, 2014
National Center for Emerging and Zoonotic Infectious DiseasesOffice of the Director
Roche 454 PTP plate, Ion Torrent 314, Pacific BioSciences SMRTcells (x 3)
Devices and brand names provided for illustrative purposes only. Their use does not imply endorsement by CDC or HHS.
VO
LU
ME O
F R
AW
DATA
>70,000 samples/yearx
2 to 3 GB raw sequence + 5-10 GB intermediate
~0.9 petabytes of raw data/year
Data Acquisition/Analysis Challenges
Transmission and storage?Is better data compression the answer?Distributed processing and extraction?Is full WGS the right approach for large-scale surveillance?
Any solution must balance the advantages of WGS, with the costs of implementation.
For PulseNet USA alone:
Input: DNA/RNA
Source:GenomicAmpliconWhole sample
Host/vector/pathogen/environment
…
Library
Output: InformationFrom Sequence Data
Comparative GenomicsHR Straintyping/SubtypingCluster identificationMolecular evolutionGenotypic characterizationVirulence, AR, signaturesFunctional annotationDiagnostic dev/validation
MetagenomicsPathogen identification/discoveryCulture-independent diagnosticsMicrobial ecology/diversity….
Data Info.
ACAATTTGTGCATAACATGTGGACAGTTTTAATCACATGTGGGTAAATAGTTGTCCACATTTGCTTTTTT TGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACAT TTTATATTTATTAGGTTGTACATTTGTTGCGCAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACAC CTTTGGAAAATATCTCTGATTTATGGAATAGTGCCTTAAAAGAATTAGAAAAAAAGGTAAGCAAGCCTAG TTATGAAACATGGTTAAAATCAACAACGGCTCATAACTTGAAGAAAGACGTATTAACGATTACAGCTCCA AATGAATTTGCTCGTGACTGGCTAGAATCTCATTACTCAGAACTTATTTCGGAAACACTATACGATTTAA CAGGGGCAAAATTAGCAATTCGCTTTATTATTCCCCAAAGTCAATCGGAAGAGGACATTGATCTTCCTCC AGTTAAGCGGAATCCAGCACAAGATGATTCAGCTCATTTACCACAGAGCATGTTAAATCCAAAATATACA TTTGATACATTTGTTATCGGCTCTGGTAACCGTTTTGCCCATGCAGCTTCATTAGCTGTAGCCGAGGCGC CAGCTAAAGCGTATAATCCACTCTTTATTTATGGGGGAGTTGGGCTTGGAAAGACGCATTTAATGCACGC AATTGGTCATTATGTAATTGAACATAATCCAAATGCAAAAGTTGTATATTTATCATCAGAAAAATTCACG AATGAATTTATTAACTCTATTCGTGATAATAAAGCTGTTGATTTTCGTAATAAATATCGCAACGTAGATG
NGS
Workflow:PlatformsChemistryPerf. char.Labor/TaTCost
Bioinformatics
Workflow:Hardware/softwareSpecialized skillsetsAlgorithms/pipelinesPathogen databasesData analysis/interpret/Integration/visualization
Increasingly Universal WorkflowsEstablished sequencing workflows for a wide range of pathogens.
Objective, “Future-Proof” DataIntrinsic quality metrics. Ability to back-test retrospective sequence data in silico for genes/markers identified at a future date.
MANY RESULTS POSSIBLE FROM A SINGLE DATASET!
A Moving TargetRapidly evolving technology space. Changing hardware and COTS/OSS capabilities. Lots of choice, but lack of consistent standards. BIG DATA. New workforce and skillset is required.
Pathogen- and application-specific, CLIA-compliant assays
File hashes/versioningValidated
methods/databasesProcess logging/audit
QA/QCSkills/proficiency
Standards
Reporting
Security
Sample intakePrep/stagingExtraction
ConversionLibrary prepSequencing
WGS and Pathogen Genomics: Advantages
It’s universal… DNA/RNA sequencing workflows and approaches can be
applied to a wide range of pathogenic organisms. It’s fundamental…
Genomics is a cornerstone for other “omic” approaches Sequence databases starting point for assay
devel./validation. It’s objective…
Sequence-based methods avoid subjectivity of phenotypic or fragment-based approaches. Volume of data internal controls.
It’s (relatively) future proof… Comprehensive sequencing captures the features you
know about, and those you don’t. Quality may change, but the sequence will not.
This makes it possible to back-test future approaches/targets on the data you collect today.
WGS and Genomic Epidemiology: Limitations
It lacks standardization… WGS is a rapidly-evolving technology space, both in terms
of sequencing and analytics. Standards and mechanisms for data/metadata analysis,
storage and exchange remain under active debate and development.
Comprehensive databases are still being built… Without a useful baseline understanding of pathogen
features/diversity, interpretation may be limited. Need curated,and comprehensive epi-linked reference
databases. Many analyses require specialized
bioinformatics infrastructure and staff. Bioinformaticists, DBAs, programmers, system
administrators, etc. Technical and computational complexity of tasks can vary
widely. Data management, retention and release.
Storage. LIMS.
Advanced Molecular DetectionProposed $30M FY2014 budget request to:
1. Improve pathogen identification and detectionOutcome: Rapid progress toward modernizing PulseNet and other critical lab-based surveillance systems
2. Adapt new diagnostics to meet evolving public health needs
Outcome: Enhance CDC’s ability to detect outbreaks early, develop new test during outbreaks, and better characterize infectious disease threats
3. Help states meet future reference testing needs in a coordinated manner
Outcome: More effective and better integrated outbreak response activities
4. Implement enhanced, sustainable, and integrated laboratory information systems
Outcome: Labs inside and outside CDC can share information quickly and seamlessly, including with other CDC databases, such as MicrobeNet and PulseNet
5. Develop prediction, modeling, and early recognition tools
Outcome: Better equipped to prevent, detect & respond to infectious diseases.
EPI NGS
BIOINFO
AMD
AMD Initiative: Strategic Investments (1)
Scientific Infrastructure: Critical laboratory and bioinformatics infrastructure at
CDC, state/local PHL, and key overseas laboratories. • Sequencers, mass-spec, other instrumentation, reagents.• High performance computing, workstations.• Data storage, networking; data integration, knowledge
management.• Service contracts, software licensing, etc.
AMD Initiative: Strategic Investments (2)
Workforce development: Training for CDC and PHL staff (bioinformatics, genomics,
-omics) New or re-tooled fellowship programs (bioinformatics,
genomics) Recruitment of new staff and skillsets (bioinformaticians,
data scientists, lab specialists, …)
AMD Initiative: Strategic Investments (3)
Consortia, partnerships and alignment of efforts Academic institutions State, Federal (NIH, FDA, DHS, DoD, DoE/National
Laboratories) Non-Profit/NGO International community Commercial/For-Profit Clinical laboratories
Pilot projects with state/local and other partners. Outbreak detection, investigation and response Leverage existing laboratory-based surveillance systems
Challenges and Opportunities for CDC/GT
Training and workforce development. Development of wet bench and bioinformatics
curriculum for public health audiences. Scientific exchanges. Fellowship programs. MOOC-style coursework and training modules for PHL.
Bioinformatic challenges. Analysis and visualization of complex structured and
unstructured data. Epi/lab integration. Dashboards/decision support.
Development and standardization of deployable, CLIA compatible bioinformatics workflows. Fieldable/portable systems.
Machine learning and other approaches for genotypic prediction of complex microbial phenotypes (eg: antimicrobial resistance)
Approaches to address CIDT: eg: accelerated metagenomic classification, lab/bifx approaches for complex sample matrices.
Tools for rapid assay design and validation from HTS data
Hardware-accelerated algorithms, scalable HPC (+NoSQL/Hadoop)
…
Questions and Discussion
For more information please contact Centers for Disease Control and Prevention
1600 Clifton Road NE, Atlanta, GA 30333Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348Visit: www.cdc.gov | Contact CDC at: 1-800-CDC-INFO or www.cdc.gov/info
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
National Center for Emerging and Zoonotic Infectious DiseasesOffice of the Director