Upload
phungkhanh
View
235
Download
3
Embed Size (px)
Citation preview
Office of Advanced Molecular Detection
National Center for Emerging and Zoonotic Infectious Diseases
Joel Sevinsky PhD & Duncan MacCannell PhD
Introduction to Bioinformatics
APHL 2017Bioinformatics Workshop2017/06/11
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Introductions
https://xkcd.com/1605/
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Input: DNA/RNASource:GenomicAmpliconWhole sample
Host/vector/pathogen/environment
…
Library
Output: InformationFrom Sequence Data
Comparative GenomicsIdentificationHigh res Straintyping/SubtypingCluster identificationMolecular evolutionGenotypic characterizationVirulence, AR, signaturesFunctional annotationDiagnostic dev/validationMinor populations, quasispeciesHost/pathogen expression
MetagenomicsPathogen identification/discoveryCulture-independent diagnosticsMicrobial ecology/diversity
Data Info.
ACAATTTGTGCATAACATGTGGACAGTTTTAATCACATGTGGGTAAATAGTTGTCCACATTTGCTTTTTT TGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACAT TTTATATTTATTAGGTTGTACATTTGTTGCGCAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACAC CTTTGGAAAATATCTCTGATTTATGGAATAGTGCCTTAAAAGAATTAGAAAAAAAGGTAAGCAAGCCTAG TTATGAAACATGGTTAAAATCAACAACGGCTCATAACTTGAAGAAAGACGTATTAACGATTACAGCTCCA AATGAATTTGCTCGTGACTGGCTAGAATCTCATTACTCAGAACTTATTTCGGAAACACTATACGATTTAA CAGGGGCAAAATTAGCAATTCGCTTTATTATTCCCCAAAGTCAATCGGAAGAGGACATTGATCTTCCTCC AGTTAAGCGGAATCCAGCACAAGATGATTCAGCTCATTTACCACAGAGCATGTTAAATCCAAAATATACA TTTGATACATTTGTTATCGGCTCTGGTAACCGTTTTGCCCATGCAGCTTCATTAGCTGTAGCCGAGGCGC CAGCTAAAGCGTATAATCCACTCTTTATTTATGGGGGAGTTGGGCTTGGAAAGACGCATTTAATGCACGC AATTGGTCATTATGTAATTGAACATAATCCAAATGCAAAAGTTGTATATTTATCATCAGAAAAATTCACG AATGAATTTATTAACTCTATTCGTGATAATAAAGCTGTTGATTTTCGTAATAAATATCGCAACGTAGATG
NGSWorkflow:PlatformsChemistryPerf. char.Labor/TaTExpertiseCost
BioinformaticsWorkflow:Hardware/softwareSpecialized skillsetsAlgorithms/pipelinesPathogen databasesData analysis/interpret/Integration/visualization
Increasingly Universal WorkflowsWorking to establish standardized sequencing workflows for a wide range of pathogens.
Many results from a single dataset.Faster and cheaper than serial tests.
A Moving TargetRapidly evolving technology space. Changing hardware and COTS/OSS capabilities. Lots of choice, but lack of consistent standards. BIG DATA. New workforce and skillset is required.
Pathogen- and application-specific, standard and/or compliant assays
File hashes/versioningValidated methods/databases
Process logging/audit
QA/QCSkills/proficiency
StandardsReporting
SecuritySample intakePrep/stagingExtraction
ConversionLibrary prepSequencing
NGS Applications in Microbiology
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
NGS Sequencing Technologies
10X GenomicsGemCode
IlluminaTruSeq SLR(moleculo)
SYNTHETIC LONG READ
IonTorrentPGM/Proton/S5
IlluminaMiSeq//NextSeq/HiSeq
SHORT READ SEQUENCING
75 to 400bp readlengths*Millions to billions of readsVarious error modelsRelatively inexpensive (~$60/isolate)Issues: Resolving complex structures, phasing
Oxford Nanopore MinIONPacific BioSciences RS IIPacific BioSciences Sequel*
LONG READ SEQUENCING
>3-20kbp readlengths*Hundreds of thousands of reads*High error rates have presented challenges*Roughly $600/isolate (RSII; Nanopore TBD)Other adv: SMRT; methyl; phasing; closing
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Run QC & Metrics
Read QCFastQC, CLC, FastXToolkit, etc
Trimming/FilteringTrimmomatic, CLC, kraken, bowtie2, ...
Reference Mapping
De Novo
Demultiplexing
PAN-GENOME, wgMLST, AMR PREDICTION, FUNC ANNOTATION
Alignmentbwa
Variant Callingsamtools, gatk,
varscan, freebayes,…
Variant Filter/Annot
Tree BuildingRAxML, PHYML, Fasttree, kSNP
MLST/AMRsrst2
WG Alignmentmauve, parsnp …
Annotation/FPprokka, pgaap, etc…
Tree BuildingHarvest, kSNP, …
ComparativeGenomics
MLST/AMRabricate, mlst
H/T: Nick Loman
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
What does the data actually look like?
3 SAMPLESIllumina Paired-End~47GB compressedID,flowcell,barcode,lane,pair
Further reading: https://en.wikipedia.org/wiki/FASTQ_format
Identifier for each read. (Syntax varies – Illumina)INSTRUMENT: HISEQRUN ID: 165FLOWCELL ID: C1CKRACXXFLOWCELL LANE: 2TILE: 2201X,Y: 1257,1980PAIR: 1FILTERING: NBARCODE: GAGTGG 1
SEQUENCE DATA
2 Per-base quality score:ASCII – 64 or 94 levels
LOWEST → ! (HEX 21)HIGHEST → ~ (HEX7E) 3
265M 100bp reads in CDD5
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
NGS Quality Assessment: FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Genome Assembly
INCLUDES: Sequence(s) of interest, ContaminantsMAY NOT INCLUDE: Poorly sequenceable regions
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Mapped assemblyFor demonstration purposes only.All copyrights belong to their respective owners.
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Reference-Guided (Mapped) Assembly
Reference Sequence/Genome
Low sequence coverage
UNMAPPED READS1. Sequences not present in the reference.2. Plasmids or other extrachromosomal.3. DNA Structural Variation/Rearrangement
ADVANTAGES: Relatively fast, well-suited to highly-conserved genomes.DISADVANTAGES: Issues with high diversity, mobile elements, linear reference
Co
vera
ge
18X
1X
Contig 1 Contig 2
Example software: BWA (https://github.com/lh3/bwa) breseq (https://github.com/barricklab/breseq)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
De Novo AssemblyFor demonstration purposes only.All copyrights belong to their respective owners.
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
PLASMID
De-Novo Assembly
Contig 1 Contig 2 Contig 3
Contig 4 Contig 5 Contig 6 Contig 7
ADVANTAGES: Reference agnostic: assembles all the reads it can. Various algorithms.DISADVANTAGES: Doesn’t always get things right. Particularly with repeat seqs.
Example software: SPAdes (http://bioinf.spbau.ru/spades)List: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Functional Annotation
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
EXERCISE
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
hqSNP and wgMLST
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Reference-Guided (Mapped) Assembly
Reference Sequence/Genome
Low sequence coverage
UNMAPPED READS1. Sequences not present in the reference.2. Plasmids or other extrachromosomal.3. DNA Structural Variation/Rearrangement
ADVANTAGES: Relatively fast, well-suited to highly-conserved genomes.DISADVANTAGES: Issues with high diversity, mobile elements, linear reference
Co
vera
ge
18X
1X
Contig 1 Contig 2
Example software: BWA (https://github.com/lh3/bwa) breseq (https://github.com/barricklab/breseq)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
High Quality SNP Typing (hqSNP)
A
A
T
C
C
C
T
T
T
A
A
A
G
G
C
A
T
T
Reference Sequence/Core Genome
1
2
3
ACTAGAACTAGTTCTACT
Advantages: adaptable, highly discriminatory, good for cluster investigations where a suitable reference is available and timeframe knownDisadvantages: not well suited for surveillance or studies where reference or allele set may shift over time. Provides limited additional data about genomic features. Issues with highly plastic genomes.
Example software: SNIPPY (https://github.com/tseemann/snippy), LYVE-SET (https://github.com/lskatz/lyve-SET), SNP Pipeline (https://github.com/CFSAN-Biostatistics/snp-pipeline), others.
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
What factors influence SNP calls?
❑ Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count and population distribution
❑ SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked
Mask mobile elements-do no consider SNPs in this location
Mobile elements
genes
Only call SNPs in genes
Raw reads
Low coverage/Poor quality
Heather Carleton (CDC/DFWED)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Selection of an Appropriate Reference
❑ Choice of reference genome affects analysis – more closely related reference more likely to identify true SNP differences
❑ For some organisms, eg: MTB, the choice is obvious. For others, reference varies or must be selected based on a preliminary genomic analysis. Heather Carleton (CDC/DFWED)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Caveats of hqSNP analyses Advantages Disadvantages When to Use
Phylogenetically informative (build a tree consistent with
evolution of the strains)
Requires a closely related reference genome – hqSNP analysis does not work
if reference genome is not closely related → Longitudinal surv. difficult
Outbreaks
SNP position can be identified on genome (gene affected can
be identified)
Takes a while and requires a lot of computer power
Need highest amount of resolution for
strain comparison
Interpretation of data depends on genomes added – is not stable, does not
lead to strain nomenclature.
Mobile genetic elements can interfere with phylogenetic estimation unless masked. These also may be critically
important.
Heather Carleton (CDC/DFWED)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
PLASMID
De-Novo Assembly
Contig 1 Contig 2 Contig 3
Contig 4 Contig 5 Contig 6 Contig 7
ADVANTAGES: Reference agnostic: assembles all the reads it can. Various algorithms.DISADVANTAGES: Doesn’t always get things right. Particularly with repeat seqs.
Example software: SPAdes (http://bioinf.spbau.ru/spades)List: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Whole Genome MLST (wgMLST)
❑ Includes relevant open reading frames from across the pangenome.❑ Some key definitions:
• Locus: open reading frame. • Allele: specific gene sequence at each locus.• Scheme: the set of loci and alleles included in straintyping definition.• Nomenclature: standardized way of referring to alleles, loci and sequence types• Curation: Updating/maintaining the allele database for a given genus/species based on established
criteria.Advantages: discriminatory; reproducible; results in consistent, hierarchical nomenclature; corollary information; data are reasonably portable.Disadvantages: may have decreased discriminatory power, particularly with clustered genomic variations; requires initial dev/ongoing curation of allele db; computationally intensive.
Example software: SRST2 (https://github.com/katholt/srst2)
atgcct
atgcct
atgcgt
1
1
2
gcagga
gcacga
gcagct
gttatt
attatt
cttaat
aacttt
aactat
aacttt
1
2
3
1
1
3
1
2
1
11111221
2331
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Flavors of multilocus sequence type analysis
• Subsets of genes can be used to identify genus/species and lineage (rMLST/ MLST)
• Core genome MLST are the genes that are in common in vast majority of genomes belonging to a genus species (for Listeria – 1748 genes belong to core and are present in ~98% of isolates tested)
• wgMLST and hqSNP provide the most information per isolate genome
hq-SNP
cgMLST
Heather Carleton (CDC/DFWED)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Caveats of wgMLST analysisAdvantages Disadvantages When to Use
Phylogenetically informative Initial assignment of alleles is computationally costly (doing assemblies
before calling alleles)
Surveillance
All virulence, serotyping, and antibiotic resistance genes are pulled out as part of analysis
Comparing character data (allele numbers)
rather than genetic data
Need high resolution
Neutralizes the effects of horizontal gene transfer (event is
only counted once rather than many times for hqSNPs)
SNPs and indels treated equally – allele
assignments categorical
Need to know serotype, virulence, AR determinants
Allele calling is stable – can lead to nomenclature based on allele calls
Requires ongoing and active curation for allele
calls
Need to communicate with partners using stable
nomenclatureHeather Carleton (CDC/DFWED)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
hqSNP vs wgMLST: Generalized Bioinformatic Process
1. Selection of appropriate reference genome
2. Review and QC of query genomes
3. Pairwise mapping of query genomes to reference
4. SNP Calling5. Filtering based on coverage,
content, complexity, masking.6. Compile SNP calls from all query
sequences7. Filtering based on allelic freq,
informativeness, etc.8. Tree building and visualization
1. Review and QC of query genomes
2. Individual de novo assembly of each query genome
3. ORF calling/extraction4. Locus assignment/annotation5. Selection of subset of ORFs
based on loci in defined scheme6. Sequence alignment of each ORF
for allele determination7. Assignment of sequence type
based on allele profile and established nomenclature.
8. Tree building and visualization
hqSNP wgMLST
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
REFERENCE ATGTCGAATTCTTATGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y D S S S I K V L K D I *
Insertion ATGTCGAATTCTTATGACAAATCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y D K S S S I K V L K D I *
Deletion ATGTCGAATTCTTATATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y I K V L K D I *
SNP ATGTCGTATTCTTATGACTCCTCTAGTATCAAAGTCCTGAAA-//-GATATTTAA M S Y S Y D S S S I K V L K D I *
Inversion ATGTCGAATTATTCTGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N Y S D S S S I K V L K D I *
Duplication ATGTCGAATTCTTATAATTCTTATGACTCCTCCAGTATCAAAGTCCTGAAA-//-GATATTTAA M S N S Y N S Y D S S S I K V L K D I *
Don’t forget: Gene duplication. Frameshift. Extrachromosomal seq (eg: plasmids). Differential selection.
hqSNP vs. wgMLST: Impact of Genetic Change
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
87[87-87]
5[2-6]
5[2-54]
55[2-69]
64[2-191]
Environmental IsolateSprout
Allele Median [min-max]
Whole-genome Multilocus Sequence Typing (wgMLST)
100
999897969594
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
State 2State 3State 1State 1
State 1State 1State 3State 4State 5State 1
.
.
.
.
.
.
.
.
.
.
.
.
State 2 State 3
State 1State 1Environmental IsolateSprout State 1State 1
State 3 State 4
State 5 State 1
0.02
68 hqSNPs
1 ± 1 hqSNPs [0-3]
58 hqSNPs [0-72]65.5 hqSNPs [54-72]
275 hqSNPs
WGS analysis by Enteric Diseases Laboratory Branch, CDC; Heather Carleton (NCEZID/DFWED)
hqSNP Analysis
Comparison of hqSNP and wgMLST analyses
Isolate from MI also highly-related (not
shown)
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
EXERCISE
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Links and Further Reading❑ Further reading
• Loman NJ, Pallen M. Twenty years of bacterial genome sequencing. Nat Rev Microbiol. 2015 Dec;13(12):787-94. doi: 10.1038/nrmicro3565
• Sintchenko V, Holmes EC. The role of pathogen genomics in assessing disease transmission. BMJ. 2015 May 11;350:h1314. doi: 10.1136/bmj.h1314.
• Kwong JC, McCallum N, Sintchenko V, Howden BP. Whole genome sequencing in clinical and public health microbiology. Pathology. 2015 Apr;47(3):199-210. doi: 10.1097/PAT.0000000000000235
❑ Free and open source software to get you started:• NCBI Genome Workbench (http://www.ncbi.nlm.nih.gov/tools/gbench/); SRST2
(https://github.com/katholt/srst2); Prokka/snippy/arbricate (https://github.com/tseemann); BWA (https://github.com/lh3); GATK (https://www.broadinstitute.org/gatk/); SPAdes (http://bioinf.spbau.ru/spades); HARVEST (https://harvest.readthedocs.org/en/latest/); kSNP (http://sourceforge.net/projects/ksnp/); PhyloViz (http://www.phyloviz.net/); Samtools (http://www.htslib.org); FreeBayes (https://github.com/ekg/freebayes); BIGSdb (http://pubmlst.org/software/database/bigsdb/); RAxML (https://github.com/stamatak/standard-RAxML); Mauve (http://darlinglab.org/mauve/mauve.html); GEPHI (https://gephi.org/); BioPerl (http://bioperl.org); BioPython (http://biopython.org); R (http://www.r-project.org)
Office of Advanced Molecular Detection
National Center for Emerging and Zoonotic Infectious Diseases
Questions?For more information please contact Centers for Disease Control and Prevention
1600 Clifton Road NE, Atlanta, GA 30333Telephone: 1-800-CDC-INFO (232-4636)/TTY: 1-888-232-6348E-mail: [email protected] Web: http://www.cdc.gov/amdTwitter: @dmaccannell @CDC_AMD
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Multiple Gene Copies
Gene: rrs (16S) 1471846 – 1473382Gene: rrl (23S) 1473658 – 1476795Gene: rrf (5S) 1476899 – 1477013
May impact analyses.
Need to be addressed in the wgMLST schema or masked in the hqSNP analysis workflow.
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
Example: Multi-species CRE Outbreak❑ In this outbreak report, an IncFII plasmid bearing the NDM-1 carbapenemase was
identified in three different genera of Enterobacteriaeceae: Klebsiella, Escherichia and Enterobacter.
❑ Outbreak plasmid: 101kb conjugative plasmid, carrying bla NDM-1
. Other betalactamases (bla
CTX-M-15) found on additional plasmids.
❑ Standard molecular epidemiologic tools, looking primarily at the core genome, would be of limited value in understanding transmission dynamics.
❑ TAKEHOME MESSAGE: MGE plays an important role in both short term and long term bacterial evolution, and it may complicate and confound an investigation.
Torres-Gonzales AAC 2015 doi:10.1128/AAC.00055-15
AM
D –
Inn
ovat
e * T
ran
sfor
m *
Pro
tect
WGS connectionsEpidemiologic connections
4
C
B
A
1 2 5
3
4
C
B
A
1 2 5
3
Epson et al. (ICHE 2014)
Building the tree
▪ Use the differences you identified by hqSNP or wgMLST to infer the relatedness or phylogeny
Isolate A
Isolate B
Isolate C
Isolate D
11
1
6
3
5
genetic change
actgaatta
actgccggt
ggagaatta
ggagagtta
ggattatta
ggatcccccggataatta
Isolate Sequence
A ggagagtta
B ggatccccc
C ggattatta
D actgccggt
ancestor actgaatta
Reading the trees
Isolate A
Isolate B
Isolate C
Isolate D
11
1
6
3
5
genetic change
LeafTaxa
NodeMost recent common ancestor
(for isolate B and C)
Ancestral nodeTerminal node Outgroup/Root –
related isolate (same PFGE pattern or 7-gene MLST) but not part of outbreak
Clade
Rooted versus unrooted trees
Isolate A
Isolate B
Isolate C
Isolate D
11
1
6
3
5
Rooted
Isolate B
Isolate C
Isolate A
Unrooted
- Rooted trees have a unique node ( created using D) that can be used to infer the most recent common ancestor of all the isolates in the tree- Unrooted trees shows relatedness without inferring the ancestry of isolates
Root
Trees, branches, and leaves – more than one way to draw a tree
▪ Many different ways to display trees
▪ Branches that connect to the terminal node are the important branch lengths to indicate relatedness
Trees, branches and leaves – reading the trees▪ Difference between similarity and relatedness on the tree▪ Isolate A and C are more similar to each other than C and B are▪ Isolate C and B are more related to each other than C and A are
Isolate A
Isolate B
Isolate C
Isolate D
11
1
6
3
5
genetic change
actgaatta
actgccggt
ggagaatta
ggagagtta
ggattatta
ggatcccccggataatta
Trees, branches and leaves – what does it mean for my outbreak investigation
▪ Epidemiologic data provides context to the tree – cannot rely on phylogenetic tree to identify outbreak source
kale
spinach
Stool
stool
11
1
6
3
5
genetic change
actgaatta
actgccggt
ggagaatta
ggagagtta
ggattatta
ggatcccccggataatta
What do cowboy have to do with my WGS tree – bootstrap values▪ Bootstrap values – confidence values for how the phylogenetic relationships are
drawn – subsample and redraw the tree
123456789ggagagttaggatcccccggattatta
153456782gaagagttggcatccccggtattattg
213456879ggagagttaggatcccccggattatta
331245678aagggagttaaggtccccaaggttatt
Isolate A
Isolate B
Isolate C
A
B
C
A
BC
A
B
C
67%
replicates with replacement
consensus tree
bootstrap replicates (100x-1000x)