Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bioinformatics- Data AnalysisErin H. Graf, PhD, D(ABMM)
Infectious Disease Diagnostics Laboratory, Children’s Hospital of Philadelphia
Department of Pathology and Laboratory Medicine, University of Pennsylvania
Outline
• Goal: Raw data virus detection and/or typing/epi• Making sense of large amounts of data can be intimidating• Focus on tools you can go home and use immediately
• Pre-processing• Quality analysis• Filtering steps and tools
• Processing• Bioinformatic tools when virus is known• Bioinformatic tools when virus is unknown (Agnostic)
• Interpretation• Important variables• Sources of error
• Standardization and Validation
Pre-processing steps
• Goal: Refine sequence data to contain only the best quality reads to reduce downstream errors• Select examples in Interpretation section
• Quality summary provided by instrument
• Bar code/adapter trim, instrument-specific filtering
• Secondary custom filtering
Pre-processing: Instrument report
• Example of Illumina • Example of Ion
Shotgun DNAseq Targeted FFPE sequencing
Pre-processing: Filter and Trim
• Reads are filtered and excluded based on instrument quality cutoffs
• Fastq files can be downloaded or analyzed directly through Apps (Illumina) or Plugins (Ion)
• Secondary analysis through “Fastqc”
• Bioinformatic suites capable of custom trimming
Processing
• Trimmed, quality filtered reads now ready for alignment
• Bioinformatic tools for known virus
• Bioinformatic tools for unknown virus (Agnostic)
Bioinformatics tools: known virus
• Goal: Generate full length virus sequence for downstream analysis• Typing, epidemiology, resistance marker analysis
• Align to single (or list of) reference genome(s)• Various alignment algorithms
• Custom trimming
• Creates Sequence Alignment Map (SAM) file
Virus reference genome:
Fastq files:
Bioinformatics tools: known virus
Bioinformatic suites• Geneious
• CLC workbench
• Bionumerics
• Others
• Graphical User Interface (GUI) • Very intuitive and user friendly
• Alignment plugins• Pull reference genomes from NCBI
Bioinformatics tools: known virus
• Downstream analysis tools• Annotation
• Phylogenetic analysis
Giberson et al (2011) NAR Stephanie Mitchell, PhD
Bioinformatic tools: unknown virus (agnostic)
• Goal: Detect any virus sequence present in a clinical sample
• Bioinformatic suites previously mentioned• Manually curate list of reference sequences to align reads against
• Web-based metagenomic pipelines• OneCodex
• Taxonomer
• CosmosID
Pipeline feature OneCodex Taxonomer CosmosID
Analysis Time* ~8 minutes ~5 minutes ~5 minutes
Number of virus genomes 5,137 >90,000 5,025
Comparison between samples Yes No Yes
Upload many samples at once Yes No Yes
Searchable reference genome list Yes No No
Independent view of virus hits No Yes Yes
Visual manipulation No Yes Yes
Connection to sequencer’s cloud No No Yes
*for 1 sample with 2 million reads
Pipeline comparison: Adenovirus from a conjunctival swab
Bioinformatic method Reads of Adenovirus Type of Adenovirus
OneCodex 39,175 B
Taxonomer 5,833 B
CosmosID 39,930 B
Manual alignment and BLAST analysis
38,573 B, serotype 3
Pipeline comparison: Enterovirus from a nasopharyngeal aspirate
Bioinformatic method Reads of Enterovirus Type of Enterovirus
OneCodex 2 Not typed
Taxonomer 609 EV-D68
CosmosID 1,498 EV-D68
Manual alignment and BLAST analysis
1,174 EV-D68
Interpretation
• Goal: Make accurate prediction with sequence data• Is the patient infected with virus “X”?• Is this SNP real?
• Important variables• Number of reads• Location of reads• Depth of coverage
• Sources of error• PCR errors during library prep or cluster generation• Read length (over-trimming)• Sequencing errors• Mapping errors• Contamination
Interpretation
• You have results from web-based pipeline, now what?• How do you decide what is real/meaningful?
• Manual confirmation is recommended• Can differentiate real hits from false-positives
• Genotyping/resistance mutation detection• SNPs result of quality issues
Interpretation: Important variables
• Number of reads
• Location of reads• All in one region?
• Depth of Coverage
• Average coverage across the viral genome
• Coverage at each individual base
Interpretation: Sources of Error
• PCR errors during library prep or cluster generation
• Read length (over-trimming)
• Sequencing errors
• Mapping errors
• Contamination
• **Positive and negative controls can help with some of these issues
Mapping error example
Measles virus reference genome
Measles virus reference genome
A= in silico Dolphin morbillivirusB= in silico low level measles virus
Schlaberg et al (2017) Arch Path & Lab Med
Contamination example
Rhinovirus reads in contaminating sample
Pe
rce
nt
fals
e-p
osi
tive
Rh
ino
viru
s re
ads
in s
amp
le w
ith
co
nta
min
atio
n
A sample with 652,676 reads of Rhinoviruscontaminates a neighbor with 100 reads of Rhinovirus
neighbor was sequenced at depth of 1 million reads
Standardization and Validation
• NY State validation guidelines• Minimum of Q20 per base and Q20 per mapped read
• FDA Draft Guidance• Need for regulatory-grade sequence database
• Cutoff values for positivity
• ARUP, UCSF, ASM PPC & CAP MRC clinical validation guidance• 5 million total reads per sample-cutoff in CSF
• Minimum of 3 non-overlapping virus gene reads- cutoff for positive result
• ASM-AAM NGS Report• Interpretive guidelines
• Quality standards
Conclusions
• Lack of standardized NGS analysis protocols• Laboratories should look to published guidance
• All analysis pipelines have limitations • Number and diversity of curated sequences
• False-positive hits
• Data should be scrutinized• Manual confirmation of pipeline hits
• Quality filtering
Resources
References:
• Wadsworth validation guidelines: http://www.wadsworth.org/sites/default/ files/WebDoc/ID WGS NGS Molecular Guidelines for Isolates_0.pdf.
• ARUP, UCSF, ASM PPC, CAP MRC Validation guidance: Schlaberg RS, Chiu CY, et al (2017) Archives of Path and Lab Medicine. doi: http://dx.doi.org/10.5858/arpa.2016-0539-RA
• FDA Draft guidance: https://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/UCM500441.pdf
• Broad Institute Best Practices: https://software.broadinstitute.org/gatk/best-practices/
• ASM-AAM Report: https://www.asm.org/images/Colloquia-report/NGS_Report.pdf
Talk to your genomics colleagues, they have probably encountered many of the same issues already