Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatics- Data AnalysisErin H. Graf, PhD, D(ABMM)

Infectious Disease Diagnostics Laboratory, Children’s Hospital of Philadelphia

Department of Pathology and Laboratory Medicine, University of Pennsylvania

Outline

• Goal: Raw data virus detection and/or typing/epi• Making sense of large amounts of data can be intimidating• Focus on tools you can go home and use immediately

• Pre-processing• Quality analysis• Filtering steps and tools

• Processing• Bioinformatic tools when virus is known• Bioinformatic tools when virus is unknown (Agnostic)

• Interpretation• Important variables• Sources of error

• Standardization and Validation

Pre-processing: Raw data

Pre-processing steps

• Goal: Refine sequence data to contain only the best quality reads to reduce downstream errors• Select examples in Interpretation section

• Quality summary provided by instrument

• Bar code/adapter trim, instrument-specific filtering

• Secondary custom filtering

Pre-processing: Instrument report

• Example of Illumina • Example of Ion

Shotgun DNAseq Targeted FFPE sequencing

Pre-processing: Filter and Trim

• Reads are filtered and excluded based on instrument quality cutoffs

• Fastq files can be downloaded or analyzed directly through Apps (Illumina) or Plugins (Ion)

• Secondary analysis through “Fastqc”

• Bioinformatic suites capable of custom trimming

Q s

core Targeted DNAseq from FFPE

Processing

• Trimmed, quality filtered reads now ready for alignment

• Bioinformatic tools for known virus

• Bioinformatic tools for unknown virus (Agnostic)

Bioinformatics tools: known virus

• Goal: Generate full length virus sequence for downstream analysis• Typing, epidemiology, resistance marker analysis

• Align to single (or list of) reference genome(s)• Various alignment algorithms

• Custom trimming

• Creates Sequence Alignment Map (SAM) file

Virus reference genome:

Fastq files:


Bioinformatic suites• Geneious

• CLC workbench

• Bionumerics

• Others

• Graphical User Interface (GUI) • Very intuitive and user friendly

• Alignment plugins• Pull reference genomes from NCBI


• Downstream analysis tools• Annotation

• Phylogenetic analysis

Giberson et al (2011) NAR Stephanie Mitchell, PhD

Bioinformatic tools: unknown virus (agnostic)

• Goal: Detect any virus sequence present in a clinical sample

• Bioinformatic suites previously mentioned• Manually curate list of reference sequences to align reads against

• Web-based metagenomic pipelines• OneCodex

• Taxonomer

• CosmosID

k-mer classification

Wood & Salzberg (2014), Genome Biology 15(3)

OneCodex

Taxonomer

CosmosID

Pipeline feature OneCodex Taxonomer CosmosID

Analysis Time* ~8 minutes ~5 minutes ~5 minutes

Number of virus genomes 5,137 >90,000 5,025

Comparison between samples Yes No Yes

Upload many samples at once Yes No Yes

Searchable reference genome list Yes No No

Independent view of virus hits No Yes Yes

Visual manipulation No Yes Yes

Connection to sequencer’s cloud No No Yes

*for 1 sample with 2 million reads

Pipeline comparison: Adenovirus from a conjunctival swab

Pipeline comparison: Adenovirus from a conjunctival swab

Bioinformatic method Reads of Adenovirus Type of Adenovirus

OneCodex 39,175 B

Taxonomer 5,833 B

CosmosID 39,930 B

Manual alignment and BLAST analysis

38,573 B, serotype 3

Pipeline comparison: Enterovirus from a nasopharyngeal aspirate

Pipeline comparison: Enterovirus from a nasopharyngeal aspirate

Bioinformatic method Reads of Enterovirus Type of Enterovirus

OneCodex 2 Not typed

Taxonomer 609 EV-D68

CosmosID 1,498 EV-D68

Manual alignment and BLAST analysis

1,174 EV-D68

Interpretation

• Goal: Make accurate prediction with sequence data• Is the patient infected with virus “X”?• Is this SNP real?

• Important variables• Number of reads• Location of reads• Depth of coverage

• Sources of error• PCR errors during library prep or cluster generation• Read length (over-trimming)• Sequencing errors• Mapping errors• Contamination

Interpretation

• You have results from web-based pipeline, now what?• How do you decide what is real/meaningful?

• Manual confirmation is recommended• Can differentiate real hits from false-positives

• Genotyping/resistance mutation detection• SNPs result of quality issues

Interpretation: Important variables

• Number of reads

• Location of reads• All in one region?

• Depth of Coverage

• Average coverage across the viral genome

• Coverage at each individual base

Depth of coverage example

Low read count example

Manual alignment to Torque teno midi virus 2 reference genome= 0 reads

Interpretation: Sources of Error

• PCR errors during library prep or cluster generation

• Read length (over-trimming)

• Sequencing errors

• Mapping errors

• Contamination

• **Positive and negative controls can help with some of these issues

Mapping error example

Measles virus reference genome

Measles virus reference genome

A= in silico Dolphin morbillivirusB= in silico low level measles virus

Schlaberg et al (2017) Arch Path & Lab Med

Contamination example

Rhinovirus reads in contaminating sample

Pe

rce

nt

fals

e-p

osi

tive

Rh

ino

viru

s re

ads

in s

amp

le w

ith

co

nta

min

atio

n

A sample with 652,676 reads of Rhinoviruscontaminates a neighbor with 100 reads of Rhinovirus

neighbor was sequenced at depth of 1 million reads

Standardization and Validation

• NY State validation guidelines• Minimum of Q20 per base and Q20 per mapped read

• FDA Draft Guidance• Need for regulatory-grade sequence database

• Cutoff values for positivity

• ARUP, UCSF, ASM PPC & CAP MRC clinical validation guidance• 5 million total reads per sample-cutoff in CSF

• Minimum of 3 non-overlapping virus gene reads- cutoff for positive result

• ASM-AAM NGS Report• Interpretive guidelines

• Quality standards

Conclusions

• Lack of standardized NGS analysis protocols• Laboratories should look to published guidance

• All analysis pipelines have limitations • Number and diversity of curated sequences

• False-positive hits

• Data should be scrutinized• Manual confirmation of pipeline hits

• Quality filtering

Resources

References:

• Wadsworth validation guidelines: http://www.wadsworth.org/sites/default/ files/WebDoc/ID WGS NGS Molecular Guidelines for Isolates_0.pdf.

• ARUP, UCSF, ASM PPC, CAP MRC Validation guidance: Schlaberg RS, Chiu CY, et al (2017) Archives of Path and Lab Medicine. doi: http://dx.doi.org/10.5858/arpa.2016-0539-RA

• FDA Draft guidance: https://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/UCM500441.pdf

• Broad Institute Best Practices: https://software.broadinstitute.org/gatk/best-practices/

• ASM-AAM Report: https://www.asm.org/images/Colloquia-report/NGS_Report.pdf

Talk to your genomics colleagues, they have probably encountered many of the same issues already