Error Correction in HighThroughput Datasets

Error Correction in HighThroughput

Datasets

Dale Beach, Longwood UniversityLisa Scheifele, Loyola University Maryland

Next-generation sequencing has revolutionized both biological research and clinical medicine, with sequencing of entire human genomes being used to predict drug responsiveness and to diagnose disease (for example Choi 2009).

The advent of next-gen sequencing requires students and researchers to

deal with large datasets

Students must be able to address error in large datasets

http://www.pnas.org/content/106/45/19096/F3.expansion.htmlhttp://www.pnas.org/content/106/45/19096.full.pdf+html

In contrast to traditional Sanger sequencing, next-generation sequencing datasets have shorter read lengths and higher error rates. This can create challenges for downstream analysis since even a small error rate will result in a large number of sequencing reads that contain errors due to the abundance of sequencing reads. Indeed, Illumina MiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free.

Sequencing error in read

http://www.pnas.org/content/106/45/19096/F3.expansion.html

http://www.pnas.org/content/106/45/19096.full.pdf+html

Background This module is designed for a genetics or

molecular biology class. It will require 3 lecture/seminar class periods with optional additional Linux-based lab activities

Prior to beginning this module, students should be familiar with: Sample preparation techniques for DNA sequencing DNA replication and the enzymes that synthesize DNA Nucleic acid and nucleotide structure

Research Goals

Initial evaluation of the quality of eukaryotic genome sequencing data

Implementation of error correction techniques

Comparison of the quality of sequencing data before and after error correction

Completed small eukaryotic genome data on Illumina platform

If students will not be performing command-line programming themselves, this data should be analyzed with:

Jellyfish to produce data on k-mer frequencies that students can use to generate a histogram in Excel

Quake to perform error correction so that students can be provided with pre- and post-error correction datasets

Sequencing Requirements

Student Learning Goals At the completion of this module, students will be

able to: Describe the important differences between

highthroughput and traditional (low throughput) experiments

Explain the reasons for variations in the quality of highthroughput datasets

Utilize computational tools to quantify errors in sequencing data

Interpret the quality of a sequencing experiment and be able to implement effective quality control measures

Computer Requirements

Excel or other Analytical packages to create a k-mer frequency distribution

Galaxy to create a boxplot of PHRED33 scores

Optional: Quake and Jellyfish on Linux system to generate k-mer data and perform error correction

Vision and Change Competencies

This module will develop students’ abilities to:

Apply the process of science▪ Design experiment from methodological design through data

analysis▪ Analyze and interpret data

Ability to use modeling and simulation▪ Design experimental strategies and predict outcomes

Ability to use quantitative reasoning▪ Depict data using histograms and boxplots▪ Interpret graphs and use the results of their analysis to modify

error correction strategies

Timeline: Class 1 Introductory lecture and data upload Intro to sequencing history

and platforms

Discuss typical sources of error in sequencing reads

Discuss sequence output formats and PHRED33 scores

Upload raw data to Galaxy

Optional: Quake in Linux to manipulate parameters and improve quality

http://www.nimr.mrc.ac.uk/mill-hill-essays/bringing-it-all-back-home-next-generation-sequencing-technology-and-you#

http://www.nimr.mrc.ac.uk/mill-hill-essays/bringing-it-all-back-home-next-generation-sequencing-technology-and-you




Introduce software packages that can be used to assess data quality

Demonstrate breaking sequencing reads into k-mers

Use Excel or Jellyfish to create k-mer graph

Use Excel or Jellyfish to create k-mer graph following manipulation of error correction parameters (variations in k-mer size)

K-mer frequency distibution

Timeline: Class 2Setting up analysis and adjusting

parameters

Discussion of using PHRED33 scores to assess data quality

Create boxplots of PHRED33 scores in Galaxy for raw data

Create boxplots of PHRED33 scores in Galaxy for data post Quake correction

can have students compare outcomes following Quake correction with different parameters

Raw Data

Data post Quake correction

Timeline: Class 3Assessing quality

Discussion Topics Why has next-generation sequencing technology led to a

revolution in biology/medicine?

Discuss and predict how chemical and physical mechanisms lead to errors

Comparison of sequence improvement based on different parameters

How do software packages determine which base is in error and which is correct if sequencing reads conflict?

Why is it important to have a numerical measure of error in addition to the nucleotide sequence?

Assessment This module will be performed as a team-based

project with students preparing and handing in a report at the end. Students will be able to:

Predict predominant types or sources of error based on experimental design and sequencing platform

Prepare a boxplot using Galaxy for an exemplary dataset and use the boxplot to evaluate the quality of the sequence data

Effectively improve the quality of any set of NGS reads prior to assembly

References https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

www.en.wikipedia.org/wiki/FASTQ_format

Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116

Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program]

http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/ukmss-2586.pdf

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

http://www.en.wikipedia.org/wiki/FASTQ_format






Documents

Error Correction in HighThroughput Datasets