NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint

Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data

NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

Large-scale sequencing projects Lots of data Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome

Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output Right number of lines? Did anything fail silently? Different handling of different classes of input? How are missing values coded? % failure?

Exploratory work in R read.table(, as.is=T, na.strings=c(NA, nan)) dim(), str(), mode(), complete.cases() head(), tail() table(), summary() order(), rank() plot(), library(ggplot2) library(plyr)

Pipeline writing Arguments/options for different input Arguments/options for parameters/auxillary files Reusable functions Reasonably flexible input format recognition Set up for parallelizing stderr for debugging, checking progress, but beware of its size and I/O! Create new directories as you go along Create flag files to indicate successful completion of each step

Make Specify input file and output file Specify command for input output Make checks presence of output file before running command Make deletes output of commands that did not finish running

Ruffus http://www.ruffus.org.uk Flexible: one many and many one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline

Setting up Ruffus

Once Ruffus is set up - Help

Once Ruffus is set up just print

NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices

Processing a raw BAM file Practical concerns Number of samples Size of files Run time Server/cluster usage: How the jobs can be parallelized Scientific concerns Ploidy of genome Source of DNA Features of genome Variation between samples Genome coverage Error rates

Manipulating a BAM file Converting between bams and fastqs Indexing Coordinate sorting Splitting or merging Filter out reads using bitwise flags/other criteria Mask entire regions

Example: Contaminants

Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html

Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches http://ngsutils.org

Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY http://broadinstitute.github.io/picard

Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations https://www.broadinstitute.org/gatk/

Useful Resource: GATK http://www.broadinstitute.org/gatk/guide/best-practices

Indel Realignment http://www.broadinstitute.org/gatk/guide/best-practices

Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

The Indel Realigner Workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf

Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time

Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the samples own insertion/deletion lengths only need to parse in one samples data at a time

Base Quality Score Recalibration (BQSR) http://www.broadinstitute.org/gatk/guide/best-practices

Why BQSR? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

The BQSR workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf

Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample

Variant Calling http://www.broadinstitute.org/gatk/guide/best-practices

Variant Calling http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf

Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time

Useful Resource: Variant Callers

Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)

Documents

NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint