NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year...
If you can't read please download the document
NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint
NGS data processing Bioinformatics tips, tools of the trade and
pipeline writing Na Cai 4 th year DPhil in Clinical Medicine
Supervisor: Jonathan Flint
Slide 2
Example projects CONVERGE -1.7x whole genome sequencing in
12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed
questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x
whole genome sequencing in 2,000 mice -Known breeding history
-Extensive phenotyping -2T of sequencing data
Slide 3
NGS data processing Taken from:
http://www.broadinstitute.org/gatk/guide/best-practices
Slide 4
Large-scale sequencing projects Lots of data Terabytes! Storage
problems, I/O problems, RAM problems Time consuming to process
Errors! Lots of them! Contamination Duplication Missing data
Difficult regions/features of the genome
Slide 5
Approach to NGS data Explore the data before processing
large-scale Pilot your experiments with small subsets Try default
parameters of softwares before altering Check output Right number
of lines? Did anything fail silently? Different handling of
different classes of input? How are missing values coded? %
failure?
Slide 6
Exploratory work in R read.table(, as.is=T, na.strings=c(NA,
nan)) dim(), str(), mode(), complete.cases() head(), tail()
table(), summary() order(), rank() plot(), library(ggplot2)
library(plyr)
Slide 7
Pipeline writing Arguments/options for different input
Arguments/options for parameters/auxillary files Reusable functions
Reasonably flexible input format recognition Set up for
parallelizing stderr for debugging, checking progress, but beware
of its size and I/O! Create new directories as you go along Create
flag files to indicate successful completion of each step
Slide 8
Make Specify input file and output file Specify command for
input output Make checks presence of output file before running
command Make deletes output of commands that did not finish
running
Slide 9
Ruffus http://www.ruffus.org.uk Flexible: one many and many one
processes Fully integrated with Python programming Need specify
only the max number of cores allowed for parallelisation Useful
printout options to check pipeline
Slide 10
Setting up Ruffus
Slide 11
Once Ruffus is set up - Help
Slide 12
Once Ruffus is set up just print
Slide 13
NGS data processing Taken from:
http://www.broadinstitute.org/gatk/guide/best-practices
Slide 14
Processing a raw BAM file Practical concerns Number of samples
Size of files Run time Server/cluster usage: How the jobs can be
parallelized Scientific concerns Ploidy of genome Source of DNA
Features of genome Variation between samples Genome coverage Error
rates
Slide 15
Manipulating a BAM file Converting between bams and fastqs
Indexing Coordinate sorting Splitting or merging Filter out reads
using bitwise flags/other criteria Mask entire regions
Slide 16
Example: Contaminants
Slide 17
Slide 18
Useful Resource: Harvard Sysbio Remove duplicate sequences in
FASTA Remove short sequences in FASTA Format FASTA
http://archive.sysbio.harvard.edu/csb/resources/computati
onal/scriptome/UNIX/Protocols/Sequences.html
http://archive.sysbio.harvard.edu/csb/resources/computati
onal/scriptome/UNIX/Protocols/Sequences.html
Slide 19
Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM,
BED, GTF file processing Eg. bamutils filter can filter out reads
with more than x mismatches http://ngsutils.org
Slide 20
Useful Resource: PicardTools Tools (in java) for BAM and FASTA
processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile,
ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER,
CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY
http://broadinstitute.github.io/picard
Slide 21
Useful Resource: GATK Tools (in java) for NGS processing and
analysis Cools things about it: Best Practices page, Forum,
Tutorials, Presentations https://www.broadinstitute.org/gatk/
Why Realign Around Indels?
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
Slide 25
Why Realign Around Indels?
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
Slide 26
How does it work? Identified intervals: Known Indels Indels
discovered in original alignments (in CIGAR strings of reads in BAM
files) Reads where there is evidence of possible misalignment
Identified intervals: Known Indels Indels discovered in original
alignments (in CIGAR strings of reads in BAM files) Reads where
there is evidence of possible misalignment
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
Slide 27
The Indel Realigner Workflow
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
Slide 28
Implementing RealignerTargetCreator
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The
RealignerTargetCreater needs as many reads from all the samples at
a particular site to determine if reads tend to get misaligned
there need to parse in data for all samples at the same time
Slide 29
Slide 30
Implementing IndelRealigner
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 Once the Intervals are
identified, reads from any single sample can be realigned
individually based on the samples own insertion/deletion lengths
only need to parse in one samples data at a time
Slide 31
Slide 32
Base Quality Score Recalibration (BQSR)
http://www.broadinstitute.org/gatk/guide/best-practices
The BQSR workflow
http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
Slide 35
Implementing BaseRecalibrator
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The BaseRecalibrator
needs all reads from each samples at all unmasked sites to come up
with the recalibration table for the dataset need to parse in all
of the data of each sample
Implementing Variant Calling
Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads
sample3reads sample4 sample5 sample6 sample7 The UnifiedGenotyper
(and many other callers) needs as many reads from all the samples
at a particular site to determine if there is a variant at the site
tend need to parse in data for all samples at a particular site at
the same time
Slide 40
Slide 41
Useful Resource: Variant Callers
Slide 42
Acknowledgements Jonathan Flint, Richard Mott Robbie Davies,
Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus)
Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex)
John Broxholme (all software help and maintenance) Jon Diprose,
Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter
(IT support)