CBI NGS Workshop Lesson 4 The Genome Analysis Toolkit (GATK)

CBI NGS WorkshopLesson 4

The Genome Analysis Toolkit

(GATK)

Liu Huan(刘欢 )

Center for Bioinformatics,

Peking University

2011-05-30

Outline

Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for

Variant Detection

Basic Concepts Single-nucleotide polymorphism (SNP)

- a DNA sequence variation occurring when a single nucleotide - A,T,C,G- in the genome differs between members of a biological species or paired chromosomes in an individual.

- e.g. two DNA fragments from different individuals,

AAGCCTA to AAGCTTA (two alleles) - Almost all common SNPs have only two alleles - Within a population, SNPs can be assigned a mino

r allele frequency

Basic Concepts

Indel

- an insertion or a deletion

- e.g.

reference: AT GG

indel 1: AT _G

indel 2: ATCGG

Basic Concepts copy number variation (CNV) - a form of structural variation - alterations of the DNA of a genome that results in th

e cell having an abnormal number of copies of one or more sections of the DNA

- CNVs correspond to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes

- e.g. normal chromosome structure: A-B-C-D CNV 1: A-B-C-C-D (a duplication of "C") CNV 2: A-B-D (a deletion of "C") - This variation accounts for roughly 12% of human g

enomic DNA - each variation may range from about one kilobase

(1,000 nucleotide bases) to several megabases in size

Outline

Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for

Variant Detection

Framework for variation discovery and genotyping from next-generation

DNA sequencing Phase 1: - raw read data with platform-dependent biases were transf

ormed into a single, generic representation with well-calibrated base error estimates, mapped to their correct genomic origin and aligned consistently with respect to one another. Mapping algorithms placed reads with an initial alignment on the reference genome, either generated in, or converted to, the technology-independent SAM reference file format.

- molecular duplicates were eliminated - initial alignments were refined by local realignment and th

en an empirically accurate per-base error model was determined.


DNA sequencing Phase 2:

- the analysis-ready SAM/BAM files were analyzed to discover all sites with statistical evidence for an alternate allele present among the samples including SNPs, short indels and copy number variations (CNVs)


DNA sequencing Phase 3: - technical covariates, known sites of variati

on, genotypes for individuals, linkage disequilibrium (LD), and family and population structure were integrated with the raw variant calls from phase 2 to separate true polymorphic sites from machine artifacts, and at these sites, high-quality genotypes were determined for all samples.

Outline

Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Varia

nt Detection

GATK architectureMapReduce

MapReduce ： - parallel computation - two steps: subdivide large problems into many discrete in

dependent pieces, which are fed to the map function, followed by reduce function, joining the map results back into a final product

- subdividing - load balance Example ： - SNP discovery: map function - ChIP-seq (peak calling) : reduce function

GATK architecture traversals - provide the division and preparation of data

walkers - analysis module - provide the map and reduce methods that consume the d

ata

GATK can provide a nearly comprehensive set of traversal types that satisfy the data access needs of the majority of analysis tools

“By each sequencer read” (read-based) and “by every read covering each single base position in a genome” (locus-based)

- standard methods for accessing data for several analyses - e.g. counting reads, building base quality histograms, reporting averag

e coverage of sequencer reads over the genome, calling SNP

Traversal Types in GATK

Read-based Traversals Read-based Traversal - presents the analysis walker with each read individually, passing each r

ead once and only once to the walker’s map function. - along with the sequencer read, the walker is presented with the referen

ce bases that the read overlaps - is useful for analyzing read quality scores, alignment scores, and mergi

ng reads from multiple bam files.

Locus-based Traversals Locus-based Traversal - It presents the analysis walkers with all the associated genomic data,

including all the reads that span the genomic location, all reference ordered data, and the reference base at the specific locus in the genome.

- Each of these single-base loci are passed to the walker’s map function - e.g. depth of coverage calculation, variant analysis

Depth of Coverage Walker in GATK Depth of Coverage: - important in CNV discovery, SNP calling, and other downstream analys

is

Depth of Coverage Walker in GATK: - at each site the walker receives a list of the reads covering the referenc

e base and emits the size of the pileup - The end user can optionally exclude reads of low mapping quality, and

other read filtering criteria. - can also be provided with a list of regions to calculate coverage, summ

ing the average coverage over each region - can also be used to quantify sequencing results over complex or highly

variable regions, e.g major histocompatibility complex (MHC)

Depth of Coverage Walker in GATK

Outline

Basic Concepts Overview for Variant Discovery GATK Architecture Data Processing Pipeline of GATK for Varia

nt Detection

Data Processing Pipline of GATK

initial mapping refinement of the initial reads multi-sample indel and SNP calling filtering of the raw SNP calls finally variant quality score recalibration.

Reference Genome of GATK hg19 is not supported

b37 is used - to keep up to date with dbSNP and the 1000 Genomes Project data file

s

Resources Download: GSA FTP server: location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <blank>

Raw Data Processing

raw fastq file NGS reads aligner

For Illumina data: recommend BWA

- accurate, fast, well-supported, open-source, and emits BAM files natively

Raw BAM to realigned, recalibrated BAM

Purpose of realignment - locally realign reads such that the number of mismatching bases is mini

mized across all the reads - In general, a large percent of regions requiring local realignment are du

e to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs.

Two steps of realignment - Step 1: Determining (small) suspicious intervals which are likely in ne

ed of realignment- Step 2: Running the realigner over those intervals


Two types of realignment - Realignment only at known sites very efficient can operate with little coverage can only realign reads at known indels

- Fully local realignment uses mismatching bases to determine if a site should be realigned, and relies on sufficient coverage to discover the correct indel allele in the reads for alignment

much slower (involves SW step) can discover new indel sites in the reads


Purpose of base quality recalibration - After recalibration, the quality scores in the QUAL field in each read in the outpu

t BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome

- the recalibration tool attempts to correct for variation in quality with machine cycle and sequence contex

- more accurate quality scores

Base Quality Recalibration: analyzing the covariation among several features of a base. e.g. - Reported quality score - The position within the read - The preceding and current nucleotide observed by the sequencing machine - Probability of mismatching the reference genome these covariation recalibrate the quality scores of all reads in a BAM file

recommendation: lane-level recalibration,

sample-level realignment

Initial variant discovery and genotyping

Input BAMs for variant discovery and genotyping - already have a single realigned, recalibrated, dedupped BAM per sam

ple, called sampleX.bam, for X from 1 to N samples in your cohort.

Multi-sample SNP and indel calling - apply the Unified genotyper to identify sites among the cohort samples.

This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort.

- Note: by default the Unified Genotyper calls SNPs only. To enable the indel calling capabilities instead use the -glm DINDEL argument.


Selecting an appropriate quality score threshold - A common question is the confidence score threshold to u

se for variant detection.

- Recommend: Deep (> 10x coverage per sample) data

recommend a minimum confidence score threshold of Q30 with an emission threshold of Q10. These Q10-Q30 calls will be emitted filtered out as LowQual.

Shallow (< 10x coverage per sample) data recommend a min. confidence score of Q4 and an emission thresh

old of Q3, since variants have by necessity lower quality with shallower coverage.


Protocol

VCF (variant call format) - standarised format for storing the most prevalent types of sequence va

riation, including SNPs, indels and larger structural variants, together with rich annotations

- usually stored in a compressed manner, and can be indexed for fast data retrieval of variants from a range of positions on the reference genome

- VCFtools: a software suite that implements various utilities for processing VCF files, including validation, merging and comparing… - http://vcftools.sourceforge.net


VCF (variant call format)


Integrating analyses: getting the best call set possible

Problems of raw VCF file - raw VCF will have many sites that aren't really genetic variants but are machine

artifacts that make the site statistically non-reference - should separate out the FP machine artifacts from the TP genetic variants !

Tools: - VariantFiltrationWalker: apply hard filters - Variant quality score recalibration: build an adaptive error model using known v

ariant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact.

Recommend: Regardless of whether you'll ultimately apply hard filtering or adaptive error mo

deling to select your final calls, first apply some common SNP filters to avoid obvious misalignment and indel artifacts.


Analysis read VCF protocol:


Basic indel filtering: - purpose: remove alignment artifacts from the data

- methods: flagging variants with high strand bias and in poorly mapped regions (HARD_TO_VALIDATE set) with more than 10% of the reads having mapping quality 0

- arguments for VariantFiltrationWalker:


Basic SNP filtering: - purpose: remove alignment artifacts from the data

- methods: flagging SNPs within clusters (3 SNPs with 10 bp of each other) and those in poorly mapped regions (HARD_TO_VALIDATE set) with more than 10% of the reads having mapping quality 0

- arguments for VariantFiltrationWalker:


Filtering around indels - Purpose: It's possible that, despite even local realignment, misalignmen

ts around true and artifactual indels will result in some false SNP calls. These errors are quite common if you didn't do local realignment, didn't provide a set of known indels during local realignment, and around very large indels that can't be modeled properly by local realignment.

- methods: perform indel calling, then you can filter your SNP calls aroun

d the raw indel calls from your data set - arguments for VariantFiltrationWalker:


Making analysis ready calls SNP calls with hard filtering

- GATK recommended hard filtering:

arguments for VariantFiltrationWalker:


Making analysis ready calls with variant quality score recalibration

- newly developed

- An alternative approach to hard filtering: Variant quality score recalibration

- methods: assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

Expected SNP call quality

Using GATK walker : VariantEval

- giving sensitivity, specificity, and Ti/Tv ratios for known and novel calls

- Expected Ti/Tv ratios:

evaluating the quality of SNP calls whole genome, or in the targeted whole exome (Agilent), or interested regions

Reference Mark A et,al. A framework for variation discovery and genotyping using next-generation DNA

sequencing data. Nature Genetics, 43: 491-498, 2011 Mark A et,al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-gener

ation DNA sequencing data. Genome Res. 20:1297-303, 2010

Wiki: GATK http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page

Best Practice Variant Detection with the GATK v2 http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

1000 Genomes: A Deep Catalog of Human Genetic Variationhttp://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

VCF poster: The Variant Call Format and VCFtools, by Petr Danecek et. al. http://vcftools.sourceforge.net/VCF-poster.pdf

VCFtools http://vcftools.sourceforge.net

Wikipedia http://en.wikipedia.org/

http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page

http://www.broadinstitute.org/gsa/wiki/index.php/Main_Page

http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

http://vcftools.sourceforge.net/VCF-poster.pdf

http://vcftools.sourceforge.net/

http://en.wikipedia.org/

Thanks for Attention !

Documents

CBI NGS Workshop Lesson 4 The Genome Analysis Toolkit (GATK)