MGL Users Group Capture / Resequencing Data Handling and Analysis

MGL Users Group Capture / Resequencing Data Handling and Analysis

Designing and ordering a targeted exome probe set

We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes

Process of design chose your genes of interest submit them to the SureDesign website some considerations

price breaks at 0.5, 3, 6, 12, 24 Mb (see next slide)

for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of scaling of costs for SureSelect probes

These are costs per sample.

For example, for 96 samples for ~130 genes: 96 x $260 = $24, 960.

Designing and ordering a targeted exome probe set

We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes

Process of design chose your genes of interest submit them to the SureDesign website some considerations

price breaks at 0.5, 3, 6, 12, 24 Mb for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome

Example of SureDesign report

Targeted vs whole exome sequencing (TES vs WES)

Cost of WES is ~$120 for pulldown probes

Can run many more samples per lane for TES

WES uses off-the-shelf probe kit, so shorter ordering time

Less “extraneous” data with TES = more “free” data with WES

Process of hybridization and library preparation

We use the Agilent SureSelectXT Target Enrichment kit need 5 µg of high quality genomic DNA to start probes are RNA, be sure DNA is Rnase-free

Shear the DNA, size select, ligate adaptors, amplify library

Hybridize to custom probes and pull down

Add barcodes, pools samples for sequencing

Sequencing

ABI SOLiD 5500xl

Optimum density is 160 million beads per lane (one DNA fragment per bead).

Nominally 110 bases read per fragment = 16.2 billion bases per lane.

Significant losses due to filtering and off-target reads.

Understanding Data from the Sequencer

Each fragment can produce one or two reads from the forward and or reverse ends.

Commonly for re-sequencing projects we want to maximize both coverage and call reliability, therefore paired ends are desirable of the longest length the sequencer can produce.

Data is in the form of individual calls and qualities are present for each.

In order to reduce possible artifacts multiple filtering steps are desirable.

Colorspace Compared to FASTQ

Colorspace is similar to FASTQ, but there is a layer of encoding making it not immediately interpretable.

Both have calls and qualities

Due to the encoding sampling two bases, call error actually goes down in colorspace data, making it a bit more reliable for re-sequencing.

A tradeoff is that reads are a bit shorter, meaning more independent fragments must be read to achieve similar coverage.

Encoding2nd Base

1st B

ase

csqual file with associated call qualities.

XSQ is a compressed binary format combining both.

You WILL have variants The human reference genome (hg19) is assembled from 13

people, various portions represent only a fraction of those individuals.

The human genome prior to the most recent build (not yet generally adopted by the vast majority of tools) contains many rare alleles.

dbSNP (build 141) reports 62 million common variants (from 260 million submissions), 29.9 million of which occur within genes. Includes mainly synonymous and ‘non-impactful’ mutations.

The goal of many re-sequencing projects is to try to distill meaningful mutations from all of this common genetic variation.

Considerations with Capture data

Exome or targeted capture is an excellent tool for reducing the amount of ‘irrelevant’ data for a study, but does introduce some caveats.

Capture is never 100% enrichment. In both our hands and in data evaluated from NISC exome capture tends to be ~50% or so on vs off target bases, as explicitly defined by the capture (exons +/- 10bp). Product literature usually extends the capture regions a further 100 bp to pad that.

By the complex hybridization nature of capture, there is a LOT of variability in how well some sequences are captured vs others. Some regions may have low/no coverage while others may be heavily covered.

Distribution of Coverage in Capture

“Average” Coverage is overall 228x Reads for capture bases, but note the range, and the presence of a terribly captured fraction!

Falloff of coverage in targeted regions

We can track what fraction of bases are covered at a certain level. This can be adjusted by how much sequencing is done.

80% of bases

50% of bases

20% of bases

Capture coverage scales fairly linearly with input, but low coverage bases do not scale well!

High coverage bases vs low

coverage bases scale differently.A factor of how well they can be

hybridized.

Pre-filtering of data

Reads are evaluated and trimmed based on contents BEFORE any form of mapping.

Important as “bad” reads may map and result in variant calls! Generally important for any form of project, not just re-sequencing, but especially critical here.

A variety of tools exist to perform this. I prefer Trimmomatic for this task.

Two main tasks for Trimmomatic: Remove adapter or problematic sequences (poly-A, etc) Clip or trim read sequences at low quality positions

Discard below a minimum threshold length

Alignment of Data

This is actually a critical choice. Which aligner you use will determine the reliability of your downstream results!

Alignment algorithms may change depending on task/project.

Generally three types of aligners: Seed & Extend Reference Indexing Prefix/Suffix matching (Burrows Wheeler Transforms)

Computational time and accuracy vary.

Benchmarking of Common Aligners

For Illumina and some colorspace mapping I prefer to use Novocraft. It’s less commonly used as it’s not free.

Oliver GR. F1000Research 2012(Simulated data on actual aligners)

Benchmarking Indel Detection

Indels are a bit trickier to detect, particularly for some alignment strategies

Oliver GR. F1000Research 2012

Post alignment Workflow GATK best practices (Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-

Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics. 43:11.10.1-11.10.33.)

Continually updated tools and recommendations for handling of sequencing data from Broad Institute.

Final portable data format VCF (Variant call format) – Tab-delimited text

Each line represents a position of a variant, then describes the genotype and underlying data & reliability for each sample. Extendable with annotations and additional information.

Common and readable by many current third party tools.

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Additional handling

Varies significantly by project & goals. Association testing with disease phenotypes Modifiers Identification of mutations segregating with disease among

families Causative mutation(s) Copy Number Variation (CNV)

The amount of data needed to perform these sorts of tests and analysis will vary depending on characterization and type of study.

Filtering, visualization, and manipulation can be done by many third party tools. Varsifter, Golden Helix, IGV, GALAXY, and MANY more. http://nihlibrary.nih.gov/Services/Bioinformatics/Pages/bioanalysis.aspx

Documents

MGL Users Group Capture / Resequencing Data Handling and Analysis