Upload
jasmine-parker
View
232
Download
2
Embed Size (px)
Citation preview
MGL Users Group Capture / Resequencing Data Handling and Analysis
Designing and ordering a targeted exome probe set
We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes
Process of design chose your genes of interest submit them to the SureDesign website some considerations
price breaks at 0.5, 3, 6, 12, 24 Mb (see next slide)
for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome
Example of scaling of costs for SureSelect probes
These are costs per sample.
For example, for 96 samples for ~130 genes: 96 x $260 = $24, 960.
Designing and ordering a targeted exome probe set
We use Agilent SureSelect hybridization probes 120 base biotinylated RNA probes
Process of design chose your genes of interest submit them to the SureDesign website some considerations
price breaks at 0.5, 3, 6, 12, 24 Mb for Karel size of target set <500K this is ~1% of whole exome this is 0.015% of whole genome
Example of SureDesign report
Targeted vs whole exome sequencing (TES vs WES)
Cost of WES is ~$120 for pulldown probes
Can run many more samples per lane for TES
WES uses off-the-shelf probe kit, so shorter ordering time
Less “extraneous” data with TES = more “free” data with WES
Process of hybridization and library preparation
We use the Agilent SureSelectXT Target Enrichment kit need 5 µg of high quality genomic DNA to start probes are RNA, be sure DNA is Rnase-free
Shear the DNA, size select, ligate adaptors, amplify library
Hybridize to custom probes and pull down
Add barcodes, pools samples for sequencing
Sequencing
ABI SOLiD 5500xl
Optimum density is 160 million beads per lane (one DNA fragment per bead).
Nominally 110 bases read per fragment = 16.2 billion bases per lane.
Significant losses due to filtering and off-target reads.
Understanding Data from the Sequencer
Each fragment can produce one or two reads from the forward and or reverse ends.
Commonly for re-sequencing projects we want to maximize both coverage and call reliability, therefore paired ends are desirable of the longest length the sequencer can produce.
Data is in the form of individual calls and qualities are present for each.
In order to reduce possible artifacts multiple filtering steps are desirable.
Colorspace Compared to FASTQ
Colorspace is similar to FASTQ, but there is a layer of encoding making it not immediately interpretable.
Both have calls and qualities
Due to the encoding sampling two bases, call error actually goes down in colorspace data, making it a bit more reliable for re-sequencing.
A tradeoff is that reads are a bit shorter, meaning more independent fragments must be read to achieve similar coverage.
Encoding2nd Base
1st B
ase
csqual file with associated call qualities.
XSQ is a compressed binary format combining both.
You WILL have variants The human reference genome (hg19) is assembled from 13
people, various portions represent only a fraction of those individuals.
The human genome prior to the most recent build (not yet generally adopted by the vast majority of tools) contains many rare alleles.
dbSNP (build 141) reports 62 million common variants (from 260 million submissions), 29.9 million of which occur within genes. Includes mainly synonymous and ‘non-impactful’ mutations.
The goal of many re-sequencing projects is to try to distill meaningful mutations from all of this common genetic variation.
Considerations with Capture data
Exome or targeted capture is an excellent tool for reducing the amount of ‘irrelevant’ data for a study, but does introduce some caveats.
Capture is never 100% enrichment. In both our hands and in data evaluated from NISC exome capture tends to be ~50% or so on vs off target bases, as explicitly defined by the capture (exons +/- 10bp). Product literature usually extends the capture regions a further 100 bp to pad that.
By the complex hybridization nature of capture, there is a LOT of variability in how well some sequences are captured vs others. Some regions may have low/no coverage while others may be heavily covered.
Distribution of Coverage in Capture
“Average” Coverage is overall 228x Reads for capture bases, but note the range, and the presence of a terribly captured fraction!
Falloff of coverage in targeted regions
We can track what fraction of bases are covered at a certain level. This can be adjusted by how much sequencing is done.
80% of bases
50% of bases
20% of bases
Capture coverage scales fairly linearly with input, but low coverage bases do not scale well!
High coverage bases vs low
coverage bases scale differently.A factor of how well they can be
hybridized.
Pre-filtering of data
Reads are evaluated and trimmed based on contents BEFORE any form of mapping.
Important as “bad” reads may map and result in variant calls! Generally important for any form of project, not just re-sequencing, but especially critical here.
A variety of tools exist to perform this. I prefer Trimmomatic for this task.
Two main tasks for Trimmomatic: Remove adapter or problematic sequences (poly-A, etc) Clip or trim read sequences at low quality positions
Discard below a minimum threshold length
Alignment of Data
This is actually a critical choice. Which aligner you use will determine the reliability of your downstream results!
Alignment algorithms may change depending on task/project.
Generally three types of aligners: Seed & Extend Reference Indexing Prefix/Suffix matching (Burrows Wheeler Transforms)
Computational time and accuracy vary.
Benchmarking of Common Aligners
For Illumina and some colorspace mapping I prefer to use Novocraft. It’s less commonly used as it’s not free.
Oliver GR. F1000Research 2012(Simulated data on actual aligners)
Benchmarking Indel Detection
Indels are a bit trickier to detect, particularly for some alignment strategies
Oliver GR. F1000Research 2012
Post alignment Workflow GATK best practices (Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-
Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics. 43:11.10.1-11.10.33.)
Continually updated tools and recommendations for handling of sequencing data from Broad Institute.
Final portable data format VCF (Variant call format) – Tab-delimited text
Each line represents a position of a variant, then describes the genotype and underlying data & reliability for each sample. Extendable with annotations and additional information.
Common and readable by many current third party tools.
##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Additional handling
Varies significantly by project & goals. Association testing with disease phenotypes Modifiers Identification of mutations segregating with disease among
families Causative mutation(s) Copy Number Variation (CNV)
The amount of data needed to perform these sorts of tests and analysis will vary depending on characterization and type of study.
Filtering, visualization, and manipulation can be done by many third party tools. Varsifter, Golden Helix, IGV, GALAXY, and MANY more. http://nihlibrary.nih.gov/Services/Bioinformatics/Pages/bioanalysis.aspx