Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
RNA-Seq: From A to T
Nick Beckloff Director, Genomics Core
Research Technology Support Facilities
Tracy Teal BEACON, MMG
Michigan State University 7/30/2014
Outline
• RNA-Seq basics • Sample input • Choosing a method • Sequencing • Library Preparation • Validation • RNA-Seq vs Arrays • Future
Upcoming Events
• Wafegen seminar September* • Methylation Boot Camp September* • BiCEP Launch September • iCER 10th Anniversary October
* Tentative due to availability
RNA-Seq Applications
RNA-Seq
Transcriptome Profiling
Biomarkers
sRNA Variants
Isoforms
Novel transcripts
Contents in this presentation may change at any time**
RNA-Seq Basics
RNA Isolation RNA QC RNA-Seq Library prep
Fragmentation cDNA synthesis
Adapter/Barcode Amplification
Analysis Library QC Sequencing
RNA-seq Project Checklist
• Budget • Repetitions
Analysis
• Quality, quantity • Species
Library
• Read Length • Depth
Sequencing
RNA-seq Guidelines
Reps
Length Coverage
Length Differential Expression = 50 vs 100 bp Coverage: Euk = 20-30M reads/sample Prok = 10M reads/sample Repetitions = 3+ Reps > Depth “when moving from 10 – 30 MM reads with 2 or 3 replicates, one will pull in approximately 25% more differentially expressed (DE) genes.” Bioinformatics Volume 30, Issue 3. 301-304.
RNA-seq Guidelines Depth vs Reps: A real world scenario 12 total samples (1 control and 5 conditions in duplicate) 1×50 bp sequncing for differential expression 10 MM reads per samples (12 samples in one lane of ILMN HiSeq) OR 30 MM reads (4 samples in one lane of ILMN HiSeq). What is the best scenario?
Reads from 10 to 30M gives 25% more reads at 1.5x cost
Increase is reps provides 35% more DE genes at 1.6x cost Bioinformatics Volume 30, Issue 3. 301-304.
Project Planning High Quality
Partially Degraded
Degraded
RNA Input is one of the most important drivers in
selecting library prep method
• What Species? • Rna quality? • Quantity?
- 1-200ng, 1-5 ug, etc
• What RNA species? - mRNA, small RNA, etc
Choosing a Method
Library Preps (PolyA selection)
• Up to 1 ug input • RIN >7.0 • Clean output • Only PolyA RNA • Stranded
Features/Limitations
Library Preps (rRNA removal)
• Not 100% efficient • Check species compatibility • 100 ng-5 ug input* • Excludes RNAs < 200bp • Can use with degraded RNA
Features/Limitations Epicentre: Ribo-Zero TM
Library Preps (rRNA removal)
http://www.epibio.com/rnamatchmaker
Check non-model organisms
BLAST rRNA seqs New Ribozero Plant
Leaf/Seed/Root!!
Library Preps (rRNA removal)
• Not 100% efficient • Some junk carry over • Find sweet spot • May need extra reads
Limitations/Tips
Sweet Spot
Library Preps (Quantitative RNA)
• 96 distinct, 8 nt Molecular Index
• Large number of combinations (96x96=9216)
• 96 Barcodes • 10-100 ng input • PE Sequencing
Features/Limitations Bioo Scientific qRNA-Seq kit
Library Preps (Low Input RNA)
Nugen Ovation System V2
• Oligo DT and random priming amplifies both polyA AND non-poly A
• 500 pg input • Stranded • Prokaryotic version • RNA < 200 bp lost
Features/Limitations
Library Preps (Low Input RNA)
SMARTer Universal Low Input
• Recommended for single cell
- Fluidgm C1 • Input 100 pg • Only for polyA samples • Yields best data for high
quality RNA*
*allegedly
Features/Limitations
Library Preps (sRNA)
Netxflex small RNA Kit
• Sequencing of small RNAs and miRNAs
• Gel free adapter depletion • No PAGE gel cuts • Total RNA input >1 ug • Adapters are ligated to
samples
Features/Limitations
Library Preps (Targeted RNA)
SureSelect RNA capture Kit
• Targeted capture of RNA • Total RNA input • KB to 10 MB size • Post-library capture
methods • Nugen uses single
primer method • No cost to design
Features/Limitations
Library Preps (Targeted Depletion)
Nugen InDA-C (Insert Dependent Adapter Cleavage)
• Customized probes for target exclusion
• Eliminates targets by cleaving sequencing adapter
• Post-library selection • Used in Ovation Prokaryotic
system • Stranded
Features/Limitations
Bead purification
InDA-C Fragmentation Enrichment PCR
Strand selection
Adaptor ligation
2nd strand synthesis
1st strand synthesis
Library Preps (Targeted Depletion)
Targeted Depletion of rRNA in (Vitis vinifera, cultivar pinot noir)
Percentage of Mapped Transcripts Cyto rRNA Chloroplast RNA Mito rRNA Informative
Library 1
Library 2
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
% informative reads increased from 22% to 56% with InDA-C
15.4
28.2 0.
5 55.9
18.9
24.3 0.
6 56.2
Mouse 18S Coverage with and without InDA-C
No InDA-C
With InDA-C
Probe location
Library Preps (Targeted Depletion)
Library Preps (Single Cell RNA)
Ovation Single Cell System
• 1-5 pg input • Converts to cDNA
amplifies library
Features/Limitations
Company Conf
idential
• Bioanalyzer traces of Ovation Single Cell RNA-Seq libraries from CD8+ resting sorted T-cells
• Functional libraries created from a single cell
• Estimated total RNA content per cell is 0.5 pg
% Non- rRNA
Single site
RefSeq Strand
Retention Total
Reads %
Aligned Input % rRNA Good alignment to genome – less wasted reads 1 cell 3,175,287 35 1.3 67 95.2
1 cell 2,928,789 39 1.4 69 87.9 Ribosomal reads less than 5% – more informative sequence
10 cells 3,089,680 54 4.1 69 97.9
Excellent strand retention – improved transcriptional value
10 cells 2,904,477 52 3.6 69 98.1
100 cells 3,687,320 81 2.9 66 98.0
100 cells 3,179,519 83 2.9 65 98.4
Library Preps (Single Cell RNA)
RNA-Seq Validation
RNA-Seq Validation (qPCR)
• qPCR validation of RNA-seq • Wafegen SmartChip System • >5,000 rxns/chip • 100 nl volume • Supports Taqman and Sybr
green • Custom targets for NGS
Features/Limitations
RNA-seq vs Arrays
• RNA-seq vs Microarrays
- Cost is comparable - Microarrays only detect what is spotted - RNA-seq > arrays for isoforms, novel transcripts - Complimentary to one another • Intangible
• Reviewers prefer RNA-seq in grants
Take Home Message
Genomics researchers astonished to learn microarrays still exist!! – The Science Web
Future of RNA-seq
• Transcriptomes of Everything • Lower Inputs • Single cell vs multiple profiles • Longer reads • More novel isoforms • More RNA subspecies
Take Home Message
General considerations for RNA-seq quantification for differential expression
Tracy K. Teal
Assistant Professor Microbiology & Molecular Genetics
July 30, 2014
Adapted from NGS RNASeq slides Author: Ian Dworkin
What are the goals of your research? Why did you generate all of the RNAseq data in the first place?
§ Transcriptome assembly (& SNP discovery) § Transcript discovery (variants for Transcription
start site, alternative splicing, etc..) § Quantification of (alternative transcripts) § Differential expression analysis across
treatments.
RNA-‐seq is generated for a number of reasons
What was once thought to be separate goals are now clearly recognized as
intertwined.
§ Early work for RNA-seq tried to “mirror” the type of gene level analysis used in microarrays.
§ However, RNA-seq has demonstrated how important it is to take into account alternative transcripts, even when attempting to get “gene level” measures.
How do we put together a useful pipeline for RNAseq
What are the steps we need to consider?
How do we put together a useful pipeline for RNAseq?
What are the steps we need to consider? § Quality filtering § Genome/transcriptome assembly. § Mapping reads to genome/transcriptome. § Deal with alternative transcripts (new
transcriptome)? § Remap & count reads. § Differential expression
Quality filtering Your analysis is only as good as your data
§ Quality control and removal of poor-quality reads (FASTQC, RNASeQC, fastx, …)
§ Remove adapters and linkers (FASTQC, Trimmomatic, …)
Mapping reads Ultimately all analyses require read mapping
Image credit: Nir Friedman lab
Challenge: alternative splicing
Overview of RNA-Seq analysis pipeline for detecting differential expression
Oshlack et al., From RNA-‐seq reads to differen3al expression results, Genome Biology 2010.
Quality filter
RNA-‐seq Workflows and Tools. Stephen Turner. Figshare. hJp://dx.doi.org/10.6084/m9.figshare.662782
Pipelines for RNA-seq (geared towards splicing)
Alamancos et al. Methods to Study Splicing from RNA-‐Seq hJp://arxiv.org/abs/1304.5952 Figshare. hJp://dx.doi.org/10.6084/m9.figshare.679993
The “tuxedo” protocol for RNA-seq
Trapnell C et al Differen[al gene and transcript expression analysis of RNA-‐seq experiments with TopHat and Cufflinks Nature Protocols 7, 562–578 (2012)
Nookaew et al 2102 NAR
How should we map reads
§ Do we want to map to a reference genome (with a “splice aware” aligner)?
§ Or do we want to map to a transcriptome directly?
Mapping to the genome
How do we deal with alternative transcripts or paralogs during mapping?
§ "splicing aware" aligners: § Exon First: (Tophat, MapSplice, SpliceMap) Fig1A Garber § Step 1 - map reads to genome § Step 2 -unmapped reads are split, and aligned.
§ Seed & extend (Fig1B Garber) (GSNAP, QPALMA) § kmers from reads are mapped (the seeds), and then
extended
Garber et al. 2011
Mapping to a transcriptome
§ What might be the downside to mapping to the transcriptome? Incomplete transcriptomes can lead to errors in inferred expression levels. Potentially less well annotated. § For this Burrows-Wheeler is faster than seed
based approaches (shrimb & stampy), but the latter may be preferred if mapping to "distant" transcriptomes.
Which to use
§ If a (close to?) perfect match transcriptome assembly is available for mapping. Burrows-wheeler based aligners can be much faster than seed based methods (upto 15x faster)
§ BW based aligners have reduced performance once mismatches are considered. § Exponential decrease in performance with each additional
mismatch (iteratively performs perfect searches). § Seed methods may be more sensitive when mapping to
transcriptomes of distantly related species (or high polymorphism rates).
From Garber et al. 2011
Counting
§ One of the most difficult issues has been how to count reads.
§ What are some of the issues that we need to account for during counting of reads?
Counting
§ We are interested in transcript abundance. § But we need to take into account a number of
things. § How many reads in the sample. § Length of transcripts § GC content and sequencing bias
Counting
§ RPKM (Reads Per Kilobase of transcript per Million mapped reads) – Mortazavi et al 2008
§ FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Avoids double counting in paired-end sequencing.
Normalizing a transcript's read count by both its length and the total number of mapped reads in the sample
Garber et al. Computa[onal methods for transcriptome annota[on and quan[fica[on using RNA-‐seq. Nat Methods, Jun 2011
Accounting for multiple isoforms
§ - Only count reads that map uniquely to an isoform. Can be very problematic, when isoforms do not have unique exons.
§ - so called "isoform-expression" methods (cufflinks, MISO) model the uncertainty parametrically (often using MLE). The model with the best mix of isoforms that models the data (highest joint probability) is the best estimate. How this is handled differs a great deal by the different model.
Garber et al. 2011
Trapnell C et al Differen[al analysis of gene regula[on at transcript resolu[on with RNA-‐seq Nat Biotechnol. 2013 Jan;31(1):46-‐53
Differential expression
§ DEseq (http://www.ncbi.nlm.nih.gov/pubmed/20979621) § EDGE-R § EBseq (RSEM/EBseq) § RSEM (http://deweylab.biostat.wisc.edu/rsem/) § eXpress (http://bio.math.berkeley.edu/eXpress/overview.html) § Beers simulation pipeline(http://www.cbil.upenn.edu/BEERS/) § DEXseq (http://bioconductor.org/packages/release/bioc/html/DEXSeq.html) § Limma (voom) § Htseq (python library) works with DEseq
Nookaew et al 2102 NAR
Differen[ally expressed genes based on sofware for quan[fica[on Differen[ally expressed genes based on sofware for mapping
Problems with cufflink and cuffdiff? Reproducibility… § http://seqanswers.com/forums/showthread.php?t=20702 § http://seqanswers.com/forums/showthread.php?t=17662 § http://seqanswers.com/forums/showthread.php?t=23962 § http://seqanswers.com/forums/showthread.php?t=21020 § http://seqanswers.com/forums/showthread.php?t=21708 § http://www.biostars.org/p/6317/
So, what to do?
Example workflows and tutorials § Ian Dworkin’s NGS course protocols http://ged.msu.edu/angus/tutorials-2013/index.html
§ Bacterial RNA-Seq workflow from Ben Johnson & Rob Abramovitch http://www.abramovitchlab.com/#/rna-seq-computational-methods/ § Canadian Bioinformatics workshops http://bioinformatics.ca/workshops/2013/informatics-rna-sequence-analysis-2013 § Trinity and Tuxedo tutorials http://trinityrnaseq.sourceforge.net/rnaseq_workshop.html § Samtools for variant calling
The “tuxedo” protocol for RNA-seq
Trapnell et al 2012
Overviews of RNA-Seq
§ Graber et al, Computational methods for transcriptome annotation and quantification using RNA-seq, Nat Methods, Jun 2011
§ http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf
Aligning to a transcriptome or a genome
§ Aligning to a genome, you have to account for the different splice variations
§ Aligning to a transcriptome, you have the different isoforms, so the mapping is more straightforward
§ However, you might have to assemble your own transcriptome
How to assemble multiple alternative spliced transcripts?
1 2 3
In the presence of AS, conven[onal assembly may be erroneous, ambiguous, or truncated.
Overlapping
truncated truncated
correct truncated
Need to use splice-aware assemblers
• Cufflinks (most commonly used) • Scripture • Trinity • Trans-‐ABySS • GRIT