RNA-Seq: From A to T - Michigan State UniversityRNA-Seq: From A to T . Nick Beckloff Director, Genomics Core Research Technology Support Facilities Tracy Teal ... RNA-seq Project Checklist

RNA-Seq: From A to T

Nick Beckloff Director, Genomics Core

Research Technology Support Facilities

Tracy Teal BEACON, MMG

Michigan State University 7/30/2014

Outline

• RNA-Seq basics • Sample input • Choosing a method • Sequencing • Library Preparation • Validation • RNA-Seq vs Arrays • Future

Upcoming Events

• Wafegen seminar September* • Methylation Boot Camp September* • BiCEP Launch September • iCER 10th Anniversary October

* Tentative due to availability

RNA-Seq Applications

RNA-Seq

Transcriptome Profiling

Biomarkers

sRNA Variants

Isoforms

Novel transcripts

Contents in this presentation may change at any time**

RNA-Seq Basics

RNA Isolation RNA QC RNA-Seq Library prep

Fragmentation cDNA synthesis

Adapter/Barcode Amplification

Analysis Library QC Sequencing

RNA-seq Project Checklist

• Budget • Repetitions

Analysis

• Quality, quantity • Species

Library

• Read Length • Depth

Sequencing

RNA-seq Guidelines

Reps

Length Coverage

Length Differential Expression = 50 vs 100 bp Coverage: Euk = 20-30M reads/sample Prok = 10M reads/sample Repetitions = 3+ Reps > Depth “when moving from 10 – 30 MM reads with 2 or 3 replicates, one will pull in approximately 25% more differentially expressed (DE) genes.” Bioinformatics Volume 30, Issue 3. 301-304.

http://bioinformatics.oxfordjournals.org/

http://bioinformatics.oxfordjournals.org/content/30/3.toc





RNA-seq Guidelines Depth vs Reps: A real world scenario 12 total samples (1 control and 5 conditions in duplicate) 1×50 bp sequncing for differential expression 10 MM reads per samples (12 samples in one lane of ILMN HiSeq) OR 30 MM reads (4 samples in one lane of ILMN HiSeq). What is the best scenario?

Reads from 10 to 30M gives 25% more reads at 1.5x cost

Increase is reps provides 35% more DE genes at 1.6x cost Bioinformatics Volume 30, Issue 3. 301-304.

http://bioinformatics.oxfordjournals.org/




Project Planning High Quality

Partially Degraded

Degraded

RNA Input is one of the most important drivers in

selecting library prep method

• What Species? • Rna quality? • Quantity?

- 1-200ng, 1-5 ug, etc

• What RNA species? - mRNA, small RNA, etc

Choosing a Method

Library Preps (PolyA selection)

• Up to 1 ug input • RIN >7.0 • Clean output • Only PolyA RNA • Stranded

Features/Limitations

Library Preps (rRNA removal)

• Not 100% efficient • Check species compatibility • 100 ng-5 ug input* • Excludes RNAs < 200bp • Can use with degraded RNA

Features/Limitations Epicentre: Ribo-Zero TM

Presenter

Presentation Notes

Essentially a subtractive hybridization rRNA removal reagent contains oligo probes complementary to rRNA sequences Magnetic beads bind rRNA-probe complexes and remove them from solution Process takes ~1-1.5 hours; requires 1ug total RNA


http://www.epibio.com/rnamatchmaker

Check non-model organisms

BLAST rRNA seqs New Ribozero Plant

Leaf/Seed/Root!!




• Not 100% efficient • Some junk carry over • Find sweet spot • May need extra reads

Limitations/Tips

Sweet Spot

Library Preps (Quantitative RNA)

• 96 distinct, 8 nt Molecular Index

• Large number of combinations (96x96=9216)

• 96 Barcodes • 10-100 ng input • PE Sequencing

Features/Limitations Bioo Scientific qRNA-Seq kit

Library Preps (Low Input RNA)

Nugen Ovation System V2

• Oligo DT and random priming amplifies both polyA AND non-poly A

• 500 pg input • Stranded • Prokaryotic version • RNA < 200 bp lost


Library Preps (Low Input RNA)

SMARTer Universal Low Input

• Recommended for single cell

- Fluidgm C1 • Input 100 pg • Only for polyA samples • Yields best data for high

quality RNA*

*allegedly


Library Preps (sRNA)

Netxflex small RNA Kit

• Sequencing of small RNAs and miRNAs

• Gel free adapter depletion • No PAGE gel cuts • Total RNA input >1 ug • Adapters are ligated to

samples


Library Preps (Targeted RNA)

SureSelect RNA capture Kit

• Targeted capture of RNA • Total RNA input • KB to 10 MB size • Post-library capture

methods • Nugen uses single

primer method • No cost to design


Library Preps (Targeted Depletion)

Nugen InDA-C (Insert Dependent Adapter Cleavage)

• Customized probes for target exclusion

• Eliminates targets by cleaving sequencing adapter

• Post-library selection • Used in Ovation Prokaryotic

system • Stranded


Bead purification

InDA-C Fragmentation Enrichment PCR

Strand selection

Adaptor ligation

2nd strand synthesis

1st strand synthesis


Targeted Depletion of rRNA in (Vitis vinifera, cultivar pinot noir)

Percentage of Mapped Transcripts Cyto rRNA Chloroplast RNA Mito rRNA Informative

Library 1

Library 2

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

% informative reads increased from 22% to 56% with InDA-C

15.4

28.2 0.

5 55.9

18.9

24.3 0.

6 56.2

Mouse 18S Coverage with and without InDA-C

No InDA-C

With InDA-C

Probe location


Library Preps (Single Cell RNA)

Ovation Single Cell System

• 1-5 pg input • Converts to cDNA

amplifies library


Company Conf

idential

• Bioanalyzer traces of Ovation Single Cell RNA-Seq libraries from CD8+ resting sorted T-cells

• Functional libraries created from a single cell

• Estimated total RNA content per cell is 0.5 pg

% Non- rRNA

Single site

RefSeq Strand

Retention Total

Reads %

Aligned Input % rRNA Good alignment to genome – less wasted reads 1 cell 3,175,287 35 1.3 67 95.2

1 cell 2,928,789 39 1.4 69 87.9 Ribosomal reads less than 5% – more informative sequence

10 cells 3,089,680 54 4.1 69 97.9

Excellent strand retention – improved transcriptional value

10 cells 2,904,477 52 3.6 69 98.1

100 cells 3,687,320 81 2.9 66 98.0

100 cells 3,179,519 83 2.9 65 98.4

Library Preps (Single Cell RNA)

RNA-Seq Validation

RNA-Seq Validation (qPCR)

• qPCR validation of RNA-seq • Wafegen SmartChip System • >5,000 rxns/chip • 100 nl volume • Supports Taqman and Sybr

green • Custom targets for NGS


RNA-seq vs Arrays

• RNA-seq vs Microarrays

- Cost is comparable - Microarrays only detect what is spotted - RNA-seq > arrays for isoforms, novel transcripts - Complimentary to one another • Intangible

• Reviewers prefer RNA-seq in grants

Take Home Message

Genomics researchers astonished to learn microarrays still exist!! – The Science Web

Future of RNA-seq

• Transcriptomes of Everything • Lower Inputs • Single cell vs multiple profiles • Longer reads • More novel isoforms • More RNA subspecies

Take Home Message

General considerations for RNA-seq quantification for differential expression

Tracy K. Teal

Assistant Professor Microbiology & Molecular Genetics

July 30, 2014

Adapted from NGS RNASeq slides Author: Ian Dworkin

What are the goals of your research? Why did you generate all of the RNAseq data in the first place?

§  Transcriptome assembly (& SNP discovery) §  Transcript discovery (variants for Transcription

start site, alternative splicing, etc..) §  Quantification of (alternative transcripts) §  Differential expression analysis across

treatments.

RNA-‐seq is generated for a number of reasons

What was once thought to be separate goals are now clearly recognized as

intertwined.

§  Early work for RNA-seq tried to “mirror” the type of gene level analysis used in microarrays.

§  However, RNA-seq has demonstrated how important it is to take into account alternative transcripts, even when attempting to get “gene level” measures.

How do we put together a useful pipeline for RNAseq

What are the steps we need to consider?

How do we put together a useful pipeline for RNAseq?

What are the steps we need to consider? §  Quality filtering §  Genome/transcriptome assembly. §  Mapping reads to genome/transcriptome. §  Deal with alternative transcripts (new

transcriptome)? §  Remap & count reads. §  Differential expression

Quality filtering Your analysis is only as good as your data

§  Quality control and removal of poor-quality reads (FASTQC, RNASeQC, fastx, …)

§  Remove adapters and linkers (FASTQC, Trimmomatic, …)

Mapping reads Ultimately all analyses require read mapping

Image credit: Nir Friedman lab

Challenge: alternative splicing

Overview of RNA-Seq analysis pipeline for detecting differential expression

Oshlack et al., From RNA-‐seq reads to differen3al expression results, Genome Biology 2010.

Quality filter

RNA-‐seq Workflows and Tools. Stephen Turner. Figshare. hJp://dx.doi.org/10.6084/m9.figshare.662782

Pipelines for RNA-seq (geared towards splicing)

Alamancos et al. Methods to Study Splicing from RNA-‐Seq hJp://arxiv.org/abs/1304.5952 Figshare. hJp://dx.doi.org/10.6084/m9.figshare.679993

The “tuxedo” protocol for RNA-seq

Trapnell C et al Differen[al gene and transcript expression analysis of RNA-‐seq experiments with TopHat and Cufflinks Nature Protocols 7, 562–578 (2012)

Nookaew et al 2102 NAR

How should we map reads

§  Do we want to map to a reference genome (with a “splice aware” aligner)?

§  Or do we want to map to a transcriptome directly?

Mapping to the genome

How do we deal with alternative transcripts or paralogs during mapping?

§  "splicing aware" aligners: §  Exon First: (Tophat, MapSplice, SpliceMap) Fig1A Garber §  Step 1 - map reads to genome §  Step 2 -unmapped reads are split, and aligned.

§  Seed & extend (Fig1B Garber) (GSNAP, QPALMA) §  kmers from reads are mapped (the seeds), and then

extended

Garber et al. 2011

Mapping to a transcriptome

§  What might be the downside to mapping to the transcriptome? Incomplete transcriptomes can lead to errors in inferred expression levels. Potentially less well annotated. §  For this Burrows-Wheeler is faster than seed

based approaches (shrimb & stampy), but the latter may be preferred if mapping to "distant" transcriptomes.

Which to use

§  If a (close to?) perfect match transcriptome assembly is available for mapping. Burrows-wheeler based aligners can be much faster than seed based methods (upto 15x faster)

§  BW based aligners have reduced performance once mismatches are considered. §  Exponential decrease in performance with each additional

mismatch (iteratively performs perfect searches). §  Seed methods may be more sensitive when mapping to

transcriptomes of distantly related species (or high polymorphism rates).

From Garber et al. 2011

Counting

§  One of the most difficult issues has been how to count reads.

§  What are some of the issues that we need to account for during counting of reads?

Counting

§  We are interested in transcript abundance. §  But we need to take into account a number of

things. §  How many reads in the sample. §  Length of transcripts §  GC content and sequencing bias

Counting

§  RPKM (Reads Per Kilobase of transcript per Million mapped reads) – Mortazavi et al 2008

§  FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Avoids double counting in paired-end sequencing.

Normalizing a transcript's read count by both its length and the total number of mapped reads in the sample

Garber et al. Computa[onal methods for transcriptome annota[on and quan[fica[on using RNA-‐seq. Nat Methods, Jun 2011

Accounting for multiple isoforms

§  - Only count reads that map uniquely to an isoform. Can be very problematic, when isoforms do not have unique exons.

§  - so called "isoform-expression" methods (cufflinks, MISO) model the uncertainty parametrically (often using MLE). The model with the best mix of isoforms that models the data (highest joint probability) is the best estimate. How this is handled differs a great deal by the different model.

Garber et al. 2011

Trapnell C et al Differen[al analysis of gene regula[on at transcript resolu[on with RNA-‐seq Nat Biotechnol. 2013 Jan;31(1):46-‐53

Differential expression

§  DEseq (http://www.ncbi.nlm.nih.gov/pubmed/20979621) §  EDGE-R §  EBseq (RSEM/EBseq) §  RSEM (http://deweylab.biostat.wisc.edu/rsem/) §  eXpress (http://bio.math.berkeley.edu/eXpress/overview.html) §  Beers simulation pipeline(http://www.cbil.upenn.edu/BEERS/) §  DEXseq (http://bioconductor.org/packages/release/bioc/html/DEXSeq.html) §  Limma (voom) §  Htseq (python library) works with DEseq

Nookaew et al 2102 NAR

Differen[ally expressed genes based on sofware for quan[fica[on Differen[ally expressed genes based on sofware for mapping

Problems with cufflink and cuffdiff? Reproducibility… §  http://seqanswers.com/forums/showthread.php?t=20702 §  http://seqanswers.com/forums/showthread.php?t=17662 §  http://seqanswers.com/forums/showthread.php?t=23962 §  http://seqanswers.com/forums/showthread.php?t=21020 §  http://seqanswers.com/forums/showthread.php?t=21708 §  http://www.biostars.org/p/6317/

So, what to do?

Example workflows and tutorials §  Ian Dworkin’s NGS course protocols http://ged.msu.edu/angus/tutorials-2013/index.html

§  Bacterial RNA-Seq workflow from Ben Johnson & Rob Abramovitch http://www.abramovitchlab.com/#/rna-seq-computational-methods/ §  Canadian Bioinformatics workshops http://bioinformatics.ca/workshops/2013/informatics-rna-sequence-analysis-2013 §  Trinity and Tuxedo tutorials http://trinityrnaseq.sourceforge.net/rnaseq_workshop.html §  Samtools for variant calling

The “tuxedo” protocol for RNA-seq

Trapnell et al 2012

Overviews of RNA-Seq

§  Graber et al, Computational methods for transcriptome annotation and quantification using RNA-seq, Nat Methods, Jun 2011

§  http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf

Aligning to a transcriptome or a genome

§  Aligning to a genome, you have to account for the different splice variations

§  Aligning to a transcriptome, you have the different isoforms, so the mapping is more straightforward

§  However, you might have to assemble your own transcriptome

How to assemble multiple alternative spliced transcripts?

1 2 3

In the presence of AS, conven[onal assembly may be erroneous, ambiguous, or truncated.

Overlapping

truncated truncated

correct truncated

Need to use splice-aware assemblers

•  Cufflinks (most commonly used) •  Scripture •  Trinity •  Trans-‐ABySS •  GRIT

Documents

RNA-Seq: From A to T - Michigan State UniversityRNA-Seq: From A to T . Nick Beckloff Director, Genomics Core Research Technology Support Facilities Tracy Teal ... RNA-seq Project Checklist