MGL USERS GROUP 07-16-14 RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS

M G L U S E R S G R O U P 0 7 - 1 6 - 1 4

RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS

WHAT DOES RNASEQ PROVIDE

• RNAseq is a quantitative experiment• Relative amounts of RNA in a sample across different

transcripts are fragmented (for large RNA) and sequenced

• The number of reads you obtain for a given transcript is proportional to the amount of it present in the original sample

• Number of reads per-transcript recovered scales directly with• Number of copies of that transcript• Size of the transcript (Longer => More fragments)

• Higher abundance transcripts may provide differential splicing information

WHAT DOES RNASEQ NOT PROVIDE?

• A measure of translational or protein activity.• Just because a gene is highly transcribed does not mean that

it’s actively engaged fully with ribosomes• Ribosomal profiling is an excellent supplement to this end

• A good view of mutations• Since read recovery scales with abundance of the given

transcript, some messages are highly (or overly) recovered while other are only sparsely sampled.

• Mutations may affect the ability of an RNA to accumulate in order to be sequenced.

• Ability to identify mutations is dependent on sampling a mutation many times.

• DNA sequencing is better suited to finding mutations to genes as it’s untied from expression.

RNASEQ LIBRARY PREPARATION CONSIDERATIONS

• Ribosomal Depletion• Ribo-Zero Gold kit• 98-99% of isolated RNA is rRNA and will be the most

abundant species in a sequencing experiment otherwise

• Amount• Ideal input for library preparation is 200ng of ribosome

depleted RNA• 1 to 5 ug of total RNA prior to depletion will work as well

for transcriptome sequencing experiments

• Small RNA requires special handling separate from transcriptome experiments

LIBRARY PREPARATION

• Ribosome depleted RNA is fragmented if large (mRNA, etc)• Tightly controlled endonucleolytic (RNase III) digestion• Cleanup

• Small RNA differences• Ensure 5’ phosophate, 3’ hydroxyl

• End ligation of adapter sequences• Cleanup (small RNA involves additional size selection)• Reverse transcription• Cleanup• PCR amplification with barcoded full adapters• Cleanup & Quantitation for Multilexing/Bead preparation

SAMPLE STATE AFTER LIBRARY PREPARATION

• From this point, sample is almost identical to a DNA fragment library

• Adaptor sequences vary slightly from DNA-based libraries, but have identical barcoding schemes.

• Samples may be read from both ends, or in the case of small RNA it may only be needed to read from one end.

• If multiplexing, differently barcoded samples are pooled for bead preparation.

BEAD PREPARATION AND DEPOSITION

• Beads are enriched with single fragment sequences by emulsion PCR and deposited onto in lanes on a flowcell.

Target deposition:~230k beads per imaging panel~150 million beads per lane

HOW MANY READS ARE NEEDED??

• A common and debated point:• Read count affects how much sampling of the RNA population is occurring• Highly abundant RNAs are fairly easy to see and therefore quantitate with

low read counts• Lower abundance species are going to be harder to catch by random

sampling of the population, therefore more difficult to quantitate accurately• At very low sampling (read) counts, you may not see some trace RNAs at all

simply by not getting them by chance!

• Target read counts are going to be informed by what you are after in the experiment• Complexity of transcriptome (Bacteria vs Human)?• Big changes in expression? Low counts needed• Precise measurement of low abundance species? Maybe more• Appearance of a minor slice isoform species in the presence of WT

isoforms? You may need a lot more

• Basic rule of thumb: 40 million for a good picture of Human transcriptome-sized complexity

AFTER SEQUENCING PROCEDEURE

• BEFORE MAPPING!• Assess quality of the run.

• Was the balance between signals fairly even?• Was there poor imaging or other aberrations in a cycle?• Was the number or reads obtained what you expected?

• Scrub resulting sequences before trying to map to your transcriptome.• Remove contaminant high abundance sequences: rRNA, tRNA,

adapters• Particularly for small RNA experiments adapter sequences may be

found joined to the ends of your targets and need to be trimmed off• Trim reads based on quality (clip off trailing low quality calls that

may lead to poor or incorrect mapping)

• It’s critical to have the best sequences to map without possible mistakes in order to get the highest quality quantitation in the end!

MAPPING PROCEDEURE

• Depends slightly on your data source and state• Illumina sequences are often mapped and

quantitated through the Tophat/Bowtie -> Cufflinks pipeline.

• SOLiD colorspace data may be mapped through Lifescope or Novocraft, then quantitated by any tool at that point.

• RNAseq mapping typically involves multiple passes:• First pass: compare your sequence to known spliced RNA

sequences for your genome• Second pass: if not identified there, map to the genome

while not allowing for large gaps, then allowing for large gaps.

POST MAPPING

• At this point, generally data is in SAM or BAM format.

• Information there includes the read ID, where on the genome it mapped, where the pair is if it’s a paired end experiment, sequence, call qualities, and other information.

VISUALIZATION OF DATA

• IGV

• UCSC Genome Browser

• Various other tools

COMPARING ABUNDANCES

• Abundance of a given species is typically expressed as RPKM or FPKM• Reads Per Kilobase per Million; Fragments Per Kilobase

per Million• Fragments is preferred for paired-end data as it only

counts those where both ends of your read map to the same transcript, and doesn’t “double count” an individual biological fragment• One molecule “spot” on the flow cell = 1 Fragment comprised

of one forward and one reverse read• Added accuracy control

• Normalizes basic counts per species by size of the transcript (how likely you are you sample it from a pool of fragments), and total size of the experiment• Allows for cross-experiment/sample comparisons

SCATTER PLOT FPKM(1) VS FPKM(2)

• Compare abundances of all RNAs directly

• Usually log scale as FPKM values range greatly

• Should see mainly a 1:1 correspondence of abundance in most RNAs

Condition 1

Condit

ion 2

SCATTER PLOT FPKM(1) VS FPKM(2)

• Most analysis will be interested in what is significantly altered between conditions

• Need to identify globally what has changed and how much confidence we have in those determinations!

Condition 1

Condit

ion 2

COMPARING ABUNDANCES

• FPKMs for the same species of RNA can be compared for expression level differences• Typically, a ratio of the FPKM values is taken, and log transformed

(usually log base 2 scale)• Log transformation eases statistical analysis of the population of

ratios across all species observed in an experiment• What’s unusual/real vs what’s within experimental noise

• Assign p-values to the modeled significance of the changes being observed• Basic how likely is this to be a sampling difference explainable by chance?

• Apply RNAseq-specific corrections to derive a “q-value” (corrected p-value)• Size of transcripts can affect sampling, replicate measurements if in the

experiment, etc

• Apply some cutoff value to the q-value and look at RNAs exceeding that threshold

VOLCANO PLOT

• Log ratio between FPKMs plotted against q-value

• Even though apparent ratios between FPKM values may be large, they may simply reflect sampling explainable by chance.

WHAT TO DO WITH THE DATA??

• An open ended question dependent upon the experimental goals

• Common procedures:• Look at altered RNAs exceeding statistical significance

• Upregulated and Downregulated.• What pathways or targets are represented here?

• Enrichment analysis

• Look at similarly expressed gene patterns globally without the statistical cutoff or other RNAs in the same pathway(s)

• Look at splice isoform differences suggested by the data• Mainly only applicable to paired end data as pairs are more

likely to span one or more exons

THANKS!

• MGL e-mail: ([email protected])

• James Iben ([email protected])• Location: Building 10, 9D41• MGL Web page: https

://science.nichd.nih.gov/confluence/display/mgl/Home• MGL listserv: MGL-USERS-L• MGL phone: (301) 402-4563

• Next users group meeting: Aug 20th, 12pm• Topic TBA

mailto:[email protected]



https://science.nichd.nih.gov/confluence/display/mgl/Home



Documents

MGL USERS GROUP 07-16-14 RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS