Upload
sydnee-shoesmith
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
M G L U S E R S G R O U P 0 7 - 1 6 - 1 4
RNASEQ LIBRARY PREPARATION AND ANALYSIS BASICS
WHAT DOES RNASEQ PROVIDE
• RNAseq is a quantitative experiment• Relative amounts of RNA in a sample across different
transcripts are fragmented (for large RNA) and sequenced
• The number of reads you obtain for a given transcript is proportional to the amount of it present in the original sample
• Number of reads per-transcript recovered scales directly with• Number of copies of that transcript• Size of the transcript (Longer => More fragments)
• Higher abundance transcripts may provide differential splicing information
WHAT DOES RNASEQ NOT PROVIDE?
• A measure of translational or protein activity.• Just because a gene is highly transcribed does not mean that
it’s actively engaged fully with ribosomes• Ribosomal profiling is an excellent supplement to this end
• A good view of mutations• Since read recovery scales with abundance of the given
transcript, some messages are highly (or overly) recovered while other are only sparsely sampled.
• Mutations may affect the ability of an RNA to accumulate in order to be sequenced.
• Ability to identify mutations is dependent on sampling a mutation many times.
• DNA sequencing is better suited to finding mutations to genes as it’s untied from expression.
RNASEQ LIBRARY PREPARATION CONSIDERATIONS
• Ribosomal Depletion• Ribo-Zero Gold kit• 98-99% of isolated RNA is rRNA and will be the most
abundant species in a sequencing experiment otherwise
• Amount• Ideal input for library preparation is 200ng of ribosome
depleted RNA• 1 to 5 ug of total RNA prior to depletion will work as well
for transcriptome sequencing experiments
• Small RNA requires special handling separate from transcriptome experiments
LIBRARY PREPARATION
• Ribosome depleted RNA is fragmented if large (mRNA, etc)• Tightly controlled endonucleolytic (RNase III) digestion• Cleanup
• Small RNA differences• Ensure 5’ phosophate, 3’ hydroxyl
• End ligation of adapter sequences• Cleanup (small RNA involves additional size selection)• Reverse transcription• Cleanup• PCR amplification with barcoded full adapters• Cleanup & Quantitation for Multilexing/Bead preparation
SAMPLE STATE AFTER LIBRARY PREPARATION
• From this point, sample is almost identical to a DNA fragment library
• Adaptor sequences vary slightly from DNA-based libraries, but have identical barcoding schemes.
• Samples may be read from both ends, or in the case of small RNA it may only be needed to read from one end.
• If multiplexing, differently barcoded samples are pooled for bead preparation.
BEAD PREPARATION AND DEPOSITION
• Beads are enriched with single fragment sequences by emulsion PCR and deposited onto in lanes on a flowcell.
Target deposition:~230k beads per imaging panel~150 million beads per lane
HOW MANY READS ARE NEEDED??
• A common and debated point:• Read count affects how much sampling of the RNA population is occurring• Highly abundant RNAs are fairly easy to see and therefore quantitate with
low read counts• Lower abundance species are going to be harder to catch by random
sampling of the population, therefore more difficult to quantitate accurately• At very low sampling (read) counts, you may not see some trace RNAs at all
simply by not getting them by chance!
• Target read counts are going to be informed by what you are after in the experiment• Complexity of transcriptome (Bacteria vs Human)?• Big changes in expression? Low counts needed• Precise measurement of low abundance species? Maybe more• Appearance of a minor slice isoform species in the presence of WT
isoforms? You may need a lot more
• Basic rule of thumb: 40 million for a good picture of Human transcriptome-sized complexity
AFTER SEQUENCING PROCEDEURE
• BEFORE MAPPING!• Assess quality of the run.
• Was the balance between signals fairly even?• Was there poor imaging or other aberrations in a cycle?• Was the number or reads obtained what you expected?
• Scrub resulting sequences before trying to map to your transcriptome.• Remove contaminant high abundance sequences: rRNA, tRNA,
adapters• Particularly for small RNA experiments adapter sequences may be
found joined to the ends of your targets and need to be trimmed off• Trim reads based on quality (clip off trailing low quality calls that
may lead to poor or incorrect mapping)
• It’s critical to have the best sequences to map without possible mistakes in order to get the highest quality quantitation in the end!
MAPPING PROCEDEURE
• Depends slightly on your data source and state• Illumina sequences are often mapped and
quantitated through the Tophat/Bowtie -> Cufflinks pipeline.
• SOLiD colorspace data may be mapped through Lifescope or Novocraft, then quantitated by any tool at that point.
• RNAseq mapping typically involves multiple passes:• First pass: compare your sequence to known spliced RNA
sequences for your genome• Second pass: if not identified there, map to the genome
while not allowing for large gaps, then allowing for large gaps.
POST MAPPING
• At this point, generally data is in SAM or BAM format.
• Information there includes the read ID, where on the genome it mapped, where the pair is if it’s a paired end experiment, sequence, call qualities, and other information.
VISUALIZATION OF DATA
• IGV
• UCSC Genome Browser
• Various other tools
COMPARING ABUNDANCES
• Abundance of a given species is typically expressed as RPKM or FPKM• Reads Per Kilobase per Million; Fragments Per Kilobase
per Million• Fragments is preferred for paired-end data as it only
counts those where both ends of your read map to the same transcript, and doesn’t “double count” an individual biological fragment• One molecule “spot” on the flow cell = 1 Fragment comprised
of one forward and one reverse read• Added accuracy control
• Normalizes basic counts per species by size of the transcript (how likely you are you sample it from a pool of fragments), and total size of the experiment• Allows for cross-experiment/sample comparisons
SCATTER PLOT FPKM(1) VS FPKM(2)
• Compare abundances of all RNAs directly
• Usually log scale as FPKM values range greatly
• Should see mainly a 1:1 correspondence of abundance in most RNAs
Condition 1
Condit
ion 2
SCATTER PLOT FPKM(1) VS FPKM(2)
• Most analysis will be interested in what is significantly altered between conditions
• Need to identify globally what has changed and how much confidence we have in those determinations!
Condition 1
Condit
ion 2
COMPARING ABUNDANCES
• FPKMs for the same species of RNA can be compared for expression level differences• Typically, a ratio of the FPKM values is taken, and log transformed
(usually log base 2 scale)• Log transformation eases statistical analysis of the population of
ratios across all species observed in an experiment• What’s unusual/real vs what’s within experimental noise
• Assign p-values to the modeled significance of the changes being observed• Basic how likely is this to be a sampling difference explainable by chance?
• Apply RNAseq-specific corrections to derive a “q-value” (corrected p-value)• Size of transcripts can affect sampling, replicate measurements if in the
experiment, etc
• Apply some cutoff value to the q-value and look at RNAs exceeding that threshold
VOLCANO PLOT
• Log ratio between FPKMs plotted against q-value
• Even though apparent ratios between FPKM values may be large, they may simply reflect sampling explainable by chance.
WHAT TO DO WITH THE DATA??
• An open ended question dependent upon the experimental goals
• Common procedures:• Look at altered RNAs exceeding statistical significance
• Upregulated and Downregulated.• What pathways or targets are represented here?
• Enrichment analysis
• Look at similarly expressed gene patterns globally without the statistical cutoff or other RNAs in the same pathway(s)
• Look at splice isoform differences suggested by the data• Mainly only applicable to paired end data as pairs are more
likely to span one or more exons
THANKS!
• MGL e-mail: ([email protected])
• James Iben ([email protected])• Location: Building 10, 9D41• MGL Web page: https
://science.nichd.nih.gov/confluence/display/mgl/Home• MGL listserv: MGL-USERS-L• MGL phone: (301) 402-4563
• Next users group meeting: Aug 20th, 12pm• Topic TBA