RNAseq analyses -- methods
June 7, 2015
Module 3 – Expression and Differential Expression
Lecture Objectives
• Terminology for RNA-seq• Steps in RNAseq analysis• Downstream interpretation of expression and
differential estimates– multiple testing, clustering, heatmaps, classification,
pathway analysis, etc.
Module 3 – Expression and Differential Expression
Steps in RNAseq workflow
• Obtain raw data from sequencer• Align/assemble reads• Process alignment to quantify reads/gene feature• Conduct differential analysis• Summarize and visualize
– Create gene lists, prioritize candidates for validation, etc.
• Conduct gene enrichment or pathway analysis
Module 3 – Expression and Differential Expression
Some nomenclature
Fasta(sequence only)
Fastq(contains quality scores)
>SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequencing file types:
Module 3 – Expression and Differential Expression
Alignment files
• SAM – Sequence Alignment/MAP coordinate file– Tab-delimited ASCII file with all of the information needed for
the alignment of the reads to reference
HWI-ST155_0544:7:2:7658:34048#0 16 Chr1 2082 1 42M *0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC
ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10CP:i:1057787HWI-ST155_0544:7:61:11040:141129#0 0 Chr1 2956 1 42M *
0 0 CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATAhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10CP:i:1056913HWI-ST155_0544:7:8:11214:130793#0 0 Chr1 3645 1 42M *
0 0 CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGTgggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10CP:i:1056224
• BAM – Binary version of SAM– Machine readable, smaller more compact version– More commonly used by viewers and analysis programs
Module 3 – Expression and Differential Expression
Reference guided RNAseq alignment
• You have a sequenced genome and a file with the gene model annotations
• Genome reference file is usually a fasta file with each chromosome as a separate entry
• GTF = gene transcript file• GFF = gene feature file (GFF3)• The GTF/GFF chromosome name must match the
chromosome name in the genome reference file
• In this case we are using the human hg19 genome build and the Refseq transcript file from 05-07-2015
Module 3 – Expression and Differential Expression
RNAseq alignment
Module 3 – Expression and Differential Expression
RPKM (FPKM)
• Reads (fragments) per Kilobase Per Million
RPKM =raw number of reads
exon lengthX
1,000,000
Number of reads mapped in the sample
• In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However: • The number of fragments is biased towards larger genes• Total number of fragments is related to total library depth
• Use of FPKM/RPKM normalizes for gene size and library depth
Module 3 – Expression and Differential Expression
Alternatives to FPKM
• Raw read counts – Instead of calculating FPKM, simply assign
reads/fragments to a defined set of genes/transcripts and determine “raw counts”
• HTSeq (htseq-count)– Python code that converts aligned reads to counts– You give it an alignment file and the associated transcript
file and it will output a list of counts by feature
• FeatureCounts (part of Subread package)– BAM file + GTF file => # reads/feature
• PGS also reports reads/feature– We can try differential expression analysis with the RPKM
and the reads file to explore the different outcomes
Module 3 – Expression and Differential Expression
How to quantify reads per feature?
Module 3 – Expression and Differential Expression
Counts vs FPKM(RPKM)?
• Count files can be analyzed by a number of different R programs– EdgeR– DESeq2– NOISeq
• More robust statistical methods for differential expression
• Accommodates more sophisticated experimental designs with appropriate statistical tests
Module 3 – Expression and Differential Expression
Gene expression differences
• Once you have your data in RPKM or counts format, you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed
• Authors used Cufflinks/CuffDiff
Module 3 – Expression and Differential Expression
How does cuffdiff work?• The variability in fragment count for each
gene across replicates is modeled.• Fragment count for each isoform is estimated
in each replicate along with a measure of uncertainty in this estimate arising from ambiguously mapped reads– transcripts with more shared exons and few
uniquely assigned fragments will have greater uncertainty
• The algorithm combines estimates of uncertainty and cross-replicate variability under a negative binomial model of fragment count variability to estimate count variances for each transcript in each library
• These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts.
More replicates = more confidence
Module 3 – Expression and Differential Expression
Multiple approaches advisable
Module 3 – Expression and Differential Expression
Multiple testing correction
• As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone.
• Well known from array studies– 10,000s genes/transcripts– 100,000s exons
• With RNA-seq, more of a problem than ever– All the complexity of the transcriptome– Almost infinite number of potential features
• Genes, transcripts, exons, juntions, retained introns, microRNAs, lncRNAs, etc, etc
• Bioconductor multtest– http://www.bioconductor.org/packages/release/bioc/html/multtest.html
Module 3 – Expression and Differential Expression
Lessons learned from microarray days
• Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573.
• Power analysis for RNA-seq experiments– http://euler.bc.edu/marthlab/scotty/scotty.php
• RNA-seq need for biological replicates– http://www.biostars.org/p/1161/
• RNA-seq study design– http://www.biostars.org/p/68885/
Module 3 – Expression and Differential Expression
Today in computer lab
• Analyze the BAM files downloaded from Google drive• Do QA/Qc• Do mRNA quantification• Do comparisons between groups
• The analysis will take several hours…plan accordingly