Applied Bioinformatics Journal ClubWednesday, March 5
Background
• Comparison of commonly used DE software packages– Cuffdiff– edgeR– DESeq– PoisssonSeq– baySeq– limma
• Two benchmark datasets– Sequencing Quality
Control (SEQC) dataset• Includes qRT-PCR for
1,000 genes
– Biological replicates from 3 cell lines as part of ENCODE project
Focus of paper:Comparison of elevant measures for DE detection
• Normalization of count data
• Sensitivity and specificity of DE detection
• Genes expressed in one condition but no expression in the other condition
• Sequencing depth and number of replicates
Theoretical background
• Count matrix—number of reads assigned to gene i in sequencing experiment j
• Length bias when measuring gene expression by RNA-seq– Reduces the ability to
detect differential expression among shorter genes
• Differential gene expression consists of 3 components:– Normalization of counts
– Parameter estimation of the statistical model
– Tests for differential expression
Normalization
• Commonly used– RPKM– FPKM– Biases—proportional
representation of each gene is dependent on expression levels of other genes
• DESeq-scaling factor based normalization– median of ratio for each gene
of its read count over its geometric mean across all samples
• Cuffdiff—extension of DESeq normalization– Intra-condition library scaling– Second scaling between
conditions– Also accounts for changes in
isoform levels
Normalization
• edgeR– Trimmed means of M
values (TMM)– Weighted average of
subset of genes (excluding genes of high average read counts and genes with large differences in expression)
• baySeq– Sum gene counts to
upper 25% quantile to normalize library size
• PoissonSeq– Goodness of fit estimate
to define a gene set that is least differentiated between 2 conditions, and then used to compute library normalization factors
Normalization
• limma (2 normalization procedures)– Quantile normalization
Sorts counts from each sample and sets the values to be equal to quantile mean from all samples– Voom: LOWESS regression to estimate mean
variance relation and transforms read counts to log form for linear modeling
Statistical modeling of gene expression
• edgeR and DESeq– Negative binomial distribution (estimation of
dispersion factor)• edgeR– Estimation of dispersion factor as weighted
combination of 2 components• Gene specific dispersion effect and common dispersion
effect calculated for all genes
Statistical modeling of gene expression
• DESeq– Variance estimate into a combination of Poisson
estimate and a second term that models biological variability
• Cuffdiff– Separate variance models for single isoform and
multiple isoform genes• Single isoform—similar to DESeq• Multiple isoform– mixed model of negative binomial
and beta distributions
Statistical modeling of gene expression
• baySeq– Full Bayesian model of negative binomial
distributions– Prior probability parameters are estimated by
numerical sampling of the data• PoissonSeq– Models gene counts as a Poisson variable– Mean of distribution represented by log-linear
relationship of library size, expression of gene, and correlation of gene with condition
Test for differential expression
• edgeR and DESeq– Variation of Fisher exact test modified for negative
binomial distribution– Returns exact P value from derived probabilities
• Cuffdiff– Ratio of normalized counts between 2 conditions
(follows normal distribution)– t-test to calculate P value
Test for differential expression
• limma– Moderated t-statistic of modified standard error
and degrees of freedom• baySeq– Estimates 2 models for every gene• No differential expression• Differential expression
– Posterior likelihood of DE given the data is used to identify differentially expressed genes (no P value)
Test for differential expression
• PoissonSeq– Test for significance of correlation term – Evaluated by score statistics which follow a Chi-
squared distribution (used to derive P values)
• Multiple hypothesis corrections– Benjamini-Hochberg– PoissonSeq—permutation based FDR
Results
• Normalization and log expression correlation
• Differential expression analysis
• Evaluation of type I errors
• Evaluation of genes expressed in one condition
• Impact of sequencing depth and replication on DE detection
5
5