Transcript expression and RNA splicing

Transcript expression and RNA splicing

Kaur Alasoo8 March 2018

Gene expression can vary in many ways

DGE - differential gene expressionDTE - differential transcript expression DTU - differential transcript usage

http://mikelove.github.io/bioc2016/#/slide-9

https://liorpachter.wordpress.com/2018/02/15/gde%C2%B2-dge%C2%B2-dtu%C2%B2-dte%E2%82%81%C2%B2-dte%E2%82%82%C2%B2/

NATURE BIOTECHNOLOGY VOLUME 31 NUMBER 1 JANUARY 2013 47

A RT I C L E S

displayed altered inclusion of key features, such as DNA binding regions, in their protein products.

RESULTSRaw fragment counts inaccurately estimate changes in expressionEarly methods for quantifying gene expression from RNA-seq data work by counting the sequencing library fragments that map to the exons of each gene and dividing the count for each gene by a scal-ing factor based on the length of the exons. Expression levels esti-mated using such approaches are less accurate than later methods27, which calculate a gene’s expression level by adding the expression values of its alternative isoforms3,16. We refer to the former as ‘raw count’ methods and the latter as ‘isoform deconvolution’ methods. Current tools for differential gene expression analysis use the raw count method, equating the change in a gene’s expression levels with the change in the number of fragments originating from it between conditions17,20,21,28.

Because the raw count method is not always accurate when calculat-ing gene expression in a single library, we hypothesized that it would be inaccurate when comparing libraries. Simple examples of hypo-thetical, alternatively spliced genes showed that the change in expres-sion could be drastically different from the change in raw read count (Fig. 1 ). We compared expression levels from two popular raw count schemes to changes in gene expression in simulation experiments. When all of a gene’s isoforms are up- or downregulated between two conditions, raw count methods recover true change in gene expres-sion. However, when some isoforms are upregulated and others downregulated, raw count methods are inaccurate (Supplementary Fig. 1 ). In contrast, gene expression levels calculated by isoform deconvolution correlated well with true gene expression even when relative abundance of the isoforms changed between conditions. Thus, identifying accurate, statistically significant expression changes at the resolution level of genes requires transcript-level calculations.

Cuffdiff 2Cuffdiff 2 assumes that the expression of a transcript in each condi-tion can be measured by counting the number of fragments generated by it. Thus, a change in the expression level of a transcript is measured by comparing its fragment count in each condition. If the chance of seeing a change in this count is small enough under an appropriate statistical model of the inherent variability in this count (say with odds of 1 in 100), the transcript is deemed significantly differentially expressed. Choosing a model that adequately controls for variability in sequencing depth, biological noise and splicing structure has been the subject of debate19. Under one of the simplest models, the Poisson model, the variability is estimated by calculating the mean count across replicates, which allows one to calculate a P-value for any observed changes in a transcript’s fragment count.

The Poisson model is computationally simple, but it fails to account for two key issues that arise in differential analysis—count uncertainty and count overdispersion. Count uncertainty refers to the observa-tion that in RNA-seq experiments it is common for up to 50% of reads to map ambiguously to different transcripts29. This happens because in higher eukaryotes alternative isoforms of most genes share large amounts of sequence, and many genes have paralogs with high sequence similarity. As a result, the fragment counts for individual transcripts cannot be calculated exactly and must be estimated. Count overdispersion refers to the fact that experiments that produce count data are often more variable across replicates than what is expected according to a Poisson distribution17,20.

Our method (Fig. 2 ) addresses both of these issues by modeling how variability in measurements of a transcript’s fragment count depends on both its expression and its splicing structure. Previous studies observed that overdispersion in RNA-seq experiments increases with expression and proposed the negative binomial dis-tribution as a means of controlling for it17,22. In contrast, ambiguity in mapping fragments to transcripts manifests itself in measurement

Isoform A

Isoform B

L - eL e e

Log fold-change(intersect count)

Log fold-change(true expression)

Log fold-change(union count)Condition BCondition A

Exon-unionmodel

Exon-intersectionmodel

a

b

log21010

0=

log268

–0.41=

log2510

–1= log 45

–0.1=

log255

0=

log287

0.19=

log26/L8/2L

0.58=

log2 0.32=10L

+6L

42L

log2 0=102L

5L

Figure 1 Changes in fragment count for a gene does not necessarily equal a change in expression. (a) Simple read-counting schemes sum the fragments incident on a gene’s exons. The exon-union model counts reads falling on any of a gene’s exons, whereas the exon-intersection model counts only reads on constitutive exons. (b) Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms. The true expression is estimated by the sum of the length-normalized isoform read counts. The discrepancy between a change in the union or intersection count and a change in gene expression is driven by a change in the abundance of the isoforms with respect to one another. In the top row, the gene generates the same number of reads in conditions A and B, but in condition B, all of the reads come from the shorter of the two isoforms, and thus the true expression for the gene is higher in condition B. The intersection count scheme underestimates the true change in gene expression, and the union scheme fails to detect the change entirely. In the middle row, the intersection count fails to detect a change driven by a shift in the dominant isoform for the gene. The union scheme detects a shift in the wrong direction. In the bottom row, the gene’s expression is constant, but the isoforms undergo a complete switch between conditions A and B. Both simplified counting schemes register a change in count that does not reflect a change in gene expression.

Changes in fragment count for a gene does not necessarily equal a change in expression

Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology31.1 (2013): 46.

NATURE BIOTECHNOLOGY VOLUME 31 NUMBER 1 JANUARY 2013 47

A RT I C L E S

displayed altered inclusion of key features, such as DNA binding regions, in their protein products.

RESULTSRaw fragment counts inaccurately estimate changes in expressionEarly methods for quantifying gene expression from RNA-seq data work by counting the sequencing library fragments that map to the exons of each gene and dividing the count for each gene by a scal-ing factor based on the length of the exons. Expression levels esti-mated using such approaches are less accurate than later methods27, which calculate a gene’s expression level by adding the expression values of its alternative isoforms3,16. We refer to the former as ‘raw count’ methods and the latter as ‘isoform deconvolution’ methods. Current tools for differential gene expression analysis use the raw count method, equating the change in a gene’s expression levels with the change in the number of fragments originating from it between conditions17,20,21,28.

Because the raw count method is not always accurate when calculat-ing gene expression in a single library, we hypothesized that it would be inaccurate when comparing libraries. Simple examples of hypo-thetical, alternatively spliced genes showed that the change in expres-sion could be drastically different from the change in raw read count (Fig. 1 ). We compared expression levels from two popular raw count schemes to changes in gene expression in simulation experiments. When all of a gene’s isoforms are up- or downregulated between two conditions, raw count methods recover true change in gene expres-sion. However, when some isoforms are upregulated and others downregulated, raw count methods are inaccurate (Supplementary Fig. 1 ). In contrast, gene expression levels calculated by isoform deconvolution correlated well with true gene expression even when relative abundance of the isoforms changed between conditions. Thus, identifying accurate, statistically significant expression changes at the resolution level of genes requires transcript-level calculations.

Cuffdiff 2Cuffdiff 2 assumes that the expression of a transcript in each condi-tion can be measured by counting the number of fragments generated by it. Thus, a change in the expression level of a transcript is measured by comparing its fragment count in each condition. If the chance of seeing a change in this count is small enough under an appropriate statistical model of the inherent variability in this count (say with odds of 1 in 100), the transcript is deemed significantly differentially expressed. Choosing a model that adequately controls for variability in sequencing depth, biological noise and splicing structure has been the subject of debate19. Under one of the simplest models, the Poisson model, the variability is estimated by calculating the mean count across replicates, which allows one to calculate a P-value for any observed changes in a transcript’s fragment count.

The Poisson model is computationally simple, but it fails to account for two key issues that arise in differential analysis—count uncertainty and count overdispersion. Count uncertainty refers to the observa-tion that in RNA-seq experiments it is common for up to 50% of reads to map ambiguously to different transcripts29. This happens because in higher eukaryotes alternative isoforms of most genes share large amounts of sequence, and many genes have paralogs with high sequence similarity. As a result, the fragment counts for individual transcripts cannot be calculated exactly and must be estimated. Count overdispersion refers to the fact that experiments that produce count data are often more variable across replicates than what is expected according to a Poisson distribution17,20.

Our method (Fig. 2 ) addresses both of these issues by modeling how variability in measurements of a transcript’s fragment count depends on both its expression and its splicing structure. Previous studies observed that overdispersion in RNA-seq experiments increases with expression and proposed the negative binomial dis-tribution as a means of controlling for it17,22. In contrast, ambiguity in mapping fragments to transcripts manifests itself in measurement

Isoform A

Isoform B

L - eL e e

Log fold-change(intersect count)

Log fold-change(true expression)

Log fold-change(union count)Condition BCondition A

Exon-unionmodel

Exon-intersectionmodel

a

b

log21010

0=

log268

–0.41=

log2510

–1= log 45

–0.1=

log255

0=

log287

0.19=

log26/L8/2L

0.58=

log2 0.32=10L

+6L

42L

log2 0=102L

5L

Figure 1 Changes in fragment count for a gene does not necessarily equal a change in expression. (a) Simple read-counting schemes sum the fragments incident on a gene’s exons. The exon-union model counts reads falling on any of a gene’s exons, whereas the exon-intersection model counts only reads on constitutive exons. (b) Both of the exon-union and exon-intersection counting schemes may incorrectly estimate a change in expression in genes with multiple isoforms. The true expression is estimated by the sum of the length-normalized isoform read counts. The discrepancy between a change in the union or intersection count and a change in gene expression is driven by a change in the abundance of the isoforms with respect to one another. In the top row, the gene generates the same number of reads in conditions A and B, but in condition B, all of the reads come from the shorter of the two isoforms, and thus the true expression for the gene is higher in condition B. The intersection count scheme underestimates the true change in gene expression, and the union scheme fails to detect the change entirely. In the middle row, the intersection count fails to detect a change driven by a shift in the dominant isoform for the gene. The union scheme detects a shift in the wrong direction. In the bottom row, the gene’s expression is constant, but the isoforms undergo a complete switch between conditions A and B. Both simplified counting schemes register a change in count that does not reflect a change in gene expression.

Changes in fragment count for a gene does not necessarily equal a change in expression

Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology31.1 (2013): 46.

Alternative transcript usage is prevalent between tissues

testes and liver (as well as in other tissues studied), consistent with thepredominant heart and muscle symptoms of exon 3A mutation15.

The genome-wide extent of alternative splicing was assessed bysearching against known and putative splicing junctions using strin-gent criteria that required each alternative isoform to be supported bymultiple independent splice junction reads with different alignmentstart positions. Binning the multi-exon genes in the RefSeq database(94% of all RefSeq genes) by read coverage and fitting to a sigmoidcurve enabled estimation of the asymptotic fraction of alternativelyspliced genes in this set as ,98% when excluding cell line data(Supplementary Fig. 2) and ,100% when using all samples(Fig. 1b). This analysis indicated that alternative splicing is essentiallyuniversal in human multi-exon genes, which comprise 94% of genesoverall, with the important qualification that a portion of detectedalternative splicing events may represent allele-specific splicing16,17.

Some of these events may involve exclusively low frequency alter-natively spliced isoforms. However, 92% of multi-exon genes wereestimated to undergo alternative splicing when considering only eventsfor which the relative frequency of the minor (less abundant) isoformexceeded 15% in one or more samples (Fig. 1c). Thus, 0.92 3 0.94 or,86% of human genes were estimated to produce appreciable levels oftwo or more distinct populations of mRNA isoforms. Conversely, noevidence of alternative splicing was detected in the 6% of RefSeq genesannotated as consisting of a single exon, even when searching againstjunctions between predicted exons in these genes.

New exons and splice junctions not previously seen in transcriptdatabases were identified by mapping the reads against predictedexons and junctions. This approach yielded a set of 1,413 high-con-fidence new exons (Supplementary Table 3), with an estimated falsediscovery rate (FDR) of ,1.5% (Supplementary Information), andthousands of putative new splice junctions (not shown). Thus,mRNA-Seq has strong potential for discovery of new exons, althoughvery substantial read depth is required to efficiently detect low-abundance isoforms (Supplementary Fig. 3).

Tissue-specific isoform expression

To explore the extent of tissue regulation of alternative transcripts,we examined eight common types of ‘alternative transcript events’1,2,each capable of producing multiple mRNA isoforms from human

genes through alternative splicing, alternative cleavage and polyade-nylation (APA) and/or alternative promoter usage (Fig. 2). Eventtypes considered included skipped exons and retained introns, inwhich a single exon or intron is alternatively included or splicedout of the mature message, and MXEs, described previously. Alsoincluded were alternative 59 splice site (A5SS) and alternative 39splice site (A3SS) events, which are particularly difficult to interrog-ate by microarray analysis because the variably included region isoften quite small. Tandem 39 untranslated regions (UTRs) andalternative last exons (ALEs), in which alternative use of a pair ofpolyadenylation sites results in shorter or longer 39 UTR isoforms orin distinct terminal exons, respectively, were also considered. Finally,we considered alternative first exons (AFEs), in which alternativepromoter use results in mRNA isoforms with distinct 59 UTRs.

For each of these event types, reads deriving from specific regionscan support the expression of one alternative isoform or the other(Fig. 2). The ‘inclusion ratio’, defined as the ratio of the number of‘inclusion’ (blue) reads to inclusion plus ‘exclusion’ (red) reads, canbe used to detect changes in the proportions of the correspondingmRNA isoforms. The fraction of mRNAs that contain an exon—the‘per cent spliced in’ (PSI or Y) value—can be estimated as the ratio ofthe density of inclusion reads (that is, reads per position in regionssupporting the inclusion isoform) to the sum of the densities ofinclusion and exclusion reads.

To assess tissue-regulated alternative splicing, a comprehensive set of,105,000 events of these eight types was derived on the basis of availablehuman cDNA and expressed sequence tag data. Reads supporting bothalternative isoforms were observed for more than one-third of theseevents (Fig. 2), and the extent of tissue-specific regulation of these eventswas assessed by comparison of the inclusion ratio in each tissue relativeto the other tissues, requiring a minimum of a 10% absolute change ininclusion ratio (Supplementary Fig. 4). Naturally, transcripts or iso-forms identified as being differentially expressed between tissues willreflect the combined effects of cell-type-specific differences in transcriptlevels, variation in the relative abundances of cell types between tissues,and variations between the individuals from whom the tissues derived.

Notably, a high frequency of tissue-specific regulation wasobserved for each of the eight event types, including over 60% ofthe analysed skipped exon, A5SS, A3SS and tandem 39 UTR events

a

log 1

0(re

ads) 0

2

02

02

02

3A3B

b c

Testes

Liver

Skeletal muscle

HeartAK074759AK092689

chr12: 97,511,900–97,516,650 bp100 bp

Number of reads (log10)0 1 2 3 4 5

1.0

0.8

0.6

0.4

0.2

06

0.88

Minimum minor isoform fraction (%)0 1 5 10 15 202

0.92

0.84

0.96

1.00

Frac

tion

of g

enes

Frac

tion

of g

enes

Figure 1 | Frequency and relative abundance of alternative splicingisoforms in human genes. a, mRNA-Seq reads mapping to a portion of theSLC25A3 gene locus. The number of mapped reads starting at eachnucleotide position is displayed (log10) for the tissues listed at the right. Arcsrepresent junctions detected by splice junction reads. Bottom: exon/intronstructures of representative transcripts containing mutually exclusive exons3A and 3B (GenBank accession numbers shown at the right). b, Meanfraction of multi-exon genes with detected alternative splicing in bins of 500genes, grouped by total read count per gene. A gene was considered as

alternatively spliced if splice junction reads joining the same 59 splice site(59SS) to different 39 splice sites (39SS) (with at least two independentlymapping reads supporting each junction), or joining the same 39SS todifferent 59SS, were observed. The true extent of alternative splicing wasestimated from the upper asymptote of the best-fit sigmoid curve (redcurve). Circles show the fraction of alternatively spliced genes. c, Frequencyof alternative splicing in the top bin (black bars) and after estimation (as inb, red bars), considering only events with relative expression of less abundant(minor) splice variant exceeding a given threshold. Error bars, s.e.m.

NATURE | Vol 456 | 27 November 2008 ARTICLES

471 ©2 0 0 8 Macmillan Publis hers Limited. All rights res erved

Wang, Eric T., et al. "Alternative isoform regulation in human tissue transcriptomes." Nature 456.7221 (2008): 470.

N

0

1

2

3

4

5FP

M

HMGCR: ENST00000287936 >HMGCR: ENST00000343975 >

txrevise: ENST00000343975 >txrevise: ENST00000511206 >

1000 2000 3000 4000 5000Distance from region start (bp)

GGGAAA

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

N

GG GA AA

0.1

0.2

0.3

0.4

0.5

rs3846662

txre

vse:

EN

ST00

0003

4397

5 us

age

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

N

GG GA AA

0.5

0.6

0.7

0.8

0.9

rs3846662

txre

vise

: EN

ST00

0005

1120

6 us

age

N0

1

2

3

4

5

FPM

HMGCR: ENST00000287936 >HMGCR: ENST00000343975 >

txrevise: ENST00000343975 >txrevise: ENST00000511206 >


B

C D

GGGAAA

●● ●●●● ●●● ●●●●●● ●● ●●●

●●●● ●● ●● ● ● ●●●●●●●● ● ●● ●● ●● ●●● ●●●●●●● ●● ● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ●● ●●● ●● ●● ● ● ●●●●● ●●●●●●● ● ● ● ●●● ●●● ● ● ●●●● ● ●● ●●● ●●●● ●●● ● ● ●●● ●●●● ●●●●●● ●● ● ●●●●●●● ●●● ● ●● ●●●● ●● ●● ●●● ●● ●

●● ●● ● ● ●● ● ●●●●● ●●●● ●●●● ● ●●●● ●● ● ● ●● ●● ● ●● ●●● ●● ●●●● ● ●●●● ●● ●●● ●●● ● ●● ●● ●●● ●●● ●● ●● ● ●●● ● ●●●●● ●● ●● ●●● ●● ●●● ●● ● ●● ●●●● ● ● ●●●●● ● ●●● ●● ●●● ●● ● ●● ●● ● ● ●● ●● ●●● ●●●●●● ●● ●●● ●● ●● ●● ●●● ●● ●●● ●●●●●●●● ●● ●●●●●● ●●●●● ●●● ●●●● ●●●● ●● ●●● ● ●●● ●●● ●●● ●●●●●● ●●● ●●● ●●● ●●●● ●● ●●● ●●●● ●● ●●●● ●● ●● ● ●● ●●● ● ●●

LDL

0

5

10

15

−log

10 p−v

alue

● ●

● ●● ●●●●●● ●●●●●●●●●●● ●

●●● ● ●●●●● ●● ●●● ● ●●● ●●●●●● ● ●● ● ●● ●●●●●● ●●● ●

●● ●●●●●●● ●●●●● ●●●●●● ●● ●● ● ●● ● ● ●●●●●● ●●●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ●●●●● ● ●●● ● ● ●● ● ●●●● ●● ●●● ●● ●● ●●●●●● ●● ●● ●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●● ●● ●●●●● ●●●● ●●●● ●●●● ● ● ●●●●● ● ●● ●●●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●●●●● ● ●● ● ●● ●●●● ●●

N

0

5

10

15

−log

10 p−v

alue

HMGCR: ENST00000343975 >

75250000 75300000 75350000 75400000 75450000Chromosome 5 position (bp)

A

PP4 = 0.996

●● ●●●● ●●● ●●●●●● ●● ●●●

●●●● ●● ●● ● ● ●●●●●●●● ● ●● ●● ●● ●●● ●●●●●●● ●● ● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ●● ●●● ●● ●● ● ● ●●●●● ●●●●●●● ● ● ● ●●● ●●● ● ● ●●●● ● ●● ●●● ●●●● ●●● ● ● ●●● ●●●● ●●●●●● ●● ● ●●●●●●● ●●● ● ●● ●●●● ●● ●● ●●● ●● ●

●● ●● ● ● ●● ● ●●●●● ●●●● ●●●● ● ●●●● ●● ● ● ●● ●● ● ●● ●●● ●● ●●●● ● ●●●● ●● ●●● ●●● ● ●● ●● ●●● ●●● ●● ●● ● ●●● ● ●●●●● ●● ●● ●●● ●● ●●● ●● ● ●● ●●●● ● ● ●●●●● ● ●●● ●● ●●● ●● ● ●● ●● ● ● ●● ●● ●●● ●●●●●● ●● ●●● ●● ●● ●● ●●● ●● ●●● ●●●●●●●● ●● ●●●●●● ●●●●● ●●● ●●●● ●●●● ●● ●●● ● ●●● ●●● ●●● ●●●●●● ●●● ●●● ●●● ●●●● ●● ●●● ●●●● ●● ●●●● ●● ●● ● ●● ●●● ● ●●

LDL

0

5

10

15

−log

10 p−v

alue

● ●

● ●● ●●●●●● ●●●●●●●●●●● ●

●●● ● ●●●●● ●● ●●● ● ●●● ●●●●●● ● ●● ● ●● ●●●●●● ●●● ●

●● ●●●●●●● ●●●●● ●●●●●● ●● ●● ● ●● ● ● ●●●●●● ●●●●● ●● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ●●●●● ● ●●● ● ● ●● ● ●●●● ●● ●●● ●● ●● ●●●●●● ●● ●● ●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●● ●●●●●●●●●●●● ●● ●●●●● ●●●● ●●●● ●●●● ● ● ●●●●● ● ●● ●●●● ●● ● ● ●● ● ● ●● ●● ● ● ● ●● ●●●● ●●●●● ● ●● ● ●● ●●●● ●●

N

0

5

10

15

−log

10 p−v

alue

HMGCR: ENST00000343975 >

75250000 75300000 75350000 75400000 75450000Chromosome 5 position (bp)

A

PP4 = 0.996

Burkhardt, Ralph, et al. "Common SNPs in HMGCR in micronesians and whites associated with LDL-cholesterol levels affect alternative splicing of exon13." Arteriosclerosis, thrombosis, and vascular biology 28.11 (2008): 2078-2084.

n engl j med 375;22 nejm.org December 1, 20162148

T h e n e w e ngl a nd j o u r na l o f m e dic i n e

HMGCR scores (P = 2.9×10−15) and a 6.6% lower risk of myocardial infarction or death from coronary heart disease (odds ratio, 0.93; 95% CI, 0.90 to 0.97) (Fig. 1B). As with the PCSK9 score, the HMGCR score had a very consistent effect on each of the secondary outcomes and a similar effect in all subgroups studied.

After adjustment for a standard decrement of 10 mg per deciliter in the LDL cholesterol level, PCSK9 variants were associated with an 18.9% decrease in the risk of myocardial infarction or death from coronary heart disease (odds ratio, 0.81; 95% CI, 0.74 to 0.89) and HMGCR variants were associated with a nearly identical 19.1%

Figure 1. Effect of PCSK9 and HMGCR Genetic Scores on the Risk of Myocardial Infarction or Death from Coronary Heart Disease.

For each study participant, we calculated a weighted PCSK9 genetic score and a weighted HMGCR score by adding the number of low-density lipoprotein (LDL) cholesterol–lowering alleles that the person had inherited at each vari-ant that was included in either score, weighted by the effect of each variant on LDL cholesterol levels measured in milligrams per deciliter. A total of 10,401 primary cardiovascular events (myocardial infarction or death from coro-nary heart disease [CHD]) were included in the analysis. Of these events, 5508 (1528 prevalent myocardial infarc-tions and 3980 incident myocardial infarctions or deaths from CHD) occurred in the prospective cohort studies and 4893 in the case–control studies. Across the included studies, the median weighted PCSK9 score was 12.7 (range, 0 to 24.5), and the median weighted HMGCR score was 16.8 (range, 4.1 to 26.7). Boxes represent point estimates of effect. Lines represent 95% confidence intervals (CIs).

PCSK9 score above medianQuartile of PCSK9 scores

4321

Difference in LDL Cholesterol vs.Score below Median or Reference

Odds Ratio for Myocardial Infarctionor Death from CHD (95% CI)

Odds Ratio for Myocardial Infarction or Death fromCHD (95% CI) per Decrease in LDL Cholesterol of 10 mg/dl

0.97 (0.91– 1.03)Reference

0.93 (0.88– 0.98)

0.92 (0.88– 0.95)

0.89 (0.84– 0.94)

−4.2

−5.8−3.9−1.8

Reference

mg/dl

HMGCR score above medianQuartile of HMGCR scores

4321


0.98 (0.92– 1.04)Reference

0.93 (0.88– 0.98)0.90 (0.85– 0.95)

−3.2

−4.6−3.1−1.2

Reference

−10.0−10.0


mg/dl

Standardized Differencein LDL Cholesterol

mg/dl

PCSK9 genetic scoreHMGCR genetic score

0.81 (0.74– 0.89)

−0.15 −0.10 −0.05 0 0.05−0.20

Natural Logarithm of Odds Ratio

0.81 (0.72– 0.90)

−0.30 −0.20 −0.10 0 0.10


−0.40

0.93 (0.90– 0.97)

−0.15 −0.10 −0.05 0 0.05−0.20


A PCSK9 Score

B HMGCR Score

C Effect of PCSK9 and HMGCR Scores on Risk of Myocardial Infarction or Death from CHD per Unit Changein LDL Cholesterol

The New England Journal of Medicine Downloaded from nejm.org at Wellcome Trust Genome Campus on December 1, 2016. For personal use only. No other uses without permission.

Copyright © 2016 Massachusetts Medical Society. All rights reserved.

Ference, Brian A., et al. "Variation in PCSK9 and HMGCR and risk of cardiovascular disease and diabetes." New England Journal of Medicine 375.22 (2016): 2144-2153.








4321




0.97 (0.91– 1.03)Reference

0.93 (0.88– 0.98)

0.92 (0.88– 0.95)

0.89 (0.84– 0.94)

−4.2

−5.8−3.9−1.8

Reference

mg/dl


4321


0.98 (0.92– 1.04)Reference

0.93 (0.88– 0.98)0.90 (0.85– 0.95)

−3.2

−4.6−3.1−1.2

Reference

−10.0−10.0


mg/dl


mg/dl


0.81 (0.74– 0.89)

−0.15 −0.10 −0.05 0 0.05−0.20


0.81 (0.72– 0.90)

−0.30 −0.20 −0.10 0 0.10


−0.40

0.93 (0.90– 0.97)

−0.15 −0.10 −0.05 0 0.05−0.20


A PCSK9 Score

B HMGCR Score





Genetic variant -> splicing change in HMGCR -> less LDL -> lower risk of coronary heart disease (CHD)








4321




0.97 (0.91– 1.03)Reference

0.93 (0.88– 0.98)

0.92 (0.88– 0.95)

0.89 (0.84– 0.94)

−4.2

−5.8−3.9−1.8

Reference

mg/dl


4321


0.98 (0.92– 1.04)Reference

0.93 (0.88– 0.98)0.90 (0.85– 0.95)

−3.2

−4.6−3.1−1.2

Reference

−10.0−10.0


mg/dl


mg/dl


0.81 (0.74– 0.89)

−0.15 −0.10 −0.05 0 0.05−0.20


0.81 (0.72– 0.90)

−0.30 −0.20 −0.10 0 0.10


−0.40

0.93 (0.90– 0.97)

−0.15 −0.10 −0.05 0 0.05−0.20


A PCSK9 Score

B HMGCR Score





Genetic variant -> splicing change in HMGCR -> less LDL -> lower risk of coronary heart disease (CHD)

Statins -> block HMGCR -> less LDL -> lower risk of coronary heart disease (CHD)(Approved by FDA in 1987)

Estimating transcript expression with the Expectation-Maximization (EM) algorithm

M =

⎛

⎝1 11 00 1

⎞

⎠

s =

⎛

⎝d1 +d3

d2d4

⎞

⎠ =

⎛

⎝e1 + e3 − 2(ϵ − 1)

e2 + ϵ − 1ϵ − 1

⎞

⎠

M =

⎛

⎝1 11 00 1

⎞

⎠ k =

⎛

⎝641

⎞

⎠

l1 = s1 + s2 = e1 + e2 + e3 − (ϵ − 1)l2 = s1 + s3 = e1 + e3 − (ϵ − 1)

t2

t1

t1t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

(a)

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2

t1t1,t2 t1,t2

t1t1,t2 t1,t2

(b)

t1A

C

G

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1B

t1A,t1B

t1A,t1B

t1A

t1A

t1A,t1B

t1B

t1B

k =

⎛

⎝422

⎞

⎠

Figure 2 MMSEQ data structures to represent read mappings to alternative isoforms and alternative haplotypes. (a) Schematic of agene with an alternatively spliced cassette exon. Each read is labeled according to the transcripts it maps to and placed along its alignmentposition. Reads that map to both transcripts, t1 and t2, are shown in red, reads that map only to t1 are shown in blue and the read that mapsonly to t2 is shown in green. Reads that align with their start positions in the regions labeled by d1 and d3 (in red) may have come from eithertranscript, reads with their start positions in d2 (in blue) can only have come from transcript 1, and reads with their start positions in d4 (ingreen) must be from transcript 2. Each row i of the indicator matrix M characterizes a unique set of transcripts that is mapped to by ki reads.There are three transcript sets: {t1, t2} (red), {t1} (blue) and {t2} (green). Exon lengths are e1, e2, e3. Hence s1 = d1 + d3, s2 = d2 and s3 = d4. Theeffective length of transcript t is equal to the sum over the elements of s that have a corresponding 1 in column t of M, that is ∑i siMit. It can beseen from the figure that these lengths are the sums of the exons minus read length (!) plus one, as expected. (b)Schematic of a single-exongene with a heterozygote near the center. Reads with starting positions in region d2 contain either the ‘C’ allele or the ‘G’ allele and thus mapto either the haplo-isoform t1A, which has a ‘C’ or t1B, which has a ‘G’. It is evident that the heterozygote acts like an alternative middle exon,and that the same model and data structures as in the alternative isoform schematic apply.

Turro et al. Genome Biology 2011, 12:R13http://genomebiology.com/2011/12/2/R13

Page 5 of 15

Transcript 1

Transcript 2

Read length

Turro, Ernest, et al. "Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads." Genome biology 12.2 (2011): R13.

M =

⎛

⎝1 11 00 1

⎞

⎠

s =

⎛

⎝d1 +d3

d2d4

⎞

⎠ =

⎛

⎝e1 + e3 − 2(ϵ − 1)

e2 + ϵ − 1ϵ − 1

⎞

⎠

M =

⎛

⎝1 11 00 1

⎞

⎠ k =

⎛

⎝641

⎞

⎠

l1 = s1 + s2 = e1 + e2 + e3 − (ϵ − 1)l2 = s1 + s3 = e1 + e3 − (ϵ − 1)

t2

t1

t1t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

(a)

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2

t1t1,t2 t1,t2

t1t1,t2 t1,t2

(b)

t1A

C

G

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1B

t1A,t1B

t1A,t1B

t1A

t1A

t1A,t1B

t1B

t1B

k =

⎛

⎝422

⎞

⎠



Page 5 of 15

M =

⎛

⎝1 11 00 1

⎞

⎠

s =

⎛

⎝d1 +d3

d2d4

⎞

⎠ =

⎛

⎝e1 + e3 − 2(ϵ − 1)

e2 + ϵ − 1ϵ − 1

⎞

⎠

M =

⎛

⎝1 11 00 1

⎞

⎠ k =

⎛

⎝641

⎞

⎠

l1 = s1 + s2 = e1 + e2 + e3 − (ϵ − 1)l2 = s1 + s3 = e1 + e3 − (ϵ − 1)

t2

t1

t1t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

(a)

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2

t1t1,t2 t1,t2

t1t1,t2 t1,t2

(b)

t1A

C

G

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1B

t1A,t1B

t1A,t1B

t1A

t1A

t1A,t1B

t1B

t1B

k =

⎛

⎝422

⎞

⎠



Page 5 of 15

M =

⎛

⎝1 11 00 1

⎞

⎠

s =

⎛

⎝d1 +d3

d2d4

⎞

⎠ =

⎛

⎝e1 + e3 − 2(ϵ − 1)

e2 + ϵ − 1ϵ − 1

⎞

⎠

M =

⎛

⎝1 11 00 1

⎞

⎠ k =

⎛

⎝641

⎞

⎠

l1 = s1 + s2 = e1 + e2 + e3 − (ϵ − 1)l2 = s1 + s3 = e1 + e3 − (ϵ − 1)

t2

t1

t1t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

(a)

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2

t1t1,t2 t1,t2

t1t1,t2 t1,t2

(b)

t1A

C

G

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1B

t1A,t1B

t1A,t1B

t1A

t1A

t1A,t1B

t1B

t1B

k =

⎛

⎝422

⎞

⎠



Page 5 of 15

Transcript compatibility matrix Read counts in each class

Goal: Divide reads into equivalence classes according to the transcripts that they are compatible with.

M =

⎛

⎝1 11 00 1

⎞

⎠

s =

⎛

⎝d1 +d3

d2d4

⎞

⎠ =

⎛

⎝e1 + e3 − 2(ϵ − 1)

e2 + ϵ − 1ϵ − 1

⎞

⎠

M =

⎛

⎝1 11 00 1

⎞

⎠ k =

⎛

⎝641

⎞

⎠

l1 = s1 + s2 = e1 + e2 + e3 − (ϵ − 1)l2 = s1 + s3 = e1 + e3 − (ϵ − 1)

t2

t1

t1t2

e1 e2 e3

e1 e3

d1 d2 d3

d1 d3d4

(a)

t1

t2

ε-1

ε-1

t1,t2 t1 t1,t2

t1t1,t2 t1,t2

t1t1,t2 t1,t2

(b)

t1A

C

G

d1 d3d2 ε-1

t1B

t1At1B

t1A,t1B

t1A,t1B

t1A,t1B

t1A

t1A

t1A,t1B

t1B

t1B

k =

⎛

⎝422

⎞

⎠



Page 5 of 15

Lengths of the classes


Ntranos et al. Genome Biology (2016) 17:112 Page 2 of 14

Fig. 1 Equivalence class and transcript-compatibility counts. This figure gives an example of how reads are collapsed into equivalence classes. Eachread is mapped to one or more transcripts in the reference transcriptome; these are transcripts that the read is compatible with, i.e., the transcriptsthat the read could possibly have come from. For example, read 1 is compatible with transcripts t1 and t3, read 2 is compatible with transcripts t1and t2, and so on. An equivalence class is a group of reads that is compatible with the same set of transcripts. For example, reads 4,5,6,7,8 are allcompatible with t1, t2, and t3, and they form an equivalence class. Since the reads in an equivalence class are all compatible with the same set oftranscripts, we simply represent an equivalence class by that set of transcripts. For example, the equivalence class consisting of reads 4,5,6,7,8 isrepresented by {t1, t2, t3}. Aggregating the number of reads in each equivalence class yields the corresponding transcript-compatibility counts. Notethat in order to estimate the transcript abundances from the transcript-compatibility counts, a read-generation model is needed to resolve themulti-mapped reads

Ntranos, Vasilis, et al. "Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts." Genome biology 17.1 (2016): 112.

The EM algorithm• Initialisation: Assign reads to transcript by dividing them

equally among all compatible transcripts.

• M-step: Estimate transcript expression (mu) by dividing the total number of reads assigned to each transcript by their length.

• E-step: Re-estimate the read assignment to transcripts based on current transcript expression estimate.

• Repeat until convergence.


K-means clustering is a type of EM algorithm

• http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

RESEARCH Open Access

Comparative assessment of methods forthe computational inference of transcriptisoform abundance from RNA-seq dataAlexander Kanitz†, Foivos Gypas†, Andreas J. Gruber, Andreas R. Gruber, Georges Martin and Mihaela Zavolan*

Abstract

Background: Understanding the regulation of gene expression, including transcription start site usage, alternativesplicing, and polyadenylation, requires accurate quantification of expression levels down to the level of individualtranscript isoforms. To comparatively evaluate the accuracy of the many methods that have been proposed forestimating transcript isoform abundance from RNA sequencing data, we have used both synthetic data as well asan independent experimental method for quantifying the abundance of transcript ends at the genome-wide level.

Results: We found that many tools have good accuracy and yield better estimates of gene-level expressioncompared to commonly used count-based approaches, but they vary widely in memory and runtime requirements.Nucleotide composition and intron/exon structure have comparatively little influence on the accuracy of expressionestimates, which correlates most strongly with transcript/gene expression levels. To facilitate the reproduction andfurther extension of our study, we provide datasets, source code, and an online analysis tool on a companionwebsite, where developers can upload expression estimates obtained with their own tool to compare them tothose inferred by the methods assessed here.

Conclusions: As many methods for quantifying isoform abundance with comparable accuracy are available, auser’s choice will likely be determined by factors such as the memory and runtime requirements, as well as theavailability of methods for downstream analyses. Sequencing-based methods to quantify the abundance of specifictranscript regions could complement validation schemes based on synthetic data and quantitative PCR in future orongoing assessments of RNA-seq analysis methods.

BackgroundThe general availability of high-throughput sequencingtechnologies greatly facilitated the detection and quanti-fication of RNA species, including protein-coding RNAs,long non-coding RNAs, and microRNAs, in many differ-ent systems. In higher eukaryotes, the vast majority ofprotein-coding genes express multiple transcript iso-forms [1–3]. Although a substantial proportion of tran-script isoforms may result from stochasticity in thesplicing process [4, 5], striking examples of isoformswitching with large impact on cellular phenotypes arealso known (for example, [6, 7]). Tissue-specific splicingpatterns have been linked to the expression of specific

RNA-binding proteins [8], some of which appear to actas ‘master’ regulators of alternative splicing in individualtissues [9]. For example, muscleblind-like proteins 1 and2 (MBNL1/MBNL2) are expressed in mesenchymal cellsand their downregulation facilitates somatic cell repro-gramming [10], while the epithelial splicing regulatoryproteins 1 and 2 (ESRP1/ESRP2) establish epithelia-specific patterns of isoform expression [11]. Neverthe-less, despite the long history of the field, the functionalrelevance of most isoforms that can be detected with se-quencing approaches remains unclear [12], particularlyin light of the rapid change of isoform usage pattern inevolution that indicates relatively weak selection pres-sure [13].Analysis of expression pattern is often one of the first

steps towards understanding a gene’s function. However,transcript isoform abundance is almost always quantified

* Correspondence: [email protected]†Equal contributorsBiozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel,Switzerland

© 2015 Kanitz et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium,provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Kanitz et al. Genome Biology (2015) 16:150 DOI 10.1186/s13059-015-0702-5

is used (Fig. 1a), and two orders of magnitude when themulti-threading option (16 cores; Fig. 1b) is used. In par-ticular, the times required to process the alignments of100 million in silico-generated reads range between ap-proximately 7 min (IsoEM) and more than 1 week(TIGAR2) when a single processor is used, and betweenabout 5 min (IsoEM) and 8 h (RSEM) when 16 cores areavailable for the tools that support multi-threading(TIGAR2 does not). With the exception of Sailfish, run-times strictly increased with the number of processed readalignments. Assuming that a method-specific, but largelysample size-independent time span is required to index

the supplied transcriptome, time complexities for most ofthe quantification algorithms appear to be approximatelylinear. Sailfish’s runtimes seem to be the highest for thesmallest dataset, presumably because the convergence ofestimation is slow for small datasets, when the vast major-ity of transcripts are sparsely covered. Notably, Sailfishcomputes abundances based on raw read sequences ratherthan alignments. Thus, whenever alignments are dispens-able, a considerable amount of time (typically 1 h or more)can be saved on sample pre-processing compared to allother methods (refer to [19, 27, 28] for an overview of‘mapping’ times for some short-read aligners and

Table 1 Overview of surveyed methodsName Reference sequencea Principle Released

BitSeq Transcripts Bayesian estimation of parameters of a model that explains the read-to-transcript alignmentdata. Reads are assumed to be sampled independently, without positional bias from transcripts,such that the probability of an alignment starting at a given position of a transcript is inverselyproportional to the transcript length. Sub-optimal alignments are used to estimate the‘background’ of spurious alignments.

2012 [67, 68]

CEM Genome Component elimination expectation-maximization approach to estimating the parameters ofisoform abundance. For each gene it aims to find a ‘sparse’ solution, with few expressedisoforms. Read sampling from isoforms is assumed to obey a quasi-multinomial distribution, inwhich positional and other biases are modeled as an effective distribution which could be,for example, uniform (no positional bias) or exponential (modeling the process of RNAdegradation).

2012 [69]

Cufflinks Genome Bayesian approach to estimating transcript abundances by explicitly modeling the length ofthe fragments expected from RNA-seq. It assumes that for a given gene, reads are sampledindependently with uniform probability along transcripts and in proportion to the transcriptabundance between transcripts. Thus, if a read can be assigned to two transcripts of differentlengths, the transcript with a shorter effective length will have a higher probability of givingrise to the read.

2010 [70]

eXpress Transcripts Similar to Cufflinks, but it includes modeling of errors and indels and it has a different modelfor fragment length selection. Unlike Cufflinks and most other methods, eXpress processesread alignments ‘on-line’ so that it can be integrated into real-time analysis pipelines.

2012 [32]

IsoEM Genome Expectation-maximization approach to inferring isoform abundances that are consistent withthe coverage of isoforms by reads. The coverage is assumed to be uniform along an isoform.Base quality scores are taken into account in computing the probabilities of alignments. Inthe E-step, the expected number of reads derived from a given isoform is computed and inthe M-step, the relative frequencies of isoforms are estimated.

2011 [71]

MMSeq Transcripts Models the read data as Poisson-distributed variables with rates that depend on the abundanceof the regions of the transcripts with which the reads are compatible and on the sequence-dependent bias in capturing the sequences. Priors on transcript abundances are Gamma-distributed. Sequencing errors are not modeled, there is only a filter on the minimal quality ofconsidered alignments.

2011 [73]

RSEM Transcripts Models the probability of observing a read as the sum of the relative abundance of thetranscript to which the reads maps times the probability of the read mapping to thetranscript, and infers transcript abundances by expectation maximization.

2009 [34, 35 ]

rSeq Transcripts Models read data as Poisson-distributed variables with rates that depend on the abundanceof the regions of the transcripts with which the reads are compatible.

2009 [75 ]

Sailfishb Transcripts Expectation-maximization method for explaining the abundance of k-mers inferred from thereads in terms on the abundance of the transcripts with the associated k-mer abundances.

2014 [76]

Scripture Genome Transcript abundance is calculated as reads per kilobase of exonic sequence per millionaligned reads, given the alignments of the reads to the genome and the annotated/reconstructed transcript.

2010 [77]

TIGAR2 Transcripts Models the read data in terms of a large number of parameters which include, beyond therelative abundance of the transcripts, the read length distribution, the nucleotides, andalignment state and quality at the first and second position of the read.

2013 [78, 85 ]

The columns are: method name, sequences to which reads are compared (transcripts or genome), principle of the method, year of release, and associated reference(s)aFor methods operating on the genome sequence, genome annotation files (GTF/BED-formatted) were also providedbIn contrast to other methods operating on transcripts, Sailfish uses k-mer statistics rather than aligning reads to transcripts

Kanitz et al. Genome Biology (2015) 16:150 Page 3 of 26

kallisto (and salmon) uses pseudo-alignment to rapidly identify equivalence classes for each read

©20

16N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

526 VOLUME 34 NUMBER 5 MAY 2016 NATURE BIOTECHNOLOGY

B R I E F C O M M U N I C AT I O N S

To validate and benchmark kallisto, we tested it on a set of 20 RNA-seq simulations generated with the program RSEM (RNA-Seq by Expectation Maximization)9, as well as on RNA-seq data from the Sequencing Quality Control Consortium (SEQC)10 for which quantitative PCR (qPCR) can be used as an independent validation of quantification. The transcript abundances and error profiles for the simulated data were based on the quantification of sample NA12716_7 from the Genetic European Variation in Health and Disease (GEUVADIS) data set11. To accord with GEUVADIS samples, the simulations consisted of 30 million reads. We examine the quality of the kallisto pseudoalignments as compared to pseu-doalignments extracted from Bowtie2 alignments. The two methods agreed exactly on the set of reported transcripts for 70.7% of the reads, but when they differed on the (pseudo)alignment of a read, Bowtie2 reported 8.02 transcripts on average compared to 4.96 for kallisto. Despite being much more specific than Bowtie2, kallisto had almost 100% sensitivity. The transcript of origin was contained in the set of reported transcripts for 99.89% of the reads, only 0.1% less than with Bowtie2 (99.99%). On the real data used as the basis for the simulations (NA12716_7), the programs displayed similar characteristics. The two methods agreed exactly for 66.22% of reads where both (pseudo)aligned, and for differing reads Bowtie2 aligned to 8.94 transcripts on average, versus 4.86 for kallisto. As expected, the number of (pseudo)aligned reads was lower for the real data, with 86.5% of the reads aligned by Bowtie2 versus 90.8% pseudoaligned by kallisto.

The accuracy of kallisto is similar to those of existing RNA-seq quantification tools (Fig. 2a and Supplementary Fig. 2) and enables a substantial improvement over Cufflinks2 and Sailfish5. The inferior performance of Cufflinks can be attributed to its limited application of the EM algorithm in cases where reads multi-map across genomic locations12. Unlike Sailfish5, which shreds reads into k-mers for fast hashing, resulting in a loss of information, kallisto’s pseudoalignments explicitly preserve the information provided by k-mers across reads (Supplementary Fig. 1).

All programs have reduced performance on paralogs owing to the similarity among genes within a family, but kallisto remains highly competitive, again almost matching RSEM’s performance (Supplementary Figs. 3 and 4). To test kallisto’s suitability for allele-specific expression quantification, we simulate reads from a transcrip-tome with two distinct haplotypes. The Spearman’s correlation for kallisto was 0.833 vs. 0.848 for RSEM, 0.830 for eXpress and 0.706 for Sailfish, showing that kallisto is suitable for allele-specific expression. Notably, the simulation was based on RSEM, for generating both the parameters and then the data using them.

We also tested kallisto on SEQC data that has independently been quantified with qPCR. Kallisto performed similarly to other programs

v1 v2 v3

v4 v5

v1 v4 v5

a

b

c

d

e

Figure 1 Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.

1.00

0.75

Med

ian

rela

tive

diffe

renc

e

0.50

0.25

0

TopHat2+

Cufflinks

HISAT+

Cufflinks

Sailfish Bowtie2+

eXpressmethod

Kallisto EMSAR Bowtie2+

RSEM

0.52 0.51

0.21

0.06 0.05 0.05 0.03

2,500

2,000

1,500

1,000

500

0

TopHat2+

Cufflinks

Bowtie2+

RSEM

Bowtie2+

eXpress

Bowtie2+

EMSARmethod

HISAT+

Cufflinks

Sailfish Kallisto

StageAlignmentQuantification

Tim

e (m

in)

a

b

Figure 2 Performance of kallisto and other methods. (a) Accuracy of kallisto, Cufflinks, Sailfish, EMSAR, eXpress and RSEM on 20 RSEM simulations of 30 million 75-bp paired-end reads based on the abundances and error profile of GEUVADIS sample NA12716_7 (selected for its depth of sequencing). For each simulation, we report the accuracy as the median relative difference in the estimated read count of each transcript. Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads. The values reported are means across the 20 simulations (the variance was too small to be visible in this plot). Relative difference is defined as the absolute difference between the estimated abundance and the ground truth divided by the average of the two. (b) Total running time in minutes for processing the 20 simulated data sets of 30 million paired-end reads described in a. All processing was done using 20 cores, with programs being run with 20 threads when possible (Bowtie2, TopHat2, RSEM, Cufflinks) and 20 parallel processes otherwise (eXpress, kallisto). Each box represents one dataset. Since eXpress and kallisto process all datasets in parallel, the only quantification time shown is the maximum of all the quantifications.

Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525.

©20

16N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

526 VOLUME 34 NUMBER 5 MAY 2016 NATURE BIOTECHNOLOGY

B R I E F C O M M U N I C AT I O N S

To validate and benchmark kallisto, we tested it on a set of 20 RNA-seq simulations generated with the program RSEM (RNA-Seq by Expectation Maximization)9, as well as on RNA-seq data from the Sequencing Quality Control Consortium (SEQC)10 for which quantitative PCR (qPCR) can be used as an independent validation of quantification. The transcript abundances and error profiles for the simulated data were based on the quantification of sample NA12716_7 from the Genetic European Variation in Health and Disease (GEUVADIS) data set11. To accord with GEUVADIS samples, the simulations consisted of 30 million reads. We examine the quality of the kallisto pseudoalignments as compared to pseu-doalignments extracted from Bowtie2 alignments. The two methods agreed exactly on the set of reported transcripts for 70.7% of the reads, but when they differed on the (pseudo)alignment of a read, Bowtie2 reported 8.02 transcripts on average compared to 4.96 for kallisto. Despite being much more specific than Bowtie2, kallisto had almost 100% sensitivity. The transcript of origin was contained in the set of reported transcripts for 99.89% of the reads, only 0.1% less than with Bowtie2 (99.99%). On the real data used as the basis for the simulations (NA12716_7), the programs displayed similar characteristics. The two methods agreed exactly for 66.22% of reads where both (pseudo)aligned, and for differing reads Bowtie2 aligned to 8.94 transcripts on average, versus 4.86 for kallisto. As expected, the number of (pseudo)aligned reads was lower for the real data, with 86.5% of the reads aligned by Bowtie2 versus 90.8% pseudoaligned by kallisto.

The accuracy of kallisto is similar to those of existing RNA-seq quantification tools (Fig. 2a and Supplementary Fig. 2) and enables a substantial improvement over Cufflinks2 and Sailfish5. The inferior performance of Cufflinks can be attributed to its limited application of the EM algorithm in cases where reads multi-map across genomic locations12. Unlike Sailfish5, which shreds reads into k-mers for fast hashing, resulting in a loss of information, kallisto’s pseudoalignments explicitly preserve the information provided by k-mers across reads (Supplementary Fig. 1).

All programs have reduced performance on paralogs owing to the similarity among genes within a family, but kallisto remains highly competitive, again almost matching RSEM’s performance (Supplementary Figs. 3 and 4). To test kallisto’s suitability for allele-specific expression quantification, we simulate reads from a transcrip-tome with two distinct haplotypes. The Spearman’s correlation for kallisto was 0.833 vs. 0.848 for RSEM, 0.830 for eXpress and 0.706 for Sailfish, showing that kallisto is suitable for allele-specific expression. Notably, the simulation was based on RSEM, for generating both the parameters and then the data using them.

We also tested kallisto on SEQC data that has independently been quantified with qPCR. Kallisto performed similarly to other programs

v1 v2 v3

v4 v5

v1 v4 v5

a

b

c

d

e

Figure 1 Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.

1.00

0.75

Med

ian

rela

tive

diffe

renc

e

0.50

0.25

0

TopHat2+

Cufflinks

HISAT+

Cufflinks

Sailfish Bowtie2+

eXpressmethod

Kallisto EMSAR Bowtie2+

RSEM

0.52 0.51

0.21

0.06 0.05 0.05 0.03

2,500

2,000

1,500

1,000

500

0

TopHat2+

Cufflinks

Bowtie2+

RSEM

Bowtie2+

eXpress

Bowtie2+

EMSARmethod

HISAT+

Cufflinks

Sailfish Kallisto

StageAlignmentQuantification

Tim

e (m

in)

a

b

Figure 2 Performance of kallisto and other methods. (a) Accuracy of kallisto, Cufflinks, Sailfish, EMSAR, eXpress and RSEM on 20 RSEM simulations of 30 million 75-bp paired-end reads based on the abundances and error profile of GEUVADIS sample NA12716_7 (selected for its depth of sequencing). For each simulation, we report the accuracy as the median relative difference in the estimated read count of each transcript. Estimated counts were used rather than transcripts per million (TPM) because the latter is based on both the assignment of ambiguous reads and the estimation of effective lengths of transcripts, so a program might be penalized for having a differing notion of effective length despite accurately assigning reads. The values reported are means across the 20 simulations (the variance was too small to be visible in this plot). Relative difference is defined as the absolute difference between the estimated abundance and the ground truth divided by the average of the two. (b) Total running time in minutes for processing the 20 simulated data sets of 30 million paired-end reads described in a. All processing was done using 20 cores, with programs being run with 20 threads when possible (Bowtie2, TopHat2, RSEM, Cufflinks) and 20 parallel processes otherwise (eXpress, kallisto). Each box represents one dataset. Since eXpress and kallisto process all datasets in parallel, the only quantification time shown is the maximum of all the quantifications.

Bray, Nicolas L., et al. "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34.5 (2016): 525.

Limitations of transcript expression estimation

Play with it yourself!

• https://github.com/kauralasoo/MTAT.03.239_Bioinformatics/blob/master/transcript_expression/EM-algorithm.md

https://github.com/kauralasoo/MTAT.03.239_Bioinformatics/blob/master/transcript_expression/EM-algorithm.md



58% of the transcripts are truncated!

500 bases hg38

Nodes:

5’ 3’Nodes: 1 2 3 4 5 6 7 8 9 11 12 13 14

EdgeTypes: SL LS LR RR RR LR LR LR RR LR LR RS RS

10

10

RRRR

5’

5’

3’

3’

5’ Splice Site Kmer

Canonical Splicing

Back Splicing


3 4 5 6 7 8 9 11 12

Isoform-Level (dependence assumption) Event-Level (independence between events)

Isoform 1

Isoform 2Alternative p ice ite

an e

a

b

c

d

SUPPLEMENTAL FIGURE 1

500 bases hg38

Nodes:

5’ 3’Nodes: 1 2 3 4 5 6 7 8 9 11 12 13 14

EdgeTypes: SL LS LR RR RR LR LR LR RR LR LR RS RS

10

10

RRRR

5’

5’

3’

3’


Canonical Splicing

Back Splicing


3 4 5 6 7 8 9 11 12

Isoform-Level (dependence assumption) Event-Level (independence between events)

Isoform 1

Isoform 2Alternative p ice ite

an e

a

b

c

d

SUPPLEMENTAL FIGURE 1

Transcript-level quantification assumes that splicing events are dependent of each other

Event-level quantification assumes that different events are regulated independently

Sterne-Weiler, Timothy, et al. "Whippet: an efficient method for the detection and quantification of alternative splicing reveals extensive transcriptomic complexity." bioRxiv (2017): 158519

Quantifying splicing at the event level

Wang, Eric T., et al. "Alternative isoform regulation in human tissue transcriptomes." Nature 456.7221 (2008): 470.

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNA and/or EST data; (2) with $1 isoform supported by mRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofevents with both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39 UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE | Vol 456 | 27 November 2008

472 ©2 0 0 8 Macmillan Publis hers Limited. All rights res erved

1010 | VOL.7 NO.12 | DECEMBER 2010 | NATURE METHODS

ARTICLES

information about the splicing of the alternative exon, as higher expression of the exclusion isoform will generally increase the density of reads in the flanking exons relative to the alternative exon, and lower expression of the exclusion isoform will decrease this ratio of densities. MISO captures this, as well as the infor-mation in the lengths of library inserts in paired-end data, by recasting the analysis of isoforms as a Bayesian inference problem. Our approach is related to the alternative-splicing quantification method12, which does not use paired-end information.

MISO samples reads uniformly from the chosen isoform, then recovers the underlying abundances of isoforms (9 and 1 − 9 in the case of a single alternative exon) using the short read data (Fig. 1a and Supplementary Fig. 3 ). As a result of mRNA fragmentation in library preparation, mRNA abundance and length contribute roughly linearly to read sampling in RNA-seq. This effect is treated by rescaling the abundances 9 and 1 − 9 of the two isoforms by the number of possible reads that could be generated from each isoform, respectively. In the model, reads from a gene locus are produced by a generative process in which an isoform is first chosen according to its rescaled abundance, and a sequence read is then sampled uniformly from possible read positions along the mRNA (Online Methods). For the exon-centric analyses involving a single alternative exon we derived an analytic solution to the inference problem, whereas for isoform-centric analyses and estimation using CIs we developed an efficient inference technique based on Monte Carlo sampling (Online Methods). Our new estimator, :MISO, uses all of the read positions used in :SJ, plus reads aligning to the adjacent exons (Fig. 1b,c) and information about the library insert length distribution in paired-end RNA-seq. Both :SJ and :MISO are unbiased estimators of 9.

An improved measure of exon expressionSimulating read generation from an alternatively spliced gene, we observed that the :MISO estimate had consistently much lower variance and error than :SJ (Fig. 1d). For reference, the dis-tribution of read-coverage values at depths typically obtained from one lane of sequencing on an Illumina Genome Analyzer 2 (GA2) and on a HiSeq 2000 are shown, in units of reads per kilobase of exon model (RPK). For a gene with median cover-age in the GA2 data set (~220 RPK), the s.d. of the estimated 9 value was reduced more than twofold, from 0.21 for :SJ to 0.09 for :MISO.

Validation of MISO estimatesTo assess the uncertainty in the splicing estimates for each exon, we calculated CIs for 9 (Online Methods) from moderate-depth breast cancer RNA-seq data (Supplementary Table 1; examples are shown in Fig. 2a,b). Comparing :MISO estimates for 52 alter-native exons to corresponding quantitative reverse-transcription PCR (qRT-PCR) values11,13 yielded a Pearson correlation r = 0.87 (Fig. 2c and Supplementary Table 2; a bias in the RT-PCR data was analyzed in Supplementary Figs. 4 –6 ). Restricting the analysis to exons with 95% CI width <0.25 increased the correlation with qRT-PCR data considerably, to r = 0.96 (Fig. 2d). Thus, MISO CIs identify exons whose RNA-seq–based 9-value estimates are more reliable.

Detection of differentially expressed isoformsDifferential splicing of alternative exons entails a difference in 9 values, $9, and can be evaluated statistically using the Bayes factor (BF), which quantifies the odds of differential regulation

0 250 500 750 1,000 1,250 1,500 1,750 2,000Coverage (RPK)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Est

imat

ed �

90% 95%85%

80% 85%

75%

75%

50%

50%

Percentile of genes (25M PE, GA2)

c

b

Fragmentation and amplification

Short reads generated from fragmentsSequencing

Reads aligned to genome and splice junctions

Alignment

0.6

0.4

Bayesian inference

�

mRNA fragments sampled in proportion to isoform length and expression level �k

a

Assign reads to isoforms using insertlength distribution

Incorporation of paired ends

Skipped exonIntron

Constitutive exons

Exclusionreads

Constitutive reads

Constitutivereads

Inclusion reads

Insert length (nt)

Insert variability:

Single-end estimate, �SJ

Paired-end estimate, �MISO

100 nt

d

1–�

Percentile of genes(80M PE, HiSeq)

� = d �

�SJ

�MISO

Figure 1 | More accurate inference of splicing levels using MISO. (a) Generative process for MISO model. White, alternatively spliced exon; gray and black, flanking constitutive exons. RNA-seq reads aligning to the alternative exon body (white) or to splice junctions involving this exon support the inclusive isoform, whereas reads joining the two constitutive exons (black-gray exon junction) support the exclusive isoform. Reads aligning to the constitutive exons are common to both isoforms. (b) The :SJ estimate uses splice-junction and alternative exon–body reads only. (c) The MISO estimate, :MISO (derived here analytically), also uses constitutive reads and paired-end read information; orange lines connect reads in a pair; the insert length distribution is shown at right. (d) Comparison of :SJ and :MISO estimates from simulated data. Reads were sampled at varying coverage, measured in RPK, from the gene structure shown at top right, with underlying true 9 = 0.5. Mean values from 3,000 simulations are shown (os.d.) for each coverage value. Percentiles of gene expression values are shown for a data set assuming 25 million mapped paired-end (PE) read pairs (25M PE; blue, extrapolating from an Illumina GA2 run that yielded 15 million mapped read pairs) and for a data set of 78 million mapped read pairs from an Illumina HiSeq 2000 instrument (78M PE; red), both obtained from human heart tissue.

1010 | VOL.7 NO.12 | DECEMBER 2010 | NATURE METHODS

ARTICLES

information about the splicing of the alternative exon, as higher expression of the exclusion isoform will generally increase the density of reads in the flanking exons relative to the alternative exon, and lower expression of the exclusion isoform will decrease this ratio of densities. MISO captures this, as well as the infor-mation in the lengths of library inserts in paired-end data, by recasting the analysis of isoforms as a Bayesian inference problem. Our approach is related to the alternative-splicing quantification method12, which does not use paired-end information.

MISO samples reads uniformly from the chosen isoform, then recovers the underlying abundances of isoforms (9 and 1 − 9 in the case of a single alternative exon) using the short read data (Fig. 1a and Supplementary Fig. 3 ). As a result of mRNA fragmentation in library preparation, mRNA abundance and length contribute roughly linearly to read sampling in RNA-seq. This effect is treated by rescaling the abundances 9 and 1 − 9 of the two isoforms by the number of possible reads that could be generated from each isoform, respectively. In the model, reads from a gene locus are produced by a generative process in which an isoform is first chosen according to its rescaled abundance, and a sequence read is then sampled uniformly from possible read positions along the mRNA (Online Methods). For the exon-centric analyses involving a single alternative exon we derived an analytic solution to the inference problem, whereas for isoform-centric analyses and estimation using CIs we developed an efficient inference technique based on Monte Carlo sampling (Online Methods). Our new estimator, :MISO, uses all of the read positions used in :SJ, plus reads aligning to the adjacent exons (Fig. 1b,c) and information about the library insert length distribution in paired-end RNA-seq. Both :SJ and :MISO are unbiased estimators of 9.

An improved measure of exon expressionSimulating read generation from an alternatively spliced gene, we observed that the :MISO estimate had consistently much lower variance and error than :SJ (Fig. 1d). For reference, the dis-tribution of read-coverage values at depths typically obtained from one lane of sequencing on an Illumina Genome Analyzer 2 (GA2) and on a HiSeq 2000 are shown, in units of reads per kilobase of exon model (RPK). For a gene with median cover-age in the GA2 data set (~220 RPK), the s.d. of the estimated 9 value was reduced more than twofold, from 0.21 for :SJ to 0.09 for :MISO.

Validation of MISO estimatesTo assess the uncertainty in the splicing estimates for each exon, we calculated CIs for 9 (Online Methods) from moderate-depth breast cancer RNA-seq data (Supplementary Table 1; examples are shown in Fig. 2a,b). Comparing :MISO estimates for 52 alter-native exons to corresponding quantitative reverse-transcription PCR (qRT-PCR) values11,13 yielded a Pearson correlation r = 0.87 (Fig. 2c and Supplementary Table 2; a bias in the RT-PCR data was analyzed in Supplementary Figs. 4 –6 ). Restricting the analysis to exons with 95% CI width <0.25 increased the correlation with qRT-PCR data considerably, to r = 0.96 (Fig. 2d). Thus, MISO CIs identify exons whose RNA-seq–based 9-value estimates are more reliable.

Detection of differentially expressed isoformsDifferential splicing of alternative exons entails a difference in 9 values, $9, and can be evaluated statistically using the Bayes factor (BF), which quantifies the odds of differential regulation

0 250 500 750 1,000 1,250 1,500 1,750 2,000Coverage (RPK)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Est

imat

ed �

90% 95%85%

80% 85%

75%

75%

50%

50%

Percentile of genes (25M PE, GA2)

c

b

Fragmentation and amplification

Short reads generated from fragmentsSequencing

Reads aligned to genome and splice junctions

Alignment

0.6

0.4

Bayesian inference

�

mRNA fragments sampled in proportion to isoform length and expression level �k

a

Assign reads to isoforms using insertlength distribution

Incorporation of paired ends

Skipped exonIntron

Constitutive exons

Exclusionreads

Constitutive reads

Constitutivereads

Inclusion reads

Insert length (nt)

Insert variability:

Single-end estimate, �SJ

Paired-end estimate, �MISO

100 nt

d

1–�

Percentile of genes(80M PE, HiSeq)

� = d �

�SJ

�MISO

Figure 1 | More accurate inference of splicing levels using MISO. (a) Generative process for MISO model. White, alternatively spliced exon; gray and black, flanking constitutive exons. RNA-seq reads aligning to the alternative exon body (white) or to splice junctions involving this exon support the inclusive isoform, whereas reads joining the two constitutive exons (black-gray exon junction) support the exclusive isoform. Reads aligning to the constitutive exons are common to both isoforms. (b) The :SJ estimate uses splice-junction and alternative exon–body reads only. (c) The MISO estimate, :MISO (derived here analytically), also uses constitutive reads and paired-end read information; orange lines connect reads in a pair; the insert length distribution is shown at right. (d) Comparison of :SJ and :MISO estimates from simulated data. Reads were sampled at varying coverage, measured in RPK, from the gene structure shown at top right, with underlying true 9 = 0.5. Mean values from 3,000 simulations are shown (os.d.) for each coverage value. Percentiles of gene expression values are shown for a data set assuming 25 million mapped paired-end (PE) read pairs (25M PE; blue, extrapolating from an Illumina GA2 run that yielded 15 million mapped read pairs) and for a data set of 78 million mapped read pairs from an Illumina HiSeq 2000 instrument (78M PE; red), both obtained from human heart tissue.

Katz, Yarden, et al. "Analysis and design of RNA sequencing experiments for identifying isoform regulation." Nature methods 7.12 (2010): 1009

MISO: Mixture of isoforms model

DEXSeq: differential exon usagederived from a single colorectal tumor resistant to a drug with a cellline derived from a single tumor sensitive to the drug. Thismethod, too, cannot be applied to replicated samples. Trapnellet al. (2010), when presenting the Cufflinks/Cuffdiff tool chain,compared consecutive time points, using data from one sample foreach time point. The Cuffdiff software tool, in the version de-scribed in the paper, can only process pairs of samples withoutreplicates. Brooks et al. (2010) used replicates but did not use themto assess biological variability because they used a modified versionof the method of Wang et al. (2008). A notable instance in whichbiological variation was accounted for in the statistical analysis isthe work of Blekhman et al. (2010). However, their method relieson the availability of a moderate-to-large number of samples, andno software implementation was provided.

The importance of accounting for biological variation hasbeen pointed out by Baggerly et al. (2003) and recently by Hansenet al. (2011). Methods to do so when inferring differential expres-sion were suggested by Baggerly et al. (2003) and Lu et al. (2005).Subsequently, Robinson and coworkers presented the edgeRmethod (Robinson and Smyth 2007, 2008; Robinson et al. 2010b),which introduced the use of the negative binomial distribution toRNA-seq analysis. Robinson et al. (2010a) extended edgeR withgeneralized linear models (GLMs) and the Cox-Reid dispersionestimator, discussed below. The basic approach of using exon–condition interactions in linear or generalized linear models todetect differential exon usage has been explored before by Clineet al. (2005) and Purdom et al. (2008) for exon microarrays and byBlekhman et al. (2010) for RNA-seq data. Our method can be seenas a further development of these approaches that also incor-porated ideas from DESeq (Anders and Huber 2010).

In this article, we first explain the proposed statistical in-ference procedure and then use it to reanalyze published datasets by Brooks et al. (2010), by Brawand et al. (2011), and by TheENCODE Project Consortium (2011). In the Discussion, we elabo-rate on the observation that most published methods are unable toaccount for biological variation, focusing on the analysis providedby Brooks et al. (2010) for their data (which is based on the methodof Wang et al. 2008), and illustrate how this leads to unreliable re-sults. Finally, we compare DEXSeq with the one competing tool thatclaims to account for biological variation, namely, the new versionsof Cuffdiff.

Method

Preparation: Flattening gene models and counting reads

The initial step of an analysis is the alignment of the sequencingreads to the genome. Here, it is important to use a tool capable ofproperly handling reads that straddle introns. Then, transcriptomeannotation with coordinates of exon boundaries is required. Formodel organisms, reference gene model databases as provided, e.g.,by Ensembl (Flicek et al. 2011), may be used. In addition, such areference may be augmented by information retrieved from theRNA-seq data set that is being studied. Garber et al. (2011) reviewtools for the above tasks.

The central data structure for our method is a table that, in thesimplest case, contains for each exon of each gene the number ofreads in each sample that overlap with the exon. Special attentionis needed, however, if an exon’s boundary is not the same in alltranscripts. In such cases, we cut the exon in two or more parts(Fig. 1). We use the term ‘‘counting bin’’ to refer to exons or partsof exons derived in this manner. Note that a read that overlaps

with several counting bins of the same gene is counted for eachof these.

Model and inference

We denote by kijl the number of reads overlapping counting bin l ofgene i in sample j. We interpret kijl as a realization of a randomvariable Kijl. The number of samples is denoted by m, i.e., j = 1, . . ., m.

We write mijl for the expected value of the concentration ofcDNA fragments contributing to counting bin l of gene i, and relatethe expected read count E(Kijl) to mijl via the size factor sj, whichaccounts for the depth that sample j was sequenced: E(kijl) = sjmijl.Note that sj depends only on j, i.e., the differences in sequencingdepth are assumed to cause a linear scaling of the read counts. Weestimate the size factors with the same method as in DESeq (Andersand Huber 2010; for details, please see Supplemental Note S.1).

A generalized linear model

We use generalized linear models (GLMs) (McCullagh and Nelder1989) to model read counts. Specifically, we assume Kijl to followa negative binomial (NB) distribution:

Kijl ; NB mean = sjmijl;dispersion = ail

! "; ð1Þ

where ail is the dispersion parameter (a measure of the distribu-tion’s spread; see below) for counting bin (i, l), and the mean ispredicted via a log-linear model as

log mijl = bGi + bE

il + bCirj

+ bECirj l: ð2Þ

The negative binomial distribution in Equation 1 has beenuseful in many applications of count data regression (Cameronand Trivedi 1998). It can be seen as a generalization of the Poissondistribution: For a Poisson distribution, the variance v is equal tothe mean m, while for the negative binomial, the variance is v = m +

am2, with the dispersion a describing the squared coefficientof variation in excess of the Poisson case. Lu et al. (2005) andRobinson and Smyth (2007) motivated the use of the NB distri-bution for SAGE and RNA-seq data; we briefly summarize theirargument in Supplemental Note S.2.

We fit one model for each gene i, i.e., the index i in Equation 2is fixed. The linear predictor mijl is decomposed into four factorsas follows: bG

i represents the baseline expression strength of genei. bE

il is (up to an additive constant) the logarithm of the expectedfraction of the reads mapped to gene i that overlap with countingbin l. bC

irjis the logarithm of the fold change in overall expression

Figure 1. Flattening of gene models: This (fictional) gene has threeannotated transcripts involving three exons (light shading), one of whichhas alternative boundaries. We form counting bins (dark shaded boxes)from the exons as depicted; the exon of variable length gets split into twobins.

Genome Research 2009www.genome.org

Differential usage of exons in RNA-seq

Cold Spring Harbor Laboratory Press on August 31, 2016 - Published by genome.cshlp.orgDownloaded from

sufficiently many bins with a sample estimate ail so much largerthan the fitted value a milð Þ that it would not be justified to only relyon the fitted values. Hence, for the ANODEV (see below), we use asdispersion value ail the maximum of the per-bin estimate ail andthe fitted value a milð Þ. On average, this overestimates the true dis-persion and costs power, but we consider this preferable to usingeither only the fitted values or the sample estimates, both of whichcarry the risk of producing many undesirable false positives. Moresophisticated alternatives for this step, which usefully interpolatebetween the two extremes, and perhaps incorporate furthercovariates besides m, might become available in the future.

Analysis of deviance

We test for each counting bin whether it is differentially used be-tween conditions. More precisely, we test against the null hypoth-esis that the fraction of reads overlapping with a counting bin l, ofall the reads overlapping with the gene, does not change betweenconditions. To this end, we fit for each gene i a reduced model withno counting-bin–condition interaction:

log mijl = bGi + bE

il + bSij; ð5Þ

and, separately for each bin l9 of gene i, a model with an interactioncoefficient for only this bin, but as in Equation 5, main effects for allbins l,

log mijl = bGi + bE

il + bSij + bEC

irj ldll0 : ð6Þ

Here, dll9 is the Kronecker delta symbol,which is 1 if l = l9 and 0 otherwise. Wecompute the likelihood of these modelsusing the dispersion values ail as estimatedfrom model (3), with the information-sharing scheme presented earlier. Com-paring the fit (6) for counting bin l9 of genei with the fit (5) for gene i, we get ananalysis-of-deviance P-value pil9 for eachcounting bin by means of a x2 likelihood-ratio test. Note that we test against the nullhypothesis that none of the conditions in-fluences exon usage, and hence, if there aremore than two different conditions r, weaim to reject the null hypothesis already ifany one of the conditions causes differen-tial exon usage.

Differential exon usage, as treatedhere, cannot be distinguished from over-all differential expression of a gene if thegene only consists of a single countingbin or if all but one of its counting binshave zero counts. Hence, we mark allcounting bins with zero counts in all sam-ples, and all bins in genes with less thantwo nonzero bins, as not testable. Further-more, we skip counting bins with a countsum across all samples below a thresholdchosen low enough that a significant re-sult would be unlikely, to speed up com-putation. Such filtering can also improvepower (see Bourgon et al. 2010).

Note that we perform one test foreach counting bin, always fitting an in-teraction coefficient only for the single

bin l9 under test. Therefore, it is valid that a read that overlaps withseveral exons is counted for each of these exons: In each test, forthe purpose of estimating and testing the interaction coefficient,any given read is only considered at most once.

Additional covariates

The flexibility of GLMs makes it easy to account for further covar-iates. For example, if in addition to the experimental condition rj wewish to account for a further covariate tj, we extend model (3) asfollows:

log mijl = bGi + bE

il + bSij + bEB

itj l+ bEC

irj l; ð7Þ

When testing for differential exon usage, the extra term bEBitj l

isadded to both the reduced model (5) and the full model (6).

An example is provided in the next section with Equation 9.

Visualization

The DEXSeq package offers facilities to visualize data and fits. Anexample is shown in Figure 3, using the data discussed in the nextsection. Data and results for a gene are presented in three panels.The top panel depicts the fitted values from the GLM fit. For thisplot, the data are fitted according to model (2), with the y coor-dinates showing the exponentiated sums:

mijl = exp ~bGi + ~bE

il + ~bCirj

+ ~bECirj l

! ": ð8Þ

Figure 3. The treatment of knocking down the splicing factor pasilla affects the fourth exon (countingbin E004) of the gene Ten-m (CG5723). (Top panel) Fitted values according to the linear model; (middlepanel) normalized counts for each sample; (bottom panel) flattened gene model. (Red) Data forknockdown samples; (blue) control.

Differential usage of exons in RNA-seq

Genome Research 2011www.genome.org

Cold Spring Harbor Laboratory Press on August 31, 2016 - Published by genome.cshlp.orgDownloaded from

Anders, Simon, Alejandro Reyes, and Wolfgang Huber. "Detecting differential usage of exons from RNA-seq data." Genome research 22.10 (2012): 2008-2017.

TECHNICAL REPORT NATURE GENETICS

methods such as Cufflinks25, LeafCutter does not measure alterna-tive transcription start sites and alternative polyadenylation directly, as they are not generally captured by intron excision events.) The major advantage of this representation is that LeafCutter does not require read assembly or inference of isoforms supported by ambig-uous reads, both of which are computationally and statistically difficult. As a result, we were able to improve speed and memory requirements by an order of magnitude or more compared with those of similar methods such as MAJIQ12.

To identify alternatively excised introns, LeafCutter pools all mapped reads from a study and finds overlapping introns demar-cated by split reads. LeafCutter then constructs a graph that con-nects all overlapping introns that share a donor or acceptor splice site. The connected components of the graph form clusters, which represent alternative intron excision events. Finally, LeafCutter iter-atively applies a filtering step to remove rarely used introns, which are defined on the basis of the proportion of reads supporting a given intron compared with other introns in the same cluster, and re-clusters leftover introns (Online Methods and Supplementary

Note 1). In practice, we have found that it is important to apply this filtering step to avoid arbitrarily large clusters at read depths where noisy splicing events are supported by multiple reads.

De novo identification of RNA splicing in mammalian organs. We tested LeafCutter’s novel intron-detection method by analyzing mapped RNA-seq19 data from 2,192 samples (Supplementary Note 1) across 14 tissues from the Genotype-Tissue Expression (GTEx) Consortium20. We then searched for introns that were predicted to be alternatively excised by LeafCutter but were missing in three commonly used annotation databases (GENCODE v19, Ensembl, and UCSC). For this analysis, we ensured that the identified introns were indeed alternatively excised by considering only introns that were excised at least 20% of the time compared with other overlap-ping introns, in at least one-fourth of the samples, analyzing each tissue separately. We found that between 10.8% and 19.3% (pan-creas and spleen, respectively) of alternatively spliced introns were unannotated, excluding testis, the major outlier, in which 48.5% of alternatively spliced introns were previously unidentified (Fig. 2a).

a

Two unknownalternative isoforms

Alternative exonsConstitutive exons

RNA-seq readsmapped togenome

Intron clusters(connected

components)

Splicing estimates(quantitative traits)

Cluster 1Cluster 2

C1,1

(C1,1,C1,2)

C1,1/(C1,1 + C1,2)(C2,1/(C2,1 + C2,2 + C2,3)(C2,2/C2,1 + C2,2 + C2,3)(C2,3/C2,1 + C2,2 + C2,3)

C1,2/(C1,1 + C1,2)

(C2,1,C2,2,C2,3)

C1,2 C2,1 C2,3

C2,2

Cluster 1 Cluster 2

Ratios

Counts

b

A AGU

G

LeafCutterLeafCutter workflow

RNA-seqsamples

MappedRNA-seq reads

(.bam)

Intron clustering

Identification ofstudy-specific

introns

Quantificationof intron usage

LeafCutter

Identification ofdifferentially spliced

introns

Counts

RNA mapping(e.g., OLego, STAR)

Excision proportionsIdentification of

sQTLs

Genotype

Input

LeafVizcAnnotation and visualization of LeafCutter differential splicing output

Differential splicing events (clusters)Gene Genomic location

Chr6:79911443–79912034

Chr16:7703949–7726776

Chr13:76410593–76414221

Chr2:86444233–86459748

Chr11:115049495–115085328

Chr3:180688146–180693910

Chr2:86393767–86398331

HMGN3

RBFOX1

LMO7

REEP1

CADM1

FXR1

IMMT

N q

RBFOX1clu_22023

Chr16

Chr Start End Verdict ∆PSI

0.537–0.496–0.407

0.366

7721601

0.51b

0.05b

0.0031d

0.44c

0.029c

0.047a

0.53a

0.4d

7726776 AnnotatedAnnotatedAnnotatedAnnotated

771493177267767721559

770394977149707703949

abcd

Chr16Chr16Chr16

4 6.85e–29

Annotated

Annotation

Brai

n (n

= 10

)H

eart

(n =

10)Annotated

Annotated

Annotated

Annotated

Annotated

Cryptic

1.15e–28

1.84e–28

1.11e–27

3.89e–27

2.56e–26

2.7e–26

4

5

4

8

5

4

Fig. 1 | Overview of LeafCutter. a, LeafCutter uses split reads to uncover alternative intron-excision options by finding introns that share splice sites. In this example, LeafCutter identified two clusters of variably excised introns. b, The LeafCutter workflow. First, short reads are mapped to the genome. When SNP data are available, WASP33 should be used to filter allele-specific reads that map with a bias. Next, LeafCutter extracts junction reads from.bam files, identifies alternatively excised intron clusters, and summarizes intron usage as counts or proportions. Finally, LeafCutter identifies intron clusters with differentially excised introns between two user-defined groups by using a Dirichlet-multinomial model, or maps genetic variants associated with intron excision levels by using a linear model. c, Visualization of differential splicing among ten GTEx heart and brain samples by LeafViz. LeafViz is an interactive browser-based application that allows users to visualize results from LeafCutter differential splicing analyses. In this example, we observed that RBFOX1 showed differential usage of a mutually exclusive exon in heart compared with the usage in brain. Differential splicing is measured in terms of the change in the percent spliced in (ΔPSI). For all examples, see “URLs.”

NA TURE GENETICS | VOL 50 | JANUARY 2018 | 151–158 | www.nature.com/naturegenetics152© 2017 Nature America Inc., part of Springer Nature. All rights reserved.

Li, Yang I., et al. "Annotation-free quantification of RNA splicing using LeafCutter." Nature genetics 50.1 (2018): 151.

Leafcutter: differential exon-exon junction usage

txrevise - a compromise between event-level and transcript-level quantification



Transcript 1Transcript 2

Ensembl transcripts

txrevisePromoter 1Promoter 2

Cassette exon

No cassette exon

Long 3’ end

Short 3’ end

LeafcutterPromoter 1Promoter 2

Cassette exonNo cassette exon

Promoter 1

Shared exonsUnique exons


Ensembl transcripts


Cassette exon

No cassette exon

Long 3’ end

Short 3’ end



Promoter 1



Ensembl transcripts


Cassette exon

No cassette exon

Long 3’ end

Short 3’ end



Promoter 1


1000 2000Distance from region start (bp)


1000 2000Distance from region start (bp)

Alternative transcript starts

Alternative middle sections

Alternative transcript ends

Transcript assembly

©20

15N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 3

L E T T E R S

assembled super-reads as described above. Note that Scripture had very low precision on both data sets because it tends to predict a far larger number of splice variants for each gene than the other methods. On Sim-I, StringTie+SR found 20% more true transcripts than the next-best programs, with 34% fewer false positives. Not surprisingly, StringTie’s improvement is much higher on Sim-I than on the cleaner Sim-II data set, where the fragment sizes fol-lowed a distribution that matched the built-in assumptions of Cufflinks. Cufflinks in particular performed far better on Sim-II compared with Sim-I, with sensitivity and precision just slightly below StringTie. All other programs, however, were substantially lower than StringTie for both precision and sensitivity on both data sets.

In principle, the other programs can also be provided with the aligned super-reads as input, as done for StringTie+SR. We tried this strategy with Cufflinks (the best assembler other than StringTie), and both its sensitivity and precision declined substantially on Sim-I, whereas on Sim-II it made only marginal improvements (results not shown). By contrast, StringTie+SR performed better than StringTie alone on both data sets, though only by a small amount. The limited improvement is a consequence of the fact that the assembled super-reads used here simply filled in the gap between a pair of reads.

The accuracies shown in Figure 2a represent all transcripts, includ-ing those that were only partially covered by reads. We looked at how well the assemblers did for those transcripts that were fully covered by reads, that is, transcripts present at relatively higher levels in a given sample (Fig. 2b). Figure 2b and Supplementary Table 1 present the accuracy of all six programs on these fully covered transcripts. Assembly accuracy was defined as above for transcripts, and we intro-duce an analogous definition of gene-level accuracy; we considered a gene to be correctly identified if at least one of its transcripts was correctly assembled. Thus gene-level accuracy was always higher than transcript-level accuracy.

In most cases, the assemblers’ accuracies in Figure 2b followed the same ranking as in Figure 2a, which included partially covered tran-scripts. StringTie+SR and StringTie performed the best on both sensitivity and precision, followed by Cufflinks. For Sim-II, StringTie+SR showed an increase of more than 5% over Cufflinks in both sensitivity and precision. On Sim-I this increase was more than twice as great on both measures. On both data sets, StringTie and StringTie+SR predicted at least one tran-script perfectly matching the annotation for over 80% of the genes.

It is worth noting that Cufflinks is designed to eliminate isoforms expressed at very low levels, on the assumption that those isoforms

may be incompletely spliced precursors or other artifacts. By default, the Cufflinks threshold for filtering out low-abundance transcripts is set to 10% of the most abundant isoform (computed separately for each gene). We tried lowering this threshold for Sim-I and Sim-II, which slightly increased Cufflinks’ sensitivity while reducing its precision by a comparable amount. Like Cufflinks, StringTie was also designed to eliminate assembled transcripts with very low levels of expression. Figure 2a,b shows StringTie’s accuracy when this filtering threshold was set to 10%, the same level as used by Cufflinks. Interestingly, lowering the threshold to 5% for StringTie still yielded better sensitivity and precision than Cufflinks yielded at the 10% threshold (Supplementary Figs. 3 and 4). All other results presented here use the 10% filter for both Cufflinks and StringTie (Supplementary Discussion and Supplementary Fig. 5).

Assembly of reads reproduces the exon-intron structure of genes, but we also need to estimate how much of each transcript was present in the original cells. To evaluate the transcript quantification per-formance of each program, we compared the estimated expres-sion with the known amounts of each transcript in the simulated data. Quantification is measured by the number of pairs of reads (“fragments” where one or both ends of a fragment are sequenced), which are normalized based on the total number of fragments sequenced (measured in millions) and by the length of the transcript (measured in kilobases), giving an estimate measured as fragments per kilobase of transcript per million fragments (FPKM). With the exception of Scripture, all programs tested here use FPKM values to estimate transcript abundances. StringTie also reports a read per base coverage for each exon of a predicted transcript. Scripture produces RPKM values, which count reads instead of fragments. As has been pointed out previously13, FPKM is preferable to RPKM in the case of paired-end RNA-seq experiments, where in some cases one of the two reads belonging to a fragment might be unmapped, possibly lead-ing to underestimates of expression. We obtained very similar results whether using FPKM or RPKM values (Supplementary Table 2).

We computed the Spearman correlation coefficient between the true and estimated expression levels for each set of transcripts. Specifically, we compared the expression level of each predicted transcript with the true transcript that it matched. The Spearman correlation first

RNA-seq reads

StringTie+SR StringTie Cufflinks Traph

Build splice graph

Maximum likelihood abundanceestimation

Build overlap graph Build flow network on top ofsplice graph

Build flow network for path of heaviestcoverage

Compute maximum flow to estimateabundance

Compute minimum path cover togenerate transcripts

Compute minimum cost flow anddecompose it into transcripts and

their abundances

Transcripts and theirabundances

Map super-reads andunassembled reads to genome

Assemble readsinto super-reads

Map reads to genome

Update

Heaviest path

a

b

Figure 1 Transcript assembly pipelines for StringTie, Cufflinks and Traph. (a) Overview of the flow of the StringTie algorithm, compared to Cufflinks and Traph. All methods begin with a set of RNA-seq reads that have been mapped to the genome. An optional secondary input to StringTie is a set of pre-assembled super-reads, designated as StringTie+SR. StringTie iteratively extracts the heaviest path from a splice graph, constructs a flow network, computes maximum flow to estimate abundance, and then updates the splice graph by removing reads that were assigned by the flow algorithm. This process repeats until all reads have been assigned. (b) Annotated transcript T for which read data covers only the fragments F1 and F2. An assembler is given credit for a correct reconstruction of T if it correctly assembles F1 and F2.

Testing differences

• “Normally” distributed data: t-test

• Count data: negative binomial distribution (e.g. DESeq2)

• Ratios: beta-binomial distribution (e.g. DRIMSeq)

• Multiple options: Dirichlet-multinomial distribution (e.g. DRIMSeq)

General steps for significance testing

• Estimate the effect size between two conditions.

• Estimate variability between groups.

• Determine significance (is the observed effect size larger than would be expected from variability between groups under the null hypothesis).

Summary

• Changes in transcript usage can have important consequences for diseases.

• Many different approaches to quantify transcript usage.

• Each has their own strengths and weaknesses, no ‘gold standard’ approach.

Documents

Transcript expression and RNA splicing