Transcript detection in RNAseq

The Queensland Brain Institute |

RNAseq analysis: Transcript detection (1/2)What is a jar ?

April 11, 2023

[by Joseph Robertson]

The Queensland Brain Institute | April 11, 2023

Quick recap: Production informatics

Sequencing Image Fastq

• Sequencing->Images->Conversion (Demultiplexing)

• Resulting file type: FASTQ• “Having raw sequence reads and quality scores”

Quality ControlProjects


Objective & Challenges

• Objective: study the active transcriptome of the cell

• Problems:– The RNA content of a cell is dominated by tRNA, rRNA

and housekeeping genes– Flowcell has only a finite real-estate of which most would

be occupied by these mainly invariable transcripts

• How to focus the sequencing on the “interesting” part of the transcriptome: mRNA and ncRNA ?


What RNAseq protocols are there?

• RNA seq– total RNA tRNA/rRNA removed + PolyA-tail filtered– Good for studying protein coding genes, e.g.

• gene expression, isoforms, expression of variant alleles• RNA editing events

– RNA-DNA differences in the human transcriptome provide a yet-unexplored aspect of genome variation.

• Small RNAseq: – Total RNA size selection for small RNA molecules– Good for small ncRNA e.g. miRNAs, snoRNA

• Duplex-specific thermostable nuclease (DSN) guided RNA seq normalization– Total RNA high abundant transcripts are digested – Good for studying all transcripts

Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.Christodoulou DC, Gorham JM, Herman DS, Seidman JG. Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. Curr Protoc Mol Biol. 2011 PMID: 21472699

Toda

y


RNA-seq workflow

1. Select PolyA-tail + remove tRNA/rRNA

2. Fragment RNA3. Make cDNA (caution you may loose

strand info)

4. Sequence5. Map reads6. Identify transcripts7. Quantify transcripts8. Identify differences between

conditions

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 PMID: 18516045.

Toda

y


Product Time

fastq 5 days

bam, vcf,… 3 weeks

paper >6 months

Per one-flowcell project

Production Informatics and Bioinformatics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch


Challenges for RNAseq read mapping

• Loosing reads because they do not match the ref. genome– Reads spanning exon junctions– RNA editing events

• Approaches– Align to ref. transcriptom

library– Exon-first e.g. Tophat– Seed-extend methods e.g.

GSNAPGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.

DNA

gRNA

mRNAeditingevent

Sequencing reads


Exon-first approach

• Align reads to ref. genome

• Chop up unaligned reads and try to identify matching regions

• Find splice junctions around the matches

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.


Seed-extend approach

• Break reads in smaller k-mers and find matches

• Iteratively extend k-mers to identify exact spliced alignment



Which method ?

• Exon-first: less computationally intensive• The additional exon-junctions found by seed-

extend have not (yet) been demonstrated to be real.



Challenges for transcript detection

• Identifying isoforms is difficult– Transcript abundance is volatile– Most reads are not helpful

(reads from exons) or even misleading (incompletely spliced precursor RNA)

– Genes can have many isoforms

• Approaches– Ignore isoforms– Genome-guided reconstruction,

e.g. Cufflinks– Genome-independent

reconstruction, e.g. Trinity

QBI dataGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.


Genome-guided reconstruction

• Use reads spanning slice junction to assemble the transcript path

• Work out minimal possible set paths so that all reads are visited (graph theory)

• If more than one set use read count to pick the most probable


Reads aligned to the genome

Isoforms


Genome-independent reconstruction

• Break reads into k-mers find their mutual overlap to build a de Bruijn graph

• Find probable paths through the graph by using read counts

• Map consensus assembly to genome

Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.


Which method?

• De novo methods are very computationally intensive

• However, they are able to find alternative isoforms and promoters and structural variation– deletions (yellow)– chimeras (green)

Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.


What are real transcripts?

• Even the most sophisticated computational method can’t tell you what is a real transcript.

QBI data

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 PMID: 21697122.

Roberts et al.


Solution: biological replicates

• Significant findings (here: new isoforms) in small sample sets can be due to – Technical errors– Biological variability– Population outliers

• Sequencing experiments are subject to the same issues (even though they are more expensive than arrays)

• Replicates are necessary to build confidence in your results!

Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011 PMID: 21747377


Three things to remember

• Methods for analyzing RNAseq data are not as mature as expression array analysis tools yet.

• Especially identifying transcript isoforms is difficult.

• Replicates are crucial to account for the biological variability


Next Week:

Abstract: This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.

Technology

Transcript detection in RNAseq