18
The Queensland Brain Institute | RNAseq analysis: Transcript detection (1/2) What is a jar ? 6/18/22 [by Joseph Robertson]

Transcript detection in RNAseq

Embed Size (px)

DESCRIPTION

Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.

Citation preview

Page 1: Transcript detection in RNAseq

The Queensland Brain Institute |

RNAseq analysis: Transcript detection (1/2)What is a jar ?

April 11, 2023

[by Joseph Robertson]

Page 2: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Quick recap: Production informatics

Sequencing Image Fastq

• Sequencing->Images->Conversion (Demultiplexing)

• Resulting file type: FASTQ• “Having raw sequence reads and quality scores”

Quality ControlProjects

Page 3: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Objective & Challenges

• Objective: study the active transcriptome of the cell

• Problems:– The RNA content of a cell is dominated by tRNA, rRNA

and housekeeping genes– Flowcell has only a finite real-estate of which most would

be occupied by these mainly invariable transcripts

• How to focus the sequencing on the “interesting” part of the transcriptome: mRNA and ncRNA ?

Page 4: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

What RNAseq protocols are there?

• RNA seq– total RNA tRNA/rRNA removed + PolyA-tail filtered– Good for studying protein coding genes, e.g.

• gene expression, isoforms, expression of variant alleles• RNA editing events

– RNA-DNA differences in the human transcriptome provide a yet-unexplored aspect of genome variation.

• Small RNAseq: – Total RNA size selection for small RNA molecules– Good for small ncRNA e.g. miRNAs, snoRNA

• Duplex-specific thermostable nuclease (DSN) guided RNA seq normalization– Total RNA high abundant transcripts are digested – Good for studying all transcripts

Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.Christodoulou DC, Gorham JM, Herman DS, Seidman JG. Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. Curr Protoc Mol Biol. 2011 PMID: 21472699

Toda

y

Page 5: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

RNA-seq workflow

1. Select PolyA-tail + remove tRNA/rRNA

2. Fragment RNA3. Make cDNA (caution you may loose

strand info)

4. Sequence5. Map reads6. Identify transcripts7. Quantify transcripts8. Identify differences between

conditions

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 PMID: 18516045.

Toda

y

Page 6: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Product Time

fastq 5 days

bam, vcf,… 3 weeks

paper >6 months

Per one-flowcell project

Production Informatics and Bioinformatics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch

Page 7: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Challenges for RNAseq read mapping

• Loosing reads because they do not match the ref. genome– Reads spanning exon junctions– RNA editing events

• Approaches– Align to ref. transcriptom

library– Exon-first e.g. Tophat– Seed-extend methods e.g.

GSNAPGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.

DNA

gRNA

mRNAeditingevent

Sequencing reads

Page 8: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Exon-first approach

• Align reads to ref. genome

• Chop up unaligned reads and try to identify matching regions

• Find splice junctions around the matches

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.

Page 9: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Seed-extend approach

• Break reads in smaller k-mers and find matches

• Iteratively extend k-mers to identify exact spliced alignment

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.

Page 10: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Which method ?

• Exon-first: less computationally intensive• The additional exon-junctions found by seed-

extend have not (yet) been demonstrated to be real.

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.

Page 11: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Challenges for transcript detection

• Identifying isoforms is difficult– Transcript abundance is volatile– Most reads are not helpful

(reads from exons) or even misleading (incompletely spliced precursor RNA)

– Genes can have many isoforms

• Approaches– Ignore isoforms– Genome-guided reconstruction,

e.g. Cufflinks– Genome-independent

reconstruction, e.g. Trinity

QBI dataGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.

Page 12: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Genome-guided reconstruction

• Use reads spanning slice junction to assemble the transcript path

• Work out minimal possible set paths so that all reads are visited (graph theory)

• If more than one set use read count to pick the most probable

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.

Reads aligned to the genome

Isoforms

Page 13: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Genome-independent reconstruction

• Break reads into k-mers find their mutual overlap to build a de Bruijn graph

• Find probable paths through the graph by using read counts

• Map consensus assembly to genome

Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.

Page 14: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Which method?

• De novo methods are very computationally intensive

• However, they are able to find alternative isoforms and promoters and structural variation– deletions (yellow)– chimeras (green)

Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.

Page 15: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

What are real transcripts?

• Even the most sophisticated computational method can’t tell you what is a real transcript.

QBI data

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 PMID: 21697122.

Roberts et al.

Page 16: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Solution: biological replicates

• Significant findings (here: new isoforms) in small sample sets can be due to – Technical errors– Biological variability– Population outliers

• Sequencing experiments are subject to the same issues (even though they are more expensive than arrays)

• Replicates are necessary to build confidence in your results!

Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011 PMID: 21747377

Page 17: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Three things to remember

• Methods for analyzing RNAseq data are not as mature as expression array analysis tools yet.

• Especially identifying transcript isoforms is difficult.

• Replicates are crucial to account for the biological variability

Page 18: Transcript detection in RNAseq

The Queensland Brain Institute | April 11, 2023

Next Week:

Abstract: This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.