Upload
denis-bauer
View
5.547
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Citation preview
The Queensland Brain Institute |
RNAseq analysis: Transcript detection (1/2)What is a jar ?
April 11, 2023
[by Joseph Robertson]
The Queensland Brain Institute | April 11, 2023
Quick recap: Production informatics
Sequencing Image Fastq
• Sequencing->Images->Conversion (Demultiplexing)
• Resulting file type: FASTQ• “Having raw sequence reads and quality scores”
Quality ControlProjects
The Queensland Brain Institute | April 11, 2023
Objective & Challenges
• Objective: study the active transcriptome of the cell
• Problems:– The RNA content of a cell is dominated by tRNA, rRNA
and housekeeping genes– Flowcell has only a finite real-estate of which most would
be occupied by these mainly invariable transcripts
• How to focus the sequencing on the “interesting” part of the transcriptome: mRNA and ncRNA ?
The Queensland Brain Institute | April 11, 2023
What RNAseq protocols are there?
• RNA seq– total RNA tRNA/rRNA removed + PolyA-tail filtered– Good for studying protein coding genes, e.g.
• gene expression, isoforms, expression of variant alleles• RNA editing events
– RNA-DNA differences in the human transcriptome provide a yet-unexplored aspect of genome variation.
• Small RNAseq: – Total RNA size selection for small RNA molecules– Good for small ncRNA e.g. miRNAs, snoRNA
• Duplex-specific thermostable nuclease (DSN) guided RNA seq normalization– Total RNA high abundant transcripts are digested – Good for studying all transcripts
Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.Christodoulou DC, Gorham JM, Herman DS, Seidman JG. Construction of normalized RNA-seq libraries for next-generation sequencing using the crab duplex-specific nuclease. Curr Protoc Mol Biol. 2011 PMID: 21472699
Toda
y
The Queensland Brain Institute | April 11, 2023
RNA-seq workflow
1. Select PolyA-tail + remove tRNA/rRNA
2. Fragment RNA3. Make cDNA (caution you may loose
strand info)
4. Sequence5. Map reads6. Identify transcripts7. Quantify transcripts8. Identify differences between
conditions
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 PMID: 18516045.
Toda
y
The Queensland Brain Institute | April 11, 2023
Product Time
fastq 5 days
bam, vcf,… 3 weeks
paper >6 months
Per one-flowcell project
Production Informatics and Bioinformatics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
The Queensland Brain Institute | April 11, 2023
Challenges for RNAseq read mapping
• Loosing reads because they do not match the ref. genome– Reads spanning exon junctions– RNA editing events
• Approaches– Align to ref. transcriptom
library– Exon-first e.g. Tophat– Seed-extend methods e.g.
GSNAPGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011. PMID: 21596952.
DNA
gRNA
mRNAeditingevent
Sequencing reads
The Queensland Brain Institute | April 11, 2023
Exon-first approach
• Align reads to ref. genome
• Chop up unaligned reads and try to identify matching regions
• Find splice junctions around the matches
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
The Queensland Brain Institute | April 11, 2023
Seed-extend approach
• Break reads in smaller k-mers and find matches
• Iteratively extend k-mers to identify exact spliced alignment
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
The Queensland Brain Institute | April 11, 2023
Which method ?
• Exon-first: less computationally intensive• The additional exon-junctions found by seed-
extend have not (yet) been demonstrated to be real.
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
The Queensland Brain Institute | April 11, 2023
Challenges for transcript detection
• Identifying isoforms is difficult– Transcript abundance is volatile– Most reads are not helpful
(reads from exons) or even misleading (incompletely spliced precursor RNA)
– Genes can have many isoforms
• Approaches– Ignore isoforms– Genome-guided reconstruction,
e.g. Cufflinks– Genome-independent
reconstruction, e.g. Trinity
QBI dataGarber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
The Queensland Brain Institute | April 11, 2023
Genome-guided reconstruction
• Use reads spanning slice junction to assemble the transcript path
• Work out minimal possible set paths so that all reads are visited (graph theory)
• If more than one set use read count to pick the most probable
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 PMID: 21623353.
Reads aligned to the genome
Isoforms
The Queensland Brain Institute | April 11, 2023
Genome-independent reconstruction
• Break reads into k-mers find their mutual overlap to build a de Bruijn graph
• Find probable paths through the graph by using read counts
• Map consensus assembly to genome
Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.
The Queensland Brain Institute | April 11, 2023
Which method?
• De novo methods are very computationally intensive
• However, they are able to find alternative isoforms and promoters and structural variation– deletions (yellow)– chimeras (green)
Iyer MK, Chinnaiyan AM. RNA-Seq unleashed. Nat Biotechnol. 2011 PMID: 21747384.
The Queensland Brain Institute | April 11, 2023
What are real transcripts?
• Even the most sophisticated computational method can’t tell you what is a real transcript.
QBI data
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 PMID: 21697122.
Roberts et al.
The Queensland Brain Institute | April 11, 2023
Solution: biological replicates
• Significant findings (here: new isoforms) in small sample sets can be due to – Technical errors– Biological variability– Population outliers
• Sequencing experiments are subject to the same issues (even though they are more expensive than arrays)
• Replicates are necessary to build confidence in your results!
Hansen KD, Wu Z, Irizarry RA, Leek JT. Sequencing technology does not eliminate biological variability. Nat Biotechnol. 2011 PMID: 21747377
The Queensland Brain Institute | April 11, 2023
Three things to remember
• Methods for analyzing RNAseq data are not as mature as expression array analysis tools yet.
• Especially identifying transcript isoforms is difficult.
• Replicates are crucial to account for the biological variability
The Queensland Brain Institute | April 11, 2023
Next Week:
Abstract: This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.