RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the...

Preview:

Citation preview

RNA SEQUENCING AND DATA ANALYSIS

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000

0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000

0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000

Insert size ~ 200bp

Overview of RNA sequencing protocol

Fwd read Reverse read

Insert

SEQUENCING

Read length: 48-76bp

Sequencing parameters

Read Depth

Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome

More reads needed for splicing variant discovery and differential comparison among samples

Current output: 120-180 million raw reads / lane

Multiplex level: 4-12 libraries / lane recommended

All RNA is not the same

Types of RNA:

All RNA is not the same

Types of RNA:

Messenger RNA

Micro RNA

Long non-coding RNA

Ribosomal RNA

Methods for RNA enrichment prior to library construction

Poly(A)-RNA selection By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation

rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA

Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column

Different methods capture different types of RNA

Poly(A)-RNA

selection

rRNA depletion

Small RNA

extraction

Messenger RNA

Micro RNA

Long non-coding RNA

Ribosomal RNA

Different methods capture different types of RNA

Poly(A)-RNA

selection

rRNA depletion

Small RNA

extraction

Messenger RNA X X

Micro RNA X X

Long non-coding RNA X

Ribosomal RNA X

Paraffin embedded vs fresh frozen

Fresh Frozen

REA

D Q

UA

LITY

First step: alignment

Or: assembly, then alignment

Alignment versus assembly

Assembly

Trinity, Cufflinks, ABySS

Particularly useful when no reference genome is available, like in bacterial transcriptomes

Alignment

Bowtie, BWA, Mosaic

Maximum sensitivity, fewer false positives

RNA sequencing applications

RNA sequencing applications

Quantification of transcript expression levels

Detection of splice variation/different isoforms of the same gene

Allele specific expression levels

Strand specific expression levels

Detection of fusion transcripts (such as BCR-ABL in CML)

Detection of sequence variation (limited application)

Validation of DNA sequence variants

RNA-seq expression levels are linear where microarrays get saturated or are insensitive

Expression is measured as ‘reads per kilobase per million’ (RPKM)

or ‘fragments per kilobase of exon per million fragments mapped’

(FPKM) to normalize for gene length and library size

In GBM, the gene EGFR is frequently targeted by intragenic deletions

vIII deletion occurs in same domain as point mutations

Detecting EGFR transcript variants using RNA-seq data

SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/SpliceSeq:Overview

Allele-/Strand-specific RNA-seq

Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data

Strand-specific RNA-seq requires specific library preparation protocol

Costs more

Output more accurate, useful for analysis in absence of a reference genome

Identification of fusion transcripts

Popular methods search for

Read pairs that map to two different genes

Need to correct for gene homology

Reads that span fusion junction

Split reads in half and align separate halfs

Make a database of all possible fusion junctions and align full reads

PRADA, MapSplice, TopHat

http://sourceforge.net/projects/prada/

FGFR3-TACC3 fusion in GBM is the result of a local inversion

FGFR3-TACC3

Fusion transcripts are often associated with copy number difference and genomic breakpoints

Copy number profile of two FGFR3-TACC3 cases in TCGA

FGFR3-TACC3

6.4% of GBM harbors transcript fusions involving EGFR

All fusions fall within the area of the EGFR amplification

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Fusion Module Discordant read pair: Each end of the

read pair maps uniquely to distinct protein-coding genes.

Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.

Gene A Gene B

Structural transcript variants in low grade glioma

RNA-seq data from 272 TCGA low grade glioma

Fusion detection accuracy affected by:

PRADA detected 1,843 fusion transcripts

#mapped

reads per

sample

Detected #fusion transcripts per sample

Filtering out artifacts

Homology E value larger than 0.01 (column Evalue)

No mismatches in junction spanning reads

Count the number of partner genes for each individual gene

Identify genes with fusions mapping to more than 10 different chromosome arms

970/1,843 fusions filtered

Validation of predicted transcript fusions

509/970 fusions filtered

Define four tiers of fusion transcripts based on evidence

Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample

Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window

Use matching DNA copy number profile

Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all

Tier 4: The rest

Validation of RNA fusions using output of BreakDancer

BreakDancer detects DNA rearrangements in low pass sequencing data

Validation of RNA fusions using output of BreakDancer

BreakDancer detects DNA rearrangements in low pass sequencing data

Variant detection

Approximately 30% of mutations are covered sufficiently to

be detected at a validation rate of ~ 80%.

From TCGA renal

cell clear cell

carcinoma project

Reverse transcriptase step to convert RNA to cDNA complicates

detection of RNA edits and mutations

RNA sequencing read alignment in PRADA

Transcripts from same gene

Reads are aligned to all possible transcripts

Reads are also aligned to genome

RNA sequencing read alignment in PRADA

Reads are aligned to all possible transcripts

Reads are also aligned to genome

Final and single placement for

each read it determined by

re-mapping

PRADA alignments – advantages versus disadvantages

Advantage:

Alignment to DNA means mapping of unannotated transcripts

Alignment to transcriptome means mapping across exon-exon junctions

Disadvantage

More conservative alignment than split-read

PRADA focuses on the analysis of paired-end RNA-sequencing data.

Four modules: 1. Processing

2. Expression and Quality Control

3. Gene fusion

4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

OUTPUTS

http://sourceforge.net/projects/prada/

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Expression & QC Module RNA-SeQC provides three types of

quality control metrics: Read Counts

Coverage

Correlation

RPKM Values at transcript level

For longest transcript

RNAseQC Process (java)

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Fusion Module Discordant read pair: Each end of the

read pair maps uniquely to distinct protein-coding genes.

Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.

Gene A Gene B

Implementation Results Samples processed

>400 KIRC

>170 GBM

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

OUTPUTS

RNA-SeQC

Works well in MDACC HPC* system

PRADA-fusion module validation rate ~85 % (53 out of 62)

RNA sequencing in The Cancer Genome Atlas

mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads

miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.

Detecting fusion transcripts in GBM

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal

33 interchromosomal

in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)

DHX33-NLRP1 (n=2, chr2)

TRIP12-SLC16A14 (n=2, chr17)

TFG-GRP128 (n=4, chr3)

KIRC fusion validation

Sample ID 5’ Gene 3’ Gene Discordant Read Pairs

Fusion Span Reads

Fusion Junction (s)

5’ Gene Chr

3’ Gene Chr

Validated?

TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes

TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes

TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes

TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes

TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes

TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes

TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes

TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No

TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No

PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays

TFE3-SFPQ was validated in three individual samples

KIRC fusion validation: RT-PCR

FAM172A-FHIT

(a) (b)

Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. SFPQ-TFE3

(a) (b)

Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. TFE3-SFPQ

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal

33 interchromosomal

in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)

DHX33-NLRP1 (n=2, chr2)

TRIP12-SLC16A14 (n=2, chr17)

TFG-GRP128 (n=4, chr3)

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers

TCGA has 1,000s of RNA seq samples - how

can we quickly scan many samples for the

presence of this fusion?

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

OUTPUTS

RNA-SeQC

Supervised Search Module GUESS-ft: General User dEfined Supervised

Search for fusion transcripts

BAM

GUESS-ft

Mapped to A

or B

A-B

Discordant

reads

Unmapped

reads

Junction DB

Junction

spanning reads

Summary

report

Use high quality

mapping reads

only, Checks

read

orientation

fulfills fusion

schema, allow

up to one

mismatch.

Two read ends

map to A and B

respectively

Parse

Unmapped

reads with the

other end

mapping to A

or B

Map parsed

reads to DB of

all possible

exon junctions

List reads with

one end map

to junction, the

other map to A

or B

Time consuming step

Identification of TFG-GRP128 fusion

All available normal samples in cghub

Subset of tumor samples selected based on RPKM expression pattern

Table. Samples across cancer types

Cancer Type # of normal

samples

# of tumor

samples

Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)

Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)

Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)

Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)

Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)

Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)

Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)

Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)

Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)

Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)

* All performed by PRADA fusion module.

Tumors with the fusion have higher GPR128 expression levels

RPKM expression pattern seen in KIRC tumors

Fusion sample(s)

Higher expression of GPR128 (activation)

TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal

Thanks.

http://sourceforge.net/projects/prada/

Recommended