1
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Single Cell Isoform Sequencing (scIso-Seq) Identifies Novel Full-length mRNAs and Cell Type-specific Expression Elizabeth Tseng 1 , Jason G. Underwood 1 , Aparna Bhuduri 2 , Calum Marrs 3 , Filip Konopacki 3 , Daniel Wong 3 , Katherine M. Munson 5 , Alexandra Lewis 5 , Alex A. Pollen 2 and Evan E. Eichler 4,5 1 PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025 2 Department of Neurology, University of California-San Francisco, San Francisco, CA 3 Dolomite/Blacktrace Holdings Ltd; 27 Jarman Way, Royston, SG8 5HW, UK 4 Howard Hughes Medical Institute, University of Washington, Seattle, WA 5 Department of Genome Sciences, University of Washington, Seattle, WA Single cell RNA-seq (scRNA-seq) is an emerging field for characterizing cell heterogeneity in complex tissues. However, most scRNA-seq methodologies are limited to gene count information due to short read lengths. Here, we combine the microfluidics scRNA-seq technique, Drop-Seq, with PacBio Single Molecule, Real-Time (SMRT) Sequencing to generate full-length transcript isoforms that can be confidently assigned to individual cells. We generated single cell Iso-Seq (scIso-Seq) libraries for chimp and human cerebral organoid samples on the Dolomite Nadia platform and sequenced each library with two SMRT Cells 8M on the PacBio Sequel II System. We developed a bioinformatics pipeline to identify, classify, and filter full-length isoforms at the single-cell level. We show that scIso-Seq reveals full-length isoform information not accessible using short reads that can reveal differences between cell types and amongst different species. Abstract scIso-Seq on Cerebral Organoids Using SQANTI2 for QC Single Cell Iso-Seq (scIso-Seq) Full Splice Match = perfect match Incomplete Splice Match = partial match Novel In Catalog = novel isoform with known junctions Novel Not in Catalog = at least one novel junction Within intron Overlap with intron and exons Figure 1. Full-length single cell libraries were generated using the Dolomite Nadia system and made into PacBio Iso- Seq libraries. Sequencing was performed on the Sequel II System. SQANTI2 [2] matching Iso-Seq transcripts to existing annotations and provides a wide range of descriptors of transcript quality. (a) Comparison against Gencode v29 transcript annotation reveals ~46% scIso-Seq transcripts are novel. (b) Novel scIso-Seq isoforms retain coding potential GENCODE v29 87.3% 3.8% 6.1% 2.8% Human Developing Cortex Single Cell Intropolis v1 Novel (c) Most junctions are validated by existing annotations and RNA-seq data. Class # Isoforms % with CAGE Peak 50 bp Full Splice Match 18,344 78% Incomplete Splice Match 13,802 37% Novel In Catalog 19,033 44% Novel Not In Catalog 9,197 67% Intergenic 245 29% (d) Matching FANTOM5 CAGE peak data with scIso-Seq transcripts shows full splice matches (perfect junction matches to annotations) are enriched for known TSS. [1] cDNA_Cupcake https://github.com/Magdoll/cDNA_Cupcake [2] SQANTI2 https://github.com/Magdoll/SQANTI2/ REFERENCES Post PCR PacBio. Shear Iso-Seq 3’ RNA-Seq 5’ primer UMI 3’ primer (AAA)n Barcode Full-Length cDNA SAMPLE POL READS (bp) POL BASE (GB) POL LENGTH (kb) Chimp Organoid 3,157,575 199 62.9 Chimp Organoid 3,637,178 206 56.7 Human Organoid 2,856,715 177 62.1 Human Organoid 5,277,118 265 50.2 Table 1. Sequencing yield for the chimp and human organoid single cell Iso-Seq (scIso-Seq) libraries on each of two SMRT Cells 8M run on the Sequel II System. SAMPLE FLNC (filtered) UNIQUE READS UNIQUE GENES UNIQUE ISOFORMS Chimp Organoid 2,303,267 418,542 14,049 58,892 Human Organoid 2,291,947 382,734 14,737 60,815 1. HiFi (CCS) reads 2. Remove cDNA primers 3. Clip UMIs and BCs 4. Remove polyA tail & artificial concatemers 5. Align to genome 6. Collapsed redundancy 7. Compare w/ annotation 8. Filter library artifacts 9. Process into CSV report for UMI/BC error correction Figure 2. Bioinformatics analysis for scIso-Seq data. The pipeline is described in Cupcake [1]. Table 2. Number of unique (de-duplicated) full-length, non- concatemer (FLNC) reads and corresponding number of unique genes and transcript isoforms. 334 56 57 PacBio Illumina Barcodes Figure 3. Matching cell barcodes between PacBio scIso-Seq data and Illumina short-read data shows high concordance. The scIso- Seq cumulative read plot indicates ~400 STAMPS (single cells). 20 kb hg38 74,025,000 74,030,000 74,035,000 74,040,000 74,045,000 74,050,000 74,055,000 74,060,000 74,065,000 74,070,000 74,075,000 Human Organoid Astrocytes Human Organoid RG/glycolysis/choroid Human Organoid G2M Human Organoid Mitochondia/S phase cells Human Organoid Neuron Human Organoid Outlier Human Organoid Progenitor GENCODE v29 Comprehensive Transcript Set (only Basic displayed by default) efSeq genes, curated subset (NM_*, NR_*, and YP_*) - Annotation Release NCBI Homo sapiens Annotation Release 109 (2018-03 Figure 4. The tropoelastin gene is found only to be expressed in astrocytes but not other human organoid cell types and shows exon skipping events and usage of alternative start/end sites. Scale chr19: 20 kb hg38 53,520,000 53,525,000 53,530,000 53,535,000 53,540,000 53,545,000 53,550,000 53,555,000 53,560,000 53,565,000 53,570,000 53,575,000 53,580,000 Chimp Organoid Astrocytes Chimp Organoid Choroid Chimp Organoid Cycling Chimp Organoid Fibroblast Chimp Organoid Neuron Chimp Organoid Unknown Human Organoid Astrocytes Human Organoid RG/glycolysis/choroid Human Organoid G2M Human Organoid Mitochondia/S phase cells Human Organoid Neuron Human Organoid Outlier Human Organoid Progenitor GENCODE v29 Comprehensive Transcript Set (only Basic displayed by default) merged_final.bam PB.11696.5 PB.11696.11 PB.11696.4 PB.11696.12 PB.16078.4 PB.16078.9 PB.16078.1 PB.16078.3 PB.16078.2 PB.16078.8 PB.16078.12 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 ZNF331 Figure 5. Alternative transcription start site (TSS) usage in chimp and human in ZNF331. Figure 6. SQANTI2 comparison against existing annotations confirms scIso-Seq data consists of full-length (5’-3’) transcript isoforms.

Single Cell Isoform Sequencing (scIso-Seq) Identifies ... · Title: Single Cell Isoform Sequencing (scIso-Seq) Identifies Novel Full-length mRNAs and Cell Type-specific Expression

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Single Cell Isoform Sequencing (scIso-Seq) Identifies ... · Title: Single Cell Isoform Sequencing (scIso-Seq) Identifies Novel Full-length mRNAs and Cell Type-specific Expression

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. FEMTO Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.

Single Cell Isoform Sequencing (scIso-Seq) Identifies Novel

Full-length mRNAs and Cell Type-specific Expression Elizabeth Tseng1, Jason G. Underwood1, Aparna Bhuduri2, Calum Marrs3, Filip Konopacki3, Daniel Wong3, Katherine M. Munson5, Alexandra Lewis5, Alex A. Pollen2 and Evan E. Eichler4,5

1 PacBio, 1305 O’Brien Drive, Menlo Park, CA 94025 2 Department of Neurology, University of California-San Francisco, San Francisco, CA 3 Dolomite/Blacktrace Holdings Ltd; 27 Jarman Way, Royston, SG8 5HW, UK 4 Howard Hughes Medical Institute, University of Washington, Seattle, WA5 Department of Genome Sciences, University of Washington, Seattle, WA

Single cell RNA-seq (scRNA-seq) is an emerging

field for characterizing cell heterogeneity in

complex tissues. However, most scRNA-seq

methodologies are limited to gene count

information due to short read lengths.

Here, we combine the microfluidics scRNA-seq

technique, Drop-Seq, with PacBio Single

Molecule, Real-Time (SMRT) Sequencing to

generate full-length transcript isoforms that can be

confidently assigned to individual cells.

We generated single cell Iso-Seq (scIso-Seq)

libraries for chimp and human cerebral organoid

samples on the Dolomite Nadia platform and

sequenced each library with two SMRT Cells 8M

on the PacBio Sequel II System. We developed a

bioinformatics pipeline to identify, classify, and

filter full-length isoforms at the single-cell level. We

show that scIso-Seq reveals full-length isoform

information not accessible using short reads that

can reveal differences between cell types and

amongst different species.

Abstract scIso-Seq on Cerebral Organoids Using SQANTI2 for QC

Single Cell Iso-Seq (scIso-Seq)

Full Splice Match = perfect match

Incomplete Splice Match = partial match

Novel In Catalog = novel isoform with known junctions

Novel Not in Catalog = at least one novel junction

Within intron

Overlap with intron and exons

Figure 1. Full-length single cell libraries were generated

using the Dolomite Nadia system and made into PacBio Iso-

Seq libraries. Sequencing was performed on the Sequel II

System.

SQANTI2 [2] matching Iso-Seq transcripts to

existing annotations and provides a wide range of

descriptors of transcript quality.

(a) Comparison against Gencode v29 transcript annotation reveals ~46%

scIso-Seq transcripts are novel.

(b) Novel scIso-Seq isoforms retain coding potential

GENCODE v29

87.3%

3.8%

6.1%

2.8%

Human Developing Cortex

Single Cell

Intropolis v1

Novel

(c) Most junctions are validated by existing annotations and RNA-seq data.

Class # Isoforms% with CAGE Peak

≤50 bp

Full Splice Match 18,344 78%

Incomplete Splice Match 13,802 37%

Novel In Catalog 19,033 44%

Novel Not In Catalog 9,197 67%

Intergenic 245 29%

(d) Matching FANTOM5 CAGE peak data with scIso-Seq transcripts

shows full splice matches (perfect junction matches to annotations)

are enriched for known TSS.

[1] cDNA_Cupcake https://github.com/Magdoll/cDNA_Cupcake

[2] SQANTI2 https://github.com/Magdoll/SQANTI2/

REFERENCES

Post PCR

PacBio. Shear

Iso-Seq

3’ RNA-Seq

5’ primer UMI 3’ primer(AAA)n BarcodeFull-Length cDNA

SAMPLE

POL

READS

(bp)

POL

BASE

(GB)

POL

LENGTH

(kb)

Chimp Organoid 3,157,575 199 62.9

Chimp Organoid 3,637,178 206 56.7

Human Organoid 2,856,715 177 62.1

Human Organoid 5,277,118 265 50.2

Table 1. Sequencing yield for the chimp and human organoid

single cell Iso-Seq (scIso-Seq) libraries on each of two SMRT

Cells 8M run on the Sequel II System.

SAMPLEFLNC

(filtered)

UNIQUE

READS

UNIQUE

GENES

UNIQUE

ISOFORMS

Chimp

Organoid2,303,267 418,542 14,049 58,892

Human

Organoid2,291,947 382,734 14,737 60,815

1. HiFi (CCS) reads

2. Remove cDNA primers

3. Clip UMIs and BCs

4. Remove polyA tail &

artificial concatemers

5. Align to genome

6. Collapsed redundancy

7. Compare w/ annotation

8. Filter library artifacts

9. Process into CSV

report for UMI/BC error

correction

Figure 2. Bioinformatics analysis for scIso-Seq data. The pipeline

is described in Cupcake [1].

Table 2. Number of unique (de-duplicated) full-length, non-

concatemer (FLNC) reads and corresponding number of unique

genes and transcript isoforms.

33456 57

PacBio Illumina

Barcodes

Figure 3. Matching cell barcodes between PacBio scIso-Seq data

and Illumina short-read data shows high concordance. The scIso-

Seq cumulative read plot indicates ~400 STAMPS (single cells).

20 kb hg38

74,025,000 74,030,00074,035,000 74,040,000 74,045,00074,050,000 74,055,000 74,060,00074,065,000 74,070,00074,075,000

Human Organoid Astrocytes

Human Organoid RG/glycolysis/choroid

Human Organoid G2M

Human Organoid Mitochondia/S phase cells

Human Organoid Neuron

Human Organoid Outlier

Human Organoid Progenitor

GENCODE v29 Comprehensive Transcript Set (only Basic displayed by default)

NCBI RefSeq genes, curated subset (NM_*, NR_*, and YP_*) - Annotation Release NCBI Homo sapiens Annotation Release 109 (2018-03-29) ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

ELN

1 _

1 _

Figure 4. The tropoelastin gene is found only to be expressed in

astrocytes but not other human organoid cell types and shows

exon skipping events and usage of alternative start/end sites.

Scale

chr19:

20 kb hg38

53,520,000 53,525,000 53,530,000 53,535,000 53,540,000 53,545,000 53,550,000 53,555,000 53,560,000 53,565,000 53,570,000 53,575,000 53,580,000Chimp Organoid Astrocytes

Chimp Organoid Choroid

Chimp Organoid CyclingChimp Organoid FibroblastChimp Organoid Neuron

Chimp Organoid Unknown

Human Organoid Astrocytes

Human Organoid RG/glycolysis/choroid

Human Organoid G2M

Human Organoid Mitochondia/S phase cellsHuman Organoid Neuron

Human Organoid OutlierHuman Organoid Progenitor

GENCODE v29 Comprehensive Transcript Set (only Basic displayed by default)

merged_final.bam

PB.11696.5

PB.11696.11

PB.11696.4

PB.11696.12

PB.16078.4

PB.16078.9

PB.16078.1

PB.16078.3

PB.16078.2

PB.16078.8

PB.16078.12

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

ZNF331

Figure 5. Alternative transcription start site (TSS) usage in chimp

and human in ZNF331.

Figure 6. SQANTI2 comparison against existing annotations

confirms scIso-Seq data consists of full-length (5’-3’) transcript

isoforms.