Single-cell RNA-seq analysis

Winter School on Mathematical and Computational Biology 2019, UQ, 2 July 2019

Dr Joshua W. K. Ho Associate Professor School of Biomedical Sciences The University of Hong Kong

Dr Kitty Lo Dr Pengyi Yang Prof Jean Yang

School of Mathematics and Statistics University of Sydney Sydney Australia

Groupminionsbasedontheirsimilarityofphysicalappearance–clusteringIden7fyingdis7nguishingfeaturesbetweendifferentgroupsofminions–differen.alexpressionanalysis

Example – diverse cell types in the mouse nervous system

Zeisel(2018),Cell

Exponen7al growth in single cell RNA seq technologies

Svenssonetal.NatureProtocols(2018)

Droplet based technologies are now domina7ng

Macoskoetal.(2015),Cell

10XGenomicsisacommercialproviderofdropletbasedscRNAseqplaNorm

scRNAseq experiments approaching 1 million cells

Saundersetal.,(2018)Cell

690,000individualcellsfrom9regionsofadultmousebrain

Number of scRNAseq tools also increasing rapidly

Downloadedfromwww.scrna-tools.org

Steps in scRNA-seq analysis

Zappiaetal.(2018)

Software •  CellRanger for 10X Genomics data •  https://support.10xgenomics.com/single-cell-

gene-expression/software/overview/welcome

•  Seurat (all-purpose single cell R package) •  https://satijalab.org/seurat/

•  Scanpy (A python package) •  https://scanpy.readthedocs.io/

•  Follow their online tutorial…easy to use

Batch effect removal

Batch effect removal •  Seurat (all-purpose single cell R

package) for very basic normalization •  Batch effect correction

•  mnnCorrect •  ZINB-Wave •  scMerge

E9.5 E10.5 E11.5 E12.5 E13.5 E14.5 E15.5 E16.5 E17.5

GSE87795Suetal.

GSE90047Yangetal.

GSE87038Dongetal.

GSE96981Campetal.

N=320cells

N=389cells

N=79cells

N=448cells

Liver fetal development 7me course datasets

tSNE of liver fetal development 7me course datasets

Highlightedbycelltypes Highlightedbybatches

Challenge:Strong“batcheffect”

scMerge

scMergeRpackageandwebsite:h\ps://sydneybiox.github.io/scMerge/

PNAS:h\ps://doi.org/10.1073/pnas.1820006116

Coming back to our mo7va7onal data – Liver fetal development 7me course datasets

−20 0 20tSNE1

logcounts

−20 −10 0 10 20 30tSNE1

scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

batchGSE87038GSE87795GSE90047GSE96981

−20 0 20tSNE1

E2logcounts

−20 −10 0 10 20 30tSNE1

scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell

batchGSE87038GSE87795GSE90047GSE96981

BeforescMerge AQerscMerge

E10.5 hepatoblasts

E17.5 cholangiocytes

E17.5 hepatocytes

Cell assignment

Science questions

•  What cell types are present in the dataset?

•  Can we identify the cell types?

•  What is the cell type composition?

•  Are the cells transitioning from one state to

another?

Analysis techniques •  Clustering (unsupervised learning)

•  Classification (supervised learning)

•  Dimension reduction

Dimension reduced plot of our data (tSNE plot)

−20 −10 0 10 20tsne1

t−SNE plot

How many cell types are there? What are the cell types?

k-means clustering

−20 −10 0 10 20tsne1

t−SNE plot

How many cell types are there? What are the cell types?

Clustering algorithms

k-means

Hierarchical

RaceID

countClust

Luke Zappia, et al. PLoS Comp. Bio. 2018

Clustering algorithms in single cell research

Phase 4: Gene iden7fica7on

Science questions

•  Which genes are differentially expressed between

cell types?

•  What are the marker genes for each cell type?

Differences between single cell and bulk RNAseq

•  Singlecellgeneexpressionsshowabimodalexpressionpa\ern–abundantgenesareeitherhighlyexpressedorundetected.

•  Thiscanbetechnical(drop-outs).• Drop-outsleadtotechnicalzeroesinthedata.•  TechnicalzeroesareduetolowcaptureefficiencyinscRNAseqexperiments.

• Manymethodshavebeenproposedtodealwithdrop-outs

Differen7al expression analysis

•  Simplesta7s7caltest• Wilcoxonranktest,t-test

• MethodsdevelopedforbulkRNAseqDE•  DESeq2•  EdgeR•  Voom-Limma

•  scRNAspecific•  Seurat•  MAST•  DECENT•  D3E•  ….manymore!

DE methods comparisons for scRNAseq

SonesonandRobinson(2018)Naturemethods

LKS Faculty of Medicine

Making scRNA-seq analysis more scalable

Cloud computing to enable scalability

• Cloud computing + Big Data Framework •  Cloud computing

•  A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

•  Key characteristics – elasticity + pay-as-you-go model •  Advantages – low entry cost + scalability

•  Big Data framework •  Hadoop – a software framework for distributed processing of big data in

large scale cluster (YARN for resource management, HDFS for big data storage, and MapReduce for analytics engine)

•  Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a speed up of up to 100x compared to MapReduce)

Falco framework

MapReduce Spark

Andrian Yang Yang et al (2017) Bioinforma)cs Michael Troup

Falco framework features

•  Ease of use •  Falco provides helper script to launch

EMR cluster and submit jobs to the cluster •  User can easily configure the cluster and jobs

by modifying the configuration file passed to the helper script

• Customisation •  Falco allows user to add custom

alignment and/or quantification tools •  User will need to implement custom function to

call the aligner/quantification tool •  Custom tool must be compatible with divide-

and-conquer approach

[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !

Sample configuration for running analysis job

Benchmarking •  Single-cell RNA-seq data sets

•  Mouse embryonic stem cell (mESC) data (869 samples) •  200bp paired-end reads,1.28×1012 bases,

1.02Tb FASTQ.gz files) •  Human brain data (466 samples)

•  100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files

•  Performance comparison of Falco against single-node •  STAR+featureCount (S+F)

•  Mouse: speedup of 2.6x – 33.4x •  Brain: speedup of 5.1x – 145.4x

•  HISAT2+HTSeq (H+H) •  Mouse: speedup of 2.5x – 58.4x •  Brain: speedup of 4.0x – 132.5x

System Nodes Mouse - embryonic stem cell (hours)

Human - brain (hours)

S+F H+H S+F H+H

Standalone

1 (1 process) 93.7 154.7 85.67 65.34

1 (5 processes) 29.3 33.8 99.09 67.08

1 (12 processes) 21.1 16.4 115.71 55.15

1 (16 processes) 18.5 13.6 114.11 67.98

10 7.0 2.7 32.13 65.34

20 4.1 1.6 39.64 67.08

30 3.3 1.4 57.68 67.68

40 2.8 1.1 76.08 67.98

Table 1. Runtime analysis of single cell datasets

Cost effectiveness using AWS spot instances

• Utilising spot instances •  AWS allows utilisation of unused Amazon

computing capacity – known as Spot instances •  Typically cheaper compared to ‘on-demand’ cost

•  To use spot instance, user needs bid for the resource

•  Use of spot instance for analysis provides a savings of ~65% compared to using ‘on-demand’ instances •  Alternative use - decrease runtime by utilising more

instances for a given ‘on-demand’ price

Figure 3. Spot instance price history for September to October

Table2.Falcocostanalysis-on-demandvsspotinstances

Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount

Dataset Number of nodes

Time (hours)

On-demand cost (USD)

Spot cost (USD)

% Savings

Mouse - ESC

10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98

Human - brain

10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98

Scaling up to a larger data set •  Data set (for Standalone + Falco) •  Single-cell Mouse oligodendrocyte from

central nervous system (SRP066613) •  6,283 samples of 50bp single-ended reads,

totalling to 231.02 Gbp stored in 200 Gb of fastq.gz file.

•  Standalone + Falco •  Preprocessing with Trimmomatic •  Alignment with STAR •  Quantification with featureCount •  Clustering with CIDR

•  Cell Ranger – custom pipeline designed by chromium •  Alignment with STAR •  Timing is approximated from runtime of a

different mouse scRNA-seq dataset

1 Process 12 Processes16 Processes Cell Ranger

Standalone

10 Nodes 40 Nodes

Falco software

• Source code •  Falco is available to download from Github •  Our work on Falco has been featured in a Nature

Toolbox article

Checkout Falco at github.com/VCCRI/Falco

starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality

•  EnablingwidespreaduseofVRvisualisa7onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)

•  Supportinterac7onusingheadmovement,keyboard,remotegamepad,andvoicecontrol

JianfuLiYuYao

Using starmap to visualise a data set of 68,000 cells from a scRNA-seq data

starmap •  starmapdemo:h\ps://vccri.github.io/starmap/

• 

•  starmapsourcecode:h\ps://github.com/VCCRI/starmap

•  bioRxivpreprint:h\ps://www.biorxiv.org/content/early/2018/05/17/324855

h\ps://www.abacbs.org/giw/

Full-papersubmission(fororalpresenta7onandjournalpublica7on):Thisweek!Abstractsubmission(fororalorposterpresenta7on):1September2019?

THANK YOU We are recruiting: -  PhD students ($57K pa scholarship) -  Research assistants -  Postdoctoral fellows -  Bioinformaticians (staff) -  Faculty jwkho@hku.hk https://holab-hku.github.io/ @joshuawkho

HKU-USydneyStrategicPartnershipFund–‘SingleCellPlus’

Single-cell RNA-seq analysis -...

Documents

RNA-seq Data Analysis - Cornell Universitycbsu.tc.cornell.edu/doc/RNA-Seq-2017-Lecture1.pdf · RNA-seq Data Analysis Qi Sun Bioinformatics Facility. Biotechnology Resource Center

RNA-Seq de novo assembly traininggenoweb.toulouse.inra.fr/~formation/RNASeq_de_novo/RNASeq_de_… · – RNA-Seq techniques RNA-Seq experiment set up Read quality assessment Read

Single-cell RNA-seq analysis - Bioinformaticsbioinformatics.org.au/winterschool/wp-content/uploads/...• Mouse embryonic stem cell (mESC) data (869 samples) • 200bp paired-end reads,1.28×1012

RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation

RNA-seq - Quantification and Differential · PDF fileRNA-seq (2) Peter N. Robinson RNA-seq RPKM Fisher’s exact test Poisson LRT Negative Binomial RNA-seq Quanti cation and Di erential

RNA-seq experiments for bioinformaticians

Differential expression in RNA-Seq

RNA-seq data

ERANGE RNA-Seq pipeline

RNA-Seq / ChIP-Seq Analysis Workflow

Update - AH diagnostics...Advanta RNA-Seq NGS Library Prep on Juno RNA sequencing (RNA-seq) is the gold standard for hypothesis-free profiling of the transcriptome. Advanta™ RNA-Seq

ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq

RNA-Seq and Single-Cell RNA-Seq Tertiary Analysismed.stanford.edu/content/dam/sm/gbsc/YueZhang_2016_Genetics_R… · 3. Statistical Methods RNA-Seq and Single-Cell RNA-Seq Tertiary

Biases in RNA- Seq data October 30, 2013 NBIC Advanced RNA- Seq course

Practical RNAPractical RNA-Seq analysisbarc.wi.mit.edu/education/hot_topics/RNAseq_Feb2014/RNA-seq_Feb_2014.slides_color.pdfPractical RNAPractical RNA-Seq analysis BaRC Hot Topics

Tutorial - QIAGEN Bioinformatics€¦ · Four workflows: 1.RNA-Seq and IPA analysis workflow 2.RNA-Seq and IPA advanced analysis workflow 3.RNA-Seq analysis workflow 4.RNA-Seq analysis

RNA-Seq Module 1

Practical RNA-seq analysis

Introduction to RNA-Seq - Bioinformaticsbioinformatics.org.au/ws13/wp-content/uploads/ws13/sites/3/Full... · Design Experiment Sample Acquisition Field / Clinic / Lab Validation

Transformation and model choice for RNA-seq co-expression ... · Transformation and model choice for RNA-seq co-expression analysis ... RNA-seq data, accounting for variable library