Bioinformatics analysis of single-cell RNA-seq databioinformatics.org.au/winterschool/wp-content/uploads/...• Mouse embryonic stem cell (mESC) data (869 samples) • 200bp paired-end

DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian

Bioinformatics analysis of single-cell RNA-seq data

Joshua W. K. Ho, PhD Head, Bioinformatics and Systems Medicine Laboratory

Victor Chang Cardiac Research Institute Senior Lecturer (Conjoint), UNSW Sydney

@joshuawkho

2018 Winter School in Mathematical & Computational Biology, University of Queensland, 3 July 2017

Bioinforma;cschallenges:-  Scalability(>1millioncells)-  Technicalnoise(dropouts)

RNA-seqalignmentandtranscriptreconstruc;on

Cloud computing to enable scalability

Cloud computing + Big Data Framework •  Cloud computing

•  A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources

•  Key characteristics – elasticity + pay-as-you-go model •  Advantages – low entry cost + scalability

•  Big Data framework •  Hadoop – a software framework for distributed processing of big data in large scale cluster (YARN for resource

management, HDFS for big data storage, and MapReduce for analytics engine) •  Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a

speed up of up to 100x compared to MapReduce)

Existing tools

•  Halvade (https://github.com/biointec/halvade) •  Written in Hadoop MapReduce •  Designed to perform variant calling of genomic data from FASTQ files •  Provides support for transcriptomic analysis

•  SparkBWA (https://github.com/citiususc/SparkBWA) •  Written in Spark •  Designed to perform alignment of FASTQ files only

•  SparkSeq (https://bitbucket.org/mwiewiorka/sparkseq/wiki/Home) •  Written in Spark •  Designed to perform interactive analysis of BAM files

•  Limitations: •  Halvade and SparkBWA does not offer multi-sample analysis •  SparkSeq does not perform alignment – which is the main bottleneck in analysis

Falco framework

MapReduce Spark

AndrianYang Yangetal(2017)Bioinforma)csMichaelTroup

Falco framework features

Ease of use •  Falco provides helper script to launch EMR cluster and submit

jobs to the cluster •  User can easily configure the cluster and jobs by modifying

the configuration file passed to the helper script

Customisation •  Falco allows user to add custom alignment and/or quantification

tools •  User will need to implement custom function to call the

aligner/quantification tool •  Custom tool must be compatible with divide-and-conquer

approach

[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !

Sample configuration for running analysis job

Benchmarking

•  Single-cell RNA-seq data sets •  Mouse embryonic stem cell (mESC) data (869

samples) •  200bp paired-end reads,1.28×1012 bases, 1.02Tb

FASTQ.gz files) •  Human brain data (466 samples)

•  100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files

•  Performance comparison of Falco against single-node

•  STAR+featureCount (S+F) •  Mouse: speedup of 2.6x – 33.4x •  Brain: speedup of 5.1x – 145.4x

•  HISAT2+HTSeq (H+H) •  Mouse: speedup of 2.5x – 58.4x •  Brain: speedup of 4.0x – 132.5x

System Nodes Mouse - embryonic stem cell (hours)

Human - brain (hours)

S+F H+H S+F H+H

Standalone

1 (1 process) 93.7 154.7 85.67 65.34

1 (5 processes) 29.3 33.8 99.09 67.08

1 (12 processes) 21.1 16.4 115.71 55.15

1 (16 processes) 18.5 13.6 114.11 67.98

Falco

10 7.0 2.7 32.13 65.34

20 4.1 1.6 39.64 67.08

30 3.3 1.4 57.68 67.68

40 2.8 1.1 76.08 67.98

Table 1. Runtime analysis of single cell datasets

Cost effectiveness by using AWS spot instances

Utilising spot instances •  AWS allows utilisation of unused Amazon computing capacity – known as

Spot instances •  Typically cheaper compared to ‘on-demand’ cost

•  To use spot instance, user needs bid for the resource •  Use of spot instance for analysis provides a savings of ~65% compared

to using ‘on-demand’ instances •  Alternative use - decrease runtime by utilising more instances for a

given ‘on-demand’ price Figure 3. Spot instance price history for September to October

Table2.Falcocostanalysis-on-demandvsspotinstances

Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount

Dataset Number of nodes

Time (hours)

On-demand cost (USD)

Spot cost (USD)

% Savings

Mouse - ESC

10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98

Human - brain

10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98

Table 3. Falco cost analysis - on-demand vs spot instances for HISAT2+HTSeq

Dataset Number of nodes

Time (hours)

On-demand cost (USD)

Spot cost (USD)

% Savings

Mouse - ESC

10 12 370.80 128.40 65.37

20 7 421.40 138.60 67.11

30 5 447.50 144.50 67.71

40 4 475.20 152.00 68.01

Human - brain

10 5 154.50 53.50 65.37

20 3 180.60 59.40 67.11

30 2 179.00 57.80 67.71

40 2 237.60 76.00 68.01

Scaling up to a larger data set

Data set (for Standalone + Falco) •  Single-cell Mouse oligodendrocyte from central nervous

system (SRP066613) •  6,283 samples of 50bp single-ended reads, totalling to

231.02 Gbp stored in 200 Gb of fastq.gz file. •  Standalone + Falco

•  Preprocessing with Trimmomatic •  Alignment with STAR •  Quantification with featureCount •  Clustering with CIDR

•  Cell Ranger – custom pipeline designed by chromium •  Alignment with STAR •  Timing is approximated from runtime of a different

mouse scRNA-seq dataset 0.0

0.5

1.0

1.5

1 Process 12 Processes16 Processes Cell Ranger

Standalone

10 Nodes 40 Nodes

Num

ber o

f cel

ls p

roce

ssed

per

sec

onds

Falco

Next step – using Falco for transcript reconstruction

AndrianYang AbhinavKishore

Discovery of novel transcript isoforms in published data

•  Identification of novel transcript and isoform

Availability

Source code •  Falco is available to download from Github •  Our work on Falco has been featured in a Nature Toolbox article Checkout Falco at

github.com/VCCRI/Falco

Technical noise in scRNA-seq: Dropouts

Figure 1 (a) Types of cell-to-cell variability observed in single-cell RNA-seq measurements. A smoothed scatter plot compares gene expression estimates from two cells of the same type (MEF cells), illustrating prevalence of dropout events, over-dispersion, and high-magnitude outliers.

Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742

The dropout problem in scRNA-seq data analysis

Fig. 1b. Heat maps showing the relationship between dropout rate and mean non-zero expression level for three published single-cell data sets including an approximate double exponential model fit.

Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241

Dropouts

1. What is the cause of zero read counts? •  Biological reason: o  True non-expression: Stochastic variability due to cell-to-cell variations (transcriptional burst)

•  Technical reason (dropout): o  Low starting mRNA that cause a transcript to be ‘missed’ during the initial reverse transcription step, and hence not

being detected during sequencing – cannot be recovered by deeper sequencing! o  Amplification biases o  Low sequencing depth o  Impact clustering by inflating

cell-to-cell dissimilarity

Dropouts

How do we deal with dropouts?

•  Ignore dropouts •  Keep the zeros, and proceed as usual •  Remove rows that have ‘too many’ zeros, then proceed as usual •  Focus on only key ‘marker genes’ that are not excessively affected by zeros

•  Account for the dropouts explicitly through a statistical mixture model •  When performing differential expression analysis (for example), take into account of variance that can be attributed

to excessive zeros (e.g, SCDE; Kharchenko et al. 2014) •  ZIFA: Modified probabilistic principal component analysis (PCA) that incorporate global zero-inflation parameter to

account for dropouts (Pierson et al. 2015) •  Imputation (using a variety of methods)

Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742

Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241

h/ps://github.com/VCCRI/CIDR

DrPaulLin

S - B f E

S-

Bf

ENo dropout

Gene k

Gen

e - ck

c-c/ cB

cw

cf

cP

cE

S - B f E

S-

Bf

E

With dropout

Gene k

ck

c-c/ cB

cw

cf

cP

cEcB

c/

ck c-

cw cP cf cE

k-

/B

w

No dropout

Hei

ght

ck c/

c- cB cw cP cf cE

S-

Bf

EkS

With dropout

cB

c/ ck c-

cw cP cf cE

S-

Bf

EkS

With dropoutq CIDR dissimilarity

Adjusted Rand Index[ SOkf Adjusted Rand Index[ kOS

Gen

e -

a b

c

CIDR

13.0 54.5 40.6 0.25

3.8 18.5 2.1 0.89

3.4 2.9 19.5

Meansquared distance

Nodropout

Withdropout

:DO.

Shrinkage rate:DO-CIDR.

)DO

Betweenclusters :BC.

Withinclusters :WC.

Ratio :BC)WC.

S - B f E kS k-

SOS

SO-

SOB

SOf

SOE

kOS

Dropout rate function

x

Dro

pout

rate

P:x.W:x.

EuclideanCIDR

S - B f E kS k-

Sw

kSkw

-S

x

:x--xk.=S

Exp

ecte

d di

stan

ce

S - B f E kS k-

SkS

-S/S

BSwS

fS

x

:x--xk.=B

S - B f E kS k-

SOS

SO-

SOB

SOf

SOE

kOS

x

:x--xk.=S:x--xk.=-:x--xk.=B:x--xk.=f:x--xk.=E:x--xk.=kS:x--xk.=k-

[E:D

ata.

-E:C

IDR

.] ) E

:Dat

a.

Expected shrinkage rate

EuclideanCIDR

d

CIDRisfastandaccurate

−200 −100 0 100−150

−100

−50

0

50

100

PC1

PC2

1

2

aprcomp

−40 −20 0 20 40 60

−20

0

20

40

60

80

PC1

PC2

1

23

bt−SNE

−2 −1 0 1 2 3 4 5

−1

0

1

2

3

4

PC1

PC2

1

2

cZIFA

−40 −20 0 20 40 60

−20

0

20

40

60

80

PC1

PC2

1

2

3

dRaceID

−50 0 50

−60

−40

−20

0

20

40

60

PC1

PC2

1 23

4

5

6

eCIDR

prcomp t−SNE ZIFA RaceID CIDRAd

just

ed R

and

Inde

x0.0

0.2

0.4

0.6

0.8

1.0f

astrocytesendothelial

fetal quiescent neuronsfetal replicating neurons

microglianeurons

oligodendrocytesoligodendrocyte precursor cells

Clusters output by algorithms:

Neurons

Astrocytes

Oligodendrocytes

Endothelial

Nuer

on 1

Neur

on 2

Neur

on 3

Astro

cyte

1As

trocy

te 2

Astro

cyte

3O

ligod

endr

ocyt

e 1

Olig

oden

droc

yte

2O

ligod

endr

ocyt

e 3

Endo

thel

ia 1

Endo

thel

ial 2

Endo

thel

ial 3

CIDR

1CI

DR 2

CIDR

3pr

com

p 1/

ZIFA

1/C

IDR

4CI

DR 5

prco

mp

2/ZI

FA 2

/CID

R 6

tSN

E 1

tSN

E 2

tSN

E 3

Race

ID 1

Race

ID 2

Race

ID 3

log(TPM)151050

starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality

•  EnablingwidespreaduseofVRvisualisa;onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)

•  Supportinterac;onusingheadmovement,keyboard,remotegamepad,andvoicecontrol

JianfuLiYuYao

Usingstarmaptovisualiseadatasetof68,000cellsfromascRNA-seqdata

h_ps://www.youtube.com/watch?v=_LLidDFQH8A

Starmapinterac;on

starmapstarmapdemo:h/ps://vccri.github.io/starmap/

starmapsourcecode:h/ps://github.com/VCCRI/starmapbioRxivpreprint:h/ps://www.biorxiv.org/content/early/2018/05/17/324855

DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian

ScalingupclusteringofscRNA-seqdatabyborrowingideasfromflowcytometryanalysis

Clusteringmethods

Xiaoxin(Sean)Ye

Ultrafastgriddensity-basedclusteringforsinglecelldata-FlowGrid

Speedingupclusteringofsinglecelldatafromhourstoseconds

Xiaoxin(Sean)Ye h_ps://github.com/VCCRI/FlowGrid

FlowCap I DataSets

Lymph• DiffuseLargeB-cellLymphoma

• Numberofevents:10197• Numberofdimension:3

StemCell• Hematopoie;cStemCellTransplant

• Numberofevents:9780• Numberofdimension:4

GvHD• GralversusHostDisease• Numberofevents:23377• Numberofdimension:4

Datasource:h_p://flowcap.flowsite.org/codeanddata/

Performance

DataSet Events Dimension FlowGrid FlowSOM FlowPeaks Flock Time(s) ARI Time(s) ARI Time(s) ARI Time(s) ARI

Lymph 10197 3 0.05 0.84 1.27 0.94 0.18 0.90 0.16 0.89 GvHD 23377 4 0.02 0.98 1.73 0.97 0.43 0.97 0.78 0.69

StemCell 9780 4 0.02 0.85 1.39 0.98 0.10 0.96 0.29 0.95

Events(million) Dimension FlowGridTime(s) FlowSOMTime(s) FlowPeaksTime(s) 0.2 4 0.04 5.04 2.98 1.5 4 0.24 33.32 15.11 11.9 4 2.42 303.46 103.99

[email protected]_p://bioinforma;cs.victorchang.edu.au@joshuawkho

Wearecurrentlyrecrui;ng:•  LabHead(Facultyposi;on),

Bioinforma;cs•  PostdoctoralFellow•  ResearchAssistant•  PhDstudents(scholarshipavailable)

Documents

Bioinformatics analysis of single-cell RNA-seq databioinformatics.org.au/winterschool/wp-content/uploads/...• Mouse embryonic stem cell (mESC) data (869 samples) • 200bp paired-end