Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian
Bioinformatics analysis of single-cell RNA-seq data
Joshua W. K. Ho, PhD Head, Bioinformatics and Systems Medicine Laboratory
Victor Chang Cardiac Research Institute Senior Lecturer (Conjoint), UNSW Sydney
@joshuawkho
2018 Winter School in Mathematical & Computational Biology, University of Queensland, 3 July 2017
Bioinforma;cschallenges:- Scalability(>1millioncells)- Technicalnoise(dropouts)
RNA-seqalignmentandtranscriptreconstruc;on
Cloud computing to enable scalability
Cloud computing + Big Data Framework • Cloud computing
• A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources
• Key characteristics – elasticity + pay-as-you-go model • Advantages – low entry cost + scalability
• Big Data framework • Hadoop – a software framework for distributed processing of big data in large scale cluster (YARN for resource
management, HDFS for big data storage, and MapReduce for analytics engine) • Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a
speed up of up to 100x compared to MapReduce)
Existing tools
• Halvade (https://github.com/biointec/halvade) • Written in Hadoop MapReduce • Designed to perform variant calling of genomic data from FASTQ files • Provides support for transcriptomic analysis
• SparkBWA (https://github.com/citiususc/SparkBWA) • Written in Spark • Designed to perform alignment of FASTQ files only
• SparkSeq (https://bitbucket.org/mwiewiorka/sparkseq/wiki/Home) • Written in Spark • Designed to perform interactive analysis of BAM files
• Limitations: • Halvade and SparkBWA does not offer multi-sample analysis • SparkSeq does not perform alignment – which is the main bottleneck in analysis
Falco framework
MapReduce Spark
AndrianYang Yangetal(2017)Bioinforma)csMichaelTroup
Falco framework features
Ease of use • Falco provides helper script to launch EMR cluster and submit
jobs to the cluster • User can easily configure the cluster and jobs by modifying
the configuration file passed to the helper script
Customisation • Falco allows user to add custom alignment and/or quantification
tools • User will need to implement custom function to call the
aligner/quantification tool • Custom tool must be compatible with divide-and-conquer
approach
[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !
Sample configuration for running analysis job
Benchmarking
• Single-cell RNA-seq data sets • Mouse embryonic stem cell (mESC) data (869
samples) • 200bp paired-end reads,1.28×1012 bases, 1.02Tb
FASTQ.gz files) • Human brain data (466 samples)
• 100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files
• Performance comparison of Falco against single-node
• STAR+featureCount (S+F) • Mouse: speedup of 2.6x – 33.4x • Brain: speedup of 5.1x – 145.4x
• HISAT2+HTSeq (H+H) • Mouse: speedup of 2.5x – 58.4x • Brain: speedup of 4.0x – 132.5x
System Nodes Mouse - embryonic stem cell (hours)
Human - brain (hours)
S+F H+H S+F H+H
Standalone
1 (1 process) 93.7 154.7 85.67 65.34
1 (5 processes) 29.3 33.8 99.09 67.08
1 (12 processes) 21.1 16.4 115.71 55.15
1 (16 processes) 18.5 13.6 114.11 67.98
Falco
10 7.0 2.7 32.13 65.34
20 4.1 1.6 39.64 67.08
30 3.3 1.4 57.68 67.68
40 2.8 1.1 76.08 67.98
Table 1. Runtime analysis of single cell datasets
Cost effectiveness by using AWS spot instances
Utilising spot instances • AWS allows utilisation of unused Amazon computing capacity – known as
Spot instances • Typically cheaper compared to ‘on-demand’ cost
• To use spot instance, user needs bid for the resource • Use of spot instance for analysis provides a savings of ~65% compared
to using ‘on-demand’ instances • Alternative use - decrease runtime by utilising more instances for a
given ‘on-demand’ price Figure 3. Spot instance price history for September to October
Table2.Falcocostanalysis-on-demandvsspotinstances
Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount
Dataset Number of nodes
Time (hours)
On-demand cost (USD)
Spot cost (USD)
% Savings
Mouse - ESC
10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98
Human - brain
10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98
Table 3. Falco cost analysis - on-demand vs spot instances for HISAT2+HTSeq
Dataset Number of nodes
Time (hours)
On-demand cost (USD)
Spot cost (USD)
% Savings
Mouse - ESC
10 12 370.80 128.40 65.37
20 7 421.40 138.60 67.11
30 5 447.50 144.50 67.71
40 4 475.20 152.00 68.01
Human - brain
10 5 154.50 53.50 65.37
20 3 180.60 59.40 67.11
30 2 179.00 57.80 67.71
40 2 237.60 76.00 68.01
Scaling up to a larger data set
Data set (for Standalone + Falco) • Single-cell Mouse oligodendrocyte from central nervous
system (SRP066613) • 6,283 samples of 50bp single-ended reads, totalling to
231.02 Gbp stored in 200 Gb of fastq.gz file. • Standalone + Falco
• Preprocessing with Trimmomatic • Alignment with STAR • Quantification with featureCount • Clustering with CIDR
• Cell Ranger – custom pipeline designed by chromium • Alignment with STAR • Timing is approximated from runtime of a different
mouse scRNA-seq dataset 0.0
0.5
1.0
1.5
1 Process 12 Processes16 Processes Cell Ranger
Standalone
10 Nodes 40 Nodes
Num
ber o
f cel
ls p
roce
ssed
per
sec
onds
Falco
Next step – using Falco for transcript reconstruction
AndrianYang AbhinavKishore
Discovery of novel transcript isoforms in published data
• Identification of novel transcript and isoform
Availability
Source code • Falco is available to download from Github • Our work on Falco has been featured in a Nature Toolbox article Checkout Falco at
github.com/VCCRI/Falco
Technical noise in scRNA-seq: Dropouts
Figure 1 (a) Types of cell-to-cell variability observed in single-cell RNA-seq measurements. A smoothed scatter plot compares gene expression estimates from two cells of the same type (MEF cells), illustrating prevalence of dropout events, over-dispersion, and high-magnitude outliers.
Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742
The dropout problem in scRNA-seq data analysis
Fig. 1b. Heat maps showing the relationship between dropout rate and mean non-zero expression level for three published single-cell data sets including an approximate double exponential model fit.
Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241
Dropouts
1. What is the cause of zero read counts? • Biological reason: o True non-expression: Stochastic variability due to cell-to-cell variations (transcriptional burst)
• Technical reason (dropout): o Low starting mRNA that cause a transcript to be ‘missed’ during the initial reverse transcription step, and hence not
being detected during sequencing – cannot be recovered by deeper sequencing! o Amplification biases o Low sequencing depth o Impact clustering by inflating
cell-to-cell dissimilarity
Dropouts
How do we deal with dropouts?
• Ignore dropouts • Keep the zeros, and proceed as usual • Remove rows that have ‘too many’ zeros, then proceed as usual • Focus on only key ‘marker genes’ that are not excessively affected by zeros
• Account for the dropouts explicitly through a statistical mixture model • When performing differential expression analysis (for example), take into account of variance that can be attributed
to excessive zeros (e.g, SCDE; Kharchenko et al. 2014) • ZIFA: Modified probabilistic principal component analysis (PCA) that incorporate global zero-inflation parameter to
account for dropouts (Pierson et al. 2015) • Imputation (using a variety of methods)
Kharchenko, P.V., Silberstein, L. and Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014; 11(7):740-742
Pierson, E. and Yau, Christopher. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biology. 2015; 16:241
h/ps://github.com/VCCRI/CIDR
DrPaulLin
S - B f E
S-
Bf
ENo dropout
Gene k
Gen
e - ck
c-c/ cB
cw
cf
cP
cE
S - B f E
S-
Bf
E
With dropout
Gene k
ck
c-c/ cB
cw
cf
cP
cEcB
c/
ck c-
cw cP cf cE
k-
/B
w
No dropout
Hei
ght
ck c/
c- cB cw cP cf cE
S-
Bf
EkS
With dropout
cB
c/ ck c-
cw cP cf cE
S-
Bf
EkS
With dropoutq CIDR dissimilarity
Adjusted Rand Index[ SOkf Adjusted Rand Index[ kOS
Gen
e -
a b
c
CIDR
13.0 54.5 40.6 0.25
3.8 18.5 2.1 0.89
3.4 2.9 19.5
Meansquared distance
Nodropout
Withdropout
:DO.
Shrinkage rate:DO-CIDR.
)DO
Betweenclusters :BC.
Withinclusters :WC.
Ratio :BC)WC.
S - B f E kS k-
SOS
SO-
SOB
SOf
SOE
kOS
Dropout rate function
x
Dro
pout
rate
P:x.W:x.
EuclideanCIDR
S - B f E kS k-
Sw
kSkw
-S
x
:x--xk.=S
Exp
ecte
d di
stan
ce
S - B f E kS k-
SkS
-S/S
BSwS
fS
x
:x--xk.=B
S - B f E kS k-
SOS
SO-
SOB
SOf
SOE
kOS
x
:x--xk.=S:x--xk.=-:x--xk.=B:x--xk.=f:x--xk.=E:x--xk.=kS:x--xk.=k-
[E:D
ata.
-E:C
IDR
.] ) E
:Dat
a.
Expected shrinkage rate
EuclideanCIDR
d
CIDRisfastandaccurate
−200 −100 0 100−150
−100
−50
0
50
100
PC1
PC2
1
2
aprcomp
−40 −20 0 20 40 60
−20
0
20
40
60
80
PC1
PC2
1
23
bt−SNE
−2 −1 0 1 2 3 4 5
−1
0
1
2
3
4
PC1
PC2
1
2
cZIFA
−40 −20 0 20 40 60
−20
0
20
40
60
80
PC1
PC2
1
2
3
dRaceID
−50 0 50
−60
−40
−20
0
20
40
60
PC1
PC2
1 23
4
5
6
eCIDR
prcomp t−SNE ZIFA RaceID CIDRAd
just
ed R
and
Inde
x0.0
0.2
0.4
0.6
0.8
1.0f
astrocytesendothelial
fetal quiescent neuronsfetal replicating neurons
microglianeurons
oligodendrocytesoligodendrocyte precursor cells
Clusters output by algorithms:
Neurons
Astrocytes
Oligodendrocytes
Endothelial
Nuer
on 1
Neur
on 2
Neur
on 3
Astro
cyte
1As
trocy
te 2
Astro
cyte
3O
ligod
endr
ocyt
e 1
Olig
oden
droc
yte
2O
ligod
endr
ocyt
e 3
Endo
thel
ia 1
Endo
thel
ial 2
Endo
thel
ial 3
CIDR
1CI
DR 2
CIDR
3pr
com
p 1/
ZIFA
1/C
IDR
4CI
DR 5
prco
mp
2/ZI
FA 2
/CID
R 6
tSN
E 1
tSN
E 2
tSN
E 3
Race
ID 1
Race
ID 2
Race
ID 3
log(TPM)151050
starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality
• EnablingwidespreaduseofVRvisualisa;onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)
• Supportinterac;onusingheadmovement,keyboard,remotegamepad,andvoicecontrol
JianfuLiYuYao
Usingstarmaptovisualiseadatasetof68,000cellsfromascRNA-seqdata
h_ps://www.youtube.com/watch?v=_LLidDFQH8A
Starmapinterac;on
starmapstarmapdemo:h/ps://vccri.github.io/starmap/
starmapsourcecode:h/ps://github.com/VCCRI/starmapbioRxivpreprint:h/ps://www.biorxiv.org/content/early/2018/05/17/324855
DrVictorChangAC1936-1991,PioneeringCardiothoracicSurgeonandHumanitarian
ScalingupclusteringofscRNA-seqdatabyborrowingideasfromflowcytometryanalysis
Clusteringmethods
Xiaoxin(Sean)Ye
Ultrafastgriddensity-basedclusteringforsinglecelldata-FlowGrid
Speedingupclusteringofsinglecelldatafromhourstoseconds
Xiaoxin(Sean)Ye h_ps://github.com/VCCRI/FlowGrid
FlowCap I DataSets
Lymph• DiffuseLargeB-cellLymphoma
• Numberofevents:10197• Numberofdimension:3
StemCell• Hematopoie;cStemCellTransplant
• Numberofevents:9780• Numberofdimension:4
GvHD• GralversusHostDisease• Numberofevents:23377• Numberofdimension:4
Datasource:h_p://flowcap.flowsite.org/codeanddata/
Performance
DataSet Events Dimension FlowGrid FlowSOM FlowPeaks Flock Time(s) ARI Time(s) ARI Time(s) ARI Time(s) ARI
Lymph 10197 3 0.05 0.84 1.27 0.94 0.18 0.90 0.16 0.89 GvHD 23377 4 0.02 0.98 1.73 0.97 0.43 0.97 0.78 0.69
StemCell 9780 4 0.02 0.85 1.39 0.98 0.10 0.96 0.29 0.95
Events(million) Dimension FlowGridTime(s) FlowSOMTime(s) FlowPeaksTime(s) 0.2 4 0.04 5.04 2.98 1.5 4 0.24 33.32 15.11 11.9 4 2.42 303.46 103.99
[email protected]_p://bioinforma;cs.victorchang.edu.au@joshuawkho
Wearecurrentlyrecrui;ng:• LabHead(Facultyposi;on),
Bioinforma;cs• PostdoctoralFellow• ResearchAssistant• PhDstudents(scholarshipavailable)