View
5
Download
0
Category
Preview:
Citation preview
Single-cell RNA-seq analysis
Winter School on Mathematical and Computational Biology 2019, UQ, 2 July 2019
Dr Joshua W. K. Ho Associate Professor School of Biomedical Sciences The University of Hong Kong
Dr Kitty Lo Dr Pengyi Yang Prof Jean Yang
School of Mathematics and Statistics University of Sydney Sydney Australia
Groupminionsbasedontheirsimilarityofphysicalappearance–clusteringIden7fyingdis7nguishingfeaturesbetweendifferentgroupsofminions–differen.alexpressionanalysis
Example – diverse cell types in the mouse nervous system
Zeisel(2018),Cell
Exponen7al growth in single cell RNA seq technologies
Svenssonetal.NatureProtocols(2018)
Droplet based technologies are now domina7ng
Macoskoetal.(2015),Cell
10XGenomicsisacommercialproviderofdropletbasedscRNAseqplaNorm
scRNAseq experiments approaching 1 million cells
Saundersetal.,(2018)Cell
690,000individualcellsfrom9regionsofadultmousebrain
Number of scRNAseq tools also increasing rapidly
Downloadedfromwww.scrna-tools.org
Steps in scRNA-seq analysis
Zappiaetal.(2018)
Software • CellRanger for 10X Genomics data • https://support.10xgenomics.com/single-cell-
gene-expression/software/overview/welcome
• Seurat (all-purpose single cell R package) • https://satijalab.org/seurat/
• Scanpy (A python package) • https://scanpy.readthedocs.io/
• Follow their online tutorial…easy to use
Batch effect removal
Batch effect removal • Seurat (all-purpose single cell R
package) for very basic normalization • Batch effect correction
• mnnCorrect • ZINB-Wave • scMerge
E9.5 E10.5 E11.5 E12.5 E13.5 E14.5 E15.5 E16.5 E17.5
GSE87795Suetal.
GSE90047Yangetal.
GSE87038Dongetal.
GSE96981Campetal.
N=320cells
N=389cells
N=79cells
N=448cells
Liver fetal development 7me course datasets
tSNE of liver fetal development 7me course datasets
Highlightedbycelltypes Highlightedbybatches
Challenge:Strong“batcheffect”
scMerge
scMergeRpackageandwebsite:h\ps://sydneybiox.github.io/scMerge/
PNAS:h\ps://doi.org/10.1073/pnas.1820006116
Coming back to our mo7va7onal data – Liver fetal development 7me course datasets
−20
0
20
40
−20 0 20tSNE1
tSN
E2
logcounts
−20
0
20
−20 −10 0 10 20 30tSNE1
tSN
E2
scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell
batchGSE87038GSE87795GSE90047GSE96981
−20
0
20
40
−20 0 20tSNE1
tSN
E2logcounts
−20
0
20
−20 −10 0 10 20 30tSNE1
tSN
E2
scMerge_scSEG cell_typescholangiocyteEndothelial CellEpithelial CellHematopoietichepatoblast/hepatocyteImmune cellMesenchymal CellStellate Cell
batchGSE87038GSE87795GSE90047GSE96981
BeforescMerge AQerscMerge
E10.5 hepatoblasts
E17.5 cholangiocytes
E17.5 hepatocytes
Cell assignment
Science questions
• What cell types are present in the dataset?
• Can we identify the cell types?
• What is the cell type composition?
• Are the cells transitioning from one state to
another?
Analysis techniques • Clustering (unsupervised learning)
• Classification (supervised learning)
• Dimension reduction
Dimension reduced plot of our data (tSNE plot)
−20
−10
0
10
20
−20 −10 0 10 20tsne1
tsne
2
t−SNE plot
How many cell types are there? What are the cell types?
k-means clustering
−20
−10
0
10
20
−20 −10 0 10 20tsne1
tsne
2
t−SNE plot
How many cell types are there? What are the cell types?
Clustering algorithms
k-means
Hierarchical
RaceID
SC3
CIDR
countClust
RCA
SIMLR
Luke Zappia, et al. PLoS Comp. Bio. 2018
25%+
Clustering algorithms in single cell research
Phase 4: Gene iden7fica7on
Science questions
• Which genes are differentially expressed between
cell types?
• What are the marker genes for each cell type?
Differences between single cell and bulk RNAseq
• Singlecellgeneexpressionsshowabimodalexpressionpa\ern–abundantgenesareeitherhighlyexpressedorundetected.
• Thiscanbetechnical(drop-outs).• Drop-outsleadtotechnicalzeroesinthedata.• TechnicalzeroesareduetolowcaptureefficiencyinscRNAseqexperiments.
• Manymethodshavebeenproposedtodealwithdrop-outs
Differen7al expression analysis
• Simplesta7s7caltest• Wilcoxonranktest,t-test
• MethodsdevelopedforbulkRNAseqDE• DESeq2• EdgeR• Voom-Limma
• scRNAspecific• Seurat• MAST• DECENT• D3E• ….manymore!
DE methods comparisons for scRNAseq
SonesonandRobinson(2018)Naturemethods
LKS Faculty of Medicine
Making scRNA-seq analysis more scalable
Cloud computing to enable scalability
• Cloud computing + Big Data Framework • Cloud computing
• A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources
• Key characteristics – elasticity + pay-as-you-go model • Advantages – low entry cost + scalability
• Big Data framework • Hadoop – a software framework for distributed processing of big data in
large scale cluster (YARN for resource management, HDFS for big data storage, and MapReduce for analytics engine)
• Spark – a general purpose data-analytics engine for analysis of big data using in-memory computation (allows a speed up of up to 100x compared to MapReduce)
Falco framework
MapReduce Spark
Andrian Yang Yang et al (2017) Bioinforma)cs Michael Troup
Falco framework features
• Ease of use • Falco provides helper script to launch
EMR cluster and submit jobs to the cluster • User can easily configure the cluster and jobs
by modifying the configuration file passed to the helper script
• Customisation • Falco allows user to add custom
alignment and/or quantification tools • User will need to implement custom function to
call the aligner/quantification tool • Custom tool must be compatible with divide-
and-conquer approach
[job_config] !name = mESC analysis job !action_on_failure = CONTINUE !analysis_script = run_pipeline_multiple_files.py !analysis_script_s3_location = s3://[YOUR-BUCKET]/scripts !analysis_script_local_location = source/spark_runner !upload_analysis_script = True !![spark_config] !driver_memory = 30g !executor_memory = 30g !![script_arguments] !input_location = s3://[YOUR-BUCKET]/mESC_clean !output_location = s3://[YOUR-BUCKET]/mESC_gene_counts !annotation_file = vM9_ERCC.gtf !strand_specificity = NONE !run_picard = True !aligner_tool = STAR !aligner_extra_args = !counter_tool = featureCount!counter_extra_args = -t exon -g gene_name!picard_extra_args = !region = us-west-2 !
Sample configuration for running analysis job
Benchmarking • Single-cell RNA-seq data sets
• Mouse embryonic stem cell (mESC) data (869 samples) • 200bp paired-end reads,1.28×1012 bases,
1.02Tb FASTQ.gz files) • Human brain data (466 samples)
• 100bp paired-end reads, 2.95×1011 bases, 213.66 Gb FASTQ.gz files
• Performance comparison of Falco against single-node • STAR+featureCount (S+F)
• Mouse: speedup of 2.6x – 33.4x • Brain: speedup of 5.1x – 145.4x
• HISAT2+HTSeq (H+H) • Mouse: speedup of 2.5x – 58.4x • Brain: speedup of 4.0x – 132.5x
System Nodes Mouse - embryonic stem cell (hours)
Human - brain (hours)
S+F H+H S+F H+H
Standalone
1 (1 process) 93.7 154.7 85.67 65.34
1 (5 processes) 29.3 33.8 99.09 67.08
1 (12 processes) 21.1 16.4 115.71 55.15
1 (16 processes) 18.5 13.6 114.11 67.98
Falco
10 7.0 2.7 32.13 65.34
20 4.1 1.6 39.64 67.08
30 3.3 1.4 57.68 67.68
40 2.8 1.1 76.08 67.98
Table 1. Runtime analysis of single cell datasets
Cost effectiveness using AWS spot instances
• Utilising spot instances • AWS allows utilisation of unused Amazon
computing capacity – known as Spot instances • Typically cheaper compared to ‘on-demand’ cost
• To use spot instance, user needs bid for the resource
• Use of spot instance for analysis provides a savings of ~65% compared to using ‘on-demand’ instances • Alternative use - decrease runtime by utilising more
instances for a given ‘on-demand’ price
Figure 3. Spot instance price history for September to October
Table2.Falcocostanalysis-on-demandvsspotinstances
Table 2. Falco cost analysis - on-demand vs spot instances for STAR+featureCount
Dataset Number of nodes
Time (hours)
On-demand cost (USD)
Spot cost (USD)
% Savings
Mouse - ESC
10 8 247.20 85.67 65.34 20 5 301.00 99.09 67.08 30 4 258.00 115.71 55.15 40 3 356.40 114.11 67.98
Human - brain
10 3 92.70 32.13 65.34 20 2 120.40 39.64 67.08 30 2 179.00 57.68 67.68 40 2 237.60 76.08 67.98
Scaling up to a larger data set • Data set (for Standalone + Falco) • Single-cell Mouse oligodendrocyte from
central nervous system (SRP066613) • 6,283 samples of 50bp single-ended reads,
totalling to 231.02 Gbp stored in 200 Gb of fastq.gz file.
• Standalone + Falco • Preprocessing with Trimmomatic • Alignment with STAR • Quantification with featureCount • Clustering with CIDR
• Cell Ranger – custom pipeline designed by chromium • Alignment with STAR • Timing is approximated from runtime of a
different mouse scRNA-seq dataset
0.0
0.5
1.0
1.5
1 Process 12 Processes16 Processes Cell Ranger
Standalone
10 Nodes 40 Nodes
Num
ber o
f cel
ls p
roce
ssed
per
sec
onds
Falco
Falco software
• Source code • Falco is available to download from Github • Our work on Falco has been featured in a Nature
Toolbox article
Checkout Falco at github.com/VCCRI/Falco
starmap: Immersive 3D visualisation of single cell data using smartphone-enabled virtual reality
• EnablingwidespreaduseofVRvisualisa7onusinglow-cost($10)VRheadsets,andaperson’sownsmartphone(withawebbrowser)
• Supportinterac7onusingheadmovement,keyboard,remotegamepad,andvoicecontrol
JianfuLiYuYao
Using starmap to visualise a data set of 68,000 cells from a scRNA-seq data
starmap • starmapdemo:h\ps://vccri.github.io/starmap/
•
• starmapsourcecode:h\ps://github.com/VCCRI/starmap
• bioRxivpreprint:h\ps://www.biorxiv.org/content/early/2018/05/17/324855
h\ps://www.abacbs.org/giw/
Full-papersubmission(fororalpresenta7onandjournalpublica7on):Thisweek!Abstractsubmission(fororalorposterpresenta7on):1September2019?
THANK YOU We are recruiting: - PhD students ($57K pa scholarship) - Research assistants - Postdoctoral fellows - Bioinformaticians (staff) - Faculty jwkho@hku.hk https://holab-hku.github.io/ @joshuawkho
HKU-USydneyStrategicPartnershipFund–‘SingleCellPlus’
Recommended