75
12 Grand Challenges in Single-Cell Data Science David Lähnemann *,1,2,3 , Johannes Köster *,+,1,4 , Ewa Szczurek *,5 , Davis J. McCarthy *,6,7 , Stephanie C. Hicks *,8 , Mark D. Robinson *,9 , Catalina A. Vallejos *,10,11 , Niko Beerenwinkel *,12,13 , Kieran R. Campbell *,15,16,17 , Ahmed Mahfouz *,18,19 , Luca Pinello *,20,21,22 , Pavel Skums *,23 , Alexandros Stamatakis *,24,25 , Camille Stephan-Otto Attolini *,26 , Samuel Aparicio 16,27 , Jasmijn Baaijens 29 , Marleen Balvert 29,31 , Buys de Barbanson 32,33,34 , Antonio Cappuccio 35 , Giacomo Corleone 36 , Bas E. Dutilh 31,38 , Maria Florescu 32,33,34 , Victor Guryev 41 , Rens Holmer 42 , Katharina Jahn 12,13 , Thamar Jessurun Lobo 41 , Emma M. Keizer 45 , Indu Khatri 46 , Szymon M. Kiełbasa 47 , Jan O. Korbel 48 , Alexey M. Kozlov 24 , Tzu-Hao Kuo 3 , Boudewijn P.F. Lelieveldt 49,50 , Ion I. Mandoiu 51 , John C. Marioni 52,53,54 , Tobias Marschall 55,56 , Felix Mölder 1,59 , Amir Niknejad 60,61 , Łukasz Rączkowski 5 , Marcel Reinders 18,19 , Jeroen de Ridder 32,33 , Antoine-Emmanuel Saliba 62 , Antonios Somarakis 50 , Oliver Stegle 48,54,63 , Fabian J. Theis 67 , Huan Yang 68 , Alex Zelikovsky 69,70 , Alice C. McHardy +,3 , Benjamin J. Raphael +,71 , Sohrab P. Shah +,72 , and Alexander Schönhuth @,+,*,29,31 * Joint first authors, major contributions to manuscript. 1 Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Germany 2 Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf, Germany 3 Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany 4 Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA 5 Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Poland 6 Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia 7 Melbourne Integrative Genomics, School of BioSciences — School of Mathematics & Statistics, Faculty of Science, University of Melbourne, Australia 8 Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA 9 Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland 10 MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, UK 11 The Alan Turing Institute, British Library, London, UK 12 Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland 13 SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland 15 Department of Statistics, University of British Columbia, Vancouver, Canada 16 Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada 17 Data Science Institute, University of British Columbia, Vancouver, Canada 18 Leiden Computational Biology Center, Leiden University Medical Center, The Netherlands 19 Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, The Netherlands 20 Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA 21 Department of Pathology, Harvard Medical School, Boston, USA 22 Broad Institute of Harvard and MIT, Cambridge, MA, USA 23 Department of Computer Science, Georgia State University, Atlanta, USA 24 Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Germany 25 Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Germany 26 Institute for Research in Biomedicine, The Barcelona Institute of Science and Technology, Spain 27 Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada 29 Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 31 Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, The Netherlands 32 Center for Molecular Medicine, University Medical Center Utrecht, The Netherlands 33 Oncode Institute, Utrecht, The Netherlands 34 Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands 35 Institute for Advanced Study, University of Amsterdam, The Netherlands 36 Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, UK 38 Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands 41 European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, The Netherlands 42 Bioinformatics Group, Wageningen University, The Netherlands 45 Biometris, Wageningen University & Research, The Netherlands 46 Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, The Netherlands 47 Department of Biomedical Data Sciences, Leiden University Medical Center, The Netherlands 48 Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 49 PRB lab, Delft University of Technology, The Netherlands 50 Division of Image Processing, Department of Radiology, Leiden University Medical Center, The Netherlands 51 Computer Science & Engineering Department, University of Connecticut, Storrs, USA 52 Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, UK 53 Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK 54 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK 55 Center for Bioinformatics, Saarland University, Saarbrücken, Germany 56 Max Planck Institute for Informatics, Saarbrücken, Germany 59 Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Germany. 60 Computation molecular design, Zuse Institute Berlin, Germany 61 Mathematics department, Mount Saint Vincent, New York, USA 62 Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany 63 Division of Computational Genomics and Systems Genetics, German Cancer Research Center – DKFZ, Heidelberg, Germany 67 Institute of Computational Biology, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany 68 Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research – LACDR — Leiden University, The Netherlands 69 Department of Computer Science, Georgia State University, Atlanta, USA 70 The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia 71 Department of Computer Science, Princeton University, USA 72 Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA + Joint last authors, workshop organizers. @ Corresponding author: Alexander Schönhuth, [email protected] PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

12 Grand Challenges in Single-Cell Data

Science

David Lähnemann*,1,2,3, Johannes Köster*,+,1,4, Ewa Szczurek*,5, Davis J. McCarthy*,6,7, Stephanie C. Hicks*,8, Mark D.

Robinson*,9, Catalina A. Vallejos*,10,11, Niko Beerenwinkel*,12,13, Kieran R. Campbell*,15,16,17, Ahmed Mahfouz*,18,19, Luca

Pinello*,20,21,22, Pavel Skums*,23, Alexandros Stamatakis*,24,25, Camille Stephan-Otto Attolini*,26, Samuel Aparicio16,27,

Jasmijn Baaijens29, Marleen Balvert29,31, Buys de Barbanson32,33,34, Antonio Cappuccio35, Giacomo Corleone36, Bas E.

Dutilh31,38, Maria Florescu32,33,34, Victor Guryev41, Rens Holmer42, Katharina Jahn12,13, Thamar Jessurun Lobo41, Emma M.

Keizer45, Indu Khatri46, Szymon M. Kiełbasa47, Jan O. Korbel48, Alexey M. Kozlov24, Tzu-Hao Kuo3, Boudewijn P.F.

Lelieveldt49,50, Ion I. Mandoiu51, John C. Marioni52,53,54, Tobias Marschall55,56, Felix Mölder1,59, Amir Niknejad60,61, Łukasz

Rączkowski5, Marcel Reinders18,19, Jeroen de Ridder32,33, Antoine-Emmanuel Saliba62, Antonios Somarakis50, Oliver

Stegle48,54,63, Fabian J. Theis67, Huan Yang68, Alex Zelikovsky69,70, Alice C. McHardy+,3, Benjamin J. Raphael+,71, Sohrab P.

Shah+,72, and Alexander Schönhuth@,+,*,29,31

*Joint first authors, major contributions to manuscript.1Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of

Duisburg-Essen, Germany2Department of Paediatric Oncology, Haematology and Immunology, Medical Faculty, Heinrich Heine University, University Hospital, Düsseldorf,

Germany3Computational Biology of Infection Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany

4Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, USA5Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Poland

6Bioinformatics and Cellular Genomics, St Vincent’s Institute of Medical Research, Fitzroy, Australia7Melbourne Integrative Genomics, School of BioSciences — School of Mathematics & Statistics, Faculty of Science, University of Melbourne,

Australia8Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA

9Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland10MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, UK

11The Alan Turing Institute, British Library, London, UK12Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland

13SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland15Department of Statistics, University of British Columbia, Vancouver, Canada16Department of Molecular Oncology, BC Cancer Agency, Vancouver, Canada17Data Science Institute, University of British Columbia, Vancouver, Canada

18Leiden Computational Biology Center, Leiden University Medical Center, The Netherlands19Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, The Netherlands

20Molecular Pathology Unit and Center for Cancer Research, Massachusetts General Hospital Research Institute, Charlestown, USA21Department of Pathology, Harvard Medical School, Boston, USA

22Broad Institute of Harvard and MIT, Cambridge, MA, USA23Department of Computer Science, Georgia State University, Atlanta, USA

24Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Germany25Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Germany

26Institute for Research in Biomedicine, The Barcelona Institute of Science and Technology, Spain27Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada

29Life Sciences and Health, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands31Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, The Netherlands

32Center for Molecular Medicine, University Medical Center Utrecht, The Netherlands33Oncode Institute, Utrecht, The Netherlands

34Quantitative biology, Hubrecht Institute, Utrecht, The Netherlands35Institute for Advanced Study, University of Amsterdam, The Netherlands

36Department of Surgery and Cancer, The Imperial Centre for Translational and Experimental Medicine, Imperial College London, UK38Centre for Molecular and Biomolecular Informatics, Radboud University Medical Center, Nijmegen, The Netherlands

41European Research Institute for the Biology of Ageing, University Medical Center Groningen, University of Groningen, The Netherlands42Bioinformatics Group, Wageningen University, The Netherlands45Biometris, Wageningen University & Research, The Netherlands

46Department of Immunohematology and Blood Transfusion, Leiden University Medical Center, The Netherlands47Department of Biomedical Data Sciences, Leiden University Medical Center, The Netherlands

48Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany49PRB lab, Delft University of Technology, The Netherlands

50Division of Image Processing, Department of Radiology, Leiden University Medical Center, The Netherlands51Computer Science & Engineering Department, University of Connecticut, Storrs, USA

52Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, UK53Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, UK

54European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK55Center for Bioinformatics, Saarland University, Saarbrücken, Germany

56Max Planck Institute for Informatics, Saarbrücken, Germany59Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Germany.

60Computation molecular design, Zuse Institute Berlin, Germany61Mathematics department, Mount Saint Vincent, New York, USA

62Helmholtz Institute for RNA-based Infection Research, Helmholtz-Center for Infection Research, Würzburg, Germany63Division of Computational Genomics and Systems Genetics, German Cancer Research Center – DKFZ, Heidelberg, Germany

67Institute of Computational Biology, Helmholtz Zentrum München – German Research Center for Environmental Health, Neuherberg, Germany68Division of Drug Discovery and Safety, Leiden Academic Center for Drug Research – LACDR — Leiden University, The Netherlands

69Department of Computer Science, Georgia State University, Atlanta, USA70The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, Russia

71Department of Computer Science, Princeton University, USA72Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA

+Joint last authors, workshop organizers.

@Corresponding author: Alexander Schönhuth, [email protected]

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 2: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

The recent upswing of microfluidics and1

combinatorial indexing strategies, further en-2

hanced by very low sequencing costs, have3

turned single cell sequencing into an em-4

powering technology; analyzing thousands—5

or even millions—of cells per experimental6

run is becoming a routine assignment in lab-7

oratories worldwide. As a consequence, we8

are witnessing a data revolution in single cell9

biology. Although some issues are similar in10

spirit to those experienced in bulk sequencing,11

many of the emerging data science problems12

are unique to single cell analysis; together,13

they give rise to the new realm of ’Single-Cell14

Data Science’.15

Here, we outline twelve challenges that will16

be central in bringing this new field forward.17

For each challenge, the current state of the art18

in terms of prior work is reviewed, and open19

problems are formulated, with an emphasis20

on the research goals that motivate them.21

This compendium is meant to serve as a22

guideline for established researchers, newcom-23

ers and students alike, highlighting interesting24

and rewarding problems in ’Single-Cell Data25

Science’ for the coming years.26

Contents27

1 Introduction 328

2 Single-Cell Data Science:29

Themes and Categories 430

2.1 Varying levels of resolution . . . 531

2.2 Quantifying uncertainty of32

measurements and analysis33

results . . . . . . . . . . . . . . 534

2.3 Scaling to higher dimensionali-35

ties: more cells, more features,36

broader coverage . . . . . . . . 637

2.4 Challenge categories . . . . . . 738

3 Challenges in single-cell tran- 39

scriptomics 7 40

3.1 Challenge I: Handling sparsity 41

in single-cell RNA sequencing . 7 42

3.2 Challenge II: Defining flexible 43

statistical frameworks for dis- 44

covering complex differential 45

patterns in gene expression . . . 13 46

3.3 Challenge III: Mapping single 47

cells to a reference atlas . . . . 15 48

3.4 Challenge IV: Generalizing 49

trajectory inference . . . . . . . 16 50

3.5 Challenge V: Finding patterns 51

in spatially resolved measure- 52

ments . . . . . . . . . . . . . . 18 53

4 Challenges in single-cell genomics 19 54

4.1 Challenge VI: Improving 55

single-cell DNA sequencing 56

data quality and scaling to 57

more cells . . . . . . . . . . . . 20 58

4.2 Challenge VII: Errors and 59

missing data in the identifi- 60

cation of features / variation 61

from single-cell DNA sequenc- 62

ing data. . . . . . . . . . . . . . 23 63

5 Challenges in single-cell phy- 64

logenomics 25 65

5.1 Challenge VIII: Scaling phylo- 66

genetic models to many cells 67

and many sites . . . . . . . . . 26 68

5.2 Challenge IX: Integrating mul- 69

tiple types of features / varia- 70

tion into phylogenetic models . 27 71

5.3 Challenge X: Inferring popula- 72

tion genetic parameters of tu- 73

mor heterogeneity by model in- 74

tegration . . . . . . . . . . . . . 29 75

2

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 3: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

6 Overarching challenges 351

6.1 Challenge XI: Integration of2

single-cell data: across sam-3

ples, experiments and types of4

measurement . . . . . . . . . . 355

6.2 Challenge XII: Validating and6

benchmarking analysis tools7

for single-cell measurements . . 398

7 Acknowledgements 419

1 Introduction10

Since being elevated to “Method of the Year”11

in 2013 [Nature Methods, 2013], sequencing12

of the genetic material of individual cells has13

become routine when investigating cell-to-cell14

heterogeneity. Single-cell measurements of15

both RNA and DNA, and more recently also16

of epigenetic marks and protein levels, can17

stratify cells at the finest resolution possible.18

Single-cell RNA sequencing (scRNA-seq)19

facilitates to distinguish cell states within20

coarser cell type clusters [for an early exam-21

ple, see Anchang et al., 2016], thereby ar-22

ranging populations of cells according to novel23

types of hierarchies. It is also possible to24

identify cells in transition between states, so25

we get a much clearer view on the dynamics26

of tissue and organism development, and on27

structures within cell populations that had so28

far been perceived as homogeneous. Along29

a similar vein, analyses based on single-cell30

DNA sequencing (scDNA-seq) can highlight31

somatic clonal structures [e.g. in cancer, see32

Francis et al., 2014, ?] and are thus helpful33

for tracking the formation of certain cell lin-34

eages and to provide insight into evolutionary35

processes acting on somatic mutations.36

The opportunities arising from single-cell37

sequencing (sc-seq) are enormous: only now38

is it possible to re-evaluate hypotheses about39

differences between pre-defined sample groups40

at the single-cell level—no matter if such 41

sample groups are disease subtypes, treat- 42

ment groups or simply morphologically dif- 43

ferent cell types. It is therefore no surprise 44

that the enthusiasm about the possibility to 45

screen the genetic material of the basic units 46

of life has been continuing to grow: a promi- 47

nent example is the Human Cell Atlas [Regev 48

et al., 2017], an initiative aiming to map the 49

different types and states of cells that a hu- 50

man being is composed of, or Zhang and Liu 51

[2019], as a most recent example of a list of 52

single-cell analysis based opportunities in par- 53

ticular domains such as the blood, the brain 54

and the lung. 55

Encouraged by the great potential of in- 56

vestigating DNA and RNA at the single- 57

cell level, the development of the corre- 58

sponding experimental technologies has expe- 59

rienced massive boosts. This upswing of high- 60

throughput sc-seq technologies—most impor- 61

tantly in microfluidics techniques and com- 62

binatorial indexing strategies [Zilionis et al., 63

2017, Vitak et al., 2017, Svensson et al., 64

2018b, Luo et al., 2019, Gao et al., 2019]— 65

means that tens or hundreds of thousands 66

of cells, instead of just tens or hundreds, 67

are routinely sequenced in one experiment; a 68

development—further fueled by in the mean- 69

time low sequencing costs—that has recently 70

even led to a publication on millions of cells in 71

one experiment [Cao et al., 2019a]. As a con- 72

sequence, primary and secondary sc-seq re- 73

sults of very large numbers of single cells are 74

becoming available worldwide, constituting a 75

data revolution for the field of single-cell anal- 76

ysis. 77

These vast amounts of data and the re- 78

search hypotheses that motivate them, need 79

to be handled in a computationally efficient 80

and statistically sound manner. As these 81

aspects clearly match a recent definition of 82

“Data Science” [Hicks and Peng, 2019], we 83

posit that we have entered the era of Single- 84

3

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 4: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Cell Data Science (SCDS).1

While SCDS faces many of the data sci-2

ence issues arising in bulk sequencing, it also3

substantially adds to them and further com-4

pounds existing scientific challenges. Namely,5

limited amounts of material available per cell6

lead to exceptionally high levels of uncer-7

tainty about (possibly missed) observations,8

and where amplification is used to generate9

more material, technical noise is added to the10

resulting data. Further, a new level of resolu-11

tion also means another—rapidly growing—12

dimension in data matrices, thus requiring13

scalable models and methods for data anal-14

ysis. While the particular challenges can vary15

greatly by research goal, tissue analyzed, ex-16

perimental setup or—last but not least—just17

by whether DNA or RNA is sequenced, fur-18

ther factoring into various protocols, assaying19

for example also the epigenome (bisulfite pro-20

tocols), chromatin accessibility (e.g. ATAC-21

seq) or protein levels (CITE-seq), the com-22

mon denominator is that the challenges are23

all rooted in data science, hence are compu-24

tational or statistical in nature. Here, we pro-25

pose the dozen data science challenges that we26

believe to be most relevant for bringing SCDS27

forward. We summarize and categorize them,28

providing a thorough review of the status of29

each challenge relative to existing approaches.30

From this foundation, we point to possible di-31

rections of research to tackle them. This cat-32

alog of SCDS challenges aims at focusing the33

development of data analysis methods and the34

directions of research in this rapidly evolving35

field—as a guideline for researchers looking36

for rewarding problems that match their per-37

sonal expertise and interests.38

2 Single-Cell Data Science: 39

Themes and Categories 40

A number of challenging themes are common 41

to all single-cell analyses, regardless of the 42

particular assay or data modality generated. 43

We will start our review by broadly categoriz- 44

ing these aspects. Later, when discussing the 45

specific 12 challenges, we will refer to these 46

broader categories wherever appropriate and, 47

if this is sensible, lay out what these broader 48

theme issues mean in the particular context. 49

If challenges covered in later sections are par- 50

ticularly entangled with the broader themes 51

listed here, we will also refer to them from 52

within this section. 53

These elementary themes may reflect issues 54

one also experiences when analyzing bulk se- 55

quencing data. However, even if not unique 56

to single-cell experiments, these issues may 57

become particularly dominant in the analysis 58

of sc-seq data and therefore require particu- 59

lar attention. The most driving of such el- 60

ementary themes, not necessarily unique to 61

sc-seq, are: (i) The need to quantify mea- 62

surement uncertainty (see challenges in sec- 63

tion 2.2) (ii) The need to benchmark methods 64

systematically, in a way that highlights the 65

metrics that are particularly critical in sc-seq 66

(section 6.2). The most driving themes spe- 67

cific to sc-seq, exacerbated by the rapid ad- 68

vances in terms of experimental technologies 69

supporting single-cell analyses, are: (i) The 70

need to scale to higher dimensional data, be 71

it more cells measured or more data mea- 72

sured per cell (section 2.3); this often arises 73

in combination with: (ii) The need to inte- 74

grate data across different types of single– 75

cell measurements (e.g. RNA, DNA, proteins, 76

methylation and so on) and across samples, 77

be they from different time points, treatment 78

groups or even organisms (section 6.1). Fi- 79

nally, the possibility to operate on the finest 80

4

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 5: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

levels of resolution casts an important, over-1

arching question: (iii) Which exact level of2

resolution is appropriate relative to the par-3

ticular research question one has in mind (sec-4

tion 2.1)? We will start by qualifying this last5

one.6

2.1 Varying levels of resolution7

Sc-seq allows for a fine-grained definition of8

cell types and states. Hence it allows for9

characterizations of cell populations that are10

significantly more detailed than characteriza-11

tions supported by bulk sequencing experi-12

ments. However, even though sc-seq operates13

at the most basic level, mapping cell types14

and states at a particular level of resolution15

of interest may be challenging: Depending on16

whether the research question allows for a cer-17

tain freedom in terms of resolution, and de-18

pending on the limits imposed by the particu-19

lar experimental setup, achieving the targeted20

level of resolution or granularity for the in-21

tended map of cells may require substantial22

methodological efforts.23

When drawing maps of cell types and24

states, it is important that they: (i) have a25

structure that recapitulates both tissue devel-26

opment and tissue organization; (ii) account27

for continuous cell states in addition to dis-28

crete cell types (i.e. reflecting cell state tra-29

jectories within cell types and smooth tran-30

sitions between cell types, as observed in tis-31

sue generation); (iii) allow for choosing the32

level of resolution flexibly (i.e. the map should33

possibly support zoom type operations, to34

let the researcher choose the desired level35

of granularity with respect to cell types and36

states conveniently, ranging from whole or-37

ganisms via tissues to cell populations and38

cellular subtypes); (iv) include biological and39

functional annotation wherever available and40

helpful in the intended functional context.41

An exemplary illustration of how maps of42

cell types and states can support different lev- 43

els of resolution are the structure-rich topolo- 44

gies generated by PAGA based on scRNA- 45

seq [Wolf et al., 2019], see Figure 1 for an 46

illustration1. At the highest levels of resolu- 47

tion, these topologies also reflect intermedi- 48

ate cell states and the developmental trajec- 49

tories passing through them. A similar ap- 50

proach that also allows for consistently zoom- 51

ing into more detailed levels of resolution is 52

provided by hierarchical stochastic neighbor 53

embedding (HSNE, Pezzotti et al. [2016]), a 54

method pioneered on mass cytometry data 55

sets [Unen et al., 2017, Höllt et al., 2018]. 56

In addition, manifold learning [Welch et al., 57

2017, Moon et al., 2018] and metric learning 58

[Hoffer and Ailon, 2015, Bromley et al., 1993] 59

may provide further theoretical support for 60

even more accurate maps, because they pro- 61

vide sound theories about reasonable, contin- 62

uous distance metrics, instead of just distinct, 63

discrete clusters. 64

2.2 Quantifying uncertainty of 65

measurements and analysis 66

results 67

The amount of material sampled from single 68

cells is considerably less in comparison with 69

the amounts of material raised in bulk exper- 70

iments, because the latter are based on ex- 71

amining the DNA or RNA of larger pools of 72

cells together. Signals become more stable 73

when individual signals are summarized (such 74

as in a bulk experiment), thus the increase in 75

resolution due to sc-seq also means a reduc- 76

tion of the stability of the supporting signals. 77

The reduction in signal stability, in turn, im- 78

plies that data becomes substantially more 79

1Figure 1 was adapted from Wolf et al. [2019],Fig. 3, provided under Creative CommonsAttribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

5

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 6: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

�ssues cell types single cells

intermediatecell states

trajectories

Figure 1: Different levels of resolution are of interest, depending on the research question andthe data available. Thus, analysis tools and reference systems (such as cell atlases) will haveto accommodate for multiple levels of resolution from whole organs and tissues over discretecell types to continuously mappable intermediate cell states, indistinguishable even at themicroscopic level. A graph abstraction that enables such multiple levels of focus is providedby PAGA [Wolf et al., 2019], a structure that allows for discretely grouping cells, as well asinferring trajectories as paths through a graph.

uncertain and tasks hitherto considered rou-1

tine, such as single nucleotide variation (SNV)2

calling in bulk sequencing, require consider-3

able methodological care to be resolved also4

for sc-seq.5

These issues with data quality and in par-6

ticular missing data pose challenges that are7

novel and unique to sc-seq, and are thus8

at the core of several challenges: regarding9

scDNA-seq data quality (see challenges in10

section 4.1) and especially regarding missing11

data in scDNA-seq (section 4.2) and scRNA-12

seq (section 3.1). In contrast, the non-13

negligible batch effects that scRNA-seq can14

suffer from reflect a common problem in high-15

throughput data analysis [Leek et al., 2010],16

and thus are not discussed here (although in17

certain protocols such effects can be allevi-18

ated by careful use of negative control data19

in the form of spike-in RNA of known con-20

tent and concentration [Severson et al., 2018,21

BEARscc]).22

Optimally, sc-seq analysis tools would accu-23

rately quantify all uncertainties arising from 24

experimental errors and biases. Thereby, 25

these tools would prevent the uncertainties 26

from propagating to the intended downstream 27

analyses in an uncontrolled manner, and 28

rather translate them into statistically sound 29

and accurately quantified qualifiers of final re- 30

sults. 31

2.3 Scaling to higher 32

dimensionalities: more cells, 33

more features, broader 34

coverage 35

The current blossoming of experimental 36

methods poses considerable statistical chal- 37

lenges, and would do even if measurements 38

were not affected by errors and biases. 39

The increase in the number of single cells 40

analyzed per experiment translates into more 41

data points being generated, requiring meth- 42

ods to scale rapidly. With scRNA-seq already 43

6

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 7: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

scaling to millions of cells, some of the respec-1

tive methodology has picked up the thread2

[Sengupta et al., 2016, Sinha et al., 2018, Wolf3

et al., 2018, Iacono et al., 2018]. Of course,4

the respective issues have not yet been fully5

resolved; further improvements are conceiv-6

able. For scDNA-seq, experimental method-7

ology has just been scaling up to more cells re-8

cently (see section 4.1 and section 5.1), mak-9

ing this a pressing challenge in the develop-10

ment of data analysis methods.11

Beyond basic scRNA-seq and scDNA-seq12

experiments, various assays have been pro-13

posed to measure chromatin accessibility14

[Buenrostro et al., 2015, Cusanovich et al.,15

2015], DNA methylation [Karemaker and Ver-16

meulen, 2018], protein levels [Virant-Klun17

et al., 2016], protein binding, and also for per-18

forming multiple simultaneous measurements19

[Clark et al., 2018, Cao et al., 2018] in sin-20

gle cells.The corresponding increase in exper-21

imental choices means another possible infla-22

tion of feature spaces.23

In parallel to the increase in the number24

of cells queried and the number of different25

assays possible, the increase of the resolu-26

tion per cell of specific measurement types27

causes a steady increase of the dimension-28

ality of corresponding data spaces. For the29

field of SCDS this amounts to a severe and30

recurring case of the “curse of dimensional-31

ity” for all types of measurements. Here32

again, scRNA-seq based methods are in the33

lead when trying to deal with feature dimen-34

sionality, while scDNA-seq based methodol-35

ogy (which includes epigenome assays) has yet36

to catch up.37

Finally, there are efforts to measure multi-38

ple feature types in parallel, e.g. from scDNA-39

seq (see section 5.2). Also, with spatial and40

temporal sampling becoming available (see41

section 3.5 and section 5.3), data integration42

methods need to scale to more and new types43

of context information for individual cells (see44

section 6.1 for a comprehensive discussion of 45

data integration approaches). 46

2.4 Challenge categories 47

All challenges we identified fall into at least 48

one of three greater categories: transcrip- 49

tomics (section 3), genomics (section 4) and 50

phylogenomics (section 5). Here, the separa- 51

tion of phylogenomics from genomics is due to 52

the distinct research goals the respective chal- 53

lenges address. Last but not least, two chal- 54

lenges are relevant to all of these categories, 55

and are thus discussed as recapitulatory chal- 56

lenges at the end: the data integration chal- 57

lenge (section 6.1) draws on the types of mea- 58

surements and experiments described in the 59

category-specific challenges. The benchmark- 60

ing challenge (presented in section 6.2), al- 61

though being essential in many areas of data 62

science, is worth highlighting here in partic- 63

ular, because benchmarking for SCDS is still 64

in its infancy. 65

3 Challenges in single-cell 66

transcriptomics 67

3.1 Challenge I: Handling 68

sparsity in single-cell RNA 69

sequencing 70

A comprehensive characterization of the tran- 71

scriptional status of individual cells enables us 72

to gain full insight into the interplay of tran- 73

scripts within single cells. However, scRNA- 74

seq measurements typically suffer from large 75

fractions of observed zeros, where a given gene 76

in a given cell has no unique molecule identi- 77

fiers or reads mapping to it. These observed 78

zero values can represent either missing data 79

(i.e. a gene is expressed but not detected by 80

the sequencing technology) or true absence of 81

7

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 8: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

expression. The proportion of zeros, or degree1

of sparsity, is thought to be due to imper-2

fect reverse transcription and amplification,3

and other technical limitations (Hicks et al.4

[2018], Bacher and Kendziorski [2016]), and5

depends on the scRNA-seq platform used, the6

sequencing depth and the underlying expres-7

sion level of the gene. The term “dropout” is8

often used to denote observed zero values in9

scRNA-seq data, but this term conflates zero10

values attributable to methodological noise11

and biologically-true zero expression, so we12

recommend against its use as a catch-all term13

for observed zeros.14

Sparsity in scRNA-seq data can hinder15

downstream analyses, but it is challenging to16

model or handle it appropriately, and thus,17

there remains an ongoing need for improved18

methods. Sparsity pervades all aspects of19

scRNA-seq data analysis, but here we fo-20

cus on the linked problems of learning la-21

tent spaces and “imputing” expression values22

from scRNA-seq data (Figure 2). Imputation,23

“data smoothing” and “data reconstruction”24

approaches are closely linked to the challenges25

of normalization. But whereas normalization26

generally aims to make expression values be-27

tween cells more comparable to each other,28

imputation and data smoothing approaches29

aim to achieve adjusted data values that—it30

is hoped—better represent the true expression31

values. Imputation methods could therefore32

be used for normalization, but do not entail33

all possible or useful approaches to normal-34

ization.35

3.1.1 Status36

The imputation of missing values has been37

very successful for genotype data. Crucially,38

when imputing genotypes we often know39

which data are missing (e.g. when no geno-40

type call is possible due to no coverage of41

a locus, although see section section 4.2 for42

the challenges with scDNA-seq data) and rich 43

sources of external information are available 44

(e.g. haplotype reference panels). Thus, geno- 45

type imputation is now highly accurate and 46

a commonly-used step in data processing for 47

genetic association studies [Das et al., 2018]. 48

The situation is somewhat different for 49

scRNA-seq data, as we do not routinely have 50

external reference information to apply (see 51

section 3.3). In addition, we can never be sure 52

which observed zeros represent “missing data” 53

and which accurately represent a true gene ex- 54

pression level in the cell [Hicks et al., 2018]. 55

Observed zeros can either represent “biologi- 56

cal” zeros, i.e. those present because the true 57

expression level of a gene in a cell was zero. 58

Or they they are the result of methodological 59

noise, which can arise when a gene has true 60

non-zero expression in a cell, but no counts 61

are observed due to failures at any point in 62

the complicated process of processing mRNA 63

transcripts in cells into mapped reads. Such 64

noise can lead to artefactual zero that are ei- 65

ther more systematic (e.g. sequence-specific 66

mRNA degradation during cell lysis) or that 67

occur by chance (e.g. barely expressed tran- 68

scripts that at the same expression level will 69

sometimes be detected and sometimes not, 70

due to sampling variation, e.g in the sequenc- 71

ing). The high degree of sparsity in scRNA- 72

seq data therefore arises from technical zeros 73

and true biological zeros, which are difficult 74

to distinguish from one another. 75

In general, two broad approaches can be ap- 76

plied to tackle this problem of sparsity: (i) use 77

statistical models that inherently model the 78

sparsity, sampling variation and noise modes 79

of scRNA-seq data with an appropriate data 80

generative model; or (ii) attempt to “impute” 81

values for observed zeros (ideally the tech- 82

nical zeros; sometimes also non-zero values) 83

that better approximate the true gene expres- 84

sion levels. We prefer to use the first option 85

where possible, and for many single-cell data 86

8

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 9: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

analysis problems, statistical models appro-1

priate for sparse count data exist and should2

be used (e.g. for differential expression anal-3

ysis). However, there are many cases where4

the appropriate models are not available and5

accurate imputation of technical zeros would6

allow better results from downstream meth-7

ods and algorithms that cannot handle sparse8

count data. For example, imputation could9

be particularly useful for many dimension re-10

duction, visualization and clustering applica-11

tions. It is therefore desirable to improve both12

statistical methods that work on sparse count13

data directly and approaches for data impu-14

tation for scRNA-seq data, whether by re-15

fining existing techniques or developing new16

ones (see also section 2.2).17

We define three broad (and sometimes over-18

lapping) categories of methods that can be19

used to “impute” scRNA-seq data in the ab-20

sence of an external reference: (i) Model-based21

imputation methods of technical zeros use22

probabilistic models to identify which ob-23

served zeros represent technical rather than24

biological zeros and aim to impute expression25

levels just for these technical zeros, leaving26

other observed expression levels untouched;27

or (ii) Data-smoothing methods define sets28

of “similar” cells (e.g. cells that are neigh-29

bors in a graph or occupy a small region30

in a latent space) and adjust expression val-31

ues for each cell based on expression values32

in similar cells. These methods adjust all33

expression values, including technical zeros,34

biological zeros and observed non-zero val-35

ues. (iii) Data-reconstruction methods typ-36

ically aim to define a latent space repre-37

sentation of the cells. This is often done38

through matrix factorization (e.g. principal39

component analysis) or, increasingly, through40

machine learning approaches (e.g. variational41

autoencoders that exploit deep neural net-42

works to capture non-linear relationships).43

Although a broad class of methods, both ma-44

trix factorization methods and autoencoders 45

(among others) are able to “reconstruct” the 46

observed data matrix from low-rank or sim- 47

plified representations. The reconstructed 48

data matrix will typically no longer be sparse 49

(with many zeros) and the implicitly “im- 50

puted” data can be used for downstream ap- 51

plications that cannot handle sparse count 52

data. 53

The first category of methods generally 54

seeks to infer a probabilistic model that cap- 55

tures the data generation mechanism. Such 56

generative models can be used to identify, 57

probabilistically, which observed zeros cor- 58

respond to technical zeros (to be imputed) 59

and which correspond to biological zeros (to 60

be left alone). There are many model-based 61

imputation methods already available that 62

use ideas from clustering (e.g. k-means), di- 63

mension reduction, regression and other tech- 64

niques to impute technical zeros, oftentimes 65

combining ideas from several of these ap- 66

proaches. These include SAVER [Huang 67

et al., 2018], ScImpute [Li and Li, 2018], 68

bayNorm [Tang et al., 2018], scRecover [Miao 69

et al., 2019], and VIPER [Chen and Zhou, 70

2018]. Clustering methods that implicitly im- 71

pute values, such as CIDR [Lin et al., 2017b] 72

and BISCUIT [Azizi et al., 2017], are closely 73

related to this class of imputation methods. 74

Data-smoothing methods, which adjust all 75

gene expression levels based on expression 76

levels in “similar” cells, have also been pro- 77

posed to handle imputation problems. We 78

might regard these approaches as “denois- 79

ing” methods. To take a simplified exam- 80

ple (Figure 2), we might imagine that sin- 81

gle cells originally refer to points in two- 82

dimensional space, but are likely to describe a 83

one-dimensional curve; projecting data points 84

onto that curve eventually allows imputation 85

of the “missing” values (but all points are 86

adjusted, or smoothed, not just true tech- 87

nical zeros). Prominent data-smoothing ap- 88

9

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 10: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

true popula�on manifold

denoising / imputa�on manifold

measurement error

denoising process

"true" data point

observed data point

denoised data point

imputed data point

Figure 2: Measurement error requires denoising methods or approaches that quantify uncer-tainty and propagate it down analysis pipelines. Also, whenever methods cannot deal withthe abundant missing values, imputation approaches are necessary. Whereas the true popu-lation manifold that generated data is never known, one can usually obtain some estimationof it that can be used for both denoising and imputation.

proaches to handling sparse counts include:1

• diffusion-based MAGIC [Dijk et al.,2

2018]3

• k-nearest neighbor-based knn-smooth4

[Wagner et al., 2018b]5

• network diffusion-based netSmooth6

[Jonathan Ronen, 2018]7

• clustering-based DrImpute [Gong et al.,8

2018]9

• locality sensitive imputation in LSIm-10

pute [Moussa and Măndoiu, 2019]11

A major task in the analysis of high-12

dimensional single-cell data is to find low-13

dimensional representations of the data that14

capture the salient biological signals and ren-15

der the data more interpretable and amenable16

to further analyses. As it happens, the ma-17

trix factorization and latent-space learning18

methods used for that task also provide an-19

other route for imputation through their abil-20

ity to reconstruct the observed data matrix21

from simplified representations of it. Prin-22

cipal component analysis (PCA) is one such23

standard matrix factorization method that24

can be applied to scRNA-seq data (preferably25

after suitable data normalization) as are other 26

widely-used general statistical methods like 27

independent component analysis (ICA) and 28

non-negative matrix factorization (NMF). As 29

(linear) matrix factorization methods, PCA, 30

ICA and NMF decompose the observed data 31

matrix into a “small” number of factors in two 32

low-rank matrices, one representing cell-by- 33

factor weights and one gene-by-factor load- 34

ings. Many matrix factorization methods 35

with tweaks for single-cell data have been pro- 36

posed in recent years, including: 37

• ZIFA, a zero-inflated factor analysis 38

[Pierson and Yau, 2015] 39

• f-scLVM, a sparse Bayesian latent vari- 40

able model [Buettner et al., 2017] 41

• GPLVM, a Gaussian process latent vari- 42

able model [Verma and Engelhardt, 2018] 43

• ZINB-WaVE, a zero-inflated negative bi- 44

nomial factor model [Risso et al., 2018] 45

• scCoGAPS, an extension of NMF [Stein- 46

O’Brien et al., 2019] 47

• consensus NMF, a meta-analysis ap- 48

proach to NMF [Kotliar et al., 2019] 49

10

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 11: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

• pCMF, probabilistic count matrix factor-1

ization with a Poisson model [Durif et al.,2

2019]3

• SDA, sparse decomposition of arrays;4

another sparse Bayesian method [Jung5

et al., 2019].6

Some data reconstruction approaches have7

been specifically proposed for imputation, in-8

cluding:9

• ENHANCE, denoising PCA with an ag-10

gregation step [Wagner et al., 2019]11

• ALRA, SVD with adaptive thresholding12

[Linderman et al., 2018]13

• scRMD, robust matrix decomposition14

[Chen et al., 2018]15

Recently, machine learning methods have16

emerged that apply autoencoders [AutoIm-17

pute, Talwar et al., 2018] and deep neu-18

ral networks [DeepImpute, Arisdakessian19

et al., 2018]) or ensemble learning [EnImpute,20

Zhang et al., 2019c]) to impute expression val-21

ues.22

Additionally, many deep learning methods23

have been proposed for single-cell data anal-24

ysis that can, but need not, use probabilis-25

tic data generative processes to capture low-26

dimensional or latent space representations of27

a dataset. Even if imputation is not a main28

focus, such methods can generate “imputed”29

expression values as an upshot of a model pri-30

marily focused on other tasks like learning la-31

tent spaces, clustering, batch correction, or32

visualization (and often several of these tasks33

simultaneously). The latter set includes tools34

such as:35

• DCA, an autoencoder with a zero-36

inflated negative binomial distribution37

[Eraslan et al., 2019]38

• scVI, a variational autoencoder with a 39

zero-inflated negative binomial model 40

[Lopez et al., 2018] 41

• LATE [Badsha et al., 2018] 42

• VASC [Wang and Gu, 2018] 43

• compscVAE [Grønbech et al., 2018] 44

• scScope [Deng et al., 2019] 45

• Tybalt [Way and Greene, 2018] 46

• SAUCIE [Amodio et al., 2019] 47

• scvis [Ding et al., 2018] 48

• net-SNE [Cho et al., 2018] 49

• BERMUDA, focused on batch correction 50

[Wang et al., 2019] 51

• DUSC [Srinivasan et al., 2019] 52

• Expression Saliency [Kinalis et al., 2019] 53

• others [Lin et al., 2017a, Zhang, 2019] 54

Besides the three categories described 55

above, a small number of scRNA-seq impu- 56

tation methods have been developed to in- 57

corporate information external to the cur- 58

rent dataset for imputation. These include: 59

ADImpute [Leote et al., 2019], which uses 60

gene regulatory network information from ex- 61

ternal sources; SAVER-X [Wang et al., 2018], 62

a transfer learning method for denoising and 63

imputation that can use information from 64

atlas-type resources; and methods that bor- 65

row information from matched bulk RNA- 66

seq data like URSM [Zhu et al., 2018] and 67

SCRABBLE [Peng et al., 2019]. 68

11

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 12: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

3.1.2 Open problems1

A major challenge in this context is the circu-2

larity that arises when imputation solely relies3

on information that is internal to the imputed4

dataset. This circularity can artificially am-5

plify the signal contained in the data, leading6

to inflated correlations between genes and/or7

cells. In turn, this can introduce false pos-8

itives in downstream analyses such as differ-9

ential expression testing and gene network in-10

ference [Andrews and Hemberg, 2019]. Han-11

dling batch effects and potential confounders12

requires further work to ensure that imputa-13

tion methods do not mistake unwanted varia-14

tion from technical sources for biological sig-15

nal. In a similar vein, single-cell experiments16

are affected by various uncertainties (see sec-17

tion 2.2). Approaches that allow quantifica-18

tion and propagation of the uncertainties as-19

sociated with expression measurements (sec-20

tion 2.2), may help to avoid problems associ-21

ated with ‘overimputation’ and the introduc-22

tion of spurious signals noted by Andrews and23

Hemberg [2019].24

To avoid this circularity, it is important25

to identify reliable external sources of infor-26

mation that can inform the imputation pro-27

cess. One possibility is to exploit external28

reference panels (like in the context of ge-29

netic association studies). Such panels are30

not generally available for scRNA-seq data,31

but ongoing efforts to develop large scale cell32

atlases [e.g. Regev et al., 2017, see also sec-33

tion 3.3] could provide a valuable resource34

for this purpose. Systematic integration of35

known biological network structures is de-36

sirable and may also help to avoid circular-37

ity. A possible approach is to encode net-38

work structure knowledge as prior informa-39

tion, as attempted in netSmooth and ADIm-40

pute. Another alternative solution is to ex-41

plore complementary types of data that can42

inform scRNA-seq imputation. This idea was43

adopted in SCRABBLE and URSM, where an 44

external reference is defined by bulk expres- 45

sion measurements from the same population 46

of cells for which imputation is performed. 47

Yet another possibility could be to incorpo- 48

rate orthogonal information provided by dif- 49

ferent types of molecular measurements (see 50

section 6.1). Methods designed to integrate 51

multi-omics data could then be extended to 52

enable scRNA-seq imputation, e.g. through 53

generative models that explicitly link scRNA- 54

seq with other data types [e.g. clonealign, 55

Campbell et al., 2019] or by inferring a shared 56

low-dimensional latent structure [e.g. MOFA, 57

Argelaguet et al., 2018] that could be used 58

within a data-reconstruction framework. 59

With the proliferation of alternative meth- 60

ods, comprehensive benchmarking is urgently 61

required as for all areas of single-cell data 62

analysis section 6.2. Early attempts by Zhang 63

and Zhang [2018] and Andrews and Hemberg 64

[2019] provide valuable insights into the per- 65

formance of methods available at the time. 66

But many more methods have since been pro- 67

posed and even more comprehensive bench- 68

marking platforms are needed. Many meth- 69

ods, especially those using deep learning, de- 70

pend strongly on choice of hyperparameters 71

[Hu and Greene, 2019]. There, more de- 72

tailed comparisons that explore parameter 73

spaces would be helpful, extending work like 74

that from Sun et al. [2019] comparing di- 75

mensionality reduction methods. Learning 76

from exemplary benchmarking studies [Sone- 77

son and Robinson, 2018, Saelens et al., 2019], 78

it would be immensely beneficial to develop 79

a community-supported benchmarking plat- 80

form with a wide-range of synthetic and ex- 81

periment ground-truth datasets (or as close 82

as possible, in the case of experimental data) 83

and a variety of thoughtful metrics for eval- 84

uating performance. Ideally, such a bench- 85

marking platform would remain dynamic be- 86

yond an initial publication to allow ongoing 87

12

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 13: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

comparison of methods as new approaches are1

proposed. Detailed benchmarking would also2

help to establish when normalization methods3

derived from explicit count models [e.g. Hafe-4

meister and Satija, 2019, Townes et al., 2019]5

may be preferable to imputation.6

Finally, scalability for large numbers of7

cells remains an ongoing concern for imputa-8

tion, data smoothing and data reconstruction9

methods, as for all high-throughput single-cell10

methods and software (see section 2.3).11

3.2 Challenge II: Defining12

flexible statistical13

frameworks for discovering14

complex differential patterns15

in gene expression16

Beyond simple changes in average gene ex-17

pression between cell types (or across bulk-18

collected libraries), scRNA-seq enables a19

high granularity of changes in expression to20

be unraveled. Interesting and informative21

changes in expression patterns can be re-22

vealed, as well as cell-type-specific changes23

in cell state across samples (Figure 6, Ap-24

proach 1). Further understanding of gene25

expression changes will enable deeper knowl-26

edge across a myriad of applications, such as27

immune responses [Kang et al., 2018b, Stub-28

bington et al., 2017], development [Karaiskos29

et al., 2017a] and drug response [Kim et al.,30

2015].31

3.2.1 Status32

Currently, the vast majority of differential ex-33

pression detection methods assume that the34

groups of cells to be compared are known35

in advance (e.g., experimental conditions or36

cell types). However, most current analy-37

sis pipelines rely on clustering or cell type38

assignment to identify such groups, before39

downstream differential analysis is performed, 40

without propagating the uncertainty in these 41

assignments or accounting for the double use 42

of data (clustering, differential testing be- 43

tween clusters). 44

In this context, most methods have fo- 45

cused on comparing average expression be- 46

tween groups [Kharchenko et al., 2014, Fi- 47

nak et al., 2015], but it appears that single- 48

cell-specific methods do not uniformly out- 49

perform the state-of-the-art bulk methods 50

[Soneson and Robinson, 2018]. Instead, lit- 51

tle attention has been given to more gen- 52

eral patterns of differential expression (Fig- 53

ure 3), such as changes in variability that ac- 54

count for mean expression confounding [Eling 55

et al., 2018], changes in trajectory along pseu- 56

dotime [Campbell and Yau, 2018, van den 57

Berge et al., 2019], or more generally, changes 58

in distributions [Korthauer et al., 2016b]. 59

Furthermore, methods for cross-sample com- 60

parisons of gene expression (e.g., cell-type- 61

specific changes in cell state across samples, 62

compare section 6.1, Figure 6 and Table 2) 63

are now emerging, such as pseudo-bulk com- 64

parisons [Kang et al., 2018a], where expres- 65

sion is aggregated over multiple cells within 66

each sample. With the expanding capacity 67

of experimental techniques to generate multi- 68

sample scRNA-seq datasets, further general 69

and flexible statistical frameworks will be re- 70

quired to identify complex differential pat- 71

terns across samples. This will be particularly 72

critical in clinical applications, where cells are 73

collected from multiple patients. 74

3.2.2 Open problems 75

Accounting for uncertainty in cell type as- 76

signment and for double use of data will 77

require, first of all, a systematic study of 78

their impact. Integrative approaches in which 79

clustering and differential testing are simul- 80

taneously performed [Vavoulis et al., 2015] 81

13

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 14: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

popula�on differences in

mean variability pseudo�me

...

Q Q

...

...

R R

...

...

S S

...

Figure 3: Differential expression of a gene or transcript between cell populations. The top rowlabels the specific gene or transcript, as is also done in Figure 6. A difference in mean geneexpression manifests in a consistent difference of gene expression across all cells of a popu-lation (e.g. high vs. low). A difference in variability of gene expression means that in onepopulation, all cells have a very similar expression level, whereas in another population somecells have a much higher expression and some a much lower expression. The resulting averageexpression level may be the same and in such cases, only single-cell measurements can findthe difference between populations. A difference across pseudotime is a change of expres-sion within a population, e.g. along a developmental trajectory (compare Figure 1). Thisalso constitutes a difference between cell populations that is not apparent from populationaverages, but requires a pseudo-temporal ordering of measurements on single cells.

can address both issues. However, integra-1

tive methods typically require bespoke imple-2

mentations, precluding a direct combination3

between arbitrary clustering and differential4

testing tools. In such cases, the adaptation of5

selective inference methods [Reid et al., 2018,6

Zhang et al., 2019b] could provide an alterna-7

tive solution.8

While some methods exist to identify more9

general patterns of gene expression changes10

(e.g. variability, distributions), these meth-11

ods could be further improved by integrat-12

ing with existing approaches that account for13

confounding effects such as cell cycle [Ste-14

gle et al., 2015] and complex batch effects15

[Butler et al., 2018a, Haghverdi et al., 2018].16

Moreover, our capability to discover interest-17

ing gene expression patterns will be vastly18

expanded by connecting with other aspects19

of single-cell expression dynamics, such as20

cell type composition, RNA velocity [Manno 21

et al., 2018], splicing and allele-specificity. 22

This will allow us to fully exploit the granu- 23

larity contained in single-cell level expression 24

measurements. 25

In the multi-donor setting, several promis- 26

ing methods have been applied to discover 27

state transitions in high-dimensional cytome- 28

try datasets [Lun et al., 2017, Bruggner et al., 29

2014, Weber et al., 2018, Nowicka et al., 2017]. 30

These approaches could be expanded to the 31

higher dimensions and characteristic aspects 32

of scRNA-seq data. Alternatively, there is a 33

large space to explore other general and flex- 34

ible approaches, such as hierarchical models 35

where information is borrowed across sam- 36

ples, while allowing for sample-specific pat- 37

terns. 38

14

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 15: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

3.3 Challenge III: Mapping1

single cells to a reference2

atlas3

Classifying cells into cell types or states is4

essential for many secondary analyses. As5

an example, consider studying and classify-6

ing how expression varies across different cells7

and different biological conditions (for differ-8

ential expression analyses, see section 3.2 and9

data integration Approach 1 in section 6.1,10

Figure 6 and Table 2). To put the results of11

such studies on a map, reliable reference sys-12

tems are required.13

The lack of appropriate, available refer-14

ences has so far implied that only reference-15

free approaches were conceivable, where unsu-16

pervised clustering approaches were the pre-17

dominant option (see data integration Ap-18

proach 0 in section 6.1, Figure 6 and Table 2).19

Method development for such unsupervised20

clustering of cells has already reached a cer-21

tain level of maturity; see Duò et al. [2018],22

Freytag et al. [2018], Kiselev et al. [2019] for23

a systematic identification of available tech-24

niques.25

However, unsupervised approaches involve26

manual cluster annotation. There are two27

major caveats: (i) manual annotation is a28

time-consuming process, which also (ii) puts29

certain limits to the reproducibility of the re-30

sults. Cell atlases, as reference systems that31

systematically capture cell types and states,32

either tissue-specific or across different tis-33

sues, remedy this issue (see data integration34

Approach 2 in section 6.1, Figure 6 and Ta-35

ble 2; see also Figure 1 for an idea of what cell36

atlas type reference systems preferably could37

look like).38

3.3.1 Status39

See Table 1 for a list of cell atlas type refer-40

ences that have recently been published. For41

human, similar endeavors as for the mouse are 42

under way, with the intention to raise a Hu- 43

man Cell Atlas [Regev et al., 2017]. Towards 44

this end, initial consortia focus on specific or- 45

gans, for example the lung [Schiller et al., 46

2019]. 47

The availability of these reference atlases 48

has led to the active development of methods 49

that make use of them in the context of su- 50

pervised classification of cell types and states 51

[Lieberman et al., 2018, Srivastava et al., 52

2018, Cao et al., 2019b, DePasquale et al., 53

2019, Kanter et al., 2019, Sato et al., 2019, 54

Zhang et al., 2019a]. A field that serves as 55

a source of inspiration is flow/mass cytom- 56

etry, where several methods have addressed 57

the classification of high-dimensional cell type 58

data [Chester and Maecker, 2015, Weber and 59

Robinson, 2016, Saeys et al., 2016, Guilliams 60

et al., 2016]. Finally, as for benchmarking 61

methods that map cells of unknown type or 62

state onto reference atlases (see Section sec- 63

tion 6.2 for benchmarking in general), atlases 64

of model organisms where full lineages of cells 65

have been integrated can form the basis for 66

further developments [Spanjaard et al., 2018, 67

Plass et al., 2018, Fincher et al., 2018, Farrell 68

et al., 2018, Briggs et al., 2018].Importantly, 69

additional information available from lineage 70

tracing can provide a cross-check with respect 71

to the transcriptome-profile-based classifica- 72

tion [Briggs et al., 2018, Kester and van Oude- 73

naarden, 2018]. 74

3.3.2 Open problems 75

Cell atlases can still be considered under 76

active development, with several computa- 77

tional challenges still open, in particular re- 78

ferring to the fundamental themes from above 79

[Regev et al., 2017, Schiller et al., 2019, Hon 80

et al., 2018]. Here, we focus on the map- 81

ping of cells or rather their molecular profiles 82

onto stable existing reference atlases to fur- 83

15

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 16: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

organism scale of cell atlas citationnematodeCaenorhabditis ele-gans

whole organism at larval stageL2

[Cao et al., 2017]

planariaSchmidtea mediter-ranea

whole organism of the adult an-imal

[Fincher et al., 2018, Plass et al.,2018]

fruit flyDrosophilamelanogaster

whole organism at embryonicstage

[Karaiskos et al., 2017b]

Zebrafish whole organism at embryonicstage

[Farrell et al., 2018, Wagneret al., 2018a]

frogXenopus tropicalis

whole organism at embryonicstage

[Briggs et al., 2018]

Mouse whole adult brain [Rosenberg et al., 2018, Saunderset al., 2018, Zeisel et al., 2018]

Mouse whole adult organism [Tabula Muris Consortium, 2018,Han et al., 2018]

Table 1: Published cell atlases of whole tissues or whole organisms.

ther highlight the importance of these fun-1

damental themes. A computationally and2

statistically sound method for mapping cells3

onto atlases for a range of conceivable re-4

search questions will need to: (i) enable op-5

eration at various levels of resolution of inter-6

est, and also cover continuous, transient cell7

states (see section 2.1); (ii) quantify the un-8

certainty of a particular mapping of cells of9

unknown type/state (see section 2.2); (iii) to10

scale to ever more cells and broader cover-11

age of types and states (see section 2.3), and12

(iv) to eventually integrate information gen-13

erated not only through scRNA-seq experi-14

ments, but also through other types of mea-15

surements, for example scDNA-seq or protein16

expression data (see below in section 6.1 for a17

discussion of data integration, especially data18

integration Approaches 4 and 5 in section 6.1,19

Figure 6 and Table 2).20

3.4 Challenge IV: Generalizing 21

trajectory inference 22

Several biological processes, such as differen- 23

tiation, immune response or cancer expansion 24

can be described and represented as continu- 25

ous dynamic changes in cell type/state space 26

using tree, graphical or probabilistic mod- 27

els. A potential path that a cell can undergo 28

in this continuous space is often referred to 29

as a trajectory (Trapnell et al. [2014] and 30

Figure 1), and the ordering induced by this 31

path is referred to as pseudotime. Several 32

models have been proposed to describe cell 33

state dynamics, starting from transcriptomic 34

data [Saelens et al., 2019]. Trajectory infer- 35

ence is in principle not limited to transcrip- 36

tomics. Nevertheless, modeling of other mea- 37

surements, such as proteomic, metabolomic, 38

and epigenomic, or even integrating multiple 39

types of data (see section 6.1), is still at its 40

infancy. We believe the study of complex tra- 41

16

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 17: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

jectories integrating different data-types es-1

pecially epigenetics and proteomics informa-2

tion in addition to transcriptomics data will3

lead to a more systematic understanding of4

the processes determining cell fate.5

3.4.1 Status6

More than sixty trajectory methods have7

been proposed for trajectory inference from8

transcriptomic data using snapshot data at9

single or multiple time points [Saelens et al.,10

2019]. Briefly, those methods start from a11

count matrix where genes are rows and cells12

are columns. First, a feature selection or di-13

mensionality reduction step is used to explore14

a subspace where distances between cells are15

more reliable. Next, clustering and minimum16

spanning trees [Trapnell et al., 2014, Ji and17

Ji, 2016], principal curve or graph fitting [Qiu18

et al., 2017, Chen et al., 2019, Rizvi et al.,19

2017], or random walks and diffusion opera-20

tions on graphs (Haghverdi et al. [2016], Setty21

et al. [2016] among others) are used to in-22

fer pseudotime and/or branching trajectories.23

Alternative probabilistic descriptions can be24

obtained using optimal transport analysis25

[Schiebinger et al., 2017] or approximation of26

the Fokker-Planck equations [Weinreb et al.,27

2018] or by estimating pseudotime through di-28

mensionality reduction with a Gaussian pro-29

cess latent variable model [Campbell and Yau,30

2016, Reid and Wernisch, 2016, Ahmed et al.,31

2019].32

3.4.2 Open problems33

Potentially, many of the above-mentioned34

methods for trajectory inference can be35

extended to data obtained with non-36

transcriptomic assays. Thereby, the follow-37

ing aspects are crucial. First, it is necessary38

to define the features to use; while for tran-39

scriptomic data the features are well anno-40

tated and correspond to expression levels of 41

genes, clear-cut features are harder to deter- 42

mine for data such as methylation profiles and 43

chromatin accessibility where signals can re- 44

fer to individual genomic sites, but also be 45

pooled over sequence features or sequence re- 46

gions. Second, many of those recent technolo- 47

gies only allow measurement of a quite lim- 48

ited number of cells compared to transcrip- 49

tomic assays where millions of cells can be 50

profiled using droplet-based platforms [Ma- 51

cosko et al., 2015, Klein et al., 2015, Zheng 52

et al., 2017]. Third, some of those measure- 53

ments are technically challenging since the in- 54

put material for each cell is limited (for exam- 55

ple two copies of each chromosome for methy- 56

lation or chromatin accessibility), giving rise 57

to more sparsity than scRNA-seq. In the 58

latter case it is necessary to define distance 59

or similarity metrics that take this problem 60

into account. An alternative approach con- 61

sists of pooling/combining information from 62

several cells or data imputation. For ex- 63

ample, imputation has been used for single- 64

cell DNA methylation [Angermueller et al., 65

2017], aggregation over chromatin accessibil- 66

ity peaks from bulk or pseudo-bulk sample 67

[Cusanovich et al., 2018], and k-mer-based 68

approaches have been proposed [Buenrostro 69

et al., 2018, de Boer and Regev, 2018, Chen 70

et al., 2019]. However, so far, no systematic 71

evaluation (see section 6.2) of those choices 72

has been performed and it is not clear how 73

many cells are necessary to reliably define 74

those features. 75

A pressing challenge is to assess how the 76

different trajectory inference methods per- 77

form on different data types and importantly 78

to define metrics that are suitable. Also, it 79

is necessary to reason on the ground truth or 80

propose reasonable surrogates (e.g. previous 81

knowledge about developmental processes). 82

Some recent papers explore this idea using 83

scATAC-seq data, an assay to measure chro- 84

17

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 18: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

matin accessibility [Buenrostro et al., 2018,1

Chen et al., 2019, Pliner et al., 2018].2

Having defined robust methods to recon-3

struct trajectories from each data type, an-4

other future challenge is related to their com-5

parison or alignment. Here, some ideas from6

recent methods used to align transcriptomic7

datasets may be extended [Butler et al.,8

2018b, Haghverdi et al., 2018, Welch et al.,9

2018]. A related unsolved problem is that10

of comparing different trajectories obtained11

from the same data type but across individu-12

als or conditions to highlight unique and com-13

mon aspects.14

3.5 Challenge V: Finding15

patterns in spatially resolved16

measurements17

Single-cell spatial transcriptomics or pro-18

teomics [Crosetto et al., 2015, Strell et al.,19

2018, Moffitt et al., 2018] technologies can20

obtain transcript abundance measurements21

while retaining spatial coordinates of cells or22

even transcripts within a tissue (this can be23

seen as an additional feature space to inte-24

grate, see Approach 3 in section 6.1, Figure 625

and Table 2). With such data, the question26

arises of how spatial information can best be27

leveraged to find patterns, infer cell types or28

functions and classify cells in a given tissue29

[Tanay and Regev, 2017].30

3.5.1 Status31

Experimental approaches have been tailored32

to either systematically extract foci of cells33

and analyze them with scRNA-seq, or to mea-34

sure RNA and proteins in-situ. Histological35

sections can be projected in two dimensions36

while preserving spatial information using se-37

quencing arrays [Ståhl et al., 2016]. Whole38

tissues can be decomposed using the Niche-39

seq approach [Medaglia et al., 2017]: here40

a group of cells are specifically labeled with 41

a fluorescent signal, sorted and subjected to 42

scRNA-seq. The Slide-seq approach uses an 43

array of Drop-seq drops with known barcodes 44

to dissolve corresponding slide sites and se- 45

quence them with the respective barcodes 46

[Rodriques et al., 2019]. Ultimately, one 47

would like to sequence inside a tissue with- 48

out dissociating the cells and without compro- 49

mising on the unbiased nature of scRNA-seq. 50

A preliminary approach has been proposed 51

by Lee et al. [2015] coined FISSEQ (Fluo- 52

rescent in-situ sequencing). Lubeck et al. 53

[2014] have shown a first approach to itera- 54

tively apply fluorescence in-situ hybridization 55

to measure hundreds of RNA species simulta- 56

neously, called seqFISH. SeqFISH+ scales the 57

FISH barcoding strategy to 10,000 genes by 58

splitting each of four barcode locations to be 59

scanned into 20 separate readings to avoid sig- 60

nal crowding [Eng et al., 2019]. Based on a 61

related principle, MERFISH was proposed by 62

Chen et al. [2015], which enables to measure 63

hundreds to thousands of transcripts in sin- 64

gle cells simultaneously while retaining spa- 65

tial coordinates [Moffitt et al., 2016]. Here, 66

even the subcellular coordinates of each in- 67

dividual transcript are retained. In addition 68

to the methods that provide in-situ measure- 69

ments of RNA, Giesen et al. [2014] and An- 70

gelo et al. [2014] use mass cytometry tech- 71

nology to quantify the abundance of proteins 72

while preserving subcellular resolution. Fi- 73

nally, the recently described Digital Spatial 74

Profiling [DSP, Merritt et al., 2019, Van and 75

Blank, 2019] promises to provide both RNA 76

and protein measurements with spatial reso- 77

lution. 78

For determining cell types, or clustering 79

cells into groups that conduct a common func- 80

tion, several methods are available [Zhang 81

et al., 2019a, Kiselev et al., 2018, Butler et al., 82

2018b]. None of these currently directly use 83

spatial information. In contrast, spatial cor- 84

18

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 19: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

relation methods have been used to detect1

aggregation of proteins [Shivanandan et al.,2

2016]. Shah et al. [2016] use seqFISH to mea-3

sure transcript abundance of a set of marker4

genes while retaining the spatial coordinates5

of the cells. Cells are clustered by gene ex-6

pression profiles and then assigned to regions7

in the brain based on their coordinates in the8

sample. Recently, Edsgärd et al. [2018] pre-9

sented a method to detect spatial differential10

expression patterns per gene based on marked11

point processes [Jacobsen, 2005]. Svensson12

et al. [2018a] provided a method to perform a13

spatially resolved differential expression anal-14

ysis. Here, spatial dependence for each gene15

is learned by non-parametric regression, en-16

abling the testing of the statistical signifi-17

cance for a gene to be differentially expressed18

in space.19

3.5.2 Open problems20

The central problem is to consider gene or21

transcript expression and spatial coordinates22

of cells, and derive an assignment of cells23

to classes, functional groups or cell types.24

While methods for both assigning cell types25

or functional groups and spatially resolved26

gene expression analysis are present, there is27

currently no method available that combines28

the two by leveraging information from spa-29

tial localization to determine the cell type or30

find groups of cells that conduct a common31

function. Depending on the studied biolog-32

ical question, it can be useful to constrain33

assignments with expectations on the homo-34

geneity of the tissue. For example, a set of35

cells grouped together might be required to36

appear in one or multiple clusters where lit-37

tle to no other cells are present. Such con-38

straints might depend on the investigated cell39

types or tissues. For example, in cancer, spa-40

tial patterns can occur on multiple scales,41

ranging from single infiltrating immune cells42

[Fridman et al., 2011] and minor subclones 43

[Swanton, 2012] to larger subclonal structures 44

or the embedding in surrounding normal tis- 45

sue and the tumor microenvironment [Cretu 46

and Brooks, 2007]. Currently, to the best of 47

our knowledge, there is no method available 48

that would allow the encoding of such prior 49

knowledge while inferring cell types by inte- 50

grating spatial information with transcript or 51

gene expression. Another important aspect 52

when modeling the relation between space 53

and expression is whether uncertainty in the 54

measurements can be propagated to down- 55

stream analyses. For example, it is desir- 56

able to rely on transcript quantification meth- 57

ods that provide the posterior distribution 58

of transcript expression [Kharchenko et al., 59

2014, Köster et al., 2017] and propagate this 60

information to the spatial analysis. Finally, 61

in light of issues with sparsity in single-cell 62

measurements (section 3.1), it appears desir- 63

able to integrate spatial information into the 64

quantification itself, and e.g. use neighboring 65

cells within the same tissue for imputation 66

or the inference of a posterior distribution of 67

transcript expression. 68

4 Challenges in single-cell 69

genomics 70

With every cell division in an organism, the 71

genome can be altered through mutational 72

events ranging from point mutations, over 73

short insertions and deletions, to large scale 74

copy number variation and complex struc- 75

tural variants. In cancer, the entire reper- 76

toire of these genetic events can occur during 77

disease progression (Figure 4). The resulting 78

tumor cell populations are highly heteroge- 79

neous. As tumor heterogeneity can predict 80

patient survival and response to therapy, in- 81

cluding immunotherapy, quantifying this het- 82

19

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 20: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

erogeneity and understanding its dynamics1

are crucial for improving diagnosis and ther-2

apeutic choices (Figure 4).3

Classic bulk sequencing data of tumor sam-4

ples taken during surgery are always a mix-5

ture of tumor and normal cells (including6

e.g. invading immune cells). This means7

that disentangling mutational profiles of tu-8

mor subclones will always be challenging,9

which especially holds for rare subclones that10

could nevertheless be the ones e.g. bearing11

resistance mutation combinations prior to a12

treatment (Figure 4). Here, the sequencing13

of (sufficient) single cells holds the exciting14

promise of directly identifying and character-15

izing those subclone profiles (Figure 4).16

4.1 Challenge VI: Improving17

single-cell DNA sequencing18

data quality and scaling to19

more cells20

Despite accumulating technological advances21

in the field, the task of characterizing tumor22

heterogeneity and inferring the evolutionary23

mechanisms that give rise to this heterogene-24

ity is still hampered by multiple types of er-25

rors that occur during the process of scDNA-26

seq [Wang and Song, 2017, Hou et al., 2015,27

Gawad et al., 2016, Estévez-Gómez et al.,28

2018]. DNA sequencing technologies differ in29

their protocols of single-cell isolation and ly-30

sis, whole genome amplification (WGA), and31

library preparation [Zhang et al., 2016]. Fail-32

ure of cell isolation leads to the presence—33

albeit usually in a small proportion—of dou-34

blets instead of single cells and the cell lysis35

step can introduce artificial sequence modifi-36

cation. The main source of error, however,37

is the WGA step. Single cells only carry38

two (in case of normal cells) up to tens (in39

amplified regions of disease cells) of copies40

of DNA molecules, which need to be sub-41

stantially amplified from pico to nanogram 42

scale to read their sequence. Amplification- 43

related artifacts include i) amplification er- 44

rors, i.e. sequence alterations such as single 45

nucleotide or indel errors introduced by the 46

polymerase in the copy process, ii) allelic bias, 47

i.e. the differential amplification of the alle- 48

les at a genomic locus (if one allele fails to 49

amplify at all, this is an allele dropout, if 50

both fail, a locus dropout), iii) chimeric se- 51

quences. The majority of WGA approaches 52

can be broadly classified into methods based 53

on polymerase chain reaction (PCR) and 54

multiple displacement amplification (MDA). 55

The PCR-based technologies include degener- 56

ate oligonucleotide-primed PCR (DOP-PCR) 57

[Telenius et al., 1992], linker-adapter PCR 58

[Klein et al., 1999], primer extension pre- 59

amplification PCR (PEP-PCR-/I-PEP-PCR) 60

[Zhang et al., 1992, Arneson et al., 2008] 61

and others. They require thermostable poly- 62

merases that withstand all temperatures dur- 63

ing the cycling. More recent MDA-based 64

technologies use the strand-displacing, high- 65

fidelity Φ29 DNA polymerase [Blanco et al., 66

1989, Dean et al., 2002, Spits et al., 2006b, 67

Picher et al., 2016, Paez et al., 2004, Spits 68

et al., 2006a] for an isothermal reaction, as 69

it is not stable at common PCR temperature 70

maxima. Another approach, called multiple 71

annealing and looping-based amplification cy- 72

cles (MALBAC) combines MDA and PCR, 73

and relies on the Bacillus stearothermophilus 74

polymerase for the MDA process [Zong et al., 75

2012]. 76

4.1.1 Status 77

Ideally, scDNA-seq should provide informa- 78

tion about the entire repertoire of distinct 79

events that occurred in the genome of a single 80

cell, such as copy number alterations, genomic 81

rearrangements, together with SNVs and 82

smaller insertion and deletion variants. How- 83

20

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 21: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

subclonecompe��on

immunedetec�on

tumourresec�on

therapy

metastasisresec�on

selec�onpressures

biopsy

�mespace

small muta�on(SNV / indel)

amplifica�on

large dele�on

sampled cell theore�calsubclone phylogeny

Figure 4: From initiation of a tumor to its detection, resection and possible metastasis, itwill evolve somatically. New genomic mutations can confer a selective advantage to theresulting new subclone, that can allow it to outcompete other tumor subclones (subclonecompetition). At the same time, the acting selection pressures can change over time, e.g. dueto new subclones arising, the immune system detecting certain subclones, or as a result oftherapy. Understanding such selective regimes—and how specific mutations alter a subclone’ssusceptibility to changes in selection pressures—will help construct an evolutionary model oftumorigenesis. And it is only within this evolutionary model, that more efficient and morepatient-specific treatments can be developed. For such a model, unambiguously identifyingmutation profiles of subclones via scDNA-seq of resected or biopsied single cells is crucial.

21

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 22: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

ever, amplification biases and errors present a1

serious challenge to variant calling [de Bourcy2

et al., 2014, Hou et al., 2015, Huang et al.,3

2015, Estévez-Gómez et al., 2018]: It is4

broadly accepted that different WGA tech-5

nologies should be used depending on whether6

SNVs or whether copy number variation7

(CNV)s are to be detected, as the distinct8

technologies differ in the magnitude of ampli-9

fication bias, and the rates of amplification er-10

rors and chimera formation. Generally, PCR-11

based approaches with more uniform coverage12

should be used for CNV calling, while MDA-13

based methods that result in less single nu-14

cleotide errors should be applied for SNV call-15

ing. The goal must thus be to (i) improve the16

coverage uniformity of MDA-based methods,17

(ii) reduce the error rate of the PCR-based18

methods, or (iii) create new methods that ex-19

hibit both a low error rate and a more uni-20

form amplification of alleles. Recent years21

witnessed intensive research in these direc-22

tions, e.g.: (i) Improved coverage uniformity23

for MDA has been achieved using droplet mi-24

crofluidics-based methods, resulting in emul-25

sion WGA (eWGA, [Fu et al., 2015]), sin-26

gle droplet MDA (sd-MDA, [Hosokawa et al.,27

2017]) and digital droplet multiple displace-28

ment amplification (ddMDA, [Sidore et al.,29

2016]). A second approach has been to cou-30

ple the Φ29 DNA polymerase to a primase31

to reduce priming bias [Picher et al., 2016].32

Both these approaches improve the calling33

of CNVs from the resulting data. (ii) One34

way to reduce the amplification error rate35

of the PCR-based methods (including MAL-36

BAC) would be to employ a thermostable37

polymerase (necessary for use in PCR) with38

proof-reading activity similar to Φ29 DNA39

polymerase. While SD polymerase combines40

thermostability with strand displacement and41

has been tested for WGA [Blagodatskikh42

et al., 2017], we are not aware of any PCR43

DNA polymerases with a fidelity in the range44

of Φ29 DNA polymerase [Potapov and Ong, 45

2017] having been used in PCR-based WGA. 46

(iii) Three newer methods use an entirely dif- 47

ferent approach: They randomly insert trans- 48

posons into the whole genome and then lever- 49

age these as priming sites for library prepara- 50

tion and amplification. Direct library prepa- 51

ration (DLP, [Zahn et al., 2017a]), as the 52

name suggests, directly sequences the result- 53

ing shallow library without any amplification, 54

allowing only for CNV calling. It has re- 55

cently been further improved to account for 56

doublets and dead cells and scaled to 80,000 57

single cells [Laks et al., 2018]. Transposon 58

Barcoded (TnBC) follows the transposon in- 59

tegration with PCR amplification, making it 60

useful for CNV calling, but suffering from am- 61

plification errors [Xi et al., 2017]. Finally, 62

Linear Amplification via Transposon Inser- 63

tion (LIANTI, [Chen et al., 2017]) introduces 64

a new approach to dealing with amplifica- 65

tion errors. Instead of exponential amplifi- 66

cation, their amplification process is linear: 67

From promoters included in the transposon 68

insertion, they transcribe the original tagged 69

sequence multiple times and then use reverse 70

transcription and second-strand synthesis to 71

obtain double-stranded DNA for sequencing. 72

As errors introduced by the individual pro- 73

cesses are not propagated, they should be 74

unique to individual copies and accordingly 75

the authors report a false positive rate that is 76

even lower than for MDA [Chen et al., 2017]. 77

4.1.2 Open problems 78

These recent developments promise scalable 79

methodology for scDNA-seq comparable to 80

that already available for scRNA-seq, while 81

at the same time reducing previously limit- 82

ing errors and biases. In addition to fur- 83

ther improvements over the described exist- 84

ing methods, the major challenge will be to 85

continuously and systematically evaluate the 86

22

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 23: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

whole range of promising WGA methods for1

the identification of all types of genetic varia-2

tion from SNVs over smaller insertions and3

deletions up to copy number variation and4

structural variants.5

4.2 Challenge VII: Errors and6

missing data in the7

identification of features /8

variation from single-cell9

DNA sequencing data.10

The aim of scDNA sequencing usually is to11

track somatic evolution at the cellular level,12

that is, at the finest resolution possible rela-13

tive to the laws of reproduction (cell division,14

Figure 5). Examples refer to identifying het-15

erogeneity and tracking evolution in cancer,16

as the likely most predominant use case (also17

see below in section 5), but also to monitor-18

ing the interaction of somatic mutation with19

developmental and differentiation processes.20

To track genetic drifts, selective pressures, or21

other phenomena inherent to the development22

of cell clones or types (Figure 4)—but also to23

stratify cancer patients for the presence of re-24

sistant subclones—it is instrumental to geno-25

type and also phase genetic variants in single26

cells with sufficiently high confidence.27

The major disturbing factor in scDNA-seq28

data is the WGA process (see section 4.1).29

All methodologies introduce amplification er-30

rors (false positive alternative alleles), but31

more drastic is the effect of amplification bias:32

the insufficient or complete failure of am-33

plification, which leads to imbalanced pro-34

portions or complete lack of variant alleles.35

Overall, one can distinguish between three36

cases: (i) an imbalanced proportion of al-37

leles, i.e. loci harboring heterozygous muta-38

tions where preferential amplification of one39

of the two alleles leads to read counts that40

are distorted, sometimes heavily; (ii) allele41

10000

11000

11100

110101100? 100?01?101

Figure 5: Mutations (colored stars) accumu-late in cells during somatic cell divisionsand can be used to reconstruct the develop-mental lineages of individual cells within anorganism (leaf nodes of the tree with muta-tional presence / absence profiles attached).However, insufficient or unbalanced WGAcan lead to the dropout of one or both alle-les at a genomic site. This can be mitigatedby better amplification methods, but alsoby computational and statistical methodsthat can account for or impute the missingvalues.

drop-out, i.e. loci harboring heterozygous mu- 42

tations where only one of the alleles was am- 43

plified and sequenced, and (iii) site drop-out, 44

which is the complete failure of amplification 45

of both alleles at a site and the resulting lack 46

of any observation of a certain position of the 47

genome. Note that (ii) can be considered an 48

extreme case of (i). 49

A sound imputation of missing alleles and 50

a sufficiently accurate quantification of un- 51

certainties will yield massive improvements in 52

geno- and haplotyping (phasing) somatic vari- 53

ants. This, in turn, is necessary to substan- 54

tially improve the identification of subclonal 55

genotypes and the tracking of evolutionary 56

23

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 24: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

developments. Potential improvements in this1

area include (i) more explicit accounting for2

possible scDNA-seq error types, (ii) integrat-3

ing with different data types with error pro-4

files different from scDNA-seq (e.g. bulk se-5

quencing or RNA sequencing), or (iii) inte-6

grating further knowledge of the process of7

somatic evolution, such as the constraints of8

phylogenetic relationships among cells, into9

variant calling models. In this latter context,10

it is important to realize that somatic evolu-11

tion is asexual. Thus, no recombination oc-12

curs during mitosis, eliminating a major dis-13

turbing factor usually encountered when aim-14

ing to reconstruct species or population trees15

from germline mutation profiles.16

4.2.1 Status17

Current single-cell specific SNV callers in-18

clude Monovar [Zafar et al., 2016] and SC-19

caller [Dong et al., 2017]. SCcaller de-20

tects somatic variants independently for each21

cell, but accounts for local allelic amplifica-22

tion biases by integrating across neighboring23

germline single-nucleotide polymorphisms. It24

exploits the fact that allele drop-out af-25

fects contiguous regions of the genome large26

enough to harbor several, and not only one,27

heterozygous mutation loci. Monovar uses28

an orthogonal approach to variant calling. It29

does not assume any dependency across sites,30

but instead handles low and uneven cover-31

age and false positive alternative alleles by32

integrating the sequencing information across33

multiple cells. While Monovar merely creates34

a consensus across cells, integrating across35

cells is particularly powerful if further knowl-36

edge about the dependency structure among37

cells is incorporated. As pointed out above,38

due to the lack of recombination, any sam-39

ple of cells derived from an organism shares40

an evolutionary history that can be described41

by a cell lineage tree (see section 5). This42

tree, however, is in general unknown and can 43

in turn only be reconstructed from single-cell 44

mutation profiles. A possible solution is to 45

infer both mutation calls and a cell lineage 46

tree at the same time, an approach taken by 47

a number of existing tools: single-cell Geno- 48

typer [Roth et al., 2016], SciCloneFit [Zafar 49

et al., 2018] and SciΦ [Singer et al., 2018]. 50

Finally, SSrGE, identifies SNVs correlated 51

with gene expression from scRNA-seq data 52

[Poirion et al., 2018]. 53

54

Some basic approaches to CNV calling from 55

scDNA-seq data are available. These are usu- 56

ally based on hidden markov models (HMMs) 57

where the hidden variables correspond to copy 58

number states, as e.g. in Aneufinder [Bakker 59

et al., 2016]. Another tool, Ginkgo, pro- 60

vides interactive CNV detection using circu- 61

lar binary segmentation, but is only avail- 62

able as a web-based tool [Garvin et al., 2015]. 63

ScRNA-seq data, which does not suffer from 64

the errors and biases of WGA, can also be 65

used to call CNVs or loss of heterozygosity 66

events: an approach called HoneyBADGER 67

[Fan et al., 2018] utilizes a probabilistic hid- 68

den Markov model, whereas the R package 69

inferCNV simply averages the expression over 70

adjacent genes [Patel et al., 2014]. 71

4.2.2 Open problems 72

SNV callers for scDNA-seq data have al- 73

ready incorporated amplification error rates 74

and allele dropout in their models. But 75

beyond these rates, the challenge remains 76

to further extend this into a full statistical 77

modeling of the amplification process, that 78

would inherently account for both errors and 79

biases, and more accurately quantify the 80

resulting uncertainties (see section 2.2). This 81

could be achieved by expanding models that 82

accurately quantify uncertainties in related 83

settings [Köster et al., 2019] and would 84

24

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 25: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

ultimately even allow reliable control of false1

discovery rates in the variant discovery and2

genotyping process. Such expanded models3

can build on a number of recent studies4

in this context, e.g. on a formalization in5

a recent preprint [Koptagel et al., 2018].6

Furthermore, such models could integrate the7

structure of cell lineage trees with the struc-8

ture implicit in haplotypes that link alleles.9

For haplotype phasing, Satas and Raphael10

[2018] recently proposed an approach based11

on contiguous stretches of amplification bias12

(similar to SCcaller, see above), whereas13

others propose read-backed phasing in two14

recent studies [Bohrson et al., 2019, Hård15

et al., 2019]. In addition, the integration16

with deep bulk sequencing data, as well as17

with (sc)RNA-seq data remains unexplored,18

although it promises to improve the precision19

of callers without compromising sensitivity.20

21

Identification of short insertions and22

deletions (indels) is another major challenge23

to be addressed: we are not aware of any24

scDNA-seq variant callers with those respec-25

tive capabilities.26

27

For copy number variation calling, soft-28

ware has previously been published mostly in29

conjunction with data-driven studies. Here,30

a systematic analysis of biases in the most31

common WGA methods for copy number32

variation calling (including newer methods33

to come) could further inform method devel-34

opment. The already mentioned approach35

of leveraging amplification bias for phasing36

could also be informative [Satas and Raphael,37

2018].38

39

The final challenge is a systematic compar-40

ison of tools beyond the respective software41

publications, which is still lacking for both42

SNV and CNV callers. This requires system-43

atic benchmarks, which in turn require simu-44

lation tools to generate synthetic datasets, as 45

well as sample-based benchmarking datasets 46

with a reasonably reliable ground truth (see 47

section 6.2). 48

5 Challenges in single-cell 49

phylogenomics 50

Single-cell variant profiles from scDNA-seq, 51

as described above (section 4.2), can be used 52

in computational models of somatic evolution, 53

including cancer evolution as an important 54

special case (Figure 4). For cancer, there is an 55

on-going, lively discussion about the very na- 56

ture of evolutionary processes at play, with 57

competing theories such as linear, branch- 58

ing, neutral, and punctuated evolution [Davis 59

et al., 2017]. 60

Models of cancer evolution may range from 61

a simple binary representation of the pres- 62

ence versus the absence of a particular mu- 63

tational event (Figure 5), to elaborate models 64

of the mechanisms and rates of distinct muta- 65

tional events. There are two main modeling 66

approaches that lend themselves to the anal- 67

ysis of tumor evolution [Altrock et al., 2015]: 68

phylogenetics and population genetics. 69

Phylogenetics comes with a rich reper- 70

toire of computational methods for likelihood- 71

based inference of phylogenetic trees [Felsen- 72

stein, 1981]. Traditionally, these methods are 73

used to reconstruct the evolutionary history 74

of a set of distinct species. However, they can 75

also be applied to cancer cells or subclones 76

(Figure 4). In this setting, tips of the phy- 77

logeny (also called leaves or taxa) represent 78

sampled and sequenced cells or subclones, 79

whereas inner nodes (also called ancestral) 80

represent their hypothetical common ances- 81

tors. The input for a phylogenetic inference 82

commonly consists of a multiple sequence 83

alignment (MSA) of molecular sequences for 84

25

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 26: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

the species of interest. For cancer phyloge-1

nies, one would concatenate the SNVs (and2

possibly other variant types) to assemble the3

input MSA. The key challenge for phyloge-4

netic method development comprises design-5

ing sequence evolution models that are (i) bio-6

logically realistic and yet (ii) computationally7

tractable for the increasingly large number of8

sequenced cells per patient and study.9

In population genetics, the tumor is under-10

stood as a population of evolving cells (Fig-11

ure 4). To date, population genetic theory12

has been used to model the initiation, pro-13

gression and spread of tumors from bulk se-14

quencing data [Foo et al., 2011, Beerenwinkel15

et al., 2007, Haeno et al., 2012]. The general16

mathematical framework behind these mod-17

els are branching processes [Kimmel and Ax-18

elrod, 2015], e.g. in models of the accumula-19

tion of driver and passenger mutations [Bozic20

et al., 2016, 2010]. Here, the driver mutations21

carry a fitness advantage, as might epistatic22

interactions among them [Bauer et al., 2014].23

On the other hand, passenger mutations are24

assumed to be neutral regarding fitness; they25

merely hitchhike along the fitness advantage26

of driver mutations they are linked to via27

their haplotype. The parameters of popu-28

lation genetic models describe inherent fea-29

tures of individual cells that are relevant for30

the evolution of their populations, e.g. fit-31

ness and the rates of birth, death, and mu-32

tations. Such cell-specific parameters should33

more naturally apply to and be derived from34

information gathered by sequencing of indi-35

vidual cells, as opposed to sequencing of bulk36

tissue samples. Models using these param-37

eters and the information about the evolu-38

tionary dynamics of cancer they contain, will39

e.g. be essential in the design of adaptive can-40

cer treatment strategies that aim at manag-41

ing subclonal tumor composition [Acar et al.,42

2019, Zhang et al., 2017].43

5.1 Challenge VIII: Scaling 44

phylogenetic models to many 45

cells and many sites 46

Even if given perfect data, phylogenetic mod- 47

els of tumor evolution would still face the 48

challenge of computational tractability, which 49

is mainly induced by: (i) the increasing num- 50

bers of cells that are sequenced in cancer 51

studies (see section 2.3), and (ii) the increas- 52

ing numbers of sites that can be queried per 53

genome (also see section 2.3). 54

5.1.1 Open problems 55

(i) While adding data from more single cells 56

will help improve the resolution of tumor phy- 57

logenies [Graybeal, 1998, Pollock et al., 2002], 58

this exacerbates one of the main challenges 59

of phylogenetic inference in general: the im- 60

mense space of possible tree topologies that 61

grows super-exponentially with the number of 62

taxa—in our case the number of single cells. 63

Therefore, phylogenetic inference is NP-hard 64

[Roch, 2006] under most scoring criteria (a 65

scoring criterion takes a given tree and MSA 66

to calculate how well the tree explains the 67

observed data). Calculating the given score 68

on all possible trees to find the tree that 69

best explains the data is computationally not 70

feasible for MSAs containing more than ap- 71

proximately 20 single cells, and thus requires 72

heuristic approaches to explore only promis- 73

ing parts of the tree search space. 74

(ii) In addition to the growing number of 75

cells (taxa), the breadth of genomic sites and 76

genomic alterations that can be queried per 77

genome also increases. Classical approaches 78

thus need not only scale with the number of 79

single cells queried (see above), but also with 80

the length of the input MSA. Here, previ- 81

ous efforts for parallelization [Aberer et al., 82

2014, Ayres, 2017] and other optimisation ef- 83

forts [Ogilvie et al., 2017] exist and can be 84

26

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 27: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

built upon. The breadth of sequencing data1

also allows determination of large numbers of2

invariant sites, which further raises the ques-3

tion of whether including them will change4

results of phylogenetic inferences in the con-5

text of cancer. Excluding invariant sites from6

the inference has been coined ascertainment7

bias, and for phylogenetic analyses of closely8

related individuals from a few populations it9

has been shown that accounting for ascertain-10

ment bias alters branch lengths, but not the11

resulting tree topologies per se [Leaché et al.,12

2015].13

5.2 Challenge IX: Integrating14

multiple types of features /15

variation into phylogenetic16

models17

Naturally, downstream analyses—like charac-18

terizing intratumor heterogeneity and infer-19

ring its evolutionary history—suffer from the20

unreliable variant detection in single cells.21

The better the quality of the variant calls22

gets, however, the more important it be-23

comes to model all types of available signal24

in mathematical models of tumor evolution,25

with the goal of increasing the resolution and26

reliability of the resulting trees; from SNVs,27

over smaller insertions and deletions, to large28

structural variation and CNVs (Figure 4). Fi-29

nally, to model somatic phylogenies compre-30

hensively, all available types of variants will31

have to be integrated into a comprehensive32

model. In the context of cancer, with ge-33

nomic destabilization occurring, this will be34

especially challenging.35

5.2.1 Status36

For phylogenetic tree inference from SNVs of37

single cells, a considerable number of tools38

exist. The early tools OncoNEM [Ross and39

Markowetz, 2016] and SCITE [Jahn et al., 40

2016] use a binary representation of presence 41

or absence of a particular SNV. They account 42

for false negatives, false positives and missing 43

information in SNV calls, where false neg- 44

atives are orders of magnitude more likely 45

to occur than false positives. The more re- 46

cent tool SiFit [Zafar et al., 2017] also uses 47

a binary SNV representation, but infers tu- 48

mor phylogenies allowing for both noise in 49

the calls and for violations of the infinite 50

sites assumption. Another approach allow- 51

ing for violations of the infinite sites assump- 52

tion is the extension of the Dollo parsimony 53

model to allow for k losses of a mutation 54

(Dollo-k) [El-Kebir, 2018, Ciccolella et al., 55

2018]. Single cell genotyper [Roth et al., 56

2016], SciCloneFit [Zafar et al., 2018], or SciΦ 57

[Singer et al., 2018] jointly call mutations in 58

individual cells and estimate the tumor phy- 59

logeny of these cells, directly from single-cell 60

raw sequencing data. In a recent work [Ko- 61

zlov, 2018], a standard phylogenetic infer- 62

ence tool RAxML-NG [Kozlov et al., 2019] 63

has been extended to handle single-cell SNV 64

data. In particular, this implements (i) a 10-s- 65

tate substitution model to represent all pos- 66

sible unphased diploid genotypes and (ii) an 67

explicit error model for allelic dropout and 68

genotyping/amplification errors. Initial ex- 69

periments showed that—although a 10-state 70

model incorporates more information—it out- 71

performed the ternary model (as used by 72

SiFit) only slightly and only in simulations 73

with very high error rates (10%-50%). How- 74

ever, further analysis suggests that benefits of 75

the genotype model become much more pro- 76

nounced with an increasing number of cells 77

and, in particular, an increasing number of 78

SNVs (Kozlov, personal communication). 79

While there are no tools yet available to 80

identify insertions and deletions from scDNA- 81

seq (see challenge above), it is only a matter 82

of time until such callers will become avail- 83

27

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 28: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

able. As they can already be identified from1

bulk sequencing data, some precious efforts2

to incorporate indels in addition to substitu-3

tions into classical phylogenetic models exist:4

A decade ago, a simple probabilistic model5

of indel evolution was proposed [Rivas and6

Eddy, 2008]. But although some progress7

has been made since then, such models are8

less tractable than the respective substitution9

models [Holmes, 2017].10

Incorporating CNVs in the reconstruction11

of tumor phylogeny can be helpful for un-12

derstanding tumor progressions, as they rep-13

resent one of the most common mutation14

types associated to tumor hypermutability15

[Kim et al., 2013]. CNVs in single cells were16

extensively studied in the context of tumor17

evolution and clonal dynamics [Navin et al.,18

2011, Eirew et al., 2015]. Reconstructing a19

phylogeny with CNVs is not straightforward.20

The challenges are not only related to ex-21

perimental limits, such as the complexity of22

bulk sequencing data [Zaccaria et al., 2017]23

and amplification biases [Gawad et al., 2016],24

but also involve computational constraints.25

First of all, the causal mechanisms, such as26

breakage-fusion-bridge cycles [Bignell et al.,27

2007] and chromosome missegregation [San-28

taguida et al., 2017], can lead to overlapping29

copy number events [Schwarz et al., 2014].30

Secondly, inferring a phylogeny with CNV31

data requires quantifying transition proba-32

bilities for changes in copy numbers based33

on the causal mechanisms. Towards that34

goal, approaches to calculate the distance be-35

tween whole copy number profiles [Zeira and36

Shamir, 2018] are a first step. But for them,37

a number of challenges remain, with several38

of the underlying problems known to be NP-39

hard [Zeira and Shamir, 2018].40

Co-occurrence of all of the above variation41

types further complicates mathematical mod-42

eling, as these events are not independent.43

For example, multiple SNVs that occurred in44

the process of tumor evolution may disappear 45

at once via a deletion of a large genomic 46

region. In addition, recent analyses revealed 47

recurrence and loss of particular mutational 48

hits at specific sites in the life histories of 49

tumors [Kuipers et al., 2017], undermining 50

the validity of the so called infinite sites 51

assumption, commonly made by phylogenetic 52

models: it assumes an infinite number of 53

genomic sites, thus rendering a repeated 54

mutational hit of the same genomic site along 55

a phylogeny impossible. 56

57

5.2.2 Open problems 58

For phylogenetic reconstruction from SNVs, 59

we anticipate a shift towards leveraging im- 60

provements in input data quality as they are 61

achieved through better amplification meth- 62

ods and SNV callers (see challenges above). 63

For indels, variant callers for scDNA-seq data 64

remain to be developed (see challenge above), 65

but are anticipated. Thus, indel modeling 66

efforts for phylogenetic reconstruction from 67

bulk sequencing data should be adapted. For 68

phylogenetic inference from CNVs, the ma- 69

jor challenges are (i) determining correct mu- 70

tational profiles and (ii) computing realis- 71

tic transition probabilities between those pro- 72

files. 73

The final challenge will be to incorporate all 74

of the above phenomena into a holistic model 75

of cancer evolution. However, this will sub- 76

stantially increase the computational cost of 77

reconstructing the evolutionary history of tu- 78

mor cells. Thus, one needs to carefully de- 79

termine which phenomena actually do mat- 80

ter (e.g. which parameters even affect the fi- 81

nal tree topology) and which features can be 82

measured (section 4.1) and called (section 4.2) 83

with sufficient accuracy to actually improve 84

modeling results. As a consequence one might 85

be able to devise more lightweight models for 86

28

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 29: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

answering specific questions and invest con-1

siderable effort into optimizing novel tools at2

the algorithmic and technical level (see chal-3

lenge below).4

5.3 Challenge X: Inferring5

population genetic6

parameters of tumor7

heterogeneity by model8

integration9

Tumor heterogeneity is the result of an evo-10

lutionary journey of tumor cell populations11

through both time and space [Swanton, 2012,12

McGranahan and Swanton, 2017]. Microen-13

vironmental factors like access to the vascular14

system and infiltration with immune cells dif-15

fer greatly—for regions within the original tu-16

mor as well as between the main tumor and17

metastases, and across different time points18

[Yang and Lin, 2017]. This imposes different19

selective pressures on different tumor cells,20

driving the formation of tumor subclones and21

thus determining disease progression (includ-22

ing metastatic potential), patient outcome23

and susceptibility to treatment (Junttila and24

de Sauvage [2013], Corredor et al. [2018] and25

Figure 4). However, even the answers to very26

basic questions about the resulting dynam-27

ics remain unanswered [Turajlic and Swanton,28

2016]: for example, whether metastatic seed-29

ing from the primary tumor occurs early and30

multiple times in parallel, with metastases di-31

verging genetically from the primary tumor,32

or whether seeding of metastases occurs late,33

from a far-developed subclone in the primary34

tumor, with that subclone seeding multiple35

locations with a genotype closer to the late-36

stage primary tumor; and whether a single37

cell can seed a metastasis, or whether the joint38

migration of a set of cells is required. Here, sc-39

seq can provide invaluable resolution [Navin40

et al., 2011].41

Although many mathematical models of 42

tumor evolution have been proposed [Bozic 43

et al., 2010, 2016, Altrock et al., 2015, Foo 44

et al., 2011, Michor et al., 2004], fundamen- 45

tal parameters characterizing the evolution- 46

ary processes remain elusive. To quantita- 47

tively describe the tumor evolution process 48

and evaluate different possible modes against 49

each other (e.g. modes of metastatic seeding), 50

we would like to estimate fitness values of 51

individual mutations and mutation combina- 52

tions, as well as rates of mutation, cell birth 53

and cell death—if possible, on the level of sub- 54

clones. These parameters determine the un- 55

derlying fitness landscape of individual cells 56

within their microenvironment, which in turn 57

determines the evolutionary dynamics of can- 58

cer progression. 59

5.3.1 Status 60

Recent technological advances already allow 61

for measuring the arrangement and relation- 62

ships of tumor cells in space, with cell loca- 63

tion basically amounting to a second measure- 64

ment type requiring data integration within 65

a cell (Approach 3 in section 6.1, Figure 6 66

and Table 2). While in vivo imaging tech- 67

niques might also become interesting for ob- 68

taining time series data in the future [Larue 69

et al., 2017], the automated analysis of whole 70

slide immunohistochemistry images [Ghaz- 71

navi et al., 2013, Saco et al., 2016] seems 72

the most promising in the context of cancer 73

and mutational profiles from scDNA-seq. It 74

is already amenable to single-cell extraction 75

of characterized cells with known spatial con- 76

text and subsequent scDNA-seq. Using laser 77

capture microdissection [Datta et al., 2015] 78

hundreds of single cells have recently been 79

isolated from tissue sections and analyzed 80

for copy number variation [Casasent et al., 81

2018]. For cell and tissue characterization in 82

immunohistochemical images, machine learn- 83

29

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 30: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

ing models are trained to segment the im-1

ages and recognize structures within tissues2

and cells [Gurcan et al., 2009, Irshad et al.,3

2014, Komura and Ishikawa, 2018]: They can4

e.g. determine the densities and quantities5

of mitotic nuclei, vascular invasion, immune6

cell infiltration on the tissue level, as well as7

stained biomarkers on the level of the individ-8

ual cell. These are key parameters of the tu-9

mor microenvironment, characterizing the in-10

teraction tumor cells with their environment11

in space [Yuan, 2016, Heindl et al., 2015].12

Mathematical models of tumor popula-13

tion genetics have classically assumed well14

mixed populations, ignoring any spatial struc-15

ture, let alone evolutionary microenviron-16

ments. Recently, methods have been ex-17

tended to account for some spatial structure18

and have already led to refined predictions of19

the waiting time to cancer [Martens et al.,20

2011] and intratumor heterogeneity [Waclaw21

et al., 2015]. In particular, spatial statistics22

has been proposed for the quantitative sta-23

tistical analysis of cancer digital pathology24

imaging [Heindl et al., 2015], but the idea25

is applicable to other spatially resolved read-26

outs. A number of methods were proposed27

to model cell-cell interactions [Schapiro et al.,28

2017, Arnol et al., 2018] or to predict single-29

cell expression from microenvironmental fea-30

tures [Goltsev et al., 2018, Battich et al.,31

2015]. With the advent of spatially resolved32

DNA sequencing, models can be adapted to33

the new data.34

Regarding temporal resolution, it is already35

common to sequence tumor material from dif-36

ferent timepoints: biopsies used for diagnosis,37

resected tumors, lymph nodes and metastases38

upon surgery and tumors after relapse. These39

time-points already lend themselves to tem-40

poral analyses of clonal dynamics using bulk41

DNA sequencing data [Johnson et al., 2014].42

But scDNA-seq will help to increase the res-43

olution of subclonal genotypes. And inte-44

grating this clonal stratification across time- 45

points and with other readouts, such as cell 46

state markers, will allow to determine central 47

model parameters for the detection of positive 48

and negative selection, e.g. rates of prolifera- 49

tion, mutation and death. 50

To also leverage the kinship relationships 51

between cells, population genetic methods 52

and models could be integrated with ap- 53

proaches from phylogenetics. One prominent 54

example of this recent trend is the use of 55

the multi-species coalescent model for ana- 56

lyzing MSAs that contain several individuals 57

for several populations [Rannala and Yang, 58

2017, Liu et al., 2015]. This naturally trans- 59

lates into analyzing tumor subclones as pop- 60

ulations of single cells, capturing some of the 61

population structure seen in cancers. This 62

phylogenetic context also lends itself to mod- 63

eling differences in mutational rates and sig- 64

natures between different cell populations, 65

e.g. between normal somatic evolution before 66

tumor initiation and cancer evolution after 67

tumor initiation, or between different tumor 68

subclones. 69

In this setting, we will have to account 70

for heterotachy (see e.g. Kolaczkowski and 71

Thornton [2008]), that is, we cannot assume 72

a single model of substitution for the entire 73

tree, but have to allow different models to act 74

on distinct branches or subtrees/subclones. 75

Here, anything from a simple model of rate 76

heterogeneity (e.g. Yang [1994]) to an empir- 77

ical mixture model as used for protein evolu- 78

tion [Le et al., 2012] could be considered. 79

A recent example integrating population 80

genetics approaches with phylogenetics, is a 81

computational model for inference of fitness 82

landscapes of cancer clone populations using 83

scDNA-seq data, SCIFIL [Skums et al., 84

2019]. It estimates the maximum likelihood 85

fitness of clone variants by fitting a replicator 86

equation model onto a character-based tumor 87

phylogeny. 88

30

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 31: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

1

For the detection of positive selection, a2

number of phylogenetic and population ge-3

netic approaches have been proposed. Phy-4

logenetic trees may be used for detecting5

branches on which positive [Zhang et al.,6

2005] or diversifying episodic selection [Smith7

et al., 2015] is acting. The tests from the8

area of “classic” phylogenetics might serve as9

a starting point for exploring and adapting10

appropriate methods that will allow to asso-11

ciate positive selection events to branches of12

the tumor tree or specific evolutionary events.13

Evolutionary pressures are often quantified by14

the dN/dS ratio of non-synonymous and syn-15

onymous substitutions. In application to tu-16

mor cell populations, however, this ratio may17

not be applicable, as it has been shown to be18

relatively insensitive when applied to popula-19

tions within the same species [Kryazhimskiy20

and Plotkin, 2008]. Other measures have been21

proposed as better suited for detecting selec-22

tion within populations based on time-series23

data and could potentially be transferred to24

tumor cell populations [Neher et al., 2014,25

Gray et al., 2011, Steinbrück and McHardy,26

2011]. An open question is to which extent27

the above tests will be sensitive to errors in28

cancer data as they are known to produce29

high false positive rates in the classic phyloge-30

netic setting if the error rate in the input data31

is too high [Fletcher and Yang, 2010]. Com-32

putationally intense solutions for decreasing33

the high false positive rate have been pro-34

posed [Redelings, 2014], but they might not35

scale to cancer datasets. Importantly, devel-36

opment of tests for positive selection could37

contribute to the discussion of whether the38

evolution of tumors is driven by selection or39

neutral.40

For the detection of negative selection, time41

resolved measurements and resulting prolif-42

eration and death rates could prove equally43

promising. Further, approaches were de-44

veloped to discover epistatic interactions— 45

particularly synthetic lethality—from ge- 46

nomic and transcriptomic data in tumor 47

genomes and cancer cell lines [Szczurek et al., 48

2013, Jerby-Arnon et al., 2014], and patient 49

survival [Matlak and Szczurek, 2017]. Some of 50

these epistatic interactions, however, can be 51

hard to spot in bulk sequencing data, as they 52

may simply disappear because of a low fre- 53

quency. ScDNA-seq, ideally in a time resolved 54

fashion and across individuals, provides much 55

more insight into epistatic interactions than 56

bulk sequencing. The key feature is that 57

it is possible to identify pairs of mutations 58

that often occur simultaneously in the same 59

genome, and pairs that rarely or never do. 60

That is, cells affected by negatively selected 61

or synthetic lethal mutations will go extinct in 62

the tumor population and thus their genotype 63

with the synthetic lethal mutations occurring 64

together will not be observed. Cell death, 65

however, can be the result of mere chance, so 66

to detect significant negative pressures, large 67

cohorts of repeated time resolved experiments 68

would have to be performed. 69

5.3.2 Open problems 70

With an increased resolution of scDNA-seq 71

(section 4.1) and more work on the scDNA- 72

seq challenges described in other sections, it 73

will be possible to determine subclone geno- 74

types in more detail. 75

The first challenge will be to integrate 76

this with the spatial location of single cells 77

obtained from other measurements. This 78

will enable determining whether cells from 79

the same subclones are co-located, whether 80

metastases are founded recurrently by the 81

same subclone(s) and whether individual 82

metastases are founded by individual or mul- 83

tiple subclones. A number of studies utilizing 84

multiple region samples from the same tumor 85

and from distant metastases already paved 86

31

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 32: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

the way in investigating these questions [Tu-1

rajlic and Swanton, 2016]. Still, only single-2

cell spatial resolution will allow identification3

of specific individual genotypes in specific lo-4

cations and the drawing precise conclusions.5

The second challenge will be to determine6

rates of proliferation and death per subclone.7

This could be achieved by measuring num-8

bers of mitotic and apoptotic cells per sub-9

clone or by integrating subclone abundance10

profiles across time points. Good estimates11

of these basic parameters will greatly benefit12

models, e.g. for the detection of positive and13

negative selection in cancer.14

A third challenge will be to determine15

subclone-specific rates of mutation. Here, in-16

tegration of models from population genetics17

and phylogenetics holds promise.18

A fourth challenge will be to devise ways19

to determine further relevant model parame-20

ters. For example, comparing expanded sub-21

clones in drug screens to determine subclone22

fitness under different treatment regimes can23

both help to predict subclone resistance (and24

thus expected treatment success) and further25

inform cancer evolution models.26

A final step will then be to put all these27

parameters into context with further infor-28

mation about local microenvironments (such29

as vascular invasion and immune cell infiltra-30

tion), to estimate the selection potential of31

such local factors for or against different sub-32

clones.33

32

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 33: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

...

...

...

...

...

...

...

A B C DE F G ...

Approach 0

...

...

...

...

...

...

...

A B C D E F G ...

?

?

?

?

?

?

...

...

...

...

...

...

...

A B C D E F G ...

Approach 1

...

...

...

...

...

...

...

A B C D E F G ...3 5

3 1

Approach 2

...

...

...

...

...

...

...

A B C D E F G ...

Approach 3

x y zp1 ...p2p3p4p5

...

...

...

...

...

A B C D E F G ...

...

...

...

...

...

muta�onal profile

gene expression

...

...

...

...

...

...

...

...

...

...

...

...

A B C D E F G ...

gene methyla�on

chroma�n accessibility

loca�on

...

...

...

...

...

...

A B C D E F G ...Approach 4scDNA-seq

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

scRNA-seqclonalprofiles

Approach 5

...

...

...

...

...

...

A B C D E F G ...

...

p1 ...p2p3p4p5

...

...

...

...

...

...

...

Figure 6: Approaches for integrating single-cell measurement datasets across measurementtypes, samples and experiments, as also described in Table 2.Approach 0 Clustering of cells from one sample from one experiment, no data integration isneeded. Approach 1 Cell populations / clusters from multiple samples but the same mea-surement type need to be linked. Approach 2 For cell populations / clusters across multipleexperiments, stable reference systems like cell atlases are needed (compare Figure 1). Ap-

proach 3 Whenever multiple measurement types can be obtained from the same cell, theyare automatically linked. However, this setup highlights the problem of data sparsity of allavailable measurement types and the dependency of measurement types that needs to beaccounted for. Approach 4 When multiple measurement types cannot be obtained fromthe same cell, a solution is to obtain them from cells of the same cell population. However,this combines the problems of Approach 1 with those of Approach 3. Approach 5 Onepossibility for easing data integration across measurement types from separate cells wouldbe to have a stable reference (cell atlas) across multiple measurement types. Effectively, thiscombines the problems of Approaches 2, 3 and 4.

33

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 34: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Integration example MTs example AMs Promises Challenges0 none scDNA-seq,

scRNA-seq,merFISH

clustering /unsupervised

identify new celltypes and states

technical noise

1 within 1 MT,within 1 exp,across > 1 smps

scDNA-seq,scRNA-seq,merFISH

differentialanalyses,time series,spatial sampling

identify effectsacross samplegroups, time andspace

technical noise; batch effects;validate cell type assignments

2 within 1 MT,across > 1 exp,across > 1 smps,

scRNA-seq,merFISH

map cells tostable reference(cell atlas)

accelerate analyses;increase sample size& generalize obser-vations

technical noise; batch effects;validate cell type assignments;standards across experimentalcenters

3 across > 1 MTs,within 1 exp,within 1 cell

scG&T-seq,scM&T-seq,seqFISH

MOFA,DIABLO,MINT

holistic view of biol.processes withincell;quantification ofdependency of MTs

scaling cell throughput;MT combinations limited;dependency of MTs;data sparsity

4 across > 1 MTs,within 1 exp,across > 1 cells,within 1 cell pop

scDNA-seq +

scRNA-seq,DNA-seq +

scRNA-seq

Cardelino,Clonealign,MATCHER

use existing datasets(faster than 3);flexible experimen-tal design

technical noise;validate cell / data grouping;test assumptions for integratingdata

5 across > 1 MTs,across > 1 exps,across > 1 smps,within cells

hypothetical:any combina-tion

hypothetical:multi-omicHCA,single-cellTCGA

comprehensive char-acterizations of bio-logical systems

all from approaches 2, 3 & 4;standards across experimentalcenters

Table 2: Approaches for data integration and their potential.Abbreviations: AM – analysis method; exp(s) – experiment(s); HCA – human cell atlas;MT – measurement type; smps – samples; TCGA – The Cancer Genome Atlas

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 35: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

6 Overarching challenges1

6.1 Challenge XI: Integration of2

single-cell data: across3

samples, experiments and4

types of measurement5

Biological processes are complex and dy-6

namic, varying across cells and organisms. To7

comprehensively analyze such processes, dif-8

ferent types of measurements from multiple9

experiments need to be obtained and inte-10

grated. Depending on the actual research11

question, such experiments will refer to dif-12

ferent time points, tissues or organisms. For13

different measurement types, we put particu-14

lar emphasis on the combination of scRNA-15

seq and scDNA-seq data, although augment-16

ing sequencing data with records on protein17

or metabolite levels is also possible.18

Since the exploration of complex, dynamic19

and variable processes requires the integration20

of data from multiple experiments, we need21

flexible but rigorous statistical and compu-22

tational frameworks to support that integra-23

tion. See Table 2 and Figure 6 for an overview24

of how the issues in creating such frameworks25

can vary relative to the particular problem2.26

When aiming at the identification of pat-27

terns of differential expression, so as to char-28

acterize variability across organisms, individ-29

uals, or location, data refers to the same30

(unique) measurement type (for example,31

only scRNA-seq), but stems from different32

time points, different locations (such as dif-33

ferent tissues or sites in a tumor), or different34

organisms. See Approach 1 in Figure 6 and35

Table 2 for methodological challenges arising36

2Graph representation in Figure 6 Approaches 2 and5 taken from Wolf et al. [2019], Fig. 3, providedunder Creative Commons Attribution 4.0 Interna-tional License (http://creativecommons.org/licenses/by/4.0/)

from this scenario. 37

Another scenario arises when aiming at a 38

general increase in sample sizes, so as to gen- 39

eralize (and statistically corroborate) obser- 40

vations. The increase in generality may fur- 41

ther support the construction of a reference 42

system, such as a cell atlas, the existence of 43

which can support decisive speed-ups when 44

classifying cells or cell states, investigated in 45

subsequent experiments (see section 3.3). In- 46

creasing sample sizes often means that data is 47

raised across multiple experiments of identi- 48

cal setup, for example experimental replicates 49

possibly raised in different laboratories, such 50

that statistically accounting for batch effects 51

is a decisive factor. See Approach 2 in Fig- 52

ure 6 and Table 2 for respective methodolog- 53

ical challenges. 54

Yet another scenario manifests when try- 55

ing to unravel complexity and coordination 56

of intracellular biological processes, as well 57

as their mutual dependencies, so as to draw 58

a comprehensive picture of a single cell. In 59

this, an optimal setup is to raise data from 60

just one single cell across multiple experi- 61

ments referring to different types of mea- 62

surements, such as scDNA-seq, scRNA-seq, 63

possibly further augmented by measurements 64

of chromatin accessibility, gene methylation, 65

proteins or metabolites. See Approach 3 in 66

Figure 6 and Table 2 for this scenario. 67

Co-measuring different and possibly con- 68

curring types of quantities, for example 69

scRNA-seq and scDNA-seq [Kong et al., 70

2019], in just one single cell can be experi- 71

mentally challenging or even just impossible 72

at this point in time. An exit strategy to this 73

problem is to raise a population of cells that 74

is coherent in terms of cell type and state. 75

One then spreads the different measurements 76

across several single cells, all of which are 77

drawn from this population. Upon having ap- 78

plied the different measurements on different 79

single cells, one needs to combine the data 80

35

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 36: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

raised in a way that is biologically meaning-1

ful, respecting that each measurement stems2

from a different cell. Note that this approach3

encompasses the possibility to raise data both4

from single cells, and from bulks of cells. An5

example for the latter are bulk sequencing de-6

rived genotypes which one uses for imputa-7

tion of missing values or the quantification of8

data that have remained uncertain in single9

cells that stem from the same population as10

the bulk. The integration of different types of11

data raised across multiple single cells, pos-12

sibly including bulk data, casts issues that13

deserve attention in their own right (see Ap-14

proach 4 in Figure 6 and Table 2), because15

these issues can substantially differ from the16

methods referring to Approach 3.17

The most comprehensive goal, finally, may18

be to gain deeper insight into the complexity19

of (intra-)cellular circuits, and to chart20

their variability across time, tissues, and21

populations. Mapping cellular circuits in this22

comprehensive manner requires to take com-23

plementary and concurring measurements24

in single cells and across multiple single25

cells, possibly also across time, tissues and26

populations. Approach 5 in Figure 6 and27

Table 2 deals with this holistic approach to28

examining single cells. The ultimate goal is29

to comprehensively characterize biological30

systems, which requires to operate at the31

single-cell level, because one would not gain32

sufficient insight otherwise.33

34

The challenges just outlined in terms of Ap-35

proaches 1-5 in Figure 6 and Table 2 all are36

affected by the issues that influence single-37

cell data analysis in general, namely: (i) the38

varying resolution levels that are of interest39

depending on the research question at hand40

(section 2.1); (ii) the uncertainty of any mea-41

surements and how to quantify it for and dur-42

ing the analyses (section 2.2) and (iii) the43

scaling of single-cell methodology to more44

cells and more features measured at once (sec- 45

tion 2.3). All of these further compound the 46

most important challenge in the integration 47

of single-cell data: to link data from the dif- 48

ferent sources in a way that is biologically 49

meaningful and supports the intended anal- 50

ysis. It is an immediate insight that the 51

maps that describe how data from the differ- 52

ent sources is linked, increase in complexity 53

on increasing amounts of samples, time points 54

and types of measurements (Figure 6, Ta- 55

ble 2): Linking multiple samples referring to 56

the same quantity measured within one exper- 57

iment (Approach 1 in Figure 6 and Table 2) 58

or across several experiments (Approach 2) 59

needs to account for batch effects. Of course, 60

whenever possible, batch effects should be 61

minimized by establishing (global) standards 62

affecting experimental centers worldwide to 63

streamline common initiatives. Nevertheless, 64

even if standards have been successfully es- 65

tablished, additional validation of, for exam- 66

ple, assignments of cells to types and states 67

may be required. 68

The integration of measurements on mul- 69

tiple quantities (such as scRNA-seq and 70

scDNA-seq) raised in one single cell (Ap- 71

proach 3) needs to account for dependencies 72

if phenomena are concurrent. An illustrative 73

example is to measure copy number variation 74

(through scDNA-seq) or methylation so as to 75

investigate their effects on RNA levels (mea- 76

sured through scRNA-seq). 77

Linking multiple types of measurement 78

across different cells from the same cell pop- 79

ulation (Approach 4) may require the group- 80

ing of cells after experiments have been per- 81

formed, because only then does disturbing 82

variability among the (prior to the experiment 83

assumed coherent) different cells become ev- 84

ident. An example is to group cells based 85

on commonalities or differences in their geno- 86

type profile, having become evident only after 87

the application of a scDNA-seq experiment. 88

36

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 37: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Any assumptions that underlie these possible1

groupings need to resist thorough statistical2

testing and functional validation.3

6.1.1 Status4

For unsupervised clustering (Approach 0 in5

Figure 6 and Table 2), method development is6

a well-established field. Remaining challenges7

have already been identified systematically,8

see Duò et al. [2018], Freytag et al. [2018],9

Kiselev et al. [2019].10

For integrating multiple datasets of the11

same measurement type across different sam-12

ples in one experiment (Approach 1), a few13

approaches are available. See for exam-14

ple MNN [Haghverdi et al., 2018], and the15

methodologies included in the Seurat pack-16

age [Satija et al., 2015, Butler et al., 2018b,17

Stuart et al., 2018]. For the challenges and18

promises referring to the integration of sc-seq19

data that vary in terms of spatial and tempo-20

ral origin, see the discussions in the section 3.521

and section 5.3 below.22

For integrating multiple datasets of the23

same measurement type across experiments24

(Approach 2), mapping cells to reference25

datasets such as the Human Cell Atlas [Regev26

et al., 2017] are currently emerging as the27

most promising strategy. We refer the reader28

to more particular and detailed discussions in29

section 3.3. If applicable reference systems30

are not available (note that the human cell31

atlas is not yet fully operable), assembling32

cell type clusters from different experiments33

is a reasonable strategy, as implemented by34

several recently published tools [Zhang et al.,35

2018, Barkas et al., 2018, Gao et al., 2018,36

Kiselev et al., 2018, Park et al., 2018, Wag-37

ner and Yanai, 2018, Boufea et al., 2019, Jo-38

hansen and Quon, 2019, Johnson et al., 2019].39

The integration of data raised from one cell,40

referring to multiple types of measurements41

(Approach 3) is described in some particular42

experimental protocols that address the issue 43

[Macaulay et al., 2017]. These focus on com- 44

bining scDNA-seq and scRNA-seq (Dey et al. 45

[2015], Macaulay et al. [2016, 2017]), methyla- 46

tion data and scRNA-seq [Angermueller et al., 47

2016], or even all of scRNA-seq, scDNA-seq, 48

methylation and chromatin accessibility data 49

[Clark et al., 2018], or targeted queries on a 50

cell’s methylation, transcription (scRNA-seq) 51

and genotype status (sc-GEM, Cheow et al. 52

[2016]). Beyond these single-cell specific ap- 53

proaches, bulk approaches that address the 54

integration of data from different types of ex- 55

periments have the potential to be leveraged 56

to account for single-cell specific noise char- 57

acteristics or adapted to also qualify for cor- 58

responding single-cell analyses (MOFA, Arge- 59

laguet et al. [2018]), DIABLO [Rohart et al., 60

2017b, Singh et al., 2018] and MINT [Rohart 61

et al., 2017a]). 62

For the integration of different measure- 63

ments performed on several cells all of which 64

stem from a population of cells that is co- 65

herent with respect to the intended analysis 66

(Approach 4), technologies such as 10X ge- 67

nomics [Zheng et al., 2017] for scRNA-seq 68

and direct library preparation (DLP, Zahn 69

et al. [2017b]) for scDNA-seq establish an ex- 70

perimental basis. As above-mentioned, the 71

greater analytical challenge is to, upon hav- 72

ing performed experiments, identify subpop- 73

ulations that had hitherto remained invis- 74

ible, and whose identification is crucial so 75

as to not combine different types of data 76

in mistaken ways. An example for this are 77

the identification of cancer clones although 78

single cells had been sampled from identi- 79

cal tumor tissue—only performing scDNA- 80

seq experiments can definitively reveal the 81

clonal structure of a tumor. If one wishes 82

to correctly link mutation with transcription 83

profiles—the latter of which are examined via 84

scRNA-seq experiments—ignoring the clonal 85

structure of a tumor would be misleading. 86

37

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 38: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Several analytical methods that address this1

problem have recently emerged: (i) clonealign2

[Campbell et al., 2019] assumes a copy-num-3

ber dosage effect on transcription to assign4

gene expression states to clones. (ii) cardelino5

[McCarthy et al., 2018] aligns clone-specific6

SNVs in scRNA-seq to those inferred from7

bulk exome data to infer clone-specific ex-8

pression patterns. (iii) MATCHER [Welch9

et al., 2017] uses manifold alignment to com-10

bine scM&T-seq [Angermueller et al., 2016]11

with sc-GEM [Cheow et al., 2016], leverag-12

ing the common set of loci. All of these13

methods are based on biologically coherent14

assumptions on how to summarize measure-15

ments across different types and samples in a16

reasonable way, despite their different physi-17

cal origin.18

6.1.2 Open problems19

Experimental technologies that deal with tak-20

ing measurements of different kinds on one21

single cell (Approach 3 in Figure 6 and Ta-22

ble 2) are on the rise and will allow to as-23

say more cells at higher fidelity and reduced24

cost. Yet, however, many methods for evalu-25

ating combinations of different types of mea-26

surements performed on one single cell have27

not been in the focus. It is to be expected28

that the corresponding open problems will be-29

come more urgent. As an example, consider30

combined measurements of scDNA-seq and31

scRNA-seq, where one uses the transcripts de-32

rived from the latter to impute missing values33

in the genotype profile derived from the first.34

While this may make Approach 4 look as35

if becoming gradually obsolete, the advances36

with respect to Approach 3 and the corre-37

sponding advances in terms of the resolution38

of how intracellular measurements of different39

types are linked with one another will benefit40

from ground work on Approach 4. Further,41

work using Approach 4 will mean a boost for42

reference systems, such as cell atlases (see also 43

Approach 2), because our understanding of 44

the link between the different substrates mea- 45

sured will improve. As an example consider 46

how gene expression increases on increasing 47

genomic copy number, known as measure- 48

ment linkage [Loper et al., 2019], are impor- 49

tant to account for in such a reference system. 50

This, in turn, will yield techniques that map 51

different cellular quantities with greater ac- 52

curacy, eventually allowing analyses at higher 53

resolution and finer granularity. As a con- 54

sequence, approaches that address taking dif- 55

ferent measurement across different cells from 56

the same population (Approach 4) will deliver 57

more fine-grained results, hence also thanks 58

to these approaches being easier to perform 59

and being more cost efficient, likely will not 60

experience a loss in popularity. 61

As just mentioned, advances with respect 62

to Approach 3 and 4 will be partially based on 63

advances in terms of mappings that connect 64

cells across their types and states, see Ap- 65

proach 2. With combinations of measurement 66

types gradually being shifted in the focus of 67

attention, extensions of Approach 2 (which 68

predominantly addresses how to connect dif- 69

ferent cells based on a single measurement) 70

are necessary. These extensions will have to 71

address how to connect different cells also in 72

terms of multiple types of measurements, or 73

even combinations thereof, such as integrative 74

genotype-expression-profiles (raised by evalu- 75

ating combined experiments on both scRNA- 76

seq and scDNA-seq, for example), which 77

points out the need for improvements address- 78

ing Approach 5. 79

Amounts of material that underlie most 80

measurements will remain tiny, oftentimes 81

limited by the amounts within a single cell 82

and by a limited number of cells available 83

from a particular cell population. This means 84

that one overarching theme will persist: that 85

the analyses we have just discussed will suf- 86

38

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 39: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

fer from missing entire views—samples, time1

points, or measurement types missing entirely2

at the time of training models or mapping3

quantities on one another. This will add to4

the difficulties in terms of missing data one5

experiences in non-integrative approaches.6

6.2 Challenge XII: Validating7

and benchmarking analysis8

tools for single-cell9

measurements10

With the advances in sc-seq and other single-11

cell technologies, more and more analysis12

tools become available for researchers, and13

even more are being developed and will be14

published in the near future. Thus, the need15

for datasets and methods that support sys-16

tematic benchmarking and evaluation of these17

tools is becoming more pressing. To be useful18

and reliable, algorithms and pipelines should19

be able to pass the following quality control20

tests: (i) They should produce the expected21

results (e.g. reconstruct phylogenies, estimate22

differential expressions or cluster the data) of23

high quality and outperform existing meth-24

ods, if such methods exist. (ii) They should25

be robust to high levels of sequencing noise26

and technological biases, including PCR bias,27

allele dropout and chimeric signals. In any28

case, benchmarking should be conducted in a29

systematic way, following established recom-30

mendations [Mangul et al., 2019, Weber et al.,31

2019].32

Evaluation of tool performance requires33

benchmarking datasets with known ground34

truth. Such data should include cell popula-35

tions with known genomic compositions and36

population structures, i.e. where frequencies37

of clones and alleles are known. Currently,38

such datasets are scarce—with some notable39

exceptions [Grün et al., 2014, Tian et al.,40

2019]—because generating them in genuine41

laboratory settings is time-, labor- and cost- 42

intensive. Experimental benchmark datasets 43

for evolutionary analysis of single-cell pop- 44

ulations are even harder to obtain, as they 45

require follow-up samples with known infor- 46

mation about evolutionary trajectories and 47

developmental times. With lack of time- 48

resolved measurements, only anecdotal evi- 49

dence exists on, for instance, how the ac- 50

curacy of phylogenetic inferences is affected 51

by data quality. Availability of such gold- 52

standard datasets would benefit single-cell ge- 53

nomics research enormously. 54

Due to aforementioned difficulties, the most 55

affordable sources of benchmarking and vali- 56

dation data are in silico simulations. Simula- 57

tions provide ground truth test examples that 58

can be rapidly and cost-effectively generated 59

under different assumptions. However, devel- 60

opment of reliable simulation tools require de- 61

sign and implementation of models which cap- 62

ture the essence of underlying biological pro- 63

cesses and technological details of single-cell 64

technologies and high-throughput sequencing 65

platforms, establishing single-cell data sim- 66

ulation as a methodologically involved chal- 67

lenge. 68

6.2.1 Status 69

Recent studies [Soneson and Robinson, 2018, 70

Saelens et al., 2019] show that systematic 71

benchmarking of different single-cell analysis 72

methodologies has begun. However, to the 73

best of our knowledge, there is still a short- 74

age of single-cell data simulation tools. Many 75

single-cell data analysis packages include their 76

own ad hoc data simulators [Vallejos et al., 77

2015, Korthauer et al., 2016a, Lun et al., 2016, 78

Lun and Marioni, 2017, Jahn et al., 2016, Sa- 79

tas and Raphael, 2018, Rizzetto et al., 2017, 80

Köster et al., 2017]. However, these simula- 81

tors are usually not available as separate tools 82

or even as a source code, tailored to specific 83

39

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 40: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

problems studied in corresponding papers and1

sometimes not comprehensively documented,2

thus limiting their utility for the broad re-3

search community. Furthermore, since such4

simulators are used only as auxiliary subrou-5

tines inside particular projects and are not6

published as stand-alone tools, they them-7

selves are usually not evaluated, and there-8

fore the accuracy of their reflection of real9

biological and technological processes remain10

unclear. There are few exceptions known11

to us, including the tools Splatter [Zappia12

et al., 2017], powsimR [Vieth et al., 2017],13

and SymSim [Zhang et al., 2019d], which pro-14

vide frameworks for simulation of scRNA-seq15

data and whose accuracy has been validated16

by comparison of its results with real data.17

For single-cell phylogenomics, cancer genome18

evolution simulators are being designed [Se-19

meraro et al., 2018, Xia et al., 2018, Meng20

and Chen, 2018].21

6.2.2 Open problems22

Simulation tools mostly concentrate on differ-23

ential expression analysis, while comprehen-24

sive simulation methods for other important25

aspects of sc-seq analysis are still to be devel-26

oped. In particular, to the best of our knowl-27

edge, no such tool is available for scDNA-seq28

data.29

With single-cell phylogenomics, one would30

like to assess the accuracy of methods for31

phylogenetic inference and subclone identifi-32

cation, or the power of population genetics33

methods for estimating parameters of interest34

(e.g. tests for selection and epistatic interac-35

tions in cancer, see section 5.3). To this end,36

realistic and comprehensive (w.r.t. the evolu-37

tionary phenomena) simulation tools are re-38

quired.39

Another interesting computational problem40

is development of tools for validation of simu-41

lated sc-seq datasets themselves by their com-42

parison with real data using a comprehen- 43

sive set of biological parameters. The first 44

such tool for scRNA-seq data is countsimQC 45

[Soneson and Robinson, 2017], but similar 46

tools for scDNA-seq data are needed. Finally, 47

most of the simulators concentrate on model- 48

ing of biologically meaningful data, while ig- 49

noring or simplifying models for sc-seq errors 50

and artifacts. 51

Another important challenge in single-cell 52

analysis tool validation is the selection of com- 53

prehensive evaluation metrics, which should 54

be used for comparison of different analysis 55

results with each other and with the ground 56

truth. For single-cell data it is particularly 57

complicated, since many analysis tools deal 58

with heterogeneous clone populations, which 59

possesses multiple biological characteristics to 60

be inferred and analyzed. Development of 61

a single measure which captures several of 62

these characteristics is complicated, and in 63

many cases impossible. For example, valida- 64

tion of tools for imputation of cellular and 65

transcriptional heterogeneity should simulta- 66

neously evaluate two measures: (i) how close 67

are the reconstructed and true cellular ge- 68

nomic profiles and (ii) how close are recon- 69

structed and true SNV/haplotype frequency 70

distributions. Development of synthetic mea- 71

sures which capture several such characteris- 72

tics (e.g. based on utilization of earth mover’s 73

distance [Knyazev et al., 2018]) is highly im- 74

portant. 75

When simulating datasets in general, the 76

circularity of simulating and inferring pa- 77

rameters under the same—possibly simplis- 78

tic model—should be critically assessed, as 79

should potential biases. Thus, further eval- 80

uation on empirical datasets for which some 81

ground truth is known will be invaluable. Ide- 82

ally, all single-cell analysis fields should define 83

a standard set of benchmark datasets that will 84

allow for assessing and comparing methods or 85

come up with a regular data analysis chal- 86

40

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 41: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

lenge. This approach has been very success-1

ful, e.g. in protein structure prediction3 and2

metagenomic analyses4. A first step in this3

direction was the recent single-cell transcrip-4

tomics DREAM challenge5.5

7 Acknowledgements6

We are deeply grateful to the Lorentz Cen-7

ter for hosting the workshop “Single Cell Data8

Science: Making Sense of Data from Billions9

of Single Cells” (4–8 June 2018). In par-10

ticular, we would like to thank the Lorentz11

Center staff, who turned organizing and at-12

tending the workshop into a great pleasure.13

For a week, the authors of this review came14

together—researchers from the fields of statis-15

tics and medicine, computer science and biol-16

ogy, and any combinations thereof. In inter-17

active workshop sessions, we brought together18

our knowledge of single-cell analyses, ranging19

from the wet-lab to the server cluster, from20

statistical models to algorithms, from can-21

cer biology to evolutionary genetics. During22

these sessions, we formulated an initial set of23

challenges that was further systematized and24

refined in the following months, and substan-25

tiated with extensive literature research of the26

respective state-of-the-art for this review.27

Acronyms28

CNV copy number variation. 22, 24, 25, 27,29

2830

ICA independent component analysis. 1031

3http://predictioncenter.org/4https://data.cami-challenge.org5https://www.synapse.org/#!Synapse:

syn15665609/wiki/582909

MALBAC multiple annealing and looping- 32

based amplification cycles. 20, 22 33

MDA multiple displacement amplification. 34

20, 22 35

MSA multiple sequence alignment. 25, 26, 36

30 37

NMF non-negative matrix factorization. 10 38

PCA principal component analysis. 10, 11 39

PCR polymerase chain reaction. 20, 22, 39 40

sc-seq single-cell sequencing. 3–6, 29, 37, 39, 41

40 42

scDNA-seq single-cell DNA sequencing. 3, 43

6–8, 16, 20–25, 27–31, 34–38, 40 44

SCDS Single-Cell Data Science. 3, 4, 7 45

scRNA-seq single-cell RNA sequencing. 3, 46

5–14, 16–18, 22, 24, 34–38, 40 47

SNV single nucleotide variation. 6, 20, 22– 48

28, 38, 40 49

WGA whole genome amplification. 20, 22–25 50

References 51

Andre J Aberer, Kassian Kobert, and Alexan- 52

dros Stamatakis. ExaBayes: massively par- 53

allel bayesian tree inference for the whole- 54

genome era. Mol. Biol. Evol., 31(10):2553– 55

2556, October 2014. 56

Ahmet Acar, Daniel Nichol, Javier 57

Fernandez-Mateos, George D. Cress- 58

well, Iros Barozzi, Sung Pil Hong, 59

Inmaculada Spiteri, Mark Stubbs, Rose- 60

mary Burke, Adam Stewart, Georgios 61

Vlachogiannis, Carlo C. Maley, Luca 62

Magnani, Nicola Valeri, Udai Banerji, and 63

41

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 42: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Andrea Sottoriva. Exploiting evolution-1

ary herding to control drug resistance2

in cancer. bioRxiv, page 566950, March3

2019. doi: 10.1101/566950. URL4

https://www.biorxiv.org/content/10.5

1101/566950v1.6

Sumon Ahmed, Magnus Rattray, and7

Alexis Boukouvalas. GrandPrix: scaling8

up the Bayesian GPLVM for single-9

cell data. Bioinformatics, 35(1):47–10

54, January 2019. ISSN 1367-4803.11

doi: 10.1093/bioinformatics/bty533.12

URL https://academic.oup.com/13

bioinformatics/article/35/1/47/14

5047752.15

Philipp M Altrock, Lin L Liu, and Franziska16

Michor. The mathematics of cancer: in-17

tegrating quantitative models. Nat. Rev.18

Cancer, 15(12):730–745, December 2015.19

Matthew Amodio, David van Dijk, Krish-20

nan Srinivasan, William S Chen, Hussein21

Mohsen, Kevin R Moon, Allison Campbell,22

Yujiao Zhao, Xiaomei Wang, Manjunatha23

Venkataswamy, Anita Desai, V Ravi, Priti24

Kumar, Ruth Montgomery, Guy Wolf, and25

Smita Krishnaswamy. Exploring Single-26

Cell data with deep multitasking neural27

networks. January 2019.28

Benedict Anchang, Tom D. P. Hart, Sean C.29

Bendall, Peng Qiu, Zach Bjornson, Michael30

Linderman, Garry P. Nolan, and Sylvia K.31

Plevritis. Visualization and cellular32

hierarchy inference of single-cell data33

using SPADE. Nature Protocols, 11(7):34

1264–1279, July 2016. ISSN 1754-2189.35

doi: 10.1038/nprot.2016.066. URL http:36

//www.nature.com/nprot/journal/v11/37

n7/full/nprot.2016.066.html.38

Tallulah S. Andrews and Martin Hem-39

berg. False signals induced by single-cell40

imputation. F1000Research, 7:1740, 41

March 2019. ISSN 2046-1402. doi: 42

10.12688/f1000research.16613.2. URL 43

https://f1000research.com/articles/ 44

7-1740/v2. 45

Michael Angelo, Sean C. Bendall, Rachel 46

Finck, Matthew B. Hale, Chuck Hitzman, 47

Alexander D. Borowsky, Richard M. Lev- 48

enson, John B. Lowe, Scot D. Liu, Shuchun 49

Zhao, Yasodha Natkunam, and Garry P. 50

Nolan. Multiplexed ion beam imaging of 51

human breast tumors. Nature Medicine, 20 52

(4):436–442, April 2014. ISSN 1546-170X. 53

doi: 10.1038/nm.3488. 54

Christof Angermueller, Stephen J. Clark, 55

Heather J. Lee, Iain C. Macaulay, Mabel J. 56

Teng, Tim Xiaoming Hu, Felix Krueger, 57

Sebastien Smallwood, Chris P. Ponting, 58

Thierry Voet, Gavin Kelsey, Oliver Ste- 59

gle, and Wolf Reik. Parallel single-cell se- 60

quencing links transcriptional and epige- 61

netic heterogeneity. Nature Methods, 13(3): 62

229–232, March 2016. ISSN 1548-7105. doi: 63

10.1038/nmeth.3728. 64

Christof Angermueller, Heather J. Lee, 65

Wolf Reik, and Oliver Stegle. Deep- 66

CpG: accurate prediction of single-cell 67

DNA methylation states using deep learn- 68

ing. Genome Biology, 18(1):67, April 69

2017. ISSN 1474-760X. doi: 10.1186/ 70

s13059-017-1189-z. URL https://doi. 71

org/10.1186/s13059-017-1189-z. 72

Ricard Argelaguet, Britta Velten, Damien 73

Arnol, Sascha Dietrich, Thorsten Zenz, 74

John C. Marioni, Florian Buettner, Wolf- 75

gang Huber, and Oliver Stegle. Multi- 76

Omics Factor Analysis—a framework for 77

unsupervised integration of multi-omics 78

data sets. Molecular Systems Biology, 79

14(6):e8124, June 2018. ISSN 1744- 80

4292, 1744-4292. doi: 10.15252/msb. 81

42

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 43: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

20178124. URL http://msb.embopress.1

org/content/14/6/e8124.2

C Arisdakessian, O Poirion, B Yunits,3

X Zhu, and L Garmire. DeepIm-4

pute: an accurate, fast and scalable5

deep neural network method to im-6

pute single-cell RNA-Seq data. bioRxiv,7

2018. URL https://www.biorxiv.org/8

content/10.1101/353607v1.abstract.9

Nona Arneson, Simon Hughes, Richard Houl-10

ston, and Susan Done. Whole-Genome am-11

plification by improved primer extension12

preamplification PCR (I-PEP-PCR). CSH13

Protoc., 2008:db.prot4921, January 2008.14

Damien Arnol, Denis Schapiro, Bernd Bo-15

denmiller, Julio Saez-Rodriguez, and Oliver16

Stegle. Modelling cell-cell interactions from17

spatial molecular data with spatial variance18

component analysis, 2018.19

Daniel L. Ayres. Research And Applica-20

tion Of Parallel Computing Algorithms For21

Statistical Phylogenetic Inference. PhD22

thesis, University of Maryland, 2017.23

URL http://drum.lib.umd.edu/handle/24

1903/19951.25

Elham Azizi, Sandhya Prabhakaran, Ambrose26

Carr, and Dana Pe’er. Bayesian Infer-27

ence for Single-cell Clustering and Imput-28

ing. Genomics and Computational Biol-29

ogy, 3(1):46, January 2017. ISSN 2365-30

7154. doi: 10.18547/gcb.2017.vol3.iss1.e46.31

URL https://genomicscomputbiol.org/32

ojs/index.php/GCB/article/view/46.33

Rhonda Bacher and Christina Kendziorski.34

Design and computational analysis35

of single-cell RNA-sequencing ex-36

periments. Genome Biology, 17(1):37

63, April 2016. ISSN 1474-760X.38

doi: 10.1186/s13059-016-0927-y.39

URL https://doi.org/10.1186/ 40

s13059-016-0927-y. 41

Md Bahadur Badsha, Rui Li, Boxiang Liu, 42

Yang I Li, Min Xian, Nicholas E Banovich, 43

and Audrey Qiuyan Fu. Imputation of 44

single-cell gene expression with an autoen- 45

coder neural network. December 2018. 46

Bjorn Bakker, Aaron Taudt, Mirjam E 47

Belderbos, David Porubsky, Diana C J 48

Spierings, Tristan V de Jong, Nancy 49

Halsema, Hinke G Kazemier, Karina 50

Hoekstra-Wakker, Allan Bradley, Eveline S 51

J M de Bont, Anke van den Berg, Victor 52

Guryev, Peter M Lansdorp, Maria Colomé- 53

Tatché, and Floris Foijer. Single-cell se- 54

quencing reveals karyotype heterogeneity in 55

murine and human malignancies. Genome 56

Biol., 17(1):115, May 2016. 57

Nikolas Barkas, Viktor Petukhov, Daria Niko- 58

laeva, Yaroslav Lozinsky, Samuel Demhar- 59

ter, Konstantin Khodosevich, and Peter V. 60

Kharchenko. Wiring together large single- 61

cell RNA-seq sample collections. bioRxiv, 62

page 460246, November 2018. doi: 10.1101/ 63

460246. URL https://www.biorxiv.org/ 64

content/10.1101/460246v1. 65

Nico Battich, Thomas Stoeger, and Lucas 66

Pelkmans. Control of transcript variabil- 67

ity in single mammalian cells. Cell, 163(7): 68

1596–1610, December 2015. 69

Benedikt Bauer, Reiner Siebert, and Arne 70

Traulsen. Cancer initiation with epistatic 71

interactions between driver and passenger 72

mutations. J. Theor. Biol., 358:52–60, Oc- 73

tober 2014. 74

Niko Beerenwinkel, Tibor Antal, David 75

Dingli, Arne Traulsen, Kenneth W Kinzler, 76

Victor E Velculescu, Bert Vogelstein, and 77

Martin A Nowak. Genetic progression and 78

43

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 44: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

the waiting time to cancer. PLoS Comput.1

Biol., 3(11):e225, November 2007.2

Graham R. Bignell, Thomas Santarius, Jes-3

sica C.M. Pole, Adam P. Butler, Janet4

Perry, Erin Pleasance, Chris Greenman,5

Andrew Menzies, Sheila Taylor, Sarah6

Edkins, Peter Campbell, Michael Quail,7

Bob Plumb, Lucy Matthews, Kirsten8

McLay, Paul A.W. Edwards, Jane Rogers,9

Richard Wooster, P. Andrew Futreal,10

and Michael R. Stratton. Architec-11

tures of somatic genomic rearrangement12

in human cancer amplicons at sequence-13

level resolution. Genome Research, 1714

(9):1296–1303, 2007. doi: 10.1101/gr.15

6522707. URL http://genome.cshlp.16

org/content/17/9/1296.abstract.17

Konstantin A Blagodatskikh, Vladimir M18

Kramarov, Ekaterina V Barsova, Alexey V19

Garkovenko, Dmitriy S Shcherbo, An-20

drew A Shelenkov, Vera V Ustinova,21

Maria R Tokarenko, Simon C Baker, Ta-22

tiana V Kramarova, and Konstantin B Ig-23

natov. Improved DOP-PCR (iDOP-PCR):24

A robust and simple WGA method for effi-25

cient amplification of low copy number ge-26

nomic DNA. PLoS One, 12(9):e0184507,27

September 2017.28

L Blanco, A Bernad, J M Lázaro, G Martín,29

C Garmendia, and M Salas. Highly ef-30

ficient DNA synthesis by the phage phi31

29 DNA polymerase. symmetrical mode of32

DNA replication. J. Biol. Chem., 264(15):33

8935–8940, May 1989.34

Craig L. Bohrson, Alison R. Barton,35

Michael A. Lodato, Rachel E. Rodin,36

Lovelace J. Luquette, Vinay V. Viswanad-37

ham, Doga C. Gulhan, Isidro Cortés-38

Ciriano, Maxwell A. Sherman, Min-39

seok Kwon, Michael E. Coulter, Alon40

Galor, Christopher A. Walsh, and41

Peter J. Park. Linked-read analysis 42

identifies mutations in single-cell DNA- 43

sequencing data. Nature Genetics, 44

page 1, March 2019. ISSN 1546-1718. 45

doi: 10.1038/s41588-019-0366-2. URL 46

https://www.nature.com/articles/ 47

s41588-019-0366-2. 48

Katerina Boufea, Sohan Seth, and Nizar N. 49

Batada. scID: Identification of equivalent 50

transcriptional cell populations across 51

single cell RNA-seq data using discrim- 52

inant analysis. bioRxiv, page 470203, 53

January 2019. doi: 10.1101/470203. URL 54

https://www.biorxiv.org/content/10. 55

1101/470203v2. 56

Ivana Bozic, Tibor Antal, Hisashi Ohtsuki, 57

Hannah Carter, Dewey Kim, Sining Chen, 58

Rachel Karchin, Kenneth W Kinzler, Bert 59

Vogelstein, and Martin A Nowak. Accu- 60

mulation of driver and passenger muta- 61

tions during tumor progression. Proc. Natl. 62

Acad. Sci. U. S. A., 107(43):18545–18550, 63

October 2010. 64

Ivana Bozic, Jeffrey M Gerold, and Mar- 65

tin A Nowak. Quantifying clonal and sub- 66

clonal passenger mutations in cancer evolu- 67

tion. PLoS Comput. Biol., 12(2):e1004731, 68

February 2016. 69

James A. Briggs, Caleb Weinreb, Daniel E. 70

Wagner, Sean Megason, Leonid Peshkin, 71

Marc W. Kirschner, and Allon M. Klein. 72

The dynamics of gene expression in ver- 73

tebrate embryogenesis at single-cell res- 74

olution. Science, 360(6392):eaar5780, 75

June 2018. ISSN 0036-8075, 1095- 76

9203. doi: 10.1126/science.aar5780. 77

URL http://science.sciencemag.org/ 78

content/360/6392/eaar5780. 79

Jane Bromley, James W. Bentz, Léon Bot- 80

tou, Isabelle Guyon, Yann Lecun, Cliff 81

44

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 45: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Moore, Eduard Säckinger, and Roopak1

Shah. Signature verification using a2

“siamese” time delay neural network.3

International Journal of Pattern Recog-4

nition and Artificial Intelligence, 07(04):5

669–688, August 1993. ISSN 0218-0014.6

doi: 10.1142/S0218001493000339. URL7

https://www.worldscientific.com/8

doi/10.1142/S0218001493000339.9

Robert V Bruggner, Bernd Bodenmiller,10

David L Dill, Robert J Tibshirani, and11

Garry P Nolan. Automated identification12

of stratifying signatures in cellular subpop-13

ulations. Proc. Natl. Acad. Sci. U. S. A.,14

111(26):E2770–7, July 2014.15

Jason D. Buenrostro, Beijing Wu, Ulrike M.16

Litzenburger, Dave Ruff, Michael L.17

Gonzales, Michael P. Snyder, Howard Y.18

Chang, and William J. Greenleaf. Single-19

cell chromatin accessibility reveals princi-20

ples of regulatory variation. Nature, 52321

(7561):486–490, July 2015. ISSN 1476-22

4687. doi: 10.1038/nature14590. URL23

https://www.nature.com/articles/24

nature14590.25

Jason D. Buenrostro, M. Ryan Corces,26

Caleb A. Lareau, Beijing Wu, Alicia N.27

Schep, Martin J. Aryee, Ravindra Ma-28

jeti, Howard Y. Chang, and William J.29

Greenleaf. Integrated Single-Cell Analy-30

sis Maps the Continuous Regulatory Land-31

scape of Human Hematopoietic Differen-32

tiation. Cell, 173(6):1535–1548.e16, 2018.33

ISSN 1097-4172. doi: 10.1016/j.cell.2018.34

03.074.35

Florian Buettner, Naruemon Pratanwanich,36

Davis J McCarthy, John C Marioni, and37

Oliver Stegle. f-scLVM: scalable and ver-38

satile factor analysis for single-cell RNA-39

seq. Genome biology, 18(1):212, Novem-40

ber 2017. ISSN 1465-6906. doi: 10.1186/41

s13059-017-1334-8. URL http://dx.doi. 42

org/10.1186/s13059-017-1334-8. 43

Andrew Butler, Paul Hoffman, Peter Smib- 44

ert, Efthymia Papalexi, and Rahul Satija. 45

Integrating single-cell transcriptomic data 46

across different conditions, technologies, 47

and species. Nat. Biotechnol., 36(5):411– 48

420, June 2018a. 49

Andrew Butler, Paul Hoffman, Peter Smib- 50

ert, Efthymia Papalexi, and Rahul Satija. 51

Integrating single-cell transcriptomic data 52

across different conditions, technologies, 53

and species. Nature Biotechnology, 36(5): 54

411–420, May 2018b. ISSN 1546-1696. 55

doi: 10.1038/nbt.4096. URL https:// 56

www.nature.com/articles/nbt.4096. 57

Kieran R. Campbell and Christopher 58

Yau. Order Under Uncertainty: Robust 59

Differential Expression Analysis Using 60

Probabilistic Models for Pseudotime In- 61

ference. PLOS Computational Biology, 12 62

(11):e1005212, November 2016. ISSN 1553- 63

7358. doi: 10.1371/journal.pcbi.1005212. 64

URL https://journals.plos.org/ 65

ploscompbiol/article?id=10.1371/ 66

journal.pcbi.1005212. 67

Kieran R. Campbell and Christopher Yau. 68

Uncovering pseudotemporal trajectories 69

with covariates from single cell and bulk 70

expression data. Nature Communications, 71

9(1):2442, June 2018. ISSN 2041-1723. 72

doi: 10.1038/s41467-018-04696-6. URL 73

https://www.nature.com/articles/ 74

s41467-018-04696-6. 75

Kieran R. Campbell, Adi Steif, Emma Laks, 76

Hans Zahn, Daniel Lai, Andrew McPher- 77

son, Hossein Farahani, Farhia Kabeer, 78

Ciara O’Flanagan, Justina Biele, Jazmine 79

Brimhall, Beixi Wang, Pascale Walters, 80

IMAXT Consortium, Alexandre Bouchard- 81

Côté, Samuel Aparicio, and Sohrab P. 82

45

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 46: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Shah. clonealign: statistical integra-1

tion of independent single-cell RNA and2

DNA sequencing data from human can-3

cers. Genome Biology, 20(1):54, March4

2019. ISSN 1474-760X. doi: 10.1186/5

s13059-019-1645-z. URL https://doi.6

org/10.1186/s13059-019-1645-z.7

Junyue Cao, Jonathan S. Packer, Vijay Ra-8

mani, Darren A. Cusanovich, Chau Huynh,9

Riza Daza, Xiaojie Qiu, Choli Lee, Scott N.10

Furlan, Frank J. Steemers, Andrew Adey,11

Robert H. Waterston, Cole Trapnell, and12

Jay Shendure. Comprehensive single-13

cell transcriptional profiling of a multi-14

cellular organism. Science, 357(6352):15

661–667, August 2017. ISSN 0036-8075,16

1095-9203. doi: 10.1126/science.aam8940.17

URL http://science.sciencemag.org/18

content/357/6352/661.19

Junyue Cao, Darren A. Cusanovich, Vijay20

Ramani, Delasa Aghamirzaie, Hannah A.21

Pliner, Andrew J. Hill, Riza M. Daza,22

Jose L. McFaline-Figueroa, Jonathan S.23

Packer, Lena Christiansen, Frank J.24

Steemers, Andrew C. Adey, Cole Trap-25

nell, and Jay Shendure. Joint profil-26

ing of chromatin accessibility and gene27

expression in thousands of single cells.28

Science, 361(6409):1380–1385, Septem-29

ber 2018. ISSN 0036-8075, 1095-30

9203. doi: 10.1126/science.aau0730.31

URL https://science.sciencemag.org/32

content/361/6409/1380.33

Junyue Cao, Malte Spielmann, Xiaojie Qiu,34

Xingfan Huang, Daniel M. Ibrahim, An-35

drew J. Hill, Fan Zhang, Stefan Mundlos,36

Lena Christiansen, Frank J. Steemers, Cole37

Trapnell, and Jay Shendure. The single-38

cell transcriptional landscape of mam-39

malian organogenesis. Nature, 566(7745):40

496, February 2019a. ISSN 1476-4687.41

doi: 10.1038/s41586-019-0969-x. URL42

https://www.nature.com/articles/ 43

s41586-019-0969-x. 44

Zhi-Jie Cao, Lin Wei, Shen Lu, De-Chang 45

Yang, and Ge Gao. Cell BLAST: Search- 46

ing large-scale scRNA-seq database via 47

unbiased cell embedding. bioRxiv, page 48

587360, March 2019b. doi: 10.1101/ 49

587360. URL https://www.biorxiv.org/ 50

content/10.1101/587360v1. 51

Anna K Casasent, Aislyn Schalck, Ruli 52

Gao, Emi Sei, Annalyssa Long, William 53

Pangburn, Tod Casasent, Funda Meric- 54

Bernstam, Mary E Edgerton, and 55

Nicholas E Navin. Multiclonal invasion in 56

breast tumors identified by topographic 57

single cell sequencing. Cell, 172(1-2): 58

205–217.e12, January 2018. 59

Chong Chen, Changjing Wu, Linjie Wu, 60

Yishu Wang, Minghua Deng, and Ruibin 61

Xi. scRMD: Imputation for single cell 62

RNA-seq data via robust matrix de- 63

composition. November 2018. URL 64

https://www.biorxiv.org/content/10. 65

1101/459404v2. 66

Chongyi Chen, Dong Xing, Longzhi Tan, 67

Heng Li, Guangyu Zhou, Lei Huang, and 68

X Sunney Xie. Single-cell whole-genome 69

analyses by linear amplification via trans- 70

poson insertion (LIANTI). Science, 356 71

(6334):189–194, April 2017. 72

Huidong Chen, Luca Albergante, 73

Jonathan Y. Hsu, Caleb A. Lareau, 74

Giosuè Lo Bosco, Jihong Guan, Shuigeng 75

Zhou, Alexander N. Gorban, Daniel E. 76

Bauer, Martin J. Aryee, David M. Lan- 77

genau, Andrei Zinovyev, Jason D. Buen- 78

rostro, Guo-Cheng Yuan, and Luca Pinello. 79

Single-cell trajectories reconstruction, ex- 80

ploration and mapping of omics data with 81

STREAM. Nature Communications, 10 82

46

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 47: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

(1):1903, April 2019. ISSN 2041-1723.1

doi: 10.1038/s41467-019-09670-4. URL2

https://www.nature.com/articles/3

s41467-019-09670-4.4

Kok Hao Chen, Alistair N Boettiger, Jef-5

frey R Moffitt, Siyuan Wang, and Xiaowei6

Zhuang. RNA imaging. spatially resolved,7

highly multiplexed RNA profiling in sin-8

gle cells. Science, 348(6233):aaa6090, April9

2015.10

Mengjie Chen and Xiang Zhou. VIPER:11

variability-preserving imputation for accu-12

rate gene expression recovery in single-cell13

RNA sequencing studies. Genome Biol., 1914

(1):196, November 2018.15

Lih Feng Cheow, Elise T. Courtois, Yuliana16

Tan, Ramya Viswanathan, Qiaorui Xing,17

Rui Zhen Tan, Daniel S. W. Tan, Paul Rob-18

son, Yuin-Han Loh, Stephen R. Quake, and19

William F. Burkholder. Single-cell multi-20

modal profiling reveals cellular epigenetic21

heterogeneity. Nature Methods, 13(10):833–22

836, October 2016. ISSN 1548-7105. doi:23

10.1038/nmeth.3961. URL https://www.24

nature.com/articles/nmeth.3961.25

Cariad Chester and Holden T Maecker. Algo-26

rithmic tools for mining High-Dimensional27

cytometry data. J. Immunol., 195(3):773–28

779, August 2015.29

Hyunghoon Cho, Bonnie Berger, and Jian30

Peng. Generalizable and scalable visualiza-31

tion of Single-Cell data using neural net-32

works. Cell Syst, 7(2):185–191.e4, August33

2018.34

Simone Ciccolella, Mauricio Soto Gomez,35

Murray Patterson, Gianluca Della Ve-36

dova, Iman Hajirasouliha, and Paola37

Bonizzoni. Inferring Cancer Progres-38

sion from Single-cell Sequencing while Al-39

lowing Mutation Losses. bioRxiv, page40

268243, April 2018. doi: 10.1101/ 41

268243. URL https://www.biorxiv.org/ 42

content/10.1101/268243v2. 43

Stephen J. Clark, Ricard Argelaguet, 44

Chantriolnt-Andreas Kapourani, 45

Thomas M. Stubbs, Heather J. Lee, Celia 46

Alda-Catalinas, Felix Krueger, Guido San- 47

guinetti, Gavin Kelsey, John C. Marioni, 48

Oliver Stegle, and Wolf Reik. scNMT-seq 49

enables joint profiling of chromatin accessi- 50

bility DNA methylation and transcription 51

in single cells. Nature Communications, 9 52

(1):781, February 2018. ISSN 2041-1723. 53

doi: 10.1038/s41467-018-03149-4. URL 54

https://www.nature.com/articles/ 55

s41467-018-03149-4. 56

Germán Corredor, Xiangxue Wang, Yu Zhou, 57

Cheng Lu, Pingfu Fu, Konstantinos Syri- 58

gos, David L Rimm, Michael Yang, Ed- 59

uardo Romero, Kurt A Schalper, Vam- 60

sidhar Velcheti, and Anant Madabhushi. 61

Spatial architecture and arrangement of 62

Tumor-Infiltrating lymphocytes for pre- 63

dicting likelihood of recurrence in Early- 64

Stage Non-Small cell lung cancer. Clin. 65

Cancer Res., September 2018. 66

Alexandra Cretu and Peter C Brooks. Im- 67

pact of the non-cellular tumor microenvi- 68

ronment on metastasis: potential thera- 69

peutic and imaging opportunities. J. Cell. 70

Physiol., 213(2):391–402, November 2007. 71

Nicola Crosetto, Magda Bienko, and Alexan- 72

der van Oudenaarden. Spatially resolved 73

transcriptomics and beyond. Nat. Rev. 74

Genet., 16(1):57–66, January 2015. 75

Darren A. Cusanovich, Riza Daza, Andrew 76

Adey, Hannah A. Pliner, Lena Chris- 77

tiansen, Kevin L. Gunderson, Frank J. 78

Steemers, Cole Trapnell, and Jay Shen- 79

dure. Multiplex single cell profiling of chro- 80

47

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 48: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

matin accessibility by combinatorial cellu-1

lar indexing. Science (New York, N.Y.),2

348(6237):910–914, May 2015. ISSN 1095-3

9203. doi: 10.1126/science.aab1601.4

Darren A. Cusanovich, James P. Redding-5

ton, David A. Garfield, Riza M. Daza, De-6

lasa Aghamirzaie, Raquel Marco-Ferreres,7

Hannah A. Pliner, Lena Christiansen, Xi-8

aojie Qiu, Frank J. Steemers, Cole Trap-9

nell, Jay Shendure, and Eileen E. M. Fur-10

long. The cis-regulatory dynamics of em-11

bryonic development at single-cell resolu-12

tion. Nature, 555(7697):538–542, March13

2018. ISSN 1476-4687. doi: 10.1038/14

nature25981. URL https://www.nature.15

com/articles/nature25981.16

Sayantan Das, Gonçalo R Abecasis, and17

Brian L Browning. Genotype Impu-18

tation from Large Reference Panels.19

Annual review of genomics and hu-20

man genetics, 19:73–96, August 2018.21

ISSN 1527-8204, 1545-293X. doi:22

10.1146/annurev-genom-083117-021602.23

URL http://dx.doi.org/10.1146/24

annurev-genom-083117-021602.25

Soma Datta, Lavina Malhotra, Ryan Dicker-26

son, Scott Chaffee, Chandan K Sen, and27

Sashwati Roy. Laser capture microdissec-28

tion: Big data from small samples. Histol.29

Histopathol., 30(11):1255–1269, November30

2015.31

Alexander Davis, Ruli Gao, and Nicholas32

Navin. Tumor evolution: Linear, branch-33

ing, neutral or punctuated? Biochim. Bio-34

phys. Acta, 1867(2):151–161, April 2017.35

Carl G. de Boer and Aviv Regev. BROCK-36

MAN: deciphering variance in epige-37

nomic regulators by k-mer factoriza-38

tion. BMC Bioinformatics, 19(1):253, July39

2018. ISSN 1471-2105. doi: 10.1186/40

s12859-018-2255-6. URL https://doi. 41

org/10.1186/s12859-018-2255-6. 42

Charles F A de Bourcy, Iwijn De Vlam- 43

inck, Jad N Kanbar, Jianbin Wang, Charles 44

Gawad, and Stephen R Quake. A quantita- 45

tive comparison of single-cell whole genome 46

amplification methods. PLoS One, 9(8): 47

e105585, August 2014. 48

Frank B Dean, Seiyu Hosono, Linhua Fang, 49

Xiaohong Wu, A Fawad Faruqi, Patricia 50

Bray-Ward, Zhenyu Sun, Qiuling Zong, 51

Yuefen Du, Jing Du, Mark Driscoll, Wan- 52

min Song, Stephen F Kingsmore, Michael 53

Egholm, and Roger S Lasken. Compre- 54

hensive human genome amplification using 55

multiple displacement amplification. Proc. 56

Natl. Acad. Sci. U. S. A., 99(8):5261–5266, 57

April 2002. 58

Yue Deng, Feng Bao, Qionghai Dai, Lani F 59

Wu, and Steven J Altschuler. Scalable anal- 60

ysis of cell-type composition from single- 61

cell transcriptomics using deep recurrent 62

learning. Nature methods, March 2019. 63

ISSN 1548-7091, 1548-7105. doi: 10.1038/ 64

s41592-019-0353-7. URL https://doi. 65

org/10.1038/s41592-019-0353-7. 66

Erica AK DePasquale, Kyle Ferchen, Stuart 67

Hay, H. Leighton Grimes, and Nathan 68

Salomonis. cellHarmony: Cell-level 69

matching and comparison of single-cell 70

transcriptomes. bioRxiv, page 412080, 71

January 2019. doi: 10.1101/412080. URL 72

https://www.biorxiv.org/content/10. 73

1101/412080v4. 74

Siddharth S. Dey, Lennart Kester, Bastiaan 75

Spanjaard, Magda Bienko, and Alexan- 76

der van Oudenaarden. Integrated genome 77

and transcriptome sequencing of the same 78

cell. Nature Biotechnology, 33(3):285–289, 79

March 2015. ISSN 1546-1696. doi: 10.1038/ 80

48

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 49: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

nbt.3129. URL https://www.nature.1

com/articles/nbt.3129.2

David van Dijk, Roshan Sharma, Juozas3

Nainys, Kristina Yim, Pooja Kathail,4

Ambrose J. Carr, Cassandra Burdziak,5

Kevin R. Moon, Christine L. Chaf-6

fer, Diwakar Pattabiraman, Brian Bierie,7

Linas Mazutis, Guy Wolf, Smita Krish-8

naswamy, and Dana Pe’er. Recovering9

Gene Interactions from Single-Cell Data10

Using Data Diffusion. Cell, 174(3):716–11

729.e27, July 2018. ISSN 0092-8674,12

1097-4172. doi: 10.1016/j.cell.2018.05.13

061. URL https://www.cell.com/cell/14

abstract/S0092-8674(18)30724-4.15

Jiarui Ding, Anne Condon, and Sohrab P16

Shah. Interpretable dimensionality reduc-17

tion of single cell transcriptome data with18

deep generative models. Nat. Commun., 919

(1):2002, May 2018.20

Xiao Dong, Lei Zhang, Brandon Milholland,21

Moonsook Lee, Alexander Y. Maslov, Tao22

Wang, and Jan Vijg. Accurate iden-23

tification of single-nucleotide variants in24

whole-genome-amplified single cells. Na-25

ture Methods, 14(5):491–493, May 2017.26

ISSN 1548-7105. doi: 10.1038/nmeth.27

4227. URL https://www.nature.com/28

articles/nmeth.4227.29

Angelo Duò, Mark D Robinson, and Char-30

lotte Soneson. A systematic performance31

evaluation of clustering methods for single-32

cell RNA-seq data. F1000Res., 7, July33

2018.34

G Durif, L Modolo, J E Mold, S Lambert-35

Lacroix, and F Picard. Probabilistic36

Count Matrix Factorization for Single37

Cell Expression Data Analysis. Bioin-38

formatics, March 2019. ISSN 1367-4803,39

1367-4811. doi: 10.1093/bioinformatics/40

btz177. URL http://dx.doi.org/10. 41

1093/bioinformatics/btz177. 42

Daniel Edsgärd, Per Johnsson, and Rickard 43

Sandberg. Identification of spatial expres- 44

sion trends in single-cell gene expression 45

data. Nat. Methods, 15(5):339–342, May 46

2018. 47

Peter Eirew, Adi Steif, Jaswinder Khattra, 48

Gavin Ha, Damian Yap, Hossein Fara- 49

hani, Karen Gelmon, Stephen Chia, Colin 50

Mar, Adrian Wan, Emma Laks, Justina 51

Biele, Karey Shumansky, Jamie Rosner, 52

Andrew McPherson, Cydney Nielsen, 53

Andrew J. L. Roth, Calvin Lefebvre, Ali 54

Bashashati, Camila de Souza, Celia Siu, 55

Radhouane Aniba, Jazmine Brimhall, 56

Arusha Oloumi, Tomo Osako, Alejandra 57

Bruna, Jose L. Sandoval, Teresa Algara, 58

Wendy Greenwood, Kaston Leung, Hong- 59

wei Cheng, Hui Xue, Yuzhuo Wang, 60

Dong Lin, Andrew J. Mungall, Richard 61

Moore, Yongjun Zhao, Julie Lorette, Long 62

Nguyen, David Huntsman, Connie J. 63

Eaves, Carl Hansen, Marco A. Marra, Car- 64

los Caldas, Sohrab P. Shah, and Samuel 65

Aparicio. Dynamics of genomic clones 66

in breast cancer patient xenografts at 67

single-cell resolution. Nature, 518(7539): 68

422–426, February 2015. ISSN 0028-0836. 69

doi: 10.1038/nature13952. URL http: 70

//www.nature.com/nature/journal/ 71

v518/n7539/full/nature13952.html. 72

Mohammed El-Kebir. SPhyR: tumor 73

phylogeny estimation from single-cell 74

sequencing data under loss and er- 75

ror. Bioinformatics, 34(17):i671–i679, 76

September 2018. ISSN 1367-4803. 77

doi: 10.1093/bioinformatics/bty589. 78

URL https://academic.oup.com/ 79

bioinformatics/article/34/17/i671/ 80

5093218. 81

49

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 50: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Nils Eling, Arianne C. Richard, Sylvia1

Richardson, John C. Marioni, and2

Catalina A. Vallejos. Correcting the3

Mean-Variance Dependency for Differen-4

tial Variability Testing Using Single-Cell5

RNA Sequencing Data. Cell Systems, 7(3):6

284–294.e12, September 2018. ISSN 2405-7

4712. doi: 10.1016/j.cels.2018.06.011. URL8

https://www.cell.com/cell-systems/9

abstract/S2405-4712(18)30278-3.10

Chee-Huat Linus Eng, Michael Lawson,11

Qian Zhu, Ruben Dries, Noushin Koulena,12

Yodai Takei, Jina Yun, Christopher13

Cronin, Christoph Karp, Guo-Cheng14

Yuan, and Long Cai. Transcriptome-15

scale super-resolved imaging in tissues by16

RNA seqFISH+. Nature, 568(7751):17

235, April 2019. ISSN 1476-4687.18

doi: 10.1038/s41586-019-1049-y. URL19

https://www.nature.com/articles/20

s41586-019-1049-y.21

Gökcen Eraslan, Lukas M. Simon, Maria22

Mircea, Nikola S. Mueller, and Fabian J.23

Theis. Single-cell RNA-seq denois-24

ing using a deep count autoencoder.25

Nature Communications, 10(1):390,26

January 2019. ISSN 2041-1723. doi:27

10.1038/s41467-018-07931-2. URL28

https://www.nature.com/articles/29

s41467-018-07931-2.30

Nuria Estévez-Gómez, Tamara Prieto, Amy31

Guillaumet-Adkins, Holger Heyn, Sonia32

Prado-López, and David Posada. Com-33

parison of single-cell whole-genome am-34

plification strategies. bioRxiv, page35

443754, October 2018. doi: 10.1101/36

443754. URL https://www.biorxiv.org/37

content/10.1101/443754v1.38

Jean Fan, Hae-Ock Lee, Soohyun Lee, Da-39

Eun Ryu, Semin Lee, Catherine Xue,40

Seok Jin Kim, Kihyun Kim, Nikolaos41

Barkas, Peter J Park, Woong-Yang Park, 42

and Peter V Kharchenko. Linking tran- 43

scriptional and genetic tumor heterogeneity 44

through allele analysis of single-cell RNA- 45

seq data. Genome Res., 28(8):1217–1227, 46

August 2018. 47

Jeffrey A. Farrell, Yiqun Wang, Saman- 48

tha J. Riesenfeld, Karthik Shekhar, 49

Aviv Regev, and Alexander F. Schier. 50

Single-cell reconstruction of develop- 51

mental trajectories during zebrafish 52

embryogenesis. Science, 360(6392): 53

eaar3131, June 2018. ISSN 0036-8075, 54

1095-9203. doi: 10.1126/science.aar3131. 55

URL http://science.sciencemag.org/ 56

content/360/6392/eaar3131. 57

Joseph Felsenstein. Evolutionary trees from 58

DNA sequences: A maximum likelihood 59

approach. J. Mol. Evol., 17(6):368–376, 60

1981. 61

Greg Finak, Andrew McDavid, Masanao 62

Yajima, Jingyuan Deng, Vivian Gersuk, 63

Alex K. Shalek, Chloe K. Slichter, Han- 64

nah W. Miller, M. Juliana McElrath, 65

Martin Prlic, Peter S. Linsley, and Raphael 66

Gottardo. MAST: a flexible statistical 67

framework for assessing transcriptional 68

changes and characterizing heterogene- 69

ity in single-cell RNA sequencing data. 70

Genome Biology, 16, 2015. ISSN 1474- 71

7596. doi: 10.1186/s13059-015-0844-5. 72

URL https://www.ncbi.nlm.nih.gov/ 73

pmc/articles/PMC4676162/. 74

Christopher T. Fincher, Omri Wurtzel, 75

Thom de Hoog, Kellie M. Kravarik, and 76

Peter W. Reddien. Cell type transcriptome 77

atlas for the planarian Schmidtea mediter- 78

ranea. Science, 360(6391):eaaq1736, 79

May 2018. ISSN 0036-8075, 1095- 80

9203. doi: 10.1126/science.aaq1736. 81

50

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 51: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

URL http://science.sciencemag.org/1

content/360/6391/eaaq1736.2

William Fletcher and Ziheng Yang. The ef-3

fect of insertions, deletions, and alignment4

errors on the branch-site test of positive se-5

lection. Mol. Biol. Evol., 27(10):2257–2267,6

October 2010.7

Jasmine Foo, Kevin Leder, and Franziska Mi-8

chor. Stochastic dynamics of cancer initi-9

ation. Phys. Biol., 8(1):015002, February10

2011.11

Joshua M. Francis, Cheng-Zhong Zhang,12

Cecile L. Maire, Joonil Jung, Veronica E.13

Manzo, Viktor A. Adalsteinsson, Heather14

Homer, Sam Haidar, Brendan Blumenstiel,15

Chandra Sekhar Pedamallu, Azra H.16

Ligon, J. Christopher Love, Matthew17

Meyerson, and Keith L. Ligon. EGFR18

Variant Heterogeneity in Glioblastoma19

Resolved through Single-Nucleus Sequenc-20

ing. Cancer Discovery, 4(8):956–971,21

August 2014. ISSN 2159-8274, 2159-8290.22

doi: 10.1158/2159-8290.CD-13-0879.23

URL http://cancerdiscovery.24

aacrjournals.org/content/4/8/956.25

Saskia Freytag, Luyi Tian, Ingrid Lönnst-26

edt, Milica Ng, and Melanie Bahlo.27

Comparison of clustering tools in R for28

medium-sized 10x Genomics single-cell29

RNA-sequencing data. F1000Research,30

7, December 2018. ISSN 2046-1402. doi:31

10.12688/f1000research.15809.2. URL32

https://www.ncbi.nlm.nih.gov/pmc/33

articles/PMC6124389/.34

Wolf H Fridman, Jérome Galon, Marie-35

Caroline Dieu-Nosjean, Isabelle Cremer,36

Sylvain Fisson, Diane Damotte, Franck37

Pagès, Eric Tartour, and Catherine Sautès-38

Fridman. Immune infiltration in human39

cancer: Prognostic significance and disease40

control. In Glenn Dranoff, editor, Cancer 41

Immunology and Immunotherapy, pages 1– 42

24. Springer Berlin Heidelberg, Berlin, Hei- 43

delberg, 2011. 44

Yusi Fu, Chunmei Li, Sijia Lu, Wenxiong 45

Zhou, Fuchou Tang, X Sunney Xie, and 46

Yanyi Huang. Uniform and accurate 47

single-cell sequencing based on emulsion 48

whole-genome amplification. Proc. Natl. 49

Acad. Sci. U. S. A., 112(38):11923–11928, 50

September 2015. 51

Dan Gao, Feng Jin, Min Zhou, and Yuyang 52

Jiang. Recent advances in single cell ma- 53

nipulation and biochemical analysis on mi- 54

crofluidics. Analyst, 144(3):766–781, Jan- 55

uary 2019. 56

Xin Gao, Deqing Hu, Madelaine Gogol, 57

and Hua Li. ClusterMap: Com- 58

paring analyses across multiple Single 59

Cell RNA-Seq profiles. bioRxiv, page 60

331330, June 2018. doi: 10.1101/ 61

331330. URL https://www.biorxiv.org/ 62

content/10.1101/331330v2. 63

Tyler Garvin, Robert Aboukhalil, Jude 64

Kendall, Timour Baslan, Gurinder S At- 65

wal, James Hicks, Michael Wigler, and 66

Michael C Schatz. Interactive analysis and 67

assessment of single-cell copy-number vari- 68

ations. Nat. Methods, 12(11):1058–1060, 69

November 2015. 70

Charles Gawad, Winston Koh, and Stephen R 71

Quake. Single-cell genome sequencing: cur- 72

rent state of the science. Nat. Rev. Genet., 73

17(3):175–188, March 2016. 74

Farzad Ghaznavi, Andrew Evans, Anant 75

Madabhushi, and Michael Feldman. Digital 76

imaging in pathology: whole-slide imaging 77

and beyond. Annu. Rev. Pathol., 8:331– 78

359, January 2013. 79

51

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 52: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Charlotte Giesen, Hao A. O. Wang, Denis1

Schapiro, Nevena Zivanovic, Andrea Ja-2

cobs, Bodo Hattendorf, Peter J. Schüffler,3

Daniel Grolimund, Joachim M. Buhmann,4

Simone Brandt, Zsuzsanna Varga, Peter J.5

Wild, Detlef Günther, and Bernd Boden-6

miller. Highly multiplexed imaging of tu-7

mor tissues with subcellular resolution by8

mass cytometry. Nature Methods, 11(4):9

417–422, April 2014. ISSN 1548-7105. doi:10

10.1038/nmeth.2869. URL https://www.11

nature.com/articles/nmeth.2869.12

Yury Goltsev, Nikolay Samusik, Julia13

Kennedy-Darling, Salil Bhate, Matthew14

Hale, Gustavo Vazquez, Sarah Black, and15

Garry P. Nolan. Deep Profiling of Mouse16

Splenic Architecture with CODEX Multi-17

plexed Imaging. Cell, 174(4):968–981.e15,18

August 2018. ISSN 0092-8674. doi:19

10.1016/j.cell.2018.07.010. URL http:20

//www.sciencedirect.com/science/21

article/pii/S0092867418309048.22

Wuming Gong, Il-Youp Kwak, Pruthvi Pota,23

Naoko Koyano-Nakagawa, and Daniel J.24

Garry. DrImpute: imputing dropout25

events in single cell RNA sequencing26

data. BMC Bioinformatics, 19(1):220, June27

2018. ISSN 1471-2105. doi: 10.1186/28

s12859-018-2226-y. URL https://doi.29

org/10.1186/s12859-018-2226-y.30

R R Gray, O G Pybus, and M Salemi. Mea-31

suring the temporal structure in Serially-32

Sampled phylogenies. Methods Ecol. Evol.,33

2(5):437–445, October 2011.34

Anna Graybeal. Is it better to add taxa or35

characters to a difficult phylogenetic prob-36

lem? Syst. Biol., 47(1):9–17, 1998.37

Christopher Heje Grønbech, Maximil-38

lian Fornitz Vording, Pascal N Timshel,39

Casper Kaae Sønderby, Tune Hannes40

Pers, and Ole Winther. scVAE: Vari- 41

ational auto-encoders for single-cell 42

gene expression data. May 2018. URL 43

https://www.biorxiv.org/content/ 44

early/2018/05/16/318295. 45

Dominic Grün, Lennart Kester, and Alexan- 46

der van Oudenaarden. Validation of noise 47

models for single-cell transcriptomics. Na- 48

ture Methods, 11(6):637–640, June 2014. 49

ISSN 1548-7105. doi: 10.1038/nmeth. 50

2930. URL https://www.nature.com/ 51

articles/nmeth.2930. 52

Martin Guilliams, Charles-Antoine Dutertre, 53

Charlotte L. Scott, Naomi McGovern, 54

Dorine Sichien, Svetoslav Chakarov, Sofie 55

Van Gassen, Jinmiao Chen, Michael 56

Poidinger, Sofie De Prijck, Simon J. 57

Tavernier, Ivy Low, Sergio Erdal Irac, 58

Citra Nurfarah Mattar, Hermi Rizal Suma- 59

toh, Gillian Hui Ling Low, Tam John Kit 60

Chung, Dedrick Kok Hong Chan, Ker Kan 61

Tan, Tony Lim Kiat Hon, Even Fos- 62

sum, Bjarne Bogen, Mahesh Choolani, 63

Jerry Kok Yen Chan, Anis Larbi, Hervé 64

Luche, Sandrine Henri, Yvan Saeys, 65

Evan William Newell, Bart N. Lambrecht, 66

Bernard Malissen, and Florent Ginhoux. 67

Unsupervised High-Dimensional Analysis 68

Aligns Dendritic Cells across Tissues 69

and Species. Immunity, 45(3):669–684, 70

September 2016. ISSN 1074-7613. doi: 71

10.1016/j.immuni.2016.08.015. URL http: 72

//www.sciencedirect.com/science/ 73

article/pii/S1074761316303399. 74

Metin N Gurcan, Laura Boucheron, Ali Can, 75

Anant Madabhushi, Nasir Rajpoot, and 76

Bulent Yener. Histopathological image 77

analysis: A review. IEEE Rev. Biomed. 78

Eng., 2:147, 2009. 79

Hiroshi Haeno, Mithat Gonen, Meghan B 80

Davis, Joseph M Herman, Christine A 81

52

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 53: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Iacobuzio-Donahue, and Franziska Michor.1

Computational modeling of pancreatic can-2

cer reveals kinetics of metastasis suggesting3

optimum treatment strategies. Cell, 148(1-4

2):362–375, January 2012.5

Christoph Hafemeister and Rahul Satija. Nor-6

malization and variance stabilization of7

single-cell RNA-seq data using regular-8

ized negative binomial regression. March9

2019. URL https://www.biorxiv.org/10

content/10.1101/576827v2.11

Laleh Haghverdi, Maren Büttner, F. Alexan-12

der Wolf, Florian Buettner, and Fabian J.13

Theis. Diffusion pseudotime robustly14

reconstructs lineage branching. Nature15

Methods, 13(10):845–848, October 2016.16

ISSN 1548-7105. doi: 10.1038/nmeth.17

3971. URL https://www.nature.com/18

articles/nmeth.3971.19

Laleh Haghverdi, Aaron T. L. Lun,20

Michael D. Morgan, and John C. Marioni.21

Batch effects in single-cell RNA-sequencing22

data are corrected by matching mutual23

nearest neighbors. Nature Biotechnology,24

36(5):421–427, 2018. ISSN 1546-1696. doi:25

10.1038/nbt.4091.26

Xiaoping Han, Renying Wang, Yincong Zhou,27

Lijiang Fei, Huiyu Sun, Shujing Lai, Assieh28

Saadatpour, Ziming Zhou, Haide Chen,29

Fang Ye, Daosheng Huang, Yang Xu, Wen-30

tao Huang, Mengmeng Jiang, Xinyi Jiang,31

Jie Mao, Yao Chen, Chenyu Lu, Jin Xie,32

Qun Fang, Yibin Wang, Rui Yue, Tiefeng33

Li, He Huang, Stuart H Orkin, Guo-Cheng34

Yuan, Ming Chen, and Guoji Guo. Map-35

ping the mouse cell atlas by Microwell-Seq.36

Cell, 172(5):1091–1107.e17, February 2018.37

Andreas Heindl, Sidra Nawaz, and Yinyin38

Yuan. Mapping spatial heterogeneity in39

the tumor microenvironment: a new era for40

digital pathology. Lab. Invest., 95(4):377– 41

384, April 2015. 42

Stephanie C. Hicks and Roger D. Peng. 43

Elements and Principles of Data Anal- 44

ysis. arXiv:1903.07639 [stat], March 45

2019. URL http://arxiv.org/abs/1903. 46

07639. arXiv: 1903.07639. 47

Stephanie C. Hicks, F. William Townes, 48

Mingxiang Teng, and Rafael A. Irizarry. 49

Missing data and technical variability 50

in single-cell RNA-sequencing exper- 51

iments. Biostatistics, 19(4):562–578, 52

October 2018. ISSN 1465-4644. doi: 53

10.1093/biostatistics/kxx053. URL https: 54

//academic.oup.com/biostatistics/ 55

article/19/4/562/4599254. 56

Elad Hoffer and Nir Ailon. Deep Metric 57

Learning Using Triplet Network. In Aasa 58

Feragen, Marcello Pelillo, and Marco Loog, 59

editors, Similarity-Based Pattern Recogni- 60

tion, Lecture Notes in Computer Science, 61

pages 84–92. Springer International Pub- 62

lishing, 2015. ISBN 978-3-319-24261-3. 63

Ian H Holmes. Solving the master equation 64

for indels. BMC Bioinformatics, 18(1):255, 65

May 2017. 66

Chung-Chau Hon, Jay W. Shin, Piero Carn- 67

inci, and Michael J. T. Stubbington. 68

The Human Cell Atlas: Technical ap- 69

proaches and challenges. Briefings in 70

Functional Genomics, 17(4):283–294, July 71

2018. ISSN 2041-2649. doi: 10.1093/bfgp/ 72

elx029. URL https://academic.oup. 73

com/bfg/article/17/4/283/4571849. 74

Masahito Hosokawa, Yohei Nishikawa, 75

Masato Kogawa, and Haruko Takeyama. 76

Massively parallel whole genome amplifica- 77

tion for single-cell sequencing using droplet 78

microfluidics. Sci. Rep., 7(1):5199, July 79

2017. 80

53

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 54: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Yong Hou, Kui Wu, Xulian Shi, Fuqiang Li,1

Luting Song, Hanjie Wu, Michael Dean,2

Guibo Li, Shirley Tsang, Runze Jiang,3

Xiaolong Zhang, Bo Li, Geng Liu, Ni-4

harika Bedekar, Na Lu, Guoyun Xie, Han5

Liang, Liao Chang, Ting Wang, Jianghao6

Chen, Yingrui Li, Xiuqing Zhang, Huan-7

ming Yang, Xun Xu, Ling Wang, and Jun8

Wang. Comparison of variations detection9

between whole-genome amplification meth-10

ods used in single-cell resequencing. Giga-11

science, 4:37, August 2015.12

Qiwen Hu and Casey S Greene. Param-13

eter tuning is a key part of dimension-14

ality reduction via deep variational au-15

toencoders for single cell RNA transcrip-16

tomics. Pacific Symposium on Biocomput-17

ing. Pacific Symposium on Biocomputing,18

24:362–373, 2019. ISSN 2335-6936, 2335-19

6928. URL https://www.ncbi.nlm.nih.20

gov/pubmed/30963075.21

Lei Huang, Fei Ma, Alec Chapman, Sijia22

Lu, and Xiaoliang Sunney Xie. Single-Cell23

Whole-Genome amplification and sequenc-24

ing: Methodology and applications. Annu.25

Rev. Genomics Hum. Genet., 16:79–102,26

June 2015.27

Mo Huang, Jingshu Wang, Eduardo Torre,28

Hannah Dueck, Sydney Shaffer, Roberto29

Bonasio, John I. Murray, Arjun Raj,30

Mingyao Li, and Nancy R. Zhang.31

SAVER: gene expression recovery for32

single-cell RNA sequencing. Nature Meth-33

ods, 15(7):539, July 2018. ISSN 1548-7105.34

doi: 10.1038/s41592-018-0033-z. URL35

https://www.nature.com/articles/36

s41592-018-0033-z.37

Joanna Hård, Ezeddin Al Hakim, Marie38

Kindblom, Åsa K. Björklund, Bengt39

Sennblad, Ilke Demirci, Marta Paterlini,40

Pedro Reu, Erik Borgström, Patrik L.41

Ståhl, Jakob Michaelsson, Jeff E. Mold, 42

and Jonas Frisén. Conbase: a software 43

for unsupervised discovery of clonal so- 44

matic mutations in single cells through read 45

phasing. Genome Biology, 20(1):68, April 46

2019. ISSN 1474-760X. doi: 10.1186/ 47

s13059-019-1673-8. URL https://doi. 48

org/10.1186/s13059-019-1673-8. 49

T. Höllt, N. Pezzotti, V. van Unen, F. Kon- 50

ing, B. P. F. Lelieveldt, and A. Vilanova. 51

CyteGuide: Visual Guidance for Hierar- 52

chical Single-Cell Analysis. IEEE Trans- 53

actions on Visualization and Computer 54

Graphics, 24(1):739–748, January 2018. 55

ISSN 1077-2626. doi: 10.1109/TVCG.2017. 56

2744318. 57

Giovanni Iacono, Elisabetta Mereu, Amy 58

Guillaumet-Adkins, Roser Corominas, Ivon 59

Cuscó, Gustavo Rodríguez-Esteban, Marta 60

Gut, Luis Alberto Pérez-Jurado, Ivo Gut, 61

and Holger Heyn. bigSCale: an analyti- 62

cal framework for big-scale single-cell data. 63

Genome Res., 28(6):878–890, June 2018. 64

Humayun Irshad, Antoine Veillard, Ludovic 65

Roux, and Daniel Racoceanu. Methods 66

for nuclei detection, segmentation, and 67

classification in digital histopathology: A 68

Review—Current status and future poten- 69

tial, 2014. 70

Martin Jacobsen. Point Process Theory and 71

Applications: Marked Point and Piecewise 72

Deterministic Processes. Springer Science 73

& Business Media, December 2005. 74

Katharina Jahn, Jack Kuipers, and Niko 75

Beerenwinkel. Tree inference for single-cell 76

data. Genome Biol., 17:86, May 2016. 77

Livnat Jerby-Arnon, Nadja Pfetzer, Yedael Y 78

Waldman, Lynn McGarry, Daniel James, 79

Emma Shanks, Brinton Seashore-Ludlow, 80

Adam Weinstock, Tamar Geiger, Paul A 81

54

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 55: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Clemons, Eyal Gottlieb, and Eytan Rup-1

pin. Predicting cancer-specific vulnerabil-2

ity via data-driven detection of synthetic3

lethality. Cell, 158(5):1199–1209, August4

2014.5

Zhicheng Ji and Hongkai Ji. TSCAN:6

Pseudo-time reconstruction and evaluation7

in single-cell RNA-seq analysis. Nucleic8

Acids Research, 44(13):e117, 2016. ISSN9

1362-4962. doi: 10.1093/nar/gkw430.10

Nelson Johansen and Gerald Quon. scAlign:11

a tool for alignment, integration and12

rare cell identification from scRNA-seq13

data. bioRxiv, page 504944, March14

2019. doi: 10.1101/504944. URL15

https://www.biorxiv.org/content/10.16

1101/504944v4.17

Brett E Johnson, Tali Mazor, Chibo Hong,18

Michael Barnes, Koki Aihara, Cory Y19

McLean, Shaun D Fouse, Shogo Ya-20

mamoto, Hiroki Ueda, Kenji Tatsuno,21

Saurabh Asthana, Llewellyn E Jalbert,22

Sarah J Nelson, Andrew W Bollen, W Clay23

Gustafson, Elise Charron, William A24

Weiss, Ivan V Smirnov, Jun S Song,25

Adam B Olshen, Soonmee Cha, Yongjun26

Zhao, Richard A Moore, Andrew J27

Mungall, Steven J M Jones, Martin Hirst,28

Marco A Marra, Nobuhito Saito, Hiroyuki29

Aburatani, Akitake Mukasa, Mitchel S30

Berger, Susan M Chang, Barry S Taylor,31

and Joseph F Costello. Mutational anal-32

ysis reveals the origin and therapy-driven33

evolution of recurrent glioma. Science, 34334

(6167):189–193, January 2014.35

Travis S. Johnson, Tongxin Wang, Zhi36

Huang, Christina Y. Yu, Yi Wu, Yatong37

Han, Yan Zhang, Kun Huang, and Jie38

Zhang. LAmbDA: Label Ambiguous39

Domain Adaptation Dataset Integration40

Reduces Batch Effects and Improves41

Subtype Detection. Bioinformatics, April 42

2019. doi: 10.1093/bioinformatics/btz295. 43

URL https://academic.oup.com/ 44

bioinformatics/advance-article/ 45

doi/10.1093/bioinformatics/btz295/ 46

5481958. 47

Altuna Akalin Jonathan Ronen. netsmooth: 48

Network-smoothing based imputation for 49

single cell RNA-seq. F1000Res., 7, 2018. 50

Min Jung, Daniel Wells, Jannette Rusch, 51

Suhaira Ahmad, Jonathan Marchini, Si- 52

mon R Myers, and Donald F Conrad. Uni- 53

fied single-cell analysis of testis gene regu- 54

lation and pathology in five mouse strains. 55

eLife, 8, June 2019. ISSN 2050-084X. 56

doi: 10.7554/eLife.43966. URL http:// 57

dx.doi.org/10.7554/eLife.43966. 58

Melissa R Junttila and Frederic J de Sauvage. 59

Influence of tumour micro-environment 60

heterogeneity on therapeutic response. Na- 61

ture, 501(7467):346–354, September 2013. 62

Hyun Min Kang, Meena Subramaniam, Sasha 63

Targ, Michelle Nguyen, Lenka Maliskova, 64

Elizabeth McCarthy, Eunice Wan, Simon 65

Wong, Lauren Byrnes, Cristina Lanata, 66

Rachel Gate, Sara Mostafavi, Alexan- 67

der Marson, Noah Zaitlen, Lindsey A 68

Criswell, and Chun Jimmie Ye. Multi- 69

plexed droplet single-cell RNA-sequencing 70

using natural genetic variation. Na- 71

ture biotechnology, 36(1):89–94, January 72

2018a. ISSN 1087-0156. doi: 10.1038/nbt. 73

4042. URL https://www.ncbi.nlm.nih. 74

gov/pmc/articles/PMC5784859/. 75

Hyun Min Kang, Meena Subramaniam, Sasha 76

Targ, Michelle Nguyen, Lenka Maliskova, 77

Elizabeth McCarthy, Eunice Wan, Simon 78

Wong, Lauren Byrnes, Cristina M Lanata, 79

Rachel E Gate, Sara Mostafavi, Alexander 80

Marson, Noah Zaitlen, Lindsey A Criswell, 81

55

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 56: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

and Chun Jimmie Ye. Multiplexed droplet1

single-cell RNA-sequencing using natural2

genetic variation. Nat. Biotechnol., 36(1):3

89–94, January 2018b.4

Jurrian Kornelis de Kanter, Philip Lijnzaad,5

Tito Candelli, Thanasis Margaritis, and6

Frank Holstege. CHETAH: a selective, hi-7

erarchical cell type identification method8

for single-cell RNA sequencing. bioRxiv,9

page 558908, February 2019. doi: 10.1101/10

558908. URL https://www.biorxiv.org/11

content/10.1101/558908v1.12

Nikos Karaiskos, Philipp Wahle, Jonathan13

Alles, Anastasiya Boltengagen, Salah Ay-14

oub, Claudia Kipar, Christine Kocks, Niko-15

laus Rajewsky, and Robert P Zinzen. The16

drosophila embryo at single-cell transcrip-17

tome resolution. Science, 358(6360):194–18

199, October 2017a.19

Nikos Karaiskos, Philipp Wahle, Jonathan20

Alles, Anastasiya Boltengagen, Salah Ay-21

oub, Claudia Kipar, Christine Kocks, Niko-22

laus Rajewsky, and Robert P. Zinzen. The23

Drosophila embryo at single-cell transcrip-24

tome resolution. Science, 358(6360):194–25

199, October 2017b. ISSN 0036-8075,26

1095-9203. doi: 10.1126/science.aan3235.27

URL http://science.sciencemag.org/28

content/358/6360/194.29

Ino D. Karemaker and Michiel Vermeulen.30

Single-Cell DNA Methylation Profiling:31

Technologies and Biological Applications.32

Trends in Biotechnology, 36(9):952–965,33

September 2018. ISSN 0167-7799, 1879-34

3096. doi: 10.1016/j.tibtech.2018.04.002.35

URL https://www.cell.com/36

trends/biotechnology/abstract/37

S0167-7799(18)30115-X.38

Lennart Kester and Alexander van Oudenaar-39

den. Single-Cell transcriptomics meets lin-40

eage tracing. Cell Stem Cell, May 2018.41

Peter V Kharchenko, Lev Silberstein, and 42

David T Scadden. Bayesian approach 43

to single-cell differential expression analy- 44

sis. Nature Methods, 11(7):740–742, July 45

2014. ISSN 1548-7091. doi: 10.1038/ 46

nmeth.2967. URL http://www.nature. 47

com/doifinder/10.1038/nmeth.2967. 48

Kyu-Tae Kim, Hye Won Lee, Hae-Ock Lee, 49

Sang Cheol Kim, Yun Jee Seo, Woosung 50

Chung, Hye Hyeon Eum, Do-Hyun Nam, 51

Junhyong Kim, Kyeung Min Joo, and 52

Woong-Yang Park. Single-cell mRNA se- 53

quencing identifies subclonal heterogeneity 54

in anti-cancer drug responses of lung ade- 55

nocarcinoma cells. Genome Biol., 16:127, 56

June 2015. 57

Tae-Min Kim, Ruibin Xi, Lovelace J. Lu- 58

quette, Richard W. Park, Mark D. John- 59

son, and Peter J. Park. Functional 60

genomic analysis of chromosomal aber- 61

rations in a compendium of 8000 can- 62

cer genomes. Genome Research, 23(2): 63

217–227, 2013. doi: 10.1101/gr.140301. 64

112. URL http://genome.cshlp.org/ 65

content/23/2/217.abstract. 66

Marek Kimmel and David Axelrod. Branch- 67

ing Processes in Biology. Interdisci- 68

plinary Applied Mathematics. Springer- 69

Verlag, New York, 2 edition, 2015. ISBN 70

978-1-4939-1558-3. URL https://www. 71

springer.com/gp/book/9781493915583. 72

Savvas Kinalis, Finn Cilius Nielsen, Ole 73

Winther, and Frederik Otzen Bagger. 74

Deconvolution of autoencoders to learn bi- 75

ological regulatory modules from single cell 76

mRNA sequencing data. BMC bioinfor- 77

matics, 20(1):379, July 2019. ISSN 1471- 78

2105. doi: 10.1186/s12859-019-2952-9. 79

URL http://dx.doi.org/10.1186/ 80

s12859-019-2952-9. 81

56

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 57: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Vladimir Yu Kiselev, Andrew Yiu, and Mar-1

tin Hemberg. scmap: projection of single-2

cell RNA-seq data across data sets. Na-3

ture Methods, 15(5):359–362, May 2018.4

ISSN 1548-7105. doi: 10.1038/nmeth.5

4644. URL https://www.nature.com/6

articles/nmeth.4644.7

Vladimir Yu Kiselev, Tallulah S. Andrews,8

and Martin Hemberg. Challenges in9

unsupervised clustering of single-cell10

RNA-seq data. Nature Reviews Genetics,11

page 1, January 2019. ISSN 1471-0064.12

doi: 10.1038/s41576-018-0088-9. URL13

https://www.nature.com/articles/14

s41576-018-0088-9.15

Allon M. Klein, Linas Mazutis, Ilke Akar-16

tuna, Naren Tallapragada, Adrian Veres,17

Victor Li, Leonid Peshkin, David A. Weitz,18

and Marc W. Kirschner. Droplet barcod-19

ing for single-cell transcriptomics applied20

to embryonic stem cells. Cell, 161(5):1187–21

1201, May 2015. ISSN 1097-4172. doi:22

10.1016/j.cell.2015.04.044.23

C A Klein, O Schmidt-Kittler, J A Schardt,24

K Pantel, M R Speicher, and G Rieth-25

müller. Comparative genomic hybridiza-26

tion, loss of heterozygosity, and DNA se-27

quence analysis of single cells. Proc. Natl.28

Acad. Sci. U. S. A., 96(8):4494–4499, April29

1999.30

Sergey Knyazev, Viachaslau Tsyvina, An-31

drew Melnyk, Alexander Artyomenko, Ta-32

tiana Malygina, Yuri B Porozov, Ellsworth33

Campbell, William M Switzer, Pavel34

Skums, and Alex Zelikovsky. CliqueSNV:35

Scalable reconstruction of Intra-Host viral36

populations from NGS reads, 2018.37

Bryan Kolaczkowski and Joseph W Thornton.38

A mixed branch length model of hetero-39

tachy improves phylogenetic accuracy. Mol.40

Biol. Evol., 25(6):1054–1066, June 2008.41

Daisuke Komura and Shumpei Ishikawa. Ma- 42

chine learning methods for histopatho- 43

logical image analysis. Comput. Struct. 44

Biotechnol. J., 16:34–42, February 2018. 45

Say Li Kong, Huipeng Li, Joyce A Tai, Elise T 46

Courtois, Huay Mei Poh, Dawn Pingxi Lau, 47

Yu Xuan Haw, Narayanan Gopalakrishna 48

Iyer, Daniel Shao Weng Tan, Shyam Prab- 49

hakar, Dave Ruff, and Axel M Hillmer. 50

Concurrent Single-Cell RNA and targeted 51

DNA sequencing on an automated platform 52

for comeasurement of genomic and tran- 53

scriptomic signatures. Clin. Chem., 65(2): 54

272–281, February 2019. 55

Hazal Koptagel, Seong-Hwan Jun, and 56

Jens Lagergren. SCuPhr: A Prob- 57

abilistic Framework for Cell Lineage 58

Tree Reconstruction. bioRxiv, page 59

357442, June 2018. doi: 10.1101/ 60

357442. URL https://www.biorxiv.org/ 61

content/early/2018/06/29/357442. 62

Keegan D Korthauer, Li-Fang Chu, Michael A 63

Newton, Yuan Li, James Thomson, Ron 64

Stewart, and Christina Kendziorski. A sta- 65

tistical approach for identifying differential 66

distributions in single-cell RNA-seq exper- 67

iments. Genome Biol., 17(1):222, October 68

2016a. 69

Keegan D. Korthauer, Li-Fang Chu, 70

Michael A. Newton, Yuan Li, James 71

Thomson, Ron Stewart, and Christina 72

Kendziorski. A statistical approach for 73

identifying differential distributions in 74

single-cell RNA-seq experiments. Genome 75

Biology, 17(1):222, 2016b. ISSN 1474- 76

760X. doi: 10.1186/s13059-016-1077-y. 77

Johannes Köster, Myles Brown, and Xi- 78

aole Shirley Liu. A bayesian model for 79

single cell transcript expression analysis on 80

MERFISH data, September 2017. 81

57

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 58: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Dylan Kotliar, Adrian Veres, M Aurel Nagy,1

Shervin Tabrizi, Eran Hodis, Douglas A2

Melton, and Pardis C Sabeti. Identify-3

ing gene expression programs of cell-type4

identity and cellular activity with single-5

cell RNA-Seq. Elife, 8:e43803, July 2019.6

Alexey M. Kozlov, Diego Darriba, Tomáš7

Flouri, Benoit Morel, and Alexan-8

dros Stamatakis. RAxML-NG: a fast,9

scalable and user-friendly tool for10

maximum likelihood phylogenetic in-11

ference. Bioinformatics, May 2019.12

doi: 10.1093/bioinformatics/btz305.13

URL https://academic.oup.com/14

bioinformatics/advance-article/15

doi/10.1093/bioinformatics/btz305/16

5487384.17

O Kozlov. Models, Optimizations, and18

Tools for Large-Scale Phylogenetic Infer-19

ence, Handling Sequence Uncertainty, and20

Taxonomic Validation. PhD thesis, Karl-21

sruhe Institute of Technology (KIT), Octo-22

ber 2018.23

Sergey Kryazhimskiy and Joshua B Plotkin.24

The population genetics of dN/dS. PLoS25

Genet., 4(12):e1000304, December 2008.26

Jack Kuipers, Katharina Jahn, Benjamin J27

Raphael, and Niko Beerenwinkel. Single-28

cell sequencing data reveal widespread re-29

currence and loss of mutational hits in the30

life histories of tumors. Genome Res., 2731

(11):1885–1894, November 2017.32

Johannes Köster, Louis Dijkstra, Tobias33

Marschall, and Alexander Schönhuth.34

Enhancing sensitivity and controlling35

false discovery rate in somatic indel36

discovery. bioRxiv, page 741256, Au-37

gust 2019. doi: 10.1101/741256. URL38

https://www.biorxiv.org/content/10.39

1101/741256v1.40

Emma Laks, Hans Zahn, Daniel Lai, Andrew 41

McPherson, Adi Steif, Jazmine Brimhall, 42

Justina Biele, Beixi Wang, Tehmina 43

Masud, Diljot Grewal, Cydney Nielsen, 44

Samantha Leung, Viktoria Bojilova, Maia 45

Smith, Oleg Golovko, Steven Poon, Peter 46

Eirew, Farhia Kabeer, Teresa Ruiz de Al- 47

gara, So Ra Lee, M. Jafar Taghiyar, Curtis 48

Huebner, Jessica Ngo, Tim Chan, Spencer 49

Vatrt-Watts, Pascale Walters, Nafis Abrar, 50

Sophia Chan, Matt Wiens, Lauren Mar- 51

tin, R. Wilder Scott, Michael T. Under- 52

hill, Elizabeth Chavez, Christian Steidl, 53

Daniel Da Costa, Yusanne Ma, Robin J. N. 54

Coope, Richard Corbett, Stephen Plea- 55

sance, Richard Moore, Andy J. Mungall, 56

Cruk Imaxt Consortium, Marco A. Marra, 57

Carl Hansen, Sohrab Shah, and Samuel 58

Aparicio. Resource: Scalable whole genome 59

sequencing of 40,000 single cells identifies 60

stochastic aneuploidies, genome replication 61

states and clonal repertoires. bioRxiv, page 62

411058, September 2018. doi: 10.1101/ 63

411058. URL https://www.biorxiv.org/ 64

content/early/2018/09/13/411058. 65

Ruben T H M Larue, Gilles Defraene, 66

Dirk De Ruysscher, Philippe Lambin, and 67

Wouter van Elmpt. Quantitative radiomics 68

studies for tissue characterization: a re- 69

view of technology and methodological pro- 70

cedures. Br. J. Radiol., 90(1070):20160665, 71

February 2017. 72

Si Quang Le, Cuong Cao Dang, and Olivier 73

Gascuel. Modeling protein evolution with 74

several amino acid replacement matrices 75

depending on site rates. Mol. Biol. Evol., 76

29(10):2921–2936, October 2012. 77

Adam D Leaché, Barbara L Banbury, Joseph 78

Felsenstein, Adrián Nieto-Montes de Oca, 79

and Alexandros Stamatakis. Short tree, 80

long tree, right tree, wrong tree: New ac- 81

quisition bias corrections for inferring SNP 82

58

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 59: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

phylogenies. Syst. Biol., 64(6):1032–1047,1

2015.2

Je Hyuk Lee, Evan R Daugharthy, Jonathan3

Scheiman, Reza Kalhor, Thomas C Fer-4

rante, Richard Terry, Brian M Turczyk,5

Joyce L Yang, Ho Suk Lee, John Aach, Kun6

Zhang, and George M Church. Fluorescent7

in situ sequencing (FISSEQ) of RNA for8

gene expression profiling in intact cells and9

tissues. Nat. Protoc., 10(3):442–458, March10

2015.11

Jeffrey T. Leek, Robert B. Scharpf, Héc-12

tor Corrada Bravo, David Simcha, Ben-13

jamin Langmead, W. Evan Johnson, Don-14

ald Geman, Keith Baggerly, and Rafael A.15

Irizarry. Tackling the widespread and16

critical impact of batch effects in high-17

throughput data. Nature Reviews Genetics,18

11(10):733–739, October 2010. ISSN 1471-19

0064. doi: 10.1038/nrg2825. URL https:20

//www.nature.com/articles/nrg2825.21

A C Leote, X Wu, and A Beyer. Network-22

based imputation of dropouts in single-cell23

RNA sequencing data. bioRxiv, 2019.24

Wei Vivian Li and Jingyi Jessica Li. An accu-25

rate and robust imputation method scim-26

pute for single-cell RNA-seq data. Nat.27

Commun., 9(1):997, March 2018.28

Yuval Lieberman, Lior Rokach, and Tal29

Shay. CaSTLe – Classification of single30

cells by transfer learning: Harnessing31

the power of publicly available single cell32

RNA sequencing experiments to annotate33

new experiments. PLOS ONE, 13(10):34

e0205499, October 2018. ISSN 1932-6203.35

doi: 10.1371/journal.pone.0205499. URL36

https://journals.plos.org/plosone/37

article?id=10.1371/journal.pone.38

0205499.39

Chieh Lin, Siddhartha Jain, Hannah Kim, 40

and Ziv Bar-Joseph. Using neural networks 41

for reducing the dimensions of single-cell 42

RNA-Seq data. Nucleic Acids Res., 45(17): 43

e156, September 2017a. 44

Peijie Lin, Michael Troup, and Joshua W. K. 45

Ho. CIDR: Ultrafast and accurate clus- 46

tering through imputation for single-cell 47

RNA-seq data. Genome Biology, 18(1):59, 48

March 2017b. ISSN 1474-760X. doi: 10. 49

1186/s13059-017-1188-0. URL https:// 50

doi.org/10.1186/s13059-017-1188-0. 51

G C Linderman, J Zhao, and Y Kluger. Zero- 52

preserving imputation of scRNA-seq data 53

using low-rank approximation. bioRxiv, 54

2018. URL https://www.biorxiv.org/ 55

content/10.1101/397588v1.abstract. 56

Liang Liu, Zhenxiang Xi, Shaoyuan Wu, 57

Charles C Davis, and Scott V Edwards. Es- 58

timating phylogenetic trees from genome- 59

scale data. Ann. N. Y. Acad. Sci., 1360: 60

36–53, December 2015. 61

Jackson Loper, Trygve Bakken, Uygar 62

Sumbul, Gabe Murphy, Hongkui Zeng, 63

David Blei, and Liam Paninski. The 64

Markov link method: a nonparametric 65

approach to combine observations from 66

multiple experiments. bioRxiv, page 67

457283, January 2019. doi: 10.1101/ 68

457283. URL https://www.biorxiv.org/ 69

content/10.1101/457283v3. 70

Romain Lopez, Jeffrey Regier, Michael B 71

Cole, Michael I Jordan, and Nir Yosef. 72

Deep generative modeling for single-cell 73

transcriptomics. Nature methods, 15 74

(12):1053–1058, December 2018. ISSN 75

1548-7091, 1548-7105. doi: 10.1038/ 76

s41592-018-0229-2. URL http://dx.doi. 77

org/10.1038/s41592-018-0229-2. 78

59

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 60: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Eric Lubeck, Ahmet F Coskun, Timur1

Zhiyentayev, Mubhij Ahmad, and Long2

Cai. Single-cell in situ RNA profiling by3

sequential hybridization. Nat. Methods, 114

(4):360–361, April 2014.5

Aaron T L Lun and John C Marioni. Over-6

coming confounding plate effects in dif-7

ferential expression analyses of single-cell8

RNA-seq data. Biostatistics, 18(3):451–9

464, July 2017.10

Aaron T L Lun, Karsten Bach, and John C11

Marioni. Pooling across cells to normalize12

single-cell RNA sequencing data with many13

zero counts. Genome Biol., 17:75, April14

2016.15

Aaron T L Lun, Arianne C Richard, and16

John C Marioni. Testing for differential17

abundance in mass cytometry data. Nat.18

Methods, 14(7):707–709, July 2017.19

Tao Luo, Lei Fan, Rong Zhu, and Dong Sun.20

Microfluidic Single-Cell manipulation and21

analysis: Methods and applications. Micro-22

machines (Basel), 10(2), February 2019.23

Iain C. Macaulay, Mabel J. Teng, Wilfried24

Haerty, Parveen Kumar, Chris P. Ponting,25

and Thierry Voet. Separation and paral-26

lel sequencing of the genomes and tran-27

scriptomes of single cells using G&T-28

seq. Nature Protocols, 11(11):2081–2103,29

November 2016. ISSN 1750-2799. doi: 10.30

1038/nprot.2016.138. URL https://www.31

nature.com/articles/nprot.2016.138.32

Iain C. Macaulay, Chris P. Ponting, and33

Thierry Voet. Single-Cell Multiomics:34

Multiple Measurements from Single35

Cells. Trends in Genetics, 33(2):155–36

168, February 2017. ISSN 0168-9525.37

doi: 10.1016/j.tig.2016.12.003. URL38

https://www.ncbi.nlm.nih.gov/pmc/39

articles/PMC5303816/.40

Evan Z. Macosko, Anindita Basu, Rahul 41

Satija, James Nemesh, Karthik Shekhar, 42

Melissa Goldman, Itay Tirosh, Allison R. 43

Bialas, Nolan Kamitaki, Emily M. Marter- 44

steck, John J. Trombetta, David A. Weitz, 45

Joshua R. Sanes, Alex K. Shalek, Aviv 46

Regev, and Steven A. McCarroll. Highly 47

Parallel Genome-wide Expression Profil- 48

ing of Individual Cells Using Nanoliter 49

Droplets. Cell, 161(5):1202–1214, May 50

2015. ISSN 1097-4172. doi: 10.1016/j.cell. 51

2015.05.002. 52

Serghei Mangul, Lana S. Martin, Brian L. 53

Hill, Angela Ka-Mei Lam, Margaret G. 54

Distler, Alex Zelikovsky, Eleazar Es- 55

kin, and Jonathan Flint. Systematic 56

benchmarking of omics computational 57

tools. Nature Communications, 10(1): 58

1393, March 2019. ISSN 2041-1723. 59

doi: 10.1038/s41467-019-09406-4. URL 60

https://www.nature.com/articles/ 61

s41467-019-09406-4. 62

Gioele La Manno, Ruslan Soldatov, Amit 63

Zeisel, Emelie Braun, Hannah Hochgerner, 64

Viktor Petukhov, Katja Lidschreiber, 65

Maria E. Kastriti, Peter Lönnerberg, 66

Alessandro Furlan, Jean Fan, Lars E. 67

Borm, Zehua Liu, David van Bruggen, 68

Jimin Guo, Xiaoling He, Roger Barker, 69

Erik Sundström, Gonçalo Castelo-Branco, 70

Patrick Cramer, Igor Adameyko, Sten 71

Linnarsson, and Peter V. Kharchenko. 72

RNA velocity of single cells. Nature, 560 73

(7719):494, August 2018. ISSN 1476-4687. 74

doi: 10.1038/s41586-018-0414-6. URL 75

https://www.nature.com/articles/ 76

s41586-018-0414-6. 77

Erik A Martens, Rumen Kostadinov, Carlo C 78

Maley, and Oskar Hallatschek. Spatial 79

structure increases the waiting time for can- 80

cer. New J. Phys., 13, November 2011. 81

60

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 61: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Dariusz Matlak and Ewa Szczurek. Epista-1

sis in genomic and survival data of can-2

cer patients. PLoS Comput. Biol., 13(7):3

e1005626, July 2017.4

Davis James McCarthy, Raghd Rostom,5

Yuanhua Huang, Daniel J. Kunz, Petr6

Danecek, Marc Jan Bonder, Tzachi Hagai,7

HipSci Consortium, Wenyi Wang, Daniel J.8

Gaffney, Benjamin D. Simons, Oliver Ste-9

gle, and Sarah A. Teichmann. Cardelino:10

Integrating whole exomes and single-cell11

transcriptomes to reveal phenotypic im-12

pact of somatic variants. bioRxiv, page13

413047, November 2018. doi: 10.1101/14

413047. URL https://www.biorxiv.org/15

content/10.1101/413047v2.16

Nicholas McGranahan and Charles Swanton.17

Clonal heterogeneity and tumor evolution:18

Past, present, and the future. Cell, 168(4):19

613–628, February 2017.20

Chiara Medaglia, Amir Giladi, Liat Stoler-21

Barak, Marco De Giovanni, Tomer Meir22

Salame, Adi Biram, Eyal David, Han-23

jie Li, Matteo Iannacone, Ziv Shul-24

man, and Ido Amit. Spatial recon-25

struction of immune niches by combin-26

ing photoactivatable reporters and scRNA-27

seq. Science, 358(6370):1622–1626, De-28

cember 2017. ISSN 0036-8075, 1095-29

9203. doi: 10.1126/science.aao4277.30

URL http://science.sciencemag.org/31

content/358/6370/1622.32

Jing Meng and Yi-Ping Phoebe Chen. A33

database of simulated tumor genomes to-34

wards accurate detection of somatic small35

variants in cancer. PLoS One, 13(8):36

e0202982, August 2018.37

Christopher R. Merritt, Giang T. Ong,38

Sarah Church, Kristi Barker, Gary Geiss,39

Margaret Hoang, Jaemyeong Jung, Yan40

Liang, Jill McKay-Fleisch, Karen Nguyen, 41

Kristina Sorg, Isaac Sprague, Charles War- 42

ren, Sarah Warren, Zoey Zhou, Daniel R. 43

Zollinger, Dwayne L. Dunaway, Gordon B. 44

Mills, and Joseph M. Beechem. High 45

multiplex, digital spatial profiling of pro- 46

teins and RNA in fixed tissue using ge- 47

nomic detection methods. bioRxiv, page 48

559021, February 2019. doi: 10.1101/ 49

559021. URL https://www.biorxiv.org/ 50

content/10.1101/559021v2. 51

Zhun Miao, Jiaqi Li, and Xuegong Zhang. 52

screcover: Discriminating true and false ze- 53

ros in single-cell RNA-seq data for imputa- 54

tion. June 2019. 55

Franziska Michor, Yoh Iwasa, and Martin A 56

Nowak. Dynamics of cancer progression. 57

Nat. Rev. Cancer, 4(3):197–205, March 58

2004. 59

Jeffrey R Moffitt, Junjie Hao, Dhanan- 60

jay Bambah-Mukku, Tian Lu, Cather- 61

ine Dulac, and Xiaowei Zhuang. High- 62

performance multiplexed fluorescence in 63

situ hybridization in culture and tissue with 64

matrix imprinting and clearing. Proc. Natl. 65

Acad. Sci. U. S. A., 113(50):14456–14461, 66

December 2016. 67

Jeffrey R. Moffitt, Dhananjay Bambah- 68

Mukku, Stephen W. Eichhorn, Eric 69

Vaughn, Karthik Shekhar, Julio D. 70

Perez, Nimrod D. Rubinstein, Junjie 71

Hao, Aviv Regev, Catherine Dulac, 72

and Xiaowei Zhuang. Molecular, spa- 73

tial, and functional single-cell profiling 74

of the hypothalamic preoptic region. 75

Science, 362(6416):eaau5324, Novem- 76

ber 2018. ISSN 0036-8075, 1095-9203. 77

doi: 10.1126/science.aau5324. URL 78

http://science.sciencemag.org/ 79

content/362/6416/eaau5324. 80

61

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 62: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Kevin R Moon, Jay S Stanley, Daniel1

Burkhardt, David van Dijk, Guy Wolf, and2

Smita Krishnaswamy. Manifold learning-3

based methods for analyzing single-cell4

RNA-sequencing data. Current Opinion in5

Systems Biology, 7:36–46, 2018.6

Marmar Moussa and Ion I. Măndoiu.7

Locality Sensitive Imputation for Sin-8

gle Cell RNA-Seq Data. Journal of9

Computational Biology, February 2019.10

doi: 10.1089/cmb.2018.0236. URL11

https://www.liebertpub.com/doi/10.12

1089/cmb.2018.0236.13

Nature Methods, 2013. Method of the year14

2013. Nature Methods, 11:1 EP –, 1215

2013. URL https://doi.org/10.1038/16

nmeth.2801.17

Nicholas Navin, Jude Kendall, Jennifer Troge,18

Peter Andrews, Linda Rodgers, Jeanne19

McIndoo, Kerry Cook, Asya Stepan-20

sky, Dan Levy, Diane Esposito, Lakshmi21

Muthuswamy, Alex Krasnitz, W Richard22

McCombie, James Hicks, and Michael23

Wigler. Tumour evolution inferred by24

single-cell sequencing. Nature, 472(7341):25

90–94, April 2011.26

Richard A Neher, Colin A Russell, and Boris I27

Shraiman. Predicting evolution from the28

shape of genealogical trees. Elife, 3, Novem-29

ber 2014.30

Malgorzata Nowicka, Carsten Krieg, Lukas M31

Weber, Felix J Hartmann, Silvia Guglietta,32

Burkhard Becher, Mitchell P Levesque,33

and Mark D Robinson. CyTOF work-34

flow: differential discovery in high-35

throughput high-dimensional cytometry36

datasets. F1000Res., 6:748, May 2017.37

Huw A Ogilvie, Remco R Bouckaert, and38

Alexei J Drummond. StarBEAST2 brings39

faster species tree inference and accurate40

estimates of substitution rates. Mol. Biol. 41

Evol., 34(8):2101–2114, August 2017. 42

J Guillermo Paez, Ming Lin, Rameen 43

Beroukhim, Jeffrey C Lee, Xiaojun Zhao, 44

Daniel J Richter, Stacey Gabriel, Paula 45

Herman, Hidefumi Sasaki, David Alt- 46

shuler, Cheng Li, Matthew Meyerson, and 47

William R Sellers. Genome coverage and 48

sequence fidelity of phi29 polymerase-based 49

multiple strand displacement whole genome 50

amplification. Nucleic Acids Res., 32(9): 51

e71, May 2004. 52

Jong-Eun Park, Krzysztof Polanski, Kerstin 53

Meyer, and Sarah A. Teichmann. Fast 54

Batch Alignment of Single Cell Transcrip- 55

tomes Unifies Multiple Mouse Cell Atlases 56

into an Integrated Landscape. bioRxiv, 57

page 397042, August 2018. doi: 10.1101/ 58

397042. URL https://www.biorxiv.org/ 59

content/10.1101/397042v2. 60

Anoop P Patel, Itay Tirosh, John J Trom- 61

betta, Alex K Shalek, Shawn M Gille- 62

spie, Hiroaki Wakimoto, Daniel P Cahill, 63

Brian V Nahed, William T Curry, Robert L 64

Martuza, David N Louis, Orit Rozenblatt- 65

Rosen, Mario L Suvà, Aviv Regev, and 66

Bradley E Bernstein. Single-cell RNA- 67

seq highlights intratumoral heterogeneity 68

in primary glioblastoma. Science, 344 69

(6190):1396–1401, June 2014. 70

Tao Peng, Qin Zhu, Penghang Yin, and 71

Kai Tan. SCRABBLE: single-cell RNA- 72

seq imputation constrained by bulk RNA- 73

seq data. Genome biology, 20(1):88, May 74

2019. ISSN 1465-6906. doi: 10.1186/ 75

s13059-019-1681-8. URL http://dx.doi. 76

org/10.1186/s13059-019-1681-8. 77

N. Pezzotti, T. Höllt, B. Lelieveldt, E. Eise- 78

mann, and A. Vilanova. Hierarchical 79

Stochastic Neighbor Embedding. Com- 80

puter Graphics Forum, 35(3):21–30, 2016. 81

62

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 63: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

ISSN 1467-8659. doi: 10.1111/cgf.1

12878. URL https://onlinelibrary.2

wiley.com/doi/abs/10.1111/cgf.12878.3

Ángel J Picher, Bettina Budeus, Oliver4

Wafzig, Carola Krüger, Sara García-5

Gómez, María I Martínez-Jiménez, Al-6

berto Díaz-Talavera, Daniela Weber, Luis7

Blanco, and Armin Schneider. TruePrime8

is a novel method for whole-genome am-9

plification from single cells based on Tth-10

PrimPol. Nat. Commun., 7:13296, Novem-11

ber 2016.12

Emma Pierson and Christopher Yau.13

ZIFA: Dimensionality reduction for14

zero-inflated single-cell gene expres-15

sion analysis. Genome Biology, 1616

(1):241, November 2015. ISSN 1474-17

760X. doi: 10.1186/s13059-015-0805-z.18

URL https://doi.org/10.1186/19

s13059-015-0805-z.20

Mireya Plass, Jordi Solana, F. Alexan-21

der Wolf, Salah Ayoub, Aristotelis Mi-22

sios, Petar Glažar, Benedikt Obermayer,23

Fabian J. Theis, Christine Kocks, and Niko-24

laus Rajewsky. Cell type atlas and lineage25

tree of a whole complex animal by single-26

cell transcriptomics. Science, 360(6391):27

eaaq1723, May 2018. ISSN 0036-8075,28

1095-9203. doi: 10.1126/science.aaq1723.29

URL http://science.sciencemag.org/30

content/360/6391/eaaq1723.31

Hannah A. Pliner, Jonathan S. Packer,32

José L. McFaline-Figueroa, Darren A.33

Cusanovich, Riza M. Daza, Delasa34

Aghamirzaie, Sanjay Srivatsan, Xiaojie35

Qiu, Dana Jackson, Anna Minkina, An-36

drew C. Adey, Frank J. Steemers, Jay37

Shendure, and Cole Trapnell. Cicero38

Predicts cis-Regulatory DNA Interactions39

from Single-Cell Chromatin Accessi-40

bility Data. Molecular Cell, 71(5):41

858–871.e8, 2018. ISSN 1097-4164. doi: 42

10.1016/j.molcel.2018.06.044. 43

Olivier Poirion, Xun Zhu, Travers Ching, and 44

Lana X. Garmire. Using single nucleotide 45

variations in single-cell RNA-seq to identify 46

subpopulations and genotype-phenotype 47

linkage. Nature Communications, 9(1): 48

4892, November 2018. ISSN 2041-1723. 49

doi: 10.1038/s41467-018-07170-5. URL 50

https://www.nature.com/articles/ 51

s41467-018-07170-5. 52

David D Pollock, Derrick J Zwickl, Jimmy A 53

McGuire, and David M Hillis. Increased 54

taxon sampling is advantageous for phylo- 55

genetic inference. Syst. Biol., 51(4):664– 56

671, August 2002. 57

Vladimir Potapov and Jennifer L Ong. Ex- 58

amining sources of error in PCR by Single- 59

Molecule sequencing. PLoS One, 12(1): 60

e0169774, January 2017. 61

Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, 62

Raghav Chawla, Hannah A. Pliner, and 63

Cole Trapnell. Reversed graph embed- 64

ding resolves complex single-cell trajecto- 65

ries. Nature Methods, 14(10):979–982, Oc- 66

tober 2017. ISSN 1548-7105. doi: 10.1038/ 67

nmeth.4402. URL https://www.nature. 68

com/articles/nmeth.4402. 69

Bruce Rannala and Ziheng Yang. Efficient 70

bayesian species tree inference under the 71

multispecies coalescent. Syst. Biol., 66(5): 72

823–842, September 2017. 73

Benjamin Redelings. Erasing errors due to 74

alignment ambiguity when estimating posi- 75

tive selection. Mol. Biol. Evol., 31(8):1979– 76

1993, August 2014. 77

Aviv Regev, Sarah A. Teichmann, Eric S. 78

Lander, Ido Amit, Christophe Benoist, 79

Ewan Birney, Bernd Bodenmiller, Peter 80

63

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 64: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Campbell, Piero Carninci, Menna Clat-1

worthy, Hans Clevers, Bart Deplancke,2

Ian Dunham, James Eberwine, Roland3

Eils, Wolfgang Enard, Andrew Farmer,4

Lars Fugger, Berthold Göttgens, Nir Ha-5

cohen, Muzlifah Haniffa, Martin Hem-6

berg, Seung Kim, Paul Klenerman, Arnold7

Kriegstein, Ed Lein, Sten Linnarsson,8

Joakim Lundeberg, Partha Majumder,9

John C. Marioni, Miriam Merad, Musa10

Mhlanga, Martijn Nawijn, Mihai Netea,11

Garry Nolan, Dana Pe’er, Anthony Philli-12

pakis, Chris P. Ponting, Steve Quake,13

Wolf Reik, Orit Rozenblatt-Rosen, Joshua14

Sanes, Rahul Satija, Ton N. Schumacher,15

Alex Shalek, Ehud Shapiro, Padmanee16

Sharma, Jay W. Shin, Oliver Stegle,17

Michael Stratton, Michael J. T. Stubbing-18

ton, Alexander van Oudenaarden, Allon19

Wagner, Fiona Watt, Jonathan Weissman,20

Barbara Wold, Ramnik Xavier, Nir Yosef,21

and the Human Cell Atlas Meeting Partic-22

ipants. The Human Cell Atlas. bioRxiv,23

page 121202, May 2017. doi: 10.1101/24

121202. URL https://www.biorxiv.org/25

content/10.1101/121202v1.26

John E. Reid and Lorenz Wernisch. Pseu-27

dotime estimation: deconfounding single28

cell time series. Bioinformatics, 32(19):29

2973–2980, October 2016. ISSN 1367-30

4803. doi: 10.1093/bioinformatics/btw372.31

URL https://academic.oup.com/32

bioinformatics/article/32/19/2973/33

2196633.34

Stephen Reid, Jonathan Taylor, and Robert35

Tibshirani. A general framework for esti-36

mation and inference from clusters of fea-37

tures. J. Am. Stat. Assoc., 113(521):280–38

293, January 2018.39

Davide Risso, Fanny Perraudeau, Svet-40

lana Gribkova, Sandrine Dudoit, and41

Jean-Philippe Vert. A general and42

flexible method for signal extraction 43

from single-cell RNA-seq data. Na- 44

ture Communications, 9(1):284, Jan- 45

uary 2018. ISSN 2041-1723. doi: 46

10.1038/s41467-017-02554-5. URL 47

https://www.nature.com/articles/ 48

s41467-017-02554-5. 49

Elena Rivas and Sean R Eddy. Probabilis- 50

tic phylogenetic inference with insertions 51

and deletions. PLoS Comput. Biol., 4(9): 52

e1000172, September 2008. 53

Abbas H. Rizvi, Pablo G. Camara, Elena K. 54

Kandror, Thomas J. Roberts, Ira Schieren, 55

Tom Maniatis, and Raul Rabadan. Single- 56

cell topological RNA-seq analysis reveals 57

insights into cellular differentiation and de- 58

velopment. Nature Biotechnology, 35(6): 59

551–560, 2017. ISSN 1546-1696. doi: 10. 60

1038/nbt.3854. 61

Simone Rizzetto, Auda A Eltahla, Peijie Lin, 62

Rowena Bull, Andrew R Lloyd, Joshua 63

W K Ho, Vanessa Venturi, and Fabio Lu- 64

ciani. Impact of sequencing depth and read 65

length on single cell RNA sequencing data 66

of T cells. Sci. Rep., 7(1):12781, October 67

2017. 68

S Roch. A short proof that phylogenetic tree 69

reconstruction by maximum likelihood is 70

hard. IEEE/ACM Trans. Comput. Biol. 71

Bioinform., 3(1):92–94, 2006. 72

Samuel G. Rodriques, Robert R. Stick- 73

els, Aleksandrina Goeva, Carly A. Mar- 74

tin, Evan Murray, Charles R. Vander- 75

burg, Joshua Welch, Linlin M. Chen, Fei 76

Chen, and Evan Z. Macosko. Slide- 77

seq: A scalable technology for measur- 78

ing genome-wide expression at high spa- 79

tial resolution. Science, 363(6434):1463– 80

1467, March 2019. ISSN 0036-8075, 81

1095-9203. doi: 10.1126/science.aaw1219. 82

64

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 65: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

URL https://science.sciencemag.org/1

content/363/6434/1463.2

Florian Rohart, Aida Eslami, Nicholas Mati-3

gian, Stéphanie Bougeard, and Kim-4

Anh Lê Cao. MINT: a multivari-5

ate integrative method to identify re-6

producible molecular signatures across7

independent experiments and platforms.8

BMC Bioinformatics, 18(1):128, February9

2017a. ISSN 1471-2105. doi: 10.1186/10

s12859-017-1553-8. URL https://doi.11

org/10.1186/s12859-017-1553-8.12

Florian Rohart, Benoît Gautier, Amrit13

Singh, and Kim-Anh Lê Cao. mixOmics:14

An R package for ‘omics feature se-15

lection and multiple data integration.16

PLOS Computational Biology, 13(11):17

e1005752, November 2017b. ISSN 1553-18

7358. doi: 10.1371/journal.pcbi.1005752.19

URL https://journals.plos.org/20

ploscompbiol/article?id=10.1371/21

journal.pcbi.1005752.22

Alexander B. Rosenberg, Charles M. Roco,23

Richard A. Muscat, Anna Kuchina, Paul24

Sample, Zizhen Yao, Lucas T. Graybuck,25

David J. Peeler, Sumit Mukherjee, Wei26

Chen, Suzie H. Pun, Drew L. Sellers,27

Bosiljka Tasic, and Georg Seelig. Single-28

cell profiling of the developing mouse brain29

and spinal cord with split-pool barcoding.30

Science, 360(6385):176–182, April 2018.31

ISSN 0036-8075, 1095-9203. doi: 10.1126/32

science.aam8999. URL http://science.33

sciencemag.org/content/360/6385/176.34

Edith M Ross and Florian Markowetz. On-35

coNEM: inferring tumor evolution from36

single-cell sequencing data. Genome Biol.,37

17:69, April 2016.38

Andrew Roth, Andrew McPherson, Emma39

Laks, Justina Biele, Damian Yap, Adrian40

Wan, Maia A Smith, Cydney B Nielsen, 41

Jessica N McAlpine, Samuel Aparicio, 42

Alexandre Bouchard-Côté, and Sohrab P 43

Shah. Clonal genotype and population 44

structure inference from single-cell tumor 45

sequencing. Nat. Methods, 13(7):573–576, 46

July 2016. 47

Adela Saco, Jose Ramírez, Natalia Rakislova, 48

Aurea Mira, and Jaume Ordi. Validation of 49

Whole-Slide imaging for histolopathogical 50

diagnosis: Current state. Pathobiology, 83 51

(2-3):89–98, April 2016. 52

Wouter Saelens, Robrecht Cannoodt, 53

Helena Todorov, and Yvan Saeys. A 54

comparison of single-cell trajectory in- 55

ference methods. Nature Biotechnology, 56

page 1, April 2019. ISSN 1546-1696. 57

doi: 10.1038/s41587-019-0071-9. URL 58

https://www.nature.com/articles/ 59

s41587-019-0071-9. 60

Yvan Saeys, Sofie Van Gassen, and Bart N. 61

Lambrecht. Computational flow cytom- 62

etry: helping to make sense of high- 63

dimensional immunology data. Nature 64

Reviews Immunology, 16(7):449–462, July 65

2016. ISSN 1474-1741. doi: 10.1038/nri. 66

2016.56. URL https://www.nature.com/ 67

articles/nri.2016.56. 68

Stefano Santaguida, Amelia Richardson, 69

Divya Ramalingam Iyer, Ons M’Saad, 70

Lauren Zasadil, Kristin A. Knouse, 71

Yao Liang Wong, Nicholas Rhind, Arshad 72

Desai, and Angelika Amon. Chromosome 73

Mis-segregation Generates Cell-Cycle- 74

Arrested Cells with Complex Karyotypes 75

that Are Eliminated by the Immune 76

System. Developmental Cell, 41(6): 77

638–651.e5, June 2017. ISSN 15345807. 78

doi: 10.1016/j.devcel.2017.05.022. URL 79

https://linkinghub.elsevier.com/ 80

retrieve/pii/S1534580717304306. 81

65

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 66: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Gryte Satas and Benjamin J Raphael. Haplo-1

type phasing in single-cell DNA-sequencing2

data. Bioinformatics, 34(13):i211–i217,3

July 2018.4

Rahul Satija, Jeffrey A Farrell, David Gen-5

nert, Alexander F Schier, and Aviv Regev.6

Spatial reconstruction of single-cell gene ex-7

pression data. Nat. Biotechnol., 33(5):495–8

502, May 2015.9

Kenta Sato, Koki Tsuyuzaki, Kentaro10

Shimizu, and Itoshi Nikaido. CellFish-11

ing.jl: an ultrafast and scalable cell12

search method for single-cell RNA sequenc-13

ing. Genome Biology, 20(1):31, February14

2019. ISSN 1474-760X. doi: 10.1186/15

s13059-019-1639-x. URL https://doi.16

org/10.1186/s13059-019-1639-x.17

Arpiar Saunders, Evan Z. Macosko, Alec18

Wysoker, Melissa Goldman, Fenna M.19

Krienen, Heather de Rivera, Elizabeth20

Bien, Matthew Baum, Laura Bortolin,21

Shuyu Wang, Aleksandrina Goeva, James22

Nemesh, Nolan Kamitaki, Sara Brum-23

baugh, David Kulp, and Steven A.24

McCarroll. Molecular Diversity and Spe-25

cializations among the Cells of the Adult26

Mouse Brain. Cell, 174(4):1015–1030.e16,27

August 2018. ISSN 0092-8674. doi:28

10.1016/j.cell.2018.07.028. URL http:29

//www.sciencedirect.com/science/30

article/pii/S0092867418309553.31

Denis Schapiro, Hartland W Jackson, Swetha32

Raghuraman, Jana R Fischer, Vito R T33

Zanotelli, Daniel Schulz, Charlotte Giesen,34

Raúl Catena, Zsuzsanna Varga, and Bernd35

Bodenmiller. histoCAT: analysis of cell36

phenotypes and interactions in multiplex37

image cytometry data. Nat. Methods, 1438

(9):873–876, September 2017.39

Geoffrey Schiebinger, Jian Shu, Marcin40

Tabaka, Brian Cleary, Vidya Subrama-41

nian, Aryeh Solomon, Siyan Liu, Sta- 42

cie Lin, Peter Berube, Lia Lee, Jenny 43

Chen, Justin Brumbaugh, Philippe Rigol- 44

let, Konrad Hochedlinger, Rudolf Jaenisch, 45

Aviv Regev, and Eric S. Lander. Re- 46

construction of developmental landscapes 47

by optimal-transport analysis of single- 48

cell gene expression sheds light on cel- 49

lular reprogramming. bioRxiv, page 50

191056, September 2017. doi: 10.1101/ 51

191056. URL https://www.biorxiv.org/ 52

content/10.1101/191056v1. 53

Herbert B Schiller, Daniel T Montoro, 54

Lukas M Simon, Emma L Rawlins, 55

Kerstin B Meyer, Maximilian Strunz, 56

Felipe Vieira Braga, Wim Timens, Ger- 57

ard H Koppelman, G.R. Scott Budinger, 58

Janette K Burgess, Avinash Waghray, 59

Maarten van den Berge, Fabian J Theis, 60

Aviv Regev, Naftali Kaminski, Jayaraj 61

Rajagopal, Sarah A Teichmann, Alexan- 62

der V Misharin, and Martijn C Nawijn. 63

The Human Lung Cell Atlas - A high- 64

resolution reference map of the human 65

lung in health and disease. American 66

Journal of Respiratory Cell and Molecular 67

Biology, April 2019. ISSN 1044-1549. 68

doi: 10.1165/rcmb.2018-0416TR. URL 69

https://www.atsjournals.org/doi/ 70

abs/10.1165/rcmb.2018-0416TR. 71

Roland F. Schwarz, Anne Trinh, Botond 72

Sipos, James D. Brenton, Nick Gold- 73

man, and Florian Markowetz. Phyloge- 74

netic Quantification of Intra-tumour Het- 75

erogeneity. PLoS Computational Biol- 76

ogy, 10(4):e1003535, April 2014. ISSN 77

1553-7358. doi: 10.1371/journal.pcbi. 78

1003535. URL https://dx.plos.org/10. 79

1371/journal.pcbi.1003535. 80

Roberto Semeraro, Valerio Orlandini, and Al- 81

berto Magi. Xome-Blender: A novel can- 82

66

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 67: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

cer genome simulator. PLoS One, 13(4):1

e0194472, April 2018.2

Debarka Sengupta, Nirmala Arul Rayan,3

Michelle Lim, Bing Lim, and Shyam4

Prabhakar. Fast, scalable and accu-5

rate differential expression analysis for6

single cells. bioRxiv, page 049734,7

April 2016. doi: 10.1101/049734. URL8

https://www.biorxiv.org/content/10.9

1101/049734v1.10

Manu Setty, Michelle D. Tadmor, Shlomit11

Reich-Zeliger, Omer Angel, Tomer Meir12

Salame, Pooja Kathail, Kristy Choi, Sean13

Bendall, Nir Friedman, and Dana Pe’er.14

Wishbone identifies bifurcating develop-15

mental trajectories from single-cell data.16

Nature Biotechnology, 34(6):637–645, June17

2016. ISSN 1546-1696. doi: 10.1038/nbt.18

3569. URL https://www.nature.com/19

articles/nbt.3569.20

D T Severson, R P Owen, M J White, X Lu,21

and B Schuster-Böckler. BEARscc deter-22

mines robustness of single-cell clusters us-23

ing simulated technical replicates. Nat.24

Commun., 9(1):1187, March 2018.25

Sheel Shah, Eric Lubeck, Wen Zhou, and26

Long Cai. In situ transcription profiling of27

single cells reveals spatial organization of28

cells in the mouse hippocampus. Neuron,29

92(2):342–357, October 2016.30

Arun Shivanandan, Jayakrishnan Unnikrish-31

nan, and Aleksandra Radenovic. On char-32

acterizing protein spatial clusters with cor-33

relation approaches. Sci. Rep., 6:31164,34

August 2016.35

Angus M Sidore, Freeman Lan, Shaun W36

Lim, and Adam R Abate. Enhanced se-37

quencing coverage with digital droplet mul-38

tiple displacement amplification. Nucleic39

Acids Res., 44(7):e66, April 2016.40

Jochen Singer, Jack Kuipers, Katharina 41

Jahn, and Niko Beerenwinkel. Single-cell 42

mutation identification via phylogenetic 43

inference. Nature Communications, 9(1): 44

5144, December 2018. ISSN 2041-1723. 45

doi: 10.1038/s41467-018-07627-7. URL 46

https://www.nature.com/articles/ 47

s41467-018-07627-7. 48

Amrit Singh, Benoit Gautier, Casey P. 49

Shannon, Florian Rohart, Michael Vacher, 50

Scott J. Tebutt, and Kim-Anh Le Cao. 51

DIABLO: from multi-omics assays to 52

biomarker discovery, an integrative ap- 53

proach. bioRxiv, page 067611, March 54

2018. doi: 10.1101/067611. URL 55

https://www.biorxiv.org/content/10. 56

1101/067611v2. 57

Debajyoti Sinha, Akhilesh Kumar, Himan- 58

shu Kumar, Sanghamitra Bandyopadhyay, 59

and Debarka Sengupta. dropclust: efficient 60

clustering of ultra-large scRNA-seq data. 61

Nucleic Acids Res., 46(6):e36, April 2018. 62

Pavel Skums, Viachaslau Tsyvina, and Alex 63

Zelikovsky. Inference of clonal selec- 64

tion in cancer populations using single- 65

cell sequencing data. bioRxiv, page 66

465211, January 2019. doi: 10.1101/ 67

465211. URL https://www.biorxiv.org/ 68

content/10.1101/465211v2. 69

Martin D Smith, Joel O Wertheim, Steven 70

Weaver, Ben Murrell, Konrad Scheffler, and 71

Sergei L Kosakovsky Pond. Less is more: an 72

adaptive branch-site random effects model 73

for efficient detection of episodic diversify- 74

ing selection. Mol. Biol. Evol., 32(5):1342– 75

1353, May 2015. 76

Charlotte Soneson and Mark D Robinson. To- 77

wards unified quality verification of syn- 78

thetic count data with countsimQC. Bioin- 79

formatics, 34(4):691–692, 2017. 80

67

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 68: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Charlotte Soneson and Mark D Robinson.1

Bias, robustness and scalability in single-2

cell differential expression analysis. Nat.3

Methods, February 2018.4

Bastiaan Spanjaard, Bo Hu, Nina Mitic,5

Pedro Olivares-Chauvet, Sharan Janjuha,6

Nikolay Ninov, and Jan Philipp Junker.7

Simultaneous lineage tracing and cell-type8

identification using CRISPR-Cas9-induced9

genetic scars. Nat. Biotechnol., 36(5):469–10

473, June 2018.11

C Spits, C Le Caignec, M De Rycke,12

L Van Haute, A Van Steirteghem,13

I Liebaers, and K Sermon. Optimization14

and evaluation of single-cell whole-genome15

multiple displacement amplification. Hum.16

Mutat., 27(5):496–503, 2006a.17

Claudia Spits, Cédric Le Caignec, Martine18

De Rycke, Lindsey Van Haute, André19

Van Steirteghem, Inge Liebaers, and Karen20

Sermon. Whole-genome multiple displace-21

ment amplification from single cells. Nat.22

Protoc., 1(4):1965–1970, November 2006b.23

S Srinivasan, N T Johnson, and D Ko-24

rkin. A Hybrid Deep Clustering Ap-25

proach for Robust Cell Type Profiling Us-26

ing Single-cell RNA-seq Data. bioRxiv,27

2019. URL https://www.biorxiv.org/28

content/10.1101/511626v1.abstract.29

Divyanshu Srivastava, Arvind Iyer, Vibhor30

Kumar, and Debarka Sengupta. Cel-31

lAtlasSearch: a scalable search engine32

for single cells. Nucleic Acids Re-33

search, 46(W1):W141–W147, July 2018.34

ISSN 0305-1048. doi: 10.1093/nar/35

gky421. URL https://academic.oup.36

com/nar/article/46/W1/W141/5000022.37

Oliver Stegle, Sarah A Teichmann, and38

John C Marioni. Computational and an-39

alytical challenges in single-cell transcrip-40

tomics. Nat. Rev. Genet., 16(3):133, Jan- 41

uary 2015. 42

Genevieve L Stein-O’Brien, Brian S Clark, 43

Thomas Sherman, Cristina Zibetti, Qi- 44

wen Hu, Rachel Sealfon, Sheng Liu, Jiang 45

Qian, Carlo Colantuoni, Seth Blackshaw, 46

Loyal A Goff, and Elana J Fertig. De- 47

composing Cell Identity for Transfer Learn- 48

ing across Cellular Measurements, Plat- 49

forms, Tissues, and Species. Cell sys- 50

tems, 8(5):395–411.e8, May 2019. ISSN 51

2405-4720, 2405-4712. doi: 10.1016/j.cels. 52

2019.04.004. URL http://dx.doi.org/ 53

10.1016/j.cels.2019.04.004. 54

Lars Steinbrück and Alice Carolyn McHardy. 55

Allele dynamics plots for the study of evolu- 56

tionary dynamics in viral populations. Nu- 57

cleic Acids Res., 39(1):e4, January 2011. 58

Carina Strell, Markus M Hilscher, Navya 59

Laxman, Jessica Svedlund, Chenglin Wu, 60

Chika Yokota, and Mats Nilsson. Placing 61

RNA in context and space - methods for 62

spatially resolved transcriptomics. FEBS 63

J., March 2018. 64

Tim Stuart, Andrew Butler, Paul Hoffman, 65

Christoph Hafemeister, Efthymia Papalexi, 66

William M. Mauck, Marlon Stoeckius, Pe- 67

ter Smibert, and Rahul Satija. Comprehen- 68

sive integration of single cell data. bioRxiv, 69

page 460147, November 2018. doi: 10.1101/ 70

460147. URL https://www.biorxiv.org/ 71

content/10.1101/460147v1. 72

Michael J T Stubbington, Orit Rozenblatt- 73

Rosen, Aviv Regev, and Sarah A Teich- 74

mann. Single-cell transcriptomics to ex- 75

plore the immune system in health and dis- 76

ease. Science, 358(6359):58–63, October 77

2017. 78

Patrik L. Ståhl, Fredrik Salmén, Sanja Vick- 79

ovic, Anna Lundmark, José Fernández 80

68

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 69: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Navarro, Jens Magnusson, Stefania Gia-1

comello, Michaela Asp, Jakub O. West-2

holm, Mikael Huss, Annelie Mollbrink,3

Sten Linnarsson, Simone Codeluppi, Åke4

Borg, Fredrik Pontén, Paul Igor Costea,5

Pelin Sahlén, Jan Mulder, Olaf Bergmann,6

Joakim Lundeberg, and Jonas Frisén. Vi-7

sualization and analysis of gene expres-8

sion in tissue sections by spatial transcrip-9

tomics. Science (New York, N.Y.), 35310

(6294):78–82, July 2016. ISSN 1095-9203.11

doi: 10.1126/science.aaf2403.12

S Sun, J Zhu, Y Ma, and X Zhou. Ac-13

curacy, Robustness and Scalability of Di-14

mensionality Reduction Methods for Sin-15

gle Cell RNAseq Analysis. bioRxiv,16

2019. URL https://www.biorxiv.org/17

content/10.1101/641142v1.abstract.18

Valentine Svensson, Sarah A Teichmann, and19

Oliver Stegle. SpatialDE: identification of20

spatially variable genes. Nat. Methods, 1521

(5):343–346, May 2018a.22

Valentine Svensson, Roser Vento-Tormo, and23

Sarah A. Teichmann. Exponential scal-24

ing of single-cell RNA-seq in the past25

decade. Nature Protocols, 13(4):599–604,26

April 2018b. ISSN 1750-2799. doi: 10.27

1038/nprot.2017.149. URL https://www.28

nature.com/articles/nprot.2017.149.29

Charles Swanton. Intratumor heterogeneity:30

evolution through space and time. Cancer31

Res., 72(19):4875–4882, October 2012.32

Ewa Szczurek, Navodit Misra, and Martin33

Vingron. Synthetic sickness or lethality34

points at candidate combination therapy35

targets in glioblastoma. Int. J. Cancer, 13336

(9):2123–2132, November 2013.37

The Tabula Muris Consortium. Single-cell38

transcriptomics of 20 mouse organs cre-39

ates a Tabula Muris. Nature, 562(7727):40

367, October 2018. ISSN 1476-4687. 41

doi: 10.1038/s41586-018-0590-4. URL 42

https://www.nature.com/articles/ 43

s41586-018-0590-4. 44

Divyanshu Talwar, Aanchal Mongia, De- 45

barka Sengupta, and Angshul Majum- 46

dar. AutoImpute: Autoencoder based 47

imputation of single-cell RNA-seq data. 48

Scientific reports, 8(1):16329, November 49

2018. ISSN 2045-2322. doi: 10.1038/ 50

s41598-018-34688-x. URL http://dx. 51

doi.org/10.1038/s41598-018-34688-x. 52

Amos Tanay and Aviv Regev. Scaling 53

single-cell genomics from phenomenology 54

to mechanism. Nature, 541(7637):331–338, 55

January 2017. 56

W Tang, F Bertaux, P Thomas, C Ste- 57

fanelli, M Saint, and others. bayNorm: 58

Bayesian gene expression recovery, im- 59

putation and normalisation for single 60

cell RNA-sequencing data. bioRxiv, 61

2018. URL https://www.biorxiv.org/ 62

content/10.1101/384586v2.abstract. 63

H Telenius, N P Carter, C E Bebb, M Norden- 64

skjöld, B A Ponder, and A Tunnacliffe. De- 65

generate oligonucleotide-primed PCR: gen- 66

eral amplification of target DNA by a single 67

degenerate primer. Genomics, 13(3):718– 68

725, July 1992. 69

Luyi Tian, Xueyi Dong, Saskia Freytag, 70

Kim-Anh Lê Cao, Shian Su, Abolfazl 71

JalalAbadi, Daniela Amann-Zalcenstein, 72

Tom S. Weber, Azadeh Seidi, Jafar S. 73

Jabbari, Shalin H. Naik, and Matthew E. 74

Ritchie. Benchmarking single cell RNA- 75

sequencing analysis pipelines using mixture 76

control experiments. Nature Methods, 16 77

(6):479, June 2019. ISSN 1548-7105. 78

doi: 10.1038/s41592-019-0425-8. URL 79

https://www.nature.com/articles/ 80

s41592-019-0425-8. 81

69

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 70: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

F William Townes, Stephanie C Hicks,1

Martin J Aryee, and Rafael A Irizarry.2

Feature Selection and Dimension Reduc-3

tion for Single Cell RNA-Seq based on a4

Multinomial Model. March 2019. URL5

https://www.biorxiv.org/content/10.6

1101/574574v1.7

Cole Trapnell, Davide Cacchiarelli, Jonna8

Grimsby, Prapti Pokharel, Shuqiang Li,9

Michael Morse, Niall J. Lennon, Kenneth J.10

Livak, Tarjei S. Mikkelsen, and John L.11

Rinn. The dynamics and regulators of cell12

fate decisions are revealed by pseudotempo-13

ral ordering of single cells. Nature Biotech-14

nology, 32(4):381–386, April 2014. ISSN15

1546-1696. doi: 10.1038/nbt.2859.16

Samra Turajlic and Charles Swanton. Metas-17

tasis as an evolutionary process. Science,18

352(6282):169–175, April 2016.19

Vincent van Unen, Thomas Höllt, Nicola20

Pezzotti, Na Li, Marcel J. T. Rein-21

ders, Elmar Eisemann, Frits Koning,22

Anna Vilanova, and Boudewijn P. F.23

Lelieveldt. Visual analysis of mass cy-24

tometry data by hierarchical stochastic25

neighbour embedding reveals rare cell26

types. Nature Communications, 8(1):27

1740, November 2017. ISSN 2041-1723.28

doi: 10.1038/s41467-017-01689-9. URL29

https://www.nature.com/articles/30

s41467-017-01689-9.31

Catalina A Vallejos, John C Marioni, and32

Sylvia Richardson. BASiCS: Bayesian anal-33

ysis of Single-Cell sequencing data. PLoS34

Comput. Biol., 11(6):e1004333, June 2015.35

Trieu My Van and Christian U. Blank.36

A user’s perspective on GeoMxTM37

digital spatial profiling. Immuno-38

Oncology Technology, 1:11–18, July39

2019. ISSN 2590-0188, 2590-0188.40

doi: 10.1016/j.iotech.2019.05.001. URL 41

https://www.esmoiotech.org/article/ 42

S2590-0188(19)30002-4/abstract. 43

Koen van den Berge, Hector Roux de 44

Bezieux, Kelly Street, Wouter Saelens, Ro- 45

brecht Cannoodt, Yvan Saeys, Sandrine 46

Dudoit, and Lieven Clement. Trajectory- 47

based differential expression analysis for 48

single-cell sequencing data. bioRxiv, 49

page 623397, May 2019. doi: 10.1101/ 50

623397. URL https://www.biorxiv.org/ 51

content/10.1101/623397v1. 52

Dimitrios V Vavoulis, Margherita 53

Francescatto, Peter Heutink, and Ju- 54

lian Gough. DGEclust: differential 55

expression analysis of clustered count data. 56

Genome Biol., 16:39, February 2015. 57

A Verma and B Engelhardt. A robust nonlin- 58

ear low-dimensional manifold for single cell 59

RNA-seq data. bioRxiv, 2018. 60

Beate Vieth, Christoph Ziegenhain, Swati 61

Parekh, Wolfgang Enard, and Ines Hell- 62

mann. powsimr: power analysis for 63

bulk and single cell RNA-seq experiments. 64

Bioinformatics, 33(21):3486–3488, Novem- 65

ber 2017. 66

Irma Virant-Klun, Stefan Leicht, Christopher 67

Hughes, and Jeroen Krijgsveld. Identifi- 68

cation of Maturation-Specific Proteins by 69

Single-Cell Proteomics of Human Oocytes. 70

Molecular & cellular proteomics: MCP, 15 71

(8):2616–2627, 2016. ISSN 1535-9484. doi: 72

10.1074/mcp.M115.056887. 73

Sarah A. Vitak, Kristof A. Torkenczy, Jimi L. 74

Rosenkrantz, Andrew J. Fields, Lena 75

Christiansen, Melissa H. Wong, Lucia Car- 76

bone, Frank J. Steemers, and Andrew 77

Adey. Sequencing thousands of single-cell 78

genomes with combinatorial indexing. Na- 79

ture Methods, 14(3):302–308, March 2017. 80

70

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 71: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

ISSN 1548-7105. doi: 10.1038/nmeth.1

4154. URL https://www.nature.com/2

articles/nmeth.4154.3

Bartlomiej Waclaw, Ivana Bozic, Meredith E4

Pittman, Ralph H Hruban, Bert Vogelstein,5

and Martin A Nowak. A spatial model6

predicts that dispersal and cell turnover7

limit intratumour heterogeneity. Nature,8

525(7568):261–264, September 2015.9

Daniel E. Wagner, Caleb Weinreb, Zach M.10

Collins, James A. Briggs, Sean G. Megason,11

and Allon M. Klein. Single-cell mapping of12

gene expression landscapes and lineage in13

the zebrafish embryo. Science, 360(6392):14

981–987, June 2018a. ISSN 0036-8075,15

1095-9203. doi: 10.1126/science.aar4362.16

URL http://science.sciencemag.org/17

content/360/6392/981.18

F Wagner, D Barkley, and I Yanai. Accurate19

denoising of single-cell RNA-Seq data us-20

ing unbiased principal component analysis.21

bioRxiv, 2019.22

Florian Wagner and Itai Yanai. Moana: A23

robust and scalable cell type classifica-24

tion framework for single-cell RNA-Seq25

data. bioRxiv, page 456129, Octo-26

ber 2018. doi: 10.1101/456129. URL27

https://www.biorxiv.org/content/10.28

1101/456129v1.29

Florian Wagner, Yun Yan, and Itai30

Yanai. K-nearest neighbor smooth-31

ing for high-throughput single-cell32

RNA-Seq data. January 2018b. URL33

https://www.biorxiv.org/content/34

early/2018/01/24/217737?rss=1.35

Dongfang Wang and Jin Gu. VASC: Dimen-36

sion reduction and visualization of single-37

cell RNA-seq data by deep variational au-38

toencoder. Genomics Proteomics Bioinfor-39

matics, 16(5):320–331, October 2018.40

Jian Wang and Yuanlin Song. Single cell se- 41

quencing: a distinct new field. Clin. Transl. 42

Med., 6(1):10, December 2017. 43

Jingshu Wang, Divyansh Agarwal, 44

Mo Huang, Gang Hu, Zilu Zhou, Vin- 45

cent B Conley, Hugh MacMullan, and 46

Nancy R Zhang. Transfer learning in 47

single-cell transcriptomics improves data 48

denoising and pattern discovery. November 49

2018. 50

T Wang, T S Johnson, W Shao, J Zhang, and 51

K Huang. BURMUDA: A novel deep trans- 52

fer learning method for single-cell RNA se- 53

quencing batch correction reveals hidden 54

high-resolution cellular subtypes. bioRxiv, 55

2019. 56

Gregory P Way and Casey S Greene. Extract- 57

ing a biologically relevant latent space from 58

cancer transcriptomes with variational au- 59

toencoders. Pacific Symposium on Bio- 60

computing. Pacific Symposium on Biocom- 61

puting, 23:80–91, 2018. ISSN 2335-6936, 62

2335-6928. doi: 10.1142/9789813235533\ 63

_0008. URL https://www.ncbi.nlm. 64

nih.gov/pubmed/29218871. 65

Lukas M. Weber and Mark D. Robinson. 66

Comparison of clustering methods for 67

high-dimensional single-cell flow and mass 68

cytometry data. Cytometry Part A, 89 69

(12):1084–1096, December 2016. ISSN 70

1552-4922. doi: 10.1002/cyto.a.23030. 71

URL https://onlinelibrary.wiley. 72

com/doi/full/10.1002/cyto.a.23030. 73

Lukas M. Weber, Malgorzata Nowicka, Char- 74

lotte Soneson, and Mark D. Robin- 75

son. diffcyt: Differential discovery 76

in high-dimensional cytometry via high- 77

resolution clustering. bioRxiv, page 78

349738, November 2018. doi: 10.1101/ 79

349738. URL https://www.biorxiv.org/ 80

content/10.1101/349738v2. 81

71

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 72: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Lukas M. Weber, Wouter Saelens, Robrecht1

Cannoodt, Charlotte Soneson, Alexander2

Hapfelmeier, Paul P. Gardner, Anne-Laure3

Boulesteix, Yvan Saeys, and Mark D.4

Robinson. Essential guidelines for compu-5

tational method benchmarking. Genome6

Biology, 20(1):125, June 2019. ISSN 1474-7

760X. doi: 10.1186/s13059-019-1738-8.8

URL https://doi.org/10.1186/9

s13059-019-1738-8.10

Caleb Weinreb, Samuel Wolock, Betsabeh K.11

Tusi, Merav Socolovsky, and Allon M.12

Klein. Fundamental limits on dynamic in-13

ference from single-cell snapshots. Proceed-14

ings of the National Academy of Sciences,15

115(10):E2467–E2476, March 2018. ISSN16

0027-8424, 1091-6490. doi: 10.1073/pnas.17

1714723115. URL https://www.pnas.18

org/content/115/10/E2467.19

Joshua Welch, Velina Kozareva, Ashley Fer-20

reira, Charles Vanderburg, Carly Martin,21

and Evan Macosko. Integrative inference22

of brain cell similarities and differences23

from single-cell genomics. bioRxiv, page24

459891, November 2018. doi: 10.1101/25

459891. URL https://www.biorxiv.org/26

content/10.1101/459891v1.27

Joshua D Welch, Alexander J Hartemink, and28

Jan F Prins. MATCHER: manifold align-29

ment reveals correspondence between single30

cell transcriptome and epigenome dynam-31

ics. Genome Biol., 18(1):138, July 2017.32

F Alexander Wolf, Philipp Angerer, and33

Fabian J Theis. SCANPY: large-scale34

single-cell gene expression data analysis.35

Genome Biol., 19(1):15, February 2018.36

F. Alexander Wolf, Fiona K. Hamey,37

Mireya Plass, Jordi Solana, Joakim S.38

Dahlin, Berthold Göttgens, Nikolaus Ra-39

jewsky, Lukas Simon, and Fabian J.40

Theis. PAGA: graph abstraction recon- 41

ciles clustering with trajectory inference 42

through a topology preserving map of sin- 43

gle cells. Genome Biology, 20(1):59, March 44

2019. ISSN 1474-760X. doi: 10.1186/ 45

s13059-019-1663-x. URL https://doi. 46

org/10.1186/s13059-019-1663-x. 47

Larry Xi, Alexander Belyaev, Sandra Spur- 48

geon, Xiaohui Wang, Haibiao Gong, Robert 49

Aboukhalil, and Richard Fekete. New li- 50

brary construction method for single-cell 51

genomes. PLoS One, 12(7):e0181163, July 52

2017. 53

Li Charlie Xia, Dongmei Ai, Hojoon Lee, 54

Noemi Andor, Chao Li, Nancy R Zhang, 55

and Hanlee P Ji. SVEngine: an efficient 56

and versatile simulator of genome struc- 57

tural variations with features of cancer 58

clonal evolution. Gigascience, 7(7), July 59

2018. 60

Li Yang and P Charles Lin. Mechanisms that 61

drive inflammatory tumor microenviron- 62

ment, tumor heterogeneity, and metastatic 63

progression. Semin. Cancer Biol., 47:185– 64

195, December 2017. 65

Z Yang. Maximum likelihood phylogenetic es- 66

timation from DNA sequences with variable 67

rates over sites: approximate methods. J. 68

Mol. Evol., 39(3):306–314, September 1994. 69

Yinyin Yuan. Spatial heterogeneity in the tu- 70

mor microenvironment. Cold Spring Harb. 71

Perspect. Med., 6(8), August 2016. 72

Simone Zaccaria, Mohammed El-Kebir, Gun- 73

nar W. Klau, and Benjamin J. Raphael. 74

The Copy-Number Tree Mixture Deconvo- 75

lution Problem and Applications to Multi- 76

sample Bulk Sequencing Tumor Data. In 77

S. Cenk Sahinalp, editor, Research in 78

Computational Molecular Biology, Lecture 79

Notes in Computer Science, pages 318–335. 80

72

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 73: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

Springer International Publishing, 2017.1

ISBN 978-3-319-56970-3.2

H Zafar, N Navin, K Chen, and L Nakhleh.3

SiCloneFit: Bayesian inference of popula-4

tion structure, genotype, and phylogeny of5

tumor clones from single-cell genome se-6

quencing data. bioRxiv, 2018.7

Hamim Zafar, Yong Wang, Luay Nakhleh,8

Nicholas Navin, and Ken Chen. Mono-9

var: single-nucleotide variant detection in10

single cells. Nature Methods, 13(6):505–11

507, June 2016. ISSN 1548-7105. doi:12

10.1038/nmeth.3835. URL https://www.13

nature.com/articles/nmeth.3835.14

Hamim Zafar, Anthony Tzen, Nicholas Navin,15

Ken Chen, and Luay Nakhleh. SiFit: infer-16

ring tumor trees from single-cell sequenc-17

ing data under finite-sites models. Genome18

Biol., 18(1):178, September 2017.19

Hans Zahn, Adi Steif, Emma Laks, Peter20

Eirew, Michael VanInsberghe, Sohrab P21

Shah, Samuel Aparicio, and Carl L Hansen.22

Scalable whole-genome single-cell library23

preparation without preamplification. Nat.24

Methods, 14(2):167–173, February 2017a.25

Hans Zahn, Adi Steif, Emma Laks, Peter26

Eirew, Michael VanInsberghe, Sohrab P.27

Shah, Samuel Aparicio, and Carl L.28

Hansen. Scalable whole-genome single-29

cell library preparation without preampli-30

fication. Nature Methods, 14(2):167–173,31

February 2017b. ISSN 1548-7105. doi:32

10.1038/nmeth.4140. URL https://www.33

nature.com/articles/nmeth.4140.34

Luke Zappia, Belinda Phipson, and Alicia35

Oshlack. Splatter: simulation of single-cell36

RNA sequencing data. Genome Biol., 1837

(1):174, September 2017.38

Ron Zeira and Ron Shamir. Genome 39

Rearrangement Problems with Sin- 40

gle and Multiple Gene Copies : A 41

Review. Not clear where this was 42

initially published and whether it is 43

peer-reviewed., 2018. URL https: 44

//pdfs.semanticscholar.org/85e6/ 45

7eb03d1b3d004c60a12df08c1f937fbaa974. 46

pdf. 47

Amit Zeisel, Hannah Hochgerner, Peter 48

Lönnerberg, Anna Johnsson, Fatima 49

Memic, Job van der Zwan, Martin Häring, 50

Emelie Braun, Lars E. Borm, Gioele 51

La Manno, Simone Codeluppi, Alessan- 52

dro Furlan, Kawai Lee, Nathan Skene, 53

Kenneth D. Harris, Jens Hjerling-Leffler, 54

Ernest Arenas, Patrik Ernfors, Ulrika 55

Marklund, and Sten Linnarsson. Molec- 56

ular Architecture of the Mouse Nervous 57

System. Cell, 174(4):999–1014.e22, 58

August 2018. ISSN 0092-8674. doi: 59

10.1016/j.cell.2018.06.021. URL http: 60

//www.sciencedirect.com/science/ 61

article/pii/S009286741830789X. 62

Allen W. Zhang, Ciara O’Flanagan, Eliz- 63

abeth Chavez, Jamie LP Lim, Andrew 64

McPherson, Matt Wiens, Pascale Wal- 65

ters, Tim Chan, Brittany Hewitson, Daniel 66

Lai, Anja Mottok, Clementine Sarkozy, 67

Lauren Chong, Tomohiro Aoki, Xue- 68

hai Wang, Andrew P. Weng, Jessica N. 69

McAlpine, Samuel Aparicio, Christian 70

Steidl, Kieran R. Campbell, and Sohrab P. 71

Shah. Probabilistic cell type assignment 72

of single-cell transcriptomic data reveals 73

spatiotemporal microenvironment dynam- 74

ics in human cancers. bioRxiv, page 75

521914, January 2019a. doi: 10.1101/ 76

521914. URL https://www.biorxiv.org/ 77

content/10.1101/521914v1. 78

Chao Zhang. Single-Cell Data Analysis Us- 79

ing MMD Variational Autoencoder. April 80

73

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 74: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

2019. URL https://www.biorxiv.org/1

content/10.1101/613414v1.abstract.2

Huanan Zhang, Catherine A A Lee, Zhuliu Li,3

John R Garbe, Cindy R Eide, Raphael Pe-4

tegrosso, Rui Kuang, and Jakub Tolar. A5

multitask clustering approach for single-cell6

RNA-seq analysis in recessive dystrophic7

epidermolysis bullosa. PLoS Comput. Biol.,8

14(4):e1006053, April 2018.9

Jesse M. Zhang, Govinda M. Kamath,10

and David N. Tse. Valid post-clustering11

differential analysis for single-cell RNA-12

Seq. bioRxiv, page 463265, June13

2019b. doi: 10.1101/463265. URL14

https://www.biorxiv.org/content/10.15

1101/463265v3.16

Jianzhi Zhang, Rasmus Nielsen, and Ziheng17

Yang. Evaluation of an improved branch-18

site likelihood method for detecting posi-19

tive selection at the molecular level. Mol.20

Biol. Evol., 22(12):2472–2479, December21

2005.22

Jingsong Zhang, Jessica J. Cunningham,23

Joel S. Brown, and Robert A. Gatenby.24

Integrating evolutionary dynamics into25

treatment of metastatic castrate-resistant26

prostate cancer. Nature Communications, 827

(1):1816, November 2017. ISSN 2041-1723.28

doi: 10.1038/s41467-017-01968-5. URL29

https://www.nature.com/articles/30

s41467-017-01968-5.31

L. Zhang and S. Zhang. Comparison of com-32

putational methods for imputing single-33

cell RNA-sequencing data. IEEE/ACM34

Transactions on Computational Biology35

and Bioinformatics, pages 1–1, 2018. ISSN36

1545-5963. doi: 10.1109/TCBB.2018.37

2848633.38

L Zhang, X Cui, K Schmitt, R Hubert, W Na-39

vidi, and N Arnheim. Whole genome am-40

plification from a single cell: implications 41

for genetic analysis. Proc. Natl. Acad. Sci. 42

U. S. A., 89(13):5847–5851, July 1992. 43

Xiao-Fei Zhang, Le Ou-Yang, Shuo Yang, 44

Xing-Ming Zhao, Xiaohua Hu, and Hong 45

Yan. EnImpute: imputing dropout 46

events in single cell RNA sequencing 47

data via ensemble learning. Bioinfor- 48

matics, May 2019c. ISSN 1367-4803, 49

1367-4811. doi: 10.1093/bioinformatics/ 50

btz435. URL http://dx.doi.org/10. 51

1093/bioinformatics/btz435. 52

Xiaoyan Zhang, Sadie L Marjani, Zhaoyang 53

Hu, Sherman M Weissman, Xinghua Pan, 54

and Shixiu Wu. Single-Cell sequencing 55

for precise cancer research: Progress and 56

prospects. Cancer Res., 76(6):1305–1312, 57

March 2016. 58

Xiuwei Zhang, Chenling Xu, and Nir Yosef. 59

SymSim: simulating multi-faceted variabil- 60

ity in single cell RNA sequencing. bioRxiv, 61

page 378646, April 2019d. doi: 10.1101/ 62

378646. URL https://www.biorxiv.org/ 63

content/10.1101/378646v3. 64

Yifan Zhang and Feng Liu. Multidimensional 65

Single-Cell analyses in organ development 66

and maintenance. Trends Cell Biol., March 67

2019. 68

Grace X. Y. Zheng, Jessica M. Terry, Phillip 69

Belgrader, Paul Ryvkin, Zachary W. 70

Bent, Ryan Wilson, Solongo B. Ziraldo, 71

Tobias D. Wheeler, Geoff P. McDer- 72

mott, Junjie Zhu, Mark T. Gregory, Joe 73

Shuga, Luz Montesclaros, Jason G. Under- 74

wood, Donald A. Masquelier, Stefanie Y. 75

Nishimura, Michael Schnall-Levin, Paul W. 76

Wyatt, Christopher M. Hindson, Rajiv 77

Bharadwaj, Alexander Wong, Kevin D. 78

Ness, Lan W. Beppu, H. Joachim Deeg, 79

Christopher McFarland, Keith R. Loeb, 80

74

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019

Page 75: 12 Grand Challenges in Single-Cell Data Science · 1 6 Overarching challenges 35 2 6.1 Challenge XI: Integration of 3 single-cell data: across sam- 4 ples, experiments and types of

William J. Valente, Nolan G. Ericson,1

Emily A. Stevens, Jerald P. Radich, Tar-2

jei S. Mikkelsen, Benjamin J. Hindson,3

and Jason H. Bielas. Massively paral-4

lel digital transcriptional profiling of sin-5

gle cells. Nature Communications, 8:14049,6

January 2017. ISSN 2041-1723. doi: 10.7

1038/ncomms14049. URL https://www.8

nature.com/articles/ncomms14049.9

Lingxue Zhu, Jing Lei, Bernie Devlin, and10

Kathryn Roeder. A UNIFIED STATISTI-11

CAL FRAMEWORK FOR SINGLE CELL12

AND BULK RNA SEQUENCING DATA.13

The annals of applied statistics, 12(1):609–14

632, March 2018. ISSN 1932-6157. doi:15

10.1214/17-AOAS1110. URL http://dx.16

doi.org/10.1214/17-AOAS1110.17

Rapolas Zilionis, Juozas Nainys, Adrian18

Veres, Virginia Savova, David Zemmour,19

Allon M Klein, and Linas Mazutis. Single-20

cell barcoding and sequencing using droplet21

microfluidics. Nat. Protoc., 12(1):44–73,22

January 2017.23

Chenghang Zong, Sijia Lu, Alec R Chapman,24

and X Sunney Xie. Genome-wide detection25

of single-nucleotide and copy-number vari-26

ations of a single human cell. Science, 33827

(6114):1622–1626, December 2012.28

75

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.27885v3 | CC BY 4.0 Open Access | rec: 23 Aug 2019, publ: 23 Aug 2019