MFGM-VAE for unpaired multi-view data integration

MFGM-VAE for unpaired multi-view data integration

Roger Aylagas Torres

14th of December 2020

EPFL SupervisorGioele La Manno

ETSETB SupervisorVeronica Vilaplana

Contents

1 Introduction 1

2 Previous Work 42.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 The Model 83.1 The concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Architecture definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Visualization and working pipeline 12

5 Results 13

6 Conclusions 15

List of Figures

1 Single cell omics and its molecular levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Multiple omics pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Seurat integration workflow. Source: [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Multi-modal VAEs described in [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Gaussian Mixture Variational Autoencoder Block with it’s neural network blocks (orange)

and relevant latent variables (purple) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Architecture of the MGM-VAE. Dash line indicates shared weights. . . . . . . . . . . . . . 107 Plots for visualizing learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 From left to right: Bar plot showing the percentage of of cells of each dataset in every

cluster, Ground Truth mapping, total number of cells per cluster, integration plot showingeach dataset and the means of the probability distribution. . . . . . . . . . . . . . . . . . 13

9 Same architecture with different embedding sizes 5, 10, 20, 30, 50 and 100 are representedin orange, dark blue, brown, light blue magenta and green respectively . . . . . . . . . . . 13

10 Convolutional architecture (Grey) vs. linear architecture (green) . . . . . . . . . . . . . . 1411 Mixed datasets is achieved whereas the mixing is completely independent from cell types . 14

1 Introduction

Figure 1: Single cell omics and its molecular levels

Over the past decade, an ambition of the single-cell field has been to use high-throughput techniquesto interrogate complex biological systems understand how cells change and respond to different stimuli.Single-cell profiling has been used to construct comprehensive tissue atlases defining diverse cell profilesand marker genes, including for aging tissues and organoids. Rare cell types have been identified thata role in genetic diseases such as cystic fibrosis or the response to pathogens in the intestine. Similarhave uncovered greater nuance in multi-factorial , particularly nervous system conditions such as disease,schizophrenia, and multiple sclerosis. Furthermore, single-cell technologies have further revealed thecomplexity of developmental processes, gene regulatory networks, and cell fate commitment. Althoughthese techniques can profile several thousand cells per sample, converting this wealth of data into bio-logical insight is challenging and requires the application of dedicated statistical methods and modelingapproaches.

Single-omics data only describe a biological process at one specific molecular level (Figure 1). For ex-ample, whole genome (or exome) sequencing detects single-nucleotide and structural variations (geneticlevel); ChIP-seq identifies transcription factor binding sites (protein–DNA interactomic level) and hi-stone modifications (epigenomiclevel) in the entire human genome; DNase-seq detects open chromatinregions for transcription factor binding loci (epigenomic level); whole genome bisulfite-seq allows to buildmethylomes which are key to understand gene regulation, X chromosome inactivation and cancerogenesis(epigenetic level); RNA-seq can be used to capture gene expression level, and discover alternative splic-ing, gene fusion and novel isoforms (transcriptomic level); microRNA (miRNA)-seq snapshots expressionof micro-RNAs that regulate mRNA translations (translation level); and protein arrays and mass spec-trometers are useful to detect concentration of proteins and metabolites (proteomic and metabolomiclevels).

We use the term multi-view data to denote any kinds of heterogeneous (could be homogeneous) datathat provide complementary information to characterize a biological object, phenomenon or system fromvarious aspects (Figure 2). Such data maybe of different types and from different sources, follow different

1

statistical distributions, posses different semantics, suffer from different levels of imprecision and contentdifferent kinds of uncertainties.

We are interested in four types of multi-view data: (1) multi-view data with different groups of samplesmeasured by the same feature set (or called multi-class data), (2) multi-view data with the same set ofobjects (samples) but several distinct feature sets, (3) multi-view data measuring the same set of objectsby the same set of features in different conditions (can be represented by a three-way sample x featurex condition tensor) and (4) multi-view data with different features and different sample sets in the samephenomenon or system, which can be further transformed to multi-relational data. Type-2 and type-4multi-view data are often referred as multi-omics data.

Single-omics data enumerated above have the following characteristics: (1) high dimensionality, (2) re-dundancy, (3) highly correlated features and (4) non-negativity. On top of them, multi-omics data havethe following characteristics: (1) mutual complementarity, (2) causality and (3) heterogeneity.

In bioinformatics, there are five types of data-driven analyses where integrative machine learning methodsare required. The first is multi-class feature selection and classification problem, where given multiplegroups of objects measured using the same set of features, one is often interested in selecting key featuresresponsible for the separation of these groups. Second, integrating multi-omics data of the same set oflabeled objects is expected to escalate prediction (classification or regression) power, for example, theearly detection of cancers based on multi-platform data. Third, in the above setting but without classlabels, the task becomes an unsupervised learning to discover novel groups of samples. Tumor subtypingis such a commonly conducted analysis. Fourth, given multiple heterogeneous feature sets observed forthe same or group of samples, the interactions among inter-view features could be crucial to understandthe pathways of a phenotype. Last, given homogeneous and heterogeneous relations within and betweenmultiple sets of biological entries from different molecular levels and clinical descriptions, inferring therelations between inter-set entries is named association study in a complex system.

Figure 2: Multiple omics pipeline

2

On the one hand, multi-view data provide us with an unprecedented opportunity to understand a complexbiological system from different angles and levels (e.g. genotype–phenotype interactions and cancerstudies), and make precise data-driven predictions (e.g. drug response prediction). On the other hand,it poses a tough challenge for machine learning experts and data scientists to wisely optimize the use ofthese data for specific needs.

In accordance with when multi-view data are incorporated into a learning process, data fusion techniquescan be classified as early, intermediate or late integration methods. In early integration methods, fea-tures from different data are concatenated into a single feature vector before fitting an unsupervised orsupervised model. In late integration, separate models are first learned for individual views, then theiroutputs are further combined to make the final determination. An intermediate strategy globally involvesdata integration in a learning process.

Data integration is an urgent need in current and future bioinformatics and in this context, this work triesto develop an algorithm using machine learning techniques to solve almost all problems stated above. Insection 2 we will review what has been done to solve this issues and the current state of the art algorithms,In chapter 3 we will explain our method, its novelties and the pipeline used for checking the goodness ofthe algorithm learning and we will present our results in chapter 4.

3

2 Previous Work

Concatenating all features into a single vector seems to be the ‘simplest’ principle that brings all featurestogether. Heterogeneous data contain much richer information than homogeneous data. But it is chal-lenging to combine heterogeneous features. Applying feature selection techniques to the concatenateddata may improve performance.

The concatenated features require additional downstream processing that may lead to loss of key in-formation. First, because multi-view data are observed in different forms that may include continuousfeatures, discrete features, characters and even graphic data types, converting them into acceptable types(e.g. continuous to discrete, categorical to binary coding) is necessary for certain models such as hid-den Markov models. Second, features from multiple views usually have different scales. Particularly fordiscriminative models, it is sometimes necessary to normalize or standardize the combined features toreduce bias and speed up training process. The feature concatenation strategy followed by normalizationis commonly used in linear models such as SVMs and LASSOs. Moreover, feature concatenation is oftenunworkable with modern data which possess a high dimensionality and rich structural information. Forinstance, converting medical text documents into a bag of words and combining it together with vec-torized image pixels may certainly ignore the importance of language semantics and local structures inimages.

Bayesian Methods

In a field, there might exist some a priori available knowledge associated with each class, which is usefulin a predictive model to better characterize the objects under investigation. From a Bayesian perspective,one can thus consider incorporating this prior knowledge into a learning model, while using the featuresof interest as regular input features.

Bayesian methods are well-known for their capability of incorporating various prior knowledge in pre-dictive or exploratory models. However, it may be difficult to find useful information as prior features.Furthermore, it is often hard to assume proper class-conditional distributions, especially for complex sys-tems. In case of many-class problems, finding a suitable class-conditional distribution for each individualclass becomes unattainable in practice.

Bayesian networks (BNs) can naturally model multi-view data with mixed distributions for classificationand feature-interaction identification purposes. As a typical model of probabilistic graphical models, aBN can be represented by fS; hg, where S denotes its graphical structure whose nodes represent variablesand directed edges indicate dependencies between pairs of variables and h the set of parameters of thevariables’ conditional distributions.

The dependency structure and parameters of the conditional distributions can be learned from data.Given a learned BN, the values of invisible variables (if any) can be inferred from (partially) observeddata. The values of variables in a BN can be all discrete, all continuous or a mixture of them. Thus,variables of different types can be naturally integrated using BN.

However, three obstacles challenge us to apply BNs in data integration. First, searching for the optimalBN structure is a NP-complete problem. Second, the number of parameters may be much larger thanthe sample size. Third, inference in a BN is intractable. They imply that using BN for high-dimensionaldata becomes exponentially difficult. Thus, heuristic structure learning algorithms and restrictions (orassumptions) on the model structure and conditional distributions are usually made to alleviate the cursesof dimensionality.

Trees

Decision trees should be considered as integrative models because a mixture of discrete and continuous

4

features can be simultaneously integrated without the need to normalize features. Classification andregression trees are representatives of rule-based prediction models. When recursively building a classi-fication tree, a feature (even a subset of features) that splits the classes the best, in term of a scoringfunction, is selected to create a node. At each node, rules are established to branch out different classesdownward. Different from black-box models, the learned hierarchy of rules (tree) are well interpretable.

Because each feature is used to learn decision rules, multi-view data of various data types (discrete,categorical and continuous) can be considered together without normalization. The values of continuousvariables are partitioned into intervals of different lengths, thus decision rules can be created for continuousvariables of a variety of distributions without the need to standardize input data.

However, decision trees are sensitive to noise, thus have a poor capability of generalization. Moreover,building a decision tree for high-dimensional data could consume an unaffordable amount of time.

The overfitting issue of decision trees can be overcome by collective intelligence, that is, ensemble learning,which builds a population of decision trees as weak learners for the state-of-the-art performance. Baggingand boosting are popular ensemble learning models, where bagging simply combines the decisions ofmultiple weak learners, while boosting tweaks the weak learners to focus on hard example.

Random forest addresses the above two challenges by randomly picking up features in the constructionof trees. Although the randomness degrades the interpretability, the importance of features can still beobtained by out-of-bag (OOB) randomization or Gini index.

Kernel learning and metric learning

Kernel matrix , rather than the original features, is only required as input of kernel classification (e.g.SVM), regression (e.g. support vector regression), clustering (e.g. spectral clustering) and feature extrac-tion (e.g. sparse representation, SR) methods. Thus, the problem of data integration can be transformedto kernel integration in the sample space rather than the heterogeneous feature space. Multiple kernellearning (MKL) is an intermediate integration technique that first computes kernel (or similarity) matri-ces separately for each data view, then combines these matrices to generate a final kernel matrix to beused in a kernel model.

A key strength of kernel methods is that their optimizations are independent of the number of features,which is known as dimension free. However, large-scale optimization corresponding to a large samplesize remains a major bottleneck. For example, an optimal MKL task can essentially be a semidefiniteprogramming problem. Moreover, because a kernel method is treated as a ‘black box’ with respect toidentifying informative features and the integration occurs in sample space, MKL should not be consideredwhen identifying feature interactions.

Simple MKL-based multiple kernel learning has been applied in [1] to predict cancer prognosis by usinggene expression data, DNA methylation data, miRNA expression data and copy number variations,showing significantly better performance over single-view methods.

Multi-modal deep learning

Multi-modal deep neural networks are another option to integrate multi-view data with heterogeneousfeature sets, and capture their high-associations for prediction, and handling in data. The basic idea is toselect a specific sub-network each view, and then integrate the output of individual in higher layers. Thesub-networks provide the flexibility of choosing appropriate deep learning models respectively individualdata views, such as deep belief net (DBN) or deep Boltzmann machine (DBM) for binary, Gaussian ordata, convolutional neural network for image data, neural network for sequential signal and deep selection(DFS) for choosing discriminative features.

5

A multi-modal DBN has emerged bioinformatics to integrate gene expression, DNA methylation anddrug response for tumor subtyping [2], showing superformance to k-mean clustering in terms of clinicaldiscrepancy (survival time).

Multi-modal deep neural networks have five attractive for data integration. First, when learning modelparameters, the sub-networks can be pretrained using different views, separately, then the parameter ofthe entire net (including the integrative layers and sub-networks) can be globally fine-tuned. Thus, thiscomponent-wise learning can significantly reduce the cost of computation. Second, the heterogeneousinformation from different views can be jointly considered well in the integrative layers for inference,classification and clustering. Third, multi-modal networks can even learn on samples with missing views,which enables the maximal use of available data instead of merely using samples with complete views.Furthermore, a well-trained generative multi-modal network, such as DBM, can be used to infer profiles ofmissing views given some other observed views from an individual, which is quite interesting, for instance,for predicting the impact of genetic variations and epigenetic changes to gene expression. Last but notleast, the flexible and deep structure of multi-modal learning is appropriate to model complex systems,thus has a great potential to make a full use of genomic data observed in various molecular levels.

Overfitting is a problem for deep networks owing to their complex structures and large amount of param-eters. In addition, how to precisely catch and explicitly interpret inter-view feature interactions remainsan open problem.

2.1 State of the art

In the field of multi-omics integration the state of the art is Seurat integration algorithm ([3]). It aimsto integrate two datasets which acquire the role of reference and query datasets. Its workflow (Figure 3)consists on doing a joint dimentionality reduction of 2 datasets using diagonalized Canonical CorrelationAnalysis (CCA) followed by an L2 normalization. They then search for Mutual Nearest Neighbours(MNN) in this shared low dimentional representation. This step is similar to the one used in [4] whereMNN was used to integrate scRNA-seq datasets.

They refer to the resulting cell pairs as anchors, as they encode the cellular relationships across datasetsthat will form the basis for all subsequent integration analyses. Anchors can successfully recover matchingcell states even in the presence of significant dataset differences, as CCA can effectively identify sharedbiological markers and conserved gene correlation patterns. Each anchor pair is assigned a score basedon the shared overlap of mutual neighborhoods for the two cells in a pair. High-scoring correspondencestherefore represent cases where many similar cells in one dataset are predicted to correspond to the samegroup of similar cells in a second dataset, reflecting increased robustness in the association between theanchor cells.

For each cell in the query dataset, they aim to apply a transformation (correction vector) that representsa weighted average across multiple batch vectors. These weights are determined by two components: acell similarity score, computed individually for each cell in the dataset, and the anchor score, computedonce for each anchor. The cell similarity score is defined by the distance between each query cell and its knearest anchors in principal-component space , prioritizing anchors representing a similar biological state.Consequently, cells in the same local neighborhood will share similar correction vectors. By subtractingthese weighted correction vectors from the query gene expression matrix, they compute a corrected queryexpression matrix that can then be combined with the original reference dataset and used as input forall integrated downstream analyses including dimensionality reduction and clustering.

To extend this procedure to multiple datasets, they first construct a guide tree based on the similaritybetween all pairs of datasets and proceed with recursive pairwise correction up the tree. The similarity

6

score used to construct the hierarchy is computed as the total number of anchors between a pair ofdatasets normalized to the total number of cells in the smaller dataset of the pair.

Figure 3: Seurat integration workflow. Source: [3]

Deep learning is gain relevance in a wide variety of fields and biology is not an exception. A proof of thatis the recent publication by Deep Mind, a company owned by Google, in which they solved the proteinfolding problem, unsolved during 50 years, using a deep learning method which predicts the 3D-structureof a protein from its sequence.

With the aim of solving this multi-view unpaired dataset integration, not much deep learning methodshave been proposed, despite several have been proposed for single-omics analysis ([5] [6] [7]). In [8] a single-cell multi-modal Variational Autoencoder model, which combines three types of joint-learning strategieswith a probabilistic Gaussian Mixture Model, is used to learn the joint latent features that accuratelyrepresent these multi-layer paired profiles. [9] is another relevant method which tries to remove batcheffects from multiple scRNA-seq datasets using Variational Autoencoders and achieves high performanceintegration with respect other traditional methods.

All these methods have several weaknesses which we try to improve in this work explained in the followingsection.

7

3 The Model

3.1 The concept

The aim of the model is to integrate several datasets, a priori related, in a low-dimentional space withthe aim of using the produced embeddings for downstream analysis. The integration must follow 4constraints in order to improve the state of the art [3]: (1) The integration must be simultaneous to avoidthe accumulation of errors produced when integrating datasets by pairs, (2) avoid the roles of query andreference datasets as this leads to different results when interchanging datasets, (3) deal with populationsnot present in all datasets for more versatility, (4) perform the integration with no preprocessing requiredas we want to solve one of the stated challenges removing the need of performing dimentionality reduction.

3.2 Previous Work

To design such model, it is easy to see that the main purpose of the model is to produce embeddings withsome constraints which are going to influence in the design. Variational Autoencoders [10] are proven toachieve an outstanding performance on generating embeddings of data such different as images, audio orgraphs among others. In [11] Variational Autoencoders where used for enhancing the PSNR of low-rateimage compression, in [12] they where used to learn a representation able to capture high level semanticcontent from speech signals and in [13] the architecture was used to learn semantic-valid molecular graphs.

Despite that, there are several ways to use VAEs for multi-modal data. Usually it is used with paireddata which makes easier the design of the model and allows more freedom to the designer. In [14] severalarchitectures of multi-modal VAEs for paired data are described. It considers 3 types of architectures(Figure 4): The ”Separate mode” in which each dataset is forwarded my separate VAEs, the ”Connectivemode”, in which both datasets are concatenated together and processed by the same encoder and decoderproducing two sets of probability parameters so that each dataset has its probability distribution, andlast but not least the ”Mixed mode” in which the input datasets are first processed by two separatefirst-stage encoder to bring them to the same dimensional space and then concatenated and forwardedjointly to a second step encoder producing also two set of probability parameters. In that paper, The”Mixed mode” is said to be the best way of encoding paired datasets as relations between paired data areused in the second step encoder to produce better embeddings, but when dealing with unpaired datasetsit is a really different story. The Mixed Mode which was claimed to be the best choice it turns no longeruseful as the model would hardly input a correct pair at the same time and so it is really inefficient tolearn relations between unrelated pairs. Despite that, the ”Separate Mode” turns then to be the mostuseful one. In this kind of multi-modal VAE, parameters are not shared between datasets, and we canforce the low-dimentional representations to be joint by adding a regularization term in the objectivefunction or by sharing the same probability distribution.

In addition to that, the ”Mixed Mode” is very difficult to generalize for any number of datasets tointegrate as the architecture would have to be modified. On the other hand the ”Separate mode” couldbe perfectly generalized for any number of datasets and also for different kinds of data.

Another key choice for designing the model is the probability distribution to use in each VAE. Normallya multivariate Gaussian distribution is used as it is vary tractable and the Kullback-Leibler divergence(KLD), the standard regularization term in the VAEs, is really easy to compute. Then the Encoderoutputs the mean and standard deviation of the Gaussian distribution and the KLD is used to force thelatent mean and standard deviation to resemble a Standard Gaussian. Despite that, Gaussian MixtureModels have been used in other approaches ([15], [16]) as it turns to be very useful when dealing withsparse data. In addition to that, using a GM probability distribution allows the possibility of clusteringand so to make the probability distribution more meaningful and interpretative.

8

Figure 4: Multi-modal VAEs described in [14]

3.3 Architecture definition

We propose a Multiple Gaussian Mixture Variational AutoEncoder (MGM-VAE) for unpaired datasetintegration and for biological insights extraction by interpreting latent variables and the probabilitydistribution. The embeddings produced by the model are thought to be used in any downstream analysislike for improving RNA Velocity or for increasing the precision of brain atlases ([17] [18]).

The model is composed by multiple GM-VAE (Figure 5), one for each dataset willing to integrate. Thisblocks are composed by 3 core parts: an encoder, a sampler and a decoder.

As we are trying to produce biologically meaningful latent variables, we use an unsupervised probabilisticclassifier as encoder. The encoder uses convolutional layers to extract useful features from the sequencethat are going to be used by a linear layer to predict belonging scores to each of the cell types. Thesoft-classification probabilities are then the probabilities of a sample belonging to each cell type.

Figure 5: Gaussian Mixture Variational Autoencoder Block with it’s neural network blocks (orange) andrelevant latent variables (purple)

We define the probability of the classification weights as stated in (1) where the probability of each samplebelonging to a cluster is approximated by the probabilistic encoder as stated below.

9

p(π|X ) = enc(X ) (1)

The probability distribution of the Variational Autoencoders chosen to represent the data is a Mixtureof Gaussians as it works really well with sparse data and so it should be a good fit for multi-omics datalike gene expression (scRNA-seq), chromatin accessibility (scATAC-seq) or DNA methylation (scBSseq).To define this layer it is required to set the number of clusters which will define the potential number ofdata groups we could have and, more precisely in this case, the number of cell types. As we are trying tomodel the probability distribution of the joint datasets, we define the number of clusters as the uniquecell types contained in all datasets. This allows us to model the probability distribution of the jointdataset and in addition have flexibility with the cell types that do not appear in all datasets.

The sampler is a neural network layer which contains the mean vectors and covariance matrices of theGaussian Mixture as neural network parameters. It is important to remark then that the parameters thatdefine the probability density function are not latent variables inferred by the encoder but are learnedby back-propagation.

zi = wi ∗ (εi ∗ σk + µk) (2)

The sampler takes the soft-classification scores and uses the highest weight to sample from a Gaus-sian of the Mixture to produce the embedding. The sampling is performed as stated in (2) using thereparametrization trick, where ε ∼ N (0, I) , index i indicates the sample from the dataset, k is the clusterindex that satisfies

k = argmaxc

p(πi = c|xi)

and wi = πi(k), the highest probability of all clusters. σk and µk are the diagonal of the covariancematrix and the mean vector of the top-1 score cluster. We restrict the covariance matrix to be diagonalas a higher-order diagonal covariance matrix can theoretically approximate any full covariance matricesin GMM ([19]) and a diagonal covariance matrix can speed up the learning procedure due to its fewerparameters and efficient inversion matrix computing method.

Figure 6: Architecture of the MGM-VAE. Dash line indicates shared weights.

10

Multiplying the sample by the top-1 weight is key for the model to be able to backpropagate gradients tothe encoder and so learn a classification based on the reconstruction loss as the gradients of the argmaxfunction cannot be computed.

To implement individual samplers for each dataset but describing a unique probability distribution weuse share weights (Figure 6) between samplers, in other words, the mean vectors and covariance matricesof all samplers are the same. This way we simplify the loss function as we don’t have to compute anydivergence regularizer to overlap several Gaussian Mixtures which, in addition of adding complexity tothe objective function, has no close form solution despite some efforts have been done to propose upperand lower bound for its maximization or minimization ([20]).

As a last step, the embedding is taking by the decoder to try to reconstruct the input sequence.

3.4 Objective function

The objective function (eq. 3) is defined as the sum of the reconstruction loss of each dataset to integrateplus the information maximization for clustering the data.

L =

D∑d=1

MSE(Xd,X ′d)−

D∑d=1

λd(H(Πd)−H(Πd|Xd)) (3)

The reconstruction loss of each dataset is defined as the Mean Square Error (MSE) between the inputdata and the reconstructed one. In this case the MSE is chosen instead of other possible loss functions(i.e. the Cross Entropy Loss) because the input data to be integrated may be of a different nature. Forexample: on one hand we could have scRNA-seq which are counts and so the data are integers with noupper limit but on the other hand we could have imaging which values are in the range [0,1]. Therefore,using the MSE function allows us the possibility of generalizing the loss function without needs of changewith respect to the nature of the input data.

We aim to classify the input sequences between cell types and to do so we use the information maximiza-tion regularizer. It is composed by two terms: the marginal entropy of the probability scores and theconditional entropy. The marginal entropy ensures all clusters are being used so that not all decisionsfall on the same cluster. On the other hand, the conditional entropy ensures high confidence in clusteringvery important if we want cell types to be compact in the latent space.

3.5 Novelty

This method would imply a breakthrough with respect the current state of the art ([3]) as it allows to inte-grate multiple omics (Imaging, scRNA-seq, scATAC-seq, scBSseq, etc) unpaired datasets simultaneouslyin a low-dimentional space characterizing this low-dimentional space with a probability distribution. Inaddition it does not require any preprocessing step before integrating, solving this way one of the majorchallenges stated by the state of the art method. Moreover, this method is able to integrate multipledatasets at once getting rid of the accumulation of errors produced by integrating datasets pairwise.

As it is designed in a interpretative way, biological insights can be discovered directly from the samplingweights using Explainable Artificial Intelligence techniques (XAI) such as Gradient Class Activation Maps(Grad-CAMs [21]) which could be used to know which genes characterize one cell type. Using these sameweights, correlations between datasets or cell types can be found giving extra interpretability to themethod.

11

4 Visualization and working pipeline

Figure 7: Plots for visualizing learning curves.

It is clear that the designed model is complex as is the task we are trying to solve. Despite that it isdesigned to be as much interpretive as possible and so it allows us to check in several ways the goodnessof the training process. With this aim, we have designed a control panel using Tensorboard in which thefitting curves (Figure 7) can be checked to know whether the model is fitting the data correctly or notand if the regularization terms, which tell the model the way the reconstruction loss has to be decreased,are met. This curves allow us to see if the model is underfitting the reconstruction loss because of theregularization terms or simply because the complexity of the former layers is low.

In addition to that, we have designed another tab in which the embeddings, the cluster means andthe soft-classification weights can be checked (Figure 8). One of the most important plots shows themeans of the clusters, that correspond to the means of the Gaussian Mixture representing the probabilitydistribution of the joint dataset, and the embeddings colored by dataset. This allows us to check howgood the integration is checking if in the same cluster appear cells from all datasets. In the same line, abar plot tells us the percentage of each dataset conforming each cluster to better interpret the integrationof both datasets. The embeddings produced by the model can be also used to check on the confidenceof the classification, plotting the signal of each cluster to all the embeddings. In addition, and as we areassessing if the model is working correctly, we have added the ground truth (not seen by the model) plotin which we can visualize each embedding with the ground truth label. This allows us to assess if theclassification of the sequences is good and the variance of each cell type cluster.

12

Figure 8: From left to right: Bar plot showing the percentage of of cells of each dataset in every cluster,Ground Truth mapping, total number of cells per cluster, integration plot showing each dataset and themeans of the probability distribution.

5 Results

To perform the experiments we have used two scRNA-seq datasets from two different fetal eyes containing12 different types of cells. First experiments have been done to see which bottleneck size was optimal forthis task. In Figure 9 you can see the results of a simple neural network (2 linear layers for the encoderand decoder) with different Gaussian mixture dimensionalities (5, 10, 20, 30, 50 and 100). It is clearthat the best embedding size to use is 10 as it gives the lowest reconstruction error and the it has a goodintegration of the datasets despite the classification is not good, as in all the other bottleneck sizes.

(a) Information Maximization (b) Dataset 1 Reconstruction (c) Dataset 2 Reconstruction

Figure 9: Same architecture with different embedding sizes 5, 10, 20, 30, 50 and 100 are represented inorange, dark blue, brown, light blue magenta and green respectively

Another experiment (Figure 11) has been done in order to compare which architecture was a better fitfor the task to solve. We have compared a 3 convolutional layer combined with a linear classifier withrespect a 4 linear layer encoder.

It is very clear that the convolutional architecture improves in all aspects and so we have decided tofollow this type of architecture for further experiments.

Several experiments have been performed with different architectures and different weights in the loss

13

(a) Information Maximization (b) Dataset 1 Reconstruction (c) Dataset 2 Reconstruction

Figure 10: Convolutional architecture (Grey) vs. linear architecture (green)

function but no good classification has been achieved. Our last experiment (Figure xx ) shows a goodintegration of the datasets but a really bad classification. We suppose this is due to the neural network notbeing able to learn relevant features describing each cell type or because the information maximizationis not a good loss function for the classification. Despite that it allows us to ensure that it is possible torepresent the joint datasets with a Gaussian Mixture model.

(a) Dataset Integration (b) Ground Truth

Figure 11: Mixed datasets is achieved whereas the mixing is completely independent from cell types

Despite we were able to get some conclusions of how the architecture should look like and the benefitsand drawbacks of every decision, we have not been able to acquire the results we want due to lack oftime.

14

6 Conclusions

Multi-omics data integration is a challenging problem in the bioinformatics field which tries to integrateunpaired data from different single-omics measurements. This work has proposed a new model for solvingit using a deep learning architecture as several works have tried before but with a more ambitious purpose:generalizing the integration for every type of data. Despite the model is not able to classify the inputsequences which is the problem of its bad performance, the model is highly interpretative and verypromising to become a breakthrough in the field. In addition, Variational Autoencoders seem the rightchoice for solving the problem and using a Gaussian Mixture for modeling the probability distributionof the joint datasets could be the key for a good probabilistic representation of the embedding space.Further work will be done to make this model the state of the art in the field.

15

References[1] Y. Zhang et al. “Improve Glioblastoma Multiforme Prognosis Prediction by Using Feature Selec-

tion and Multiple Kernel Learning”. In: IEEE/ACM Transactions on Computational Biology andBioinformatics 13.5 (Sept. 2016). Conference Name: IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, pp. 825–835. issn: 1557-9964. doi: 10.1109/TCBB.2016.2551745.

[2] M. Liang et al. “Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal DeepLearning Approach”. In: IEEE/ACM Transactions on Computational Biology and Bioinformat-ics 12.4 (July 2015). Conference Name: IEEE/ACM Transactions on Computational Biology andBioinformatics, pp. 928–937. issn: 1557-9964. doi: 10.1109/TCBB.2014.2377729.

[3] Tim Stuart et al. “Comprehensive Integration of Single-Cell Data”. en. In: Cell 177.7 (June 2019),1888–1902.e21. issn: 00928674. doi: 10.1016/j.cell.2019.05.031. url: https://linkinghub.elsevier.com/retrieve/pii/S0092867419305598 (visited on 09/07/2020).

[4] Laleh Haghverdi et al. “Batch effects in single-cell RNA-sequencing data are corrected by matchingmutual nearest neighbors”. en. In: Nature Biotechnology 36.5 (May 2018), pp. 421–427. issn: 1087-0156, 1546-1696. doi: 10.1038/nbt.4091. url: http://www.nature.com/articles/nbt.4091(visited on 09/07/2020).

[5] Lei Xiong et al. “SCALE method for single-cell ATAC-seq analysis via latent feature extraction”.en. In: Nature Communications 10.1 (Dec. 2019), p. 4576. issn: 2041-1723. doi: 10.1038/s41467-019- 12630- 7. url: http://www.nature.com/articles/s41467- 019- 12630- 7 (visited on09/15/2020).

[6] Dongfang Wang and Jin Gu. “VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder”. en. In: Genomics, Proteomics & Bioinformatics 16.5(Oct. 2018), pp. 320–331. issn: 16720229. doi: 10.1016/j.gpb.2018.08.003. url: https:

//linkinghub.elsevier.com/retrieve/pii/S167202291830439X (visited on 09/15/2020).

[7] Juexin Wang et al. scGNN: a novel graph neural network framework for single-cell RNA-Seq anal-yses. en. preprint. Bioinformatics, Aug. 2020. doi: 10.1101/2020.08.02.233569. url: http://biorxiv.org/lookup/doi/10.1101/2020.08.02.233569 (visited on 09/22/2020).

[8] Chunman Zuo and Luonan Chen. “Deep-joint-learning analysis model of single cell transcriptomeand open chromatin accessibility data”. en. In: Briefings in Bioinformatics (Nov. 2020), bbaa287.issn: 1467-5463, 1477-4054. doi: 10.1093/bib/bbaa287. url: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa287/5985290 (visited on 01/18/2021).

[9] Romain Lopez et al. “Deep generative modeling for single-cell transcriptomics”. en. In: NatureMethods 15.12 (Dec. 2018). Number: 12 Publisher: Nature Publishing Group, pp. 1053–1058. issn:1548-7105. doi: 10.1038/s41592- 018- 0229- 2. url: https://www.nature.com/articles/s41592-018-0229-2 (visited on 09/25/2020).

[10] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. en. In: arXiv:1312.6114[cs, stat] (May 2014). arXiv: 1312.6114. url: http://arxiv.org/abs/1312.6114 (visited on10/16/2020).

[11] Lei Zhou et al. “Variational Autoencoder for Low Bit-rate Image Compression”. en. In: (), p. 4.

[12] J. Chorowski et al. “Unsupervised Speech Representation Learning Using WaveNet Autoencoders”.In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.12 (Dec. 2019). Con-ference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 2041–2053. issn: 2329-9304. doi: 10.1109/TASLP.2019.2938863.

[13] Tengfei Ma, Jie Chen, and Cao Xiao. “Constrained Generation of Semantically Valid Graphs viaRegularizing Variational Autoencoders”. en. In: (), p. 12.

16

[14] Feiran Huang et al. “Multimodal Network Embedding via Attention based Multi-view VariationalAutoencoder”. In: Proceedings of the 2018 ACM on International Conference on Multimedia Re-trieval. ICMR ’18. New York, NY, USA: Association for Computing Machinery, June 2018, pp. 108–116. isbn: 978-1-4503-5046-4. doi: 10.1145/3206025.3206035. url: https://doi.org/10.1145/3206025.3206035 (visited on 09/22/2020).

[15] Nat Dilokthanakul et al. “DEEP UNSUPERVISED CLUSTERING WITH GAUSSIAN MIXTUREVARIATIONAL AUTOENCODERS”. en. In: (2017), p. 12.

[16] Linxiao Yang et al. “Deep Clustering by Gaussian Mixture Variational Autoencoders With GraphEmbedding”. en. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV).Seoul, Korea (South): IEEE, Oct. 2019, pp. 6439–6448. isbn: 978-1-72814-803-8. doi: 10.1109/ICCV . 2019 . 00654. url: https : / / ieeexplore . ieee . org / document / 9010011/ (visited on09/24/2020).

[17] Gioele La Manno et al. “RNA velocity of single cells”. en. In: Nature 560.7719 (Aug. 2018). Number:7719 Publisher: Nature Publishing Group, pp. 494–498. issn: 1476-4687. doi: 10.1038/s41586-018- 0414- 6. url: https://www.nature.com/articles/s41586- 018- 0414- 6 (visited on01/22/2021).

[18] Brain Map - brain-map.org. url: https://portal.brain-map.org/ (visited on 01/27/2021).

[19] Douglas Reynolds. “Gaussian Mixture Models”. en. In: Encyclopedia of Biometrics. Ed. by StanZ. Li and Anil Jain. Boston, MA: Springer US, 2009, pp. 659–663. isbn: 978-0-387-73003-5. doi:10.1007/978-0-387-73003-5_196. url: https://doi.org/10.1007/978-0-387-73003-5_196(visited on 01/25/2021).

[20] J.-L. Durrieu, J.-Ph. Thiran, and F. Kelly. “Lower and upper bounds for approximation of theKullback-Leibler divergence between Gaussian Mixture Models”. In: 2012 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). ISSN: 2379-190X. Mar. 2012,pp. 4833–4836. doi: 10.1109/ICASSP.2012.6289001.

[21] R. R. Selvaraju et al. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-BasedLocalization”. In: International Journal of Computer Vision (2019). doi: 10.1007/s11263-019-01228-7.

17

Documents

MFGM-VAE for unpaired multi-view data integration