Multi-source Domain Adaptation Techniques for Mitigating

Proceedings of Machine Learning Research – Under Review:i–xix, 2021 Full Paper – MIDL 2021 submission

Multi-source Domain Adaptation Techniques for MitigatingBatch Effects: A Comparative Study

Rohan Panda1 [email protected]

Sunil Vasu Kalmady2,3 [email protected]

Russ Greiner3,4 [email protected] & Electronics Engineering, Birla Institute of Technology and Science, Pilani, India2Canadian VIGOUR Centre, University of Alberta, Edmonton, Canada3Department of Computing Science, University of Alberta, Edmonton, Canada4Alberta Machine Intelligence Institute, Edmonton, Alberta, Canada

Editors: Under Review for MIDL 2021

Abstract

The past decade has seen an increasing number of applications of deep learning (DL)techniques to biomedical fields, especially in neuroimaging-based analysis. Such DL-basedmethods are generally data-intensive and require large number of training instances, whichmight be infeasible to acquire from a single acquisition site, especially for data such as fMRIscans, due to the time and costs that they demand. We can attempt to address this issue bycombining fMRI data from various sites, thereby creating a bigger heterogeneous dataset.Unfortunately, the inherent differences in the combined data, known as batch effects, oftenhampers learning a model. To mitigate this issue, techniques such as multi-source domainadaptation (MSDA) aim at learning an effective classification function that uses (learned)domain-invariant latent features. This paper analyzes and compares the performance ofvarious popular MSDA methods (MDAN, DARN, MDMN, M3SDA) at predicting differentlabels (illness, age and sex) of images from several public rs-fMRI datasets: ABIDE I andADHD-200. It also evaluates the impact of various conditions such as: class imbalance,number of sites along with a comparison of the degree of adaptation of each of the methods,thereby presenting the effectiveness of MSDA models in neuroimaging-based applications.

Keywords: Resting-state fMRI, multi-source domain adaptation, batch effects, deep learn-ing, ADHD, ASD

1. Introduction

1.1. Motivation and Background

With recent developments in brain imaging technology, data in the form of functional Mag-netic Resonance Imaging (fMRI), electroencephalography (EEG), and Magnetoencephalog-raphy (MEG) have become widely available, which can be helpful in conducting variousdiagnostic and predictive analyses. Owing to the spatio-temporal nature of fMRI data,which allows for extensive information extraction, there has been a steady rise in the appli-cations of various deep learning (DL) strategies applied to fMRI data to classify or predictmental illnesses (e.g., Alzheimer’s, ADHD, Schizophrenia, etc.), brain states (e.g., sleepstages, task-based activity, etc.) or patient demographics (e.g., age, gender, IQ, etc.).

© 2021 R. Panda, S.V. Kalmady2,3 & R. Greiner3,4.

MSDA & Batch Effects

DL models are data-intensive in nature and tend to work better as we increase the size ofthe data available for training. However, owing to the difficulties related to the acquisitionof fMRI data, building a large dataset is often infeasible, expensive and time-consuming.

A general workaround involves building a large dataset by combining data from variousacquisition sites for a particular research task. This, however, leads to another problemthat arises as the data was collected from multiple sites, which means this can involvedifferent acquisition methods, equipment, demographic of patients, methodology, etc. Themodels can be trained on a dataset that simply contains all of these instances, without anymodifications. However, this method ignores these differences, that might hamper the modelgeneralizability. The basic reason for such variations is the differences in the probabilitydistributions of data and label across sites, which are generally termed as domain shift, andalso, batch effects (Dundar et al., 2007).

Recent works in various domains have focused on developing methods to mitigate suchissues, including domain adaptation (DA) techniques, which aim at building a general-ized model that can learn from the multiple given source sites to produce a model thatcan perform reasonably well on a new, yet related, target site. The existing DA tech-niques have varied approaches based on factors like: number of source sites (single-sourceDA, multi-source DA), labels availability in target domain (unsupervised, semi-supervised,supervised DA), method of domain adaptation (discrepancy, adversarial, reconstructionbased), etc. Which technique performs best can depend on the objective at hand, and thetype of datasets that have been used. This motivates our work, here, to study the existingDA methods and their performance when dealing with multi-site biomedical (in this case,resting-state fMRI) data.

1.2. Related Works

In the past decade there have been various new techniques of that apply DL tools to fMRIdata, to develop predictive models based on numerous objectives.

Many systems view raw fMRI data as a sequence of 3-dimensional data, motivatingvarious techniques which use 3D convolutions to build models such as using 3D-CNN topredict ADHD using fMRI and structural MRI (Zou et al., 2017), extracting features us-ing 3D-Convolutional Autoencoders for mild Traumatic Brain Injury recognition (Zhaoet al., 2017), predicting Schizophrenia using 3D-CNN, development of 2-channel 3D DNNfor autism spectrum disorder (ASD) classification (Li et al., 2018b). Though such meth-ods allow for maximal information extraction, the deep models are computationally veryexpensive and generally infeasible. To mitigate this issue, functional connectivity matri-ces (Lynall et al., 2010) are popularly used and are found to be a good replacement, makingthe training computationally feasible and also providing a way to interpret the results.Some noteworthy results using FCMs include classification of Schizophrenia patients usingvarious DL methods (Arbabshirani et al., 2013; Shen et al., 2010; Yan et al., 2017), predic-tion of other illnesses such as attention deficit hyperactivity disorder (ADHD) (Riaz et al.,2020), Alzheimer’s (Ju et al., 2017), ASD (Li et al., 2018a; Saeed et al., 2019) and MildCognitive Impairment (Chen et al., 2016). There have also been classifications of otherbrain states, such as suicidal behavior (Gosnell et al., 2019), chronic pain (Santana et al.,

ii


2019), migraine (Chong et al., 2017) and demographics such as age (Pruett Jr et al., 2015)or gender (Fan et al., 2020), etc.

While the works mentioned above have shown impressive results, none addressed theissue of batch effects. However, there have been few recent methods that have tried todeal with batch effects in different ways. Olivetti et al. (2012) was one of the first to in-vestigate batch effects in rs-fMRI datasets (ADHD-200) using extremely randomized treesalong with dissimilarity representation. Vega and Greiner (2018) studied the impact of clas-sical techniques such as covariate, z-score normalization, and whitening on batch effects.Wang et al. (2019) explored ways to use low-Rank domain adaptation to reduce existingbiases on multi-site fMRI dataset. Recent approaches include, transport-based joint distri-bution alignment (Zhang et al., 2020), federated learning (Li et al., 2020) and conditionalautoencoder (Fader Networks) (Pominova et al., 2020).

It is therefore useful to have a comparative survey of the performances of various ex-isting MSDA techniques applied to solve the batch effects in multi-site fMRI datasets, tounderstand the benefits and limitations of DA approaches.

2. Domain Adaptation Methods

We define the common objective of MSDA techniques as follows: Given a collection oflabelled source-domain data Ds = {(xis, yis)}

Nsi=1 ∀s ∈ {1, . . . , S} and a collection of

unlabelled target-domain data DT = {xiT }NTi=1 (where xis, x

iT ∈ X and yis ∈ Y ) the goal

is to build a classifier that can use information from the source domains to help producemodels that can perform accurate classifications in the target domain (Zhao et al., 2020).For our experiments, We take one of S domains as the target domain (by convention, thisis the domain indexed by S) and use the others as source domains (s ∈ {1, ..., S − 1}).Generally, MSDA techniques employ different strategies of transforming the target domaindistribution into the source domain distributions to tackle the issue of batch effects. Weallow the marginal probability distributions PX to be different across domains, but requirethe conditional probability distributions PY |X remain the same. Below is a short introduc-tion to the various methods used in our experiments.

Domain Adversarial Neural Networks (DANN): Considered as one of the fundamen-tal models in DA, DANN(Ajakan et al., 2014) is a single-source DA technique – the onlysingle-source DA method included in the comparison. DANN’s architecture is similar toFigure 1(b), except that all the sources’ data are combined as used as a single big source.

Multi-source Domain Adversarial Networks (MDAN): MDAN can be seen as anatural extension of DANN for MSDA problems. Its feature extractor and label classifierare essentially the same as DANN’s, but MDAN uses one domain adapter Mdi for each ofthe S − 1 source domains. Zhao et al. (2018) introduces two versions of MDAN: hard-maxand soft-max variants. We use just the soft-max variant as it is shown to provide bettergeneralization in (Zhao et al., 2018).

Domain AggRegation Networks (DARN): One of MSDA’s main challenges is that itneeds to include source sites based on the target site, in a way that minimizes the negative

iii


Figure 1: The training & evaluation pipelines used in this work. (a) represents the SRCmodel, used as the baseline. All other MSDA models can be generalised to beusing (b). Finally, to compare the accuracy achieved with target-only data – theTAR model – is shown in (c).

transfer while preserving as many training instances as possible. To tackle this issue, DARNdynamically selects source sites and gives the sites varied importance based on their labelclassification losses. This is possible by solving the Lagrangian dual of the objective thatneeds to be optimized by utilizing binary search strategies.

Multi-Domain Matching Networks (MDMN): MDMN tackles MSDA by first project-ing features into a shared feature space. By computing, then using a degree of similaritybetween the target and source sites, MDMN merges similar sites together to construct theshared feature space, while reducing the negative transfer by keeping dissimilar sites distant.This model tackles this objective by using a loss function based on Wasserstein distanceand a special training paradigm as described in (Li et al., 2018c).

Moment Matching for MSDA(M3SDA): Unlike the previous discussed models, M3SDAaligns target domain with source domains while simultaneously aligning source sites amongstthemselves. Furthermore, it tackles this issue by utilizing the feature distribution momentsinstead of the raw input features for adaptation, which provides a certain robustness anda statistical advantage in MSDA. Peng et al. (2019) introduces an extension of M3SDA,called M3SDA-β, which they demonstrate performs better against overfitting and providesbetter generalization. We therefore use M3SDA-β to understand the model’s performanceon neuroimaging data.Appendix A provides more information about each of these architectures.

3. Methodology

3.1. Datasets and Tasks

This study uses two different publicly available datasets for training and evaluation, selectedon the basis of the number of total scans available, the number of sites of data acquisition,and their frequent usage in the research community.

iv


The first dataset consists of rs-fMRI scans from the ABIDE I dataset (Craddock et al.,2013a), including 530 control instances (tagged as typical controls, TC) and 505 instancescollected from subjects suffering with Autism Spectrum Disorder (ASD), which have beenacquired from 17 different sites. Appendix B shows the phenotypic information and pre-processing steps used in the dataset.

The ADHD-200 dataset is our second multi-site fMRI dataset, which has been compiledfrom 8 different sites, and contains 1516 rs-fMRI scans in total: 842 scans from controlsubjects, and 674 from subjects who suffer from Attention Deficit Hyperactivity Disor-der (ADHD). Again, see Appendix B for further information.

For each dataset, we run three different classification tasks, using three different labels:(1) the respective mental illnesses, between illness and control samples; and binary classi-fication of two phenotypic labels, (2) sex and (3) age (old versus young, w.r.t. the globalmedian calculated separately for each of the two datasets).

3.2. Functional Connectivity Matrix and Feature Extraction

Functional connectivity is defined as the temporal dependency of spatially-remote neuro-physiological events (Van Den Heuvel and Pol, 2010). It computes the level of co-activationbetween two spatially separate regions of interest (ROIs) in the brain, based on the meantime-series extracted these ROIs. Each ROI is pre-defined using some atlas or template.Here, we use the Automatic Anatomical Labelling (AAL) atlas (Tzourio-Mazoyer et al.,2002), which partitions the brain into AR4:119116 different non-overlapping ROIs.

We then calculate the functional connectivity matrix (FCM) using Pearson’s correlationcoefficient between each pair of time-series which results in a AR4:119× 119116× 116 matrix.Since the diagonal of this matrix is redundant and the matrix is symmetric, the diagonalis dropped and the upper triangle of the matrix is flattened to finally produce a vector ofsize AR4:

(1192

)(1162

)for each rs-fMRI scan, which is used as the input data to various models

in this study.

3.3. Training and Testing Settings

The MSDA models require labeled data flowing in from multiple source domains, as wellas a batch of unlabeled data from the target domain. To accommodate this, we first takea single source is the target domain, and consider the remaining sites as different sourcedomains. The target domain is then split using a AR2:stratified 10-fold cross-validationstrategy, wherein a single fold is kept aside for testing while the remaining 9 folds areused (without their labels) to provide the unlabeled target domain data required for theunsupervised-MSDA methods. All datapoints from the source sites are fed into the modelalong with their labels during training. We repeat the training and testing for each fold andfor each site, then report the average accuracies as the results. Figure 1 shows this pipeline.

To compare the performance of MSDA models (Figure 1(b)), the SRC model (see Fig-ure 1(a)) is used as the baseline model. In this setting, data from all the source sites arecombined and are treated as one big dataset; that is, no target site data is used. Also, theTAR model uses only the (labeled) target site data (and no source sites), in a AR2:stratified10-fold CV setting to maintain the class distribution in all the folds. This model shows thebaseline performance when only the target site information is available.

v


(a)

(b)

Figure 2: Site-wise performance of MSDA and baseline models in AR2, AR4:illness classifica-tion for All: ADHD-200 (a) and ABIDE I (b) . The height of each bar is theaverage of 10-Fold CV accuracies achieved when the particular site was chosen asthe target site. To show how much better an MSDA model performs w.r.t. usingonly the target site data, and also w.r.t. using all source data without any domainadaptation, we then subtract the TAR model accuracy from every model’s originalaccuracy, and AR3:show the SRC model’s scores as a black target-line.

4. Results

Mental Illness: Figure 2 and Figure 3 show that MSDA models produce classifiers that aremore accurate than SRC and TAR. In case of ABIDE I, we find a significant increase in theaccuracy scores from baseline (SRC) to the MSDA models, with the highest increase beingin M3SDA ( 5%). In the case of ADHD-200 data, the data was balanced (see Appendix D)and used for experimentation. In comparison to ABIDE, MSDA is only slightly moreaccurate than SRC ( 1-4%). MDAN (76.21%) has the highest increase in comparison tobaseline (72.72%), while MDMN (71.54%) does not outperform baseline accuracy. Wenoted the positive impact of balancing the data (Appendix D) using oversampling for thisdataset. This observation suggests that current MSDA models might be impacted by theclass balance present in data used.

Figure 2 provides a deeper look at the site-wise performance of the models. For ABIDE I,Figure 2(b)subfigure shows that generally the models performed better than TAR and SRC,with only a few exceptions. For site OHSU, no model was able to perform better than SRC,while in some sites such as SDSU, CMU and Stanford, a few models performed worse thanTAR. For ADHD-200, Figure 2(a)subfigure shows that sites such as KKI and Pittsburghave MSDA scores lower than TAR, however, in both of these sites, MDAN is able to per-form better in comparison to TAR. M3SDA has a higher accuracy in most (around 10 out

vi


Figure 3: Average accuracies of all models across all sites. (a),(b),(c) depict the accuraciesfor illness, age and sex for ABIDE I respectively, while (d) and (e) are for illnessand sex for ADHD-200 dataset. *(p<0.05) and **(p<0.01) depict the statisticalsignificance between the MSDA methods and SRC models, each calculated usinga paired-t test.

of 17) of the sites in ABIDE I, while in ADHD-200, MDAN seems to perform better withsome consistency. Furthermore, the increase in accuracy using MSDA models differ fromABIDE I (16-20%) to ADHD-200 (8-10%).

Age: Figure 3(b) shows that applying DA models does not increase the classificationperformance over the baseline results. The baseline and MSDA models both have an ac-curacy of around AR3:90-9186-88%. To explore whether this was due to class imbalance, weapplied a strategy similar to the one in Section D, however that did not improve the results.It could be that demographic information such as age might be domain invariant, whichwould explain why MSDA models did not improve baseline performance.

Sex: Figure 3(c) presents the accuracy scores of the experiments on the ABIDE Idata. While only M3SDA showed significant increase for ABIDE I, in ADHD-200 all theDA models except MDMN scored significantly better than baseline scores. While MDANperforms the best in ADHD-200, it is not so accurate when it comes to ABIDE I; moreover,we see that M3SDA is consistently accurate in both types of datasets.

The previous results show that the MSDA models perform better than simply combiningall the source data and utilizing it without any adaptation. To explore how well the modelsharmonize the source sites, we ran experiments on MSDA models’ ability to make featuressite-invariant. Figure 4 reports the results of a two-layered fully-connected network thatwas trained and tested to classify the sites based on input latent features in a 10-fold CVsetting. We found that, in ABIDE I, the generalization of sites seems to be better than inthe case of ADHD-200. AR3:We observe that, though MSDA methods have lower accuracy in

vii


(a) (b)

Figure 4: Average accuracy values 10 fold CV of site classification using latent featureslearnt in various models for ABIDE I(a) and ADHD-200(b).

site classification, it is still greater than chance ( 117 for ABIDE Iand 1

8 for ADHD-200). Thisis owing to the trade-off between harmonizing sites and retaining discriminatory informa-tion for the classification that each model must tackle. Since each model tries to achieve thisbalance in different ways, we find that there is still some remnant site information present inthe processed features by each of the MSDA techniques. Nevertheless, all MSDA methodsproduced latent features using which, it was difficult for the neural net to distinguish whichsite an instance was from. This ability to make features site-invariant is hypothesized to bethe driving force behind improving the performance w.r.t. baseline performance.

5. Discussion and Conclusion

This paper analyses the performance of various existing MSDA models at classifying differ-ent objectives using rs-fMRI data, using data from popular public datasets ABIDE I andADHD-200. We used FCMs of data as the representative feature vector upon which themodels were trained and evaluated. Section 4 shows that the MSDA methods are successfulin producing site-invariant latent features for the data, which in turn helps in improvingclassification accuracies. However, note that such methods are sensitive to the class dis-tribution present in the data. While fixed hyperparameters provided useful insights intoMSDA performances, AR2:an extensive hyperparameter tuning of the models was not in-cluded in this study owing to computational constraints. Furthermore, we found that somelearning objectives were domain-invariant or unaffected by MSDA architectures (e.g., age),AR1:however this was not exhaustively tested due to data limitations. Based on the experi-ments conducted, we observe that M3SDA consistently performed well across datasets andlabels and was less prone to class imbalance. Models such as DARN, MDMN and MDANperformed better in the larger dataset (ABIDE I) and were sensitive to class imbalance,nevertheless, they performed significantly better when classes were balanced using simplesampling techniques. In general, we see that MDAN and M3SDA have improved the per-formance w.r.t. the baseline accuracies by a bigger margin than others for majority of theclassifications. Based on these results, it is suggestive that MSDA techniques can be bene-ficial in improving the performance of DL techniques in neuroimaging-based applications.

viii


References

Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, and Mario Marchand.Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.

Mohammad Reza Arbabshirani, Kent Kiehl, Godfrey Pearlson, and Vince D Calhoun. Clas-sification of schizophrenia patients based on resting-state functional network connectivity.Frontiers in neuroscience, 7:133, 2013.

Pierre Bellec, Carlton Chu, Francois Chouinard-Decorte, Yassine Benhajali, Daniel S Mar-gulies, and R Cameron Craddock. The neuro bureau adhd-200 preprocessed repository.Neuroimage, 144:275–286, 2017.

Xiaobo Chen, Han Zhang, Yue Gao, Chong-Yaw Wee, Gang Li, Dinggang Shen, andAlzheimer’s Disease Neuroimaging Initiative. High-order resting-state functional con-nectivity network for mci classification. Human brain mapping, 37(9):3282–3296, 2016.

Catherine D Chong, Nathan Gaw, Yinlin Fu, Jing Li, Teresa Wu, and Todd J Schwedt.Migraine classification using magnetic resonance imaging resting-state functional connec-tivity data. Cephalalgia, 37(9):828–844, 2017.

Cameron Craddock, Yassine Benhajali, Carlton Chu, Francois Chouinard, Alan Evans,Andras Jakab, Budhachandra Singh Khundrakpam, John David Lewis, Qingyang Li,Michael Milham, et al. The neuro bureau preprocessing initiative: open sharing of pre-processed neuroimaging data and derivatives. Neuroinformatics, 4, 2013a.

Cameron Craddock, Sharad Sikka, Brian Cheung, Ranjeet Khanuja, Satrajit S Ghosh,Chaogan Yan, Qingyang Li, Daniel Lurie, Joshua Vogelstein, Randal Burns, et al. To-wards automated analysis of connectomes: The configurable pipeline for the analysis ofconnectomes (c-pac). Front Neuroinform, 42:10–3389, 2013b.

Murat Dundar, Balaji Krishnapuram, Jinbo Bi, and R Bharat Rao. Learning classifierswhen the training data is not iid. In IJCAI, volume 2007, pages 756–61, 2007.

Liangwei Fan, Jianpo Su, Jian Qin, Dewen Hu, and Hui Shen. A deep network model ondynamic functional connectivity with applications to gender classification and intelligenceprediction. Frontiers in neuroscience, 14:881, 2020.

SN Gosnell, JC Fowler, and R Salas. Classifying suicidal behavior with resting-state func-tional connectivity and structural neuroimaging. Acta Psychiatrica Scandinavica, 140(1):20–29, 2019.

Magdiel Jimenez-Guarneros and Pilar Gomez-Gil. A study of the effects of negative transferon deep unsupervised domain adaptation methods. Expert Systems with Applications,page 114088, 2020.

Ronghui Ju, Chenhui Hu, Quanzheng Li, et al. Early diagnosis of alzheimer’s diseasebased on resting-state brain networks and deep learning. IEEE/ACM transactions oncomputational biology and bioinformatics, 16(1):244–257, 2017.

ix


Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

Hailong Li, Nehal A Parikh, and Lili He. A novel transfer learning approach to enhance deepneural network classification of brain functional connectomes. Frontiers in neuroscience,12:491, 2018a.

Xiaoxiao Li, Nicha C Dvornek, Xenophon Papademetris, Juntang Zhuang, Lawrence HStaib, Pamela Ventola, and James S Duncan. 2-channel convolutional 3d deep neuralnetwork (2cc3d) for fmri analysis: Asd classification and feature learning. In 2018 IEEE15th International Symposium on Biomedical Imaging (ISBI 2018), pages 1252–1255.IEEE, 2018b.

Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence Staib, Pamela Ventola, and James S.Duncan. Multi-site fmri analysis using privacy-preserving federated learning and domainadaptation: Abide results, 2020.

Yitong Li, David E Carlson, et al. Extracting relationships by multi-domain matching. InAdvances in Neural Information Processing Systems, pages 6798–6809, 2018c.

Mary-Ellen Lynall, Danielle S Bassett, Robert Kerwin, Peter J McKenna, Manfred Kitzbich-ler, Ulrich Muller, and Ed Bullmore. Functional connectivity and brain networks inschizophrenia. Journal of Neuroscience, 30(28):9477–9487, 2010.

Emanuele Olivetti, Susanne Greiner, and Paolo Avesani. Adhd diagnosis from multiple datasources with batch effects. Frontiers in systems neuroscience, 6:70, 2012.

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Momentmatching for multi-source domain adaptation. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 1406–1415, 2019.

Marina Pominova, Ekaterina Kondrateva, Maxim Sharaev, Alexander Bernstein, andEvgeny Burnaev. Fader networks for domain adaptation on fmri: Abide-ii study. arXivpreprint arXiv:2010.07233, 2020.

John R Pruett Jr, Sridhar Kandala, Sarah Hoertel, Abraham Z Snyder, Jed T Elison, To-moyuki Nishino, Eric Feczko, Nico UF Dosenbach, Binyam Nardos, Jonathan D Power,et al. Accurate age classification of 6 and 12 month-old infants based on resting-statefunctional connectivity magnetic resonance imaging data. Developmental cognitive neu-roscience, 12:123–133, 2015.

Atif Riaz, Muhammad Asad, Eduardo Alonso, and Greg Slabaugh. Deepfmri: End-to-enddeep learning for functional connectivity and classification of adhd using fmri. Journalof Neuroscience Methods, 335:108506, 2020.

Fahad Saeed, Taban Eslami, Vahid Mirjalili, Alvis Fong, and Angela Laird. Asd-diagnet:A hybrid learning approach for detection of autism spectrum disorder using fmri data.Frontiers in Neuroinformatics, 13:70, 2019.

x


Alex Novaes Santana, Ignacio Cifre, Charles Novaes De Santana, and Pedro Montoya. Usingdeep learning and resting-state fmri to classify chronic pain conditions. Frontiers inneuroscience, 13:1313, 2019.

Hui Shen, Lubin Wang, Yadong Liu, and Dewen Hu. Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embeddingof fmri. Neuroimage, 49(4):3110–3121, 2010.

Nathalie Tzourio-Mazoyer, Brigitte Landeau, Dimitri Papathanassiou, Fabrice Crivello,Olivier Etard, Nicolas Delcroix, Bernard Mazoyer, and Marc Joliot. Automated anatom-ical labeling of activations in spm using a macroscopic anatomical parcellation of the mnimri single-subject brain. Neuroimage, 15(1):273–289, 2002.

Martijn P Van Den Heuvel and Hilleke E Hulshoff Pol. Exploring the brain network: areview on resting-state fmri functional connectivity. European neuropsychopharmacology,20(8):519–534, 2010.

Roberto Vega and Russ Greiner. Finding effective ways to (machine) learn fmri-basedclassifiers from multi-site data. In Understanding and Interpreting Machine Learning inMedical Image Computing Applications, pages 32–39. Springer, 2018.

Mingliang Wang, Daoqiang Zhang, Jiashuang Huang, Pew-Thian Yap, Dinggang Shen, andMingxia Liu. Identifying autism spectrum disorder with multi-site fmri via low-rankdomain adaptation. IEEE transactions on medical imaging, 39(3):644–655, 2019.

Junfeng Wen, Russell Greiner, and Dale Schuurmans. Domain aggregation networks formulti-source domain adaptation. In International Conference on Machine Learning, pages10214–10224. PMLR, 2020.

Weizheng Yan, Sergey Plis, Vince D Calhoun, Shengfeng Liu, Rongtao Jiang, Tian-Zi Jiang,and Jing Sui. Discriminating schizophrenia from normal controls using resting state func-tional network connectivity: A deep neural network and layer-wise relevance propagationmethod. In 2017 IEEE 27th international workshop on machine learning for signal pro-cessing (MLSP), pages 1–6. IEEE, 2017.

Junyi Zhang, Peng Wan, and Daoqiang Zhang. Transport-based joint distribution align-ment for multi-site autism spectrum disorder diagnosis using resting-state fmri. In Inter-national Conference on Medical Image Computing and Computer-Assisted Intervention,pages 444–453. Springer, 2020.

Han Zhao, Shanghang Zhang, Guanhang Wu, Jose MF Moura, Joao P Costeira, and Ge-offrey J Gordon. Adversarial multiple source domain adaptation. Advances in neuralinformation processing systems, 31:8559–8570, 2018.

Sicheng Zhao, Bo Li, Pengfei Xu, and Kurt Keutzer. Multi-source domain adaptation inthe deep learning era: A systematic survey. arXiv preprint arXiv:2002.12169, 2020.

Yu Zhao, Qinglin Dong, Hanbo Chen, Armin Iraji, Yujie Li, Milad Makkie, Zhifeng Kou,and Tianming Liu. Constructing fine-granularity functional brain network atlases viadeep convolutional autoencoder. Medical image analysis, 42:200–211, 2017.

xi


Liang Zou, Jiannan Zheng, Chunyan Miao, Martin J Mckeown, and Z Jane Wang. 3d cnnbased automatic diagnosis of attention deficit hyperactivity disorder using functional andstructural mri. IEEE Access, 5:23626–23636, 2017.

Appendix A. Domain Adaptation Methods

In the following sub-sections, a brief description of the underlying mechanisms in each ofthe MSDA techniques used in this study is given.

A.0.1. Domain Adversarial Neural Networks (DANN)

Unlike rest of the methods used in this study, DANN (Ajakan et al., 2014) is a single-sourcedomain adaptation technique that has been included to compare its performance with theother methods. DANN aims to learn features which help in accurately classifying datapoints based on their labels while making sure that the features are domain-agnostic. Thearchitecture of the model used for adversarial training consists of three components, first,feature extractor Mf which converts the input features to a latent representation (usuallylower in dimension). The model bifurcates into two branches each of which is fed withthese latent features. The first branch is the class predictor, Mc, which predicts the class towhich the sample belongs to, whereas, the second branch is used for predicting in domainof the sample and is denoted by Md. A gradient reversal layer (GRL), Gλ(.), is attachedat the start of Md to train the model in an adversarial fashion. The loss function which isminimized is given by:

E(θf , θd, θc) =∑

s∈[1,S−1]

Ns∑i=1

Lc(Mc(Mf (xis; θf ); θc), yis)+

∑k∈[1,S]

Nk∑i=1

Ld(Md(Gλ(Mf (xk; θf )); θd), yik)

(1)

Lc(yi, yi) = −yilog(yi) (2)

Ld(yi, yi) = −(yilog(yi)) + (1− yi)log(1− yi)) (3)

Gλ(X) = X,dGλdX

= −λI (4)

A.0.2. Multisource Domain Adversarial Networks (MDAN)

MDAN can be seen as a logical extension to DANN wherein multiple source domains areused instead of one single source(Zhao et al., 2018). Apart from Mf and Mc the architecturecontains one domain classifierMdi for each of the S−1 source domains. The soft-max versionof MDAN which utilizes the log-exp-max function to obtain a smoother approximation ofthe max function used in adversarial training was used in this work as it was shown toproduce better and more computationally efficient results(Zhao et al., 2018). The modeltrains to minimize the following loss function:

xii


E(θf , θD, θc) =1

γlog

∑s∈[1,S−1]

exp(γ(Lsc + Ls,SD )) (5)

αs =Lsc∑

s∈[1,S−1] Lsc

(6)

Where Lsc is the cross-entropy loss on label classification for the data from source site s andLs,SD is the domain discrimination loss for data from source site s and target site S, similarto the losses defined in Eq.(1). The values of αs are dynamically derived during the trainingprocess using each sites’ losses helping the learning happen smoothly.

A.0.3. Domain AggRegation Networks (DARN)

One of the main efforts in domain adaptation is during combining data from differentsources, wherein, we need to select the domains which closely resemble the target domainin hand while excluding domains that are dissimilar. While inclusion of more domainsprovides the model with more data to train on, utilizing domains that are very differentfrom the target domain leads to negative transfer (Jimenez-Guarneros and Gomez-Gil,2020). DARN (Wen et al., 2020) aims at dynamically selecting and combining sites duringthe training phase to find the optimal selection in the trade-off between increasing samplesize and decreasing negative transfer. DARN comprises of a feature extractor Mf , labelclassifier Mc, and a domain classifier Mdi for each domain. To define the objective functionof the model, first, the model losses are defined by:

ls(θf , θc, θd) =∑

xi,yi∈Ds

Lc(Mc(Mf (xi; θf ); θc), yi)+

∑xi,di∈Ds,DS

Ld(Mds(Mf (xi; θd); θc), di)

(7)

Consequently the collection of these losses for all sources is L = [l1, l2, · · ·, lS−1]>. Atemperature parameter τ is incorporated with the final objective defined by:

minα∈4

− < z,α > + ‖α‖2 (8)

(define alpha’s set) where z = L/τ . To solve for the optimal α values the Lagrangiandual of the above equation given by, −z>α+ ‖α‖2−λ>α+ ν(1>α− 1) for ν ∈ R,λ ≥ 0,isused. This gives us the optimal alpha values α∗ as:

α∗ =[z− ν∗1]+‖[z− ν∗1]+‖1

(9)

where ν∗ is found using binary search between [min(z)− 1,max(z)]. Thus, the impor-tance of each source domain is dynamic and keeps changing throughout the training phaseto find the optimal sources and their contribution in developing the features for the targetdomain.

xiii


A.0.4. Multi-Domain Matching Networks (MDMN)

MDMN also works on the concept of developing a shared feature space however the addi-tional step is included to improve classification performance is based on mapping the featurespace distributions of every source domain among themselves as well as mapping this com-mon feature space to the target domain. The idea is to use a domain adapter which findsthe degree of similarity between all the source domains so that the strength on similarityof the target domain is shared among all the similar source domains (Li et al., 2018c). Themethod allows all similar domains to merge together while keeping dissimilar domains awayto reduce negative transfer. This is achieved by imposing a Wasserstein distance-based lossfunction which encourages the features from different domains to be closer to each other.MDMN has shown to help improve the classification performance while avoiding over-fittingissues as described in (Li et al., 2018c). The loss function which is used for training themodel is given by:

E(θf , θd) =1

SN0

N0∑i=1

1

Nsi

rTsiMd(Mf (xi; θf ); θd) (10)

rs =

{−βswss′ s′ 6= sβs s′ = s

, ∀s′ ∈ [1, S] (11)

Where, Nsi denotes the proportion of data that comes from si, and the data samples aretaken in a mini-batch format given by {(xi, si)}∀i ∈ [1, N0]. In MDMN a single domainadapter is used with weight sharing instead of using S different domain adapters for com-putational efficiency and is denoted by, Md(; θd) = [Md1(; θd),Md2(; θd), · · ·MdS (; θd)]. Thedefinitions and strategies to calculate βs and ws can be found in (Li et al., 2018c).

A.0.5. Moment Matching for MSDA (M3SDA)

The main objective of M3SDA is to align the target domain with the source domains whilealigning the source domains among themselves simultaneously during the training process.Unlike the other methods discussed, M3SDA tries to align the feature distribution mo-ments of each source instead of using adversarial training for reducing the domain batcheffects (Peng et al., 2019). One of the basic assumptions that this method is based on is thatthe posterior distribution of the class labels PY |X would automatically align is the modelis able to align the pior feature distributions PX of the domains. This assumption howevermight not hold true with practical datasets containing multiple sources. To mitigate thisissue M3SDA-β was introduced in (Peng et al., 2019), which has been used in this studyas well. M3SDA-β minimizes the domain discrepancy based on the kth order cross-momentdivergence denoted by dkCM (·, ·), where k is a parameter taken as an input, furthermore thetraining strategy utilized in (Peng et al., 2019) was applied for this model. The main lossfunction can be written as:

E(θf , θc) =∑s∈[1,S]

Ns∑i=1

Ld(Mc(Mf (xis; θf ); θc), yis) + dkCM (Fs, FS) (12)

where Fs, FS denote the feature vectors received from Mf for the source and target dataxs and xS respectively. Apart from the feature extractor Mf , M3SDA-β also uses a pair

xiv


of classifiers MC and MC′ for each domain denoted by MC′ = [(MC1 ,MC′1), (MC2 ,MC′

2), · ·

·, (MCS,MC′

S)]. The training strategy involves using Mf and MC in a three-step process

wherein, first, both the models are trained together to classify multi-source samples. Next,MC is trained while keeping Mf fixed to maximize the target domain discrepancy betweeneach of the classifiers in a classifier pair from MC . Finally, Mf is trained while fixingMC to minimize the discrepancy of each classifier pair in MC . The process repeats untilconvergence is achieved and during testing a weighted-average of the classifier outputs isused to make predictions, wherein the weights are defined using source-only accuracies asdescribed in (Peng et al., 2019).

Appendix B. Data Demographics

The ABIDE I dataset (Craddock et al., 2013a) is a combination of fMRI scans from 17different sites. The dataset provides users with rs-fMRI, T1 structural brain images andphenotypic information for each patient. It consists of 505 ASD scans and 530 controls.As a part of pre-processing the rs-fMRI data, the C-PAC processing pipeline offered byPreprocessed Connectome Project (Craddock et al., 2013b) was used. The pipeline consistsof several steps such as slice-time correction, motion correction, intensity normalization,and nuisance signal removal. Furthermore, data from all the sites were spatially registeredto the MNI152 template space, along with being passed through a band-pass filter (0.01 -0.1Hz) to remove any high frequency noise in the data The site-wise distribution of age andsex is described in AR2:Table B Table 1.

Similarly, the ADHD-200 dataset(Bellec et al., 2017), which was first introduced duringthe ADHD-200 Competition, contains scans from 8 different sites which inclues a total of973 individuals. This dataset also provides one or more rs-fMRI, T1 structural MRI andthe respective phenotype for each individual.The scans undergo similar preprocessing usinga pipeline made available by Neuroimaging Analysis Kit (NIAK) which includes steps suchas Slice timing correction, motion correction, linear and non-linear spatial normalization,correction of physiological noise, Spatial smoothing and MNI T1 space registration. . Thedistribution of the data according to the phenotype is provided in AR2:Table B Table 2. Since,the phenotypic data had inconsistent and missing age information, the particular columnhas been omitted from the table.

Appendix C. Model Specifications

The architecture of the various components in each of the pipelines was kept constant, i.e.the feature extractor, label classifier and domain adapter had the same design across allmethods. Since the features were flattened FCM, fully connected layers (FCN) were usedalong with dropouts and L-2 regularization. The sub-models’ designs are described in Table3. AR4:In most cases, the complete model is trained end-to-end using using the Adam op-timizer (Kingma and Ba, 2014) on the loss functions defined in Appendix A. Few of themodels (e.g., MDMN) utilize a training strategy unlike other methods (see Appendix A).In such methods, the training process described by in the original papers are utilized.The hyperparameters used are listed in Table 4 and Table 5. Common hyperparameters

xv


ASD TC

Sites Age Sex Age Sex

Pitt 19.0 (7.3) M 25, F 4 18.9 (6.6) M 23, F 4Olin 16.5 (3.4) M 16, F 3 16.7 (3.6) M 13, F 2OHSU 11.4 (2.2) M 12, F 0 10.1 (1.1) M 14, F 0SDSU 14.7 (1.8) M 13, F 1 14.2 (1.9) M 16, F 6Trinity 16.8 (3.2) M 22, F 0 17.1 (3.8) M 25, F 0UM 13.2 (2.4) M 57, F 9 14.8 (3.6) M 56, F 18USM 23.5 (8.3) M 46, F 0 21.3 (8.4) M 25, F 0Yale 12.7 (3.0) M 20, F 8 12.7 (2.8) M 20, F 8CMU 26.4 (5.8) M 11, F 3 26.8 (5.7) M 10, F 3Leuven 17.8 (5.0) M 26, F 3 18.2 (5.1) M 29, F 5KKI 10.0 (1.4) M 16, F 4 10.0 (1.2) M 20, F 8NYU 14.7 (7.1) M 65, F 10 15.7 (6.2) M 74, F 26Stanford 10.0 (1.6) M 15, F 4 10.0 (1.6) M 16, F 4UCLA 13.0 (2.5) M 48, F 6 13.0 (1.9) M 38, F 6Maxmun 26.1 (14.9) M 21, F 3 24.6 (8.8) M 27, F 1Caltech 27.4 (10.3) M 15, F 4 28.0 (10.9) M 14, F 4SBL 35.0 (10.4) M 15, F 0 33.7 (6.6) M 15, F 0

Table 1: ABIDE I demographics. The age is represented by the mean (standard deviation)format and the sex distribution is denoted by M: males and F: females.

ADHD TC

Site Sex Sex

KKI M 15 F 10 M 41 F 28AR2:NINeuroImage M 31 F 5 M 12 F 25

NYU M 117 F 34 M 56 F 55OHSU M 30 F 13 M 30 F 40Peking M 92 F 10 M 84 F 59

Pittsburg M 3 F 1 M 50 F 44UWash M 0 F 0 M 33 F 28Brown M 0 F 0 M 9 F 17

Table 2: ADHD-200 demographics

(shown in Table 4) are kept constant throughout all experiments and datasets, while specifichyperparameters (see Table 5) are selected from a range of candidate values.

Appendix D. Class Balancing

As discussed in Section 4, to handle the data imbalance the minority class was over-sampledto match in number with the majority class. In case a particular site consists of data of only

xvi


Sub-Model Architecture Output

Feature Extractor (Mf ) input → 2000 → 1000 latent featuresLabel Classifier (Mc) 1000 → 100 → 2 label predictionsDomain Adapter (Md) 1000 → 100 → n* class predictions

Table 3: FCN architectures used in each of the sub-models. Each layer was followed by aReLU layer (except last layer where softmax is used) and a dropout layer withp=0.5. The output of Md has different number of nodes for different models anddatasets and therefore, is represented by ”n”.

ABIDE I ADHD-200

α 0.0001 0.0003batch size 100 200dropout 0.5 0.5epochs 100 50

Table 4: List of hyperparameters AR2:which were kept constant throughout all the experi-ments and used in all MSDA methods. AR2:’*’ denotes that that hyperparameter may be

different for different models based on the hyperparameter tuning conducted, hence, the value

that was used mostly has been reported.

Dataset Hyperparameters Illness Sex Age

ABIDE IDANN DARN MDAN MDMN M3SDA DANN DARN MDAN MDMN M3SDA DANN DARN MDAN MDMN M3SDA

µ 0.1 0.01 0.1 0.1 0.01 0.1 0.1 0.1 0.01 0.1 0.1 0.1 0.1 0.01 0.1γ - 0.7 5 - - - 0.5 5 - - - 0.7 10 - -

ADHD-200µ 1 0.001 0.01 1 1 0.1 1 1 0.1 1 - - - - -γ - 0.5 2 - - - 0.5 3 - - - - - - -

Table 5: AR2:model-specific hyperparameters used in our experiments apart from the com-mon hyperparameters in Table 4.

one class, the site is dropped in that experiment (e.g., ADHD illness classification displays 6sites instead of 8). To understand the impact of class balancing on improving performanceof each model, the comparison of the accuracies before and after data balancing is providedin Figure 5. The data in ABIDE I for illness already contained balanced classes and hencewas omitted.

It is seen in Figure 5 balancing data in the case of age classification for ABIDE Imadeno difference when it came to MSDA performances, while we see a significant improvementwhile using this strategy for sex classification with an increase of as high as 8% for theMDMN model. The accuracy changes in ADHD-200 are quite different than what is ob-served for the ABIDE dataset. We see that in the case of illness classification, all models(except DANN) benefited from the balancing. An interesting observation can be made insex classification for ADHD-200, wherein the accuracy scores of MDAN increased by almost

xvii


Figure 5: A comparison between the accuracies before and after balancing the classes usingoversampling. The age and sex classifications for ABIDE I are shown in (a) and(b), while (c) and (d) represent the illness and sex classifications for ADHD-200 dataset.

7%. Hence, in most case data balancing had a postive and significant impact at improvingmodel performance.

xviii


xix

Documents

Multi-source Domain Adaptation Techniques for Mitigating