27
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014 2039 Comparative Evaluation of Registration Algorithms in Different Brain Databases With Varying Dif culty: Results and Insights Yangming Ou*, Hamed Akbari, Michel Bilello, Xiao Da, and Christos Davatzikos Abstract—Evaluating various algorithms for the inter-subject registration of brain magnetic resonance images (MRI) is a neces- sary topic receiving growing attention. Existing studies evaluated image registration algorithms in specic tasks or using specic databases (e.g., only for skull-stripped images, only for single-site images, etc.). Consequently, the choice of registration algorithms seems task- and usage/parameter-dependent. Nevertheless, re- cent large-scale, often multi-institutional imaging-related studies create the need and raise the question whether some registration algorithms can 1) generally apply to various tasks/databases posing various challenges; 2) perform consistently well, and while doing so, 3) require minimal or ideally no parameter tuning. In seeking answers to this question, we evaluated 12 general-purpose registration algorithms, for their generality, accuracy and robust- ness. We xed their parameters at values suggested by algorithm developers as reported in the literature. We tested them in 7 databases/tasks, which present one or more of 4 commonly-en- countered challenges: 1) inter-subject anatomical variability in skull-stripped images; 2) intensity homogeneity, noise and large structural differences in raw images; 3) imaging protocol and eld-of-view (FOV) differences in multi-site data; and 4) missing correspondences in pathology-bearing images. Totally 7,562 regis- trations were performed. Registration accuracies were measured by (multi-)expert-annotated landmarks or regions of interest (ROIs). To ensure reproducibility, we used public software tools, public databases (whenever possible), and we fully disclose the parameter settings. We show evaluation results, and discuss the performances in light of algorithms’ similarity metrics, transfor- mation models and optimization strategies. We also discuss future directions for the algorithm development and evaluations. Index Terms—Brain magnetic resonance imaging (MRI), de- formable image registration, evaluation, registration accuracy. I. INTRODUCTION I MAGE registration is a process of transforming different images into the same spatial coordinate system, so that after registration, the same spatial locations in different images rep- Manuscript received March 05, 2014; revised May 29, 2014; accepted May 31, 2014. Date of publication June 13, 2014; date of current version September 29, 2014. Asterisk indicates corresponding author. *Y. Ou is with the Center for Biomedical Image Computing and Analytics (CBICA), Department of Radiology, University of Pennsylvania, Philadelphia, PA, 19104 USA, and also with the Athinoula A. Martinos Center for Biomed- ical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA. H. Akbari, M. Billelo, X. Da, and C. Davatzikos are with the Center for Biomedical Image Computing and Analytics (CBICA), Department of Radi- ology, University of Pennsylvania, Philadelphia, PA 19104 USA. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMI.2014.2330355 resent the same anatomical structures. Image registration, espe- cially deformable image registration, is a fundamental problem in medical image computing. It is usually an indispensable com- ponent in many analytic studies, including studies aiming to understand population trends of imaging phenotypes, to mea- sure longitudinal changes, to fuse multi-modality information, to guide computerized interventions, to capture structure-func- tion correlations, and many others (two recent comprehensive surveys can be found in [1], [2]; various other surveys can be found in [3]–[9]). The past two decades have witnessed the development of many deformable registration algorithms. A comprehensive evaluation of different registration methods has thus become a research topic of interest. It is the basis for users to choose the most suitable methods for the problems at hand, and for algorithm developers to be better informed theoretically. A comprehensive evaluation is a fairly complicated problem, though. It oftentimes requires public databases, expert-labeling of ground truth regions/landmarks, a comprehensive evaluation protocol, careful tunings of parameters, considerable computa- tional resources, and a proper choice of data to reect certain registration challenges or to meet certain (pre-)clinical needs. A. Literature on Evaluation of Algorithms in Brain MRI Registration West et al. [10] evaluated 3 registration methods for multi-modal registration of the same subject undergoing neu- rosurgery, where the accuracy was measured on the ducial markers. Hellier et al. [11] evaluated six methods (ANIMAL, Demons, SICLE, Mutual Information, Piecewise Afne, and the authors’ own method) in a database containing brain MRI from 18 healthy subjects; they measured registration accuracy on expert-dened cortical regions. Yanovsky et al. [12] eval- uated three methods (uid and two variations of the authors’ own methods, namely symmetric and asymmetric unbiased methods), in mapping brains during longitudinal scans for the detection of atrophy. They used 20 subjects from the ADNI database, and all images were preprocessed to exclude skull and dura. Yassa et al. [13] evaluated three methods (DARTEL, LDDMM, and Diffeomorphic Demons) in inter-subject regis- tration of images especially at the medial temporal lobe, which is a crucial region for the study of memory. Christensen et al. introduced the NIREP project [14], [15], which in the rst phase contains 16 publicly-available MR images, each having 32 expert-dened regions of interests (ROIs). They also dened 0278-0062 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014 2039

Comparative Evaluation of Registration Algorithmsin Different Brain Databases With Varying

Difficulty: Results and InsightsYangming Ou*, Hamed Akbari, Michel Bilello, Xiao Da, and Christos Davatzikos

Abstract—Evaluating various algorithms for the inter-subjectregistration of brain magnetic resonance images (MRI) is a neces-sary topic receiving growing attention. Existing studies evaluatedimage registration algorithms in specific tasks or using specificdatabases (e.g., only for skull-stripped images, only for single-siteimages, etc.). Consequently, the choice of registration algorithmsseems task- and usage/parameter-dependent. Nevertheless, re-cent large-scale, often multi-institutional imaging-related studiescreate the need and raise the question whether some registrationalgorithms can 1) generally apply to various tasks/databasesposing various challenges; 2) perform consistently well, and whiledoing so, 3) require minimal or ideally no parameter tuning. Inseeking answers to this question, we evaluated 12 general-purposeregistration algorithms, for their generality, accuracy and robust-ness. We fixed their parameters at values suggested by algorithmdevelopers as reported in the literature. We tested them in 7databases/tasks, which present one or more of 4 commonly-en-countered challenges: 1) inter-subject anatomical variability inskull-stripped images; 2) intensity homogeneity, noise and largestructural differences in raw images; 3) imaging protocol andfield-of-view (FOV) differences in multi-site data; and 4) missingcorrespondences in pathology-bearing images. Totally 7,562 regis-trations were performed. Registration accuracies were measuredby (multi-)expert-annotated landmarks or regions of interest(ROIs). To ensure reproducibility, we used public software tools,public databases (whenever possible), and we fully disclose theparameter settings. We show evaluation results, and discuss theperformances in light of algorithms’ similarity metrics, transfor-mation models and optimization strategies. We also discuss futuredirections for the algorithm development and evaluations.

Index Terms—Brain magnetic resonance imaging (MRI), de-formable image registration, evaluation, registration accuracy.

I. INTRODUCTION

I MAGE registration is a process of transforming differentimages into the same spatial coordinate system, so that after

registration, the same spatial locations in different images rep-

Manuscript received March 05, 2014; revised May 29, 2014; accepted May31, 2014. Date of publication June 13, 2014; date of current version September29, 2014. Asterisk indicates corresponding author.*Y. Ou is with the Center for Biomedical Image Computing and Analytics

(CBICA), Department of Radiology, University of Pennsylvania, Philadelphia,PA, 19104 USA, and also with the Athinoula A. Martinos Center for Biomed-ical Imaging, Massachusetts General Hospital, Harvard Medical School,Charlestown, MA 02129 USA.H. Akbari, M. Billelo, X. Da, and C. Davatzikos are with the Center for

Biomedical Image Computing and Analytics (CBICA), Department of Radi-ology, University of Pennsylvania, Philadelphia, PA 19104 USA.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMI.2014.2330355

resent the same anatomical structures. Image registration, espe-cially deformable image registration, is a fundamental problemin medical image computing. It is usually an indispensable com-ponent in many analytic studies, including studies aiming tounderstand population trends of imaging phenotypes, to mea-sure longitudinal changes, to fuse multi-modality information,to guide computerized interventions, to capture structure-func-tion correlations, and many others (two recent comprehensivesurveys can be found in [1], [2]; various other surveys can befound in [3]–[9]).The past two decades have witnessed the development of

many deformable registration algorithms. A comprehensiveevaluation of different registration methods has thus becomea research topic of interest. It is the basis for users to choosethe most suitable methods for the problems at hand, and foralgorithm developers to be better informed theoretically. Acomprehensive evaluation is a fairly complicated problem,though. It oftentimes requires public databases, expert-labelingof ground truth regions/landmarks, a comprehensive evaluationprotocol, careful tunings of parameters, considerable computa-tional resources, and a proper choice of data to reflect certainregistration challenges or to meet certain (pre-)clinical needs.

A. Literature on Evaluation of Algorithms in Brain MRIRegistration

West et al. [10] evaluated 3 registration methods formulti-modal registration of the same subject undergoing neu-rosurgery, where the accuracy was measured on the fiducialmarkers. Hellier et al. [11] evaluated six methods (ANIMAL,Demons, SICLE, Mutual Information, Piecewise Affine, andthe authors’ own method) in a database containing brain MRIfrom 18 healthy subjects; they measured registration accuracyon expert-defined cortical regions. Yanovsky et al. [12] eval-uated three methods (fluid and two variations of the authors’own methods, namely symmetric and asymmetric unbiasedmethods), in mapping brains during longitudinal scans for thedetection of atrophy. They used 20 subjects from the ADNIdatabase, and all images were preprocessed to exclude skulland dura. Yassa et al. [13] evaluated three methods (DARTEL,LDDMM, and Diffeomorphic Demons) in inter-subject regis-tration of images especially at the medial temporal lobe, whichis a crucial region for the study of memory. Christensen etal. introduced the NIREP project [14], [15], which in the firstphase contains 16 publicly-available MR images, each having32 expert-defined regions of interests (ROIs). They also defined

0278-0062 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2040 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

a comprehensive set of evaluation criteria (including inten-sity-based variations, region-based overlaps, and transitivityerrors). Based on these criteria they evaluated six methods(rigid, affine, AIR, Demons, SICLE, SLE) in [16]. Klein et al.in [17] evaluated 14 publicly-available registration tools forinter-subject registration of brain MR images. Four databases,each containing images of multiple subjects, were used. Whileit evaluated perhaps the largest number of registration toolsso far, [17] only focused on skull-stripped, high quality, andsingle-site images from healthy subjects. To evaluate registra-tion methods under more challenges, Klein et al. in anotherstudy [18] included both skull-stripped and raw images. Here,raw images refer to those acquired directly from scanner, priorto any image processing steps.

B. Need for a More Comprehensive Evaluation Study

While all the aforementioned studies provided insightful andinformative evaluations, each study only tested registrationmethods in a specific task. For instance, for healthy subjectsonly (e.g., [11], [13], [16], [17]), for multi-modal fusion forpathological subjects only (e.g., [10]), for skull-stripped imagesonly (e.g., [12], [13], [17]), for raw images only (e.g., [10], [11],[19]), for skull-stripped and raw images only (e.g., [18]), forsingle-site data only (e.g., [10], [11], [13], [16], [17]), and formulti-site data only (e.g., [12]). In addition, different evaluationstudies included different registration methods for evaluation.Moreover, they used different parameter settings for a sameregistration method. As a result, the choice of registrationalgorithms seemed to be task- and database-dependent, andwas sensitive to parameter settings.Nevertheless, many of today’s large-scale, pre-clinical

and imaging-related studies present a wide variety ofchallenges—they may contain normal-appearing and/orpathology-bearing images; they may contain skull-strippedand/or raw images; and they may contain images acquiredfrom single- and/or multi-institutions. Facing all these pos-sible challenges, there is an increasing need for registrationalgorithms that are publicly-available, that can widely apply tovarious tasks/databases, that can perform relatively accuratelyand robustly, that can be easily used by people with varyingexpertise in image registration, and that are without much needfor parameter tunings.While automatically and effectively tuning parameters for

specific database/tasks is an important and active area of re-search (e.g., [20]–[24]), having registration algorithms that canbe widely applicable to many tasks/databases is a very desir-able property. This need stems not only from the size of studies,but also from the rapidly increasing number of studies under-taken in, for instance, translational neuroscience. Due to the lackof ground truth, tuning parameters is a difficult task even fortechnical experts. Moreover, when images are acquired and pro-cessed in multiple collaborative institutions, having registrationalgorithms that perform robustly and consistently well with afixed set of parameters becomes almost entirely necessary.

C. Overview and Contributions of Our Evaluation Study

Towards meeting this need, this paper evaluates 12 publicly-available and general-purpose registration algorithms, including

an attribute-based algorithm and 11 other intensity-based algo-rithms. Our work built upon and significantly expands previousevaluation studies of the similar nature in many aspects.1) Our work evaluated registration methods under var-ious tasks and databases presenting a wide variety ofchallenges, rather than in a specific task containing spe-cific challenges. We identified four typical challenges ininter-subject registration of brain MRI (as will be de-scribed in Section II). We chose seven databases, eachcontaining images of multiple subjects, to represent someor all of those challenges. This helped reveal whethera registration method could be generally applicable androbust with regard to various challenges.

2) To reflect the robustness of registration algorithms es-pecially in multiple large-scale translational studiesinvolving various tasks/databases, we fixed the parametersfor each registration method throughout this paper (i.e.,task-independent parameter settings). Particularly, in alltasks/databases in this paper, we used the parameters asthe ones reported in [17] whenever applicable. Using suchparameter settings was because that the parameters hadbeen “optimized” by authors/developers of each specificalgorithm for the registration of skull-stripped, prepro-cessed and normal-appearing brain MR images, whichis a typical task in the inter-subject registration [17]. Werealize that this set of parameters might not be optimalfor other databases/tasks (e.g., those containing raw im-ages, pathology-bearing images, or multi-site images).However, having a fixed set of parameters is perhapshow those registration algorithms would be used or firsttried in daily imaging-related translational studies, whereheavy parameter tunings are not only tedious, but alsoless practical or reproducible. From another perspective,it would be preferable if some registration algorithmscould apply widely and could perform consistently well invarious tasks/databases giving a fixed set of parameters.

To maintain reproducibility of our study, we used publicdatabases wherever possible (six out of seven databases used inthis paper are public); we included registration algorithms/toolsthat are publicly available; and we will fully disclose the exactparameter settings in Appendix B.The rest of the paper is organized as follows. In Section II, we

identify four typical challenges in the inter-subject registrationof brain MR images. In Section III, we present the protocol toevaluate the accuracies of registration algorithms. In Section IV,we show the evaluation results. Finally, we discuss and concludethis paper in Section V.

II. TYPICAL CHALLENGES IN INTER-SUBJECT BRAINMRI REGISTRATION

Brain images from different subjects may present one or moreof the following challenges to registration.Challenge 1: Inter-Subject Anatomical Variability. Sub-

jects may vary structurally (Fig. 1 shows some examples). Inter-subject variability is a common challenge in many registrationtasks investigating neuro-development, neuro-degeneration, orneuro-oncology. It is the main challenge against which regis-tration methods were evaluated in the literature [13], [15]–[17].

Page 3: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2041

Fig. 1. Four randomly-chosen subjects in the NIREP database (the top tworows) and four randomly-chosen subjects in the LONI-LPBA40 database (thebottom two rows). For each subject, both the intensity image and the expert-an-notated ROI image are shown. Different colors represent different ROIs ineach database. These two databases were used to evaluate how registrationmethods perform facing challenges arising from the inter-subject variability(Challenge 1).

Challenge 2: Intensity Inhomogeneity, Noise and StructuralDifference in Raw Images. In addition to inter-subject anatom-ical variability, imagesmay suffer from intensity inhomogeneity(due to bias field), background noise, and low contrast. Withskulls, ears, neck structures present in the raw images, subjectsmay also present larger deformations in those nonbrain struc-tures compared to cortical structures (see Fig. 2 for example).Registration of raw images is necessary when 1) skull strippingis erroneous, so one has to work with the with-skull raw images;or 2) when registration itself is part of the skull-stripping step(e.g., in multi-atlas-based skull stripping approaches [19], [25]).Challenge 3: Protocol and FOV Differences in Multi-site

Databases. Many of today’s large-scale translational imaging-related studies involve brainMR images acquired frommultipleinstitutions. Since MR scanners, imaging protocols, and FOVsmay vary from institution to institution, the acquired imagesmay vary greatly. Especially when the FOV is different, whichis not uncommon inmulti-site databases, some images may con-tain structures that do not show up in other images (see Fig. 3for an example). One can rely on experts to interactively cropimages, so that images from various institutions cover roughlythe same FOV. However, the manual cropping is labor-inten-sive, subject to intra-/inter-rater variability, and may become in-tractable for today’s large-scale studies. Seeking a registration

Fig. 2. Images and annotations of two randomly chosen subjects from each ofthe three databases we used to represent Challenge 2 (intensity inhomogeneity,noise and structural differences in raw brain images). (a) From the BrainWebdatabase. (b) From the IBSR database. (c) From the OASIS database.

method that is relatively more robust to imaging protocol andFOV differences is therefore of interest.Challenge 4: Pathology-induced Missing Correspon-

dences in Pathology-Bearing Images. Spatially normalizing anumber of pathology-bearing images into a normal-appearingtemplate space offers opportunities to understand the common

Page 4: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2042 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Fig. 3. Three-plane view of the intensity images and annotation images fromthree randomly-chosen subjects in the ADNI database. White color in the an-notation images denotes the brain masks, and red denotes hippocampus masks.Blue contours in panel (a) point to the region that exists in one image, but doesnot exist in other images, due to the FOV differences in multiple imaging in-stitutions. The ADNI database was used to represent Challenge 3 (on top ofChallenges 1, 2). (a) A normal control (NC) subject. (b) A mild-cognitive-im-pairment (MCI) subject. (c) An Alzheimer’s Disease (AD) subject.

spatial patterns of diseases. Pathologies present in the patients’images, but not in the normal-appearing template. This posesthe so-called missing correspondence problem (see Fig. 4 forexample). An ideal registration approach should accuratelyalign the normal regions (which do have correspondences acrossimages), and relax the deformation in the pathology-affectedregions, where no correspondences can be found [26], [27]. Lit-erature has suggested to either mask out the pathological regionsfrom the registration process (i.e., the cost-function-maskingapproach [27]), or, to simulate a pathological region in thenormal-appearing template (i.e., the pathology-seeding ap-proach [28]–[34]). However, both approaches require a carefulsegmentation of the pathological regions, which in itself is notan easy or affordable task, especially in large-scale studies. It

Fig. 4. Database to evaluate how registration methods perform facing the chal-lenge arising from the pathology-induced missing correspondences (i.e., Chal-lenge 4). Red arrows point out the regions that contain the cavity (after the re-section of the original tumors) and the recurrent tumors. Their correspondencesare difficult to find in the normal-appearing template image (second row).

is ideal if a general-purpose registration algorithm, segmenta-tion-free in itself, can perform well in pathological-to-normalsubject registration. That is, without prior knowledge of thepresence or the location of the pathologies, nor any partition ofthe pathological versus normal regions, it is ideal if a registrationalgorithm can find correspondences in places where correspon-dences can be found,while relaxing the deformation (or reducingthe pathology-induced bias) in regions where correspondencescan hardly be established.

III. EVALUATION PROTOCOL

In the following, Section III-A presents the databases wechose to represent these four challenges aforementioned inSection II. Then Section III-B briefly introduces the registrationmethods/tools we included in this evaluation. Section III-Celaborates parameter settings for all methods, with an emphasison how to maintain the fairness, transparency, and repro-ducibility in our evaluation. Section III-D describes the criteriato measure registration accuracies.

A. Databases

Seven databases were used. Of them, six are publicly avail-able. Specifically, we used two public databases to representchallenge 1 (Section III-A1); three public databases to repre-sent challenge 2 (Section III-A2); one public database to repre-sent challenge 3 (Section III-A3); and one in-house database torepresent challenge 4 (Section III-A4). They are summarized inTable I and introduced in the following.1) Databases Representing Challenge 1: Two publicly-

available and single-site databases, NIREP and LONI-LPBA40,were used to represent Challenge 1 (inter-subject variability).Both databases contain multiple normal subjects. Images inthe two databases have been skull-stripped by neuroradiolo-gists. Both databases contain T1-weighted (T1w) MR images(sequence parameters in Table I). Neuroradiologists annotatedthose images into a number of ROIs (32 ROIs in the NIREPdatabase and 56 ROIs in the LONI-LPBA40 database). The

Page 5: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2043

TABLE IDATABASES USED IN OUR STUDY. ABBREVIATIONS: UCLA—UNIVERSITY OFCALIFORNIA AT LOS ANGELES; MGH—MASSACHUSETTS GENERAL HOSPITAL;BIRN—BIOMEDICAL INFORMATICS RESEARCH NETWORK; SPGR—SPOILEDGRADIENT ECHO PULSE SEQUENCE; MP-RAGE—MAGNETIZATION-PREPARED

RAPID ACQUISITION WITH GRADIENT ECHO; TR—REPETITION TIME;TE—ECHO TIME; FA—FLIP ANGLE, N/A—NOT AVAILABLE

annotated ROIs are located in the frontal, parietal, temporaland occipital lobes, cingulate gyrus, insula, cerebellum, andbrainstem. Note that, those ROIs were not used in the registra-tion process; instead, they only served as references to evaluatethe accuracy of registration (explained later in Section III-D1).

The detailed lists of those ROIs can be found in Appendix C.The detailed information about how ROIs were annotated canbe found in [14] and [35]. Fig. 1 shows the intensity imagesand the corresponding expert-annotated ROI images fromfour subjects in the NIREP database and four subjects in theLONI-LPBA40 database. These databases were also usedto evaluate the performance of registration methods in othersimilar studies (e.g., [15], [17]).Registration was carried out from every subject to every other

subject in the same database. This removed any bias in the selec-tion of source and target images in the registration. This led to240 , or 210 , registrations in the NIREP,or LONI-LPBA40, database, for each registration method. Be-fore registration, we removed the bias field inhomogeneity bythe N3 algorithm (using the default parameters) [36], and re-duced the intensity difference between the two images by a his-togram matching step.2) Databases Representing Challenge 2: Three public

databases were used, containing raw brain MR images frommultiple subjects. They were: BrainWeb, IBSR and OASISdatabases. Specifically, the BrainWeb database [37] containsraw brain images of 20 healthy subjects. In each subject,every image voxel has been annotated as one of the 11 brainor nonbrain tissue types or structures: cerebrospinal fluid(CSF), gray matter (GM), white matter (WM), fat, muscle,muscle/skin, skull, vessels, around fat, dura matter, and bonemarrow. Fig. 2(a) presents two randomly-chosen subjects fromthe BrainWeb database, including their intensity images and thecorresponding annotation images. We have randomly picked11 BrainWeb subjects, leading to 110 pair-wiseregistrations for each registration algorithm. The IBSR database[38] consists of raw T1-weighted MRI scans of 20 healthysubjects from the Center for Morphometric Analysis at theMassachusetts General Hospital. In this database, the brainmasks have been manually delineated by trained investigatorsfor each subject. We randomly picked up 10 IBSR subjects,leading to 90 inter-subject registrations for eachregistration algorithm. Fig. 2(b) shows the raw intensity imagesand the corresponding brain masks of 2 randomly-chosen IBSRsubjects. The OASIS database [39] contains cross-sectionalT1-weighted MRI Data in young, middle aged, nondemented,and demented older adults, to facilitate basic and clinical dis-coveries in neuroscience. The brain masks were first generatedby an automated method based on a registration to an atlas,and then proofread and corrected by human experts before therelease. We randomly selected 10 OASIS subjects, leading to90 inter-subject registrations for each registrationalgorithm. Fig. 2(c) shows the raw intensity images and thecorresponding brain masks of 2 randomly-chosen OASISsubjects. Similar to those databases used for Challenge 1, theexpert annotations were not used in the registration process;rather, they only served as references to evaluate registrationaccuracy, as we will explain later in Section III-D2.3) Database Representing Challenge 3: One example

multi-site database is the ADNI database. ADNI, or Alzheimer’sDisease Neuroimaging Initiative, is a large-scale longitudinalstudy for better understanding and diagnosing Alzheimer’sDisease. It contains images acquired at 57 collaborative in-stitutions or companies, all of which are publicly available.

Page 6: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2044 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

TABLE IIREGISTRATION ALGORITHMS TO BE EVALUATED FOR THE INTER-SUBJECT REGISTRATION OF BRAIN IMAGES. THIS TABLE IS ONLY A BRIEF SUMMARY OF THEM.MORE DETAIL CAN BE FOUND IN APPENDIX A. ABBREVIATIONS: DIFF.—DIFFEOMORPHISM; MI—MUTUAL INFORMATION; SSD—SUM OF SQUARED DIFFERENCE;

MSD—MEAN SQUARED DIFFERENCE; CC—CORRELATION COEFFICIENT; NCC—NORMALIZED CC

Different imaging sites used different MRI devices, imagingprotocols, and FOVs. Registration among ADNI subjects isusually needed in the data preprocessing, or for the spatialnormalization of subjects. This multi-site database has mostof the challenges a multi-site database typically presents.Moreover, the ADNI protocol has now become a standard forthe studies of aging and neurodegenerative disorders such asAD. Therefore, the performance on the ADNI database is ofgreat importance for registration algorithms applied to datafrom older individuals and individuals with neurodegenerativediseases. We randomly selected the baseline images of 10ADNI subjects, including three normal controls (NC), fourmild-cognitive-impairment (MCI), and three AD subjects. Formany subjects, the brain mask and hippocampus are availableat the data release website. They were used in our experimentsas the references to evaluate the registration accuracy. Fig. 3displays the raw intensity images and the brain/hippocampusmasks for three randomly-chosen ADNI subjects (1 CN, 1MCI, and 1 AD subjects). Please note the presence of theinter-subject variability in the ventricle size, sulci, gyri, etc.;the image inhomogeneity, noise and large deformation; andespecially the FOV differences due to the image acquisitionin multiple imaging sites [e.g., the neck can be seen in (a)(highlighted by the blue contours), but is barely seen in (b)].As in the previously-mentioned database, we performed allpair-wise registrations to avoid subject/template bias. This ledto 90 registrations for each registration method.4) Database Representing Challenge 4 (In Combination

With Challenge 1): An in-house database containing eightpatients with recurrent brain tumors was used. T1-weightedimages were collected with the image size 192 256 192and the voxel size 0.977 0.977 1.0 mm . Images containboth the cavity, caused mainly by the blood pool after theresection of the original tumor, and the recurrent tumors. Weregistered those pathology-bearing images into a commonT1-weighted MR image (i.e., the template), which was col-lected from a healthy subject (image size 256 256 181,voxel size 1.0 1.0 1.0 mm ). In this database, we havecollected landmarks and ROIs annotated by two independentexperts (HA and MB). Those landmarks and ROIs served asreferences for measuring the registration accuracy (the criteriato be presented in Section III-D4). Due to the HIPPA regulation

(Health Insurance Portability and Accountability Act), thepublic release of this database is still an ongoing effort.

B. Registration Algorithms Included

Twelve general-purpose, publicly-available image registra-tion methods were included in our study. They are summa-rized in Table II. We note that they are only a fraction ofthe large number of registration algorithms developed in thecommunity. The pool can always be expanded to includeother general-purpose algorithms (e.g., LDDMM [40], elastix[41], NiftyReg [42], plastimatch [43], etc.) and brain-specificmethods (which often needs or incorporates tissue segmen-tation and/or preprocessing such as skull-stripping or surfaceconstruction, e.g., DARTEL [44], HAMMER [45], FreeSurfer[46], Spherical Demons [47], etc.). In general, we chose the12 methods listed in Table II, because they represent a widevariety of choices for similarity measures, deformation modelsand optimization strategies, which are the most important com-ponents for registration algorithms (see Table II). Out of those12 registration methods, nine methods were included in a re-cent brain registration evaluation study [17]: flirt1 [48], fnirt2

[49], AIR3 [50], [51], ART4 [52], ANTs5 [53], CC-FFD6 [54],SSD-FFD [54], MI-FFD [54], and Diffeomorphic Demons7

[55]. In addition, we included DRAMMS8 [56], and two regis-tration methods that were not included in study [17]. They are:(the nondiffeomorphic, or additive, version of) Demons [57](with an ITK-based public software available), and DROP9

[58] (a novel discrete optimization strategy that dramaticallyincreases registration speed while maintaining high accuracy).For the completeness of this paper, more detail of these imageregistration algorithms can be found in Appendix A. And how

1flirt: http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/flirt2fnirt: http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/fnirt3AIR: http://bishopw.loni.ucla.edu/air5/4ART: http://www.nitrc.org/projects/art/5ANTs: http://www.picsl.upenn.edu/ANTS6CC/MI/SSD-FFD: http://www.doc.ic.ac.uk/dr/~software/7(Diff.) Demons: http://www.insight-journal.org/browse/publication/1548DRAMMS: http://www.cbica.upenn.edu/sbia/software/dramms9DROP: http://www.mrf-registration.net/

Page 7: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2045

their parameters were set will be presented in the next subsec-tion (for the general rules) and Appendix B (for the detailedparameter values).Note that, while including a large number of registration

algorithms/tools is preferable, including all available registra-tion algorithms/tools seems less practical. We have included anumber of the best performing methods (ANTs, ART, Demons,MI-FFD, etc.) as previously reported in [17] and several recentones (DROP, DRAMMS) representative of new advancementsin optimization strategies and/or similarity designs. Our focus,though, was not only on the number of algorithms/tools beingincluded in this study, but more importantly on comprehen-sively evaluating registration methods in various tasks otherthan one or two specific tasks as in many previous evaluationstudies. Doing so could provide an valuable insight to the gen-erality, accuracy and robustness of registration algorithms/toolsand an inspiration for future algorithm development. Somealgorithms/tools were not included. One reason was that theywere not included in [17], and hence their best parameters werenot reported on the same training databases that other methodsused to optimize their parameters. We wanted to avoid thepotential bias introduced by us selecting parameters of variousalgorithms, therefore we chose to use the optimal parameterssets whenever available in [17]. Moreover, some algorithmsalready had their closely-related methods included in our study.For example, LDDMM is in line with ANTs but does not havethe symmetric design; elastix is an implementation of manytransformation/similarity criteria, for which we have alreadyhad 12 methods in this paper to represent the variety; niftyRegand plastimatch are based on the GPU implementation of theMI-FFD algorithm, which has already been included in thisstudy, and the GPU-implementation should be expected toimprove the speed but not necessarily the registration accuracy.On the other hand, the evaluation framework in this paper isgeneral to include, in the future, many other popular registrationalgorithms/tools.

C. Parameter Configurations for Registration Algorithms

We had the following two rules to set the parameters for eachmethod.• Rule 1: We used the optimized parameters as reportedin [17] whenever applicable. Those parameters wereoptimized by the methodology/software developers them-selves on skull-stripped brain image databases that aresimilar to the ones we used in this paper (specifically, theyused four databases—IBSR, LONI-LPBA40, CUMC,MGH—for training, which are similar to the databaseswe used in this paper, which also contain skull-strippedT1-weighted images from 1.5T scanners). We can treatthose databases in [17] as the “training” database forthe databases we used in this paper. For registrationalgorithms that were not included in [17]—DRAMMS,(Additive) Demons and DROP—we took the same logic:to optimize their parameters in, and only in, the task ofregistering skull-stripped images (the LONI-LPBA40database specifically). The fact that this LONI-LPBA40database was also used as one of the seven databases for

“testing” in this paper is less of concern, since 1) the pa-rameters of all other registration methods included in thispaper were also optimized in the LONI-LPBA40 databaseand other similar skull-stripped databases (IBSR, CUMU,MGH databases) as reported in [17]; and moreover, 2) in[17], the registration methods seemed to be trained andtested in the same exact databases (IBSR, CUMU, MGH,LONI-LPBA40), while in our study, we only trained/op-timized the registration methods in one skull-strippeddatabase representing Challenge 1, and we tested all regis-tration methods in six other unseen databases representingChallenges 1–4, respectively.

• Rule 2: For each method, we fixed its parameters in allregistration tasks. Put differently, we used the same param-eters for a registration method, no matter it was used forskull-stripped brain images, raw brain images, multi-sitedata, or tumor-recurrent brain images. It should be ad-mitted that the optimized parameters for skull-strippedbrain MR images are not necessarily optimal for rawimages or pathology-bearing images. However, using thesame set of parameters has two advantages: 1) most usersor algorithm-developers will start from the parametersthat have already been optimized in normal-appearing,skull-stripped images (e.g., [18], [59]–[63]); 2) it helpsreveal the generality of registration methods and their ro-bustness levels facing various registration challenges. Thesecond point is especially important, because a registrationmethod that can successfully apply to a wide variety ofregistration tasks without the need for the task-specificparameter tuning should be desirable for the routine use inmany large-scale pre-clinical research studies.

Based on these two rules, we set the parameters which aredisclosed in Appendix B of this paper.

D. Criteria to Measure Registration Accuracy

Having described the databases and registration methods inthe previous sub-sections, this sub-section introduces the cri-teria to evaluate registration accuracy.1) Criteria in Databases Representing Challenge 1: We

measured the accuracies of inter-subject registrations in theNIREP and LONI-LPBA40 databases by the Jaccard Overlap[64] between the deformed ROI annotations and the ROIannotations in the target image space. A greater overlap oftenindicates a more accurate spatial alignment. This was also theaccuracy criterion used in many other evaluation studies suchas [14], [17], [63]. Rohlfing in [65] demonstrated that, as longas the ROIs are localized (e.g., those (sub-)cortical structures),which is the case in the two databases we used, the regionaloverlap of ROIs is a faithful indicator of registration accuracyin various locations in the image space. Mathematically, giventwo regions and in a 3-D space, and given the volume ofa region as defined by , the Jaccard overlap [64]between the two regions is defined as

(1)

Some other studies used the Dice overlap [66], defined as. It should show

the same trend and should be directly linked with the Jaccard

Page 8: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2046 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Fig. 5. Measuring registration accuracies in different zones. Panel (a) is the sketch of dividing the whole images into various zones. The solid contour filledwith yellow texture denotes the abnormal zone (Zone 1), which contains the post-resection cavity and the recurrent tumor. Zones 2 and 3 are normal-appearingregions immediately close to, and far away from, Zone 1. Zone 4 is the whole brain boundary. The definition of the zones can be found in the main contextin Section III-D4. Panel (b) shows landmark/ROI definitions for an example pair of images. Blue contours are expert-defined ROIs in Zone 1. Red crosses areexpert-defined landmarks in Zone 2. Yellow crosses are expert-defined landmarks in Zone 3. Green contours are the automatically-computed brain boundaries(through Canny edge detection of the brain masks), to measure the registration accuracy in Zone 4. Please note that the landmark/ROI definitions from a secondexpert (which are not shown here) may differ. This figure is best viewed in color.

overlap by . Therefore, reporting either oneoverlap metric should be sufficient for our purpose.2) Criteria in Databases Representing Challenge 2: In the

BrainWeb database, the annotations of 11 relatively localizedbrain and nonbrain structures [see Fig. 2(a)] were available.Therefore, we measured registration accuracy by the Jaccardoverlap between the warped ROI annotations and the target ROIannotations. In the IBSR and OASIS databases, only the brainmasks from raw brain images [see Fig. 2(b) and (c)] were avail-able. Since the brain mask is not a localized structure, the Jac-card overlap alone, according to [65], might not be sufficientto represent the registration accuracy. Therefore, we used the95-percentile Hausdorff Distance (HD) between the warped andthe target brain masks as an additional accuracy surrogate. TheHD between two point sets and is defined as

(2)

where is the Euclidean distance between the spatial lo-cations of two points. The HD is symmetric to two input im-ages, with a smaller value indicating a better alignment of brainboundaries. We used the 95th percentile other than the max-imum HD, to avoid the influence of outliers, as suggested in[67] and [68].3) Criteria in Databases Representing Challenge 3: Since

the annotations of both the brain mask and the left and righthippocampi were available in the ADNI database, we used theJaccard overlap between the warped and target ROIs to indicatethe registration accuracy in this multi-site database—a higheroverlap usually means a better spatial alignment of two images.4) Criteria in Databases Representing Challenge 4: The

landmark and ROI annotations from two independent expertsin the in-house brain tumor database enabled us to measure the

registration accuracy in various locations. Specifically, we de-fined 4 zones in the entire image space, as can be seen in Fig. 5.Those zones were defined by the distances to the abnormal re-gions. Therefore, they helped reflect how the existence of cavi-ties and recurrent tumors influenced the registration accuracy invarious regions of the image.• Zone 1: Abnormal region. Experts HA and MB togethercontoured the abnormal region that contains 1) the post-re-section cavity and 2) the recurred tumor. Within this con-tour was what we defined as Zone 1, the abnormal re-gion [see Fig. 5(a)]. The experts referred to FLuid-Attenu-ated Inversion-Recovery (FLAIR) MR image for this con-touring, because of its high sensitivity and specificity indelineating brain tumors [69]–[71]. Then they indepen-dently found the corresponding regions in the normal tem-plate image. We measured the accuracy of registration inZone 1 by two metrics: 1) the average Dice overlap, and2) the 95-th percentile Hausdorff Distance, between thealgorithm-warped abnormal region and two rater-warpedabnormal regions in the template space. A higher regionaloverlap and a smaller distance reflect a better alignment oftwo images in Zone 1.

• Zone 2: Regions immediately neighboring the ab-normal region. A 30 mm-wide band immediately outsidethe abnormal region was defined, by morphologically di-lating the abnormal region mask agreed by the two expertsin the patient’s image space [see Fig. 5(a)]. Anatomicallandmarks were identified in this band, which served asthe references to reflect the registration accuracy in the im-mediate neighborhood of abnormalities. One expert (HA)labeled 10 anatomical landmarks in Zone 2 within thepatient image. The two experts then independently labeledthe corresponding anatomical landmarks in the templatespace. The average Euclidean distance between the algo-rithm-calculated corresponding landmark locations andthe rater-labeled corresponding landmark locations in the

Page 9: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2047

Fig. 6. Depiction of inter-expert and algorithm versus expert landmark errors.

template space was used to measure the registration accu-racy in Zone 2. Smaller landmark errors point to higherregistration accuracy. The concept is depicted in Fig. 6.Given a set of expert-annotated landmarks inthe patient image, their corresponding landmark loca-tions (independently by expert HA) and

(independently by expert MB) in the tem-plate image, and the algorithm-calculated correspondinglandmark locations also in the templateimage, we defined the inter-expert landmark errors (thelength of the solid line in Fig. 6) as

(3)

and the algorithm-to-expert landmark errors (the averagelength of the dashed lines in Fig. 6) as

(4)

where is the Euclidean distance between two voxellocations.

• Zone 3: Regions far away from the abnormal region.Zone 3was defined as all the normal regions outside Zone 2[see Fig. 5(a)]. We used landmarks to evaluate the registra-tion accuracy in Zone 3. This could show how the recurrenttumor and the cavities have influenced registration in far-away normal-appearing regions. One expert (HA) labeled40 anatomical landmarks in Zone 3. Then two experts inde-pendently labeled corresponding landmarks in the templatespace. Registration accuracy in this zone was measured bythe average Euclidean distance between the algorithm-cal-culated corresponding landmarks and the rater-labeled cor-responding landmarks in the template space. Smaller land-mark errors point to higher registration accuracy in Zone 3.

• Zone 4: Brain boundaries. The existence of cavities andrecurrent tumors inside the patient image may even influ-ence the alignment of the brain boundaries between the pa-tient and the normal-appearing template images. To cap-ture this influence, we measured the dice overlap and the95th percentile Hausdorff Distance between the warpedand the template brain masks. A higher dice overlap andsmaller distance indicate a higher level of robustness of aregistration algorithm with regard to the abnormality-in-duced negative impact.

IV. RESULTS AND ANALYSIS

In this section, we use four subsections to present the eval-uation results in the databases representing the four aforemen-tioned challenges.

A. Results in Databases Representing Challenge 1

Fig. 7 shows the average Jaccard overlap over all ROIs inthe NIREP and LONI-LPBA40 databases. A detailed table ofJaccard overlap per ROI can be found in Appendix C for allalgorithms evaluated in this paper. Several observations can bemade from this set of results.1) DRAMMS and ANTs obtained the highest Jaccardoverlaps in both databases. Between the two methods,DRAMMS had a slightly higher accuracy in the NIREPdatabase ( for ANTs and

for DRAMMS, ); whereasANTs had a slightly higher accuracy in the LONI-LPBA40database ( for ANTs and

for DRAMMS, ). In bothcases, the differences were tiny ( difference be-tween the average Jaccard overlaps in the 0–1 scale).Following ANTs/DRAMMS were DROP, Demons andART registration methods. Among these methods, ANTsand ART were included in the evaluation study [17] andwere found to be the two most accurate methods. Ourfindings here showed a similar trend. In addition, the threemethods—DRAMMS, DROP and (the nondiffeomorphic,or additive, version of) Demons, which were not includedin [17], showed highly competitive performances.

2) Methods such as SSD-FFD, fnirt, DROP use intensitydifferences (SSD) as the similarity metric. On average,they had reasonable Jaccard overlaps. However, theyalso showed larger variations of regional overlaps, andtherefore they were less stable in our experiments thanthose methods using CC, MI, or attribute-based similaritymeasures.

3) The high degree of freedom allowed by a deformationmechanism (such as the FFD model as used in DRAMMSand the diffeomorphism LDDMMmodel as used in ANTs)is perhaps another factor contributing to the higher reg-istration accuracy, compared to deformation mechanismswith relatively few degrees of freedom (such as fifth-orderpolynomials in AIR).

B. Results in Databases Representing Challenge 2

For the three databases containing raw images, we had twoscenarios—one focusing on the localized structures and tissuetypes throughout the image space (the BrainWeb database);and the second scenario focusing on the brain masks (the IBSRand OASIS databases), which is crucial for (multi-)atlas-basedskull-stripping.Scenario 1. Registration Accuracy in Multiple Localized

ROIs/Structures in the Raw Images: Fig. 8 shows the averageJaccard overlaps of the 11 ROIs in the BrainWeb database.Several observations can be made.1) DRAMMS, DROP and Demons obtained similarly highaccuracies, followed closely by the ANTs and ARTmethods. These registration methods were also among

Page 10: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2048 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Fig. 7. Box-and-Whisker plots of registration accuracies in the NIREP and LONI-LPBA40 databases, as indicated by the Jaccard overlaps averaged across 32 (inNIREP) or 56 (in LONI-LPBA40) ROIs. This figure shows how registration methods perform facing Challenge 1 (inter-subject variability).

Fig. 8. Box-and-Whisker plots of registration accuracy in the BrainWeb database, as indicated by the Jaccard overlaps averaged across 11 available ROIs. This isScenario 1 in the testing of registration methods facing Challenge 2 (intensity inhomogeneity, noise and structural differences in raw images).

the most accurate ones in registering skull-stripped brainimages as shown in the previous sub-section.

2) On the other hand, the average Jaccard overlap in var-ious ROIs by the best-performing algorithm in this raw,

Page 11: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2049

Fig. 9. Registration accuracy in raw brain images, in the IBSR and OASIS databases, as indicated by the Jaccard overlap (the first row) and 95th percentileHausdorff Distance (the second row), between the warped and the target brain masks. This is Scenario 2 in the testing of registration methods facing Challenge 2(intensity inhomogeneity, noise and structural differences in raw images). “prctile” in the title of the second subfigure means “percentile”.

with-skull database (i.e., DRAMMS)was only around 0.32(Fig. 8). Compared to the average Jaccard of 0.52–0.57 inregistering skull-stripped brain images (Fig. 7), this clearlyunderlined the increased level of difficulty in registeringraw, with-skull brain images.

Scenario 2. Registration Accuracy in Brain Masks of the RawImages: In this registration scenario, we focused on the accu-racy of registration in warping the brain masks, which is thebasis for the atlas-based skull-stripping framework (e.g., [19],[25]). Fig. 9 shows the accuracies of three registration methods(ANTs, Demons and DRAMMS), which were forerunners in theresults in Scenario 1. Several observations can be made.1) The Jaccard regional overlap and the 95th percentileHausdorff Distance showed the same trend for aligningthe brain masks—a higher Jaccard overlap correspondedto a smaller distance;

2) The ranking and the difference of methods seemed to behighly dependent on the database, especially the level of

difficulty for registration in a database. In the OASIS data-base, which exhibits a lower level of inter-subject FOVdifferences and intensity inhomogeneity [as can be seenin Fig. 2(c)], ANTs scored the highest accuracy, followedclosely by Demons and DRAMMS. In the IBSR database,however, which exhibits a higher level of inhomogeneity,background noise and larger deformations, DRAMMSscored the highest accuracy, followed, with relativelybigger distances, by Demons and ANTs.

3) Overall, a Jaccard overlap of 0.75–0.93 could be ex-pected for the brain masks when we registered raw,with-skull images within a same database. This shouldprovide a promising starting point for (multi-)atlas-basedskull-stripping (e.g., [19], [25]). On the other hand, oneneeds to be aware that the registration accuracy might de-crease when images are from multi-site databases, usuallywith larger imaging and FOV differences (to be shown inthe next subsection).

Page 12: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2050 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Fig. 10. Jaccard overlaps in the ADNI database, for a) the brain mask (the left three columns); b) the left hippocampus (the middle three columns); and c) theright hippocampus (the right three columns). This figure shows how registration methods perform in a typical multi-site database, where additional challengesarise from the imaging and FOV differences in different imaging institutions (Challenge 3).

C. Results in the Database Representing Challenge 3

Fig. 10 shows the overlap results averaged over all 90 pair-wise inter-subject registrations in the ADNI database. Com-pared to the registration within the single-site database, the reg-istration of raw images acquired from multiple imaging sitesencountered an increased level of difficulty. Specifically, threeobservations can be made.1) DRAMMS and ANTs performed better than Demons inaligning the brain masks. They showed higher levels ofrobustness with regard to the FOV differences, the back-ground noise and the presence of skull or other nonbrainstructures.

2) The presence of the skull, the background noise, andespecially the FOV difference, in the ADNI multi-sitedatabase, had a clearly visible impact on the accuracyof aligning deep brain structures. When registeringskull-stripped images from single-site databases suchas in the LONI-LPBA40 database, ANTs, Demons andDRAMMS could align hippocampi at Jaccard overlapsaround 0.6, and they differed by less than 0.05 Jaccardoverlap on average (see Table V in Appendix C). How-ever, when the raw images from the multi-site ADNIdatabase were used, even the best-performing algorithms(DRAMMS and Demons as shown in Fig. 10) couldonly align hippocampi at Jaccard overlaps around 0.5 onaverage, and algorithms had greater differences in termsof the Jaccard overlaps they obtained. Another factor thatcaused the decrease in the accuracy and the increase in thedifferences among methods could be that those subjectsin the ADNI database have highly variable degrees ofneuro-degeneration (three normal controls, four MCI, andthree AD subjects), and hence they have largely different

ventricle sizes, atrophy patterns, and hippocampus sizes(see Fig. 11 for some examples of very difficult cases,which will be described in item 4 below).

3) Considering the quantitative results in all three regions(brain mask, left, and right hippocampi) in those with-skullraw images acquired frommultiple institutions, DRAMMSshowed the greatest promise.

4) Besides the quantitative results in the limited number ofstructures such as the brain masks and the hippocampi, thevisual inspection of the registration results in the wholeimages could actually reveal much greater differencesamong registration methods. Fig. 11 shows some regis-tration results from Demons, ANTs and DRAMMS intwo pairs of subjects from the multi-site ADNI database.As pointed out by blue arrows in the figure, DRAMMSshowed a clear advantage to align the largely differentanatomies such as the ventricles. To capture this largedifference, many algorithms may have to increase theirsearch ranges. However, this usually requires consider-able efforts for task-specific, or even individual-specific,parameter adjustments. Adjusting search ranges is a non-trivial research topic. It often requires specific theoreticdesigns [73], or requires the introduction of anatomicallandmarks [74]–[80]. The main difficulty is to effec-tively balance between capturing the large differencesand capturing the local subtle displacements. Therefore,algorithms that can capture and balance between both, anddo not require additional parameter adjustments, becomefavorable. Actually, in our recent studies that spatially nor-malized all ADNI subjects into a common template(from a normal control subject), a small portion of theAlzheimer’s Disease (AD) subjects have unusually largeventricles than many other AD subjects. We considered a

Page 13: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2051

Fig. 11. Demons, ANTs and DRAMMS registration results of two pairs of images having large anatomical variations especially in ventricles, mainly due to theirdifferent levels of neuro-degeneration. All subjects are from the multi-site ADNI database. Blue arrows point to some typical locations where registration resultsfrom three methods differ greatly.

registration “failure” if there were more than 5 mm errorsin the ventricle boundaries (visually pronounced errors).Accordingly, the success rate was defined by

##

Compared to the 80%–85% success rate by ANTs andDemons when used with the fixed sets of parameters,DRAMMS, also using a fixed set of parameters as in othercases, achieved a success rate of 96%, which was a clearimprovement of registration accuracy in this large-scalemulti-institutional database.

Page 14: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2052 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Fig. 12. Landmark errors or the 95th percentile Hausdorff Distance in various zones in the pathology-to-normal subject registrations. In addition to the errors,we have shown the average Jaccard overlap in Zone 1 in this figure. This figure shows how registration methods perform in the presence of pathology-inducedmissing correspondences (Challenge 4).

D. Results in Database Representing Challenge 4

Fig. 12 shows the quantitative registration errors in thepathology-to-normal subject registration. As a reference, thisfigure also includes inter-expert errors between the two inde-pendent experts in Zones 1–3. A desirable registration should1) accurately register the normal-appearing regions, wherecorrespondences can be established (i.e., small landmark er-rors in Zone 2–3); 2) accurately align the brain boundaries(i.e., small 95%-percentile HD distances in Zone 4); and 3)map the pathological regions to the right location but relaxthe deformation within the pathological regions, where cor-respondences could hardly be established (i.e., high regionaloverlaps in Zone 1). In Fig. 12, several observations can bemade.1) Two independent experts agreed with each other only at a0.48 Jaccard overlap on average in the cavity and tumorrecurrence regions (Zone 1). This reflected the difficulty,or the ambiguity, for the human experts to deal with themissing correspondences. Among the four methods eval-uated, ANTs agreed with experts at the highest level (av-erage Jaccard 0.40), followed closely by DRAMMS (av-erage 0.37 Jaccard overlap). The 95th percentile HausdorffDistances showed the same trend. Overall, experts showedbetter agreement between each other than between algo-rithms and experts.

2) In Zone 2 (the immediate neighborhood of the abnormalregions), the average landmark error was 4.1 mm betweenexperts. Landmarks errors for DRAMMS, ANTs and Dif-feomorphic Demons were similar (at 3.9, 4.1, and 4.4 mm,

respectively), and also comparable to the inter-experterrors.

3) Further away from the abnormal regions, Zone 3 hadlarger landmark errors than Zone 2. The average errorswere 5.3 mm between experts, 5.8 mm for DRAMMS, and5.9 mm for ANTs. Diffeomorphic Demons and especiallyfnirt started to have larger landmark errors (6.6 and 9.6mm on average). This difference may, in part, be attributedto the fact that DRAMMS used texture features and ANTsused correlation coefficient as similarity metric, whichwere perhaps more robust and reliable than the intensitydifference which Demons and fnirt used as their similaritymetrics.

4) Another interesting finding was in Zone 4 (brainboundary). Because of the sharp contrast between fore-ground and background in a skull-stripped image, thebrain boundaries ought to be among the easiest partsto register. In the presence of pathologies, however,this was surprisingly not always the case. Fnirt, forexample, had 11.1 mm as the 95th percentile HausdorffDistance at brain boundaries, which meant registrationfailures in several cases. ANTs had, on average, 3.3mm as the 95th percentile Hausdorff Distance in thebrain boundary, which was even bigger than the averageerrors ANTs produced in the abnormal regions (2.1mm). By carefully examining the output images, wefound that this average boundary error by ANTs wasmainly caused by misalignments in the boundaries closeto the pathology sites in several cases. This showed

Page 15: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2053

Fig. 13. Registration of a brain image with tumor recurrence to a normal brain template by DRAMMS, for a series of slices in the coronal view. This figures showshow the mutual-saliency mechanism (a spatial-varying utilization of voxels) helped DRAMMS in the pathological-to-normal subject registration scenario. Withoutsegmentation, initialization, or prior knowledge, the automatically-calculated mutual-saliency map (d), defined in the target image space, effectively assigned lowweights to those regions that correspond to those outlier regions (pointed out by arrows) in the source image (a). This way, the negative impact of outlier regionscould be largely reduced; registration was mainly driven by regions that could establish good correspondences. Red arrows point to the post-surgery cavity regions.Blue arrows point to the recurrent tumors.

that the pathology regions may impact a wide area inANTs registration. On the other hand, DRAMMS andDiffeomorphic Demons had the smallest boundary errors(1.8 and 2.2 mm, respectively). Both errors were smallerthan those in the abnormal regions, indicating goodalignments of the brain boundaries. This showed thatthe negative effect of pathological regions was morelocalized in DRAMMS and Diffeomorphic Demonsregistration algorithms, which should be desirable.

It should be emphasized that general-purpose registrationalgorithms are usually not designed for registering pathology-bearing images. Task-specific registration algorithms are often

needed to segment and specifically deal with the pathology-af-fected regions. However, the fact that DRAMMS, as a gen-eral-purpose registration algorithm, performed stably and ro-bustly in all Zones 1–4, highlighted the effect of using at-tributes to measure voxel similarities and to quantify voxel-wise matching reliabilities. To better illustrate this, Fig. 13shows a set of representative DRAMMS registration results(registration from a tumor-recurring patient’s brain image tothe normal-appearing brain template image). First, DRAMMSextracted high-dimensional Gabor texture attributes to repre-sent each voxel. The attributes should be more informativethan intensities in the search for correspondences. Moreover,

Page 16: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2054 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

at each voxel, DRAMMS automatically calculated a so-called“mutual-saliency” weight, also based on the attributes. Themutual-saliency quantified the chance of each and every voxelto establish a reliable correspondence between the two im-ages. As Fig. 13(d) shows, the mutual-saliency map effec-tively identified outlier regions (dark blue), where correspon-dences could be hardly established. The identified outlierregions coincided with the recurrent tumor regions [as redarrows pointed out in panel (a)]. Note that, this was obtainedwithout any segmentation, manual masking, or any priorknowledge of the presence or the location of the tumor recur-rence. Being segmentation-free is a feature that differentiatesDRAMMS from those task-specific cost-function-masking ap-proaches or pathology-seeding approaches. As a result of thisattribute-based similarity measurement and mutual-saliencyweighting, the registration by DRAMMS was mainly drivenby the regions where correspondences could be well estab-lished. This led to visually plausible results as shown inFig. 13(c).

V. DISCUSSION AND CONCLUSION

In this last section, we first summarize the work and findingsin Section V-A. Then, Section V-B discusses the theoreticaldifferences among the registration algorithms included inthis paper, which may, at least partly, explain the differentperformances among registration methods in our experiments.Section V-C discusses the limitations of the whole evaluationwork and the future directions. Finally, Section V-D concludesthis paper.

A. Summary of Work and Findings

This study evaluated several registration algorithms and theirpublicly-available software tools. Our evaluation had the fea-tures summarized below.First, compared to existing studies that evaluated registra-

tion algorithms in specific tasks and/or databases, our studyutilized multiple databases to represent a wide range of chal-lenges for the inter-subject registration. As Table I showed, thedatabases we included in this evaluation work covered a va-riety of imaging scanners (GE, Siemens, Philips), field strengths(1.5T, 3T), age groups, imaging FOVs, and imaging protocols(varying pulse sequence parameters). The purpose was to exten-sively evaluate the generality, robustness and accuracy of regis-tration algorithms.Second, our study found out that, in general, registration al-

gorithms differed greatly in terms of their performances, whenfacing different databases or challenges. For skull-stripped im-ages included in our study, ANTs and DRAMMS led to thehighest overlaps of expert-annotated (sub-)cortical structures,followed by ART, Demons, DROP, and FFD. Whereas for morechallenging tasks in databases containing raw, multi-site andpathology-bearing images, the attribute-based DRAMMS algo-rithm obtained relatively more stable and higher accuracies, fol-lowed closely by the intensity-based and symmetric ANTs reg-istration algorithm.

B. Understanding the Differences Among RegistrationAlgorithms

Registration algorithms differ in similarity metrics, transfor-mation models, and the optimization strategies. Table II sum-marized the registration methods included in this paper, andmore details can be found in Appendix A. Such differences arelikely the major factors for their different performances in thispaper.In terms of similarity metrics, 11 out of 12 methods in-

cluded in this paper measure the image similarity based on thegray scale intensities or intensity distributions. DRAMMS, onthe other hand, measures the image similarity by a rich set ofmulti-scale and multi-orientation Gabor attributes. Intensitiesalone may not necessarily carry anatomical or geometricinformation of voxels. That is, voxels having similar oreven identical intensities may belong to different anatom-ical structures. Consequently, a common challenge in inten-sity-based similarity metrics is how to effectively deal withmatching ambiguities. Methods such as ANTs measure thesimilarity of two voxels by the correlation coefficient of in-tensities in local patches centered at those two voxels. Thelocal patches carry, to some extent, the local texture or geo-metric information. Therefore, in our experiments they wererelatively more robust to noise, partial volume effects andmagnetic field inhomogeneities, compared to measuring thevoxel-wise similarity using intensities alone. Attribute-basedmethods such as DRAMMS extend this to the explicit char-acterization of voxels by the high-dimensional, often moreinformative, texture or geometric attributes. This could reducematching ambiguities, but at the cost of an increased compu-tational burden. This observation has been documented in theliterature by several research groups (e.g., [45], [81]–[83]).The generality and accuracy of DRAMMS in our exper-iments, especially its performances in raw, multi-site andpathology-bearing images, provided new evidence for usingattributes to measure image similarities. On the other hand,there is also ongoing research on extending intensity-basedsimilarity metrics (CC or MI) into more robust measures toreduce matching ambiguity for mono- and multi-modality reg-istration [22], [84], [85].Another related issue is how image voxels are used when

calculating the similarity between two images. General-purpose registration methods, such as most of the ones in-cluded in our study, often use all voxels equally to define theimage similarity. On the other hand, DRAMMS introducedthe notion of “mutual-saliency.” The central idea was to useall voxels, but at different levels of confidence as measured bythe mutual-saliency metric. Specifically, those voxels havinghigher confidence to establish reliable correspondences wereassociated with higher mutual-saliencies (e.g., Fig. 13), andthey were accordingly used with higher weights in calculatingthe image similarity. They were the main driving force for theregistration. An immediate advantage was in the registrationof pathology-bearing images such as shown in Fig. 13.Without prior knowledge for tumor presence, or any prior

Page 17: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2055

tumor segmentation, DRAMMS examined voxels one byone and attached with each one of them a “mutual-saliency”number that reflected its ability to find correspondences. Thisway, the mutual-saliency map in Fig. 13(d) automaticallyand effectively found out a temporal lobe region that haddifficulty to establish correspondences, and the location ofthis region agreed with that of the abnormal regions. By this,the deformation within the abnormal region was relaxed; theother normal-appearing regions were matched well, whichdrove the registration of the whole image. The idea ofspatially-varying treatment of voxels has also been adoptedin other registration approaches (e.g., [86]–[88]), showinggreat promise in many challenging registration problemsinvolving, for example, topology-changing tumor changes,pathology-induced outliers, and cardiac/lung motion-inducedsubtle changes.In terms of the transformation models, the ones with more

degrees of freedom typically led to higher registration ac-curacies in our experiments. For instance, the geometriccubic B-spline-based FFD transformation model as used inCC/MI/SSD-FFD, DRAMMS and DROP, and the velocityfields used in Demons and ANTs, could perhaps explaintheir relatively higher registration accuracies than other lessflexible transformation models (e.g., the fifth polynomial asused in AIR). The symmetric feature as introduced in ANTsseemed to at least partly contribute to its accuracy and ro-bustness. Specifically, in the pathological-to-normal subjectregistration, where two images differ greatly, ANTs had highaccuracies in abnormal regions and in the immediately neigh-boring normal regions. This was perhaps due to the symmetricsetting, which constrained both images to deform towards the“hidden middle template” between the two images. This way,a difficult inter-subject registration problem was decoupledinto two relatively simpler subproblems. Such a symmetricsetting is also advocated in many other approaches such as inthe linear registration [89] and deformable registration [90],[91]. Furthermore, the diffeomorphic setting in ANTs andDiffeomorphic Demons also contributed to the accuracy androbustness, since the regularization of the transformation inthe diffeomorphisms seemed to account for the real-worldanatomical deformations.When it comes to the optimization strategies, methods in-

cluded in this paper used optimizers either in the discretespace (DRAMMS, DROP) or in the continuous space (allothers). The discrete optimization helped to reduce the com-putational time to 3–5 min in DROP [58], [92], compared to10–20 min in MI/CC/SSD-FFD, which have the same simi-larity metric and transformation model. Their accuracies werecomparable in our experiments. Another interesting compar-ison in our experiments was the 30–50 min computational timeof DRAMMS, which used a discrete optimizer on high-di-mensional attribute-based similarities, versus about 1–1.5 hfor ANTs, which used a continuous optimizer on patch-basedcorrelation coefficient similarities. Both computational timeswere for a pair-wise registration of some typical brain images(e.g., image size 256 256 200), and on a Linux opera-tion system with an Intel Xeon x5667 3.06-GHz CPU and

a 16 GB memory. It should take another controlled study tofurther investigate the impact of the optimization strategieson registration accuracies. One thing to note is that manyregistration methods are able to be parallelized into GPU ac-celerations [93]–[95].

C. Limitations of Our Evaluation Study and Future Work

We also note some drawbacks of this study, and hence ourfuture work.First, like many other studies [11], [12], [16], [83], [96]–[98],

this study was also conducted by the authors of one of the algo-rithms to be evaluated. Questions may naturally arise for thereproducibility and fairness in such comparative evaluations.We tried to address these questions when designing and con-ducting this study: 1) for reproducibility, we fully disclosedthe parameters, used public databases whenever possible, andconstrained our evaluation within publicly-available registra-tion algorithms; 2) for fairness, we used the parameters sug-gested by the authors of the registration methods as reportedby [17], and all of those parameters were “optimal” in the reg-istration of skull-stripped and preprocessed brain MR images.Not all self-conducted evaluation studies in the literature hadthese features, but we felt that it was really important to complywith these high standards in our study. Our future plan includesthe participation in third-party-organized challenges, such as in[17], [99], and [100].The second point worth discussing in our study is that

we fixed the parameters for each registration method. Twoquestions may arise: whether we should fix parameters andat which values we should fix the registration parameters.For the first question, we fixed registration parameters be-cause the aim was to test whether some methods can beused in many large-scale, often multi-institutional, transla-tional studies. This was a high standard and quite an ambi-tious aim that did not appear in previous evaluation studies.Facing many registration tasks and many databases, normalor pathological, single- or multi-site, in daily translational re-search, it is usually less practical to tune parameters for eachspecific task/database. Therefore, we fixed parameter valuesin our study. We note that this does not necessarily reflecta registration method’s stability or sensitivity with regard tothe parameter changes. Stability is another preferable featureof registration methods. To better reveal stability or sensi-tivity, a future study is needed to examine how registrationalgorithms perform over a wide range of parameter values.That is, to thoroughly investigate how registration accuracychanges when the values of registration parameters change.The difficulty lies in the determination of effective, oftenmultivariate, parameter ranges, and the objective comparisonof parameters (or parameter ranges) among different registra-tion algorithms. For the second question—at which values weshould fix the parameters, the parameter values we used inthis paper were those optimal for one typical registration task.They were the values suggested by authors of the algorithmsthemselves, and hence most likely to be adopted by ordinary

Page 18: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2056 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

users, or to be tried in the first pass by algorithm developers.We note that they are not necessarily optimal for the otherthree tasks, nor necessarily optimal for the four tasks alto-gether. In the future, a more complete study may be neededto “learn” the (range of) parameter values that are best overallin the four tasks included in this paper. Highly sophisticatedlearning framework needs to be designed, such as [21] and itsextensions. It will require a larger data size for training andtesting, and the familiarization of the implementation detailsin each registration algorithm.In our experiments, we used fewer images and fewer

databases to represent Challenges 3–4 than Challenges 1–2.Specifically, only 10 ADNI subjects (three AD, four MCI,three NC), or 90 pair-wise registrations, were used to testregistration methods against challenges arising in multi-sitedatabases; and only eight patients with recurrent tumors wereused to test registration methods against pathology-inducedmissing correspondences. Because of the relatively smalldata size, a decisive conclusion on how registration methodsperform facing Challenges 3–4 may need to be deferred tofuture larger-scale studies. What we want to emphasize is that,although small in data size, those experiments were amongthe very first ones appearing in the literature to evaluate gen-eral-purpose registration methods in multi-site data and inpathological data. Therefore, those experiments could serve asa proof of concept that some registration algorithms may bearthe potential to work reasonably well in those difficult cases. Afuture study with a larger data size is needed in this direction.Acquiring multi-site or pathological data, especially acquiringexpert annotations of landmarks and/or ROIs on those data, isin itself a nontrivial problem.The same as in studies [10], [11], [14]–[18], [98], expert-de-

fined annotations of ROIs or landmarks served as referencesfor the evaluation of registration accuracy in our study. Onething to note is that expert annotations may be subject tointra-/inter-expert variability, and may have a certain level ofuncertainty or even errors. Therefore, a perfect and completelyerror-free registration algorithm may still present some minorerrors in the current criteria to assess registration accuracy,because of the uncertainty in the expert annotation of land-marks or ROIs. Quantifying such uncertainty is another topicfor future studies. To reduce the influence by the variabilityor uncertainty of expert annotations, we either used two inde-pendent experts in some databases, or used multiple databasesfor a specific task, where different databases were annotatedby different experts to reduce the chances of systematic uncer-tainties or errors.In our study, we only evaluated the accuracy of registra-

tion, which was what most previous studies did [12], [17],[96]–[98]. Some recent studies (e.g., [16], [99]) started tolook at more comprehensive criteria including the registra-tion smoothness. The argument was that, many registrationalgorithms might achieve a fairly high structural overlaps atvery aggressive underlying deformations. Therefore, studieslike [16], [99] started to put registration accuracy in the con-

text of registration smoothness, and evaluated both properties.Debates exist, though, for two reasons. First, whether theemphasis should be on the accuracy or on the smoothnessis usually more dependent on the specific application. Forinstance, atlas-based segmentation frameworks may need amore aggressive registration to obtain higher structural over-laps; whereas on the other hand, population studies usingvoxel-wise statistics may need a smoother registration to betterbalance between the difference among individuals and thecommonality within a population. Second, while the criteriafor registration accuracy can be indicated on landmark errorsor regional overlaps, the criteria for registration smoothnessis relatively loosely defined. Some studies used Jacobian de-terminants [99], [101]—negative Jacobian determinants indi-cating self-folding (i.e., nondiffeomorphism) should be penal-ized, as human organs deform smoothly. In the existence ofcross-individual anatomical differences, whether the deforma-tion should be strictly diffeomorphism remains a topic of de-bate. Plus, the computation of Jacobian determinants requiresa numerical approximation of the continuum from the discreteimage space, and usually varies by different software pack-ages. Consequently, other studies (e.g., [14]–[16], [89], [102])used additional metrics such as the transitivity and the inverseconsistency to measure the smoothness and diffeomorphism ofdeformation fields. Nevertheless, the wide adoption of thosemetrics needs more studies, and the balance between accuracyand smoothness seems a task-specific choice. This is anotherreason that our future studies need to thoroughly examine arange of parameter values to look at how registration accuracyand smoothness change as parameter values change, so thatusers may make a more informed choice about the algorithms,their implementation tools and a proper set of parameters forproblems at hand.Besides registration accuracy or smoothness, a perhaps more

important task for our future study is the utility of registra-tion methods in various clinical applications. For example, inmulti-atlas-based segmentation propagation, the most accurateregistration for single-atlas registration may not necessarilyachieve the highest overall segmentation accuracy. Choosinga proper registration methods or a proper set of registrationparameters in this case is subject to other interleaved factorssuch as the label fusion. Another example is in the detectionof atrophy or growth patterns, which is usually an importantcomponent in the study of neuro-degenerations [103]–[108]or neuro-development [109]–[112]. There, the most accurateor the most smooth registration may not be necessarily theoptimal choice to decipher the subtle patterns [113]. Sim-ilar situations also occur for the brain extraction (e.g., [19]),the quantification of longitudinal disease change (e.g., [83],[114]), and in cognitive neuroscience (e.g., [115]). Therefore,much interest is in putting registration into the big picture ofend (pre-)clinical goals and considering its interactions withother factors.To further improve registration accuracy, one interesting

topic is to utilize multiple registration methods in a meta-

Page 19: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2057

analysis framework, where one method may underperformor fail in a task/database/individual, but others may makeup the loss. Several newly published articles supported thisconsensus process, or the so-called meta-registration process,such as [72], [116]–[118]. Doshi et al. recently combinedANTs and DRAMMS in a multi-atlas labeling framework,resulting in a top-performing method in a MICCAI seg-mentation challenge [72]. Muenzing et al. recently combinedANTs, NiftyReg and DROP and showed a significant reduc-tion of registration errors in pulmonary images [118]. Thesestudies further underline the potentially synergistic value ofvarious registration methods. Another interesting avenue forimproving registration accuracy, especially when registrationalgorithms face several challenges from large anatomical vari-abilities or imaging device differences (e.g., 2-D ultrasound to3-D MRI), is to better initialize general-purpose registrationalgorithms with anatomic or geometric landmarks. Several re-cent studies are in this direction, on how to accurately andautomatically extracting useful landmarks [77], [79], [119],and how to effectively combine landmarks and voxel-wiseinformation in an elegant and practical framework [76], [92],[120]–[123].

D. Conclusion

In conclusion, this paper conducted a comparative eval-uation of 12 publicly-available and general-purpose imageregistration methods, in the context of inter-subject registra-tion of brain MR images. In contrast to existing works thatevaluated registration methods in a specific task or a specificdatabase, the emphasis of this work was on evaluating thegenerality, accuracy and robustness of registration methods invarious inter-subject brain MR image registration tasks withvarious challenges involving multiple databases containingskull-stripped images, noisy and raw images, multi-site data,and pathology-bearing images. Our experiments showed thatDRAMMS and ANTs gained relatively higher accuracy inskull-stripped images, followed by ART, Demons, DROP andFFD. DRAMMS showed a relatively higher accuracy whenraw, multi-site or pathology-bearing data were involved. Themain reasons might be 1) DRAMMS measures the simi-larity between voxels by a rich set of texture attributeswhile other methods by intensities; and 2) DRAMMS hasthe mutual-saliency mechanism to automatically quantify thematching reliability of voxels, and to use those highly reliablematching to drive the registration, while other methods usevoxels equally. The fact that ANTs, ART, Demons, DROP, andSSD/MI-FFD performed relatively accurately among all inten-sity-based algorithms highlight the importance of 1) choosingimage similarity metrics, such as correlation coefficient andmutual information, that are relatively more robust with re-gard to the noise, intensity inhomogeneity, and partial volumeeffects, and 2) choosing proper transformation models that areof sufficient degrees of freedom but well regularized by thediffeomorphism or even the symmetric design with regard tothe two input images.

Our future work will include the participations in third-party-organized grand challenges, the further investigation ofregistration performances with regard to changes of varying pa-rameter values, and a more extensive evaluation by includingmore data, more registration methods, more comprehensiveevaluation criteria, and especially more clinical-oriented ap-plications. Another very interesting direction is to develop ameta-registration paradigm, where registration methods com-plement rather than compete with each other. This has thepotential to bring accuracy, robustness, and generality to a newand higher level.

APPENDIX ASUMMARY OF THE 12 PUBLICLY-AVAILABLE METHODS

INCLUDED IN THIS EVALUATION

Appendix A supplements Section III-B in providing more de-tail for the 12 publicly-available registration methods includedin this evaluation study.• flirt. An affine transformation assumes that all voxels inthe image move together with 12 degrees of freedom (dof)in a 3-D space. This includes three dof for scaling, threedof for translation, three dof for rotation, and three dof forshearing. The publicly-available FMRIB’s Linear ImageRegistration Tool (flirt) [48] from FSL package is perhapsthe most cited work for affine registration, as evidencedby more than 1500 citations in the Google Scholar searchengine since its publication in 2001 (as ofMarch 2013). Themain contribution of flirt is in the optimization strategy.A global, multi-start and multi-resolution optimizationstrategy specifically for affine image registration problemswas proposed. The fundamental idea is to combine a fastlocal optimization (Powell’s conjugate gradient descentmethod [124]) with an initial search phase. The flirt toolsupports a variety of intensity-based similarity metrics,includingLeast Square IntensityDifference (LS), (Normal-ized) Correlation Coefficient ((N)CC) and (Normalized)Mutual Information ((N)MI). Theflirt tool is publicly avail-able at http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT.

• AIR. The Automatic Image Registration (AIR) registra-tion tool was developed by researchers (Woods et al.) atthe University of California, Los Angeles (UCLA) in the1990s [50], [51]. AIR makes contributions in the deforma-tion models. It models the deformation by second-, third-,fourth- and fifth-order nonlinear polynomials (30, 60, 105,and 168 deformation parameters, respectively). To find theoptimal values for the deformation parameters, AIR min-imizes a cost function between the deformed image andthe target image. Three different cost functions are sup-ported in AIR. One is the ratio image uniformity (RIU).The ratio of the intensity in the deformed image to the in-tensity in the target image is computed at each voxel lo-cation. The deformed and the target images are assumedto be aligned when the ratio is homogeneous (i.e., highmean value and low standard deviation) in the entire targetimage space. The second cost function is Sum of Square

Page 20: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2058 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

Difference (SSD) of image intensities. The SSD cost func-tion assumes that the same anatomical structure in dif-ferent images share the same intensity. Therefore, two im-ages are aligned when the differences in their intensitiesare minimized. The third cost function is a variant of thesecond cost function, with some relaxation. Instead of as-suming that the same anatomical structure shares the sameintensity in different images, the third cost function as-sumes the same anatomical structure shares the same in-tensity with a global scaling factor. Therefore, it is a glob-ally-weighted SSD cost function. To minimize the costfunction, a numerical optimizer based on either Newton’smethod [125] or Levenberg–Marquardt method [126] isused. The AIR registration tool is publicly available athttp://bishopw.loni.ucla.edu/air5/.

• ART. The Automatic Registration Toolbox (ART) wasdeveloped by researchers (Ardekani et al.) at the NathanKline Institute for Psychiatric Research and in the NewYork University in 2005 [52]. It makes contributions indefining a new similarity metric. In the ART framework,each voxel is characterized by a high-dimensional featurevector, which is constructed by stacking the intensityvalues of all the voxels in the neighborhood. Comparedwith most voxel-wise methods which establish correspon-dences by the intensity at each voxel, ART establishescorrespondences by the high-dimensional feature vectorat each voxel. The similarity of two voxels is defined ontheir feature vectors as the inner-product of two vectorsand an idempotent and symmetric matrix that removes themean of the vector it pre-multiplies. ART is implementedin multi-resolution fashion. In the middle and low imageresolutions, ART searches correspondences for all voxelsin the target image. In the highest image resolution, ARTonly searches correspondences at those voxels whosegradient norms are in a certain upper percentile of thegradient magnitude histogram. The ART registration toolis publicly available at http://www.nitrc.org/projects/art/.

• ANTs. The Advanced Normalization Tools (ANTs) wasdeveloped by researchers (Avants et al.) at the Univer-sity of Pennsylvania in the 2000s [53]. It is based onthe LDDMM algorithm [40], but improves LDDMM’scomputational efficiency and introduces symmetry intothe LDDMM framework. In particular, ANTs decom-poses the diffeomorphic deformation into two symmetriccomponents. The idea is that, instead of deforming oneimage into the space of the other image, which is usuallynot symmetric to the input images, ANTs simultaneouslydeforms two input images, each towards the “midpoint”image. Therefore, the formulation becomes symmetricto the input images. ANTs supports three intensity-basedsimilarity metrics: SSD, which LDDMM uses; CC, whichis used by default in ANTs; and MI. The gradient descentoptimization strategy [127] is used to numerically findthe optimal deformation in the above formulation. ANTsiteratively updates the transformation by a velocity field,which, subject to the symmetry constraints, is computed

at each voxel by searching a most similar voxel in theother image, according to whichever similarity metricusers choose (by default correlation coefficient). Then thevelocity field is smoothed by Gaussian filters before incor-porated into the total deformation. The ANTs registrationtool is publicly available at http://stnava.github.io/ANTs/.

• Demons and Diffeomorphic Demons. The Demonsalgorithm [57], [128] considers deformable image regis-tration as a nonparametric diffusion process. It introduces“demons” to push voxels to their correspondences ac-cording to the local intensity characterizations. Usingintensity difference as the similarity metric, a force iscomputed from the optical flow equations to push voxelsby some velocity that is iteratively added to the total dis-placement (initially zero). The total displacement is thensmoothed with a Gaussian filter serving as regularizationsof the deformation. In its original version [57], the velocityfield is simply added to the current deformation in eachiteration, which may cause self-foldings in the deformationfield and is hence not diffeomorphic. To solve this problem,the Diffeomorphic Demons was developed in [128], bycomposing the deformation with the exponential of thevelocity field. The Diffeomorphic Demons version hasbeen shown to produce much smoother deformation (mea-sured by Jacobians and Harmonic Energy of the obtaineddeformation) at an accuracy comparable to that of the orig-inal Demons [128]. In both approaches, the deformationis iteratively updated by a velocity field, which computesvoxel-wise displacements according to a specific similaritymetric; and the increment and the total deformation aresmoothed by Gaussian filters to maintain smoothness.In this paper, both Demons and Diffeomorphic Demonsare included in the evaluation. The Demons software (in-cluding Diffeomorphic Demons) is publicly available athttp://www.insight-journal.org/browse/publication/154.

• DRAMMS. Deformable Registration via AttributeMatching and Mutual-Saliency Weighting [56] is a gen-eral-purpose deformable registration algorithm. It makescontributions in defining a new similarity metric with twofeatures. One feature is that it finds voxel correspondencesbased on high-dimensional Gabor texture attributes. Theother feature is that it does not force each and every voxel tofind its correspondences asmost other general-purpose reg-istration methods do. Rather, it weights voxels differentlybased on automatically detecting the ability of this voxelto establish correspondences between images. This way,the registration is mainly driven by voxels/regions that canestablish reliable correspondences; at the same time, it caneffectively reduce the negative impact of outlier regions ifthey exist. DRAMMS uses the cubic B-spline transforma-tion model, as will be described later in the FFD algorithm.The DRAMMS software package is publicly available athttp://www.cbica.upenn.edu/sbia/software/dramms/.

• DROP. DROP is the implementation of a deformableimage registration pipeline developed by researchers(Glocker et al.) at the Ecole Centrale de Paris, France,

Page 21: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2059

the Technische Universität München, Germany, and theUniversity of Grete, Greece [58]. The main contributionof DROP is in the numerical optimization. Discrete opti-mization is, for the first time, introduced into the field ofmedical image registration, brining in significant speedupcompared with other continuous optimizers such as gra-dient descent (8 min by discrete optimization versus 3 h 50min by gradient descent for the same set of brain images,as reported in [58]). DROP uses FFD as its deformationmodel (described later in this section), and supports 12intensity-based similarity metrics: Sum of Absolute Dif-ferences (SAD), which is used by default, Sum of AbsoluteDifference plus Sum of Gradient Inner Products (SADG),Sum of Squared Differences (SSD), Normalized Correla-tion Coefficient (NCC), Normalized Mutual Information(NMI), Correlation Ratio (CR), Sum of Gradient InnerProduct (SGAD), and others. The DROP registration toolis publicly available at http://www.mrf-registration.net/.

• FFD and its variants (CC-/MI-/SSD-FFD). The freeform deformation (FFD) model is a geometric transfor-mation model that was introduced to image registrationby [54] in 1999. Since then it has been widely used andcited for more than 2400 times in Google Scholar searchengine (as of March 2013). In the FFD model, a regulargrid of so-called “control points” is superimposed ontop of the dense image lattice. A FFD model basicallystates that the movement of an image voxel is a smooth,cubic B-spline-based interpolation of the displacementof the control points surrounding this voxel. Therefore,the task of finding movements at each voxel was trans-lated into finding the displacements at regularly-spacedcontrol points. In contrast to voxels-based methods, FFDhas three nice properties: 1) control points are regularlyspaced in the image, providing guidance throughout theimage domain; 2) deformation is smooth by the cubicB-spline-based interpolation; 3) landmarks moves byfinding its own correspondences, whereas control pointmoves by finding the most possible corresponding patchfor the image patch it controls, or represents. In the originalwork [54], FFD was combined with normalized mutualinformation (NMI) similarity metric. It can be combinedwith several other similarity metrics, including correlationcoefficient (CC) and sum of squared difference (SSD). Inthis paper, we used the IRTK software package, which isthe original implementation of the FFD-based registration.NMI-FFD, CC-FFD and SSD-FFD were all includedin the evaluation. The software is publicly available athttp://www.doc.ic.ac.uk/~dr/software/.

• fnirt. The FMRIB’s Nonlinear Image Registration Tool(fnirt) was developed by researchers (Anderson, Smith,and Jenkinson) at the University of Oxford, Oxford, U.K.,in 2007 [49], [129]. The fnirt method uses Sum of SquaredDifferences (SSD) as the similarity metric, therefore it isonly suitable for mono-modality image registration tasks.It implements the free form deformation (FFD) model. Thedeformation is regularized by the magnitude of the Lapla-

cian of the deformation (also known as the bending en-ergy of the deformation). The main contribution is in theoptimization process [129]. The optimization is based onmulti-resolution Levenberg-Marquardt strategy [126]. Theregistration is initialized and run to the convergence in thedown-sampled images, generating a deformation field withlow resolutions and a high regularization weight. The im-ages and the deformation field from the first step are thenup-sampled, with the regularization modified. And it isagain run to the convergence. This is repeated until the re-quired high-resolution and the required level of regular-ization is achieved. The fnirt registration tool is publiclyavailable at http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/fnirt.

APPENDIX BPARAMETER SETTINGS FOR REGISTRATION METHODS

In Section III-C, we presented two rules to set the parametersfor each registration method, in order to maintain fairness of theevaluation. Appendix B below provides the details of parametersettings for transparency and reproducibility in the evaluation.• flirt.

---------

• AIR.

----

• ART.

------

• ANTs.

------

Page 22: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2060 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

• (Additive) Demons.

------

• Diffeomorphic Demons.

-----

• DRAMMS.

––––

• DROP.

where specifies all parameters. Theyare:

.• CC-FFD.

--

where specifies all parameters. Theyare:

.Note that, in [17]; we set

to increase smoothness of the deformation.• MI-FFD.The usage is the same as in CC-FFD above, with theonly difference being one item in :

.• SSD-FFD.The usage is the same as in CC-FFD above, with theonly difference being two items in :

.• fnirt.

APPENDIX CJACCARD OVERLAP FOR ALL ROIS IN THE NIREP

AND LONI DATABASES

Table III shows Jaccard overlaps, averaged over all pair-wiseregistrations, in all 32 ROIs in the NIREP database. Methodsthat obtained highest average Jaccard overlap in each ROI arenoted in bold texts. Similarly, Table V shows overlaps, aver-

Page 23: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2061

TABLE IIIJACCARD OVERLAP FOR EACH OF THE 32 ROIS IN THE NIREP DATABASE,AVERAGED AMONG ALL 240 REGISTRATIONS . FOR EACH ROI,BOLD TEXTS AND LIGHT-GRAY TEXTS INDICATE THAT THE CORRESPONDINGREGISTRATION METHODS OBTAIN THE HIGHEST AND THE SECOND HIGHEST

OVERLAP COMPARED TO OTHER REGISTRATION METHODS

aged over all pair-wise registrations, in all 56 ROIs in the LONI-LPBA40 database. They provide more information than Fig. 7especially on what level of accuracy we can expect for eachof the ROI structure in inter-subject registration. For instance,hippocampus may have over 0.6 Jaccard overlap in inter-sub-ject registration, while superior occipital gyrus may only havearound 0.45 Jaccard overlap.

TABLE IVJACCARD OVERLAP FOR EACH OF THE 11 ROIS IN THE BRAINWEB DATABASE,AVERAGED AMONG ALL 110 REGISTRATIONS . FOR EACH ROI,BOLD TEXTS AND LIGHT-GRAY TEXTS INDICATE THAT THE CORRESPONDINGREGISTRATION METHODS OBTAIN THE HIGHEST AND THE SECOND HIGHEST

OVERLAP COMPARED TO OTHER REGISTRATION METHODS

APPENDIX DJACCARD OVERLAP FOR ALL ROIS IN THE

BRAINWEB DATABASE

Table IV shows the average Jaccard overlap in all 11 ROIs inthe BrainWeb database. Methods that obtained highest averageJaccard overlap in each ROI are noted in bold texts.

Page 24: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2062 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

TABLE VJACCARD OVERLAP FOR EACH OF THE 56 ROIS IN THE LONI-LPBA40 DATABASE, AVERAGED AMONG ALL 210 REGISTRATIONS . FOR EACH ROI,BOLD TEXTS AND LIGHT-GRAY TEXTS INDICATE THAT THE CORRESPONDING REGISTRATION METHODS OBTAIN THE HIGHEST AND THE SECOND HIGHEST

OVERLAP COMPARED TO OTHER REGISTRATION METHODS

Page 25: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2063

REFERENCES

[1] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medicalimage registration: A survey,” IEEE Trans. Med. Imag., vol. 32, no.7, pp. 1153–1190, Jul. 2013.

[2] A. Sotiras, Y. Ou, N. Paragios, and C. Davatzikos, “Graph-based de-formable image registration,” in Handbook of Biomedical Imaging;Methodologies and Clinical Research, N. Paragios, J. Duncan, and N.Ayache, Eds. New York: Springer, 2015.

[3] J. B. A. Maintz and M. A. Viergever, “A survey of medical image reg-istration,” Med. Image Anal., vol. 2, no. 1, pp. 1–36, 1998.

[4] H. Lester and S. Arridge, “A survey of hierarchical non-linear med-ical image registration,” Pattern Recognit., vol. 32, no. 1, pp. 129–149,1999.

[5] D. L. Hill, P. G. Batchelor, M. Holden, and D. J. Hawkes, “Medicalimage registration.,” Phys. Med. Biol., vol. 46, no. 3, pp. 1–45, 2001.

[6] B. Zitova and J. Flusser, “Image registration methods: A survey,”Image Vis. Comput., vol. 21, no. 11, pp. 977–1000, 2003.

[7] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever, “Mutual-infor-mation-based registration of medical images: A survey,” IEEE Trans.Med. Imag., vol. 22, no. 8, pp. 986–1004, Aug. 2003.

[8] W. R. Crum, T. Hartkens, and D. L. Hill, “Non-rigid image registration:Theory and practice,” Br. J. Radiol., vol. 77, no. 2, pp. 140–153, 2004.

[9] M. Holden, “A review of geometric transformations for nonrigid bodyregistration,” IEEE Trans. Med. Imag., vol. 27, no. 1, pp. 111–128, Jan.2008.

[10] J. West et al., “Comparison and evaluation of retrospective inter-modality brain image registration techniques,” J. Comput. Assist.Tomogr., vol. 21, no. 4, pp. 554–568, 1997.

[11] P. Hellier, C. Barillot, I. Corouge, B. Gibaud, G. Le Goualher, L.Collins, A. Evans, G. Malandain, and N. Ayache, “Retrospectiveevaluation of inter-subject brain registration,” inMedical Image Com-puting and Computer-Assisted Intervention—MICCAI 2001. NewYork: Springer, 2001, pp. 258–265.

[12] I. Yanovsky, A. D. Leow, S. Lee, S. J. Osher, and P. M. Thompson,“Comparing registration methods for mapping brain change usingtensor-based morphometry,” Med. Image Anal., vol. 13, no. 5, p. 679,2009.

[13] M. A. Yassa and C. E. Stark, “A quantitative evaluation of cross-par-ticipant registration techniques for MRI studies of the medial temporallobe,” NeuroImage, vol. 44, no. 2, pp. 319–327, 2009.

[14] G. E. Christensen, X. Geng, J. G. Kuhl, J. Bruss, T. J. Grabowski, I.A. Pirwani, M. W. Vannier, J. S. Allen, and H. Damasio, “Introduc-tion to the non-rigid image registration evaluation project (NIREP),”in Biomedical Image Registration. New York: Springer, 2006, pp.128–135.

[15] J. H. Song, G. E. Christensen, J. A. Hawley, Y. Wei, and J. G. Kuhl,“Evaluating image registration using NIREP,” in Biomedical ImageRegistration. New York: Springer, 2010, pp. 140–150.

[16] Y. Wei, “Non-rigid image registration evaluation using common eval-uation databases,” M.S. thesis, Univ. Iowa, Iowa City, 2009.

[17] A. Klein et al., “Evaluation of 14 nonlinear deformation algorithmsapplied to human brain MRI registration,” NeuroImage, vol. 46, no. 3,pp. 786–802, 2009.

[18] A. Klein, S. S. Ghosh, B. Avants, B. T. Yeo, B. Fischl, B. Ardekani, J.C. Gee, J. J. Mann, and R. V. Parsey, “Evaluation of volume-based andsurface-based brain image registration methods,”NeuroImage, vol. 51,no. 1, pp. 214–220, 2010.

[19] J. Doshi, G. Erus, Y. Ou, B. Gaonkar, and C. Davatzikos, “Multi-atlasskull stripping,” Acad. Radiol., vol. 20, no. 12, pp. 1566–1576, 2013.

[20] G. Wu, F. Qi, and D. Shen, “Learning-based deformable registrationof MR brain images,” IEEE Trans. Med. Imag., vol. 25, no. 9, pp.1145–1157, 2006.

[21] B. T. Yeo, M. R. Sabuncu, T. Vercauteren, D. J. Holt, K. Amunts, K.Zilles, P. Golland, and B. Fischl, “Learning task-optimal registrationcost functions for localizing cytoarchitecture and function in the cere-bral cortex,” IEEE Trans. Med. Imag., vol. 29, no. 7, pp. 1424–1441,Jul. 2010.

[22] F.Michel,M.Bronstein,A.Bronstein, andN.Paragios, “Boostedmetriclearning for 3-D multi-modal deformable registration,” in Proc. IEEEInt. Symp. Biomed. Imag.: FromNano toMacro, 2011, pp. 1209–1214.

[23] M. Kim, G. Wu, P.-T. Yap, and D. Shen, “A general fast registrationframework by learning deformation-appearance correlation,” IEEETrans. Image Process., vol. 21, no. 4, pp. 1823–1833, Apr. 2012.

[24] A. Valsecchi, J. Dubois-Lacoste, T. Stutzle, S. Damas, J. Santamaria,and L. Marrakchi-Kacem, “Evolutionary medical image registrationusing automatic parameter tuning,” in 2013 IEEE Congr. Evolut.Comput., 2013, pp. 1326–1333, IEEE.

[25] K. K. Leung, J. Barnes, M. Modat, G. R. Ridgway, J. W. Bartlett, N. C.Fox, and S. Ourselin, “Brain maps: An automated, accurate and robustbrain extraction technique using a template library,” NeuroImage, vol.55, no. 3, pp. 1091–1108, 2011.

[26] S. Parisot, W. Wells, III, S. Chemouny, H. Duffau, and N. Paragios,“Concurrent tumor segmentation and registration with uncer-tainty-based sparse non-uniform graphs,” Med. Image Anal., vol. 18,no. 4, pp. 647–659, 2014.

[27] M. Brett, A. P. Leff, C. Rorden, and J. Ashburner, “Spatial normaliza-tion of brain images with focal lesions using cost function masking,”NeuroImage, vol. 14, no. 2, pp. 486–500, 2001.

[28] B. M. Dawant, S. Hartmann, and S. Gadamsetty, “Brain atlasdeformation in the presence of large space-occupying tumors,”in Medical Image Computing and Computer-Assisted Interven-tion—MICCAI’99. New York: Springer, 1999, pp. 589–596.

[29] M. Cuadra, C. Pollo, A. Bardera, O. Cuisenaire, J. Villemure, and J.Thiran, “Atlas-based segmentation of pathological MR brain imagesusing a model of lesion growth,” IEEE Trans. Med. Imag., vol. 23, no.10, pp. 1301–1314, Oct. 2004.

[30] A.Mohamed et al., “Deformable registration of brain tumor images viaa statistical model of tumor-induced deformation,” Med. Image Anal.,vol. 10, no. 5, pp. 752–763, 2006.

[31] E. Zacharaki, D. Shen, S. Lee, and C. Davatzikos, “ORBIT: Amultires-olution framework for deformable registration of brain tumor images,”IEEE Trans. Med. Imag., vol. 27, no. 8, pp. 1003–1017, Aug. 2008.

[32] A. Gooya, G. Biros, and C. Davatzikos, “Deformable registration ofglioma images using em algorithm and diffusion reaction modeling,”IEEE Trans. Med. Imag., vol. 30, no. 2, pp. 375–390, Feb. 2011.

[33] B. Wang, M. Prastawa, A. Saha, S. P. Awate, A. Irimia, M. C. Cham-bers, P. M. Vespa, J. D. Van Horn, V. Pascucci, and G. Gerig, “Mod-eling 4D changes in pathological anatomy using domain adaptation:Analysis of TBI imaging using a tumor database,” inMultimodal BrainImage Analysis. New York: Springer, 2013, pp. 31–39.

[34] D. Kwon, M. Niethammer, H. Akbari, M. Bilello, C. Davatzikos, andK. Pohl, “PORTR: Pre-operative and post-recurrence brain tumor reg-istration,” IEEE Trans. Med. Imag., vol. 33, no. 3, pp. 651–667, Mar.2014.

[35] D. Shattuck, M. Mirza, V. Adisetiyo, C. Hojatkashani, G. Salamon,K. Narr, R. Poldrack, R. Bilder, and T. AW, “Construction of a 3-Dprobabilistic atlas of human cortical structures.,” NeuroImage, vol. 39,pp. 1064–1080, 2008.

[36] J. G. Sled, A. P. Zijdenbos, and A. C. Evans, “A nonparametric methodfor automatic correction of intensity nonuniformity inMRI data,” IEEETrans. Med. Imag., vol. 17, no. 1, pp. 87–97, Jan. 1998.

[37] B. Aubert-Broche, M. Griffin, G. Pike, A. Evans, and D. Collins,“Twenty new digital brain phantoms for creation of validation imagedata bases,” IEEE Trans. Med. Imag., vol. 25, no. 11, pp. 1410–1416,Nov. 2006.

[38] O. Tsang, A. Gholipour, N. Kehtarnavaz, K. Gopinath, R. Briggs, and I.Panahi, “Comparison of tissue segmentation algorithms in neuroimageanalysis software tools,” in Proc. 30th Annu. Int. Conf. IEEE EMBS,2008, pp. 3924–3928.

[39] D. Marcus, T. Wang, J. Parker, J. Csernansky, J. Morris, and R.Buckner, “Open access series of imaging studies (OASIS): Cross-sec-tional MRI data in young, middle aged, nondemented, and dementedolder adults,” J. Cognitive Neurosci., vol. 19, no. 9, pp. 1498–1507,2007.

[40] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes, “Computing largedeformation metric mappings via geodesic flows of diffeomorphisms,”Int. J. Comput. Vis., vol. 61, no. 2, pp. 139–157, 2005.

[41] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. W. Pluim,“elastix: A toolbox for intensity-based medical image registration,”IEEE Trans. Med. Imag., vol. 29, no. 1, pp. 196–205, Jan. 2010.

[42] M. Modat, G. Ridgway, Z. Taylor, M. Lehmann, J. Barnes, D. Hawkes,N. Fox, and S. Ourselin, “Fast free-form deformation using graphicsprocessing units.,” Comput. Methods Programs Biomed., vol. 98, no.3, pp. 278–284, 2010.

[43] G. Sharp, R. Li, J. Wolfgang, G. Chen, M. Peroni, M. Spadea, S. Mori,J. Zhang, J. Shackleford, and N. Kandasamy, “Plastimatch—An opensource software suite for radiotherapy image processing,” presented atthe 16th Int. Conf. Use Comput. Radiother., Amsterdam, The Nether-lands, 2010.

[44] J. Ashburner et al., “A fast diffeomorphic image registration algo-rithm,” NeuroImage, vol. 38, no. 1, pp. 95–113, 2007.

[45] D. Shen and C. Davatzikos, “HAMMER: Hierarchical attributematching mechanism for elastic registration,” IEEE Trans. Med.Imag., vol. 21, no. 11, pp. 1421–1439, Nov. 2002.

[46] B. Fischl, “Freesurfer,” NeuroImage, 2012.

Page 26: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

2064 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10, OCTOBER 2014

[47] B. Yeo, M. R. Sabuncu, T. Vercauteren, N. Ayache, B. Fischl, and P.Golland, “Spherical demons: Fast diffeomorphic landmark-free sur-face registration,” IEEE Trans. Med. Imag., vol. 29, no. 3, pp. 650–668,Mar. 2010.

[48] M. Jenkinson and S. Smith, “A global optimisation method for robustaffine registration of brain images.,” Med. Image Anal., vol. 5, no. 2,pp. 143–156, 2001.

[49] J. Andersson, S. Smith, and M. Jenkinson, “FNIRT—FMRIB non-linear image registration tool,” in 14th Annu. Meet. Organizat. HumanBrain Map., Melbourne, Australia, 2008.

[50] R. Woods et al., “Rapid automated algorithm for aligning andreslicing PET images,” J. Comput. Assist. Tomogr., vol. 16, no. 4,p. 620, 1992.

[51] R. Woods, S. Grafton, C. Holmes, S. Cherry, and M. JC, “Automatedimage registration: I. General methods and intrasubject, intramodalityvalidation,” J. Comput. Assist. Tomogr., vol. 22, pp. 139–152, 1998.

[52] B. Ardekani et al., “Quantitative comparison of algorithms for inter-subject registration of 3-D volumetric brain MRI scans,” J. Neurosci.Methods, vol. 142, no. 1, pp. 67–76, 2005.

[53] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetricdiffeomorphic image registration with cross-correlation: Evaluatingautomated labeling of elderly and neurodegenerative brain,” Med.Image Anal., vol. 12, no. 1, pp. 26–41, 2008.

[54] D. Rueckert, L. Sonoda, C. Hayes, D. Hill, M. Leach, and D. Hawkes,“Nonrigid registration using free-form deformations: Applicationto breast MR images,” IEEE Trans. Med. Imag., vol. 18, no. 8, pp.712–721, Aug. 1999.

[55] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Diffeo-morphic demons: Efficient non-parametric image registration,”NeuroImage, vol. 45, no. 1, pp. S61–S72, 2009.

[56] Y. Ou, A. Sotiras, N. Paragios, and C. Davatzikos, “DRAMMS:Deformable registration via attribute matching and mutual-saliencyweighting,” Med. Image Anal., vol. 15, no. 4, pp. 622–639, 2011.

[57] J. Thirion, “Image matching as a diffusion process: An analogy withMaxwell’s demons,” Med. Image Anal., vol. 2, no. 3, pp. 243–260,1998.

[58] B. Glocker et al., “Dense image registration through MRFs and ef-ficient linear programming,” Med. Image Anal., vol. 12, no. 6, pp.731–741, 2008.

[59] R. A. Heckemann, S. Keihaninejad, P. Aljabar, D. Rueckert, J. V.Hajnal, and A. Hammers, “Improving intersubject image registrationusing tissue-class information benefits robustness and accuracy ofmulti-atlas based anatomical segmentation,” NeuroImage, vol. 51, no.1, pp. 221–227, 2010.

[60] H. Jia, G. Wu, Q. Wang, and D. Shen, “ABSORB: Atlas building byself-organized registration and bundling,” NeuroImage, vol. 51, no. 3,pp. 1057–1070, 2010.

[61] M. Toews, W. Wells, III, D. L. Collins, and T. Arbel, “Feature-basedmorphometry: Discovering group-related anatomical patterns,” Neu-roImage, vol. 49, no. 3, pp. 2318–2327, 2010.

[62] J. Hamm, D. H. Ye, R. Verma, and C. Davatzikos, “GRAM: A frame-work for geodesic registration on anatomical manifolds,” Med. ImageAnal., vol. 14, no. 5, pp. 633–642, 2010.

[63] D. H. Ye, J. Hamm, D. Kwon, C. Davatzikos, and K. M. Pohl, “Re-gional manifold learning for deformable registration of brain MR im-ages,” in Medical Image Computing and Computer-Assisted Interven-tion—MICCAI 2012. New York: Springer, 2012, pp. 131–138.

[64] P. Jaccard, Etude comparative de la distribution florale dans uneportion des Alpes et du Jura. Lausanne, Switzerland: Impr. Corbaz,1901.

[65] T. Rohlfing, “Image similarity and tissue overlaps as surrogates forimage registration accuracy: Widely used but unreliable,” IEEE Trans.Med. Imag., vol. 31, no. 2, pp. 153–163, Feb 2012.

[66] L. Dice, “Measures of the amount of ecologic association betweenspecies,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.

[67] O. Colliot, O. Camara, and I. Bloch, “Integration of fuzzy spatial rela-tions in deformable models—Application to brain MRI segmentation,”Pattern Recognit., vol. 39, no. 8, pp. 1401–1414, 2006.

[68] H. Elhawary, S. Oguro, K. Tuncali, P. R. Morrison, S. Tatli, P. B. Shyn,S. G. Silverman, and N. Hata, “Multimodality non-rigid image regis-tration for planning, targeting and monitoring during CT-guided per-cutaneous liver tumor cryoablation,” Acad. Radiol., vol. 17, no. 11, pp.1334–1344, 2010.

[69] M. Essig, H. Hawighorst, S. O. Schoenberg, R. Engenhart-Cabillic,M. Fuss, J. Debus, I. Zuna, M. V. Knopp, and G. van Kaick, “Fastfluid-attenuated inversion-recovery (FLAIR) MRI in the assessmentof intraaxial brain tumors,” J. Magn. Reson. Imag., vol. 8, no. 4, pp.789–798, 2005.

[70] J. Liu, J. K. Udupa, D. Odhner, D. Hackney, and G. Moonis, “A systemfor brain tumor volume estimation via MR imaging and fuzzy connect-edness,” Comput. Med. Imag. Graph., vol. 29, no. 1, p. 21, 2005.

[71] R. Verma, E. I. Zacharaki, Y. Ou, H. Cai, S. Chawla, S.-K. Lee,E. R. Melhem, R. Wolf, and C. Davatzikos, “Multi-parametric tissuecharacterization of brain neoplasms and their recurrence using patternclassification of MR images,” Acad. Radiol., vol. 15, no. 8, p. 966,2008.

[72] J. Doshi, G. Erus, Y. Ou, and C. Davatzikos, “Ensemble-basedmedical image labeling via sampling morphological appearancemanifolds,” in Medical Image Computing and Computer-AssistedIntervention—MICCAI’13 Challenge Workshop on Segmentation:Algorithms, Theory and Applications. New York: Springer, 2013.

[73] L. Tang and G. Hamarneh, “Random walks with efficient search andcontextually adapted image similarity for deformable registration,”in Medical Image Computing and Computer-Assisted Intervention(MICCAI). New York: Springer, 2013, vol. 8150, LNCS, pp. 43–50.

[74] Z. Xue, D. Shen, and C. Davatzikos, “Determining correspondencein 3-D MR brain images using attribute vectors as morphologicalsignatures of voxels,” IEEE Trans. Med. Imag., vol. 23, no. 10, pp.1276–1291, Oct. 2004.

[75] Y. Zhan, Y. Ou, M. Feldman, J. Tomaszeweski, C. Davatzikos, and D.Shen, “Registering histologic and MR images of prostate for image-based cancer detection,” Acad. Radiol., vol. 14, no. 11, pp. 1367–1381,2007.

[76] A. Sotiras, Y. Ou, B. Glocker, C. Davatzikos, and N. Paragios, “Simul-taneous geometric-iconic registration,” in Medical Image Computingand Computer-Assisted Intervention—MICCAI 2010. New York:Springer, 2010, pp. 676–683.

[77] Y. Ou, A. Besbes, M. Bilello, M.Mansour, C. Davatzikos, and N. Para-gios, “Detecting mutually-salient landmark pairs withMRF regulariza-tion,” in Proc. IEEE Int. Symp. Biomed. Imag., From Nano to Macro,2010, pp. 400–403.

[78] J. Yang, J. P.Williams, Y. Sun, R. S. Blum, and C. Xu, “A robust hybridmethod for nonrigid image registration,” Pattern Recognit., vol. 44, no.4, pp. 764–776, 2011.

[79] H. Lu, L.-P. Nolte, and M. Reyes, “Interest points localization for brainimage using landmark-annotated atlas,” Int. J. Imag. Syst. Technol.,vol. 22, no. 2, pp. 145–152, 2012.

[80] Y. H. Le, U. Kurkure, and I. A. Kakadiaris, “PDM-ENLOR: Learningensemble of local PDM-based regressions,” in Proc. IEEE Conf.Comput. Pattern Recognit., 2013, pp. 1878–1885.

[81] M. P. Heinrich, M. Jenkinson, M. Bhushan, T. Matin, F. V. Gleeson,S. M. Brady, and J. A. Schnabel, “Mind: Modality independent neigh-bourhood descriptor for multi-modal deformable registration,” Med.Image Anal., vol. 16, no. 7, pp. 1423–1435, 2012.

[82] M. Toews and W. M. Wells, III, “Efficient and robust model-to-imagealignment using 3-D scale-invariant features,” Med. Image Anal., vol.17, no. 3, pp. 271–282, 2013.

[83] Y. Ou, S. P. Weinstein, E. F. Conant, S. Englander, X. Da, B.Gaonkar, M.-K. Hsieh, M. Rosen, A. DeMichele, C. Davatzikos, andD. Kontos, “Deformable registration for quantifying tumor changesduring neoadjuvant chemotherapy,” Magn. Reson. Med., 2014, to bepublished.

[84] H. Luan, F. Qi, Z. Xue, L. Chen, and D. Shen, “Multimodality imageregistration by maximization of quantitative-qualitative measure ofmutual information,” Pattern Recognit., vol. 41, no. 1, pp. 285–298,2008.

[85] H. Rivaz, Z. Karimaghaloo, and D. L. Collins, “Self-similarityweighted mutual information: A new nonrigid image registrationmetric,” Med. Image Anal., vol. 18, no. 2, pp. 343–358, 2014.

[86] B. Qin, Z. Gu, X. Sun, and Y. Lv, “Registration of images with outliersusing joint saliency map,” IEEE Signal Process. Lett., vol. 17, no. 1,pp. 91–94, Jan. 2010.

[87] X. Zhuang, S. Arridge, D. J. Hawkes, and S. Ourselin, “A nonrigidregistration framework using spatially encodedmutual information andfree-form deformations,” IEEE Trans. Med. Imag., vol. 30, no. 10, pp.1819–1828, Oct. 2011.

[88] D. Pace, S. Aylward, and M. Niethammer, “A locally adaptive regular-ization based on anisotropic diffusion for deformable image registra-tion of sliding organs,” IEEE Trans. Med. Imag., vol. 32, no. 11, pp.2114–2126, 2013.

[89] M. Reuter, H. D. Rosas, and B. Fischl, “Highly accurate inverse con-sistent registration: A robust approach,” NeuroImage, vol. 53, no. 4,pp. 1181–1196, 2010.

[90] H. D. Tagare, D. Groisser, and O. Skrinjar, “Symmetric non-rigid regis-tration: A geometric theory and some numerical techniques,” J. Math.Imag. Vis., vol. 34, no. 1, pp. 61–88, 2009.

Page 27: IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 33, NO. 10 ...you2/publications/Ou14_TMI.pdfical Imaging, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA 02129 USA

OU et al.: COMPARATIVE EVALUATION OF REGISTRATION ALGORITHMS IN DIFFERENT BRAIN DATABASES WITH VARYING DIFFICULTY 2065

[91] I. Aganj, M. Reuter, M. R. Sabuncu, and B. Fischl, “Symmetricnon-rigid image registration via an adaptive quasi-volume-preservingconstraint,” in Proc. IEEE 10th Int. Symp. Biomed. Imag., 2013, pp.230–233.

[92] B. Glocker, A. Sotiras, N. Komodakis, and N. Paragios, “Deformablemedical image registration: Setting the state of the art with discretemethods*,” Annu. Rev. Biomed. Eng., vol. 13, pp. 219–244, 2011.

[93] X. Han, L. S. Hibbard, and V. Willcut, “GPU-accelerated, gradient-free mi deformable registration for atlas-based MR brain image seg-mentation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit. Workshop, 2009, pp. 141–148.

[94] R. Shams, P. Sadeghi, R. Kennedy, and R. Hartley, “A survey ofmedical image registration on multicore and the GPU,” IEEE SignalProcess. Mag., vol. 27, no. 2, pp. 50–60, Mar. 2010.

[95] X. Gu, H. Pan, Y. Liang, R. Castillo, D. Yang, D. Choi, E. Castillo, A.Majumdar, T. Guerrero, and S. B. Jiang, “Implementation and evalua-tion of various demons deformable image registration algorithms on aGPU,” Phys. Med. Biol., vol. 55, no. 1, p. 207, 2010.

[96] J. A. Schnabel, C. Tanner, A. D. Castellano-Smith, A. Degenhard, M.O. Leach, D. R. Hose, D. L. Hill, and D. J. Hawkes, “Validation ofnonrigid image registration using finite-element methods: Applicationto breast MR images,” IEEE Trans. Med. Imag., vol. 22, no. 2, pp.238–247, Feb. 2003.

[97] X. Li et al., “Validation of an algorithm for the nonrigid registration oflongitudinal breast MR images using realistic phantoms,” Med. Phys.,vol. 37, p. 2541, 2010.

[98] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J.C. Gee, “A reproducible evaluation of ANTs similarity metric perfor-mance in brain image registration,” NeuroImage, vol. 54, no. 3, pp.2033–2044, 2011.

[99] K. Murphy et al., “Evaluation of registration methods on thoracic CT:The EMPIRE10 challenge,” IEEE Trans. Med. Imag., vol. 30, no. 11,pp. 1901–1920, Nov. 2011.

[100] R. Castillo, E. Castillo, R. Guerra, V. E. Johnson, T. McPhail, A. K.Garg, and T. Guerrero, “A framework for evaluation of deformableimage registration spatial accuracy using large landmark point sets,”Phys. Med. Biol., vol. 54, no. 7, p. 1849, 2009.

[101] Y. Ou, D. H. Ye, K. M. Pohl, and C. Davatzikos, “Validation ofDRAMMS among 12 popular methods in cross-subject cardiacMRI registration,” in Biomedical Image Registration. New York:Springer, 2012, pp. 209–219.

[102] X. Geng, D. Kumar, and G. E. Christensen, “Transitive inverse-con-sistent manifold registration,” in Information Processing in MedicalImaging. New York: Springer, 2005, pp. 468–479.

[103] M. V. Zanetti, M. S. Schaufelberger, J. Doshi, Y. Ou, L. K. Ferreira,P. R. Menezes, M. Scazufca, C. Davatzikos, and G. F. Busatto,“Neuroanatomical pattern classification in a population-based sampleof first-episode schizophrenia,” Progr. Neuro-Psychopharmacol. Biol.Psychiatry, vol. 43, no. 116, p. 125, 2012.

[104] R. Bansal, L. H. Staib, A. F. Laine, X. Hao, D. Xu, J. Liu,M.Weissman,and B. S. Peterson, “Anatomical brain images alone can accurately di-agnose chronic neuropsychiatric illnesses,” PloS One, vol. 7, no. 12, p.e50698, 2012.

[105] X. Da, J. Toledo, D. Wolk, Y. Ou, A. Shacklett, P. Parmpi, L. Shaw, J.Trojanowski, and C. Davatzikos, “Prediction of conversion from MCIto AD: Integration and relative values of brain atrophy patterns, clinicalscores, CSF biomarkers, and APOE genotype,” presented at the RSNAAnnu. Meet., Chicago, IL, 2013.

[106] X. Da, J. B. Toledo, J. Zee, D. A. Wolk, S. X. Xie, Y. Ou, A. Shacklett,P. Parmpi, L. Shaw, J. Q. Trojanowski, and C. Davatzikos, “Integrationand relative value of biomarkers for prediction of MCI to AD progres-sion: Spatial patterns of brain atrophy, cognitive scores, APOE geno-type and CSF biomarkers,” NeuroImage: Clin., vol. 4, pp. 164–173,2014.

[107] N. Koutsouleris et al., “Accelerated brain aging in schizophreniaand beyond: A neuroanatomical marker of psychiatric disorders,”Schizophrenia Bull., p. sbt142, 2013.

[108] M. H. Serpa et al., “Neuroanatomical classification in a population-based sample of psychotic major depression and bipolar I disorder with1 year of diagnostic stability,” BioMed Res. Int., vol. 2014, 2014.

[109] T. D. Satterthwaite et al., “Neuroimaging of the philadelphia neurode-velopmental cohort,” NeuroImage, vol. 86, pp. 544–553, 2013.

[110] G. Erus, H. Battapady, T. D. Satterthwaite, H. Hakonarson,R. E. Gur, C. Davatzikos, and R. C. Gur, “Imaging patterns ofbrain development and their relationship to cognition,” CerebralCortex, p. bht425, 2014.

[111] M. Ingalhalikar, D. Parker, Y. Ghanbari, A. Smith, K. Hua, S. Mori,T. Abel, C. Davatzikos, and R. Verma, “Connectome and maturationprofiles of the developing mouse brain using diffusion tensor imaging,”Cerebral Cortex, p. bhu068, 2014.

[112] Y. Ou, N. Reynolds, R. Gollub, R. Pienaar, Y. Wang, T. Wang, D.Sack, K. Andriole, S. Pieper, C. Herrick, S. N. Murphy, P. E. Grant,and L. Zöllei, “Developmental brain ADC atlas creation from clinicalimages,” in Org. Human Brain Mapp., Hamburg, Germany, 2014.

[113] S. Baloch and C. Davatzikos, “Morphological appearance manifoldsin computational anatomy: Groupwise registration and morphologicalanalysis,” NeuroImage, vol. 45, no. 1, pp. S73–S85, 2009.

[114] B. Baumann, B. Teo, K. Pohl, Y. Ou, J. Doshi, M. Alonso-Basanta,J. Christodouleas, C. Davatzikos, G. Kao, and J. Dorsey, “Multipara-metric processing of serial MRI during radiation therapy of brain tu-mors: “Finishing with flair?”,” Int. J. Radiat. Oncol. Biol. Phys., vol.81, p. S794, 2011.

[115] S. S. Ghosh, S. Kakunoori, J. Augustinack, A. Nieto-Castanon, I.Kovelman, N. Gaab, J. A. Christodoulou, C. Triantafyllou, J. D.Gabrieli, and B. Fischl, “Evaluating the validity of volume-based andsurface-based brain image registration for developmental cognitiveneuroscience studies in children 4 to 11 years of age,” NeuroImage,vol. 53, no. 1, pp. 85–93, 2010.

[116] S. Seshamani, P. Rajan, R. Kumar, H. Girgis, T. Dassopoulos,G. Mullin, and G. Hager, “A meta registration framework forlesion matching,” in Medical Image Computing and Computer-Assisted Intervention—MICCAI 2009. New York: Springer, 2009,pp. 582–589.

[117] S. E. Muenzing, B. van Ginneken, and J. P. Pluim, “On combiningalgorithms for deformable image registration,” in Biomedical ImageRegistration. New York: Springer, 2012, pp. 256–265.

[118] S. E. Muenzing, B. van Ginneken, M. A. Viergever, and J. P. Pluim,“Dirboost—An algorithm for boosting deformable image registration:Application to lung CT intra-subject registration,” Med. Image Anal.,vol. 18, no. 3, pp. 449–459, 2014.

[119] Y. Guo, G. Wu, J. Jiang, and D. Shen, “Robust anatomical correspon-dence detection by hierarchical sparse graph matching,” IEEE Trans.Med. Imag., vol. 32, no. 2, pp. 268–277, Feb. 2013.

[120] U. Kurkure, Y. H. Le, N. Paragios, J. P. Carson, T. Ju, and I. A. Kakadi-aris, “Landmark/image-based deformable registration of gene expres-sion data,” in Proc. IEEE Conf, Comput. Vis. Pattern Recognit., 2011,pp. 1089–1096.

[121] G. Sharp et al., “SU-E-J-64: Landmark and ROI-enhancement-assistedinter-patient deformable registration of 3-D bladder CT images,”Med.Phys., vol. 40, no. 6, pp. 164–164, 2013.

[122] A. Gupta, H. K. Verma, and S. Gupta, “A hybrid framework for regis-tration of carotid ultrasound images combining iconic and geometricfeatures,” Med. Biol. Eng. Comput., vol. 51, no. 9, pp. 1043–1050,2013.

[123] S. Wörz and K. Rohr, “Spline-based hybrid image registrationusing landmark and intensity information based on matrix-valuednon-radial basis functions,” Int. J. Comput. Vis., vol. 106, no. 1, pp.76–92, 2014.

[124] M. Powell, “An efficient method for finding the minimum of a functionof several variables without calculating derivatives,” Comput. J., vol.7, no. 2, pp. 155–162, 1964.

[125] M. Avriel, Nonlinear Programming: Analysis and Methods. Mi-neola, NY: Dover, 2003.

[126] J. More, “The Levenberg-Marquardt algorithm: Implementation andtheory,” Numer. Anal., pp. 105–116, 1978.

[127] A. Cauchy, “Méthode générale pour la résolution des systemesd’équations simultanées,” Comp. Rend. Sci. Paris, vol. 25, no. 1847,pp. 536–538, 1847.

[128] T. Vercauteren et al., “Diffeomorphic demons: Efficient non-para-metric image registration.,” NeuroImage, vol. 45, no. 1, pp. S61–S72,2008.

[129] J. Andersson, M. Jenkinson, and S. Smith, Non-linear registration, Akaspatial normalisation FMRIB FMRIB Analysis Group, Univ. Oxford,Tech. Rep. TR07JA2, 2007.