14
Approaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert human raters John A. Bogovic a, , Bruno Jedynak b , Rachel Rigg e , Annie Du f , Bennett A. Landman d , Jerry L. Prince a, c, e , Sarah H. Ying c, f, g a Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA b Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD, USA c Department of Radiology and Radiological Science, Johns Hopkins Medical Institutions, Baltimore, MD, USA d Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA e Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA f Department of Neurology, Johns Hopkins Medical Institutions, Baltimore, MD, USA g Department of Ophthalmology, Johns Hopkins Medical Institutions, Baltimore, MD, USA abstract article info Article history: Accepted 27 August 2012 Available online 4 September 2012 Keywords: Human cerebellum Manual labeling Delineation Parcellation STAPLE STAPLER Label fusion Volumetric measurements obtained from image parcellation have been instrumental in uncovering structurefunction relationships. However, anatomical study of the cerebellum is a challenging task. Because of its complex structure, expert human raters have been necessary for reliable and accurate segmentation and parcellation. Such delineations are time-consuming and prohibitively expensive for large studies. Therefore, we present a three-part cerebellar parcellation system that utilizes multiple inexpert human raters that can efciently and expediently produce results nearly on par with those of experts. This system includes a hierarchical delineation protocol, a rapid verication and evaluation process, and statistical fusion of the inexpert rater parcellations. The quality of the ratersand fused parcellations was established by examining their Dice similarity coefcient, region of interest (ROI) volumes, and the intraclass correlation coefcient of region volume. The intra-rater ICC was found to be 0.93 at the nest level of parcellation. © 2012 Elsevier Inc. All rights reserved. Introduction The cerebellum has a remarkably complex structure that coordinates numerous vital functions of the human body. It is involved in tasks such as eye-movement (McCormick and Thompson, 1984; Ritchie, 1976), speech (Silveri et al., 1994), balance, ne motor control, motor learning, and cognition (Leiner et al., 1986; Schmahmann, 1991). Like the cerebral cortex, the human cerebellum exhibits functional localization. This is reected in part by its anatomic structure. There are two macroscopic levels of organization: mediallateral and rostralcaudal. The mediallateral anatomical divisions of the cerebellum echo the differences in connectivity between the medially located spinocerebellum, and the lateral cerebrocerebellum, or neo-cerebellum.The spinocerebellum consists of the wormlike vermisand the more lateral paravermis, or in- termediate zone.As its name suggests, these regions receive afferents primarily from the spinal cord. The evolutionarily ancient occulonodular lobe or vestibulocerebellumis intimately associated with the vestibular system, and therefore highly inuences spatial orientation and balance. In the rostralcaudal direction, transverse ssures create divisions in the cerebellum called lobules. The study of cerebellar substructures has been confounded by in- consistencies in nomenclature. This has been remedied in the study of humans by the general acceptance of the standard introduced by Schmahmann et al. (1999, 2000). In their work, the cerebellar lobules are numbered from I to X. These lobules stem from white matter branches rooted in the central mass of cerebellar white matter, called the corpus medullare (CM). Under this convention, lobule I is located most rostrally, with lobule numbering increasing caudally. In this work, we adopt the numbering standard of Schmahmann, but will refer to super-groupings as follows: anterior (IV), superior posterior, or middle (VI and Crus I and II of VIIA, and VIIB), inferior posterior (VIII, IX), and caudal (VIII, IX, and X). Three of these conventions and super-groupings are illustrated in Fig. 1. Region-specic changes in the cerebellum have been correlated with a number of diseases and functional decits. For example, region- ally selective degeneration of the vermis and anterior lobe has been ob- served over the course of aging (Andersen et al., 2003; Raz et al., 1998). A decrease in size of the inferior posterior vermis has been observed in boys with attention-decit and hyperactivity disorder (ADHD) relative to normals (Berquin et al., 1998; Mostofsky et al., 1998). Several studies have shown changes in volumes of the vermis (Nopoulos et al., 1999; Okugawa et al., 2003) and vermian white matter (Levitt et al., 1999) in patients with schizophrenia. Evidence suggests that the vermis and NeuroImage 64 (2013) 616629 Corresponding author. E-mail address: [email protected] (J.A. Bogovic). 1053-8119/$ see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.neuroimage.2012.08.075 Contents lists available at SciVerse ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

NeuroImage 64 (2013) 616–629

Contents lists available at SciVerse ScienceDirect

NeuroImage

j ourna l homepage: www.e lsev ie r .com/ locate /yn img

Approaching expert results using a hierarchical cerebellum parcellation protocol formultiple inexpert human raters

John A. Bogovic a,⁎, Bruno Jedynak b, Rachel Rigg e, Annie Du f, Bennett A. Landman d,Jerry L. Prince a,c,e, Sarah H. Ying c,f,g

a Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USAb Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD, USAc Department of Radiology and Radiological Science, Johns Hopkins Medical Institutions, Baltimore, MD, USAd Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USAe Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USAf Department of Neurology, Johns Hopkins Medical Institutions, Baltimore, MD, USAg Department of Ophthalmology, Johns Hopkins Medical Institutions, Baltimore, MD, USA

⁎ Corresponding author.E-mail address: [email protected] (J.A. Bogovic).

1053-8119/$ – see front matter © 2012 Elsevier Inc. Allhttp://dx.doi.org/10.1016/j.neuroimage.2012.08.075

a b s t r a c t

a r t i c l e i n f o

Article history:Accepted 27 August 2012Available online 4 September 2012

Keywords:Human cerebellumManual labelingDelineationParcellationSTAPLESTAPLERLabel fusion

Volumetric measurements obtained from image parcellation have been instrumental in uncovering structure–function relationships. However, anatomical study of the cerebellum is a challenging task. Because of its complexstructure, expert human raters have been necessary for reliable and accurate segmentation and parcellation. Suchdelineations are time-consuming and prohibitively expensive for large studies. Therefore,we present a three-partcerebellar parcellation system that utilizes multiple inexpert human raters that can efficiently and expedientlyproduce results nearly on par with those of experts. This system includes a hierarchical delineation protocol, arapid verification and evaluation process, and statistical fusion of the inexpert rater parcellations. The quality ofthe raters’ and fusedparcellationswas established by examining their Dice similarity coefficient, region of interest(ROI) volumes, and the intraclass correlation coefficient of region volume. The intra-rater ICCwas found to be 0.93at the finest level of parcellation.

© 2012 Elsevier Inc. All rights reserved.

Introduction

The cerebellumhas a remarkably complex structure that coordinatesnumerous vital functions of the human body. It is involved in tasks suchas eye-movement (McCormick and Thompson, 1984; Ritchie, 1976),speech (Silveri et al., 1994), balance, finemotor control, motor learning,and cognition (Leiner et al., 1986; Schmahmann, 1991). Like the cerebralcortex, the human cerebellum exhibits functional localization. This isreflected in part by its anatomic structure. There are two macroscopiclevels of organization: medial–lateral and rostral–caudal. The medial–lateral anatomical divisions of the cerebellum echo the differences inconnectivity between the medially located spinocerebellum, and thelateral cerebrocerebellum, or “neo-cerebellum.” The spinocerebellumconsists of thewormlike “vermis” and themore lateral paravermis, or “in-termediate zone.” As its name suggests, these regions receive afferentsprimarily from the spinal cord. The evolutionarily ancientflocculonodularlobe or “vestibulocerebellum” is intimately associatedwith the vestibularsystem, and therefore highly influences spatial orientation and balance.In the rostral–caudal direction, transverse fissures create divisions inthe cerebellum called lobules.

rights reserved.

The study of cerebellar substructures has been confounded by in-consistencies in nomenclature. This has been remedied in the study ofhumans by the general acceptance of the standard introduced bySchmahmann et al. (1999, 2000). In their work, the cerebellar lobulesare numbered from I to X. These lobules stem from white matterbranches rooted in the central mass of cerebellar white matter, calledthe corpus medullare (CM). Under this convention, lobule I is locatedmost rostrally, with lobule numbering increasing caudally. In thiswork, we adopt the numbering standard of Schmahmann, but willrefer to super-groupings as follows: anterior (I–V), superior posterior,or middle (VI and Crus I and II of VIIA, and VIIB), inferior posterior(VIII, IX), and caudal (VIII, IX, and X). Three of these conventionsand super-groupings are illustrated in Fig. 1.

Region-specific changes in the cerebellum have been correlatedwith a number of diseases and functional deficits. For example, region-ally selective degeneration of the vermis and anterior lobe has been ob-served over the course of aging (Andersen et al., 2003; Raz et al., 1998).A decrease in size of the inferior posterior vermis has been observed inboys with attention-deficit and hyperactivity disorder (ADHD) relativeto normals (Berquin et al., 1998; Mostofsky et al., 1998). Several studieshave shown changes in volumes of the vermis (Nopoulos et al., 1999;Okugawa et al., 2003) and vermian white matter (Levitt et al., 1999)in patients with schizophrenia. Evidence suggests that the vermis and

Page 2: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

I/II

IIIIV

VVI

VIIA (Crus I)VIIA (Crus II)VIIB

VIIIAVIIIB

X

IXPons

Brainstem

Anterior

SuperiorPosterior

InferiorPosterior

F-NPons

Brainstem

Cerebellum

Pons

Brainstem

I/II III IV VVI

VIIA (Crus I)

VIIA (Crus II)

VIIBVIIIA

VIIIB

X

IX

Anterior

SuperiorPosterior

InferiorPosterior

Flocculonodular

Cerebellum

Whole Cerebellum Lobar Parcellation Lobule Parcellation

Sag

ittal

Cor

onal

Fig. 1. Illustration of cerebellar anatomy and lobule grouping.

1 The Dice similarity coefficient (DSC) (Dice, 1945) was computed with the "goldstandard" expert rater delineation. For a given label l, the DSC of a parcellation withthe expert parcellation is given by: DSC(l)=2(|SRl∩SEl|)/(|SEl|+|SRl|) , where SRl andSEl denote the sets of voxels assigned to label l by the rater and expert, respectively.

617J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

flocculus are targeted by chronic alcoholism (Baker et al., 1999;Cavanagh et al., 1997). Superior posterior lobe volumes have beenshown to decrease in patients with Alzheimer's disease relative to con-trols (Thomann et al., 2008). Several types of cerebellar ataxia have alsodemonstrated region-specific atrophy within the cerebellum (Brenneiset al., 2003; Jung et al., 2011; Ying et al., 2006).

Clearly, the success of such studies depends on accurate and precisemeasurements for the structures of interest. Segmentations producedby human experts remain the gold standard despite the progressachieved in automated segmentation algorithms. However, the trainingof experts is a very long process, typically requiring thousands of hours.The high quality measurements produced by such raters are thereforetime-consuming and expensive. Here we briefly review examples ofthree approaches that have been used to limit the amount of human ex-pertise required: 1) limit the scope of the study, 2) employ automatedor semi-automated image analysis methods, or 3) employ inexperthuman raters.

In order to make best use of the experts’ time, research hypothesesmay be tested using a small cohort of subjects, on a small number ofsubregions, or using a coarse parcellation. For example, one may limitstudy to the cerebellar vermis. Raz et al. (1998) measured thecross-sectional area of the cerebellar vermis grouping lobules I–V,VI–VII, and VIII–X. A coarse parcellation considered by Levitt et al.(1999), among others, consists of hemispheric white and gray matter,vermian white matter, and gray matter split into three subgroupsconsisting of lobules I–V, VI–VII, and VIII–X.

A fewmethods have been introduced that produce a full parcellationof the cerebellar lobules from magnetic resonance (MR) images. Asemi-automatic surface-based method presented by Makris et al.(2003, 2005) has the advantage of parcellating the cerebellum intomedial-to-lateral subdivisions. Pierson et al. (2002) developed manualand semi-automatedmethods for delineating the corpusmedullare, an-terior lobe, superior posterior lobe, and inferior posterior lobe. Thesemanual and semi-automated methods have produced good results,but rater training and delineation time are not discussed. Significant

knowledge of cerebellar anatomy is required to follow these protocols,and their use will yield adequate results only when used by experthuman raters. As a consequence, they tend also to be time-consumingand expensive.

To our knowledge, there is currently only one publicly availableautomated method for cerebellar parcellation. Diedrichsen (2006),Diedrichsen et al. (2009) describe an automatic method based onnonlinear registration of a cerebellum label template (SUIT). This meth-odology was employed in (Donchin et al., 2012) in the case of focal le-sions and cerebellar atrophy and showed lobule-specific changes usingvoxel-based morphometry. However, an explicit evaluation of segmen-tation performance was not performed. For our data, the SUIT templateproduced a mean Dice similarity1 (standard deviation) of 0.55 (0.16)across 15 subjects and 24 labels (see Appendix A for details). Since thismethod uses a template constructed from control subjects, it is not sur-prising that results are poor for patients with severe cerebellar degener-ative disease (especially those including diverse forms of lobule-specificatrophy). Our data,which includenine subjectswith significant degener-ation, demonstrate this phenomenon. In any case, testing this approach(or any alternative) would require numerous manual parcellations forvalidation. The present paper describes a cost-effective approach to pro-duce such manual parcellations on both normal and ataxia subjects.

Recently, the “multi-atlas” segmentation framework has been shownto achieve excellent results for many anatomical segmentation tasks(Heckemann et al., 2006; Isgum et al., 2009). We explored the perfor-mance of such a method (see Appendix A for details) for cerebellarparcellation and found that it produced unsatisfactory results: meanDice similarity (standard deviation) across 15 subjects and 28 labels of

Page 3: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

618 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

0.66 (0.15). An important factor in explaining this is the high complexityand inter-subject variability of the anatomical shape of the cerebellum.

In order to address the need for simultaneously accurate and inex-pensive ratings, numerous explorations into recruiting members ofthe public to complete rating tasks (“crowd sourcing”) have beenperformed. These studies have challenges ranging from object identifi-cation and annotation in images (Russell et al., 2007; Sorokin andForsyth, 2008) to language translation (Callison-Burch, 2009). The re-sults suggest that good results can be obtained once enough data fromthe crowd has been collected. Of course, the quality of the final resultsdepends on the difficulty of the task, the extent to which the “crowd”is trained, and the method for combining their ratings. Tasks thatrequire little to no training (e.g., object identification or naming) oftenproduce excellent results, while the results for more challenging tasksare usually sub-par. In a previous work, a large number of minimallytrained raters performed an easy version of the cerebellar parcellationtask with acceptable accuracy (Landman et al., 2012a). In particular,the labeling task was two-dimensional with only four labeled substruc-tures whose boundaries are readily discernible by underlying image in-tensities, while here we seek three-dimensional labeling with as manyas 29 substructures, some of which are only separable by knowledge ofcerebellar anatomy.Minimally trained “crowd” raters, having no trainingin cerebellar anatomy, are incapable of producing acceptable, detailedthree-dimensional parcellations, even when statistically combined.

In this work, we sought to develop a delineation scheme that yieldsthe accuracy of expert raters with the efficiency (with regard to timeand cost) of crowd raters. The anatomical complexity and subtlety ofimage features in the cerebellum necessitate a large time commitmentbe devoted to training and practice in order to become an expert cerebel-lum delineator. In our experience, approximately 2000 h were devotedin order to achieve excellent reliability (>0.99 ICC) (Jung et al., 2011).Note that this figure depends on many factors such as the difficulty ofthe parcellation problem or previous experience or qualifications of therater (e.g., a seasoned radiologistwill require less training for a particulartask than a novice). We recommend that researchers take these factorsinto account and perform “on-the-fly” evaluation during the trainingprocess to determine when a rater achieves the desired performance.On the other hand, raters from the crowd (with less than 20 h of experi-ence), struggle to produce acceptable delineations due to the anatomicalknowledge required. We have found that individuals with a moderateamount of training (about 150 h), whom we call "inexpert raters", aremarkedly superior to “crowd” raters, and after review and statistical fu-sion, their results can yield delineations approaching the quality of thatproduced by an expert.

This paper describes a system, involving three stages, for themanualparcellation of the cerebellumbymultiple inexpert raters. Thefirst stagein this system is a hierarchical delineation protocol which enables anewcomer to label the cerebellum with relative ease and accuracy. Thehierarchical scheme involves first delineating the whole cerebellum,and then outliningfiner structures until all sub-structures of the cerebel-lum have been obtained. A hierarchical approach reduces rater variabil-ity since the larger, more easily located anatomical landmarks aredelineated at early stages in the hierarchy. At later stages, the ratersneed only concentrate on further delineation of the regions of the cere-bellum that were previously established. An additional advantage of thisapproach is that a coarser parcellation can be achieved in less time thanthat required for the full parcellation, and coarser parcellations may beideal for some studies. Our proposed methodology achieves a mean(standard deviation) Dice similarity of 0.84 (0.10) and outperforms allautomated methods that we have been able to test (see Appendix Afor further details).

Our protocol is specifically designed and tested on both normal andhighly atrophied cerebella. Subjects diagnosed with spinocerebellarataxia types 2 and 6, and subjects with non-genetic cerebellar ataxiawere among those used for rater training and testing. Our protocol islikely to be applicable for studies involving ADHD, schizophrenia, or

chronic alcoholism and others, because the extent and patterns of atro-phy are comparable in these conditions. The proposed methodologymay not be successful in more extreme conditions such as Dandy–Walker syndrome. Because of the potential for large inter-subject varia-tions in cerebellar size and shape, raters can make significant errors, asillustrated in Fig. 2. Therefore, the second stage of our delineation systemis a method for rapid rater review, evaluation, and correction with em-phasis on the large errors that our inexpert raters sometimes make.

The third stage of our delineation system is the application of a robuststatistical fusion method to the results of three or more inexpert raters(Landman et al., 2012b). This stage removes errors made by the raters,since the consensus of these imperfect delineations tend to be accurate.Fig. 2 shows an example of a fused label result that is similar to that of anexpert rater.While the excellent outcome of the fusion despite high var-iability in raters may not occur in all cases, this example serves to showthat combining raters can produce a more robust and often more accu-rate result than the inexpert raters alone.

Cerebellum delineation/parcellation protocol

Our protocol took a hierarchical approach to delineation: early stepsfocus on large, clear boundaries, and subsequent levels of hierarchysub-parcellate previously defined regions. This scheme is easily adaptedto a fast, abridged protocol, inwhich coarse structures are delineated butthe finest structures are not. The choice of the hierarchical level atwhichto stop involves a tradeoff between the detail of a fine parcellation, andthe reliability of a coarse parcellation (larger structures are able to be de-lineated more consistently). The NIH software, “Medical Image Process-ing, Analysis, and Visualization” (MIPAV) (McAuliffe et al., 2001) wasused throughout the delineation process. The entire protocol, completewith snapshots of each stage is found online at http://iacl.ece.jhu.edu/Cerebellum_Protocol. The description we provide here is intended toconvey a general sense of the strategies we employ.

Image acquisition, processing, and interaction

Acquisition protocolFor each subject, two magnetization prepared rapid gradient echo

(MP-RAGE) images were obtained using a 3.0 T MR scanner (Intera,Phillips Medical Systems, Netherlands). The first MP-RAGE was ac-quired with the following parameters: 132 slices, axial orientation,1.1 mm slice thickness, 8° flip angle, TE=3.9 ms, TR=8.43 ms, FOV21.2×21.2 cm,matrix 256×256. The secondwas acquired with the fol-lowing parameters: 162 slices, axial orientation, 0.9 mmslice thickness,8° flip angle, TE=3.9 ms, TR=8.35 ms, FOV 23×18.328 cm, matrix256×256. The two acquired images were co-registered and averagedto produce a volume with 0.8 mm isotropic voxels in order to improveSNR. If a different image acquisition protocol is used, the delineationprocess can still be carried out, but specific results may vary.

RegistrationThe registration was performed by using MIPAV's (McAuliffe et al.,

2001) Optimized Automatic Registration tool, which is based on(Jenkinson and Smith, 2001) using a rigid (6 degrees of freedom)transformation. The average image was spatially normalized to a ste-reotactic space by registration to the ICBM atlas (Mazziotta et al.,1995).

Histogram adjustmentIn order to increase contrast among the tissue types of interest,

raters were instructed to adjust the window and level to optimize ap-parent contrast between CSF, gray matter and white matter. Low in-tensity background, and high intensity regions such as blood vesselscan reduce contrast between the cerebellar features when the con-trast window covers the min and max intensities. Raters adjustedthe window based on the image histogram only and did not change

Page 4: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

Fig. 2. An illustration of the lobules VIIB through VIIIB being consistently (above) and inconsistently delineated (below). Fissures are indicated with red lines. The anatomical dif-ferences between the two subjects were a significant source of error. In particular, the lower subject had an “extra” fissure that caused confusion among the inexpert raters.

619J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

it during the delineation process and thereby avoided a potential influ-ence on their boundary choices (a concern described in (Steenbakkerset al., 2005)).

Locking labelsAfter delineating the whole cerebellum (see below) it is further

subdivided in subsequent steps. Therefore, users are instructed to"lock" labels that are in the background (complement) to the objectthat they are currently subdividing. This simple step greatly expeditesthe delineation process because 1) it prevents boundaries that havealready been established from changing, 2) it prevents accidentalmouse clicks from creating isolated, random labels from occurring

in remote regions, and 3) it permits larger brushes to be used whendelineating near existing boundaries.

Coarse delineation

Fig. 3 outlines the levels of the hierarchy our raters traverse as partof the delineation protocol, the details of which are described below.

Whole cerebellumFor each scan a threshold is determined and employed when delin-

eating thewhole cerebellum. This threshold is determined by examiningthe intensity of the scan and selecting a threshold that includes the gray

Page 5: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

Whole

Cerebellum

Corpus

Medullare

Lobar

parcellation

Lobule

parcellation

Fig. 3. A schematic showing the steps of the hierarchical delineation protocol.

620 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

matter but not the cerebrospinal fluid (CSF). In our experience, thissemi-automated thresholding allows raters tomore consistently and ef-ficiently separate cerebellar gray matter from the surrounding CSF.

The tentorium cerebelli (the thin layer of dura mater separating theoccipital lobe and cerebellum) is not included in the delineation, nor isany other portion of the dura. The pattern and texture of white matterand gray matter in the cerebellum are used to delineate the boundarybetween the cerebellum and cerebrum. A layer of CSF between the cer-ebellar and cortical gray matter is often visible. The boundary betweenthe white matter of the cerebellum and the spinal cord is defined as theline in the sagittal plane connecting the most anterior of the dorsal andventral gray matter in the cerebellum.

Throughout the volume, the sagittal, coronal, and axial views areroutinely consulted in order to ensure the highest level of certaintyin the delineations and to produce smooth parcellations. In addition,all boundaries are examined and corrected in each of the cardinalplanes to remove extraneously labeled voxels.

Corpus medullareThe corpusmedullare (CM) is defined as thewhitematter on the in-

terior of the cerebellum that contains the deep cerebellar nuclei andconnects through the cerebellar peduncles to the brain stem. It is theportion of white matter that cannot reasonably be assigned to any par-ticular lobule. Delineation begins with a sagittal view near the midlineand proceeds laterally. The label for the CM does not extend into thewhite matter branches in the lobules but rather connects the bottomsof the deepest fissures. It includes all the white matter within theselimits and extends to the anterior boundary of the cerebellum andbrain stem. This volume is then adjusted in the sagittal, coronal, andaxial planes until it appears smooth in each.

Lobar delineation

The second level of the hierarchy parcellates the cerebellar graymatter into three large structurally significant components: the anteriorlobules (I–V), the middle (superior–posterior) lobules (VI–VII), andcaudal (inferior–posterior and flocculonodular) lobules (VIII–X). Theseare chosen because the fissures separating them are prominent and re-liably delineated. These divisions also correspond roughly to importantfunctional specialization in the cerebellum, as the anterior lobules havebeen identified as contributing to sensorimotor function (Ito, 1984;Nitschke et al., 1996), the middle lobules seem to be involved in cogni-tion (Middleton and Strick, 1994; Schmahmann, 1991), while the cau-dal lobules (especially IX and X) are essential for ocular motor tasks(Ito, 1984).

Primary and prebiventer fissuresAt the midline of the sagittal plane, the primary (dorsal) and

prebiventer (ventral) fissures are readily identifiable. Raters definethese fissures by referencing examples in our protocol and theSchmahmann et al. (1999, 2000) atlas. The primary fissure divideslobules V and VI while the prebiventer fissure separates lobules VIIBand VIIIA, creating a total of four regions: the anterior lobules (I–V),the middle lobules (VI–VII), the caudal lobules (VIII–X), and the cor-pus medullare. The remainder of the cerebellum is delineated by fol-lowing these anatomical boundaries through the hemispheres. Thesedivisions are also examined and refined in the axial and coronalplanes.

VermisVermal lobules VIIIA through X are delineated as a group. The vermis

is defined as a roughly symmetrical shape in the center of the cerebel-lum and is best identifiable in the coronal plane. It is clearest in themost posterior and most anterior planes. In the posterior planes of thecoronal view it appears as a circular shape between the two hemi-spheres inferior to the CM. In the anterior planes it is more triangularand appears to be “wedged” between the twohemispheres. There is typ-ically uncertainty near the midline of the volume which can be clarifiedusing the sagittal plane to connect the posterior and anterior regionsdelineated in the coronal view.

Lobule delineation

The finest level of the hierarchy is the delineation of lobules. We de-fine all lobule boundaries using the Schmahmann et al. (2000) atlas. Allunchanging paint/labels remain locked and efforts are made to ensurecontinuity of labels between hemisphereswhen possible. Raters typical-ly proceed “clockwise” (rostral-to-caudal) in the sagittal view in order tobest take advantage of the label “locking” capability and to improve theirefficiency.

Lobules I/IIThis is the smallest lobule and is visible only in several slices at the

midline of the sagittal plane. It is most readily identified in the sagittalplane and appears as a thin curve of greymatter bordering the superiorcerebellar peduncle.

Lobule IIIThis lobule is adjacent to I/II and usually consists of only one major

whitematter branch. In the axial view it terminates near the boundariesof the spinal cord. It is typically shorter and smaller than both lobules IVand V.

Lobule IVThis lobule is adjacent to Lobule III and is best recognized by locating

the next fissure (when proceeding clockwise from the fissure definingthe boundary between III and IV) that remains uninterrupted throughout

Page 6: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

621J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

the entirety of the hemisphere. This boundary is somewhat unreliable be-cause such a boundary between the anterior edge and the primaryfissureis not present in every subject. In such cases the most prominent fissurein the region to be sub-parcellated is chosen.

Lobule VThis lobule is trivially delineated once the previous four labels have

been applied as it lies between lobule IV and the primary fissure. It isusually similar in size to lobule IV and is best labeled in the sagittal view.

Lobule VIThis lobule is bounded anteriorly by the primaryfissure and posterior-

ly by Crus I through the majority of the hemisphere. Near the midline,Crus I is diminished, and lobule VImaydirectly abut Crus II. This boundaryis most readily identified roughly 15 mm from the midline in the sagittalview. Lateral to themidline, Crus I is identified as the growing branch andcan be used to identify the boundary of lobule VI. We begin labeling thehemisphere of the cerebellum and proceed laterally. Next, we return toour starting point (15 mm from the midline) and proceed toward themidline, creating a boundary following the anatomy.

VIIA-Crus IThis lobule is not present in the midline sagittal plane and is best

identified by examining concurrent sagittal slices between the midlineand a slice roughly 15 mm lateral to the midline. Crus I is a large lobulewith only one branch, which appears several slices from the midlineand continues to grow, remaining present for the remainder of thehemisphere.

VIIA-Crus IIThis lobule is adjacent toVI andCrus I.Wedefine the secondboundary

as themost prominentfissure between Crus I and the prebiventerfissure.This fissure must be present through the entirety of the hemisphere.However, occasionally there exist two candidate fissures, neither ofwhich spans the hemisphere. In this case the fissure spanning the greaterportion of the hemisphere is selected.

Lobule VIIBThis lobule is trivially delineated as the remaining superior posterior

lobe after the previous three lobules are identified.

Lobule VIIIAThis lobule is adjacent to lobule VIIB and is distinguishable from

lobule VIIIB because it does not abruptly curve in the axial view. Itsometimes consists of two smaller branches that meet the point ofthe vermis (see the axial view). It is best recognized in the axialview and delineated in the sagittal view.

Lobule VIIIBThis lobule usually begins as one white matter branch at the center

of the cerebellum and becomes two branches as it extends laterally;this behavior is best seen in the axial view. In the sagittal view it appearsto hook or curve around lobule IX, and it also borders lobule X. Thisboundary is a smaller fissure and can be distinguished by the differencein orientation between lobules VIIIB and X. It is best delineated in thesagittal and axial views.

Lobule IXLobule IX appears as a round structure at the anterior portion of the

cerebellum in the axial view. It is present only near the midline of thecerebellum. In the coronal view it is themostmedial lobule and appearsto extend horizontally towards the center of the cerebellum. It is easiestto delineate in this view as the boundary is a straight line. It bordersprimarily the vermis, lobule VIIIB, lobule IX of the other hemisphere,and the spinal cord.

Lobule XLobule X is the anterior most lobule of the cerebellum. It is present

primarily half way in between the center and the outer edge. It issmaller than most other lobes and is near the spinal cord. It appearsas a bump in the sagittal view, not prominent enough to be part oflobule VIIIB.

Vermis VIIIAThis is the small posterior most part of the vermis and seems to take

only half of the posterior most white matter trunk. It is the lobule thatis present in the most lateral slices of the vermis and appears to meetvermis VIIIA in the axial plane. It is best delineated in the sagittal plane.

Vermis VIIIBThis region consists of the other half of the posterior most white

matter branch and is present, laterally, nearly as long as the vermisof VIIIA. It is also best delineated in the sagittal plane.

Vermis IXThis lobule is a triangular shape and is bordered on one side by

lobule VIIIB and on the other by X. It is the largest lobe in the sagittalview of the midline. The border between IX and X is a fissure thatreaches the CM and can be determined by considering the differencein the branches on either side. It is best delineated in the sagittalplane.

Vermis XLobule X is the anterior most region of the vermis and contains no

fissures. It is usually slightly wider (laterally) than lobule IX. The fissureseparating vermis lobules IX and X is often wider in the more lateralsagittal slices than in themedial. These lobules are sometimes separatedby lobule IX of the hemisphere. Lobule X of the vermis is disjoint fromlobule X of the hemisphere.

Rapid rater review

The consistency of the rater delineations for each subject wasestablished by manual examination using a method of “rapid rater re-view”. An automated routine located all delineations of a single subjectcompleted by any rater and produced a series of mosaic images. Eachimage displays, side-by-side, a slice of a single subject from each rater'sdelineation (including repeat delineations by a single rater) as shown inFig. 4. Because errors tended to be large, representative slices were suf-ficient to identify regions of disagreement. Note that this type of verifi-cation depends on the fact that several human raters delineate eachsubject. Specifically, this allows “rapid-reviewers” to assess delineationsfor similarity to each other, rather than for accuracy. In this way,rapid-reviewers couldmanually identify raters whowere visibly incon-sistent with either themselves or the other raters. Though the process isunable to correct some inconsistent delineations found, it does flagscans for redelineation or exclusion from further analysis. This wasdone at the last stage of the hierarchy because most errors were foundthere. When a particular label is found to be in inconsistent, only thatlabel is excluded while any other valid regions produced by that raterfor that subject remain.

Rapid reviewers also classified error into three categories: “sloppi-ness,” “disagreement,” and “misattribution.” An inconsistency was la-beled “sloppy,” when a raters delineation failed to follow eitheranatomical or image based cues. Examples of sloppiness included: isolat-ed pixels assigned to a label, unsmooth label boundaries, or label bound-aries that did not follow either tissue or fissure boundaries (e.g., biascaused by not properly following the correct intensity value). Errorswere assigned to “disagreement,”when appropriate cueswere followed,but a rater chose a different feature to follow relative to the others (Fig. 2shows a typical example of this). Finally, a “misattribution” error oc-curred when a rater correctly delineated boundaries, but assigned an

Page 7: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

Rater 1 (1) Rater 1 (2) Rater 2 (1) Rater 2 (2) Rater 3 (1) Rater 3 (2)

Coronalslice 155

Axialslice 35

Sagittalslice 70

Rater 1 Rater 2 Rater 3

Coronalslice 152

Fig. 4. Example of an automatically generated montage of rater delineations used for rapid rater review. The first example shows examples of axial, coronal, and sagittal slices forone subject delineated twice by three raters. All raters were consistent with themselves and with each other. The second example shows a “misattribution” error, in which Rater 3delineated the boundaries correctly but switched the labels for the left and right.

622 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

incorrect label. For example, assigning the lobule Crus1-right label toCrus1-left was considered a misattribution error. An example is shownin Fig. 4, where Rater 3 exchanged the right and left labels. Misattributionerrors were corrected by reassigning the correct labels to the raterdelineations.

Rater fusion

Statistical label fusion methods can produce more accurate delinea-tions by combining the labels produced by multiple raters. We reportresults using the STAPLER algorithm (Landman et al., 2012b), a robustextension of “Simultaneous Truth and Performance Level Estimation”(STAPLE) (Warfield et al., 2004). STAPLE models the performance ofeach rater as a confusion matrix, where element (i,j)indicates the prob-ability that the rater assigned object iwhen the true label was object j. A“consensus” labeling can then be estimated using the probability ofevery label for each voxel given the performances of all raters. The esti-mation givesmoreweight to raters thatwere deemedmore accurate forparticular labels. The rater performances are then re-estimated usingthe consensus label probabilities from all voxels, and thewhole processis iterated to convergence. This scheme differs from majority vote inthat raters are weighted unequally according to how well they agreewith the consensus, where this agreement is estimated globally. Fur-thermore, it can resolve situations inwhich all raters assign different la-bels by selecting the label assigned by the best (most accurate) rater.

STAPLER follows a similar procedure but includes prior knowledgethat regulates the performance estimation in order to avoid unlikelylabel configurations that result from poor estimates of the rater perfor-mance. It can also handle cases where there are irregularities in thenumbers of raters or their labels, but this feature was not used here.We performed experiments comparing STAPLER with the STAPLE algo-rithm and a majority vote fusion scheme, and found that the STAPLERmethod performed best among these three. Note that while it did notuniformly improve upon the other methods, its performance was supe-rior across all labels and several measures (see theMethods and results

section for the evaluationmetrics). A discussion of the types and extentof errors that can be corrected by statistical fusion can be found in theDiscussion section.

We used the publicly available implementation of the multi-compartment STAPLER algorithm (http://www.nitrc.org/projects/masi-fusion), initialized with the empirical label probability map. Convergencewas declaredwhen the normalized trace of the confusionmatrix changedless than 10−4 between iterations. The value of the rater performancebias parameter was set to 0.5I for all raters, where I is the identity matrix.

Methods and results

Three human raters completed the protocol on 48 subjects. Theseraters were undergraduate students at Johns Hopkins University, Bal-timore, MD. Each student underwent approximately 150 h of trainingin the protocol before their results were considered valid. The train-ing consisted of a study of the protocol document and delineation oftraining subjects followed by expert feedback and subsequent correc-tion by the student. The cohort consisted of 18 healthy control sub-jects, six subjects diagnosed with spinocerebellar ataxia (SCA) type6, 22 subjects with a non-genetic cerebellar ataxia, and two subjectswith other diagnoses. One expert rater (with approximately 5000 hof experience) delineated 15 subjects from the cohort, and was con-sidered the “gold standard.” This expert has been shown to have ex-ceptional reliability (inter-rater ICC with another expert of 0.991)based on delineation of 23 regions for 22 subjects (Jung et al.,2011). The amount of time required for the finest parcellation of asingle subject is roughly equivalent for multiple inexpert raters andone expert rater: three raters each taking about 15 h per subject, ver-sus the expert rater taking approximately 50 h per subject.

The inexpert raters underwent rapid review by two independenthuman judges. They found that approximately 4% of the regions pro-duced by the three raters we used in this study had an error of somesort detected (216 errors out of 5376 rater-regions) by this process.These included both “sloppiness” and “disagreement” errors. Over

Page 8: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

623J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

half (113) of these errors came from lobules VIIB through VIIIA. Ex-amples of a “disagreement” and “misattribution” errors discoveredby this review process are shown in Figs. 2 and 3, respectively. Seethe Rapid rater review section for definitions of these error types.

Label fusion using STAPLER was run on the delineations providedby three inexpert raters. Since our raters repeated the delineations ofsome subjects there were often many unique groups of three ratingsthat could be used to produce a label fusion result. We therefore fusedall groups of three raters available to us and were able to provide anindication of the variability in a label fusion result given inexpertraters.

The reliability of the volume measurements obtained from the de-lineations were measured using the intraclass correlation coefficient(ICC) (Shrout and Fleiss, 1979; McGraw and Wong, 1996). We usedthe icc function in the irr package for the R software to compute theICC. We selected the “one way” option because the lobule volumesfor each subject are considered random effects, but not the raters.All other options were left to their default values. The mathematicaldetails of this ICC can be found in Table 4 in McGraw and Wong(1996), under the designation “ICC(1)”. This type of ICC measuresthe absolute agreement between different raters' volumes, as weare interested in determining whether the volumes produced by dif-ferent raters were identical (rather than consistent). These and allmeasures of delineation quality were computed at all levels of thelabel hierarchy. Note that the reliability measures tend to be greaterat coarse parcellations and lesser for finer parcellations. Specificexplanations for this behavior are given in the Discussion section.

Table 1 shows computed intra-rater ICC values for a coarse group-ing of labels that was described in Makris et al. (2005), and for the fulllobule parcellation. The 95% confidence interval for the estimate isgiven in parentheses. For the coarse grouping, the inexpert ratersachieve good intra-rater ICCs for the corpus medullare and anteriorand posterior lobes (estimates range from 0.62 to 0.99). By thismeasure, STAPLER performed well on the flocculonodular lobe(intra-rater ICC range from 0.89 to 0.94), despite the fact that inex-pert raters were less reliable (−0.67 to 0.82). There is a similartrend for the lobule reliabilities; the STAPLER results tend to bemore reliable than individual raters. Intra-rater ICCs for the inexpertraters ranged from 0.26 to 0.96 for rater one, −0.07 to 0.87 for ratertwo, and −0.66 to 0.98 for rater three. The STAPLER fused resultshad ICCs ranging from 0.71 to 0.99, and indicate that the fused resultsare much more reliable than the individual raters.

Table 2 shows the inter-rater ICCs for the lobar and lobuleparcellation of the cerebellum. This measure was always computedrelative to the expert rater. For the corpus medullare and anteriorand posterior lobes, inter-rater ICCs ranged from 0.49 to 0.98 for theinexpert raters, and from 0.94 to 0.99 for STAPLER. For the remaininglobules, ICCs ranged from −0.04 to 0.98 for the inexpert raters andfrom −0.19 to 0.99 for STAPLER.

In contrast with the ICC, the Dice similarity coefficient (DSC) mea-sures the degree of spatial overlap of two delineations and for thisreason is a stricter measure of similarity. Fig. 5 shows the boxplotsof the DSC of the inexpert raters and the fusion results with the expertrater. Note that the DSC will give a result equal to sensitivity and pos-itive predictive value when the volumes of expert and rater are equal.The fusion results tend to have higher similarity coefficients than theinexpert raters, as well as more compact distributions.

We also computed the signed volume difference given by: SVD=(VR−VE)/VE, where VR and VE are the volumes of a region delineatedby the rater and expert, respectively. This quantity measures the frac-tional difference of the rater's volume in comparison to the expert. Adelineator that agrees with the expert would achieve an SVD of zero,while raters that consistently overestimate or underestimate the vol-umes of certain regions would have positive and negative SVDs, re-spectively. We report this measure to show that the volumes of thestructures found are similar in absolute terms in addition to being

consistent (as measured by the ICC). Fig. 6 shows the boxplots of theSVD score for all the inexpert raters and the fused results. In general,the results of label fusion have compact distributions around zero,whereas the inexpert raters' volume measurements are more spreadout. This indicates that fusion produces volume measurements similarto the expert more often than the raw inexpert raters. We furtherreport in Table 3 the nominal volumes (mean and standard deviation)for normal subjects of the structures described here obtained byexpert rater delineation and STAPLER fused inexpert delineation.

Discussion

This is the first study to demonstrate that relatively inexperiencedhuman raters, when given a hierarchical protocol, can produce delinea-tions close to those of an expert rater. As inexpert raters are easier totrain and less expensive to employ than expert raters, ourmethodologycould facilitate the accurate labeling of very large data sets in a practicalfashion. Given the approximate training and delineation times we havecompiled during this study, we can report a rough estimate of the timeand cost savings achievable using this methodology. Training time forthe inexperts was reduced by about 75% because the three raterscould be trained in parallel and because their training was faster,consisting simply of learning the protocol. This suggests that havingthis hierarchical protocol enables raters to be quickly mobilized for aparticular study. After training, the three inexperts can complete a sub-ject in about 70% less time than the expert, primarily because they areable to work in parallel. Furthermore, we observed cost savings of 95%for training and 60% per subject delineation after taking into accountthe amount of time spent on training and delineation and typical hourlywages for researchers in metropolitan areas and for undergraduate in-terns. Of course, it is important that this methodology produce resultsas close as possible to those of the expert.

Agreement with an expert rater can be achieved by employing athree-stage approach. A hierarchical delineation protocol reduces ratervariability by limiting the number of decisions a rater needs to makeat any given moment. Gross errors are isolated using rapid review andverification of consensus. Raters tend to make “sloppiness” errors inde-pendently (in a statistical sense), and so often fit the error model of sta-tistical fusion algorithms. This explains the improvement of accuracyand reliability gained from using the fused results.

This leads to an important discussion regarding the types of ratererrors that can be discovered and corrected using our methodology.As our methodology does not employ expert raters, we rely on the“consensus” of the inexperts to determine locations of possible errors.As a result, boundaries that have been delineated incorrectly (relativeto the truth) but identically (i.e., a consensus was reached) by allraters are not detectable as errors. Nevertheless, the quality of our re-sults suggests that this situation does not occur very often, and whena rater makes an error, there is usually disagreement regarding theboundary's location.

We present data defining the inherent tradeoff between granularityand accuracy. Understanding this choice could facilitate the statisticaldesign of hypothesis-driven cerebellar parcellation protocols in the fu-ture. Our results show that the overlap and volume repeatability mea-sures degrade as raters complete finer stages of the labeling hierarchy(see Figs. 5 and 6 and Tables 1 and 2). This can be caused by reducedor lack of contrast between small structures or inter-subject differencesin anatomy. This in turn results in differences in rater judgment in thoseregions, and denoted a “disagreement” error by rapid reviewer. In par-ticular, the most prominent fissures give rise to significant image con-trast, are stable from subject to subject, and therefore are reliablydelineated. On the other hand, the boundaries between some lobulesare small fissures between the corresponding folia that produce only asmall amount of image contrast. As was noted in the Cerebellumdelineation/parcellation protocol section, when fissures are not visiblein the image, their presence had to be inferred from the positions of

Page 9: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

0.0

0.2

0.4

0.6

0.8

1.0

CorpusMedullare

R L

VIIIv IXv XvIII IV VIV Crus1 Crus2 VIIB VIIIA VIIIB IX XR L R L R L R L R L R L R L R L R L R L

0.0

0.2

0.4

0.6

0.8

1.0

CerebellumL

Corpus MedullareHemispheresR

VermisAnterior Middle CaudalLR LR LR

Rater 1

Rater 2

Rater 3

STAPLER

Dic

eD

ice

Fig. 5. Box plots of Dice similarity coefficient with the expert rater for the human raters and label fusion by STAPLER. The upper and lower plots show overlaps for the lobarparcellation and lobule parcellations of Fig. 1, respectively.

624 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

white matter branches. This judgment is largely dependent on experi-ence and anatomical knowledge.

Other situations in this labeling task arise in which image contrastgives little or no information. It is not uncommon for two lobules tosprout from a single white matter branch emanating from the corpusmedullare. Since white matter branches were included as part of thelobule volume, the shared white matter branches were split between

the two lobules approximately equally. A related challenge is the ap-pearance of marked fissures between different parts of the same lobule(lobules often consist of multiple folia). Consequently, raters must dis-tinguish between fissures or other image features corresponding to lob-ule boundaries and fissures that separate different parts of a singlelobule. To make matters worse, the size and depth of the fissure mayor may not be a reliable feature in determining which fissures separate

Page 10: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

−1.0

0.5

0.0

0.5

1.0

1.5

2.0

CorpusMedullare

R L

VIIIv IXv XvIII IV VIV Crus1 Crus2 VIIB VIIIA VIIIB IX XR L R L R L R L R L R L R L R L R L R L

−1.0

-0.5

0.0

0.5

1.0

1.5

2Rater 1

Rater 2

Rater 3

STAPLER

CerebellumL

Corpus MedullareHemispheresR

VermisAnterior Middle CaudalLR LR LR

Sig

ned

Vol

ume

Diff

eren

ceS

igne

d V

olum

e D

iffer

ence

Fig. 6. Box plots of signed volume difference with the expert rater for the human raters and label fusion by STAPLER. The upper and lower plots show comparisons for the lobarparcellation and lobule parcellations of Fig. 1, respectively.

625J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

lobules. (i.e., fissures within a single lobule can appear larger/deeperthan fissures separating two lobules). As a result, raters occasionallychoose to follow different image features in delineating the same lobuleboundary. This phenomenon explains some large discrepancies be-tween raters thatwe observed, and the surprisinglywide confidence in-tervals of ICC for some lobules. Future work in improving the rapidreview process could help alleviate the effect of these outliers.

We now will present several examples of anatomical features of thecerebellum that confoundaccurate parcellation. Fig. 2 shows an example

of this phenomenon focusing on the labeling of lobules Crus II, VIIB,VIIIA, and VIIIB. Notice that the first subject has three fissures separatingfour regions, and the second subject has four fissures separating five re-gions. All inexpert raters agree with the expert rater on the first subjectsince the image features indicate the anatomical divisions. The labelingtask for the second subject is much more difficult, since raters neededto determine which of the image features correspond to anatomicaldivisions. In this case, the expert determined that VIIIA (pictured inorange) contained two branches. All of the inexpert raters correctly

Page 11: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

Table 1Intra-rater ICCs for coarse (lobe) and fine (lobules) parcellations (see Fig. 1).

Lobule Intra-rater ICC

Rater 1 Rater 2 Rater 3 STAPLER

CM 0.62 (0.33, 0.81) 0.79 (0.60, 0.89) 0.93 (0.65, 0.99) 0.99 (0.96, 1.00)Anterior R+L 0.95 (0.90, 0.98) 0.79 (0.61, 0.90) 0.99 (0.93, 1.00) 1.00 (0.99, 1.00)

L 0.95 (0.89, 0.97) 0.84 (0.70, 0.92) 0.94 (0.69, 0.99) 0.99 (0.97, 1.00)R 0.95 (0.89, 0.98) 0.75 (0.53, 0.87) 0.89 (0.49, 0.98) 0.99 (0.96, 1.00)

Posterior R+L 0.84 (0.69, 0.92) 0.71 (0.48, 0.85) 0.86 (0.37, 0.98) 0.97 (0.92, 0.99)L 0.90 (0.80, 0.95) 0.74 (0.52, 0.87) 0.88 (0.45, 0.98) 0.99 (0.96, 1.00)R 0.77 (0.55, 0.89) 0.69 (0.44, 0.84) 0.76 (0.11, 0.96) 0.95 (0.86, 0.98)

I–III L 0.72 (0.47, 0.86) 0.67 (0.41, 0.82) 0.97 (0.85, 1.00) 0.98 (0.95, 0.99)R 0.61 (0.31, 0.80) 0.41 (0.07, 0.67) 0.98 (0.89, 1.00) 0.97 (0.92, 0.99)

IV L 0.76 (0.54, 0.88) 0.79 (0.60, 0.89) 0.98 (0.90, 1.00) 0.98 (0.93, 0.99)R 0.74 (0.51, 0.87) 0.66 (0.40, 0.82) 0.73 (0.03, 0.96) 0.97 (0.92, 0.99)

V L 0.77 (0.56, 0.89) 0.86 (0.72, 0.93) 0.81 (0.24, 0.97) 0.95 (0.85, 0.98)R 0.72 (0.48, 0.86) 0.86 (0.73, 0.93) 0.93 (0.63, 0.99) 0.98 (0.93, 0.99)

VI L 0.96 (0.92, 0.98) 0.82 (0.66, 0.91) 0.92 (0.62, 0.99) 0.99 (0.97, 1.00)R 0.95 (0.90, 0.98) 0.73 (0.50, 0.86) 0.89 (0.47, 0.98) 0.97 (0.91, 0.99)

VIIA L 0.88 (0.76, 0.94) 0.76 (0.56, 0.88) 0.91 (0.54, 0.99) 0.99 (0.98, 1.00)Crus1 R 0.66 (0.39, 0.83) 0.70 (0.46, 0.84) 0.62 (−0.17, 0.94) 0.98 (0.93, 0.99)VIIA L 0.46 (0.11, 0.71) 0.67 (0.41, 0.83) 0.30 (−0.53, 0.86) 0.80 (0.51, 0.93)Crus2 R 0.18 (−0.21, 0.52) 0.63 (0.36, 0.80) 0.89 (0.47, 0.98) 0.89 (0.72, 0.96)VIIB L 0.28 (−0.10, 0.59) 0.53 (0.22, 0.75) 0.34 (−0.49, 0.87) 0.97 (0.90, 0.99)

R 0.38 (0.01, 0.66) 0.54 (0.23, 0.75) 0.89 (0.47, 0.98) 0.86 (0.65, 0.95)VIIIA L 0.57 (0.25, 0.78) 0.44 (0.11, 0.69) 0.79 (0.18, 0.97) 0.72 (0.35, 0.89)

R 0.36 (−0.01, 0.65) 0.51 (0.19, 0.73) 0.89 (0.47, 0.98) 0.89 (0.70, 0.96)VIIIB L 0.62 (0.33, 0.81) 0.56 (0.25, 0.76) 0.58 (−0.22, 0.93) 0.85 (0.62, 0.95)

R 0.63 (0.35, 0.81) 0.73 (0.52, 0.86) 0.95 (0.76, 0.99) 0.97 (0.93, 0.99)VIII v 0.87 (0.73, 0.94) 0.64 (0.38, 0.81) 0.83 (0.30, 0.97) 0.98 (0.95, 0.99)IX L 0.78 (0.57, 0.89) 0.87 (0.74, 0.93) 0.87 (0.42, 0.98) 0.96 (0.89, 0.99)

R 0.76 (0.55, 0.88) 0.68 (0.43, 0.83) 0.49 (−0.34, 0.91) 0.97 (0.92, 0.99)v 0.56 (0.24, 0.77) 0.42 (0.08, 0.67) 0.71 (−0.01, 0.95) 0.93 (0.80, 0.97)

X R+L 0.74 (0.51, 0.87) 0.52 (0.20, 0.74) −0.33 (−0.84, 0.56) 0.91 (0.77, 0.97)L 0.63 (0.34, 0.81) 0.52 (0.21, 0.74) 0.25 (−0.57, 0.84) 0.89 (0.72, 0.96)R 0.73 (0.50, 0.87) 0.50 (0.18, 0.72) −0.67 (−0.94, 0.17) 0.94 (0.83, 0.98)v 0.28 (−0.10, 0.59) 0.48 (0.16, 0.72) 0.82 (0.26, 0.97) 0.85 (0.62, 0.95)

Table 2Inter-rater ICCs for coarse (lobe) and fine (lobules) parcellations (see Fig. 1). This measure compares a particular human rater or fusion method to the “gold standard” expert rater.

Lobule Inter-rater ICC with ground truth

Rater 1 Rater 2 Rater 3 STAPLER

CM 0.49 (−0.01, 0.80) 0.92 (0.79, 0.97) 0.76 (0.38, 0.92) 0.94 (0.81, 0.99)Anterior R+L 0.98 (0.96, 0.99) 0.78 (0.59, 0.89) 0.79 (0.62, 0.89) 0.96 (0.85, 0.99)

L 0.98 (0.96, 0.99) 0.76 (0.55, 0.88) 0.80 (0.63, 0.89) 0.96 (0.86, 0.99)R 0.95 (0.90, 0.97) 0.79 (0.61, 0.89) 0.74 (0.54, 0.86) 0.95 (0.83, 0.99)

Posterior R+L 0.93 (0.88, 0.96) 0.56 (0.26, 0.76) 0.63 (0.37, 0.80) 0.99 (0.95, 1.00)L 0.96 (0.92, 0.98) 0.57 (0.27, 0.77) 0.64 (0.39, 0.81) 0.99 (0.96, 1.00)R 0.90 (0.83, 0.94) 0.55 (0.24, 0.76) 0.60 (0.34, 0.78) 0.98 (0.92, 0.99)

I–III L 0.75 (0.40, 0.91) 0.45 (−0.04, 0.77) 0.67 (0.20, 0.89) 0.56 (−0.03, 0.87)R 0.47 (−0.04, 0.79) 0.21 (−0.31, 0.64) 0.21 (−0.37, 0.68) 0.27 (−0.37, 0.75)

IV L 0.85 (0.61, 0.95) 0.82 (0.55, 0.93) 0.88 (0.64, 0.96) 0.96 (0.85, 0.99)R 0.87 (0.66, 0.96) 0.91 (0.76, 0.97) 0.84 (0.56, 0.95) 0.96 (0.84, 0.99)

V L 0.78 (0.45, 0.92) 0.77 (0.46, 0.92) 0.67 (0.21, 0.89) 0.86 (0.57, 0.96)R 0.86 (0.63, 0.95) 0.84 (0.60, 0.94) 0.73 (0.32, 0.91) 0.83 (0.48, 0.95)

VI L 0.96 (0.88, 0.99) 0.99 (0.96, 1.00) 0.94 (0.81, 0.98) 0.97 (0.89, 0.99)R 0.96 (0.89, 0.99) 0.97 (0.92, 0.99) 0.91 (0.73, 0.97) 0.95 (0.83, 0.99)

VIIA L 0.95 (0.85, 0.98) 0.98 (0.96, 0.99) 0.95 (0.84, 0.99) 0.99 (0.97, 1.00)Crus1 R 0.92 (0.77, 0.97) 0.97 (0.93, 0.99) 0.88 (0.66, 0.96) 0.96 (0.86, 0.99)VIIA L 0.90 (0.72, 0.97) 0.68 (0.29, 0.88) 0.75 (0.36, 0.92) 0.85 (0.53, 0.96)Crus2 R 0.60 (0.14, 0.85) 0.65 (0.23, 0.86) 0.53 (−0.01, 0.84) 0.56 (−0.03, 0.87)VIIB L 0.82 (0.54, 0.94) 0.11 (−0.40, 0.57) 0.60 (0.09, 0.86) 0.73 (0.27, 0.93)

R 0.51 (0.01, 0.81) 0.40 (−0.10, 0.75) 0.04 (−0.51, 0.58) 0.54 (−0.05, 0.86)VIIIA L 0.57 (0.10, 0.84) 0.86 (0.63, 0.95) 0.84 (0.54, 0.95) 0.81 (0.44, 0.95)

R 0.58 (0.11, 0.84) 0.64 (0.22, 0.86) 0.75 (0.37, 0.92) 0.33 (−0.31, 0.78)VIIIB L 0.39 (−0.14, 0.75) 0.96 (0.90, 0.99) 0.80 (0.46, 0.94) 0.61 (0.04, 0.88)

R 0.73 (0.37, 0.91) 0.90 (0.75, 0.97) 0.57 (0.05, 0.85) 0.84 (0.50, 0.96)VIII v 0.82 (0.53, 0.94) 0.70 (0.32, 0.89) 0.87 (0.61, 0.96) 0.80 (0.40, 0.94)IX L 0.77 (0.45, 0.92) 0.91 (0.76, 0.97) 0.49 (−0.07, 0.82) 0.90 (0.67, 0.97)

R 0.83 (0.57, 0.94) 0.95 (0.87, 0.98) 0.94 (0.82, 0.98) 0.94 (0.78, 0.98)v 0.85 (0.61, 0.95) 0.73 (0.38, 0.90) 0.85 (0.57, 0.95) 0.87 (0.59, 0.97)

X R+L 0.74 (0.57, 0.85) 0.57 (0.27, 0.77) 0.62 (0.35, 0.79) −0.15 (−0.68, 0.49)L −0.04 (−0.53, 0.48) 0.52 (0.04, 0.81) 0.46 (−0.10, 0.81) −0.15 (−0.67, 0.49)R 0.02 (−0.48, 0.53) 0.36 (−0.15, 0.73) 0.25 (−0.33, 0.70) −0.19 (−0.69, 0.46)v 0.11 (−0.41, 0.59) 0.54 (0.08, 0.82) 0.46 (−0.11, 0.80) 0.41 (−0.22 0.81)

626 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

Page 12: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

Fig. 7. Examples of lobule X delineations. The inexpert rater delineations are labeled with Dice similarity coefficients with the expert delineation. Note that small absolute errorstranslate to large relative errors due to the small size of the structure.

627J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

identified the fissures, but chose to split the five branches among thefour labels in ways that did not agree with the expert. This highlightsthe importance of anatomical knowledge for certain aspects of thisdelineation task.We found that a consensus labeling, determined by sta-tistical fusion performs well and compensates, in part, for the relativeinexperience of raters.

A different challengewas faced by our inexpert raterswhen delineat-ing lobule X (theflocculonodular lobe). These rather small regions on theanterior part of the cerebellum are in very close proximity to othernerves and connective tissue. The inexpert raters had difficulty in

Table 3Means and standard deviations (in parentheses) for the lobar and lobule parcellations of co

Lobule Expert Rater 1

CM 12.14 (0.57) 10.79 (2.16)Anterior R+L 14.54 (1.71) 15.66 (2.97)

L 6.89 (0.92) 7.52 (1.36)R 7.65 (1.27) 8.14 (1.86)

Posterior R+L 88.93 (7.46) 92.65 (13.63)L 46.18 (3.75) 48.17 (6.39)R 42.75 (3.73) 44.48 (7.48)

I–III L 0.77 (0.21) 0.80 (0.27)R 0.74 (0.13) 0.91 (0.42)

IV L 2.93 (0.58) 2.71 (0.64)R 3.52 (1.14) 3.10 (1.01)

V L 3.20 (0.61) 4.00 (0.75)R 3.39 (0.77) 4.12 (0.85)

VI L 8.26 (1.71) 9.15 (1.86)R 7.49 (1.73) 8.67 (1.93)

VIIA L 13.49 (1.77) 13.11 (2.25)Crus1 R 12.49 (1.36) 11.98 (2.42)VIIA L 6.35 (0.97) 8.42 (2.70)Crus2 R 7.71 (1.93) 8.02 (1.82)VIIB L 4.64 (0.60) 4.49 (0.94)

R 4.82 (0.68) 4.59 (1.74)VIIIA L 6.31 (0.55) 4.26 (1.57)

R 3.62 (1.20) 3.54 (0.86)VIIIB L 4.05 (0.71) 5.26 (1.71)

R 3.44 (0.70) 4.37 (1.27)VIII v 1.83 (0.41) 0.64 (0.35)IX L 3.08 (0.25) 3.60 (0.76)

R 3.18 (0.18) 3.54 (0.85)v 0.99 (0.07) 1.09 (0.20)

X R+L 0.91 (0.19) 1.29 (0.17)L 0.44 (0.10) 0.66 (0.08)R 0.47 (0.10) 0.63 (0.11)v 0.33 (0.02) 0.23 (0.09)

distinguishing these structures and consistently over-estimated thesize of lobule X relative to the expert rater. The details can be observedin Fig. 6; the signed volume difference is positive for lobule X for all inex-pert raters and fused results. A specific example of this phenomenon isgiven in Fig. 7. Despite the qualitative similarity of the inexpert andexpert delineations, the DSC (displayed in blue) is small due to thesmall size of the lobule. Even though their errors were small on an abso-lute scale, the small size of these structures caused unacceptably largerelative errors in volume. We plan to address this impediment in a sub-sequent revision of our protocol, though this structure may be too

ntrol subjects.

Rater 2 Rater 3 STAPLER

12.44 (1.45) 14.42 (1.36) 12.18 (1.84)15.47 (2.80) 14.25 (2.81) 15.78 (2.46)7.36 (1.58) 6.98 (1.51) 7.59 (1.30)8.11 (1.67) 7.27 (1.80) 8.19 (1.55)

97.80 (12.61) 94.67 (10.55) 92.55 (13.62)50.50 (5.88) 48.78 (4.72) 48.20 (6.31)47.54 (6.71) 46.01 (5.75) 44.34 (7.48)0.57 (0.23) 0.66 (0.29) 0.78 (0.26)0.71 (0.40) 0.67 (0.28) 1.13 (0.42)2.85 (0.90) 3.07 (0.83) 3.01 (0.91)3.43 (0.97) 3.35 (1.26) 3.11 (1.01)3.94 (0.84) 3.25 (0.83) 3.81 (0.67)3.97 (1.03) 3.25 (0.83) 3.94 (0.76)9.03 (2.18) 9.30 (2.05) 9.18 (2.12)8.74 (1.76) 8.66 (1.45) 8.07 (1.76)

13.89 (2.51) 13.74 (2.34) 13.15 (1.82)12.99 (2.69) 12.92 (2.43) 12.03 (3.17)8.37 (1.62) 7.15 (1.60) 8.53 (1.50)9.72 (1.74) 7.60 (2.15) 6.96 (2.48)5.46 (1.62) 5.81 (1.83) 4.56 (2.36)5.46 (1.61) 6.07 (1.60) 6.92 (1.41)5.91 (1.77) 5.64 (1.67) 5.47 (2.07)4.01 (0.95) 4.06 (1.20) 2.88 (1.04)4.32 (0.94) 3.69 (0.77) 3.58 (1.06)3.87 (0.83) 3.28 (0.83) 4.36 (1.09)0.74 (0.20) 0.55 (0.21) 0.91 (0.28)3.74 (0.94) 3.45 (0.75) 3.74 (0.81)3.27 (0.92) 3.34 (0.85) 3.12 (0.86)0.92 (0.15) 0.95 (0.15) 1.15 (0.34)1.31 (0.30) 1.20 (0.29) 1.72 (0.31)0.69 (0.18) 0.60 (0.19) 0.93 (0.27)0.62 (0.14) 0.61 (0.13) 0.79 (0.10)0.33 (0.07) 0.33 (0.06) 0.34 (0.08)

Page 13: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 SUIT

Multi-atlas

Proposed

CorpusMedullare

VIIIv IXv Xv AverageI-IV VIV Crus1 Crus2 VIIB VIIIA VIIIB IX XR L R L R L R L R L R L R L R L R L R L

Dic

e S

imila

rity

Fig. 8. Box plots of Dice similarity coefficient for SUIT, multi-atlas segmentation and the proposed manual method using inexpert raters. The rightmost set of boxes shows the of theaverage Dice similarity across all labels.

628 J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

variable and challenging for inexpert raters to delineate reliably. Dedicat-ed training for this lobulemaybenecessary. To summarize, the situationsdescribed above rely on rater judgment and experience rather than ob-serving image features. Therefore, differences between raters in this esti-mation account for errors in the smaller lobules. Incorporating differentimage features (for example diffusion weighted image contrasts) maybe useful in future efforts to provide image cues not present in the T1 im-ages used here. Diffusion imaging in particular, could play a role in futurestudies in sub-parcellating the cerebellar white matter and identifyingthe deep nuclei.

The rapid-review and fusion aspects of this system are critical as well.The review process was instrumental in identifying gross errors made byinexpert raters. Statistical label fusion techniques (STAPLER) improvedthe reliability of thefinal result by combining individual inexpert delinea-tions. We have shown that the consensus labeling estimated using thesetechniques tends to agree with a parcellation produced by an expert; anexample of this is given in Fig. 2. The results suggest that more robustfusion techniques and/or more careful delineation may be necessarywhen regions are small and/or difficult to delineate (e.g., lobule X).

Finally, our nominal volume measures reported in Table 3 generallyagree with those reported in Makris et al. (2005). For example, themean bilateral (R+L) anterior lobe volume in Makris et al. (2005) was13.68 (2.4), whereas the STAPLER fusion result here was 12.18 (1.84).The difference observed could be due to the mean age of our cohort(51 years), which is likely larger than the unreported mean age inMakris et al. (2005). In many cases, the mean of the fused results tendsto be closer to themean expert result than any of the individual inexpertraters.

Conclusion

By using this protocol, inexpert raters can produce a parcellation ofthe human cerebellum that agrees with experts when pairedwith a sys-tem for review/verification and statistical label fusion. The hierarchicalnature of this protocol allows for researches to tailor the protocol to aspecific hypothesis by balancing reliability and parcellation coarseness.Fusion of multiple inexpert delineations can produce a labeling thatapproaches the reliability and accuracy of an expert. This is especiallyuseful since employing many inexpert raters may be more time andcost-effective than employing a small number of experts. The efficiencyoffered by this approach can enable larger scale studies of the cerebellumin the future.

Acknowledgments

This work was supported in part by NIH/NINDS grants R01NS56307and R01NS37747. The authors would like to thank Neda Jahanshad forher work on initial versions of the protocol, the undergraduate raters,Charles Ang, Lauren Drake, and Javaneh Jabbari, and the rapid-reviewersAaron Carass and Lorre Atlan.

Appendix A

The evaluation described used leave-one-out cross validation toestimate the performance of themultiple-atlas segmentation framework.Fifteen subjects were used (six controls), meaning that fourteen “atlases”were used to label each subject. Registration was performed using the

Page 14: Approaching expert results using a hierarchical cerebellum ...bruno/Bogovic2013.pdfApproaching expert results using a hierarchical cerebellum parcellation protocol for multiple inexpert

629J.A. Bogovic et al. / NeuroImage 64 (2013) 616–629

SyN algorithm (Avants et al., 2008), which was found to be a leadingmethod for the purposes of labeling the cortex (Klein et al., 2009). SyNwas run using the following parameters: cross correlation similarityusing a 32 bin square joint histogram, Gaussian regularization withsigma of 3, a three level optimization using 30, 20, and 10 iterations atthe coarse-, mid-, and full-resolutions, respectively. Label fusion wasaccomplished with the multi-category STAPLE algorithm (Warfield etal., 2004) initialized with equal confusion matrices for all raters withdiagonal entries equal to 0.9999. Convergence was declared when thedifference of the normalized trace of the confusion matrix betweentwo iterations was less than 10−5.

The SUIT algorithm was run using SUIT version 2.4 with SPM8. Acorpus medullare label was manually added to the SUIT template. Awhole cerebellummask was generated using the suit_isolate command.The SUIT template and subject cerebellum were deformably registeredusing suit_normalize. Next, the lobule labels and cerebellummask wereresampled into the subject space with suit_reslice_inv. Finally, we ap-plied the cerebellum mask from the isolation step to the lobule labelsto refine the delineation near deep fissures.

The Dice similarity coefficient (DSC) with the expert's lobule labelswas computed for each of these 15 subjects for the multi-atlas, andSUIT automated methods, and for the proposed method using inexpertraters, a box plot ofwhich is shown in Fig. 8. ThemedianDSC of the pro-posed method is higher for all lobules than either of the automaticmethods. We also computed the DSC using only the gray matter labelsafter masking the gray matter labels for all methods with the true(expert rater) gray matter. The mean (standard deviation) DSCs are:0.63 (0.15) for SUIT, 0.82 (0.13) for multi-atlas, and 0.89 (0.09) for ourinexpert rater fusion. This indicates automated methods make errorsin assigning different gray matter labels that cannot be solved by incor-porating a simple tissue segmentation.

References

Andersen, B.B., Gundersen, H.J.G., Pakkenberg, B., 2003. Aging of the human cerebellum: astereological study. J. Comp. Neurol. 466, 356–365.

Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C., 2008. Symmetric diffeomorphic imageregistration with cross-correlation: evaluating automated labeling of elderly andneurodegenerative brain. Med. Image Anal. 12, 26–41.

Baker, K.G., Harding, A.J., Halliday, G.M., Kril, J.J., Harper, C.G., 1999. Neuronal loss infunctional zones of the cerebellum of chronic alcoholics with and withoutWernicke's encephalopathy. Neuroscience 91, 429–438.

Berquin, P.C., Giedd, J.N., Jacobsen, L.K., Hamburger, S.D., Krain, A.L., Rapoport, J.L.,Castellanos, F.X., 1998. Cerebellum in attention-deficit hyperactivity disorder: amorphometric MRI study. Neurology 50, 1087–1093.

Brenneis, C., Bosch, S.M., Schocke, M., Wenning, G.K., Poewe, W., 2003. Atrophy patternin SCA2 determined by voxel-based morphometry. Neuroreport 14, 1799–1802.

Callison-Burch, C., 2009. Fast, cheap, and creative: evaluating translation quality usingAmazon's Mechanical Turk. EMNLP, pp. 286–295.

Cavanagh, J.B., Holton, J.L., Nolan, C.C., 1997. Selective damage to the cerebellar vermisin chronic alcoholism: a contribution from neurotoxicology to an old problem ofselective vulnerability. Neuropathol. Appl. Neurobiol. 23, 355–363.

Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecol-ogy 26, 297–302.

Diedrichsen, J., 2006. A spatially unbiased atlas template of the human cerebellum.Neuroimage 33, 127–138.

Diedrichsen, J., Balsters, J.H., Flavell, J., Cussans, E., Ramnani, N., 2009. A probabilisticMR atlas of the human cerebellum. Neuroimage 46, 39–46.

Donchin, O., Rabe, K., Diedrichsen, J., Lally, N., Schoch, B., Gizewski, E.R., Timmann, D.,2012. Cerebellar regions involved in adaptation to force field and visuomotor per-turbation. J. Neurophysiol. 107, 134–147.

Heckemann, R.A., Hajnal, J.V., Aljabar, P., Rueckert, D., Hammers, A., 2006. Automatic ana-tomical brain MRI segmentation combining label propagation and decision fusion.Neuroimage 33, 115–126.

Isgum, I., Staring, M., Rutten, A., Prokop, M., Viergever, M.A., van Ginneken, B., 2009.Multi-atlas-based segmentation with local decision fusion—application to cardiacand aortic segmentation in CT scans. IEEE Trans. Med. Imaging 28, 1000–1010.

Ito, M., 1984. The Cerebellum and Neural Control. Raven, New York.Jenkinson, M., Smith, S., 2001. A global optimisation method for robust affine registration

of brain images. Med. Image Anal. 5, 143–156.Jung, B.C., Choi, S.I., Du, A.X., Cuzzocreo, J.L., Ying, H.S., Landman, B.A., Perlman, S.L., Baloh,

R.W., Zee, D.S., Toga, A.W., Prince, J.L., Ying, S.H., 2011. MRI shows a region-specificpattern of atrophy in spinocerebellar ataxia type 2. Cerebellum 11 (1), 272–279.

Klein, A., Andersson, J., Ardekani, B.A., Ashburner, J., Avants, B., Chiang, M.-C., Christensen,G.E., Collins, D.L., Gee, J., Hellier, P., Song, J.H., Jenkinson, M., Lepage, C., Rueckert, D.,Thompson, P., Vercauteren, T., Woods, R.P., Mann, J.J., Parsey, R.V., 2009. Evaluationof 14 nonlinear deformation algorithms applied to human brain MRI registration.Neuroimage 46, 786–802.

Landman, B.A., Asman, A.J., Scoggins, A.G., Bogovic, J.A., Stein, J.A., Prince, J.L., 2012a.Foibles, follies, and fusion: web-based collaboration for medical image labeling.Neuroimage 59, 530–539.

Landman, B.A., Asman, A.J., Scoggins, A.G., Bogovic, J.A., Xing, F., Prince, J.L., 2012b. Robuststatistical fusion of image labels. IEEE Trans. Med. Imaging 31, 512–522.

Leiner, H.C., Leiner, A.L., Dow, R.S., 1986. Does the cerebellum contribute to mentalskills? Behav. Neurosci. 100, 443–454.

Levitt, J.J., Mccarley, R.W., Nestor, P.G., Petrescu, C., Donnino, R., Hirayasu, Y., Kikinis, R.,Jolesz, F.A., Shenton, M.E., 1999. Quantitative volumetric MRI study of the cerebellumand vermis in schizophrenia: clinical and cognitive correlates. Am. J. Psychiatry 156,1105–1107.

Makris, N., Hodge, S.M., Haselgrove, C., Kennedy, D.N., Dale, A., Fischl, B., Rosen, B.R.,Harris, G., Caviness, V.S., Schmahmann, J.D., 2003. Human cerebellum: surface-assisted cortical parcellation and volumetry with magnetic resonance imaging. J.Cogn. Neurosci. 15, 584–599.

Makris, N., Schlerf, J.E., Hodge, S.M., Haselgrove, C., Albaugh, M.D., Seidman, L.J., Rauch, S.L.,Harris, G., Biederman, J., Caviness, V.S., Kennedy, D.N., Schmahmann, J.D., 2005. MRI-based surface-assistedparcellationof humancerebellar cortex: an anatomically specifiedmethod with estimate of reliability. Neuroimage 25, 1146–1160.

Mazziotta, J.C., Toga, A.W., Evans, A., Fox, P., Lancaster, J., 1995. A probabilistic atlas ofthe human brain: theory and rationale for its development: The International Con-sortium for Brain Mapping (ICBM). Neuroimage 2, 89–101.

McAuliffe, M.J., Lalonde, F.M., McGarry, D., Gandler, W., Csaky, K., Trus, B.L., 2001.Medical image processing, analysis & visualization in clinical research. IEEE CMBS,p. 381.

McCormick, D.A., Thompson, R.F., 1984. Cerebellum: essential Involvement in the clas-sically conditioned eyelid response. Science 223, 296–299.

McGraw, K.O., Wong, S.P., 1996. Forming inferences about some intraclass correlationcoefficients. Psychol. Methods 1, 30–46.

Middleton, F.A., Strick, P.L., 1994. Anatomical evidence for cerebellar and basal gangliainvolvement in higher cognitive function. Science 266, 458–461.

Mostofsky, S.H., Reiss, A.L., Lockhart, P., Denckla, M.B., 1998. Evaluation of cerebellarsize in attention-deficit hyperactivity disorder. J. Child Neurol. 13, 434–439.

Nitschke, M.F., Kleinschmidt, A., Wessel, K., Frahm, J., 1996. Somatotopic motor repre-sentation in the human anterior cerebellum. A high-resolution functional MRIstudy. Brain 119, 1023–1029.

Nopoulos, P.C., Ceilley, J.W., Gailis, E.A., Andreasen, N.C., 1999. An MRI study of cerebellarvermismorphology in patients with schizophrenia: evidence in support of the cognitivedysmetria concept. Biol. Psychiatry 46, 703–711.

Okugawa, G., Sedvall, G.C., Agartz, I., 2003. Smaller cerebellar vermis but not hemi-sphere volumes in patients with chronic schizophrenia. Am. J. Psychiatry 160,1614–1617.

Pierson, R., Corson, P.W., Sears, L.L., Alicata, D., Magnotta, V., O'Leary, D., Andreasen,N.C., 2002. Manual and semiautomated measurement of cerebellar subregions onMR images. Neuroimage 17, 61–76.

Raz, N., Dupuis, J.H., Briggs, S.D., McGavran, C., Acker, J.D., 1998. Differential effects ofage and sex on the cerebellar hemispheres and the vermis: a prospective MRstudy. AJNR Am. J. Neuroradiol. 19, 65–71.

Ritchie, L., 1976. Effects of cerebellar lesions on saccadic eye movements. J. Neurophysiol.39, 1246–1256.

Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T., 2007. LabelMe: a database andweb-based tool for image annotation. Int. J. Comput. Vision 77, 157–173.

Schmahmann, J.D., 1991. An emerging concept. The cerebellar contribution to higherfunction. Arch. Neurol. 48, 1178–1187.

Schmahmann, J.D., Doyon, J., McDonald, D., Holmes, C., Lavoie, K., Hurwitz, A.S., Kabani,N., Toga, A., Evans, A., Petrides, M., 1999. Three-dimensional MRI atlas of thehuman cerebellum in proportional stereotaxic space. Neuroimage 10, 233–260.

Schmahmann, J.D., Doyon, J., Toga, A., Petrides, M., Evans, A., 2000. MRI Atlas of theHuman Cerebellum. Academic Press, San Diego, CA.

Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing rater reliability.Psychol. Bull. 86, 420.

Silveri, M.C., Leggeio, M.G., Molinari, M., 1994. The cerebellum contributes to linguisticproduction. Neurology 44, 2047.

Sorokin, A., Forsyth, D., 2008. Utility data annotation with Amazon Mechanical Turk.CVPR. IEEE, Anchorage, AK, pp. 1–8.

Steenbakkers, R.J.H.M., Duppen, J.C., Fitton, I., Deurloo, K.E.I., Zijp, L., Uitterhoeve, A.L.J.,Rodrigus, P.T.R., Kramer, G.W.P., Bussink, J., De Jaeger, K., Belderbos, J.S.A., Hart, A.A.M.,Nowak, P.J.C.M., van Herk, M., Rasch, C.R.N., 2005. Observer variation in target volumedelineation of lung cancer related to radiation oncologist–computer interaction: a “BigBrother” evaluation. Radiother. Oncol. 77, 182–190.

Thomann, P.A., Schläfer, C., Seidl, U., Dos Santos, V., Essig, M., Schröder, J., 2008. Thecerebellum in mild cognitive impairment and Alzheimer's disease—a structuralMRI study. J. Psychiatr. Res. 42, 1198–1202.

Warfield, S.K., Zou, K.H., Wells, W.M., 2004. Simultaneous truth and performance levelestimation (STAPLE): an algorithm for the validation of image segmentation. IEEETrans. Med. Imaging 23, 903–921.

Ying, S.H., Choi, S.I., Perlman, S.L., Baloh, R.W., Zee, D.S., Toga, A.W., 2006. Pontine andcerebellar atrophy correlate with clinical disability in SCA2. Neurology 66,424–426.