Sparse geostatistical analysis in clustering fMRI time series

S

Ja

b

a

ARRA

KADLS

1

TnlIb

gbkatritfde

bttw

0d

Journal of Neuroscience Methods 199 (2011) 336– 345

Contents lists available at ScienceDirect

Journal of Neuroscience Methods

j o ur nal homep age: www.elsev ier .com/ locate / jneumeth

parse geostatistical analysis in clustering fMRI time series

un Yea,∗, Nicole A. Lazarb, Yehua Lib

Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USADepartment of Statistics, University of Georgia, Athens, GA, USA

r t i c l e i n f o

rticle history:eceived 5 October 2010eceived in revised form 12 May 2011ccepted 12 May 2011

eywords:

a b s t r a c t

Clustering is used in fMRI time series data analysis to find the active regions in the brain related to astimulus. However, clustering algorithms usually do not work well for ill-balanced data, i.e., when onlya small proportion of the voxels in the brain respond to the stimulus. This is the typical situation in fMRI– most voxels do not, in fact, respond to the specific task. We propose a new method of sparse geostatis-tical analysis in clustering, which first uses sparse principal component analysis (SPCA) to perform data

utocovarianceata-driven analysisASSOilhouette values

reduction, followed by geostatistical clustering. The proposed method is model-free and data-driven; inparticular it does not require prior knowledge of the hemodynamic response function, nor of the exper-imental paradigm. Our data analysis shows that the spatial and temporal structures of the task-relatedactivation produced by our new approach are more stable compared with other methods (e.g., GLManalysis with geostatistical clustering). Sparse geostatistical analysis appears to be a promising tool forexploratory clustering of fMRI time series.

. Introduction

Compared with medical imaging techniques such as Computeromography (CT) or other X-ray based methods, Magnetic Reso-ance Imaging (MRI) is a relatively new non-invasive technique to

ook at the structure of the brain. Functional Magnetic Resonancemaging (fMRI) is a method to see the working of the active humanrain by MRI equipment.

Traditional statistical analysis in fMRI has been based on theeneral linear model (GLM) (Friston et al., 1995) and other model-ased methods (Goutte et al., 1999). But these methods need priornowledge of the design and response, and strongly depend onn understanding of the anticipated hemodynamic response ofhe brain. In recent years, many statistical contributions to fMRIesearch have addressed the so-called model-free approach, i.e.,nformation from the fMRI data is extracted without reference tohe experimental protocol; effects or components of interest areound directly from the intrinsic structure of the data. This approachoes not depend on the estimation of any functions and is consid-red a data-driven analysis (Huettel et al., 2004).

An important feature of fMRI data is that they are correlatedoth spatially and temporally. Although it is obviously important
o account for these structures, the earliest GLM approach ignoredhem. The simplest version of the GLM is the two-sample t-test,hich is in popular use due to its robustness even without account-
∗ Corresponding author. Tel.: +1 6056885833.E-mail address: [email protected] (J. Ye).

165-0270/$ – see front matter © 2011 Elsevier B.V. All rights reserved.oi:10.1016/j.jneumeth.2011.05.016

© 2011 Elsevier B.V. All rights reserved.

ing for known correlations. However it soon became clear thata better analysis would consider the temporal correlation. In theGLM of fMRI time series, the residual errors are still autocorrelated,hence data temporal smoothing is usually executed, e.g., Wors-ley et al. assume an AR(p) covariance structure in the GLM model(Worsley et al., 2002). Since Worsley’s GLM model still introduced abias (Carew et al., 2003), currently there are two popular directionsin GLM, one is “direct smoothing in GLM”, the other is “cluster-ing after GLM”. With direct smoothing, as in Carew et al. (2003),the time series is directly smoothed as a spline in GLM analysis,thereafter the smoothing parameters are optimized by generalizedcross-validation, to filter out the high frequency noise while pre-serving low frequency signal similar to the clustering procedureshown in Fig. 4. Clustering after GLM takes a multivariate approachto search for commonalities in the temporal patterns of behavioracross voxels and therefore considers both the spatial and temporalcorrelations in the data (Ye et al., 2009). Dealing with spatial struc-ture is more difficult and hence was attempted later; an example iselastic net regression (Carroll et al., 2009) which uses the least abso-lute shrinkage and selection operator (LASSO) (Tibshirani, 1996) toachieve dimension reduction in space, thereby obtaining sparse,interpretable models. Here dimension reduction refers to a pro-cess of reducing the number of image voxels under considerationand sparse modeling means resulting models use information fromonly a relatively small subset of predictive image voxels (Carroll
et al., 2009). However, in this approach the temporal correlation iscompletely ignored.
Recent advances in the application of statistical methodologyhave allowed researchers to more realistically handle both tem-

dx.doi.org/10.1016/j.jneumeth.2011.05.016

http://www.sciencedirect.com/science/journal/01650270

http://www.elsevier.com/locate/jneumeth

mailto:[email protected]

dx.doi.org/10.1016/j.jneumeth.2011.05.016

ience M

p(Gacechnatttrrtsp

iwCtawldeboiIiersei

pgctatdaa(hv2auvafottr

wLwt

J. Ye et al. / Journal of Neurosc

oral and spatial dependence in the data. Using autocovarianceautocorrelation) structure of the time course in clustering afterLM (Ye et al., 2009), an idea borrowed from geostatistics, offersn important advantage over the popular raw time series and cross-orrelation approaches currently used in fMRI data analysis (Gouttet al., 1999; Bandettini et al., 1993). Since autocovariance does notount the order of the raw data matrix over time, two time seriesaving identical autocovariance (autocorrelation) structure doesot imply that they are coherent or synchronous. This is anotherdvantage of using autocovariance in clustering instead of rawime series, because the autocovariance functions automaticallyake care of the time delays in different time series. We know thathe active voxels in the brain communicate with each other whenesponding to the stimulus, i.e. some voxels might have delayedeaction to the stimulus; some voxels might have early reaction dueo anticipation. The autocovariance function eliminates the timehifting, refrains from declaring too many clusters and in generalrovides more stable clustering results (Ye et al., 2009).

Because this geostatistical clustering is based on sample (empir-cal) autocovariance functions of fMRI time series (Ye et al., 2009),

hich can be considered as a type of smoothing, there is a link toarew et al. (2003). This model-free approach captures the impor-ant temporal characteristics of the voxels and can separate thectivation and noise very well. However, the method still has aeakness, namely the lack of dimension reduction. One of chal-

enges in fMRI cluster analysis is that clustering algorithms usuallyo not work well for ill-balanced data (Goutte et al., 1999; Fadilit al., 2000). “Ill-balanced” means that the numbers of observationselonging to the different classes are widely disparate, e.g., if mostbservations belong to one class, then all observations might be putnto a single cluster, even if there are different patterns in reality.n fMRI data the population of “activated” (stimulus-related) voxelss much smaller than the population of non-stimulus-related vox-ls, for many standard stimuli. Hence, a pre-screening process toemove non-stimulus-related voxels is desirable. For example, a 2-ample t-test is often used for data reduction in this setting (Gouttet al., 1999; Wang et al., 2007). Hence, the above clustering methods still based on the GLM and is not completely data-driven.

The major innovation of the current paper, compared with ourrevious work (Ye et al., 2009), is to propose a new method of sparseeostatistical analysis in clustering, which uses sparse principalomponent analysis (SPCA) (Zou et al., 2006) instead of the GLMo pre-screen the fMRI data before clustering. SPCA is model freend avoids modeling the hemodynamic function or signal delays inhe brain. In this way, the whole analysis is data driven. Completelyata-driven analysis is not new in fMRI; independent componentnalysis (ICA) and principal component analysis (PCA), for example,re both commonly used data-driven techniques in fMRI analysisAndersen et al., 1999; McKeown et al., 1998). But both PCA and ICAave disadvantages, i.e., in ICA, any slight time shifting at differentoxels will result in a biased ICA amplitude estimate (Calhoun et al.,003); in PCA, the eigenvectors (loading vectors) from the covari-nce matrix have non-zero coefficients for every voxel and theseful components may not be correctly judged by their explainedariances (McKeown et al., 1998). SPCA is an extension of PCA using

LASSO (Tibshirani, 1996) penalty to force the loading coefficientsor non-stimulus-related voxels to be zero. The advantage of SPCAver the GLM analysis is that the former can detect brain activa-ion without any assumptions on the data, such as knowledge ofhe fMRI stimulus presentation or a model for the hemodynamicesponse function.

We note that there are a few other sparse data recovery frame-
orks, such as compressed sensing and sparse regression via
ASSO. We use SPCA because it best serves the goal of the study,hich is to screen for the most variable voxels in the brain while

he subject is performing a visual task. In contrast, the goal of the

ethods 199 (2011) 336– 345 337

compressed sensing method is to compress the image by project-ing it onto a low dimensional space, which can best preserve thefeatures in the image. It does not search for the most variable voxelsduring the time of the experiment. The sparse regression via LASSO,on the other hand, requires a linear model that bears the fMRI dataas covariates. One possibility is to use the stimuli sequence as theresponse and the brain image as a covariate. This approach searchesfor the voxels most linearly related to the stimuli. However, it is wellknown that the signal in active voxels is subject to a delay, whichcan be different across different regions of the brain. Such delays inbrain signals can destroy the linear relationship between the timeseries from an active voxel to the stimuli sequence. As a model-freeapproach, SPCA avoids modeling the signal delay and is well-suitedto our data.

In contrast to a previous work (Ye et al., 2009), the proposedmethod will lead to efficient clustering algorithms; result in newinsights of interactions between data reduction and feature extrac-tion in fMRI cluster analysis. Our current approach is well-adaptedto fMRI data especially when little prior knowledge about the refer-ence function or stimulus is available. Before clustering, we reducethe dimension of the ill-balanced data by SPCA, which considersthe spatial relations of data and is data-driven. Thereafter we usegeostatistical clustering to overcome the disadvantages of SPCA(McKeown et al., 1998), i.e., the SPCA results might not be easilyinterpreted because they do not consider the serial correlation inthe voxel time series. Geostatistical method after SPCA reorganizesthe clustering results and provides better interpretation based onthe temporal patterns of the data. Therefore, sparse geostatisticalanalysis jointly considers the advantages of SPCA and geostatisti-cal method in clustering, as well as accounting for the spatial andtemporal structures in the data.

The rest of the paper is organized as follows: In Section 2, weintroduce basic concepts and definitions of the sparse geostatisticalclustering method. In Section 3, we describe the fMRI data sets usedto test our method. Application of our method to these fMRI datasets is in Section 4. Section 5 summarizes the proposed method.

2. Methods

2.1. Data reduction by SPCA

We first perform SPCA to pre-screen the data. Regular PCA isa classic tool for analyzing large scale multivariate data (Jolliffe,1986). In PCA, the goal is to find a few components that explain alarge proportion of the total sample variance of the original vari-ables. The principal components can be regarded as the extractedfeatures that maximally separate the individual observation vec-tors. However, one of the shortcomings of PCA is that each principalcomponent is still a linear combination of all the original variables.SPCA aims at producing easily interpreted components with only afew non-zero coefficients, i.e., each new variable is a linear combi-nation of a small subset of the original variables.

Regular PCA: Regular PCA takes an (n × p) matrix X, where weassume there are measurements xijs, i = 1, ..., n, j = 1, ..., p, withthe condition

∑ni=1xij = 0 for j = 1, ..., p. We form the covariance

matrix of X by calculating XTX and solve the eigenvalue problemXTX = BDBT subject to BTB = Ik, where the elements of D, di, i = 1, ...,k are the k positive eigenvalues and the columns of B are the eigen-vectors (loading vectors). Let Z = XB; Z will be much smaller thanX if k � p and the dimension of the problem thereby is reduced.This makes Z(n × k) and B(p × k). The columns of Z are the prin-
cipal components (expansion coefficients), linear combinations ofthe variables in X.
Regularized principal component regression: SPCA focuses on amethod for computing eigenvectors by a variable selection method

3 ience M

i2np

B̂

sam

b̂

te

b̂

wtc

2

tntGectis

iac(eiobs“TrfcgptctcnSc

asdssto(b

38 J. Ye et al. / Journal of Neurosc

n a “self-contained” principal component regression (Zou et al.,006), in which the data set is regressed on the principal compo-ents of itself. Letting X = (x1, ..., xn)T, the ordinary least squaresrincipal component regression is

= argminB

n∑i=1

||xTi − BBT xT

i ||2, (1)

ubject to BTB = Ik, where k is the number of components. Herergmin stands for the argument at which a function reaches itsinimum. Since Z = XB,

j = argminb||zj − Xbj||2, (2)

he regularized principal component regression, called SPCA by Zout al. (2006), is

j = argminb||zj − Xbj||2 + �||bj||2 + �1||bj||1, (3)

here j = 1, ..., k, zj is the jth regularized principal component, bj ishe jth sparse eigenvector, � is the tuning parameter and �1 is theonstrained parameter.

.1.1. Steps in SPCAStep one – Data normalization: A necessary step before SPCA is

o normalize the original data in time, because in the absence oformalization, SPCA extracts variables with large variances ratherhan variables with similar patterns (Baudelet and Gallez, 2003;outte et al., 2001). Zou et al. (2006) prove that normalization cannsure the reconstruction of the principal components in principalomponent regression. If normalized fitted coefficients are used,he scaling factor does not affect the eigenvectors (Rencher, 2002),.e., the principal components from the correlation structure arecale invariant.

Step two – Choose the number of components: The scree graphs often used to choose the number of components in PCA. Topply the scree graph method, we plot the value of each suc-essive log-eigenvalue against the order and look for an elbowbend) between the “large” log eigenvalues and the “small” log-igenvalues (Rencher, 1998; Cattell, 1966). The recommendations to retain those log eigenvalues in the steep curve before the firstne on the straight line. The number of components is taken toe the point at which the remaining log eigenvalues are relativelymall. The plot from this point on is mere “scree”, which meansthe rock debris at the foot of a cliff” (Tatsuoka and Lohnes, 1988).he smaller log eigenvalues tend to lie along a straight line and justepresent random variation. As examples, Fig. 1 gives scree plotsrom three sample data sets. The turning points between the steepurves and the straight lines are 5, 6 and 2 respectively. The screeraph method is simple, intuitive and easy to use. Hence it is veryopular in many areas, e.g., biology and ecology. One drawback ofhis approach is that sometimes the result in the scree graph is notonclusive, i.e., it might suggest two or more solutions in selectinghe number of components. For example, in Fig. 1 (c) the elbowould occur at 2 or 6. In such a situation, we tend to include a largerumber of principal components for further consideration, becausePCA only serves as a screening process, and we can rely on thelustering to find the real task related voxels at the next stage.

Step three – Determine the constrained parameter: Since fMRI datare ill-balanced, we only want the components from SPCA to beparse enough to get the most significant patterns in the brain, i.e.,imension reduction. In order to minimize the risk of discardingtimulus-related voxels, we are liberal in our choice of the con-trained parameter. The goal of the loose pre-screening is to reduce
he large number of non-stimulus-related voxels, as this can seri-usly affect the robustness and sensitivity of the clustering resultsFadili et al., 2000). Any remaining non-stimulus-related voxels cane easily classified in the further clustering.
ethods 199 (2011) 336– 345

Step four – Choose the tuning parameter: The effect of the tun-ing parameter � is generally small (McKeown et al., 1998). Hencethe data set is divided into 5 disjoint parts, i.e., 5-fold, and cross-validation is used to find the best � (Hastie et al., 2001). Theadvantage of cross-validation is that all observations in the dataset are used for both training and testing.

2.2. Further clustering by geostatistical analysis

SPCA helps screen out a large proportion of non-stimulus-related voxels, but the serial correlation in the voxel time seriesis not considered during this phase of the procedure. Hence it isnecessary to perform an additional analysis step after SPCA. Herewe use the autocovariance of the voxel time series as the feature –an idea borrowed from geostatistics (Ye et al., 2009).

2.2.1. Structural clusteringIn geostatistics, variogram or autocovariance modeling is called

structural analysis, and spatio-temporal information is used to rep-resent the structure of the data. It is implicit that structure variesspatially or temporally between the classes of interest (Atkinsonand Lewis, 2000), hence the name “structural clustering”. For cal-culation of the sample (empirical) autocovariance, see Isaaks andSrivastava (1989). It is typical to use the autocovariance functionas an error structure in the GLM analysis of fMRI data, for exam-ple Worsley et al. (2002) assume an AR(p) error. Here we use theautocovariance structure to discriminate temporal behavior of thevoxels. For inactive voxels where the time series contain noise only,it is reasonable to assume stationarity (Isaaks and Srivastava, 1989).Even for the active voxels containing signal, where stationarity maynot hold, the autocovariance function can still be calculated andprovide useful information. In our data, the autocovariance func-tions from the active voxels have periodic patterns similar to thestimulus sequence, and are very different from those of the inactivevoxels. This provides the basis for our cluster analysis. The differ-ence between our work and those in the literature (Worsley et al.,2002) is that we do not assume any model for the autocovariancefunction. We do not use the raw time series because they are toonoisy. The sample autocovariance functions are moment estima-tors, and therefore the noise is effectively removed by averaging.Hence, the characteristic patterns are more clearly revealed by theautocovariance functions.

2.2.2. Steps in structural clustering after SPCAStep one – Calculate the autocovariance: For the normalized

(n × p) data matrix X, we formalize the new data after SPCA:

Y = Xb1bT1 + Xb2bT

2 + · · · + XbkbTk .

Although the new data Y are still a (n × p) matrix, after SPCAmore than 50% of the p variables in the image have been forced tobe zeros in most cases. We calculate the empirical autocovariancesonly of the remaining non-zero variables.

Step two – Define the clustering metric: Two metrics are com-monly used: one is Euclidean distance between objects, called“Euclidean”; the other is one minus the sample correlation betweenobjects, called “correlation” (Goutte et al., 2001). The “correlation”metric is more appropriate in our case and has been previouslyproposed in clustering (Serban and Wasserman, 2005).

Step three – Clustering using k means algorithm: The k meansalgorithm is common in neuroimaging applications because of its
computational advantages: computations are fast, the algorithmdoes not require retention of all distances, and convergence occursquickly (Bowman et al., 2004). For a given number of clusters k,the algorithm iteratively minimizes the within-class variance by

J. Ye et al. / Journal of Neuroscience Methods 199 (2011) 336– 345 339

2 4 6 8 10−1.4

−1.2

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

Components

Log

eige

nval

ues

(a) First saccade data

2 4 6 8 10−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

Components

Log

eige

nval

ues

(b) Second saccade data

2 4 6 8 10−5.0

−4.5

−4.0

−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

Components

Log

eige

nval

ues

(c) Resting data

Fig. 1. (a–c) Scree graphs for three sample data sets. For the two saccade data sets, the turning points between the steep curves and the straight lines are 5 and 6 respectively;f lanati

a(

hdanw

c(actdttptirc

3

wTmDafrra

or the resting data set, the number of components could be 2 or 6. See text for exp

ssigning data to the nearest center and recalculating each centerGoutte et al., 2001).

Step four – Determine the final results by silhouette values: Sil-ouette values, first introduced by Rousseeuw (1987), are used toetermine the number of clusters in the k means algorithm. Theverage silhouette width for the entire data set is used to select theumber of clusters k, by choosing k so that the average silhouetteidth is highest (Kaufman and Rousseeuw, 1990).

There are numerous methods for identifying the number oflusters, e.g., the gap (Tibshirani et al., 2001) and jump methodsSugar and James, 2003). We applied both to our fMRI data setsnd found that they tended to select a larger number of clustersompared to using silhouette values. As a result, some of the clus-ers found by using steps 1–3 are further partitioned into severalaughter clusters. The voxels within a daughter cluster are spa-ially scattered in the brain, and there is no clear evidence thathis further partition is related in any way to function. We alsoerform the Fourier analysis described in Section 4, and the daugh-er clusters from the same mother cluster behave very similarlyn the frequency domain. For these reasons, we only present theesults using the silhouette values to determine the number oflusters.

. Description of fMRI data

We apply our method to three different fMRI data sets. The datae investigate were acquired on different 1.5 T magneton scanners.

he first saccade data set consists of 30 axial slices in a 64 × 64atrix, taken over 156 time points, with repetition time TR = 2.5 s.ata are on a single subject. A block design alternating 12 trials ofnti-saccade, followed by 12 trials of pro-saccade tasks was per-
ormed. Pre-processing steps performed on this data set includedemoval of spatial outliers, correction of head motion, outlier cor-ection in image space, Gaussian kernel smoothing with full widtht half maximum (FWHM) 2 voxels, removal of linear voxel-wise
on.

trends, removal of linear drifts over time for each voxel, i.e., removalof linear trends among voxels, and over time for each voxel. Detaileddescriptions of these processing steps can be found in McNameeand Eddy (2001).

The second saccade data set (a different subject from the firstsaccade data set and from a different scanner) consists of 23 axialslices in a 64 × 64 matrix, taken over 81 time points. The subjectwas presented with blocks of anti-saccade trials (6 trials each)alternated with blocks of fixation trials. All the data pre-processingsteps are similar to those in the first data set except this data setwas masked (Carew et al., 2003) as part of the processing by theresearcher.

We also analyze a resting data set, again for a different singlesubject, which contains three axial slices in a 64 × 64 matrix, takenover 1498 time points, with repetition time TR = 2 s. This is a maskedlong range resting data set, i.e., only 1096 voxels of the original 4096are inside the masked brain in each slice and the duration of the nostimulus experiment is relatively long. The overall stability of thebrain was good (not much head motion) even before registration.A total slow translation of the brain by approximately 1 mm anda maximal rotation of approximately 2 degrees were corrected forby registration of the data.

As an example of the motivation for the geostatistical compo-nent of our analysis, Fig. 2 gives a primary view of two typical voxels,one inactive and one active. These two voxels are arbitrarily cho-sen from the clearly active and inactive brain regions in the firstsaccade data set. The inactive voxel has very high frequencies inits time series (Fig. 2(a)) and autocovariance function (Fig. 2(b)),indicating the unknown noise; the active voxel has much lower fre-quencies in its time series (Fig. 2(c)) and autocovariance function(Fig. 2(d)), corresponding to the experimental design. Obviously,
there are more clear peaks and troughs in the shape of waves fromthe empirical autocovariance function of the active voxel (Fig. 2(d)).Hence using the geostatistical approach has the potential to yieldbetter clustering results.

340 J. Ye et al. / Journal of Neuroscience Methods 199 (2011) 336– 345

20 40 60 80 100 120−0.4

−0.2

0

0.2

0.4

(a) Time series of the inactive voxel

20 40 60 80 100 120−0.015

−0.01

−0.005

0

0.005

0.01

0.015

(b) Autocovariance of the same inactive voxel

20 40 60 80 100 120−0.4

−0.2

0

0.2

0.4

(c) Time series of the active voxel

20 40 60 80 100 120−0.01

−0.005

0

0.005

0.01

(d) Autocovariance of the same active voxel

F ne voa ame fo

4

4

pf2Bse4dpaHattrfsitswNtfi

peat2s

ig. 2. A primary view of two typical voxels from the first saccade data set, where outocovariance function (right) of the inactive voxel; the bottom panels show the s

. Data analysis

.1. Data analysis for the first saccade data set

Because the fourth slice was expected by the researcher whorovided the data to have more activation than other slices, weocus on this slice for demonstration purposes (McNamee and Eddy,001). All 4096 voxels are used in SPCA because data are unmasked.y the scree graph (Fig. 1(a)), five principal components are cho-en. Since the choice of constrained value is not very critical, as anxample, here we reduce the total number of voxels to fewer than00. The tuning parameter is chosen by 5-fold cross-validation asescribed above. The numbers of non-zero loadings in the first fiverincipal components are 102, 86, 48, 59 and 46 voxels respectively,nd the fourth and fifth components have one overlapped voxel.ence there are 340 voxels in total (Fig. 3). Fig. 3(a) and (c) showspparently “true” brain activation and “head motion” patterns inhe first two components. We also do further clustering to confirmhese findings. All the silhouette values using the correlation met-ic are greater than 0.80, which means strong structures have beenound (Chalana et al., 2001). The case with k = 6 has the highest pos-ible mean of silhouette values (1.0000), and the numbers of voxelsn the six clusters are 102, 86, 48, 58, 45 and 1. Cluster 6 extractshe overlapped voxel from components 4 and 5 in the SPCA analy-is. k = 5 has the second largest mean of silhouette values (0.9952),here the numbers of voxels are 102, 86, 49, 58 and 45 respectively.ow the overlapped voxel is reassigned to the most probable clus-

er. Hence the best number of clusters could reasonably be eitherve or six.

The geostatistical approach easily extracts the different timeatterns of the clusters (Fig. 4). For k = 6, cluster 1 has 102 vox-ls and the mean autocovariance function shows very strong peaks
nd troughs in the shape of waves, showing periods of time correla-ions corresponding to the experimental design. (Fig. 4(c)). Cluster
has 86 voxels with some peaks and troughs in the shape of waves,howing weak periods of time correlations that correspond to the

xel is inactive and the other is active. The top panels show the time series (left) andr the active voxel.

experimental design (Fig. 4(d)). Clusters 3, 4 and 5 show very highfrequencies, which do not correspond to the experimental design(Fig. 4(e)–(g)), but may represent machine noise. Cluster 6 is basi-cally null and noninformative since it contains only a single voxel(Fig. 4(h)). Thus, cluster 6 shows no pattern. To provide furtherconfirmation of these conclusions, we use Fourier transforms toconvert the autocovariances of the stimulus trail and clusters 1, 2,3, 4 and 5 to their frequency components (Fig. 5). The dominant fre-quency in the spectrum of cluster 1 (Fig. 5(b)) matches remarkablywell with that of the stimulus trail (Fig. 5(a)), hence we concludethat this cluster represents voxels that are reacting to the stimulus,as surmised above. On the other hand, the dominant frequency inthe spectrum of cluster 2 (Fig. 5(c)) is lower than that of the stimulustrail. Considering the locations of the voxels in that cluster, cluster2 can be explained by “head motion”. The dominant frequenciesin the spectra of clusters 3, 4 and 5 (Fig. 5(d)–(f)) are much higherthan that of the stimulus trail, suggesting they are due to noise.

The case with k = 3 would also be a reasonable choice. This choiceis bolstered by the fact that it has the third largest mean of sil-houette values (0.9604). The clusters of “true” brain activation and“head motion” would be the same as before, and the “noise” clus-ters would be combined into one. Hence, doing the clustering againprovides a chance to evaluate the results from SPCA and enhanceinterpretation with a scientific basis. Finally, there are 102, 86, 152voxels in the “true” brain activation, “head motion”, “noise” clustersand 83, 45, 108 voxels left after masking.

4.2. Data analysis for the second saccade data set

For the second saccade data set, we also use a slice which wasexpected to have large amounts of stimulus-related activity for pur-poses of demonstration. In this slice, 600 voxels out of the total 4096
are inside the masked brain. By looking at the scree graph (Fig. 1(b)),the turning point between the steep curve and the straight line isat 6, hence six components are chosen. We also reduce the numberof voxels to be less than 400, as we did in the first saccade data set.


F n then fifth

Anbo

F(2t

ig. 3. Maps of five SPCA components for the first saccade data set and overlaid oon-zero loadings are 102, 86, 48, 59 and 46 voxels respectively, and the fourth and

fter carefully selecting the constrained and tuning parameters, theumbers of non-zero loadings in the first six principal componentsy SPCA are 52, 36, 41, 43, 40 and 45 voxels respectively, and nonef the six components have overlapped voxels. Hence there are only

0 20 40 60 80 100 120−2

0

2(a) Autocovariance of stimulus trail

0 20 40 60 80 100 120

−2

0

2

x 10−3 (c) Mean autocovariance of cluster 1

−

0 20 40 60 80 100 120

−2

0

2

x 10−3 (e) Mean autocovariance of cluster 3

−

0 20 40 60 80 100 120

−2

0

2

x 10−3 (g) Mean autocovariance of cluster 5

−

−

ig. 4. Time patterns for stimulus trail and the first saccade data set. (a) The autocovariac–h) The mean autocovariances of the different clusters. Cluster 1 shows strong peaks an

shows weak peaks and troughs in the shape of waves corresponding to the experimentahe experimental design, but may represent machine noise. Cluster 6 is basically “null” an

original brain. The first 5 components retain 340 voxels. The numbers of the fivecomponents have one overlapped voxel.

257 voxels in total. Fig. 6 gives the maps of the six SPCA components(clusters) for the second saccade data and overlaid on the originalbrain. Unlike the first saccade data set, here it is more difficult tojudge the brain maps directly without considering the geostatisti-

0 20 40 60 80 100 120

2

0

2

x 10−3 (d) Mean autocovariance of cluster 2

0 20 40 60 80 100 120

2

0

2

x 10−3 (f) Mean autocovariance of cluster 4

0 20 40 60 80 100 120

2

0

2

x 10−3 (h) Mean autocovariance of cluster 6

0 20 40 60 80 100 120

2

0

2

x 10−3(b) Mean autocovariance of 340 voxels

nce of the stimulus trail. (b) The mean autocovariance of the 340 retained voxels.d troughs in the shape of waves corresponding to the experimental design. Clusterl design. Clusters 3, 4 and 5 show very high frequencies which do not correspond tod noninformative since it only has one single voxel and therefore shows no pattern.


0 10 20 30 40 500

0.2

0.4

0.6

0.8

1(a) Spectrum in autocovariance of stimulus trail

Frequency0 10 20 30 40 50

0

0.2

0.4

0.6

0.8

1x 10

−3 (b) Spectrum in atuocovariance of cluster 1

Frequency

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1x 10

−3(c) Spectrum in atuocovariance of cluster 2

Frequency0 10 20 30 40 50

0

0.5

1

1.5

2x 10

−3(d) Spectrum in atuocovariance of cluster 3

Frequency

0 10 20 30 40 500

0.5

1

1.5

2x 10

−3(e) Spectrum in atuocovariance of cluster 4

Frequency0 10 20 30 40 50

0

0.5

1

1.5

2x 10

−3 (f) Spectrum in atuocovariance of cluster 5

Frequency

Fig. 5. For the first saccade data set, (a) is the spectrum in the autocovariance of the stimulus trail. (b), (c), (d), (e) and (f) are the spectra in the autocovariances of clusters 1,2, 3, 4 and 5, respectively. Since (a) and (b) have the most similar frequencies, cluster 1 is directly in reaction to the stimulus trail.

Fig. 6. Maps of six SPCA components for the second saccade data set and overlaid on the original brain. The first six components retain 257 voxels. The numbers of the sixnon-zero loadings are 52, 36, 41, 43, 40 and 45 voxels respectively, and none of the six components have overlapped voxels.


Fig. 7. (a–d) Comparison of sparse geostatistical analysis and GLM clustering for the first saccade data set; (e–h) comparison of sparse geostatistical analysis and GLMc

cd

stogadttwttbtl

4c

Geylb2

sitdvoscitgcc

lustering for the second saccade data set. See text for explanation.

al clustering, i.e., we have to look at the temporal patterns of theifferent clusters.

As we have already shown in the analysis of the first saccade dataet, geostatistical clustering reorganizes the SPCA results. Sincehere are no overlapped voxels among the six components, obvi-usly the mean of silhouette values equals to 1 when k = 6 in theeostatistical analysis. Hence we can consider the six componentss six clusters and look at their empirical autocovariance functionsirectly. We convert the autocovariances of the stimulus trail andhe six clusters (components) to their frequency components. Onlyhe dominant frequency in the spectrum of cluster 4 matches wellith that of the stimulus trail (not shown). Hence we conclude

hat this cluster (component) represents voxels that are reactingo the stimulus. By looking at Fig. 6(g) and (h), we can see that therain activity in the fourth component is also weaker than that inhe comparable cluster from the first data set although the spatialocations of activation correspond well.

.3. Comparison of sparse geostatistical analysis and GLMlustering

The major difference between the proposed new method andLM clustering (e.g., GLM analysis with geostatistical clustering (Yet al., 2009)) is in the data reduction part: sparse geostatistical anal-sis uses SPCA to pre-screen the data; GLM clustering uses generalinear model to pre-screen the data. Here we make a comparisonetween sparse geostatistical analysis and GLM clustering (Ye et al.,009) for the two saccade data sets.

Fig. 7 shows the maps of active regions for the two saccade dataets with the two different clustering methods. To keep consistencyn the comparison between the two saccade data sets, we maskedhe clustered results from the first saccade data set. For the firstata set, Fig. 7(a) and (b) shows the maps of the 83 masked acti-ated voxels from the 102 clustered voxels and overlaid on theriginal brain in sparse geostatistical analysis; Fig. 7(c) and (d)hows the maps of the 88 masked activated voxels from the 111lustered voxels and overlaid on the original brain in GLM cluster-
ng. For the second data set, Fig. 7(e) and (f) shows the maps ofhe 43 clustered voxels and overlaid on the original brain in sparseeostatistical analysis; Fig. 7(g) and (h) shows the maps of the 42lustered activated voxels and overlaid on the original brain in GLMlustering.
4.3.1. Comparison in the first saccade data setWe find that the “activation” regions detected by the sparse geo-

statistical analysis and the GLM clustering have 80% shared voxels(Fig. 7(a) and (c)). We further investigate the active voxels detectedby the sparse geostatistical analysis, but missed by the GLM clus-tering. Fig. 8 shows the autocovariance function and the spectrumof one such active voxel. The spectrum clearly shows that the timeseries in this voxel contains task related signals, but also high fre-quency noise. We conclude that the GLM clustering tends to missthe voxels with both high signal and high noise. Overall, there are17 such voxels missed by the GLM clustering and they all show thatsame spectral pattern.

4.3.2. Comparison between the two saccade data setsUsing similar thresholds in the sparse geostatistical analysis, we

find many more active voxels from the first data set (Fig. 7(a)–(d)),which likely means that this subject is a “good” activator for the eyemovement design compared to the subject of the second data set.Hence it is no surprise that for the first subject both sparse geo-statistical analysis and GLM clustering separate the task-relatedactivation from artifacts and other noise very well. But for thesecond saccade data set, sparse geostatistical analysis gives muchbetter estimation (Fig. 7(e)–(h)). To verify our assumption, we trythe GLM with false discovery rate (FDR) q = 0.05 in the two datasets (Genovese et al., 2002). For the first saccade data set, the mapof active brain regions is quite similar to sparse geostatistical anal-ysis and GLM clustering (not shown). For the second saccade dataset, GLM with multiplicity correction (FDR q = 0.05) does not revealany activation, i.e., there are almost no activated voxels shown inthe brain map. Finding objective and effective thresholds for voxel-wise statistics derived from neuroimaging data has been a longstanding problem. In the GLM clustering method (Ye et al., 2009),further clustering (e.g., geostatistical clustering) is used to removeerrors after the initial GLM analysis. If the brain “activations” arenot strong enough, (e.g., in the second data set), it is difficult tokeep the balance between GLM analysis and further clustering. Ifthe criterion used in the GLM analysis is too strict, many weak focalactivations will be discarded; if the criterion is too loose, further
clustering may not perform well for the ill-balanced data. On thecontrary, the data-driven method – sparse geostatistical analysis,is good for dimension reduction no matter whether the subject’sbrain “activations” are strong or weak.


20 40 60 80 100−5

0

5x 10

−4

Time

(a) Autocovariance of one active voxel

0 10 20 30 40 500

0.5

1

1.5x 10

−4

Frequency

(b) Spectrum of the active voxel

F sparso nd hig

hite

4

psmiwstygsFvS(t

ig. 8. (a) The empirical autocovariance function of one active voxel detected by thef the same voxel. Note that the spectrum has both local maximum values at low a

While our results indicate that the sparse geostatistical analysisas promise, even for poor performers, studying individuals per se

s less interesting. We gain statistical power by combining subjectsogether. Hence interesting and useful follow on work will be anxtension of these methods to group analysis.

.4. Sparse geostatistical analysis results on the whole brain

For applying the method to the whole brain, conceptually, therocedure of the analysis would be the same as the analysis for aingle slice. Computationally, the extension to the whole brain isore of a challenge, since we quickly run into memory and storage

ssues. We will address efficient computation for the SPCA in futureork. As a compromise, we implement the analysis on multiple

lices. Since activation should be somewhat localized, we choosehe seven most adjacent slices to the fourth, for the additional anal-sis. Fig. 9 shows active voxels detected in each slice by sparseeostatistical analysis, where the parameters chosen in the SPCAtep are similar. For the seven new slices, we perform the sameourier analysis as in the fourth slice and confirm that these “acti-
ation” clusters are task related. In the slicewise analysis, whenPCA is used as a loose screening device instead of a crude thresholde.g., t-test), sparse geostatistical analysis shows advantages overhe GLM clustering based on the spectrum of the active voxels.
Fig. 9. Active regions detected by sparse geostatistical analysis slice

e geostatistical analysis, where the GLM clustering failed to detect; (b) the spectrumh frequencies, indicating high signal with high noise.

4.5. Data analysis for the long range resting data set

For the long range resting data set, there should be no differ-ences among the three slices. Hence we choose the first slice as anexample. We first use the scree graph to determine the number ofcomponents in SPCA. The turning point between the steep curveand the straight line is not as clear as before, with 2 and 6 bothbeing plausible values (Fig. 1(c)). Hence, we analyze the data usingboth six and three components. The reason to choose three com-ponents instead of two is: if the most probable number of clustersis two, the further clustering in the first analysis will reassign sixcomponents to three clusters, two of which are almost the same asthe first two components in the second analysis, and the other fourare combined into the third component. Hence the two approacheswill get consistent results if the right number of clusters is two.

In the first analysis, the first six components contain 718 voxels.The numbers of non-zero loadings in the first six principal com-ponents are 138, 77, 152, 263, 130 and 102 voxels, and there are144 overlapped loadings. After calculating the means of silhouettevalues for k = 2, 3, 4, 5, 6, 7 in the further clustering, the best choiceis found to be k = 3. Cluster 1 has 126 voxels, and is very similar to
component 3; cluster 2 has 514 voxels, and is very similar to thecombination of components 1, 4, 5 and 6; cluster 3 has 78 voxels,and is very similar to component 2. Next, we choose the numberof components to be k = 3 and do the same analysis as before for
by slice with the similar parameters. See text for explanation.

ience M

cbavtmvcvnptTipiyp

5

daSadStssbntatt

A

odpatYwbth

R

A

A

analysis for fMRI data. NeuroImage 2002;15:1–15.

J. Ye et al. / Journal of Neurosc

omparison. The first three components have 149 voxels. The num-ers of non-zero loadings in the first three principal componentsre 62, 45 and 42 voxels respectively and there are no overlappedoxels. We also combine the three components again and do fur-her clustering. The results are exactly the same as in SPCA with

ean of silhouette values equal to 1. Comparing the mean autoco-ariances of the three clusters with the results in the six originalomponents, we find the time patterns of the three clusters areery similar. Hence, after further clustering, we conclude that theumber of components for the resting data is two. We observe clearatterns in one cluster (not shown). In the raw temporal patterns ofhis cluster, there is a large upward movement after time point 600.his discontinuity might be due to uncorrected head motion whichs often visible as a largely vertical movement on the time courselot (Huettel et al., 2004). Hence, when the number of components

s undetermined in SPCA, further clustering by geostatistical anal-sis can well interpret the results from SPCA and find the mostrobable cluster allocation.

. Conclusion

We present sparse geostatistical analysis in clustering andemonstrate its uses in fMRI time series. The proposed method is

combination of two existing statistical methods. However, whenPCA and geostatistical clustering are jointly used, this model-freepproach not only changes the whole analysis process to be data-riven, but also offers a well-grounded framework for clustering.pecifically, sparse geostatistical analysis shows advantages overhe GLM clustering which is based on a crude threshold in datacreening. Our proposed method can be considered as a two-steppatio-temporal cluster analysis. The first step is to reduce the ill-alanced data by SPCA, which screens out the voxels that are clearlyot active during the experiment; the second step is to do the clus-ering using geostatistical ideas, which further refines the resultsnd identifies the active voxels based on the temporal patterns ofhe data. The efficiency of the data analysis is greatly improved byhe proposed two-step procedure.

cknowledgements

We wish to express our appreciation to Dr. Rebecca L. McNameef the University of Pittsburgh for kindly providing the first saccadeata set and her conscientious assistance with physiological inter-retations. We thank Dr. Jennifer McDowell and Michael Amlungt the University of Georgia Psychology Department for providinghe second saccade data set. We also would like to thank Dr. Nathananasak, now at the Medical College of Georgia, for his assistanceith the resting data acquisition. Jun Ye was supported in part

y SDSU 2011 Research/Scholarship Support Fund SA1100185. Wehank the referees of an earlier version of this paper, who gaveelpful advice on clarifying and explaining our ideas.

eferences

ndersen AH, Gash DM, Avison MJ. Principal component analysis of the dynamicresponse measured by fMRI: a generalized linear systems framework. MagneticResonance Imaging 1999;17:795–815.

tkinson PM, Lewis P. Geostatistical classification for remote sensing: an introduc-tion. Computers and Geostatistics 2000;26:361–71.

ethods 199 (2011) 336– 345 345

Bandettini PA, Jesmanowicz A, Wong EC, Hyde JS. Processing strategies for time-course data sets in functional MRI of the human brain. Magnetic Resonance inMedicine 1993;30:161–73.

Baudelet C, Gallez B. Cluster analysis of BOLD fMRI time series in tumors to studythe heterogeneity of hemodynamic response to treatment. Magnetic Resonancein Medicine 2003;49:985–90.

Bowman FD, Patel R, Lu C. Methods for detecting functional classifications in neu-roimaging data. Human Brain Mapping 2004;23:109–19.

Calhoun VD, Adli T, Pekar JJ, Pearlson GD. Latency in sensitive ICA group independentcomponent analysis of fMRI data in the temporal frequency domain. NeuroImage2003;20:1661–9.

Carew JD, Wahba G, Xie X, Nordheim EV, Meyrand ME. Optimal spline smooth-ing of fMRI time series by generalized cross-validation. NeuroImage 2003;18:950–61.

Carroll MK, Cecchi GA, Rish I, Garg R, Rao AR. Prediction and interpretationof distributed neural activity with sparse models. NeuroImage 2009;44:112–22.

Cattell EM. The scree test for the number of factors. Multivariate Behavioral Research1966;1:245–76.

Chalana V, Ng L, Rystrom LR, Gee JC, Haynor DR. Validation of brain segmentationand tissue classification algorithm for T1-weighted MR images. Medical Imaging2001: Imaging Processing 2001;4322:1873–82.

Fadili MJ, Ruan S, Bloyet D, Mazoyer B. A multistep unsupervised fuzzy clusteringanalysis of fMRI time series. Human Brain Mapping 2000;10:160–78.

Friston KJ, Holmes AP, Worsley KJ, Poline J-P, Frith CD, Frackowiak RSJ. Statisticalparametric maps in functional imaging: a general linear approach. Human BrainMapping 1995;2:189–210.

Genovese CR, Lazar NA, Nichols T. Thresholding of statistical maps in functionalneuroimaging using the false discovery rate. NeuroImage 2002;15:870–8.

Goutte C, Toft P, Rostrup E, Nielsen FA, Hansen LK. On clustering fMRI time series.NeuroImage 1999;9:298–310.

Goutte C, Hansen LK, Liptrot MG, Rostrup E. Feature space clustering for fMRI metaanalysis. Human Brain Mapping 2001;13:165–83.

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning, data mining,inference and prediction. New York, NY: Springer Science; 2001.

Huettel SC, Song AW, McCarthy G. Functional magnetic resonance imaging. Sunder-land, MA: Sinauer Associates Inc; 2004.

Isaaks EH, Srivastava RM. An introduction to applied geostatistics. New York, NY:Oxford University Press, Inc; 1989.

Jolliffe IT. Principal component analysis. New York, NY: Springer Verlag; 1986.Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis.

Hoboken, NJ: John Wiley and Sons, Inc; 1990.McKeown MJ, Makeig S, Brown GG, Jung T-P, Kindermann SS, Bell AJ, et al. Analysis

of fMRI data by blind separation into independent spatial components. HumanBrain Mapping 1998;6:160–88.

McNamee RL, Eddy WF. Visual analysis of variance: a tool for quantitative assess-ment of fMRI data processing and analysis. Magnetic Resonance in Medicine2001;46:1202–8.

Rencher AC. Multivariate statistical inference and applications. New York, NY: JohnWiley and Sons, Inc; 1998.

Rencher AC. Methods of multivariate analysis. 2nd ed. New York, NY: John Wileyand Sons, Inc.; 2002.

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and valida-tion of cluster analysis. Journal of Computational and Applied Mathematics1987;20:53–65.

Serban N, Wasserman L. CATS: clustering after transformation and smoothing. Jour-nal of the American Statistical Association 2005;100:990–9.

Sugar C, James G. Finding the number of clusters in a data set: an informationtheoretic approach. Journal of the American Statistical Association 2003;98:750–63.

Tatsuoka MM, Lohnes PR. Multivariate analysis. 2nd ed. New York, NY: MacmillanPublishing Company; 1988.

Tibshirani R. Regression shrinkage and selection via LASSO. Journal of Royal Statis-tical Society: Series B 1996;58:267–88.

Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal of Royal Statistical Society: Series B 2001;63:411–23.

Wang D, Shi L, Yeung DS, Tsang ECC, Heng PA. Ellipsoidal support vector clusteringfor functional MRI analysis. Pattern Recognition 2007;40:2685–95.

Worsley K, Liao C, Aston JAD, Petre V, Duncan G, Morales F, et al. A general statistical

Ye J, Lazar NA, Li Y. Geostatistical analysis in clustering fMRI time series. Statisticsin Medicine 2009;28:2490–508.

Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. Journal of Com-putational and Graphical Statistics 2006;15:265–86.

Documents

Sparse geostatistical analysis in clustering fMRI time series