Identiﬂcation of Local Clusters for Count Data: A Model-Based Moran…tlzhang/crimerev.pdf · Identiﬂcation of Local Clusters for Count Data: A Model-Based Moran’s I Test Tonglin

Identification of Local Clusters for Count Data: A

Model-Based Moran’s I Test

Tonglin Zhang ∗and Ge Lin †

Purdue University and West Virginia University

February 14, 2007

∗Department of Statistics, Purdue University, 250 North University Street,West Lafayette, IN 47907-2066, Email:

[email protected]†Department of Geology and Geography, West Virginia University, Morgantown, WV 26506-6800, email:

[email protected]

0

Identification of Local Clusters for Count Data: A

Model-Based Moran’s I Test

Abstract

We set out IDR as a loglinear model-based Moran’s I test for Poisson count data that resembles

the Moran’s I residual test for Gaussian data. We evaluate its type I and type II error probabilities

via simulations, and demonstrate its utility via a case study. When population sizes are hetero-

geneous, IDR is effective in detecting local clusters by local association terms with an acceptable

type I error probability. When used in conjunction with local spatial association terms in loglinear

models, IDR can also indicate the existence of first-order global cluster that can hardly be removed

by local spatial association terms. In this situation, IDR should not be directly applied for local

cluster detection. In the case study of St. Louis homicides, we bridge loglinear model methods

for parameter estimation to exploratory data analysis, so that a uniform association term can be

defined with spatially varied contributions among spatial neighbors. The method makes use of

exploratory tools such as Moran’s I scatter plots and residual plots to evaluate the magnitude of

deviance residuals, and it is effective to model the shape, the elevation and the magnitude of a

local cluster in the model-based test.

Keywords: Cluster and clustering; deviance residual; Moran’s I; permutation test; spatial

autocorrelation; type I error probability.

1 Introduction

Count and cross-tabulated frequency data are common in geographical analyses. Many spatial

phenomena, such as births, deaths, crimes and species richness, can be counted by a spatial unit,

either as a raw count or as a ratio over some exposure. Prior to the 1970s, count data were often

converted to rate for statistical analyses because of limited computational power in categorical

statistics. In the late 1970s, computationally expensive methods, such as loglinear models for

1

cross-tabulated data were introduced into social sciences and geography [15, 38], and they were

quickly included in many statistical packages. In spatial statistical analyses, however, counts

are still frequently converted to rate so that a testing method for continuous variables, such as

Moran’s I [26, 27] or Getis-Ord’s G [20], can be directly applied. However, when population sizes

are heterogeneous across spatial units, converting counts to rates often leads to variance inflation

and biased type I error probabilities. Some propose to incorporate a population weight to the test

statistics [29, 35], but the heterogeneity problem still remains [5]. Since a loglinear model can take

account of population sizes in its likelihood ratio test, it is natural to extend the spatial statistics

under the loglinear model framework.

In this article, we set out a loglinear model-based test statistic for Poisson count data that

corresponds to Moran’s I for continuous data. We chose Moran’s I because of its popularity

and its ease of implementation. There have been hundreds of applications and extensions of

the statistic since Moran’s I was first published in 1948 [26]. Currently, most researchers focus

on estimation methods [13, 22, 37], spatial distribution properties [7, 19], and adjustment of

heterogeneous population sizes for count data [5, 29, 35]. A concurrent theme is focused on local

spatial statistics or indicators [16]. It is pointed out that the extent of spatial correlation may

vary locally due to omitted, misspecified, and deficient measurements for a stationary spatial

relationship [17]. A significant Moran’s I test may be caused by either a global trend of spatial

autocorrelation, or a few local spatial associations. Attempts have been made to partition space

for spatially varied parameterization [10], and to decompose a global autocorrelation measure, such

as Moran’s I, into a local indicator of spatial association (LISA) [3]. With auxiliary information,

LISA is able to locate spatial associations, such as hot spots and cool spots [33]. Our model-based

test should complement LISA, because it is not only able to explicitly indicate high-value and

low-value clusters, but it is also able to account for heterogeneous population sizes.

As its name suggested, a model-based test depends on a particular statistical model. In a linear

regression model, a dependent variable is often associated with a set of explanatory variables. After

2

a final model is derived, a residual Moran’s I test for spatial autocorrelation can also be performed

to detect spatial clustering for unexplained variations ([14], p. 197). When a regression model

does not include any explanatory variables, the residual Moran’s I test is identical to the Moran’s

I test of the dependent variable. If we can bridge this test for spatial autocorrelation to a loglinear

model, it would likely narrow the apparent knowledge gap between Moran’s I for continuous data

and other autocorrelation tests for count data ([38], p. 307).

There are some recent advances in incorporating count data in spatial statistics. Griffith [21]

introduced a spatial filter specification of the auto-regressive logistic model that is able to remove

the global clustering effect. The model is likely to provide unbiased parameter estimates for auto-

regressive logistic regression, but due to its focus on model correction, the method may not be

able to detect a local association. Several test statistics, such as Ipop and I∗pop [29] and modified

I [35] by Empirical Bayes Index (EBI), CG [34] or spatial-X2 test [31], are able to account for

heterogeneous population sizes and to detect a local cluster, but none of them can account for

ecological or geographic covariates. Lin’s [23] spatial logit association model is able to include

ecological covariates and spatial associations, but the significance of a logit association term is

not a direct measure of spatial clustering. Although Apanasovich and his coauthors [6] used

the Pearson residuals to test for spatial autocorrelation in their autoregressive model, the test

was not formally specified and evaluated for wider applications. In this paper, we demonstrate

that Moran’s I based on loglinear residuals can be used not only as a global indicator of spatial

autocorrelation, but also as a tool for modeling the location, the shape, the elevation and the size

of a local spatial cluster.

In the remaining sections of the paper, we briefly review the permutation test of Moran’s I by

using regression residuals and reformulate it in the context of Poisson data by using the deviance

residuals of a loglinear model. We then use simulations to evaluate its statistical properties under

the null and alternative hypotheses of spatial independence. In section 4, we apply the deviance

residual Moran’s I test to the St. Louis crime data. Finally, we provide some concluding remarks.

3

2 A Model-Based Moran’s I

Consider a study area that has m regions indexed by i. Let Xi be the variable of interest in region

i. Moran’s I [26, 27] is expressed as:

I =

∑mi=1

∑mj=1 wij(Xi − X̄)(Xj − X̄)

[∑m

i=1

∑j=1 wij][

∑mi=1(Xi − X̄)2/m]

, (1)

where X̄ =∑m

i=1 Xi/m, and wij with wii = 0 is the (i, j)-th element of a spatial weight matrix W .

Commonly wij is defined by the adjacency of spatial units: wij = 1 if regions i and j are adjacent

(neighbors) and wij = 0 otherwise. A significant and positive value of Moran’s I indicates the

existence of a positive autocorrelation, or the existence of high-value or low-value clustering. A

significant and negative value of Moran’s I indicates a negative autocorrelation, or a tendency

toward the juxtaposition of high values next to low values.

The null hypothesis of Moran’s I is usually based on the assumption that the distributions

of Xi are homogeneous. The p-value of the significance of Moran’s I is computed from a z-test

based on its z-value given by Z(I) = [I−E(I)]/√

V (I), where E(I) and V (I) are the theoretically

mean and variance respectively under the null hypothesis. Under the null hypothesis of no spatial

autocorrelation, Z(I) is assumed to be asymptotically distributed of N(0, 1) as m → ∞. The

theoretical values of E(I) and V (I) are usually computed under the random permutation test

scheme:

E(I) = − 1

m− 1(2)

and

V (I) =m[(m2 − 3m + 3)S1 −mS2 + 3S2

0 ]− b2[(m2 −m)S1 − 2mS2 + 6S2

0 ]

(m− 1)(m− 2)(m− 3)S20

− E2(I), (3)

where S0 =∑m

i=1

∑mj=1,j 6=i(wij +wji)/2, S1 =

∑mi=1

∑mj=1,j 6=i(wij +wji)

2/2, and S2 =∑m

i=1(wi·+w·i)2

with wi· =∑m

j=1,j 6=i wij, and b2 = m∑m

i=1(zi − z̄)4/[∑m

i=1(zi − z̄)2]2 ([14], p. 21).

When observations are counts, such as crimes, Xi in (1) often takes the form of case rate

as Xi = ni/ξi, where ni is the number of cases and ξi is the at risk population size in region

4

i. However, the homogeneous assumption under this specification may not be valid [11]. Since

loglinear models can relax this assumption, we can specify a loglinear model and use its deviance

residuals to test for spatial autocorrelation.

Suppose that the random count Ni with an observed count ni, i = 1, · · · ,m, follows a Poisson

distribution and assume that the counts Nis are independent. Suppose that a set of geographical

covariates are observed together with the count ni. Then, a loglinear model can be set out by

taking the observed geographical covariates as explanatory variables and the logarithm of the

at risk population size (e.g. log(ξi)) as an offset term. When the parameters are estimated by

maximum likelihood estimation, the estimated value n̂i of the expected count E(Ni) can be derived

and the conventional deviance residual ([1], p. 588) for region i is

ri,d = 2sign(ni − n̂i)[ni log(ni/n̂i)− ni + n̂i]1/2, (4)

where sign(·) is the sign function defined as sign(a) is 1 if a > 0, is 0 if a = 0, and is −1 if a < 0.

The concepts and statistical properties of deviance residuals in loglinear models are well es-

tablished. We can readily extend these concepts to spatial statistics. Note that the numerator of

Moran’s I is a martingale if Xis are independent with mean 0. When Xi = ri,d with ri,d given

in (4) and n̂i is replaced by the expected count E(Ni), X1, · · · , Xm are independent and E(Xi)

is almost 0 if E(Ni) is large (e.g. E(Ni) > 5). When E(Ni) is estimated by n̂i from a loglinear

model, then under the model assumption, n̂i is a consistent estimator of E(Ni), and the joint dis-

tribution of (r1,d, · · · , rm,d) is approximately normal with mean 0 and variance-covariance matrix

an orthogonal projection matrix [1, 30] denoted by Pm. For a fixed number of covariates when m

is large, the orthogonal project matrix Pm is almost equivalent to the identity matrix since the

dimension of the kernel space of the project matrix is equal to the number of covariates. When m

goes to infinity, r1,d, · · · , rm,d are approximately independent and the asymptotic normality of Z(I)

can be proven by a martingale approximation of the numerator of Moran’s I with an application

of the Martingale Central Limit Theorem ([9], p. 475). In addition, one must also consider the

5

convergence of the permutation mean and variance of Moran’s I in this scenario [32].

This particular asymptotic formulation of the deviance residuals is analogous to that of re-

gression residuals ([14], p. 198). It is noted that deviance residuals are very flexible in loglinear

models, and they reflect categorical structure (in this case spatial structure) while controlling for

potentially heterogeneous population sizes ([1], p. 495). We can similarly test deviance residu-

als for spatial autocorrelation by specifying a loglinear model. Since a loglinear model, such as

log-rate model, can incorporate geographic (or ecological) covariates, we can test its residuals for

spatial autocorrelation in the presence or absence of ecological covariates. A nature approach is to

apply the random permutation test so that Moran’s I based on the deviance residual of a loglinear

model is analogous to Moran’s I based on residuals of a regression model [6].

Given that deviance residuals are approximately multivariate normal, we can test spatial au-

tocorrelation of the residuals by replacing Xi in (1) with ri,d in equation (4), and we label it IDR.

The mean and variance of IDR can be identically derived from the random permutation scheme

of the conventional Moran’s I as given by equations (2) and (3) respectively. To implement IDR,

we can simply estimate the expected counts under the null model with the intercept only, which

indicates that n̂i = ξi(n/ξ) with n =∑n

i=1 ni and ξ =∑m

i=1 ξi. In this case, the i-th deviance

residual ri,d can be derived by inserting n̂i in (4). If IDR is positive and significant, it suggests

spatial clustering, which can either be contributed by a first-order clustering trend or a few local

clusters. We can detect clustering contributions by applying spatial association models [23, 24].

First, a number of spatial association terms are added to the null model. Then, the parameter

estimates together with residuals are derived in the model fitting process. The existence of spatial

autocorrelation is tested again via IDR for the model residuals. If IDR is significant in the null

model but not significant in the model with local association terms, the significance found in the

null model is likely to be accounted for by the local association terms. If a few spatial association

terms cannot reduce the significance of IDR from the null model, it suggests the existence of a

first-order global clustering tendency.

6

Under the assumption that there is a local cluster in the study area, a loglinear model with a

spatial association term is:

log(λi) = log(ξi) + β0 + β1di(j) (5)

where λi = E(Ni), β0 is the grand mean and β1 is the unknown parameter for the spatial association

term defined by di(j) in which di(j) is 1 if location i is believed to be in a cluster centered at unit

j, and otherwise di(j) is 0. We test if β1 significantly differs from 0. The significance of the local

association term is determined by its p-value via the likelihood ratio test over the null model

without the spatial association term. Likewise, the contribution of the spatial association term to

IDR is determined by comparing the p-value of IDR with and without the term. If the coefficient

of β1 is positively significant, then the local cluster is a hot spot. If the coefficient is negatively

significant, then the local cluster is a cool spot.

Besides the above likelihood ratio test, we can gauge the contribution of β1 for the local cluster

by comparing IDR results with and without the β1 term in model (5). If IDR is not significant when

the spatial association term is included, then the clustered effect in the null model is sufficiently

removed by the association term. If the inclusion of spatial association term in model (5) does not

change the significance level of IDR in the null model, then the clustering effect remains. To further

improve model fit and to identify the explained clustering effect, one can either refine the spatial

association term already in the model or include another spatial association term. Finally, if the

existence of a local cluster is accompanied by a first-order global clustering trend, the likelihood

ratio test may still be significant by the inclusion of the local association term, but it is unlikely

to reduce the p-value of IDR from a significant level to a non-significant level.

7

3 Simulation Assessment of IDR

We designed Monte Carlo simulation experiments to assess the effectiveness of the model-based test

IDR under population heterogeneity. Type I error rates were evaluated under the null hypothesis

of homogeneous rates with heterogeneous population sizes, while spatial cluster modeling was

evaluated in the presence and absence of first-order global clustering trend. All the simulation

experiments were based on a 10 × 10 lattice with wij being defined by the rook rule of spatial

adjacency. We set the significance level of α = 0.05 to assess the rejection rates of IDR in each set

of simulations. In the presence of a local cluster, a residual plot was also furnished to facilitate

the evaluation process.

In addition to IDR, we included the original Moran’s I by converting counts to rates, and

denoted it by Ir which is defined by letting xi = ri = ni/ξi in (1). Previous studies have demon-

strated that Ir is sensitive to heterogeneous populations, and the inclusion of Ir was to serve as

a baseline for comparison. We also included the Empirical Bayesian Index (EBI) denoted by

IEBI , a population-adjusted Moran’s I proposed by Assuncao and Reis [5]. IEBI is found to be

effective for adjusting population sizes in the presence of population heterogeneity, and it has

been included in GeoDa, a popular spatial exploratory data analysis freeware [4]. However, IEBI

is not a model-based test and it cannot include ecological covariates. This can be seen from the

definition, in which zi = ri,EBI = (pi − b)/√

νi, where b = n/ξ, νi = a + b/ξi, a = s2 − b/(ξ/m)

and s2 =∑m

i=1 ξi(pi − d)/ξ. Hence,

IEBI =

∑mi=1

∑j 6=i wij(

pi−b√νi− 1

m

∑ml=1

pl−b√νl

)(pj−b√νj− 1

m

∑ml=1

pl−b√νl

)

[∑m

l=1(pi−b√

νi− 1

m

∑ml=1

pl−b√νl

)2/m][∑m

i=1

∑mj=1,6=i wij]

. (6)

Assess IDR for population heterogeneity. Following the simulation studies of Walter [36]

and Assuncao and Reis [5], we compared the type I error probabilities of IDR, IEBI and Ir based

on Monte Carlo simulations. Walter [36] reported that densely populated areas with a pocket

sparsely populated area could cause an excessive type I error probability for Ir. To represent this

pattern, we generated relatively low population of 106(1 − η)2 for lattice points within a 2-unit

8

circle centered at (3, 3), and 106 for others. The η value indexes population heterogeneity from 0

to 0.8 with an increment of 0.04 increments. When η = 0, all the populations were homogeneous.

As η is getting closer to 1, the populations are increasingly heterogeneous.

Based on the above population patterns, we generated independent Poisson random variables

with the mean value being 10−4 times the population size for each lattice point. Since identical

rates were expected across all lattice points, there should be no spatial clustering. The rejection

rate, therefore, should reflect the type I error probability of the spatial autocorrelation test. For

each η value, we calculated type I error probabilities based on 10, 000 simulations and resultant z

values. The results (Figure 1) show that both IDR and IEBI were able to account for population

heterogeneity with an almost identical type I error probability around 0.05 for all η values. The

type I error probability of Ir, however, was only acceptable when η is small with little variation

in population sizes. As η increased, the type I error rates also increased. When population sizes

varied substantially (η = 0.8), the rejection rate was as high as 25%, a result consistent with

Walter’s simulations.

Assess IDR for local cluster detection. Based on the previous simulation result, we devised

a fixed heterogeneous population pattern: the population was 105 if a point on the lattice was

within the circle and the population was 106 otherwise. We generated independent Poisson random

variables with the mean equal to 0.0001 times the population size for each lattice point. We then

inserted a 2-unit circle for a cluster effect centered at (7, 7), and set the mean equal to 0.0001(1+δ)

times the population. The δ value represented the strength and direction of the cluster effect, and

it increased from −0.8 to 0.8 with 0.04 in each step. If δ < 0, the circle represented a low-value

cluster; if δ > 0, it represented a high-value cluster.

Again, based on 10, 000 simulations for each δ value under population heterogeneity, the re-

jection rates of IDR with and without the spatial association term from model (5) are shown in

Figure 2. The rejection rate without the spatial association term indicates the statistical power of

IDR, while the rate with the spatial association term indicates the effectiveness of the model-based

9

test for a high-value or low-value cluster. If the model based test is effective, the test statistic

should no longer be significant when the spatial association term that covers the exact circle being

included. The results show that IDR under the null model had a reasonable power (Figure 2).

When δ values were around 0, the rejection rate was around 0.05. When the absolute δ values

were greater than 0.25, the rejection rates were about 15%. When the δ value reached −0.8 or a

cool spot, the rejection rate was almost 100%. When δ value reached 0.8, the rejection rate was

about 85%. Both results suggest that IDR under the null model is likely to be significant when

there is a strong local cluster.

However, when the cluster tendency was accounted for by the spatial association term, the

rejection rates were consistently around 0.05, suggesting that IDR was unlikely to be significant

when a spatial association term absorbed the cluster effect. Since the relative risks within the

cluster were all similarly higher or lower than the rest of the area in our simulations, once its

effect was removed by the spatial association term, the study area became spatially independent,

a result consistent with previous simulations in the spatial logit association model [23].

The effect of the spatial association term can be illustrated by the residual and QQ-normal

plots from a single simulation. The upper panel of Figure 3 displays the results under the null

model. The IDR test had a p-value of 0.0001 primarily due to a number of extremely high deviance

residuals from the clustered area. Likewise, the QQ-plot shows that a number of high values are

concentrated in the upper tail, suggesting the existence of extreme values. The lower panel in

Figure 3 shows that once the spatial association term was added to the model, the effect of extreme

large residual values in the null model was disappeared, whereas the p-value of IDR reduced to

0.12 with evenly distributed residuals. This result is also collaborated from the QQ-plot with all

the values along a straight line.

Assess IDR for a local cluster in the presence of first-order global clustering. It is

known that a local cluster and a first-order clustering trend can operate simultaneously. In the

presence of global clustering, it is often necessary to first de-trend before fitting a spatial regression

10

model [2, 12]. We intend to evaluate the performance of IDR in this situation by generating the

global spatial structure from a log-normal distribution, and by inserting a local cluster from the

previous simulation with δ = 2.0 in the simulation. If the local test is insensitive to the first-order

clustering tendency, then it indicates the existence of global clustering.

In the simulation process, we first generated 100 identically independently distributed (iid)

N(0, 1) random variables, denoted by ε = (ε1, ε2, · · · , ε100). Next, we calculated a vector u by

letting u = (I − ρW )−1ε with ρ increasing from 0 to 0.2 in step increment 0.01 such that u

satisfied

u = ρWu + ε,

where ρ is the coefficient of the global spatial association [2, 5]. Third, we let λ = 0.0001(1+2d)eu,

where λ = (λ1, · · · , λ100) was the vector of Poisson intensity for generating counts. We generated a

conditional independent Poisson random variable Ni with parameter λi times the i-th population

size. When ρ = 0, there was only a local cluster in the simulated pattern, and when ρ 6= 0,

there were both local and global clustering tendencies in the simulation pattern. We assess the

effectiveness of IDR by comparing the rejection rates of IDR with and without the spatial association

term.

Based on 10, 000 simulations for each ρ value, the results (Figure 4) showed that the spatial

association term was unable to reduce the clustering effect except when the global clustering trend

was very weak. For instance, when ρ = 0, the rejection rate for IDR in the null model was about

28%, and it suggested spatial clustering. When the spatial association term was included in this

case, the local cluster tendency was reduced similar to the previous simulation. As the global

clustering trend ρ increased, the rejection rates of IDR also increased, and the two curves with and

without the association term were likely to be significant for even a modest increase in ρ. The

inclusion of the spatial association term had little effect on removing a local clustering effect in the

presence of the global clustering tendency. It further suggests that even when a association term

might be significant in terms of the likelihood ratio test, the local effect might not be trustworthy,

11

because the global effect overshadowed the local effect.

Figure 5 displays the residual and QQ-plots of the deviance residuals with and without the

spatial association term from a single simulation run (ρ = 0.15). It is evident, there were only

few large deviance residuals in absolute values, and they were not clumped together. This pattern

is in sharp contrast with the one in Figure 3. In addition, the p-values of IDR with and without

using the spatial association term were very close: 0.0003 with the spatial association term and

0.0001 without. These result suggest that the inclusion of a local association term is unlikely to

reduce the significance of IDR because of the overall global clustering effect.

In summary, IDR is effective in reducing type I error probabilities of the traditional Moran’s

I due to heterogeneous population sizes, and its performance is comparable to that of IEBI . An

advantage of IDR over IEBI is its ability to include ecological or other spatial covariates. When a

significant IDR is contributed mainly by a local cluster, we can devise a spatial association term to

remove the cluster effect, so that the spatial autocorrelation observed in the null model would not

be significant anymore. The exact form of association term can be determined either by a stepwise

regression method [23] or from a exploratory method, such as deviance residual plots. Since IDR

is sensitive to the existence of local clusters but not sensitive to the presence of the global trend,

the inclusion of a spatial association term in the IDR test can indicate whether a first-order global

clustering trend exists or not.

4 St. Louis Homicides Data analysis

In this section, we apply IDR to analyzing homicides in the St. Louis region. The data set was

originally analyzed by Messner, et. al [25], and it is also included as part of exercises in GeoDa

[4], a simple spatial analysis package developed by Anselin and his associates. In the original

paper, homicide rates for 1984-1988 and 1988-1993 periods were analyzed at the county level, and

a number of local clusters including one centered at St. Louis City were identified by LISA. Here,

12

we can use the model based IDR to detect spatial clustering based on homicide incidents and the at

risk population. Analogous to LISA, we also included a local version of deviance residual Moran’s

I or deviance residual LISA denoted by IDR,i. The result of IDR,i was compared with the results

of the local versions of Ir and IEBI , denoted by Ir,i and IEBI,i respectively, where IDR,i, Ir,i and

IEBI,i can be defined according to the formula given by Anselin [3] as

Ii =

∑mj=1,j 6=i wij(xi − x̄)(xj − x̄)

∑mi=1(xi − x̄)2/m

(7)

by letting xi = ri,d, xi = ri and xi = ri,EBI respectively. All IDR,i, Ir,i and IEBI,i are able to provide

additional ways of exploratory spatial analysis for count data, such as cluster maps. However, only

IDR,i is able to provide an additional clustering analysis when a covariate variable is accounted

for.

In the preliminary analysis, we found that IDR for the 1984-1988 period was 0.0319 with an

insignificant p-value of 0.2034, and IDR for the 1988-1993 period was 0.1137 with a significant

p-value of 0.0066. We, therefore, focused on the latter period.

Between 1988 and 1993, there were 2, 650 homicides, and the average homicide rate was about

10 per 100, 000. County populations in the study area vary substantially: St. Louis County was

the largest with more than one million residents, and five other counties that include St. Louis

City, St. Clair, Boone, Sangamon and Macon had at least 50, 000 residents. To detect spatial

clustering for homicides, we first fitted the null model. The results from IDR indicated a significant

clustering tendency with the z-value of 2.72 and p-value of 0.0066. When we plotted the deviance

residuals by five equal intervals (Figure 6), St. Louis City was in the first interval with 40.22

deviance residual, St. Clair county was in the third interval (18.06), and there was no county

in the second interval. This indicated that St. Louis City was the only county that indicated

a high-value cluster surrounded by St. Clair, St. Louis and Madison counties. In addition, we

further plotted deviance residuals LISA by using GeoDa, and found that the standardized values

of IDR,i, Ir,i and IEBI,i were 17.5, 25.45 and 25.64 respectively when i indicated the St. Louis

13

county, and the values were 12.27, 13.93 and 14.01 respectively when i indicated the St. Clair

county. The values of the rest counties were much lower that the values of those two counties.

The LISA plot also indicated that St. Louis and Madison counties were next to high valued

counties, presumably the two very high valued county St. Louis City and St. Clair counties.

Based on the above information, we decided not to adopt the spatial association term that assigns

equal contribution to the clustered effect. We refined the shape of the cluster by examining each

individual residual within the adjacent counties, and devised a spatially varied association term

to capture the magnitude of residual variation within a cluster.

Based on the principle of the uniform association model [1], a large residual value should

correspond to a large di(j) value, and a relatively small residual value should correspond to a small

di(j) value. When a neighbor county has an ignorable absolute residual value, it can be dropped

from the spatial specification. From the five equal interval classification, St. Louis City was in

the first interval, St. Clair in the third, and Madison and St. Louis in the fourth interval. We

assigned, accordingly, 4 to St. Louis City, 2 to St. Clair, and 1 to Madison and St. Louis counties,

and this assignment could be achieved automatically in our search algorithm because of standard

intervals were used. The results show that the model with the spatial association term was highly

significant contributing to an around 2, 610 reduction of deviance from the null model of 2, 944

to the alternative model of 334. In the meantime, the coefficient of the spatial association term

0.6145 indicates a high-value cluster, and its inclusion changed the p-values of IDR from 0.0066 in

the null model to 0.9094 in the alternative model. It suggests that the spatial association term

can remove the effect of the local cluster, and there was no global clustering trend. In addition,

if we inspect the deviance residuals individually for the 4 counties, we could see 40, 20, and −5.

Based on this information, we further experimented with assigning 4 to St. Louis City, 2.5 to

St. Clair, and 1 to Madison and St. Louis counties (Model II), and this assignment could further

reduce the deviance to 177 with a z-value of 0.3920 for IDR. In both cases, the values of IDR,i

significantly decreased to a very low level for St. Louis and St. Clair counties which was almost not

14

significant throughout the region at the 0.05 probability level when we adjusted for the multiple

testing problem of 78 units by the Bonferroni’s method (see [28], p 153).

It is worth noting that odds ratios can be used to describe the shape of a cluster. For instance,

the odds ratio of 1.849 = e0.6145 in Model I between St. Louis county and other counties indicates

that St. Louis county was 1.849 = e0.6145 times as likely as other counties to have a homicide.

Similarly, St. Louis City would be 11.68 = e4×0.6145 times and St. Clair would be 3.4178 = e2×0.6145

times as likely.

Alternatively, we can use geographical covariates to explain the detected clustering tendency.

For instance, it is known that St. Louis City had a high concentration of Blacks. We obtained the

percentage of Blacks from the 1990 census for all the 78 counties and used it as an ecological co-

variate in place of the spatial association term. The results (Table 1 last row) show the percentage

of Blacks was positively associated with the likelihood of homicides in the study area. The ecolog-

ical model performed slightly better than the spatial association model in terms of the likelihood

ratio test, i.e., smaller deviance with the same number degrees of freedom. In addition, when the

ecological variable was included, the p-value of 0.4902 for IDR was not significant, suggesting that

there was no spatial autocorrelation anymore. This result implies that the St. Louis City cluster

detected by the association term can be explained by the percentage of Blacks in the case study.

The use of an ecological variable or a spatial association term can both yield useful information

to describe and quantify a detected cluster.

5 Concluding Remarks

In this paper, we have specified and evaluated IDR as a loglinear model-based Moran’s I test

for Poisson count data that resembles the Moran’s I residuals test for Gaussian data. Based on

previous studies, we pointed out that loglinear residuals are not only asymptotically normal, but

also applicable to the permutation test of Moran’s I for a correctly specified model. We evaluated

15

type I and type II error rates via simulations, and found that IDR was effective to account for

heterogeneous population sizes, and to detect a local cluster in the absence of a global trend. In

the presence of a global trend, the power of detecting a local cluster was very weak, a problem

that also exists for a continuous dependent variable in a linear regression model [12].

In the case study, we extended Lin’s [23] spatial association model that emphasizes equal

contributions among spatial neighbors to an ordered or uniform spatial association model that

captures spatially varied contributions among spatial neighbors within a cluster. This model has

several advantages. First, it makes use of exploratory tools such as Moran’s I scatter plots and

residual plots to evaluate the magnitude of deviance residuals. Second, cluster shape can be

determined in terms of its geographic coverage and its slope via odds ratios. In other words,

a 3-dimensional cluster that spatially varies in terms of its magnitude can be derived by the

spatially varied association term. Third, this analysis can be extended to probit, logit [6] and

other limited dependent variables under the loglinear framework. Finally, our model-based IDR

test is complementary to recent development of residual-based spatial statistical approaches [8].

Future research should extend IDR to other test statistics, such as Getis-Ord’s G [20] and

Geary’s c [18], and assess their effectiveness for various spatial problems. Likewise, there are

many conventional methods for modeling categorical associations, and we should examine their

effectiveness for constructing a spatially varied association term, and for specifying various forms

of loglinear models in the context of spatial analysis. The current study does not offer any de-

trend methods in the presence of a global trend, and how to de-trend while locating and explaining

local clusters remains an challenging issue. Finally, like other model-based tests, when a model

is mis-specified, the result from a model-based test, such as IDR can be misleading, criteria for a

correctly specified model should be established for spatial loglinear models.

Aknowledgements: The authors would like to thank a reviewer for the detailed comments

and suggestions, which have substantially improved the quality of the paper.

16

References

[1] Agresti, A. (2002). Categorical Data Analysis. Wiley, New York.

[2] Anselin, L. (1990). Spatial dependence and spatial structural instability in applied regression

analysis. Journal of Regional Science, 30, 185-207.

[3] Anselin, L. (1995). Local indicators of spatial association-LISA. Geographical Analysis, 27,

93-115.

[4] Anselin, L., Syabri, I. and Kho, Y. (2006). GeoDa: An introduction to spatial data analysis.

Geographical Analysis, 38, 5-22..

[5] Assuncao, R. and Reis, E. (1999). A new proposal to adjust Moran’s I for population density.

Statistics in Medicine, 18, 2147-2162.

[6] Apanasovich, T. V, Sheather, S., Lupton, J. R., Popovic, N., Yurner, N. D., Chapkin, R. S.,

Braby, L., A., Carroll, R. J. (2003). Testing for spatial correlation in nonstationary binary

data, with application to aberrant crypt foci in colon carcinogenesis. Biometrics, 50, 752-761.

[7] Bennett, R. J. and Haining, R. P. (1985). Spatial structure and spatial interaction modeling

approaches to the statistical analysis of geographic data. Journal of Royal Statistical Society

A, 48, 1-36.

[8] Baddeley, A., Turner, R. and Hazelton, M. (2005). Residual analysis for spatial point processes.

Journal of Royal Statistical Society B, 67, 617-666.

[9] Billingsley, P. (1995). Probability and Measure, Wiley, New York.

[10] Brunsdon, C., Aitkin, M., Fotheringham, S. and Charlton, M. (1999). A comparison of ran-

dom coefficient modeling and geographically weighted regression for spatial non-stationary

regression problems. Geographical and Environmental Modelling, 3, 47-62.

17

[11] Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of Royal

Statistical Society A, 154, 143-55.

[12] Cressie, N. (1993). Statistics for spatial data, Wiley, New York.

[13] Cliff, A. D. and Ord, J. K. (1972). Test for spatial autocorrelation among regression residuals.

Geographical Analysis, 4, 267-284.

[14] Cliff, A. D. and Ord, J. K. (1981). Spatial Processes: Models And Applications, Pion, London.

[15] Fingleton, B. (1983b). Loglinear models with dependent spatial data. Environment and Plan-

ning A, 15, 801-13

[16] Fotheringham, S. (1997). Trends in quantitative geography: I: stressing the local. Progress in

Human Geography, 21, 88-96.

[17] Fotheringham, S. (1999). Guest editorial: local modeling. Geographical and Environmental

Modeling, 3 5-7.

[18] Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statisti-

cian, 5, 115-145.

[19] Getis, A. and Aldstadt, J. (2004). Constructing the spatial weights matrix using a local

statistic. Geographical Analysis, 36, 90-104.

[20] Getis, A. and Ord, J. (1992). The analysis of spatial association by use of distance statistics.

Geographical Analysis, 24, 189-206.

[21] Griffith, D. (2002). A spatial filtering specification for the auto-Poisson model. Statistics and

Probability Letters, 58, 245-251.

[22] Lee, S. I. (2004). A generalized significance testing method for global measures of spatial

association: an extension of the Mantel test. Environment And Planning A, 36, 1687-1703.

18

[23] Lin, G. (2003). A spatial logit association model for cluster detection. Geographical Analysis,

35, 329-340.

[24] Lin, G. and Zhang, T. (2005). Loglinear residual tests of Moran’ I autocorrelation and their

applications to Kentucky Breast Cancer Data. Geographical Analysis, to appear.

[25] Messner, S., Anselin, L., Baller, R., Hawkins, D., Deane, G. and Tolnay, S. (1999). The

spatial patterning of county homicide rates: an application of exploratory spatial data analysis.

Journal of Quantitative Criminology, 15, 423-450.

[26] Moran, P. A. P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical

Society Series B, 10, 243-251.

[27] Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37, 17-23.

[28] Neter, J., Kutner, M. H., Nachtsheim, C. and Wasserman, W. (1996). Applied Linear Statis-

tical Models, 4th Edition, McGraw Hill, New York.

[29] Oden, N. (1995). Adjusting Moran’s I for population density. Statistics in Medicine, 14, 17-26.

[30] Pierce, D. and Schafer, D. (1986). Residuals in Generalized linear models. Journal of American

Statistical Association, 81, 977-986.

[31] Rogerson, P. A. (1999). The detection of clusters using a spatial version of the chi-square

goodness-of-fit statistics. Geographical Analysis, 31, 130-147.

[32] Sen, A. (1976). Large sample-size distribution of statistics used in testing for spatial correla-

tion. Geographical analysis, 9, 175-184.

[33] Sokal, P. R., Oden, N. L. and Thomson, B. A. (1998). Local spatial autocorrelation in a

biological model. Geographical Analysis, 30, 411-432.

[34] Tango, T. (1995). A class of tests for detecting general and focused clustering of rare diseases.


19

[35] Waldhor, T. (1996). The spatial autocorrelation coefficient Moran’s I under heteroscedasticity.


[36] Walter, S. D. (1992). The analysis of regional patterns in health data. American Journal of

Epidemiology, 136, 730-741.

[37] Whittemore, A., Friend, N., Brown, B. and Holly, E. (1987). A test to detect clusters of

disease. Biometrika, 74, 631-635.

[38] Wrigley, N. (1985) Categorical Data Analysis for Geographers and Environmental Scientists.

Longman, New York.

20

0.0 0.2 0.4 0.6 0.8

0.00

0.05

0.10

0.15

0.20

0.25

η

Rej

ectio

n R

ate

IrIEBI

IDR

Figure 1: Type I error rates of Ir, IDR and IEBI under heterogeneity (α = 0.05).

21

−0.5 0.0 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Local Cluster

δ

Rej

ectio

n R

ate

IDR withoutIDR with

Figure 2: Rejection rate of IDR with and without the spatial association term (α = 0.05).

22

0 20 40 60 80 100

−2

02

46

Residual Plot: Without

Index

Dev

ianc

e R

esid

uals

−2 −1 0 1 2

−2

02

46

QQ−plot: Without

Theoretical Quantiles

Sam

ple

Qua

ntile

s

0 20 40 60 80 100

−2

−1

01

23

Residual Plot: With

Index

Dev

ianc

e R

esid

uals

−2 −1 0 1 2

−2

−1

01

23

QQ−plot: With


Sam

ple

Qua

ntile

s

Figure 3: Residual plots and QQ-plots in the presence a local cluster (δ = 0.5).

23

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

Global and Local Trend

δ

Rej

ectio

n R

ate

IDR withoutIDR with

Figure 4: Power functions of IDR with and without the spatial association term.

24

0 20 40 60 80 100

−20

−10

010

2030

40

Residual Plot: Without

Index

Dev

ianc

e R

esid

uals

−2 −1 0 1 2

−20

−10

010

2030

40

QQplot: Without


Sam

ple

Qua

ntile

s

0 20 40 60 80 100

−20

−10

010

2030

40

Residual Plot: With

Index

Dev

ianc

e R

esid

uals

−2 −1 0 1 2

−20

−10

010

2030

40

QQplot: With


Sam

ple

Qua

ntile

s

Figure 5: Residual plots and QQ-plots in the presence of local and global clustering structures

25

Madison

St. ClairSt. Louis

Deviance-10.3 - -0.196-0.196 - 9.9089.908 - 20.01220.012 - 30.11630.116 - 40.22

100 0 100 200 Miles

Figure 6: Deviance residuals of the null model for St. Louis homicides.

26

Table 1: Loglinear model estimates and IDR results for St. Louis homicides: 1988-1993.

Models β̂1 p-value G2 d.f. IDR p-value

Null 2944.0 77 0.1137 0.0066

Spatial association I (St. Louis) 0.6145 0 334.2 76 −0.0352 0.9094

Spatial association II (St. Louis)* 0.6240 0 177.2 76 0.0132 0.7283

Ecological covariate (% of Blacks) 0.0554 0 301.9 76 0.0306 0.4902

Note: variables captured by β̂1 are in parentheses . Model I assigns 4 to St. Louis county, 2 to St. Clair

county, and 1 to the other adjacent counties; Model II differs by assigning 2.5 to St. Clair county.

27

Documents

Identiﬂcation of Local Clusters for Count Data: A Model-Based Moran…tlzhang/crimerev.pdf · Identiﬂcation of Local Clusters for Count Data: A Model-Based Moran’s I Test Tonglin