Chapter 8 – Spatial contiguity (geostatistic + lattice data)

1

Chapter 8 – Spatial contiguity (geostatistic + lattice data)

EDA involves methods of describing and visualizing data and its structure in order to

formulate hypotheses and check the validity of assumptions. R is a particularly useful

tool for EDA. In the follows, some common

techniques are introduced.

Data posting – It is a map on which each data

location is plotted along with its corresponding

data value.

gigantemap.r(soil87.dat)

# soil87.dat is regularly spaced at 87 locations

text(soil87.dat$gx,soil87.dat$gy,

as.character(soil87.dat$pHFsrf),col=4,cex=0.8)

0 100 200 300 400 500

0200

400

600

800

5.92

5.25

5.66

6.23

6.18

6.97

5.66

6.16

5.45

4.98

6.44

5.21

5.97

5.83

5.95

6.03

5.86

5.24

6.24

6.18

5.54

5.74

5.6

5.23

4.83

6.59

5.71

5.85

5.23

5.27

5.59

5.13

4.73

5.35

5.52

5.03

5.17

5.96

5.72

6.4

5.6

5.25

5.53

5.66

4.48

5.25

5.5

5.27

6.07

5.23

5.42

5.51

5.59

4.97

5.06

5.95

5.19

4.95

4.94

4.83

5.26

5.37

5.93

5.35

4.65

4.75

6.17

5.39

5.23

5.26

5.3

4.69

5.47

4.65

4.51

4.99

6.18

5.39

5.45

5.51

5.58

5.46

5.16

5.83

5.79

5.59

5.64

2

Symbol map

It is similar to the data posting map, but with each location replaced by a symbol

that represents the class to which the data belongs.

z=soil87.dat$pHFsrf

summary(z) Min. 1st Qu. Median Mean 3rd Qu. Max.

4.48 5.23 5.47 5.499 5.83 6.97

z[z<=5.23]=1

z[z>5.23&z<=5.83]=2

z[z>5.83]=3

gigantemap.r(soil87.dat)

text(soil87.dat$gx,soil87.dat$gy,

as.character(z),col=4)

x

y

0 100 200 300 400 500

0200

400

600

800

3

2

2

3

3

3

2

3

2

1

3

1

3

2

3

3

3

2

3

3

2

2

2

1

1

3

2

3

1

2

2

1

1

2

2

1

1

3

2

3

2

2

2

2

1

2

2

2

3

1

2

2

2

1

1

3

1

1

1

1

2

2

3

2

1

1

3

2

1

2

2

1

2

1

1

1

3

2

2

2

2

2

1

2

2

2

2

3

Contour map, image and 3-D plot

Contour/image maps are useful to reveal overall trends in the data. However, the

smoothing out of the variability in the data sometimes can be misleading.

Contour map (all these are R functions):

1. >pH.mat=gigante.mat.r(soil87.dat$gx,soil87.dat$gy,soil87.dat$pHFsrf)

# Interpolate a surface for z values.

2. >gigantemap.r(soil87.dat) # draw a location map

3. >contour(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T) #

add a contour map

Image:

1. >gigantemap.r(soil87.dat) # draw a location map

2. >image(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T)

# add an image

3-D perspective plot:

>persp(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat))

4

H-scatter plot

Locations close to each other are likely to have similar values than locations far

apart. H-scatter plot is a useful tool to display this spatial continuity. An h-scatter

plot shows all possible pairs of data values whose locations are separated by a

certain distance in a particular direction.

A set of sample data, z(s), is from a systematical

grid. The h-scatter plot is a scatter plot in which the

x-axis is labeled z(s) and the y-axis is labeled z(s+h),

where h is a spatial lag. If spatial continuity is strong,

the cloud of points (z(s), z(s+h)) should fall closely

along the 1-1 diagonal line. As h increases, the

data values become less similar, the cloud on the

h-scatter plot becomes fatter and more diffuse.

5

Distance lag and angle tolerance for h-scatter plot

The construction of h-scatter plot is straightforward for regularly spaced sample

locations. However, if the locations are irregularly spaced, in some situation there

may be very few sample pairs for a given h, while in others there may be a lot of

sample pairs. To overcome this problem we need to specify a tolerance both on the

distance of h and on its direction. Illustrated below, any locations falling within the

shaded area are considered to be within 5 m of a given location in an easterly

direction.

5 m20°

20°

0.5 m0.5 m

tol

+ + + + + ++ + + + + ++ + + + + ++ + + + + ++ + + + + ++ + + + + +

Regularly spaced

+ + + ++ +

++

++ + + +

+ ++ +

++

Irregularly spaced

6

Stationarity and isotropy

The assumption of intrinsic stationarity of a random field is essential for geostatistical

analysis. It is defined through first differences as follows:

where (h) is called (semi)variogram. Intrinsic stationarity basically implies a process with a

constant mean and with a variance depending only through the magnitude and direction of h.

Furthermore, if the variance 2(h) is independent of direction, i.e., it only depends on the

magnitude of ||h||, the random field is isotropic.

Geostatistical data are typically composed of both large-scale trend and small-scale random

variation. Geostatistics (variogram/Kriging) is designed to describe the small-scale variation,

so it is necessary to remove the trend and anisotropy before conducting geostatistical analysis

and spatial prediction.

)(2)()((var

0)()(

hshs

shs

ii

ii

zz

zzE

7

Median polishing

Median polishing is a resistant method and appropriate for grided data. It is based on

an additive decomposition as

data = grand mean + row + column + residual

zij = z.. + (zi. – z..) + (z.j – z..) + (zij – zi. – z.j + z..)

where a subscript dot denotes mean (median) over that subscript.

The algorithm iterates successive sweeps of medians out of rows, then out of

columns, accumulating them in row, column, and grand effects. At each step of the

algorithm, the above relation is preserved. The residuals are what is left after the

algorithm converges. The method cannot remove the interaction effects between rows

and columns.

Note: The residuals are the data that we should use in geostatistical analysis. It has been thought

working with residuals introduces bias into variogram estimation, but the bias is generally small

(particularly when median is used instead of mean).

8

Trend surface analysis

Trend surface analysis is appropriate for (non-)grided data. It basically fits the data using a multiple regression (considering x and y coordinates as independent variables, interaction of x and y or higher order terms are usually needed).

Data model:

>grid=expand.grid(x=1:50,y=1:25)>x=grid$x>y=grid$y>z=10+0.1*x+0.0012*y^2+rnorm(1250,0,1) #data generating

>x2=x^2>y2=y^2>xy=x*y>z=lm(z~x+y+xy+x2+y2) # fit the data z

Mapping z:>x0=0:50>y0=0:25> image(x0,y0,matrix(z,nrow=50),col=topo.colors(100),main="Species distribution",xlab="x",ylab="y")

,0012.01.010 2 yxz )1,0(~ N

9

Trend surface analysis (cont’d)

> summary(z.lm)

Call:lm(formula = z0 ~ x + y + xy + x2 + y2)

Residuals: Min 1Q Median 3Q Max -3.06491 -0.65335 0.03657 0.66170 2.88715

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.002e+01 1.478e-01 67.841 <2e-16 ***x 1.078e-01 8.404e-03 12.826 <2e-16 ***y -8.981e-03 1.706e-02 -0.527 0.599 xy 7.934e-05 2.614e-04 0.304 0.762 x2 -1.473e-04 1.461e-04 -1.008 0.314 y2 1.273e-03 5.862e-04 2.172 0.030 * ---

Mapping z.lm residuals:> image(x0,y0,matrix(resid(z.lm),nrow=50),col=topo.colors(100),main=“Residuals",xlab="x",ylab="y")

20012.01.010 yxz

10

z(si)

Spatial autocorrelation

Several techniques available to detect spatial autocorrelation (e.g., variogram and

correlogram). Here we introduce four indices; we defer variogram to next chapter.

These indices deal with binary, ordinal and continuous data.

1. Mantel test

Wij = ||si-sj|| Uij = ||z(si)-z(sj)||

1

1 11

n

i

n

ijijijUWM

n

i

n

jijijUWM

1 12

0

0

0

0

434241

343231

242321

141312

www

www

www

www

0

0

0

0

434241

343231

242321

141312

uuu

uuu

uuu

uuu

n

i

n

jijW

M

1 1

2

2 β is the slope of the

regression Uij= βWij+eij

random variable

11

Mantel test (cont’d)

Wij = ||si-sj|| Uij = ||z(si)-z(sj)||

n

i

n

jijijUWM

1 12

0

0

0

0

434241

343231

242321

141312

www

www

www

www

0

0

0

0

434241

343231

242321

141312

uuu

uuu

uuu

uuu

We do not usually use β for Mantel test. We instead directly use M2 (or M1) for

testing spatial patterning. Several methods can be used to test for significance.

The basic idea is to randomly assign z1, z2, …, zn to the study area and then

compare the observed M2 against the random M2’s to see how extreme the

observed M2obs is. If M2obs is very different from the majority of random M2’s,

we then reject the assumption of random distribution.

z(si)

12

n

i

n

jijijUWM

1 12

0

0

0

0

434241

343231

242321

141312

www

www

www

www

0

0

0

0

434241

343231

242321

141312

uuu

uuu

uuu

uuuMantel test – significance test

(1) Permutation test: Under the

hypothesis of random distribution,

the observed z(si) can be thought of

as a random assignment of values to

spatial location s. There are in total

n! possible # of assignments (maps).

Enumerating M2 for the n!

arrangements yields its null

distribution and the probability to

exceed M2obs can be ascertained.

(2) Monte Carlo test: To permute n! is

computingly very expensive. In this

case, we can sample independently

small subset of random assignments

(e.g., 99) to construct a null

distribution of M2.

M2obs

13

Measures on lattices

Defining spatial connectivity is critical to the analysis

of lattice data. There are several ways to define

connectivity: by contiguity and distance.

Spatial connectivity (contiguity)

Rook Bishop Queen

North Carolina

0

1ijw

if sites i and j are connected

if they are not connected

14

2. Black-Black and Black-White join counts

Defining spatial connectivity is critical to the analysis

of lattice data. There are several ways to define

connectivity: by contiguity and distance.

0

1ijw

if sites i and j are connected

if they are not connected

n

i

n

jjiij zzwBB

1 12

1

0

1iz

if site i is occupied

if it is not occupied

n

i

n

jjiij zzwBW

1 1

2

2

1

15

Test for significance

Permutation or Monte Carlo test can be used to test the significance

of the BB or BW statistics. Typically, the significance is tested by

assuming an asymptotic normal distribution:

)(

)(

BBVar

BBEBBzBB

n

i

n

jijwBBE

1 1

2

2

1)(

)(

)(

BWVar

BWEBWzBW

n

i

n

jijwBWE

1 1

)1()(

))1()(1(4

1)( 21

2 SSBBVar ))1(41)(1(4

1)1()( 21 SSBWVar

n

i

n

jjiij wwS

1 1

21 )(

2

1

n

i

n

j

n

jjiij wwS

1

2

1 12

Cliff, A.D. & Ord, J.K. 1981. Spatial processes: models and applications. Pion Limited, London.

16

3. Moran’s I

where zi is the data value at location i,

h is the distance between locations i and j,

wij takes 1 if the pair (i, j) pertains to distance class h (the one for which the

coefficient is computed), otherwise 0.

W is the sum of wij.

Moran’s I behaves like Pearson’s correlation coef because its numerator consists of a

covariance term. Thus, it is sensitive to extreme values.

n

ii

n

i

n

jjiij

zzW

zzzzwn

hI

1

2

1 1

)(

))((

)(

))(( jiij zzU

))(( zzzzU jiij

|| jiij zzU

2)( jiij zzU

The association of z values at locations i and j can be measured by several ways:

Covariance based Distance based

17

Testing Moran’s I

If zi is iid normal distribution, it can be shown that, as n is large, Moran’s I is

asymptotically normally distributed with the first two moments:

where

22

221

22

)1(

3)(

large) is (if 01

1)(

Wn

WnWWnIE

nn

IE

n

iii

ij jiij

ij

wwW

wwW

wW

1

2..2

21

)(

)(2

1

Two types of test:

1. Normal test if data are iid normal dist.

2. Permutation (randomization) test for

whatever data distribution.

18

Geary’s c

Different from Moran’s I, this coefficient is a distance-type function because the

numerator sums the squared differences between values at the various pairs of

locations.

In fact, Geary’s c corresponds in form to the Durbin-Watson statistic used to test the

presence of autocorrelation in the residuals of a regression (particularly in time

series data).

A side note on the effects of autocorrelation on regression analysis:

1. OLS regression coefs. are still unbiased, but they are no longer minimum variance estimates, i.e., the

estimates are inefficient.

2. If the residuals are positively autocorrelated, the residual mean square errors (MSE) may seriously

underestimate 2, so that the std error may be too small. Thus, the CI’s are shorter than they really should

be (see Cressie p.13-15), and tests of hypotheses on individual regression coefs may indicate that one or

more regressors contribute significantly to the model when they really do not. Inflating Type I error!

Generally, underestimating 2 gives a false impression of accuracy.

3. The CI’s and tests of hypotheses based on the t- and F-distributions are not really appropriate.

2

2

)(2

)()1()(

zzW

zzwnhc

i

jiij

19

Testing Geary’s I

Similar to the Moran’s I, c is also asymptotically normally distributed if zi is iid

normal distribution and n is large. The first two moments are:

Two types of test:

1. Normal test if data are iid normal distribution.

2. Permutation (randomization) test for whatever data distribution.

2

221

)1(2

4)2)(1()(

1)(

Wn

WWWncV

cE

20

spatstat functions for Moran’s I and Geary’s c

> xy.dat=expand.grid(x=1:4, y=1:4) #x and y locations

> xy.z=1:16 #illustrative data

> xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) #create neighborhood structure > str(xy.nb) #view the neighbors> plot(xy.nb,xy.dat) #plot neighborhood> xy.W=nb2listw(xy.nb) #generate weight for each neighbor

#assign equal weight to each neighbor> moran(xy.z, xy.W, length(xy.nb), Szero(xy.W)) # compute Moran’s I.> geary(xy.z, xy.W, length(xy.nb), length(xy.nb)-1,Szero(xy.W)) # compute Geary’s c.

sp.correlogram is used to compute correlogram (the graphic relationship between autocorrelation coef against distance). However, care is needed for the proper use of sp.correlogram. When using sp.correlogram, only use the basic nb object that connect all cells, e.g. for the regular grids, we only need to use xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) to define the nb (higher orders of neighborhood is not needed). If using higher order neighborhood structure, e.g., xy.nb=dnearneigh(as.matrix(xy.dat), 0, 3), sp.correlogram will still compute correct sp.correlogram but the lag has a slightly different tolerance band.

21

styles in nb2listw

>xy.W=nb2listw(xy.nb, style=“W”)

It is helpful to understand the different weighting style in nblistw. There are five options in style: “W”, “B”, “C”, “U” and “S”. “W” is the default.

You can use >str(xy.W) to view the difference between the styles.1. “W”: it weighs the neighbors of each focal point independently from the neighbors of other

focal points. Each neighbors of a focal point has the same weight to the focal point. For example, if a focal point has two neighbors, then the weight for each neighbor is 0.5. If a focal point has four neighbors, then the weight for each neighbor is 0.25. The sum of the weights for all the focal points must be the total number of the focal points.

2. “B”: it simply takes binary weight, i.e., each neighbor of a focal point contributes 1 to the focal point. The sum of the weights for all the focal points must be the total number of the neighbors.

3. “C”: global standardized. It is calculated by dividing the number of neighbors of each focal point by the total number of neighbors (the sum calculated from “B” method). The sum of the weights must be 1.

4. “S”: the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168. The sum of the weights is n.

22

Let’s compute correlogram for Gigante surface water pH values

> soil87.xy=data.frame(x=soil87.dat$gx,y=soil87.dat$gy) #x and y locations

> soil87.nb=dnearneigh(as.matrix(soil87.xy), 0, 60) #create xy neighborhood structure

> plot(soil87.nb, soil87.xy) #plot the nb structure

> soil87.pH.correlogram=sp.correlogram(soil87.nb, soil87.dat$pHFsrf, order=18, method=“I”, zero.policy=T) #compute correlogram> soil87.pH.correlogram> plot(soil87.pH.correlogram, main=“Gigante surface water pH values”)

Neighborhood structure

23

R function for Moran’s I and Geary’s c

The R program, autocorrelation.r, computes both I and c for a given data at a

particular distance.

>autocorrelation.r(data,x,y,dist1,dist2,k=0)

# data – a data array we want to test its autocorrelation

# x – the x-coordinates for the data array

# y – the y-coordinates for the data array

# dist1 and dist2 – used to define neighbors; locations fall within (dist1,dist2) of

a focal location are defined as neighbors, not including dist1 and dist2.

# k – the number of simulations for computing permutation distribution under

the null hypothesis of no spatial autocorrelation (k=0 default)

Example:

autocorrelation.r(soil87.dat$pHFsrf,soil87.dat$gx,soil87.dat$gy,50,100,k=99)

24

LISAs (Localized Indicators of Spatial Autocorrelation)

We can rewrite Mantel’s statistic M2 in terms of contribution of each individual site:

Mi can be thought of as the autocorrelation measure associated with the ith site, a local measure.

Definition of LISA: (1) It measures the extent of spatial autocorrelation around a particular site, (2) if

its sum is proportional to a global measure of spatial autocorrelation such as BB, I or c.

Applications:

(1) Detect regions where the autocorrelation is markedly different from other area.

(2) Detect local spatial clusters (hotspots). There are many definitions of hot (cold) spots. One way is

to use histogram and identifies those sites as hot (cold) spots where observations are extreme.

LISAs enable hotspot definition to take into account of the surrounding values of a site. A hot

spot can be defined as one or several contiguous sites where the LISAs are unusually large or

small.

(3) Distinguish spatial outliers from distributional outliers. A spatial outlier may not be a

distributional outlier.

n

jijiji UWM

1

n

ii

n

i

n

jijij MUWM

11 12 where

Anselin, L. 1995. Local indicators of spatial association – LISA. Geographic Analysis 27:93-115.

25

Moran’s LISAs

A local version of Moran’s I can be written as

It is obvious that the sum of the local Moran’s I is proportional to the total I (scaled by W, the total weights):

Anselin (1995) derives the mean and variance of Ii under the randomization assumption:

spdep packagge: “localmoran”.

n

iiI

WI

1

1

n

jjijii zzwzz

Sn

nI

12)1(

n

jiji w

nIE

11

1)(

26

A few technical comments

1. Spatial autocorrelation analysis should not be performed with fewer than ~30

data points because the number of location pairs in each distance class would

then become too small to produce reliable results.

2. There are two ways of dividing distance into classes: (1) by forming equal

distance classes (as we did for the soil pH values in soil87.dat), (2) by forming

classes with equal frequencies (i.e., distance h is not equally spaced, but for each

distance class the number of locations are approximately equal).

3. Spatial autocorrelation analysis should not be performed with a data containing

a lot of 0’s, because the degree of autocorrelation would then be overestimated

and would reflect the fact that the locations share their absence for that variable,

which is not what is intended in most applications.

4. Euclidean distances between locations may not be the best way to express

spatial relationships in applications. Such a spatial structure should be defined

depending on the particular purposes of a study.

Legendre, P. and Fortin, M.-J. 1989. Spatial pattern and ecological analysis. Vegetatio 80:107-138.

Documents

Chapter 8 – Spatial contiguity (geostatistic + lattice data)