GWR Presentation

Geographically Weighted Regression

CSDE Statistics Workshop Christopher S. Fowler PhD. February 1st 2011 Significant portions of this workshop were culled from presentations prepared by Fotheringham, Charleton and Brunsdon and presented at the 2010 Advanced Workshop on Spatial Analysis at the University of Santa Barbara.

Center for Studies in

Demography and Ecology

University of

Washington

Outline for the Session

The motivation for GWR

◦ Examples from YOUR discipline

Mapping OLS Residuals

◦ A good baseline for why we need GWR

GWR

◦ Definitions, basic concepts

Running GWR

◦ A straightforward implementation in ArcGIS

GWR and some extensions

Basics of OLS

y X

Assumes a stationary process

Same stimulus provokes the same

response anywhere in the study area

Why might relationships vary

spatially? Sampling variation

Relationships intrinsically different across

space (attitudes, preferences, contextual

effects)

Model misspecification

Applications: Ecology

―GWR works on

trees…‖

Could have been “differentiated

sampling pattern creates predictable

and changing levels of interaction

among observations”

Applications: Public Health

The relationship between

mortality and occupational

segregation and between

mortality and unemployment

varies across Tokyo

Relationships vary

systematically

Applications: Sociology/Public Policy

The link between multifamily

housing and residential burglaries

varies widely even when

controlling for numerous

socioeconomic and neighborhood

factors

Missing variables (and

they may very well be

unknowable)

Back up…How do we know if we

have nonstationarity in our model?

Map residuals and test them for spatial

autocorrelation

…if our model errs systematically with a

spatial pattern then we may be on to

something.

Poverty in the Southern U.S.

Our example Model

65

Poverty FemaleHeadedHousehold Unemployed

Black andolder M etro

AtLeastH ighSchoolEducation

Based on the work of Paul Voss and Katherine Curtis

These are all understood to be good predictors of poverty

What kinds of spatial structures influence this data set?

Lab Part 1

Run our OLS model in ArcGIS

Examine model output

Map residuals

Calculate Moran’s I and Local Moran’s I

Our best aspatial model

So what now?

◦ Add more missing variables and try again

Repeat the steps from the lab

◦ Accept that there is something about certain places

that makes them different (spatial heterogeneity)

Try GWR

◦ Test variables meant to explore interactions taking

place at short distances (spatial dependence)

Try Spatial Regression (Likely a spatial lag model)

◦ Assume that the correlation is a ―nuisance‖ and

control for it in the error term

Try Spatial Regression (Likely a spatial error model)

Outline for Part II

What is GWR

Weighting in GWR

Geographically Weighted Regression

Local statistical technique to analyze

spatial variations in relationships

We are not content with global averages

of spatial data (climate for example)

Why should we be satisfied with global

averages in a statistical analysis?

Put another way….Simpson’s

Paradox If we think of these

points as our data

grouped into colors

by region we can

see that the global

and local models

differ significantly

Source: Rücker and Schumacher BMC Medical Research Methodology 2008

8:34 doi:10.1186/1471-2288-8-34

Basic definitions

Spatial nonstationarity exists when the same stimulus provokes a different response in different parts of the study region

Global models are statements about processes that are assumed to be stationary and, as such, are location independent

Local models are spatial disaggregations of global models, the results of which are location specific

Spatial heterogeneity refers to spatial patterns resulting from broad similarities usually over time

Spatial dependence refers to spatial patterns that result from interactions among observations

GWR in greater detail

Spatial Heterogeneity and Spatial Dependence

GWR and Spatial Processes

GWR is excellent at picking up broad

scale regional differences

◦ spatial heterogeneity

Not as effective at dealing with small scale

interaction processes

◦ Too much bias in each local model

◦ That doesn’t mean it wont try (and give you

misleading results)

GWR in a nutshell

Global model

Where i indicates that there is a set of coefficients estimated for every observation in our data set

y X

i i i iy X

becomes

The Key Difference

We estimate a set of regression

coefficients for each observation

To do so we weight near observations

more heavily than more distant ones.

We may also estimate coefficients based

on some local subset of observations

Some advantages of GWR

Excellent tool for testing model

specification

◦ Where does model fit look good, where are

you missing something?

Residuals generally lower and not spatially

autocorrelated

Real values for β

.9 .8 .8 .7 .5

.8 .7 .6 .5 .4

.7 .6 .5 .4 .4

.6 .5 .4 .3 .2

.5 .4 .3 .2 .1

Estimated Values of β in global

model

.5 .5 .5 .5 .5

.5 .5 .5 .5 .5

.5 .5 .5 .5 .5

.5 .5 .5 .5 .5

.5 .5 .5 .5 .5

Residuals from global model

+ + + + 0

+ + + 0 -

+ + 0 - -

+ 0 - - -

0 - - - -

Reasons to use GWR

Identify model misspecification

Identify nonstationarity in relationships

Improved model fit (R2, AIC, etc)

Reduced spatial autocorrelation

Represent ―context‖

◦ Address spatial heterogeneity when precise

variables may not exist

You’ve convinced me, what next?

Run your aspatial model (as we did in 1st

lab)

◦ We will want the results and diagnostics to

compare with what comes next.

Decide how you are going to weight your

nearby locations

◦ Fixed bandwidth

◦ Variable bandwidth

◦ User-defined bandwidth

It all comes down to how you

weight the observations… We can use a fixed bandwidth ―h‖

Number of observations will vary, but area they represent will remain constant

Wij = exp[-((dij/h)2)/2]

h

Weighting option 2

Or we can employ an adaptive bandwidth

Number of observations will remain fixed, but area will not be the same

Wij = [1-(dij2/ h2)] 2 if j is one of i‘s N nearest neighbors

Kernels and Weights

So how do we know what bandwidth to use?

•Bandwidth specifies

shape of weights curve

•Kernel type tells us

whether we will define

our bandwidth based on

distance (fixed) or

number of neighbors

(adaptive)

Judging the appropriate bandwidth

A tradeoff between

◦ Bias: we include observations that are not part of the same spatial ―group‖

and

◦ Variance: we don’t have enough points in our model to say anything with conviction

AICc or CV measure model fit

Optimize fit to obtain best bandwidth.

AIC

Bandwidth

Optimum

Variance Bias

To sum

Weighting assumptions are very important to outcomes in GWR

Fixed distance kernel is more appropriate when the distribution of your observations is relatively stable across space (e.g. size, number of neighbors).

Adaptive kernel is appropriate when distribution varies across space (e.g. events are clustered or polygons are heterogeneous)

Once a kernel type is selected optimization takes some of the guesswork out of it, but robustness checks are still needed

Residuals from the OLS model from

last lesson

Looks

reasonably good

Moran’s I is still

.22 and highly

significant

Lab

Run GWR model

Check Residuals

Check variation in coefficients

Further topics/issues in GWR

Where to go for next steps

General troubleshooting

Significance testing

Outlier problems

Poisson and Logistic model

implementations

Mixed form models

Other software implementations of

GWR

GWR 3.x (4.0 should be out soon)

R (spgwr package)

Stata

Matlab

Perhaps others I haven’t heard of…

General Troubleshooting

Regional dummies –BAD

◦ Eliminate them from model—we are trying to

show regional variation, not control for it

Binary and low probability count variables

◦ Use caution, lack of variation may cause

model to crash or have trouble finding a

workable bandwidth

Significance Testing

How do I know if the variation I see in my

coefficients is meaningful?

Could do t-test, but you will run into

problems with multiple (1,387) tests

◦ Results in lots of false positives

◦ Standard correction (Bonferroni) will make

any significance finding nearly impossible

Best Method: Monte Carlo

simulation Randomly reassign all observation values

(dependent and independent variables travel together) to different observation locations

◦ Each county’s data gets assigned randomly to a different county

Re-run GWR and record coefficients

Repeat lots of times (at least 100)

Define a distribution for coefficient values and compare your coefficients to this distribution

pe is effective number of parameters

p is the number of parameters

Other method: Fotheringham

Significance Test

1 e

e

Fotheringhamp

pnp

Fotheringham Significance Test

1 e

e

Fotheringhamp

pnp

.05.001283

37.971 (37.97)

1387 8

Fotheringham

Type equation here.

In Excel we can find the significant T-statistic using:

TINV(.001283,1379)

In R we use:

qt(1-(.001283/2),1379)

Either way we get a value of ~3.23

Results: Significant Nonstationarity

for Percent Hispanic

Outlier problems

Outliers cause problems for everybody, but their impact is greater for local regressions, particularly when bandwidth keeps number of observations low.

In standard OLS ◦ Run model and identify observations with high or

low residuals (~ +/- 4)

◦ Weight these observations less than 1

◦ Re-run until none of the observations have extreme residuals

◦ Now do your GWR with weights assigned

Poisson and Logistic model forms

Implementations exist in both R and GWR

3.x software

Both require much greater care with

respect to colinearity and lack of variation

Mixed-form models

What if some of your variables are stationary and others have variation?

Mixed-form models allow you to hold some coefficients constant while allowing others to vary

Not yet implemented in any statistical package, but not that difficult from a technical standpoint

Concluding comments

What comes next?

◦ Spatial regression

◦ Multilevel models

Documents

GWR Presentation