Upload
vudan
View
235
Download
0
Embed Size (px)
Citation preview
GISPopSci
Day 1
Spatial Regression Modeling
Paul Voss & Katherine Curtis
The Center for Spatially Integrated Social ScienceSanta Barbara, CAJuly 12-17, 2009
GISPopSci
ObjectiveProvide a solid introduction and overview of the concepts and techniques of spatial
regression analysis
with plenty of hands-on experience in the afternoon
lab sessions
GISPopSci
What’s the point?Data that are referenced to location
bring an extremely important additional amount of valuable information to a data analysis
But it also brings some (possibly unfamiliar) pitfalls that require a new awareness as that analysis proceeds
GISPopSci
Plan for the week (1)
• Today: Understanding Spatial Data– Broad overview of spatial data, spatial data analysis and core
spatial concepts. Why “spatial is special”– Classical linear regression model
• assumptions underlying OLS• consequences of violations of assumptions• why spatial processes violate OLS assumptions
– Introduction to EDA & ESDA– Lab: Shapefiles & introduction to GeoDa & EDA using R
• Tuesday: Diagnosing spatial autocorrelation– Understanding & measuring global spatial autocorrelation– Weights matrices– Understanding & measuring local spatial autocorrelation
• Moran scatterplot• LISA statistics
– Lab: Global & local measures of spatial autocorrelation
GISPopSci
Plan for the week (2)
• Wednesday: Common modeling strategies for spatial processes– Understanding spatial processes
• spatial heterogeneity• spatial dependence
– Spatial regression models• OLS in GeoDa• understanding GeoDa regression diagnostics• spatial lag model; spatial error model
– No Lab. Free time in Santa Barbara!
• Thursday: Focus on spatial heterogeneity– Spatial heterogeneity in relationships– Introduction to GWR
• GWR theory and approach• GWR software
– Lab: GWR hands-on using R
GISPopSci
Plan for the week (3)
• Friday: Alternative spatial processes– Reminder that there are some spatial data analysis
approaches that have not been covered in this class• point pattern analysis• geostatistical methods; kriging
– Exciting things on the horizon– Resources for pursuing these topics intensively– Lab:
• Research presentations• Open forum
GISPopSci
Plan for today• A good motivational example• Why “spatial is special”
– characteristics of spatial data– problems caused by spatial data
• Spatial analysis vs. spatial data analysis• Broad, gentle overview of spatial data and spatial
data analysis• Classes of problems in spatial data analysis• Review OLS assumptions & violations• Exploratory Data Analysis• Introduction to afternoon lab
GISPopSci
Any questions as we get started?
GISPopSci
Some beginning facts• Regression is the workhorse of quantitative social
science• Much social science data is spatially referenced• Spatially referenced data bring special problems to
an analysis– heterogeneity of observational units → heteroskedasticity– spatial autocorrelation → residual dependence
• A consequence of these “special problems” is that the assumption of iid errors in a standard OLS regression specification is violated, and statistical inference from such a model is not valid
GISPopSci
Motivation• Omer R. Galle, Walter R. Gove, & J. Miller McPherson.
1972. “Population Density and Pathology: What Are the Relations for Man” Science 176(4030):23-30– data: 75 community areas in Chicago for 1960– 5 measures of “social pathology” as function of crowding, controlling
for social class & ethnicity– “…the greater the density, the greater the fertility” (p. 176)
• Colin Loftin & Sally K. Ward. 1983. “A Spatial Autocorrelation Model of the Effects of Population Density on Fertility” American Sociological Review 48(1):121-128– “…the GGM findings with regard to fertility are an artifact of the failure
to recognize the presence of disturbance variables which are spatially autocorrelated” (p. 127)
• Moral: When analyzing spatially referenced data, it’s highly useful to know something about the rudiments of spatial data analysis (i.e., some understanding of why “spatial is special”)
GISPopSci
Okay… So why is spatial special?• Scale dependency
– Robinson (ASR, 1950) → “Ecological Fallacy”– “The relationship between ecological and individual
correlations which is discussed in this paper provides a definite answer as to whether ecological correlations can validly be used as substitutes for individual correlations. They cannot.” (p. 357)
– MAUP (Modifiable Areal Unit Problem)– “Habitual users of ecological correlations know that the
size of the coefficient depends to a marked degree upon the number of sub-areas. …[T]he size of the ecological correlation [will increase numerically as consolidation of smaller areas into larger areas takes place].” (Robinson, pp. 357-8)
GISPopSci
Why is spatial special? (2)
• Observational areas are generally of different size– Spatial heterogeneity → error
heteroskedasticity
GISPopSci
Counties in U.S. South: 2000 Census
n = 1,387
Brewster Co. TX:Area: 6,193 mi2
Fairfax City, VA:Area: 6.3 mi2
GISPopSci
Counties in U.S. South: 2000 Census
n = 1,387
Harris Co. TX:Pop: 3,400,578Loving Co. TX:
Pop: 67
GISPopSci
Why is spatial special? (3)
• Observational areas are generally of different size (geographic size; population size)– Heteroskedasticity in regression errors
• Neighboring areas are similar– Tobler’s 1st law of Geography: “Everything is
related to everything else, but near things are more related than distant things.” (1970:236)
– (positive) spatial autocorrelation
GISPopSci
Proportion of Children (under age 18) in Poverty: 2000 Census
Source:SF3 Table P87
Loudoun Co. VA: 0.028
Starr Co. TX: 0.595
GISPopSci
Why is spatial special? (4)
• Observational areas are generally of different size– Heteroskedasticity in regression errors
• Neighboring areas are similar– Tobler’s 1st law of Geography: “Everything is related to
everything else, but near things are more related than distant things.” (1970:236)
– (positive) spatial autocorrelation
• Probable stumbling blocks when modeling the data– Again… the assumption of iid errors in a standard OLS
regression specification is violated and statistical inference is not valid
GISPopSci
Spatial versusNon-Spatial Data Analysis
GISPopSci
Take these two maps, for example
Any traditional data analysis that does not utilize the location & spatial arrangement (topological information) of
the data will lead to identical results for the two maps
GISPopSci
And why is this important?“What makes the methods of modern [spatial data analysis] different from many of their predecessors is that they have been developed with the recognition that spatial data have unique propertiesand that these properties make the use of methods borrowed from aspatial disciplines highly questionable”
Fotheringham, Brunsdon& Charlton, Quantitative
Geography: Perspectives onSpatial Data Analysis
Sage, 2000 p. xii
GISPopSci
Because of these unique properties, if we blithely carry out an OLS regression
using aggregated geographic data…
• Our estimated regression coefficients are biased and inconsistent, or…
• Our estimated regression coefficients are inefficient• Our R2 statistic is exaggerated• We’ve made incorrect inferences• We’ll never get it published
– or shouldn’t!
surely some large subset of the following undesirable horrors almost certainly awaits
us (the curse of Tobler’s 1st Law)
GISPopSci
Given these problems, why would anyone bother to analyze spatial data?
• There’s lots of it!• Occasional need for non-disclosure of
individual-level data• Space is important
– Space as a means of organizing human activities– Location as a means of integrating interesting data
• Some interesting questions can (only) be examined with spatial data
Some interesting questions that might be addressed using modern
spatial analysis:
• Local tax rates(“spillover” in y?)
• Expenditures for police(“spillover” in x?)
• Demographic analysis: Are Quitman & Tallahatchie counties (two contiguous counties in the Mississippi Delta) really two separate observations?
(“spillover” in ε ?)GISPopSci
GISPopSci
Now, Let’s Define Some Terms
GISPopSci
What exactly are “spatial data”?
…data where, in addition to attribute values relating to the primary
phenomena of interest, the relative spatial locations of observations are
also recorded
GISPopSci
And what is spatial regression analysis?
Regression using spatial data where explicit attention is given to location
and arrangement of geographic units
Even if we don’t really care about spatial processes
GISPopSci
Spatial Analysis versusSpatial Data Analysis
GISPopSci
GIS
Spatial Analysis
• P-median problems• Maximal covering problem• Location set covering problem• Traveling salesman problem (TSP)
GISPopSci
GIS
Spatial Data Analysis
Spatial Analysis
“Spatial Statistics”
GISPopSci
Spatial Data Analysis
GISPopSci
GIS
Spatial Data Analysis
Spatial Analysis
“Spatial Statistics”
EventData
GeostatisticalData
LatticeData
Spatial Econometrics
Spatial Regression Analysis
|
GISPopSci
Types of Spatial Data
• Event data (point data)• Spatially continuous data (geostatistical
data)• Lattice data (regionalized data)• Spatial interaction data (flow data)
GISPopSci
Our focus of attention in this workshop will primarily be
on Lattice Data
On Friday we’ll briefly touch on the other kinds of spatial data and other
spatial data analytic approaches
GISPopSci
Lattice Data• Have one or more variables whose values are
measured over a set of areas• Interest focuses on the attribute values, not on
the locations which are known and unchanging• Objectives:
– Detecting patterns in the spatial arrangement of attribute values
– examining the relationships among the set of variables taking into account any spatial effects present
• Approach:– Exploratory spatial data analysis– Confirmatory spatial data analysis (spatial regression)
GISPopSci
Some early cautions:• Our goal is to correctly model and draw proper inferences
about an unobserved, random DGP (random field)– spatial process?
• Spatial heterogeneity?• spatial dependence?• time?
– sampling perspective?• Spatial autocorrelation• Scale issues; scale dependency; aggregation bias;
boundary issues• The tools are pretty good, but along the way many
subjective decisions are made– defining “neighborhood” – choosing a weights matrix
GISPopSci
Before taking a closer look at the meanings of Spatial
Autocorrelation, Spatial Heterogeneity and Spatial
Dependence, let’s take a closer look at the traditional OLS
regression model
GISPopSci
Standard OLS Regression
ikikiii xxxy εββββ +++++= ...22110
(n x 1)
(n x k+1) (k+1 x 1)(n x 1)
yXXXβ TT 1)(ˆ −=and where:
eeknT
⎥⎦⎤
⎢⎣⎡
−−= )1(1ˆ 2σ
εX βy +=In matrix notation:
For those of you who glaze over at the sight of that, the
next 3 slides are for you
GISPopSci
GISPopSci
ikikiii xxxy εββββ +++++= ...22110
How do we get from this
εXβy +=
to this
???
GISPopSci
626216106
525215105
424214104
323213103
222212102
121211101
εβββεβββεβββεβββεβββεβββ
+++=+++=+++=+++=+++=+++=
xxyxxyxxyxxyxxyxxy
Assume n = 6 and we have two independent variables x1 and x2
εXβy += is shorthand for the above
GISPopSci
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
6
5
4
3
2
1
yyyyyy
y
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
2616
2515
2414
2313
2212
2111
111111
xxxxxxxxxxxx
X⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
2
1
0
βββ
β
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
=
6
5
4
3
2
1
εεεεεε
ε
We get to the shorthand matrix algebra expression by realizing the following:
εXβy +=and
GISPopSci
Okay… but what about the assumptions underlying the
OLS regression model?
We must place some conditions both on the population and on the data to establish unbiasedness, consistency
and efficiency These conditions are embodied in the
Gauss-Markov Theorem
GISPopSci
The Gauss-Markov Theorem asserts that is a “Best Linear Unbiased Estimator” (BLUE) of ,
provided the following assumptions are met:
$β
• Linearity• Mean independence E[εi |xi] = 0 (implies E[ε] = 0)• Homoskedasticity and uncorrelated disturbances
Cov[ε] = E[ε ε’ ] = I• X is of rank k+1 (k = no. of “independent” vars.)• X is non-stochastic (or stochastic with finite second
moments, and E[X’ε] = 0 for unbiasedness)• Normal disturbance
2σ
GISPopSci
To discuss these assumptions we need a few words to describe the
performance of estimates• Bias. An estimation method is unbiased if it
produces estimates that have a statistical expectation equal to the true (population) value
• Consistency. Estimates converge toward the quantity being estimated as the sample size increases
• Efficiency. Efficient estimates are those that have smaller standard errors than estimates produced by some competing estimator
GISPopSci
Regarding the OLS assumptions
• Linearity and mean independence support unbiased estimates
• Homoskedasticity and uncorrelated disturbances support efficiency
• Normality of disturbances means we can do statistical inference using t tables
• Normality also means we can estimate the model by MLE
It is partly with these OLS assumptions in mind that we stress
the need to know our data
EDA / ESDA
GISPopSci
We engage in EDA (ESDA) to determine how our data might lead to violations of one or more of the assumptions underlying our
model regarding the unknown DGP
GISPopSci
For example…• Are we starting out by maximizing the probability of obtaining a
normal error structure?– Why do we care about this?– How can we check for this?– What can we do about it?
• Do we have good linear relationships between our dependent variable and independent variables?– Why do we care about this?– How can we check for this?– What can we do about it?
• Should any of our variables be transformed?– Do you know how to proceed?
• Do we have any outliers?– What kind of outliers?– What options are available to us?
• Fortunately almost everything we’ll want to do can be done within GeoDa TM And what we do in GeoDa, we’ll try to replicate with R
GISPopSci
http://geodacenter.asu.edu/
GeoDaTM is a trademark of Luc Anselin & the University of Illinois
GISPopSci
Exploring Spatial Data• Goal is to seek good understanding and description of the
data, thus suggesting hypotheses to explore
• Not much emphasis here on p-values, which are so ubiquitous in most of our training & our statistical instincts
• Look especially for clues to “spatial heterogeneity” or “spatial dependence”
• Few a priori assumptions about the data
• Robust (“resistant”) methods
• Analysis may sometimes end here
• EDA/ESDA: Despite good tools, at heart EDA is a philosophy; an attitude; “best practice” way of thinking about your data; a way of staying out of trouble
GISPopSci
EDA / SDA Tools• Maps• Descriptive statistics• Plots and graphics• Classification and clustering methods• Software with dynamically linked objects• Global and local spatial autocorrelation• Look for evidence of “Tobler’s First Law”
GISPopSci
Maps? Sure, but what kind?
PPOVQuantile Map (9)
PPOVPercentile Map
PPOVBoxMap (1.5)
PPOVStd. Dev. Map
So… some answers are:What are you trying to discover with the map?What are you wishing to show with the map?
GISPopSci
Descriptive Statistics
GISPopSci
Plots and Graphs
GISPopSci
Checks for normality and
symmetry
Plots and Graphs… (cont.)
GISPopSci
Plots and Graphs… (cont.)
Checks for linearity & outliers
Tomorrow we’ll take a close look at the concept of spatial autocorrelation
But, in the context of assumptions underlying the OLS model, this raises some serious problems
GISPopSci
Correlated Disturbances
GISPopSci
Cov[ε] = E[ε ε’ ] = σ2 Σ ≠ σ2 I• This can arise from a number of
violations of the OLS assumptions• Often occurs when there is an important
independent variable missing from the specification
• It causes OLS parameter estimates or their associated standard errors to be unreliable
• It inflates the value of the R2 statistic
If non-zero covariance between X and ε
GISPopSci
E[X’ε] ≠ 0
OLS parameter estimates are biased
OLS parameter estimates are inconsistent
E[b] = E[(X’X)-1X’y]= E[(X’X)-1X’(Xβ + ε)]= β + (X’X)-1E[X’ε] ≠ β
plim[b] = β + plim[(X’X/n)-1] x plim[X’ε/n]
GISPopSci
Readings for today• Anselin, Luc. 1989. “What is Special About Spatial Data? Alternative
Perspectives on Spatial Data Analysis.” NCGIA Technical Paper 89-4.• Ward, Michael D., and Kristian Skrede Gleditsch. 2008. Spatial
Regression Models. Quantitative Applications in the Social Sciences, No. 155. Thousand Oaks, CA: Sage. Chapter 1. [Book is available on line at: http://faculty.washington.edu/mdw/pdfs/SRMbook.pdf ]
• Goodchild, Michael F., Luc Anselin, Richard P. Applebaum, and Barbara Herr Harthorn. 2000. “Toward Spatially Integrated Social Science.” International Regional Science Review 23:139-159.
• Loftin, Colin and Sally K. Ward. 1983. “A Spatial Autocorrelation Model of the Effects of Population Density on Fertility.” American Sociological Review, 48(1):121-128.
• Galle, Omer R., Walter R. Gove, and J. Miller McPherson. 1972. “Population Density and Pathology: What Are the Relations for Man?” Science (new series) 176:23-30.
GISPopSci
Afternoon Lab
Shapefiles and introduction to GeoDaTM
and EDA using GeoDa & R
GISPopSci
Questions?