AN OVERVIEW Spatial Analysis. Spatial Analysis involves... Data Exploration – the uncovering of...

Preview:

Citation preview

AN OVERVIEW

Spatial Analysis

Spatial Analysis involves ...

Data Exploration – the uncovering of patterns, identification of the unusual, discovery of groups

Visualisation – mapping and chartingSummary– data reduction, noise removal, synthesis of

informationModelling and EstimationExplanation (Causal analysis)Reliability and Quality Measurement

Data Types in Spatial Analysis

Spatial analysis draws on three types of information: THEMATIC (Attribute) data – WHAT are the key characteristics

we are interested in? SPATIAL data – WHERE are things in space? TEMPORAL data –WHEN things exist or existed, or when

particular events took place We can explore thematic, spatial or temporal relationships

separately, or in combination BUT the more we explore simultaneously, the HARDER it gets

and the LONGER it takes

Properties of Spatial Data

SPATIAL PATTERN of locations

SPATIAL DEPENDENCE between attribute values observed at different locations

SPATIAL HETEROGENEITY - systematic variation across space

Spatial Patterns

• Systematic behaviour / occurrence in space:

RANDOM CLUSTERED DISPERSED

Spatial Dependence

• Tobler’s First Law of Geography: “Everything is related to everything else, but near things are more related than distant things”

• Consequently, the way in which we aggregate and partition data may have implications… (MAUP – see later)

What is Spatial Heterogeneity?

Systematic Variation Across Space.This can be caused by:

intrinsically different relationships across space (spatial variations in attitudes or preferences due to administrative, political or social contexts);

model misspecification (omitted variables, inappropriate functional form).

spatial variation in relationships due to sampling variations;

Exploratory Data Analysis

Exploratory techniques are useful in:Pre-modelling: exploring data accuracy, formulating

hypotheses, detecting clusters, outliers and trends (“brushing” in GeoDa);

Post-modelling: examining model accuracy and robustness (e.g. mapping of residuals from a model).

May be applied to individual variables, or to relationships between variables.

Exploratory Spatial Data Analysis in GeoDa

http://www.ph.ucla.edu/epi/snow/broadstreetpump.html

Exploratory Spatial Data Analysis: John Snow’s Map of Cholera Deaths, 1855

Exploratory Post-Modelling Analysis in ONS: Checking the validity of small area estimates

Accounting for Spatial Effects

We can exploit spatial properties to: Produce smooth estimates and formulate hypotheses and

eliminate noiseWe can potentially build models to explain and

measure: Spatial interactions (understanding flows) Similarity and influences across space Variations across space Processes that evolve across space and time

Similarity and influences between neighbours

Rook’s Case Bishop’s Case Queen’s Case

‘W’ Adjacency Matrix

Spatial dependence with area data

Measuring local relationships:

Moran scatterplot(Visualising)

Moran’s i and Geary’s c(Hypothesis testing)

Proportion of economically active heads

of household in social classes 1 & 2

(East Anglia wards)

PHHSO

.6.4.20.0-.2-.4

WPH

HSO

.4

.3

.2

.1

-.0

-.1

-.2

-.3

Modelling Spatial Dependence

•The spatial interaction matrix W normally shows direct

neighbours (R, B or Q case), but may accommodate

second- or higher order neighbours by means of spatial

lags, thus:

What is a neighbourhood?

What relationships to a given location might make neighbouring spatial data influential? Only immediate neighbours (R, B or Q case)?; First, second, third etc., ‘order’? All neighbours within a given radius (h)?; A fixed number of neighbours?

Capturing Neighbourhoods – the Spatial Kernel

A ‘tent’ that can be placed over each data point Fixed Kernel - the number of neighbours captured

varies across the study area Adaptive Kernel - the radius varies over the study area,

but the number of neighbours captured for every point is fixed.

Neighbours enclosed ‘within the tent’ are then weighted according to a ‘distance decay curve’ implemented by the kernel.

The coverage (floor area of the tent) might be chosen:

arbitrarily (resulting in too much or too little smoothing);

by some rule of thumb, e.g.

optimally (using Kriging interpolation).

Spatial Kernels - Potential for errors

IDW Smoothing

A simple smoothing method used widely in GIS is IDW (inverse distance-weighted) interpolation

IDW interpolation works by estimating the value of the target variable y at points unknown using a local weighted average of known values

The influence of each local point is weighted by proximity to the point of interest

IDW Interpolation

d1

d8

d6

d4

d2

d7

d5d3

In general terms, the smoothed values ofa target variable y are averages of the values at n known points, multiplied by a WEIGHTING FUNCTION : where wij are the weights linking observation i and any other observation j

The weighting function, w, is used tomodel the influence of distance on thecontribution that the other points make.

It is usually some functionof distance, like w = 1/dk or w = ekd

Exploiting Spatial Dependency - Example

There were about 215,000 house sales in London in 2002

Adaptive Kernel – 25 nearest neighboursAdaptive Kernel – 25 nearest neighbours Fixed Kernel – 5 Kilometre Search RadiusFixed Kernel – 5 Kilometre Search Radius

Two contrasted smoothing regimes

Considerations and Refinements

How many points should we include?What is the maximum distance we should search

within?What should the value of w be? It need not be 0 – 1 .Should these values be fixed or variable across space?

We can best-guess the first three, and assume the fourth is fixed – or we can refer to local spatial properties, using Kriging or GWR

Some Limitations ...

As we try to analyse more things simultaneously (space, time, theme), models become more complicated, more difficult to specify and (usually) take longer to compute.

Software to implement is LIMITED / Bespoke Data handling / volumes can be a problem! LSOA W matrix! More complexity (usually) means:

More difficult to specify the model properly Harder to interpret the results More difficult to judge quality

Combining techniques and models is tricky

… and gains

Spatial analysis builds spatial information explicitly into the analysis and modelling framework.

Methods allow us to explore our data for patterns and trends, to build models that include spatial and space / time relationships, and to visualise the results of our analysis.

If methods are based on statistical principles, we can obtain reliability measures alongside the outputs.

Modifiable Units of Spatial Coverage

There is no “standard” unit of spatial coverage like the HH:MM:SS of time.

When we move from unit-level spatial data to groupings of individual point events, these geographical groups are inherently MODIFIABLE.

A space can be subdivided into n zones in many thousands of different ways.

In practice, this means that the areas used to report aggregated data are often arbitrary (in terms of the data being studied) or designed without data analysis in mind.

Cross-cut by 2003 CAS WardsCross-cut by 2003 CAS Wards Cross-cut by MSOAsCross-cut by MSOAs

One dataset – two boundaries

The two faces of MAUP

The “Scale Effect” or “Aggregation Effect” Different results and inferences may be obtained when the

same set of data is grouped into increasingly larger areal units

The “Zoning Effect” or “Partitioning Effect” Results and inferences may vary depending on the partition of

space that is applied at a given geographical scale.

These effects interact. Exactly how?

Scale Zoning

ONS Preliminary MAUP Project - Data

2001 Census data for England and Wales Standard dataset of 122 variables, aggregated to seven different

geographies: Output Areas (175,434 areas) Lower Layer Super Output Areas (34,378 areas) Middle Layer Super Output Areas (7,194 areas) 2003 Statistical Wards (8,868 areas) Local Authorities (376 areas) Counties (34 areas) Government Office Regions (10 areas)

ONS Preliminary MAUP Project - Procedures

Pearson PM correlations between all variables calculated at all seven levels and ‘mapped’ in MatLab.

Both x and y axis show the range of 112 census variables, starting at the top left-hand corner

Pearson PM correlations were calculated for every variable against all the others.

Naturally, some correlations were positive, others, negative.

MatLab Pearson Correlation Matrices - Counts

However, for Count data, as the areas grew in size from Output Area to GOR, there was a distinct shift towards positive correlations.

This is expected – larger areas, more count data.The sudden ‘redshift’ from MSOA to Ward is accounted for by

the heterogeneous (size and socio-economic) nature of wards

MatLab Pearson Correlation Matrices – Counts

OA LSOA MSOA

Ward LAD County GOR

MatLab Pearson Correlation Matrices – Rates

Rate data also shows a drift towards positive correlation with larger spatial units, but proceeds more smoothly than with count data.

This is because as the spatial units grow in size, the differences between the areas is reduced.

MatLab Pearson Correlation Matrices – Rates

OA LSOA MSOA

Ward LAD County GOR

ONS Preliminary MAUP Project – Contd.

We also did simple linear regression models on a smaller, common subset of variables, and fitted these at each geography.

We then applied the models to all seven geographies using raw count and rate-based input data.

These confirmed that Pearson PM correlations change considerably when calculated for different geographies

Summary of (tentative) findings so far

As geographical level alters, Pearson (and Spearman) correlations, independent variable vs. dependent relationships, regression parameters and model predictive power also alter, but not necessarily consistently. The direction of change can reverse, and this can include sign changes.

Changes most severe for large area configurationsCount models less resilient than rate models

What next?

Extension of work to consider “typical” ONS analysis scenarios and geographical effects

Seven geographies is not enough to really explore these effects – especially partitioning problem rather than aggregation effects

Plan to construct many “pseudo” geographies using Dave Martin’s AZ Tool utility.

Use of 2001 individual-level census base to construct a series of analysis scenarios for a proper simulation study across hundreds or thousands of artificial geographies

Recommended