17
Introducing the EquiPop software an application for the calculation of k-nearest neighbour contexts/neighbourhoods John Östh Uppsala University, Department of Social and Economic Geography, Box 513, SE-751 20 Uppsala (Sweden); email: [email protected] Keywords: KNN, k-nearest neighbour, EquiPop, context, contextual analysis, distance decay, individualised neighbourhood, Egocentric neighbourhood, bespoke neighbourhood Abstract In health science and social science much attention is paid to understand the links between the contexts in which individuals reside, and the effects these contexts have on peoples choices, health and similar. Contextual statistics are often derived from area based aggregates or floating catchment area aggregates and in almost all cases, these aggregates are based on varying population counts. Depending on research question, population distribution and shapes of areas or sizes of radiuses, contextual variables with varying population counts may be more or less reliable in analyses. Using a k-nearest neighbour approach, fixed population counts may be used to construct contextual variables. Traditionally, non-heuristic, k-nearest neighbour analyses are computationally very demanding, a fact that has contributed to its limited use in research. In this article, EquiPop, a software application for the creation of k- nearest neighbour contexts is presented. EquiPop approaches the k-nearest neighbour computation using new calculation techniques, which enables the creation of contextual variables also for spatially very large and detailed datasets in relatively short time spans. How to install, prepare files, load and run the software, do optional settings as well as handling of output is presented in this paper. 1. The nature of neighbourhoods and the measurement of contexts Within health science and social science, the study of contexts and contextual effects usually centres on individuals and usually using the word neighbourhoodto capture the individual contexts, and neighbourhood effects to capture how individuals affect and are being affected by social networks, topography and extents of a certain area. However, the interchangeable

Introducing the EquiPop software - Uppsala Universityequipop.kultgeog.uu.se/Tutorial/Introducing EquiPop.pdfIntroducing the EquiPop software an application for the calculation of k-nearest

Embed Size (px)

Citation preview

Introducing the EquiPop software

an application for the calculation of

k-nearest neighbour contexts/neighbourhoods

John Östh

Uppsala University, Department of Social and Economic Geography, Box 513,

SE-751 20 Uppsala (Sweden); email: [email protected]

Keywords: KNN, k-nearest neighbour, EquiPop, context, contextual analysis,

distance decay, individualised neighbourhood, Egocentric neighbourhood,

bespoke neighbourhood

Abstract

In health science and social science much attention is paid to understand the links between the

contexts in which individuals reside, and the effects these contexts have on peoples choices,

health and similar. Contextual statistics are often derived from area based aggregates or

floating catchment area aggregates and in almost all cases, these aggregates are based on

varying population counts. Depending on research question, population distribution and

shapes of areas or sizes of radiuses, contextual variables with varying population counts may

be more or less reliable in analyses. Using a k-nearest neighbour approach, fixed population

counts may be used to construct contextual variables. Traditionally, non-heuristic, k-nearest

neighbour analyses are computationally very demanding, a fact that has contributed to its

limited use in research. In this article, EquiPop, a software application for the creation of k-

nearest neighbour contexts is presented. EquiPop approaches the k-nearest neighbour

computation using new calculation techniques, which enables the creation of contextual

variables also for spatially very large and detailed datasets in relatively short time spans. How

to install, prepare files, load and run the software, do optional settings as well as handling of

output is presented in this paper.

1. The nature of neighbourhoods and the measurement of contexts

Within health science and social science, the study of contexts and contextual effects usually

centres on individuals and usually using the word “neighbourhood” to capture the individual

contexts, and neighbourhood effects to capture how individuals affect and are being affected

by social networks, topography and extents of a certain area. However, the interchangeable

use of the phrase neighbourhood meaning context is not entirely unproblematic, since

neighbourhoods often are single concepts, representing a piece of territory having little or no

correspondence with human behaviour (Lee, 1968; Galster, 2001). The ‘real’, human-centred

neighbourhoods are more complex and require more than a designated piece of land to be

studied. So complex in fact that bounding a neighbourhood spatially becomes impossible

since different neighbourhood attributes have different and varying scales.

The complexity of neighbourhood means that measuring neighbourhoods becomes difficult.

A pragmatic solution to the measurement problems outlined above is to measure the different

neighbourhood attributes, rather than neighbourhoods, using scales and shapes that suite the

attribute best. This means either using statistics aggregated to some kind of pre-existing areal

unit, using statistics collected using radiuses or using a k-nearest neighbour approach.

Evidently, choosing neighbourhood measurement method becomes important.

There is a large body of work discussing neighbourhoods and especially neighbourhood

effects in both social and health sciences (for review articles see for instance: Pickett & Pearl,

2001; Sampson et al., 2002; Mair et al. 2008; Sellström & Bremberg, 2006). Central in most

of the reviewed articles is that how neighbourhoods are perceived spatially and analytically is

critical for both outcome and inference. There are examples of studies where individual-

centred k-nearest neighbour approaches or bespoke neighbourhoods have been used in social

science and health science (see for instance Johnston et al. 2004, Johnston et al. 2005, Chaix,

et al. 2005 & Davies and Hazelton, 2010). However, as the above listed review articles also

note, contextual or neighbourhood variables are occasionally depicted by fixed bandwidths

(radiuses), almost always represented by fixed area entities such as wards, tracts, blocks or

counties but almost never represented by k-nearest neighbour contexts.

Whether this is a problem or not is entirely dependent on the studied neighbourhood

attribute(s) and the research question at hand. In cases where the area itself contains attributes,

structures or values that define neighbourhood, neighbourhoods are best defined by fixed

border areas. However, for the study of social processes, fixed border areas are problematic

(Sampson et al., 2002). Fixed border areas also disregards the scale and distribution of what is

measured since, leading to biases related to MAUP ( Modifiable Areal Unit Problem) (see for

instance Openshaw, 1994; Wong, 2004; Andersson and Musterd, 2010).

In cases where a predefined radius best depicts the spatial perimeter of the neighbourhood,

radii-based neighbourhoods should be used. This would potentially be more common where

space is an important determinants for the definition of neighbourhoods, such as locations

with proximity to services, scenic views, metro stations or similar. Radii-based

neighbourhoods have been used extensively in planning over a long time where planning

ideals have been centred on reaching local amenities with a short walking distance; see for

instance (Perry 1929/1998). Though a majority of the radii based neighbourhood studies have

been focused on the planned landscape rather than on people, also social processes such as

segregation have successfully been studied using radii-based approaches (Reardon et al.,

2008).

If the spatial relationship between individuals and their ilk (or opposite) define

neighbourhoods, contexts would preferably be defined by the composition of neighbours. This

is preferably done using a k-nearest neighbour approach. That is, as long as the physical

separation between neighbours is not too big for neighbourhoods to be factual.

There is no single good scale or method for the measurement of all attributes taking place

within a neighbourhood since neighbourhood attributes are produced and consumed at

different scales (Galster, 2001). For the measurement of neighbourhood and neighbourhood

effects this means that the palette of methods is best if varied and sensitive to the spatial

structures and scales at play. This also means that the introduction of EquiPop and its method

for the calculation of k-nearest neighbours is to be viewed as a new tool in the toolbox,

specifically designed for easy calculation of k-nearest neighbour statistics also using very

large datasets.

The remainder of the article is focused on the functionality and use of EquiPop. First, the

computational idea behind EquiPop is presented, followed by an installation guide, manual

for running EquiPop and handling its output. Finally two examples of how EquiPop can be

used are presented followed by a conclusion. Additional EquiPop-related material is available

on http://equipop.kultgeog.uu.se.

2. Idea behind EquiPop

Regular non-heuristic k-nearest neighbour computations are very computer demanding. This

because finding the k-nearest neighbour requires all populated locations j to be sorted

adjacently according to their distance from any origin i. By accumulating population counts

from the vector of j until value k has been reached, neighbour and neighbourhood statistics

can be constructed for location i. However, for all other locations i, the sorting and

accumulative counting process needs to be repeated. In cases where the count of locations

reaches thousands or even millions of unique locations, iterative sorting procedures are no

longer viable computation procedures. More pragmatic solutions to the computational

problems are usually to substitute the non-heuristic k-NN algorithms for context

approximations, where fixed-border areas such as municipalities, blocks or wards or floating

catchment areas of a predefined radius are used instead.

In cases where contexts best is studied and understood in terms of neighbouring members of

the studied population, a k-nearest neighbour approach is to prefer. In order to make k-NN

computations possible also in very large datasets, EquiPop calculates k-NN without sorting

the data according to distances between all i and j. Instead, EquiPop categorizes all in-data in

a runtime geocoded matrix according to the x and y coordinates so that the data is arranged

according to their spatial extent. However, the matrix is organized similar to pixels in a digital

image, where space is rectified into gridded units. By gridding the data, calculations of

distances between any units i and j will be less accurate compared to using original

coordinates1. This means that the average error will be more influential on shorter distances

and smaller k-values, since the average error (being fixed) makes up a greater proportion of

the distance.

Using gridded data has a fundamental advantage in the computation of k-nearest neighbours.

From any unit i the distances to surrounding units j will be the same regardless where i is

located (See Östh et al. 2014c). This means that rather than calculating the unique distances

1 On average, the error introduced by gridding will equal roughly 70.7% of the size of the grid-unit

between i and all j:s for each location, a pre-set rule for all distances can be applied. This is

also the single-most important reason why EquiPop can be used to calculate k-NN in larger

datasets.

Currently, EquiPop holds the distance-order instructions for the 4 million nearest units. In

Figure 1 the principle behind gridded distance is shown. From any unit i, the distance to any

unit j, with the suffix “a”, is the same. The same distance relationship is true for all units j

with suffix “b”.

Figure 1 illustrates that the distance between any i and any ja or any i and any jb always will be the same in a gridded dataset.

3. Installation of EquiPop

In order to obtain the software (download from http://equipop.kultgeog.uu.se) the user needs

to enter usage-related information in two steps. First, the user will need to create a user

account online and agree to user-license terms. Secondly, the user needs to enter information

about the computer(s) on which the software is installed. Thereafter the user will be able to

download the software and an activation code.

Two files are available for installation. The EquiPop-file contains the GUI (Graphical User

Interface), from which all EquiPop operations are controlled, while the EquiPop-service-file

contains the computational parts. The separation of the GUI and computational parts enables

users to install computing demanding parts of the software on a fast computer/server, whilst

the GUI-part of the program may be installed on any computer having Windows as OS2. It is

possible to install the EquiPop-file on several computers, all sharing the same EquiPop-

service. EquiPop has been developed in C# using Windows-NET. This means that a .NET

framework needs to be installed on the computer3. During installation of the EquiPop file, the

user will be asked to configure the EquiPop-service endpoint by confirming or adding an

URL for the installed EquiPop-service. By default the address is

http://localhost:19999/equipop, where “localhost” indicates that the computational parts of the

program are located on the same computer as is the GUI. If the user decides to separate GUI

and computational parts, the URL needs to be reformulated so that “localhost” is replaced

with an IP or DNS-address – however, the “:19999/equipop” part should remain unaltered4.

There is no installation order for the EquiPop and EquiPop-service files as long as the service

endpoint URL points to the correct computer. However, after a trial period the user will be

prompted to enter an activation code. The activation code is generated on the EquiPop website

and renders the user access to EquiPop during the 365 days following registration. The

activation code can be updated online after installation.

4. Preparing files for EquiPop

Central for the preparation of EquiPop input files is to determine a ‘good-enough’ grid unit,

fine enough not to compromise local characteristics and spatial patterns but not large enough

to make computing time-consuming. A first step is to determine the minimum distance

between any two objects in the studied population. The ‘Near’ function in ArcGis can be used

to retrieve minimum distances as well as finding out near distances for different percentiles in

the population. Choosing a grid unit smaller than the observed minimum distance will ensure

that data will not need to be aggregated. However, if the dataset is detailed and spans over

larger areas, some aggregation may need to be accepted. In two recent analyses of segregation

in California (Clark et al. 2014; Östh et al. 2014d), a grid unit of 250ft was chosen for the

analysis of the racial composition on block-level. In a few cases, the block-midpoint to block-

midpoint distances were shorter than 250ft. In those cases the block populations were

aggregated and treated as one.

In the second step the data is gridded and aggregated. In Figure 2, an example of how

gridding and aggregation of data is conducted for future use in EquiPop is shown. Where

truncating of coordinates are conducted in the first step and aggregation of an EquiPop dataset

in the second. In the example, SPSS-syntax is used to exemplify – however, most statistics

and/or spreadsheet software can be used. If another software is used the SPSS syntax can be

seen as pseudo-code.

The truncation procedure in the syntax takes out all coordinate details finer than 100 metric

units and aligns/rounds them all to the nearest 50 metric unit. The following aggregation

procedure makes sure that not more than one instance of each pair of coordinates may exist. It

is important not to have coordinate duplicates in the in-data file since the last encountered

value will overwrite any former values (leading to biased output). After running EquiPop, the

2 Tested on Windows Vista/7/ 8 & Windows server 2003/2008/2012 R2

3 .NET can be found on Microsoft.com

4 Number refers to port used for transferring of data between EquiPop and EquiPop-service.

output may easily be “brought back” to the original file by merging/joining the output with

original file using the truncated coordinates as index variables.

Note that the aggregate scripting in Figure 2 constructs two “PlaceCode” variables with two

alternative aggregation-methods. This is conducted only to show that aggregation of values

may make use of different techniques – however, only one of the techniques is to be used in

the event of aggregation.

Figure 2 Illustrates how gridding of coordinates and aggregation of input data can be scripted using SPSS.

As a third step, the aggregated file (in this case a SPSS file) must be saved as a tab-separated

text-file. The first row must consist of variable names – while all other rows must contain data

to be included in the analysis (this is the default saving-setting in most software). Five

variables are required to be exported, an “ID” variable used to nominally keep trace of the

included units, two coordinate variables , one variable holding the sum of all individuals at

any location i and finally a variable holding the sum of individuals belonging to the studied

subgroup. The “ID” variable accepts string and numerical formats, while other variables only

accept numerical formats (including float/double). The tab-delimited file may not contain

missing values; zeros should replace the missing values. Loading files containing missing

values will always lead to biased output. Exported files may contain more than the required

five variables as long as these variables are declared as having no function during analysis.

5. Running EquiPop

To import a file, click ”File/Open…” to open the “Open File” window. Select file to import

using the “Browse” button and select folder and file and click “Open”. As indicated in the

right section of Figure 3, EquiPop by default accepts tab separated .txt and .dat (ASCII) files

but any file suffix works as long as the content is saved as tab-separated ASCII-text. The

EquiPop template format .json opens a predefined file with predefined settings. EquiPop

template files can be saved from the “File”-menu.

When a file has been selected, the EquiPop interface looks as in the left “Open File” section

of Figure 3. Variable names that are included in the first row of the imported file are listed in

the “Column” list. In the “Field” list, functions needed to run EquiPop are listed. Functions

can be dragged from the “Field” list and dropped onto variables in the “Column” list. The

association between function and variables is confirmed in the “Mapped field” list. In case the

imported file contains more than the five variables needed for running EquiPop, the function

“None” needs to be dropped on remaining variables.

At the bottom of the “Open File” interface, users may enter a rectification-unit value. By

default this textbox is empty but in cases where the rectification unit is known to the user, the

value may be entered. If left empty, EquiPop will search the imported data and set the

rectification value automatically. It should be noted that by setting rectification value

incorrectly, computation output will be biased5. By clicking “OK” the file importation settings

are accepted and the main EquiPop interface is shown.

Figure 3 illustrates windows used for importation of files for analysis.

After importation of an analysis file, k-levels need to be set, output-variables to be selected

and decay mode and decay parameter determined before the computations can start. First, the

k-levels need to be set. In Figure 4 requested settings are shown in detail where sections 1a

and 1b illustrate how k-levels are added to and deleted from the running-order list. Requested

value is typed in the top left textbox and added to the running-order list by clicking “Add”

(1a.). By selecting value in the running-order list and by clicking “Delete” the value can be

removed (1b). The running-order list can contain multiple values. Theoretically, there is

neither a maximum count of k-values nor a maximum k-value that can be entered. However,

for each k-level additional sets of output-variables will be created during runtime making

larger datasets with multiple k-levels challenging for some computers. Similarly, very large k-

values mean that very large neighbourhoods need to be searched which in turn will increase

computation time.

If accepting default settings, EquiPop is ready to run at this point. Click the “Run analysis”

button and the analysis will start. During runtime, an approximation of remaining time for

computation will be illustrated by the progress-bar (2 in Figure 4). The progress-bar indicates

time by increasing the green part of the progress-bar until computation is ready. During

runtime it is possible to load and start the next rounds of analysis. Analyses not yet started

will show up in a queue.

Under the running-order list, four checkboxes (checked by default) enables the user to 5 Setting value manually limits computation times marginally. Practical use includes setting finer rectification

units than automatically generated. This is desirable in certain comparative frameworks.

determine which output-variables to save. For each k-level defined by the user, every checked

output-variable will report output-variables for the corresponding k. The “Include distance”

will report the Euclidian distance from each location i to location j where the user defined k-

value was reached. The “Include count all” will report the factual k at every user defined k.

This seemingly odd variable is useful in aggregated datasets. For instance, if the original

dataset consists of individuals whilst the used file is aggregated to block mid-points, the

factual k-value is often a bit greater than the requested user defined value. To exemplify, if the

runtime value < k before adding an additional block but becomes > k after adding the next

nearest block, reported variable value will be based on the count after adding the next nearest

block. Similarly, the two remaining variables “(Include) count group” and “(Include) ratio”

will report factual k-values. The “count group” reports the count of members belonging to the

treatment population for each k and the “ratio” reports the quota of “count group” over “count

all”. When larger datasets with multiple k-levels are studied, either “count all”, “count group”

or “ratio” can be excluded to reduce file-size. The missing variable can easily be calculated in

retrospect.

By default EquiPop runs without distance decay. This means that all objects/individuals will

be assumed to contribute with equal weight to the reported output-variables regardless of

distance from i. In case more distant objects/individuals are assumed to be of less importance,

five different distance decay models including exponential, exponential normal, exponential

square-root, log-normal and power-functions may be employed (see 3a. and 3b. in Figure 4).

The properties of various decay models are discussed at length in the works of Wilson (1981),

Fotheringham and O’Kelly (1989) and Reggiani et al. (2011). For specifications of decay

parameters see for instance Östh et al. (2014a and 2014b)

Figure 4 illustrates how k-levels are defined (1a. and 1b), runtime progression (2) and the specification of

distance decay models.

In the “processing status” section (2 in Figure 4) a completed analysis is identified not only by

the progression of the green progress-bar but also through the file-name that transforms itself

to a HTML-link by which the user can download a zip-file containing the output. Clicking the

link will trigger the default web-browser to open and a transfer of the output from EquiPop to

the computer’s “Downloading area”. It is important to note that the web-browser is used as a

service provider and no data is transferred externally. Having that said, during installation, the

user may choose to separate installations of the computational parts of EquiPop and the GUI.

If the computational parts are installed on another computer data will be transferred over the

Intranet/ Internet.

In the left section of Figure 5 the arrow points at a “downloaded” zip-file ready for

decompression. In the right section of Figure 5, the two files contained by the zip-file are

shown; 1a shows a meta-information-file and 1b the contents of the meta-information file. It

should be noted that the contents of the meta-information file, describing the settings and file

use, is identical to the information shown in the “mouse-cursor-fly-over” message shown

under section 2 in Figure 4. Label 2 in Figure 5 points to the file containing the analysis

output. The output file is always named as the input file.

Figure 5 Illustrates how a zipped folder containing metadata and EquiPop output is saved/downloaded (left) and

what the zipped folder contains (right)

6. Handling EquiPop output

EquiPop output is arranged as tab-separated ASCII. This format guarantees that output can be

opened in many software packages. The output variables can be categorized into three main

groups – files always created, files created for each k if checked during setup and files created

if checked and distance decay is specified. In table 1, variables from an EquiPop run with k-

levels 25 and 50 are illustrated. In the Always category, the first five variables are identical to

the five input variables, however renamed after function. The four remaining variables in the

Always category form a special case. Results are saved to these four variables in cases where

the highest k-value has been reached and EquiPop moves on to the next location i for search

of k-nearest neighbours or when the k-levels are too large (for example greater than the sum

of individuals in the population) and/or the spatial distribution of the studied population

means that some individuals will be located far from others the requested k-level may not be

reached within the four million next nearest gridded units. In order not to be caught in an

(almost) eternal search loop, EquiPop terminates the search for the k-nearest neighbours from

any location i if the requested k has not been reached when four million units have been

searched. Before moving to the next unit the (maximum) count of individuals in the

population is saved to the variable “SumCountAll”, the sum of subgroup members are saved

to “SumCountGroup”, the ratio between “SumCountGroup” over “SumCountAll” is saved to

the “Ratio” variable. Finally, the “MaxDistance” variable describes at what distance from unit

i where the last neighbour was counted before terminating.

If checked, four variables are added for each k-level entered by the user. The four variables

are “IntervalSumCountAll_x”,”IntervalSumCountGroup_x”, “IntervalRatio_x” and

“IntervalDistance_x” where x represents the user entered k-value. Variable

“IntervalSumCountAll_x” holds the factual count of individuals needed to reach k-level “x”

and ”IntervalSumCountGroup_x” holds the equivalent count of treatment group members. In

“IntervalRatio_x” the quota of ”IntervalSumCountGroup_x” over

”IntervalSumCountGroup_x” is calculated and saved. Finally, the “IntervalDistance_x”

variable holds the Euclidian distance between origin location i and the unit where the k=x

nearest neighbour was encountered.

The corresponding decayed variables are making use of the same k-values as the non-

decaying variables6. What is different with this type of variables is that encountered

individuals are given less weight (according to decay specification) as distance increases. This

means that decayed count variables by necessity have smaller values that non-decayed.

Table 1 Variables exported in EquiPop output files. Column “Always” lists variables always being exported,

column “if checked” lists variables being exported if variables are checked during setup. Variables “If checked

plus decay is activated” are exported if variables are checked during setup and distance decay settings are used.

Variables Always if checked If checked plus

decay is activated

Id X

EastWest X

NorthSouth X

CountAllLocal X

CountGroupLocal X

SumCountAll X

SumCountGroup X

Ratio X

MaxDistance X

IntervalSumCountAll_25 X

IntervalSumCountGroup_25 X

IntervalRatio_25 X

IntervalSumCountAllDecay_25 X

IntervalSumCountGroupDecay_25 X

IntervalRatioDecay_25 X

IntervalDistance_25 X

IntervalSumCountAll_50 X

IntervalSumCountGroup_50 X

IntervalRatio_50 X

IntervalSumCountAllDecay_50 X

IntervalSumCountGroupDecay_50 X

IntervalRatioDecay_50 X

IntervalDistance_50 X

6 This is also the reason why no specific distance decay distance variable is available.

Due to the size of the output-files certain third-party software will be needed for the analysis

of the output. Using spreadsheet software, EquiPop-output can be imported to Excel as long

as the file-import format is changed from spreadsheet to text. The importation wizard will

present the user with different alternatives. By choosing “delimited” rather than “fixed width”

as importation method, the tab-separated order of the EquiPop-output file will be used parse

values to cells in the spreadsheet. Importation to statistical software such as SPSS is

conducted in similar fashion7. Many GIS software can transform the EquiPop output files to

shape-files directly in the software. For ArcGis, the file-suffix must be changed to .txt to be

recognized.

7. Two short examples- using EquiPop with a slightly different angle

EquiPop is designed to calculate ratios, i.e. to find out how many from any subpopulation x

that can be found within the k-nearest neighbours from any origin i. This means that all in-

data is arranged as counts where individuals are listed either as belonging to or not belonging

to the studied subgroup in question. Below, two slightly different analyses are conducted.

First, historical epidemical data are used to find out spatial concentrations of incidences in

terms distance needed to encounter 100 cases. Second, by tweaking the input data, EquiPop

can be used to calculate mean values. In the second example, average age for the 6 400

nearest neighbours year 2010 in Sweden is analysed. The purpose of these two examples is to

inspire users to set up analyses also where some data is missing (as in the first example) or

where other results than ratios are preferred (as in the second example).

Example 1 - a pioneer in mapping epidemical incidences was John Snow. In 1854, Snow used

a map to show that Cholera deaths in Soho, London, were clustered around a certain water

pump (Johnson, 2006). In this example we make use of EquiPop to calculate distances needed

to reach a k-level of 100 deaths from any location where a cholera-related death was

observed. In Figure 6, the results are revealed. Map B., shows that for almost 20% of the

addresses hit by the epidemic no less than 100 fatalities were encountered within 20 meters

and almost 50% were encountered within 30 meters (deciles were used to categorize the

distances). The spatial relationship between pumps and deaths is illustrated in section A.,

where yellow markers indicate pumps, dark dots indicated addresses where cholera deaths

were encountered and finally, larger green dots indicating the locations of the EquiPop-

gridded units, used in analysis (made larger to be visible).

The data needed for this analysis was downloaded from Robin’s blog (2014). Robin has

kindly digitized incidences and water pumps from Snow’s map and projected the material to

OSGB 1936 / British National Grid, and made it available for public use.

Though the mapped material is made available as a shape-file it is not automatically ready

for use in EquiPop. Using ArcMap as a tool for preparing, this is how the preparation is

conducted. First two fields holding doubles are added (open attribute table and choose add

7 The importation wizard in SPSS sometimes lists the ratio-variables as string-variables rather

than numerical variables – changing the importation format to “dot” solves this problem.

field, choose double as type and name the two fields X and Y). Secondly, right click on the

new field headers and choose “calculate geometry”. This way, the X and Y coordinates will

be made available for easy importation and further preparation in software such as Excel or

SPSS (note that data is kept in the .dbf file). Further preparation means that the user must

create five variables in order to make data work in EquiPop (see above). First, the “ID”

variable has no particular use in this example and can therefore be replaced with a value or

phrase. Second, a suitable grid-unit for the coordinates is meter, round or truncate the X and Y

variables to X-grid and Y-grid variables to the nearest meter. Third, make use of the incidence

counts as both “CountSubGroup” and “CountAll”. This will cause all ratios to take the value

one. However, since distance rather than ratio is of interest all other variable outputs are

unimportant.

Figure 6 Illustrates cholera deaths in London 1854. Left map shows cholera deaths (black dots), EquiPop prepared, 1m gridded cholera deaths are shown in green (dots enlarged to be identifiable under the dark dots). Larger yellow symbols shows locations of water pumps. Right map shows distances (meter) to nearest 100 deaths in cholera from any location of cholera-related death. Arranged by deciles, cold colours represent short distances while warm colours represent longer distances

Example 2 - the preparation of the average-age dataset is similar to the preparation of regular

datasets with one exception. In the average dataset, the subgroup (CountSubGroup) value

communicates the sum of maximum life-spans lived by the local population, rather than the

count of subgroup individuals. In the average age example, the maximum life-span

(maximum age) in Sweden, 2010 was 111 years. This means that the subgroup value can be

defined as the sum all years lived by the local residents divided by the maximum age in

Sweden. Since EquiPop accepts decimal values in the variables, the variable should not be

rounded to the nearest integer.

After running the dataset in EquiPop, the ratio-variables should be multiplied with the

maximum age lived to produce the average age among the k-nearest neighbours from any

location i. In Figure 7, the left-side maps are illustrating the average ages in Sweden 2010,

using 2% quantiles. The right-side maps illustrate the average ages using fixed-age-intervals

(0.5 years per colour). The 2x3 top maps magnify average-age patterns in the three major

metropolitan areas. The average-age maps are interesting from two perspectives. First, from a

computational perspective, the analysis make use of almost 800 000 unique, populated

locations and millions of unpopulated spatial units using a grid of 100m x 100m. The

computation-time on a 28 GB RAM, workstation is less than 10 minutes. Secondly, from an

age-distribution perspective, the results show that average age varies considerably between

parts of the country. It is noteworthy that younger individuals are clustered in ‘islands’ around

the major city areas, while rural areas are considerably older.

Figure 7 Illustrates average age among k=6 400 nearest neighbours in Sweden 2010. The left-side maps are illustrating the average ages in Sweden 2010, using 2% quantiles. The right-side maps illustrate the average ages using fixed age-intervals (0.5 years per colour).

8. Conclusion

Using a k-nearest neighbour approach to denote individual centred neighbourhoods can in some

analyses be more accurate than using administrative areas or radii based areas. By introducing the

EquiPop software application, k-nearest neighbour computations can be conducted with greater

ease, also in datasets containing millions of populated locations. This article has demonstrated how

EquiPop is installed, data is prepared, analyses conducted and output used. The demonstration

shows that EquiPop is capable of counting shares of the studied population belonging to any studied

subgroup at any specific k-values and for any location i. In addition, settings or methods for the

calculation of mean values, distances, making use of several k-values at the same time, as well as

enabling for analyses of different distance decay functions are included in the software.

ACKNOWLEDGMENT: The author gratefully acknowledges financial support from VR project 2012-

5509 “Stadens segrationsmönster: En internationell jämförande studie av boendesegregationens

mönster, drivkrafter och effekter”

References

Andersson, R. & Musterd, S., (2010), What scale matters? Exploring the relationships between

individuals’ social position, neighbourhood context and the scale of neighbourhood,

Geografiska Annaler: Series B, Human Geography 92 (1): 23–43.

Chaix, B., Merlo, J., Subramanian, S. V., Lynch, J. and Chauvin, P. (2005): 'Comparison of a Spatial

Perspective with the Multilevel Analytical Approach in Neighborhood Studies: The Case of

Mental and Behavioral Disorders due to Psychoactive Substance Use in Malmö, Sweden, 2001'.

American Journal of Epidemiology, vol, 162 no, 2 pp 171-182.

Clark A. William, Malmberg Bo & Östh John, (PAA 2014), Segregation and De-segregation in

Metropolitan Contexts: Los Angeles as a paradigm for our changing ethnic world.

Davies, T. M. and Hazelton, M. L. (2010): 'Adaptive kernel estimation of spatial relative risk'. Statistics

in Medicine, vol, 29 no, 23 pp 2423-2437.

Fotheringham, A.S. and M.E. O’Kelly (1989), Spatial Interaction Models: Formulations and

Applications, Dordrecht: Kluwer Academic.

Galster George, (2001), On the Nature of Neighbourhood, Urban Studies, Vol. 38, No. 12, 2111–2124

Johnson, Steven (2006), The Ghost Map: The Story of London's Most Terrifying Epidemic – and How it

Changed Science, Cities and the Modern World. Riverhead Books. ISBN 1-59448-925-4

Johnston, R. J., Jones, K., Burgess, S., Propper, C., Sarker, R., & Bolster, A. (2004) Scale, factor

analyses, and neighborhood effects, Geographical Analysis 36(4): 350–369.

Johnston, R. J., Propper, C., Burgess, S., Sarker, R., Bolster, A. & Jones, K., (2005), Spatial scale and the

neighbourhood effect: multinomial models of voting at two recent British general elections,

British Journal of Political Science 35 (3): 487–514

Lee, T., (1968), Urban Neighbourhood as a Socio-Spatial Schema, Human Relations 1968 21: 241,

DOI: 10.1177/001872676802100303

Mair C., Diez Roux, A V. & Galea S., (2008), Are neighbourhood characteristics associated with

depressive symptoms? A review of evidence Journal of Epidemiology and Community Health;

62:940–946. doi:10.1136/jech.2007.066605

Openshaw, S. (1984). The modifiable areal unit problem, CATMOG (Concepts and Techniques in

Modern Geography). Geo Abstracts:40.

Östh John, Clark A. William. & Malmberg Bo, (forthcoming in Geographical Analysis), Measuring the

scale of segregation using k-nearest neighbor aggregates

Östh John, Lyhagen, Johan and Reggiani Aura, (2014b), Half-life and Spatial Interaction Models: Job

Accessibility Analysis in Sweden, Forthcoming in European Journal of Transport and

Infrastructure Research

(online estimator: http://files.kultgeog.uu.se/files/spatialanalysis/halflife.html)

Östh, John, Malmberg, Bo and Andersson, Eva, (2014c) Analysing segregation with individualized

neighbourhoods defined by population size, in C. D. LLOYD, I. SHUTTLEWOTH and D. WONG

(Ed.) Social-Spatial Segregation: Concepts, Processes and Outcomes, Policy Press.

Östh, John, Reggiani, Aura and Galiazzo, Giacomo (2014a) Conventional and New Approaches for the

Estimation of Distance Decay in Potential Accessibility Models: Comparative analyses, in

Condeço Ana, Reggiani Aura & Gutiérrez Javier (Ed.) Accessibility and spatial interaction,

Edward Elgar (EE).

Perry, C., (1929/1998), The Neighbourhood Unit (1929) Reprinted Routledge/Thoemmes, London,

1998

Pickett, K. E., & Pearl, M. (2001), Multilevel analyses of neighbourhood socioeconomic context and

health outcomes: a critical review, J Epidemiol Community Health 2001;55:111–122

Reardon, S. F., S. A. Matthews, D. O'Sullivan, B. A. Lee, G. Firebaugh, C. R. Farrell, and K. Bischoff.

(2008). The geographic scale of metropolitan racial segregation. Demography 45 (3):489-514.

Reggiani, A., P. Bucci, and G. Russo, (2011), ‘Accessibility and impedance forms: empirical

applications to the German commuting networks’, International Regional Science Review 34

(2), pp. 230-252.

Robin’s blog (2014), file: SnowGIS_SHP.zip,

URL: http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/

Sampson R. J., Morenoff J. D. & Gannon-Rowley T., (2002), Assessing “Neighborhood Effects”: Social

Processes and New Directions in Research, Annual Review of Sociology, Vol. 28, pp. 443-478

Sellström E. & Bremberg S., (2006), The significance of neighbourhood context to child and

adolescent health and well-being: A systematic review of multilevel studies, Scandinavian

Journal of Public Health, 34: 544–554

Wilson, A. G. (1981), Geography and the Environment: Systems Analytical Methods, Chichester: John

Wiley & Sons.

Wong, D. (2004) Comparing traditional and spatial segregation measures: a spatial scale perspective.

Urban Geography 25, 66-82