158
GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE PLANNING by PING YIN (Under the Direction of Lan Mu and Marguerite Madden) ABSTRACT Geographic information systems (GIS) are increasingly recognized as an effective and efficient tool to deal with geographic questions in health studies. The overarching research question of this dissertation asks how GIS and spatial analysis can be used to facilitate public health studies. Three aspects of health studies are included: spatial disease cluster detection, spatio-temporal disease mapping, and health service planning. New methods or models are proposed and implemented with GIS in this dissertation to address an important problem in each of the three aspects. First, a redesigned spatial scan statistic (RSScan) is proposed to quickly detect disease clusters in arbitrary shapes. The experimental results indicate that the improved RSScan method generally has higher power and accuracy than three existing methods for detecting the clusters in irregular shapes. Second, to explore the spatio-temporal patterns of lung cancer incidence risks in Georgia between 2000 and 2007, a total of seven hierarchical Bayesian models are developed and compared at the census tract level using a two-year time period as the temporal unit. The study shows the northwest region of Georgia has stably elevated lung cancer incidence risks for

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE …

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

(Under the Direction of Lan Mu and Marguerite Madden)

ABSTRACT

Geographic information systems (GIS) are increasingly recognized as an effective and

efficient tool to deal with geographic questions in health studies. The overarching research

question of this dissertation asks how GIS and spatial analysis can be used to facilitate public

health studies. Three aspects of health studies are included: spatial disease cluster detection,

spatio-temporal disease mapping, and health service planning. New methods or models are

proposed and implemented with GIS in this dissertation to address an important problem in each

of the three aspects.

First, a redesigned spatial scan statistic (RSScan) is proposed to quickly detect disease

clusters in arbitrary shapes. The experimental results indicate that the improved RSScan method

generally has higher power and accuracy than three existing methods for detecting the clusters in

irregular shapes. Second, to explore the spatio-temporal patterns of lung cancer incidence risks in

Georgia between 2000 and 2007, a total of seven hierarchical Bayesian models are developed

and compared at the census tract level using a two-year time period as the temporal unit. The

study shows the northwest region of Georgia has stably elevated lung cancer incidence risks for

all the population groups by race and sex. It also shows that there are strong inverse relationships

between socioeconomic status and lung cancer incidence risk in males and weak inverse

relationships in females in Georgia. Finally, two transportation models that address the modular

capacitated maximal covering location problem (MCMCLP) are proposed and used to optimally

site ambulances for Emergency Medical Services (EMS) Region 10 in Georgia. As a component

of the allocation-location problems for health service planning, spatial demand representation is

discussed and three representation approaches are empirically compared in both problem

complexity and representation error.

Results of this dissertation contribute to the advancement of geospatial analysis in disease

surveillance and health service decision making. Future research could include using GIS and

spatial analysis to improve the accuracy of detected clusters, explore the environmental factors

related to the spatio-temporal patterns of lung cancer incidence risks in Georgia, and integrate

population movement in health service planning.

INDEX WORDS: GIS, Public health, Cluster detection, Disease mapping, Health planning

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

B.E., Tsinghua University, China, 2002

M.E., Tsinghua University, China, 2005

A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial

Fulfillment of the Requirements for the Degree

DOCTOR OF PHILOSOPHY

ATHENS, GEORGIA

2012

© 2012

Ping Yin

All Rights Reserved

GEOGRAPHIC INFORMATION SYSTEMS FOR SPATIAL DISEASE CLUSTER

DETECTION, SPATIO-TEMPORAL DISEASE MAPPING, AND HEALTH SERVICE

PLANNING

by

PING YIN

Major Professor: Lan Mu Marguerite Madden Committee: Xiaobai Yao Thomas Jordan John Vena Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2012

iv

ACKNOWLEDGEMENTS

Five years’ Ph.D. study in the Department of Geography at the University of Georgia

(UGA) is great experience to me. I am grateful to all of those people who supported and helped

me to finish my dissertation research. First and foremost, my deepest gratitude goes to my major

professors, Dr. Lan Mu and Dr. Marguerite Madden, for their excellent guidance and full

supports. Without their endless input, timely feedbacks, and great inspiration, I cannot have my

research finished today. I really appreciate their dedication and generous help to my research and

other academic activities.

I would thank Dr. John Vena in the Department of Epidemiology and Biostatistics at

UGA for providing me the health data for my research. His invaluable advice from an

epidemiological perspective greatly improves my research.

I would also acknowledge Dr. Xiaobai Yao and Dr. Thomas Jordan for their insightful

advices and suggestions on this research and other academic areas.

I want to thank Dr. Andrew Herod. He made me realize that how important correct

citations are in academic writing.

The institutions that sponsored my research deserve special notice. They are the UGA

research foundation and the UGA graduate school with the dean’s award in social sciences and

the dissertation completion award.

Finally, I deeply thank my parents and my wife, Jing. It is their unconditional love and

endless patience that encourage me to finish my dissertation.

v

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS .......................................................................................................... iv

LIST OF TABLES ...................................................................................................................... viii

LIST OF FIGURES ........................................................................................................................ x

CHAPTER

1 INTRODUCTION AND LITERATURE REVIEW .................................................... 1

1.1 Background ....................................................................................................... 1

1.2 Research Objectives .......................................................................................... 6

1.3 Literature Review.............................................................................................. 8

1.4 Dissertation Structure...................................................................................... 12

References ............................................................................................................. 13

2 DETECTING DISEASE CLUSTERS IN ARBITRARY SHAPES WITH A

REDESIGNED SPATIAL SCAN STATISTIC ......................................................... 18

Abstract ................................................................................................................. 19

2.1 Introduction ..................................................................................................... 20

2.2 Existing Methods for Detection of Disease Clusters ...................................... 21

2.3 Redesigned Spatial Scan Method (RSScan) ................................................... 24

2.4 Performance Evaluation .................................................................................. 28

2.5 Application: Georgia Lung Cancer, 1998 -2005 ............................................. 37

2.6 Discussion and Conclusions ........................................................................... 38

vi

References ............................................................................................................. 41

3 HIERARCHICAL BAYESIAN MODELING OF THE SPATIO-TEMPORAL

PATTERNS OF LUNG CANCER INCIDENCE RISKS IN GEORGIA, 2000-2007 44

Abstract ................................................................................................................. 45

3.1 Introduction ..................................................................................................... 46

3.2 Study Area and Data ....................................................................................... 48

3.3 Methods........................................................................................................... 50

3.4 Results ............................................................................................................. 57

3.5 Discussions ..................................................................................................... 67

3.6 Conclusions ..................................................................................................... 68

References ............................................................................................................. 70

4 MODULAR CAPACITATED MAXIMAL COVERING LOCATION PROBLEM

FOR THE OPTIMAL SITING OF EMERGENCY VEHICLES ............................... 73

Abstract ................................................................................................................. 74

4.1 Introduction ..................................................................................................... 75

4.2 Modular Capacitated Maximal Covering Location Problem (MCMCLP) ..... 78

4.3 Spatial Demand Representation ...................................................................... 84

4.4 Applications: Optimal Siting of Ambulances ................................................. 85

4.5 Discussion ....................................................................................................... 96

4.6 Conclusion ...................................................................................................... 98

References ............................................................................................................. 99

5 AN EMPIRICAL COMPARISON OF SPATIAL DEMAND REPRESENTATIONS

IN MAXIMAL COVERAGE MODELING ............................................................. 102

vii

Abstract ............................................................................................................... 103

5.1 Introduction ................................................................................................... 104

5.2 Representation Error in Covering Location Modeling ................................. 106

5.3 The MCLP Model and Problem Complexity ................................................ 110

5.4 Service Area Spatial Demand Representation .............................................. 112

5.5 Experimental Design ..................................................................................... 117

5.6 Results and Discussions ................................................................................ 120

5.7 Conclusions ................................................................................................... 130

References ........................................................................................................... 133

6 CONCLUSIONS....................................................................................................... 136

6.1 Summary and Conclusions ........................................................................... 136

6.2 Future Research ............................................................................................ 139

References ........................................................................................................... 142

APPENDICES

I LIST OF ACRONYMS ............................................................................................ 143

viii

LIST OF TABLES

Page

Table 2.1: Test statistics and search strategies of four spatial scan methods ............................... 25

Table 2.2: Information of simulated cluster models ..................................................................... 31

Table 2.3: Estimated power of four spatial scan methods (significance level=0.05) ................... 33

Table 2.4: Contingency table for detected cluster estimates and true clusters ............................. 34

Table 2.5: KIAs between the most likely clusters and true clusters for four spatial scan methods 36

Table 2.6: Average Type I error of four spatial scan methods ..................................................... 37

Table 3.1: Total number of cases of individuals over 20 years old and the percentage of included

cases in the analyses by sex and race ........................................................................... 49

Table 3.2: Variables incorporated in the modified Darden-Kamel Composite Index .................. 51

Table 3.3: Components of logarithms of RRs in the seven Bayesian spatio-temporal models .... 54

Table 3.4: DICs of the seven models ............................................................................................ 57

Table 3.5: Posterior median (95% CI) of the shared temporal components and differential

temporal components ................................................................................................... 66

Table 3.6: Posterior median (95% CI) of the RRs for SES quintile ............................................. 67

Table 3.7: Correlations between the posterior median RRs using model 2 with two different

types of hyperpriors ..................................................................................................... 67

Table 4.1: Information for roads ................................................................................................... 89

Table 4.2: Count of the facilities with varied numbers of ambulances ........................................ 96

Table 5.1: Numbers of demand objects in 45 SASDRs .............................................................. 121

ix

Table 5.2: Numbers of demand objects in all demand representations for comparison ............. 124

Table 5.3: Minimum numbers of facilities reported by models for covering 100% demand ..... 125

Table 5.4: Cost and optimality errors between grid-point-based demand representations and

SASDRs ...................................................................................................................... 127

Table 5.5: Cost and optimality errors between grid-rectangle-based demand representations and

SASDRs ...................................................................................................................... 128

x

LIST OF FIGURES

Page

Figure 1.1: GIS functions and GIS applications in public health ................................................... 4

Figure 1.2: Logical structure of the dissertation research ............................................................... 9

Figure 2.1: Graph-based representation of a region map .............................................................. 27

Figure 2.2: Population 2000 by counties in GA in the United States ........................................... 30

Figure 2.3: Locations of simulated clusters: (a) circular shape (b) linear shape (c) trifurcate shape 30

Figure 2.4: Estimated average power of four spatial scan methods ............................................. 34

Figure 2.5: Average KIAs of four spatial scan methods ............................................................... 36

Figure 2.6: SIRs and the detected cluster of lung cancer incidence in GA, 1998-2005 ............... 38

Figure 3.1: Population density by census tract and the 10 most populous cities in Georgia 2000 48

Figure 3.2: Quintile map of SES in Georgia 2000 ........................................................................ 52

Figure 3.3: Maps of crude standardized incidence ratios (SIRs) by race and sex during 2000-

2001 ............................................................................................................................. 58

Figure 3.4: Maps of the posterior median RRs for white males in each time period ................... 60

Figure 3.5: Maps of the posterior median RRs for white females in each time period ................ 61

Figure 3.6: Maps of the posterior median RRs for black males in each time period .................... 62

Figure 3.7: Maps of the posterior median RRs for black females in each time period ................ 63

Figure 3.8: Maps of elevated RR frequency by race and sex during 2000-2007 .......................... 64

Figure 3.9: Maps of the posterior median of the shared spatial component and differential spatial

components ................................................................................................................. 65

xi

Figure 4.1: Illustration of three demand types: unallocated demand (da and db), covered allocated

demand (dc), and uncovered allocated demand (dd) .................................................... 78

Figure 4.2: Example of the SASDR with circular facility service area (a) demand space U (the

square) and two potential service areas S1 and S2 (the circles) (b) four demand objects

in the SASDR result of demand space U partitioned by service areas S1 and S2 ........ 85

Figure 4.3: Population density of Georgia EMS Region 10 (study area) by census block group

and existing ambulance facility locations ................................................................... 87

Figure 4.4: Road network in EMS Region 10 in GA .................................................................... 89

Figure 4.5: Eight-minute service areas (non-white polygons) of all potential ambulance facility

sites (red points) based on the road network ............................................................... 90

Figure 4.6: SASDR result for the study area with demand (population) distribution .................. 92

Figure 4.7: Results of the MCMCLP models siting 58 ambulances in 82 potential facility

locations with w= 8106 −× (the facility location is rendered in the same color as its

allocation area) (a) the MCMCLP-NFC model (b) the MCMCLP-FC model with 20

facilities ....................................................................................................................... 95

Figure 5.1: Examples of spatial demand representations with (a) census blocks or their centroids,

and (b) rectangle grid or its centroids ....................................................................... 108

Figure 5.2: Illustration of overlay operation A▲B: (a) set A and set B (b) the result from A▲B 114

Figure 5.3: The SASDR with circular facility service area: (a) demand space U and two potential

service areas S1 and S2, (b) the partition of demand space U with service area S1, and

(c) the partition of demand space U with both service areas S1 and S2 ..................... 116

Figure 5.4: Three modes of potential facility sites: (a) regular grid points with spacing R, (b)

centroids of census blocks, and (c) intersections of major roads .............................. 118

xii

Figure 5.5: Examples of grid-point-based and grid- rectangle-based demand representations for

comparison with SASDR .......................................................................................... 120

Figure 5.6: Relationship between Site-Service Index and demand object density in SASDR with

circular service coverage ........................................................................................... 123

Figure 5.7: Percentages of covered demand reported by the MCLP models with 3 types of

demand representations when the configuration of potential facility sites include: (a) 66

grid points, (b) 272 grid points (c) 66 block centroids, and (d) 272 block centroids ..... 126

1

CHAPTER 1

INTRODUCTION AND LITERATURE REVIEW

1.1 Background

Because all fields are changing all along, the debate on the definitions and scopes of

subfields such as “medical geography”, “health geography” and “spatial epidemiology” still

continues (Brown et al. 2010). However, it cannot be denied that more and more attention from

the researchers in health, geography, and other fields are drawn to the geographic component of

health, i.e., the question “where”. Where are populations at risk? Where are hotspot areas with

elevated disease risks? Where can we intervene to eliminate or reduce disease risks? Where can

we locate healthcare facilities to improve health services delivery? Geographic information

systems (GIS), which were originally used within the formal discipline of geography, are

increasingly recognized as an effective and efficient tool to deal with these geographic questions

in research and practices in epidemiology and public health (Rushton 2003, Najafabadi 2009,

Nykiforuk and Flaman 2011, Cromley and McLafferty 2012).

Actually, over 150 years ago, early public health professionals learned that maps could be

used to explore patterns of diseases and relationships between diseases and risk factors. In 1840,

Robert Cowan used a map to show the relationship between fever and overcrowding in Glasgow

(Melnick 2002). The famous story about John Snow, one of the fathers of modern epidemiology,

is often used in current textbooks in epidemiology, disease mapping and GIS to illustrate the one

of the first uses of a map to identify a disease source (Melnick 2002, Koch 2005, Longley et al.

2

2005). In 1854, John Snow plotted a map showing the cholera deaths in the Soho district of

London, by which he demonstrated the association between these deaths and contaminated water

supplies from a public water pump in the center of the outbreak.

Since the development of the first real GIS, the Canada Geographic Information System

in the mid-1960s, there has been a rapid increase and great improvement in the functions of GIS

based on the advances in computer science, cartography, computational geometry, and spatial

statistics. Cromley and McLafferty (2012) define GIS as computer-based systems for the

integration and analysis of geographic data. They classify GIS functions into three broad

categories based on what people want to do with spatial data: 1) spatial database management; 2)

visualization and mapping; and 3) spatial analysis. In the past, GIS was regarded as a technology

as discussed above. Nowadays, GIS has been attached with multiple labels, such as GIS software,

GIS data, GIS community, and doing GIS (Longley et al. 2005). Goodchild (1992) coined the

term of “GIScience” that refers to the research field about the fundamental principles and

questions underlying the activities of using GIS as a technology.

Nykiforuk and Flaman (2011) reviewed GIS applications in public health and classified

four content categories in order of descending prevalence in the literature: disease surveillance,

risk analysis, health access and planning, and community health profiling. Disease surveillance is

the compilation and tracking of data on the incidence prevalence, and spread of disease (Wall

and Devine 2000). Cluster detection, disease mapping, and disease modeling are several

interrelated components of disease surveillance. Cluster detection is an analysis process that aims

to identify hotspot areas with elevated disease risks. Disease mapping is used to understand the

distribution of disease or disease risk in the past or present. Disease modeling extends the disease

mapping to identify factors associated with disease risks in order to predict the future spread of

3

disease. These components of disease surveillance that are important for disease prevention and

control can be conducted in spatial or spatio-temporal dimensions. Risk analysis includes some

aspect(s) of risk – assessment, management, communication, or monitoring – relative to impacts

on health (Nykiforuk and Flaman 2011). Health access and planning is to evaluate and improve

health services delivery. Community health profiling is the compilation of mapping of

information regarding the health of a population in a community. These four categories are

overlapping. For example, in a disease mapping application, risk analyses could also be

conducted.

Figure 1.1 shows GIS functions and GIS applications in public health based on Cromley

and McLafferty’s (2012) and Nykiforuk and Flaman’s (2011) classifications discussed above. It

is impossible to completely describe all of GIS functions and how they can be used in public

health studies because the use of GIS functions is usually application-dependent and both GIS

and health studies are evolving all along. Here, we only briefly list several aspects to show how

GIS can greatly facilitate health studies, including population estimation, data integration,

exposure assessment, healthcare access evaluation, and communication.

(1) Population estimation

It is important for health studies to understand the distribution of a population at risk.

Because of the economic and social processes that structure residential development, age, sex

and race-ethnicity of the population are usually not uniform throughout the region of settlement

(Cromley and McLafferty 2012). GIS makes it possible to view residential distributions in great

detail. In addition to residence, GIS can help to model people’s activity in space and their

migration processes to understand the exposure people experienced, which is important for the

studies of diseases with a long latency period such as cancers. Sometimes, population data are

4

not available in some regions or some time periods, GIS can be used to interpolate or modeling

the population with available data in other regions or time periods.

Figure 1.1. GIS functions and GIS applications in public health

Spatial database

• Store • Join • Query • Edit • Delete

Visualization and mapping

• Tables • Graphs • Maps • Statistics

Spatial analysis

• Measurement • Topological analysis • Network analysis • Surface analysis • Spatial statistics

Disease surveillance

• Cluster detection • Disease mapping • Disease modeling

Risk analysis

• Assessment • Management • Communication • Monitoring

Health access and planning

• Market segmentation • Client catchment areas • Market utilization • Location-allocation

modeling

Community health profiling

• Mapping health and setting variables in a community

• Multilevel, ecological links between people and settings

Public health studies GIS functions

5

(2) Data integration

The strong capability of spatial data management of GIS makes it easy to integrate

multiple geographic data of health outcomes and environmental, socioeconomic, and behavioral

factors based on geographic information (location). These spatial data may be collected by

different local, state, or federal agencies, public and private, using different devices or

technology. Linking all of these data can give a more comprehensive context or settings of the

disease of interest, which is essential to identify relationships between diseases and all kinds of

factors and develop etiological hypotheses.

(3) Exposure assessment

Accurate estimation and mapping of exposures is clearly vital if valid inferences are to be

drawn either about the spatial distribution of risk factors, or about their geographic relationship

with health outcome (Elliott et al. 2000). Suitable measures, such as biomarkers, tend to be

costly and invasive. Therefore, especially for population-based research, it is common to

estimate exposure based on environmental monitoring data, such as air pollutant concentrations,

or using proxy measures of exposure, such as distance from source. These indirect methods can

be easily conducted in GIS using interpolation methods and measuring functions.

(4) Healthcare access evaluation

Evaluating current status of health service delivery is important for health policy making

and utilization of resources. The network analysis functions in GIS provide convenient ways to

calculate client catchment areas of healthcare facilities and the shortest distance from population

to healthcare facilities. Some measures for healthcare accessibility, such as the two-step floating

catchment area method (2SFCA) for assessing the local availability of services in relation to

6

population need (Luo and Wang 2003), can easily be implemented in GIS using join and sum

functions.

(5) Communication

Preparing and displaying maps of health information are among the most important

functions of public health GIS (Cromley and McLafferty 2012). By portraying the results of

analysis on a map, GIS technology gives communities an easily understandable visual picture of

community health (Melnick 2002). Maps are recognized as one of the most important

communication tools among researchers, decision makers, and public. With the development of

Internet GIS, the health information can be quickly published using interactive web mapping to

anyone with access to the Internet (Theseira 2002, Boulos 2003, Boulos 2005).

Based on the above examples of GIS applications in health, we can see that GIS can be

used as a natural and effective means to approach a variety of program, policy, and planning

issues in health promotion and public health (Nykiforuk and Flaman 2011).

1.2 Research Objectives

The overarching research question of this dissertation asks how GIS and spatial analysis

can be used to facilitate public health studies. Understanding health status and then effectively

and efficiently providing health care service are necessary to promote public health. Therefore,

this research involves three aspects of health studies related with heath surveillance and health

service planning: spatial disease cluster detection, spatio-temporal disease mapping, and optimal

siting of health facilities. The first two are both techniques used to describe the distribution of a

disease. Spatial disease cluster detection is to quickly identify the hotspot areas with elevated

risks. Usually, it only requires health outcome data and basic population data. It is very useful for

health departments to maintain surveillances on disease outbreaks. However, it cannot provide

7

detailed information on the spatial patterns of disease risks within hotspot areas and other areas

of interest. Spatio-temporal disease mapping can complement cluster detection analysis. It can

provide the spatio-temporal patterns of disease risks across the whole study area and the time

period. These health patterns can be linked to all kinds of factors to develop etiological

hypotheses. Knowing the patterns of disease risks is not the end. The goal of health study is to

prevent and control the spread of disease and promote public health. Given the patterns of

disease risks obtained from disease mapping analyses, we can easily identify areas with high

health service needs. Then, based on the spatial distribution of the needs, health service can be

planned more effectively and efficiently.

This dissertation research includes three main objectives, each of which addresses an

important problem in the three aspects of health studies by developing new methods or models

that are implemented with GIS and spatial analysis. More specifically, these three objects are:

(1) To develop a new method to detect disease clusters in arbitrary shapes with higher

statistical power and more accurate geographic boundaries;

(2) To develop hierarchical Bayesian models to explore the spatio-temporal patterns of

lung cancer incidence risks by race and sex in Georgia (2000-2007) at a fine spatio-temporal

scale;

(3) To develop a new location-allocation model to optimally site ambulances so that the

emergency medical services (EMS) can be delivered more effectively and efficiently.

In the study of the location-allocation model for health service planning, a sub-problem –

spatial demand representation – is worth discussing since it is highly related to modeling errors

and problem complexity. Therefore, this dissertation research is also to empirically compare

8

three existing spatial demand representation approaches to provide some implications on how to

choose appropriate one for a specific application.

In general, Figure 1.2 shows the logical structure of the dissertation research.

1.3 Literature Review

1.3.1 Detection of Irregular Disease Clusters

Detection of disease clusters in time, space or space-time has generated considerable

interests within disciplines of geography and public health for many decades (Besag and Newell

1991, Maheswaran and Craglia 2004, Lawson 2006). The shape of the geographic area of a true

disease cluster may be arbitrary. For example, air pollution diffusing from an incinerator may

cause an arbitrary disease cluster due to the wind strength and direction. To detect clusters in

irregular shapes, several methods have been proposed in (Duczmal and Assunção 2004, Tango

and Takahashi 2005, Aldstadt and Getis 2006, Duczmal et al. 2006, Kulldorff et al. 2006,

Yiannakoulias et al. 2007, Duczmal et al. 2008, Duczmal et al. 2009, Cançado et al. 2010).

Seeking methods for detection of clusters in irregular shapes with higher statistical power and

more accurate geographic boundary is still a hot topic in current health research.

1.3.2 Spatio-temporal Mapping of Disease Risks

Lung cancer is not only the second most commonly diagnosed cancer in men and women,

but also the leading cause of cancer-related death in Georgia (Georgia Department of Public

Health 2008). However, as far as we know, the lung cancer studies in Georgia are very few, and

most of them mainly focus on descriptive analyses using crude rates at a coarse spatio-temporal

scale, such as the 5-year incidence rates at the health district or county level. Such analyses are

not useful for assessing the health of diverse communities, and could introduce inferential biases

on etiological hypotheses. In addition, they can only provide limited help for healthcare

9

Figure 1.2. Logical structure of the dissertation research

Health surveillance Health service planning

Spatial disease cluster detection

Spatio-temporal disease mapping

Optimal siting of health facilities

New method for detection of clusters with irregular shapes

Spatio-temporal Bayesian models for Georgia lung cancer mapping at fine scales

New location-allocation model for ambulance siting

Spatial demand representation

Comparison of three spatial demand representations

GIS for public health studies

Sub-problem

Component Component

Component Component Component

Research Topic Research Topic Research Topic Research Topic

10

performance assessment and health policy making to improve the efficiency of interventions and

the distribution of resources. The low reliability of the disease rates for small population areas is

one of the challenges for mapping disease risk at a fine spatio-temporal scale. Recently,

hierarchical Bayesian models have been widely used to map disease risk spatially or spatio-

temporally to overcome or mitigate the small number problem (Bernardinelli et al. 1995, Waller

et al. 1997, Xia and Carlin 1998, Knorr-Held 2000, Mollié 2001, Wakefield et al. 2001, Best et

al. 2005, Richardson et al. 2006, Abellan et al. 2008, Lawson 2009, Fortunato et al. 2011).

When mapping one disease for multiple population groups or multiple diseases that have

common risk factors, a joint modeling framework can be used (Knorr-Held and Best 2001, Held

et al. 2005, Richardson et al. 2006, Downing et al. 2008). In this modeling framework, a set of

shared random components exists in each model.

1.3.3 Capacitated Maximal Covering Location Problems

Given a covering standard for a service, such as a distance or travel-time maximum, the

objective of the maximal covering location problem (MCLP) is to locate a fixed number of

facilities to provide the service to cover as many demands as possible. MCLP modeling, after

being put forward by Church and ReVelle (1974), has been a powerful and widely used tool in

many planning processes to optimally distribute limited resources to maximize social and

economic benefits. Chung et al. (1983) and Current and Storbeck (1988) published two early

papers dealing with the capacitated versions of the MCLP where the demands allocated to a

facility will not exceed the capacity of that facility. In all capacitated MCLP models, only one

fixed capacity level of the facility is considered for each potential facility site. However, many

situations arise where each potential facility site could have several possible maximum capacity

levels for a facility to choose. For example, the capacity limit of an emergency facility (e.g.,

11

ambulance base or fire station) can be assumed to be determined by its stationed emergency

vehicles (e.g., ambulances or fire trucks). Therefore, varied numbers of emergency vehicles will

provide a series of possible maximum capacity levels for the emergency facility to choose.

1.3.4 Spatial Demand Representations

For covering location modeling, it is common to assume that aggregated or continuous

spatial demand is concentrated on a set of points or uniformly distributed within areal units.

Different from the traditional area-based representations using census units or regular polygons,

such as triangles or rectangles, as demand objects, Cromley et al. (2012) proposed a new area-

based demand representation that partitions a continuous demand space into a set of the least

common demand coverage units (LCDCUs) by overlaying demand coverage areas at potential

facility sites. This representation approach, without complicated model formulations, could

reduce or eliminate some errors associated with the traditional point-based and area-based

representations.

Many covering location models, such as the maximal covering location problem (MCLP),

have been proven to be nondeterministic polynomial time (NP)-hard (Megiddo et al. 1981),

which means that no algorithm has been discovered yet to solve it in polynomial time in the

worst case. Actually, the size of a covering location problem is highly related to the demand

representation it adopts. Therefore, even if a demand representation approach may theoretically

reduce or eliminate some representation errors in a problem, it probably could make the problem

difficult, if not impossible, to solve using exact methods in current optimization software.

Relying on some heuristic algorithms to solve such a complicated problem may introduce other

errors in modeling results. It is worth noting that the complexity of problems associated with

demand representations is rarely discussed in current literature.

12

1.4 Dissertation Structure

The dissertation structure is organized into six chapters. Chapter 1 is a brief introduction

of the background and objectives of the dissertation research, and literature review of the topics

covered in this dissertation, including the detection of irregular disease cluster, spatio-temporal

mapping of disease risks, capacitated maximal covering location problems, and spatial demand

representations. The following four chapters are separate papers published in or to be submitted

to journals. In Chapter 2, a redesigned spatial scan statistic is proposed to detect disease clusters

with irregular shapes. Chapter 3 develops seven hierarchical Bayesian models under separate and

joint modeling frameworks to explore the spatio-temporal patterns of lung cancer incidence risks

in Georgia (2000-2007) at the census tract level with a two-year temporal unit. Chapter 4

develops modular capacitated maximal covering location problem (MCMCLP) models to

optimally site emergency vehicles (e.g. ambulance). In Chapter 5, three spatial demand

representation approaches are compared in both representation error and problem complexity

using the MCLP as an example. Chapter 6 provides conclusions of this dissertation and shows

the future work.

13

References

Abellan, J.J., Richardson, S. & Best, N., 2008. Use of space–time models to investigate the stability of patterns of disease. Environmental health perspectives, 116 (8), 1111.

Aldstadt, J. & Getis, A., 2006. Using amoeba to create a spatial weights matrix and identify spatial clusters. Geographical analysis, 38 (4), 327-343.

Bernardinelli, L., Clayton, D., Pascutto, C., Montomoli, C., Ghislandi, M. & Songini, M., 1995. Bayesian analysis of space—time variation in disease risk. Statistics in Medicine, 14 (21 22), 2433-2443.

Besag, J. & Newell, J., 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society. Series A (Statistics in Society), 154 (1), 143-155.

Best, N., Richardson, S. & Thomson, A., 2005. A comparison of bayesian spatial models for disease mapping. Statistical Methods in Medical Research, 14 (1), 35.

Boulos, M.N.K., 2003. The use of interactive graphical maps for browsing medical/health internet information resources. International Journal Of Health Geographics, 2 (1), 1.

Boulos, M.N.K., 2005. Web gis in practice iii: Creating a simple interactive map of england's strategic health authorities using google maps api, google earth kml, and msn virtual earth map control. International Journal Of Health Geographics, 4 (1), 22.

Brown, T., Mclafferty, S. & Moon, G. eds. 2010. A companion to health and medical geography, Chichester, UK: Wiley-Blackwell.

Cançado, A.L.F., Duarte, A.R., Duczmal, L.H., Ferreira, S.J., Fonseca, C.M. & Gontijo, E.C.D.M., 2010. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters. International Journal of Health Geographics, 9 (1), 55.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

14

Cromley, E.K. & Mclafferty, S.L., 2012. Gis and public health, 2nd ed. New York: The Guilford Press.

Cromley, R.G., Lin, J. & Merwin, D.A., 2012. Evaluating representation and scale error in the maximal covering location problem using gis and intelligent areal interpolation. International Journal of Geographical Information Science, 26 (3), 495-517.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Downing, A., Forman, D., Gilthorpe, M., Edwards, K. & Manda, S., 2008. Joint disease mapping using six cancers in the yorkshire region of england. International Journal of Health Geographics, 7 (1), 41.

Duczmal, L. & Assunção, R., 2004. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45 (2), 269-286.

Duczmal, L., Cançado, A.L.F. & Takahashi, R.H.C., 2008. Geographic delineation of disease clusters through multi-objective optimization. Journal of Computational & Graphical Statistics, 17, 243-262.

Duczmal, L., Duarte, A.R. & Tavares, R., 2009. Extensions of the scan statistic for the detection and inference of spatialclusters. Scan Statistics, 153-177.

Duczmal, L., Kulldorff, M. & Huang, L., 2006. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics, 15 (2), 428-442.

Elliott, P., Wakefield, J.C., Best, N.G. & Briggs, D.J., 2000. Spatial epidemiology: Methods and applications. In Elliott, P., Wakefield, J.C., Best, N.G. & Briggs, D.J. eds. Spatial epidemiology: Methods and applications. New York: Oxford univeristy press, 3-14.

Fortunato, L., Abellan, J.J., Beale, L., Lefevre, S. & Richardson, S., 2011. Spatio-temporal patterns of bladder cancer incidence in utah (1973-2004) and their association with the presence of toxic release inventory sites. International Journal of Health Geographics, 10 (1), 16.

Georgia Department of Public Health, 2008. Cancer program and data summary. Atlanta,GA.

15

Goodchild, M.F., 1992. Geographical information science. International Journal of Geographical Information Systems, 6 (1), 31-45.

Held, L., Natário, I., Fenton, S.E., Rue, H. & Becker, N., 2005. Towards joint disease mapping. Statistical Methods in Medical Research, 14 (1), 61-82.

Knorr-Held, L., 2000. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine, 19 (17-18), 2555-2567.

Knorr-Held, L. & Best, N.G., 2001. A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society: Series A (Statistics in Society), 164 (1), 73-85.

Koch, T., 2005. Cartographies of disease : Maps, mapping, and medicine Redlands, California: ESRI Press.

Kulldorff, M., Huang, L., Pickle, L. & Duczmal, L., 2006. An elliptic spatial scan statistic. Statistics in Medicine, 25 (22), 3929-3943.

Lawson, A., 2006. Statistical methods in spatial epidemiology, 2nd ed. Chichester, England ; Hoboken, NJ: Wiley.

Lawson, A.B., 2009. Bayesian disease mapping: Hierarchical modeling in spatial epidemiology: Chapman & Hall/CRC.

Longley, P.A., Goodchild, M.F., Maguire, D.J. & Rhind, D.W., 2005. Geographic information systems and science, 2nd ed.: John Wiley & Sons, Ltd.

Luo, W. & Wang, F., 2003. Measures of spatial accessibility to health care in a gis environment: Synthesis and a case study in the chicago region. Environment and Planning B, 30 (6), 865-884.

Maheswaran, R. & Craglia, M., 2004. Gis in public health practice Boca Raton: CRC Press.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

16

Melnick, A.L., 2002. Introduction to geographic information systems in public health Gaithersburg, Maryland: Aspen Publishers.

Mollié, A., 2001. 15.. Bayesian mapping of hodgkins disease in france. Spatial Epidemiology, 1 (9), 267-286.

Najafabadi, A.T., 2009. Applications of gis in health sciences. Shiraz E Medical Journal, 10 (4), 221-230.

Nykiforuk, C.I.J. & Flaman, L.M., 2011. Geographic information systems (gis) for health promotion and public health: A review. Health Promotion Practice, 12 (1), 63-73.

Richardson, S., Abellan, J. & Best, N., 2006. Bayesian spatio-temporal analysis of joint patterns of male and female lung cancer risks in yorkshire (uk). Statistical Methods in Medical Research, 15 (4), 385.

Rushton, G., 2003. Public health, gis and spatial analytic tools. Annual Review of Public Health, 24, 43-56.

Tango, T. & Takahashi, K., 2005. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4, 11-15.

Theseira, M., 2002. Using internet gis technology for sharing health and health related data for the west midlands region. Health & Place, 8 (1), 37-46.

Wakefield, J., Best, N. & Waller, L., 2001. 7.. Bayesian approaches to disease mapping. Spatial Epidemiology, 1 (9), 104-128.

Wall, P.A. & Devine, O.J., 2000. Interactive analysis of the spatial distribution of disease using a geographic information systems. Journal of geographical systems, 2 (3), 243.

Waller, L., Carlin, B., Xia, H. & Gelfand, A., 1997. Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association, 607-617.

Xia, H. & Carlin, B., 1998. Spatio-temporal models with errors in covariates: Mapping ohio lung cancer mortality. Statistics in Medicine, 17 (18), 2025-2043.

17

Yiannakoulias, N., Rosychuk, R.J. & Hodgson, J., 2007. Adaptations for finding irregularly shaped disease clusters. International Journal of Health Geographics, 6 (1), 28.

18

CHAPTER 2

DETECTING DISEASE CLUSTERS IN ARBITRARY SHAPES WITH A REDESIGNED

SPATIAL SCAN STATISTIC1

1 Yin, P. and Mu, L. To be submitted to Geographical Analysis.

19

Abstract

Detection and surveillance of spatial disease clusters in arbitrary shapes have generated

considerable interest within disciplines of geography and public health. However, most of

existing methods have drawbacks such as enormous computing workloads, peculiar-shape

clusters detected, multiple testing problem, and among others. In this study, the commonly-used

Kulldorff’s circular spatial scan statistic (CSScan) was redesigned to quickly detect spatial

disease clusters in arbitrary shapes by using Tango’s restricted likelihood ratio as the test statistic

combined with Assunção et al.’s dynamic Minimum Spanning Tree (dMST) search strategy. Six

cluster models and two non-cluster scenarios were designed and five hundred replications for

each model were simulated to test and compare the performances of the redesigned spatial scan

statistic method (RSScan) with Tango’s method, Assunção et al.’s method, and Kulldorff’s

CSScan method to detect the statistically significant clusters and identify the boundaries of

clusters. Besides the metric of power, the Kappa Index of Agreement (KIA) was used to indicate

the degree of match between a cluster estimate and the true cluster. The results from the

performance experiment indicate that the RSScan method with appropriate parameters, which

were explored in this study, generally has a higher or similar capability to rapidly detect spatial

disease clusters in arbitrary shapes than other three methods. RSScan method was then applied to

detecting the cluster of lung cancer in the State of Georgia in United States for the period of 1998

to 2005. Limitations of RSScan method are also discussed.

Keywords: Spatial scan statistic, Restricted likelihood ratio, Disease cluster, Arbitrary shape,

Dynamic Minimum Spanning Tree

20

2.1 Introduction

Detection of disease clusters in time, space or space-time has generated considerable

interest within disciplines of geography and public health for many decades (Besag and Newell

1991, Maheswaran and Craglia 2004, Lawson 2006). Lawson (2006) described a disease cluster

as “any area within the study region of significant elevated risk” of a particular disease. It is also

referred to as hot-spot cluster. The causes of disease clusters may include the communicability of

some diseases, adverse effects from physical, socioeconomic, or psychosocial environment,

certain kinds of lifestyles which are commonly considered harmful to health, such as smoking,

and poor accessibility to healthcare (Maheswaran and Craglia 2004). Detecting disease clusters

not only aids the analysis of disease etiology, but also enables public health departments improve

their surveillance, distribute funding and other resources and control for possible disease

outbreaks.

It is well accepted that the spatial variation of disease incidence is highly related with the

background population at risk. For example, the occurrence of a kind of disease in an urban area

is higher than that in a rural area, maybe only due to the larger population in the urban area. If

two cities have the same size of population, but the proportion of population over age 60 in the

first city is much higher than that in the second city, it is not surprising that the incidence of

cardiovascular disease in the first city is higher. In addition, the geographic area’s shape of a true

disease cluster may be arbitrary. For example, air pollution diffusing from an incinerator may

cause an arbitrary disease cluster due to the wind strength and direction. Therefore, detection of

the spatial disease clusters should not only take account of the spatial variation of population at

risk, but also be able to catch arbitrary shapes of detected disease clusters.

21

In the following sections, Section 2 is a brief review of several well-known methods for

detecting spatial disease clusters. Section 3 proposes a redesigned spatial scan method (RSScan)

using Tango’s (2008) restricted likelihood ratio as the test statistic combined with Assunção et

al.’s (2006) dynamic Minimum Spanning Tree (dMST) search strategy to quickly detect spatial

disease clusters in arbitrary shapes. Section 4 tests the performance of RSScan with simulated

data, which is followed by an application in Section 5 using RSScan to detect the cluster of lung

cancer in Georgia from 1998 to 2005. Section 6 concludes the paper.

2.2 Existing Methods for Detection of Disease Clusters

Local Moran’s I is an index which has been widely used to identify clusters (Anselin

1995, Jacquez and Greiling 2003, Rogerson and Yamada 2009, Goovaerts 2010). However, there

are several issues concerned with using Local Moran’s I to detect disease clusters. As the design

of Local Moran’s I is to test the similarity of the attributive values between the region of interest

and its neighbors, the clusters detected with Local Moran’s I may be not the areas with

significant elevated disease risk. Local Moran’s I is incapable of detecting the clusters which

only involve a single region. Conducting a separate statistical test with Local Moran’s I for each

region in the study area results in a multiple testing problem that some clusters may be detected

just by chance even if the real pattern of disease incidence is random (Rogerson and Yamada

2009). In addition, crude rates, such as Standardized Incidence Ratio (SIR), are usually directly

used as the attribute in Local Moran’s I to detect the disease clusters (Jacquez and Greiling 2003,

Rogerson and Yamada 2009), which may cause the test to be unstable due to low reliability of

disease rate with a small population at risk.

Different from Local Moran’s I, Openshaw et al.’s (1987) Geographical Analysis

Machine (GAM) is an exploratory and graphical method that allows to detect clusters with

22

significant elevated disease risk. A fine regular lattice is laid on the study region, and many

circles of various radii are constructed on each lattice point. The number of disease cases in each

circle is then counted and compared with the number of disease cases which would be expected

under the null hypothesis that all disease incidences are spatially distributed randomly within the

underlying structure of population at risk. With Monte Carlo testing (Dwass 1957) where the

probability distribution of the expected number of cases in each circle is generated based on

simulations, if the null hypothesis is rejected, the corresponding circle will be drawn on the map.

Finally, an idea about where and how large the disease clusters may be can be obtained by

looking at the plotted circles. Each circle is regarded as having a significantly elevated risk.

Since there are usually thousands of circles with various radii tested simultaneously, the multiple

testing problem and enormous computational workload need to be addressed. Turnbull et al.

(1990) proposed a method, Cluster Evaluation Permutation Procedure (CEPP), which only tests

the circle with maximum count of disease cases among all moving circles covering the same

predefined population. This method solves the multiple testing problem, but the input threshold,

a predefined population, may be hard to determine.

Based on Openshaw et al.’s (1987) and Turnbull et al.’s (1990) methods, Kulldorff and

Nagarwalla (1995) developed a circular spatial scan statistic which is denoted as the CSScan

method in the following part. A circular scan window with various radii is constructed and

moved over the space of study area. The null hypothesis is defined as the probability of being a

case in the circle, p, is the same as that in the rest of the study region, q. The alternative

hypothesis is p > q. Given the number of cases and population inside and outside the circle,

maximum likelihood ratio between these two hypotheses is selected as the test statistic, which

can be derived with two stochastic models, Bernoulli and Poisson (Kulldorff 1997). The circular

23

window with the maximum test statistic is regarded as the most likely cluster. Its significance is

then tested using Monte Carlo testing method (Dwass 1957). The spatial scan statistic based on

Poisson model λ is shown as below (Equation 2.1, Kulldorff 1997):

( )( )

( ) ( )( )

( ) ( )( )

( )( )

−−

>

−−

=

Ζ∈

otherwise

zenznn

zeznif

zenznn

zezn

znnzn

z

1

supλ Equation 2.1

where sup denotes supremum (least upper bound), z denotes the zone within the circular scan

window which is included in the zone set Z, n(z) and e(z) denote the actual number of disease

cases and the null expected number of cases within the specified zone z, respectively. n is count

of total disease cases in study area. CSScan method is one of the widely-used methods for cluster

detection until now possibly because it addresses the problems existing in such methods as Local

Moran’s I, GAM, and CEPP. In addition, the latest version of the tool for this method,

SaTScanTM, can be easily accessed over the Internet (Kulldorff and Information Management

Services Inc. 2010).

Since Kulldorff’s CSScan uses a circular window to scan the study region, it is difficult

to detect clusters of irregular shapes. In order to solve this problem, many methods have been

developed which mainly modify the search strategy of the scan window or the construction of a

test statistic. Duczmal and Assunção (2004) proposed a simulated annealing search strategy for

detection of arbitrarily shaped spatial clusters. In this method, however, it tends to be arbitrary

24

when choosing one of the four strategies with different levels of randomness for the successor of

the current subgraph at each step. Tango and Takahashi (2005) proposed a flexibly shaped spatial

scan statistic which exhaustively searches all cluster candidates within a given radius of any area.

However, there is an exponential increase in running time of their algorithm with the increase of

search radius. Several penalty parameters were incorporated into the maximum likelihood ratio

function in different methods to either enable the method to find irregular shaped clusters, such

as the “eccentricity penalty” in Kulldorff et al. (2006) for elliptical-shaped clusters, or penalize

the detected clusters that are very irregular in shape, such as the “non-compactness” in Duczmal

et al. (2006) and “non-connectivity penalty” in Yiannakoulias et al (2007). In spite of all the

efforts, these methods are still plagued with a large dose of subjectivity in these penalty

parameters.

2.3 Redesigned Spatial Scan Method (RSScan)

From the review of existing methods in the previous section, it can be summarized that

spatial scan methods mainly consist of two components: a search strategy and a test statistic such

as the spatial scan statistic λ. The objective of spatial scan is to find zone z which maximizes the

test statistic over all zones in the set Z and identifies the one that constitutes the most likely

cluster (Duczmal and Assunção 2004). A search strategy mainly defines the zone set Z and in

turn determines the possible shape of a cluster estimate and the running time of an algorithm. A

test statistic, combined with the search strategy, determines the performance of the method. In

order to rapidly detect arbitrarily shaped spatial disease clusters for count data, and at the same

time to address the issues identified in the above-mentioned methods, we redesigned Kulldorff’s

CSScan method by using Assunção et al.’s (2006) dMST method as the search strategy and

Tango’s (2008) restricted likelihood ratio as the test statistic in our RSScan method, which will

25

be described in the following subsections (2.3.1 and 2.3.2), respectively. Table 2.1 shows the test

statistics and search strategies used in four spatial scan methods including our RSScan method,

Tango’s method, Assunção et al.’s method, and Kulldorff’s CSScan method.

Table 2.1. Test statistics and search strategies of four spatial scan methods

Test Statistic

Tango’s Restricted Likelihood Ratio

Kulldorff’s Maximum Likelihood Ratio

Search Strategy

Assunção et al.’s dMST RSScan Assunção et al.’s

method

Circular Scan Window Tango’s method CSScan

Although Tango (2008) mentioned the restricted likelihood ratio could be used with a

non-circular scan window, and his latest version of software FleXScan v3.1 (Takahashi et al.

2010), released just after this study was finished allows the restricted likelihood ratio to be

combined with his flexible scan method, the current literature lacks work testing and discussing

such kind of combination. Tango (2008) designed four cluster models to test the statistical power

of restricted likelihood ratio with circular scan windows. However, using this method it is

difficult to explain the performance of restricted likelihood ratio as a test statistic under other

situations, such as different levels of disease cases in study area or various shapes of clusters.

The choice of the screening level α1 in the restricted likelihood ratio needs also to be explored

when combined with the non-circular scan window such as the dMST search strategy in our

RSScan method.

26

2.3.1 Test Statistic

It is reasonable to think that not only should the disease clusters be areas of significantly

elevated risk as a whole, but also the risks of individual regions within the clusters should not be

very low. Therefore, we adopt the restricted likelihood ratio proposed by Tango (2008) as the test

statistic λT in our RSScan method (Equation 2.2, Tango 2008).

( )( )

( ) ( )( )

( ) ( )( )

( )( ) ( )∏

Ζ∈<

−−

>

−−

=

zii

znnzn

zT pI

zenznn

zeznI

zenznn

zezn

1αλ sup Equation 2.2

where I(·) is an indicator function. The only difference between Tango’s restricted likelihood

ratio function (Equation 2.2) and Kulldorff’s maximum likelihood ratio function (Equation 2.1)

is the product of indicator functions: ( )∏∈

<zi

iipI α , in which α1 is a screening level specified by

users for the risk of any individual region, and pi is the one-tailed mid-p value of region i under

the test for null hypothesis H0: E(Ni) = ei , which is defined as below (Equation 2.3, Tango 2008).

( ) ( )}~|Pr{21}~|1Pr{ iiiiiiiii ePoisNnNePoisNnNp =++≥= Equation 2.3

where Ni is a random variable which denotes the number of disease cases in region i, ni and ei

denote the actual number of cases and null expected number of cases in region i, respectively. In

Tango’s restricted likelihood ratio function, if the one-tailed mid-p value of a region is less than

the prespecified screening level α1, this region will be regarded as being of elevated risk.

Otherwise, this region will not be considered in the disease cluster estimate. It should be noted

27

that Kulldorff’s maximum likelihood ratio is the special case of the restricted likelihood ratio

when the screening level α1=1.

Although the problem of noninterpretability in the parameters is addressed and the cluster

size is effectively controlled with the restricted likelihood ratio function, the choice of screening

level α1 is totally up to users. Tango (2008) provides a guideline regarding the choice of α1 for a

test of the nominal α level of 0.05, and recommends α1=0.2 as a default value. However, this

guideline is derived only from the testing results with four simulated cluster models using a

circular scan window. The recommendation of α1 value in our RSScan method for detecting the

clusters in arbitrary shapes will be explored in Section 4.

2.3.2 Search Strategy

In order to detect arbitrarily shaped clusters and guarantee the spatial contiguity, we use

graph G (V, E) to represent a region map, where V is a set of n vertices (each representing such a

region as census tract or county), and E is a set of edges (each connecting a unique pair of

adjacent regions) (Figure 2.1).

Figure 2.1. Graph-based representation of a region map

28

The exclusion of the regions of low risks in the restricted likelihood ratio function is

realized by removing all edges of those regions in the graph. This screening step also reduces the

amount of calculation in the algorithm. Therefore, the final cluster estimate will only include the

regions which are connected in the graph. Similar to the Kulldorff’s CSScan method, the RSScan

method will find the most likely cluster with the largest value of the test statistic to address the

multiple testing problem.

Assunção et al.’s (2006) dMST method is used as the search strategy in our RSScan

method. Given a graph G and an empty collection T, for any vertex u, the steps can be described

as follows:

1) Put vertex u into T.

2) Among all the vertices not in T but adjacent to any vertex in T, identify the vertex v

adding which T has the largest value of the test statistic at current step, and then put

vertex v into T. All vertices in current T constitute one zone (i.e. a potential cluster) for

scan.

3) Repeat step 2 until all vertices connected to vertex u in graph G are added into T.

Above steps are executed for each vertex not isolated in the graph G, and then we can get

the zone set Z where the one with the maximum test statistic will be regarded as the most likely

cluster . In order to reduce calculating intensity, a search radius K is set so that at most K-1

nearest neighboring vertices are involved into the zones when scanning each vertex.

2.4 Performance Evaluation

2.4.1 Experimental design

An experiment was designed with six single-cluster models based on simulated data in

order to evaluate the performance of the RSScan method. For each cluster model, the location of

29

the disease cluster was first located in the study area, and then a relative risk r>1 was assigned to

the regions within the disease cluster and r=1 to the rest regions. Given the total number of

disease cases in the study area, the number of disease cases in region i follows a multinomial

distribution with the probability of ∑=

m

iiiii prpr

1/ where ri and pi are the relative risk and

population at risk in region i, respectively. m is the total number of regions in the study area.

Based on the criterion used by Kulldorff et al. (2003), the relative risk for all regions that

constitutes a cluster is determined using a one-sided binomial test with significance level of 0.05

such that the null hypothesis is rejected with probability of 0.999 when the alternative is a cluster

with unknown risk but with known location. This choice of relative risks provides an upper limit

of 0.999 for the power attainable by any test.

Three types of shapes are designed for simulated cluster models: round, line and trifurcate

shape. The study area (Figure 2.2) is the State of Georgia (GA) in the United States including

159 counties with a total population of 9,210,790 (year 2000). Three locations in this area

(Figure 2.3) are chosen for simulated clusters. Two levels of disease case numbers are designed:

Low (500 cases) and High (5000 cases). Combining the types of disease cases and cluster shape,

there are total six cluster models. A code format as ‘X_Shape’ was used to label these cluster

models. The first ‘X’ indicates the level of disease case numbers with L for low and H for high.

Table 2.2 lists all detailed information of each cluster model. We also simulated a scenario where

there is no cluster for each level of disease case numbers (all regions have a relative risk r=1) so

that the capability of the method to control Type I error could be tested.

30

Figure 2.2. Population 2000 by counties in GA in the United States

Figure 2.3. Locations of simulated clusters: (a) circular shape (b) linear shape (c) trifurcate shape

31

Table 2.2. Information of simulated cluster models

Cluster ID

Cluster Code

Count of Cases

Population in Cluster

Cluster Size (count of counties) Shape Type Relative

Risk 1 L_Round 500 1,802,970 7 Round 1.63 2 H_Round 5000 1.18 3 L_Line 500 1,721,370 5 Line 1.64 4 H_Line 5000 1.18 5 L_Tri 500 427,594 7 Trifurcate

shape 2.30

6 H_Tri 5000 1.33

For each type of cluster and non-cluster scenario, 500 replications were simulated, each

of which has the same cluster location and total number of disease cases over the whole study

area but different disease cases in every region. The nominal significance level was selected as

0.05, which means that clusters with p-values larger than 0.05 are considered not significant.

Monte Carlo testing method (Dwass 1957) with 999 repetitions were used to test the significance

of the observed test statistic. So the p-value can be calculated with the rank of the observed test

statistic among the total 1000 tests. In order to explore the effect of screening level α1 in

restricted likelihood ratio function, five different values: 0.05, 0.1, 0.2, 0.3 and 0.4 were set.

Since the RSScan method is a hybrid between Tango’s (2008) method and Assunção et

al.’s (2006) method, these two methods were chosen for comparison in an experiment.

Considering Kulldorff’s CSScan method is probably the most widely-used method for detecting

spatial clusters, it also was added into the comparison. A 20% population in study region was set

as the upper limit covered by the circular scan window in CSScan method, and the search radius

K in other three methods are correspondingly set to 30 counties .

2.4.2 Experimental Results

Power is the most important evaluation criterion for cluster detection tests, which

indicates how effective methods are in identifying the presence of statistically noteworthy

clusters (Kulldorff et al. 2003, Tango and Takahashi 2005, Assunção et al. 2006, Tango 2008). In

32

order to understand how well these methods identify the correct boundaries of a cluster, Kappa

Index of Agreement (KIA, De Smith et al. 2007) is chosen as a complimentary metric to the

power in this study since it not only shows the match degree between the detected cluster

estimates and the true clusters, but also excludes the probability that the cluster regions are

detected by chance. In this case, the KIA decreases the impacts on the evaluation caused by

different cluster model properties, such as study region size and cluster size. In order to easily

compare the performances of different methods or different screening level values in RSScan and

Tango’s method, the results of six cluster models were averaged in terms of the levels of disease

cases and shapes of clusters.

2.4.2.1 Estimated Power of Methods

The power in this study is defined as the ratio of statistically significant clusters detected

(significance level=0.05) to the count of replications for each cluster model (500). The results of

the power analysis for four spatial scan methods are shown in Table 2.3. The highest value for

each scenario (column in the table) is bold. The test statistics in Assunção et al.’s method and

CSScan method can be regarded as the restricted likelihood ratio with α1=1.

We can see that all four methods have higher power to detect significant clusters with

lower level of disease cases (L_Cas) than those with higher level of disease cases (H_Cas). With

the increase of α1 from 0.05 to 0.4, RSScan method is easier to detect the significant clusters in

the shapes varying from linear shape (Line) to round shape (Round) and then to trifurcate shape

(Tri), while Tango’s method is easier to detect the significant clusters in the shapes varying from

linear shape (Line) to round shape (Round) but more difficult for the trifurcate shaped clusters

(Tri) whatever the value of α1 is. Assunção et al.’s method and CSScan method both have highest

powers for trifurcate shaped clusters (Tri). However, Assunção et al.’s method is more difficult to

33

detetct significnat round clusters (Round) while CSScan method has the lowest power for linear

clusters (Line).

Table 2.3. Estimated power of four spatial scan methods (significance level=0.05)

Number of Cases Cluster Shape Average H_Cas L_Cas Line Round Tri

α1 = 0.05 RSScan 0.74 0.795 0.788 0.757 0.758 0.768 Tango’s 0.661 0.725 0.741 0.693 0.645 0.693

α1 = 0.1 RSScan 0.773 0.802 0.824 0.796 0.743 0.788 Tango’s 0.669 0.733 0.752 0.71 0.64 0.701

α1 = 0.2 RSScan 0.788 0.835 0.831 0.831 0.773 0.812 Tango’s 0.683 0.743 0.754 0.718 0.668 0.713

α1 = 0.3 RSScan 0.79 0.831 0.807 0.817 0.807 0.81 Tango’s 0.693 0.765 0.754 0.741 0.693 0.729

α1 = 0.4 RSScan 0.823 0.847 0.811 0.825 0.869 0.835 Tango’s 0.719 0.775 0.748 0.78 0.712 0.747

α1 = 1 Assunção’s 0.866 0.887 0.873 0.855 0.901 0.876 CSScan 0.779 0.798 0.716 0.756 0.894 0.789

Figure 2.4 shows the estimated average power for each method considering all scenarios.

The figure shows that Assunção et al.’s method has the highest average power (0.876) among

these four methods for the clusters with any level of disease cases and any type of shape. RSScan

method has a good power especially when α1 is large such as 0.4 (0.835). CSScan method has a

relatively low power (0.789), and Tango’s method has the lowest power whatever the value of α1

is.

2.4.2.2 Kappa Index of Agreement

In order to evaluate the agreement between the most likely cluster detected and true

clusters to understand how well these methods identify the correct boundaries of a cluster, KIA

was used as another metric to evaluate the performance of these four methods. One advantage of

KIA is that it excludes the probability of detected cluster regions caused merely by chance. There

34

are two categories of regions: inside cluster and outside cluster. Given the study area size (S), the

true cluster size (T), the detected cluster estimate size (D), and the size of the intersection

between the cluster estimate and the true cluster (I), Table 2.4 shows the contingency table for

detected cluster estimates and true clusters.

Figure 2.4. Estimated average power of the four spatial scan methods

Table 2.4. Contingency table for detected cluster estimates and true clusters

Cluster Estimate Inside Cluster Outside Cluster Total True Cluster

Inside Cluster I T-I T Outside Cluster D-I S-T-D+I S-T

Total D S-D S

Based on above contingency table, the KIA equation can be derived for this study

(Equation 2.4):

0.6 0.65

0.7 0.75

0.8 0.85

0.9 0.95

1

0.05 0.1 0.2 0.3 0.4 1

Pow

er

Screening level α1

RSScan Tango's

Assunção's CSScan

35

EEO

−−

=1

κ Equation 2.4

( )S

IDTSIO +−−+= , ( ) ( )

2STSDSTDE −×−+×

=

where O is the observed proportion of matching values (the contingency table diagonal) and E is

the expected proportion of matches in this diagonal assuming the two categories in true cluster

are independent from the two categories in cluster estimate. KIA ranges from 0 to 1, and 1 means

a perfect agreement.

With the highest KIA value for each scenario (column in the table) in bold, Table 2.5

indicates that all methods have higher or close performance to identify the correct boundaries of

a cluster when there is a relatively low level of disease cases in the study region (L_Cas). With

the increase of α1 from 0.05 to 0.4, both RSScan and Tango’s methods are good at identifying the

boundaries of the clusters in the shapes varying from line (Line) to round (Round). The

boundaries of trifurcate shaped clusters (Tri) are difficult to be correctly identified by both

methods. Assunção et al.’s method is relatively better for clusters with trifurcate shape (Tri) than

other shapes, and CSScan method is good for round cluster (Round).

Figure 2.5 shows the average KIA value for each method considering all scenarios. The

figure indicates that RSScan method has a better performance to detect the boundaries of clusters

in various shapes than other three methods and peaks when α1 is 0.2 (0.614). The performance of

Tango’s method peaks when α1 is 0.4 and has a similar KIA value with CSScan method (about

0.47). Assunção et al.’s method has a relatively low power (0.435) possibly due to many low-risk

regions being involved into the cluster estimates.

36

Table 2.5. KIAs between the most likely clusters and true clusters for four spatial scan methods

Number of Cases Cluster Shape Average H_Cas L_Cas Line Round Tri

α1 = 0.05 RSScan 0.506 0.526 0.598 0.511 0.438 0.516 Tango’s 0.365 0.373 0.47 0.354 0.283 0.369

α1 = 0.1 RSScan 0.571 0.581 0.661 0.603 0.464 0.576 Tango’s 0.391 0.397 0.498 0.386 0.298 0.394

α1 = 0.2 RSScan 0.601 0.628 0.683 0.667 0.492 0.614 Tango’s 0.416 0.426 0.499 0.425 0.338 0.421

α1 = 0.3 RSScan 0.56 0.599 0.612 0.638 0.489 0.58 Tango’s 0.441 0.457 0.493 0.48 0.374 0.449

α1 = 0.4 RSScan 0.506 0.546 0.527 0.571 0.481 0.526 Tango’s 0.47 0.475 0.493 0.548 0.377 0.473

α1 = 1 Assunção’s 0.424 0.445 0.383 0.444 0.477 0.435 CSScan 0.468 0.481 0.457 0.577 0.391 0.475

Figure 2.5. Average KIAs of four spatial scan methods

2.4.2.3 Non-cluster Scenario Results

For non-cluster scenario, Table 2.6 shows that all methods averagely detected about 5%

clusters out of 500 non-clustered replications. Considering the significance level of 0.05 used for

these tests, the results indicate that all methods have good capabilities to control Type I error.

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.05 0.1 0.2 0.3 0.4 1

KIA

Screening level α1

RSScan Tango's Assunção's CSScan

37

Table 2.6. Average Type I error of four spatial scan methods

RSScan Tango’s Assunção’s CSScan α1 = 0.05 0.04 0.05 - - α1 = 0.1 0.044 0.046 - - α1 = 0.2 0.035 0.044 - - α1 = 0.3 0.043 0.045 - - α1 = 0.4 0.048 0.041 - - α1 = 1 - - 0.046 0.042

2.5 Application: Georgia Lung Cancer, 1998 -2005

Based on above experimental results, the RSScan method with appropriate screening

level α1 value was found to usually have a higher capability than other three methods to detect

the significant clusters and identify the boundaries of clusters in arbitrary shapes. 0.2 could be

recommended as the default α1 value.

We use the RSScan method to detect the cluster of lung cancer diagnosed in GA in the

period of 1998-2005. The health data from Georgia Comprehensive Cancer Registry show that

the lung cancer cases in GA from 1998 to 2005 total 42,521 among which male cases are 25,615

and female cases are 16,906. The expected number of cases for county i is calculated based on

GA population in 2000 (Figure 2.2) and adjusted by the age and sex.

Figure 2.6 shows standardized incidence ratio (SIR) for each county in GA and the

detected cluster result using RSScan method with screening level α1 = 0.2. The detected cluster is

found to be located in north-western GA including total 8 counties: Bartow, Gordon, Haralson,

Murray, Polk, Walker, Whitfield, and Paulding. The p-value of the cluster is 0.002, and total

3,177 cases occurred within the cluster area during that time. The SIR of the cluster is 1.31.

38

Figure 2.6. SIRs and the detected cluster of lung cancer incidence in GA, 1998-2005

2.6 Discussion and Conclusions

It should be noted that the performances of both the RSScan method and the other three

methods vary under different situations such as counts of disease incidence cases and cluster

shapes. This finding corresponds well with the power analysis given by Waller and Gotway

(2004) that most tests to detect clusters have spatially heterogeneous power. The high estimated

power in the experiment indicates that these methods could be competent in the exploratory

study which indicates the questionable areas for further study. However, the relatively low KIA

39

values indicate that these methods may be inappropriate for the applications which require

accurate boundaries of clusters, such as the analysis of the change of spatial clusters over time. In

order to get deeper insights about the spatio-temporal disease risk pattern, disease risk modeling,

such as spatio-temporal multilevel models, may be a better way.

Tango’s restricted likelihood ratio has good interpretability and strong power in detecting

disease clusters with circular scan window (Tango 2008). To our knowledge, however, there is no

previous work discussing its performance in detecting clusters in arbitrary shapes with other

search strategies. For the first time, this study implements and tests restricted likelihood ratio

combined with Assunção et al.’s dMST search strategy to quickly detect disease clusters in

arbitrary shapes. In order to understand the performance of this redesigned hybrid method in

various situations, more cluster models than Tango (2008) and Assunção et al. (2006) were

designed in this performance test, which includes six cluster models and two non-cluster

scenarios. These cluster models consider different numbers of disease cases in a study area and

various shapes of clusters. The choice of the screening level α1 in restricted likelihood ratio is

also explored when combined with Assunção et al.’s dMST search strategy in the RSScan

method. Besides the metric of power, this study proposes using KIA to evaluate and compare the

performances of cluster detection methods to identify the boundaries of clusters in order to avoid

the effects due to the different cluster model properties. Finally, the application of the RSScan

method was applied in a case of detecting the cluster of lung cancer in Georgia between 1998

and 2005.

The experimental results indicate that the RSScan method with appropriate screening

level α1 generally has higher or similar capability to quickly detect statistically significant

disease clusters and identify the boundaries of clusters than Tango’s method, Assunção et al.’s

40

method, and Kulldorff’s CSScan method under the same situation, especially for the clusters in

irregular shapes. Based on results of this study, 0.2 is recommended as a default for the screening

level α1 in the RSScan method.

41

References

Anselin, L., 1995. Local indicators of spatial association-lisa. Geographical analysis, 27 (2), 93-115.

Assunção, R., Costa, M., Tavares, A. & Ferreira, S., 2006. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25 (5), 723-742.

Besag, J. & Newell, J., 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society. Series A (Statistics in Society), 154 (1), 143-155.

De Smith, M., Goodchild, M. & Longley, P., 2007. Geospatial analysis: A comprehensive guide to principles, techniques and software tools: Troubador Publishing.

Duczmal, L. & Assunção, R., 2004. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis, 45 (2), 269-286.

Duczmal, L., Kulldorff, M. & Huang, L., 2006. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics, 15 (2), 428-442.

Dwass, M., 1957. Modified randomization tests for nonparametric hypotheses. Annals of Mathematical Statistics, 28 (1), 181-187.

Goovaerts, P., 2010. Geostatistical analysis of county level lung cancer mortality rates in the southeastern united states. Geographical analysis, 42 (1), 32-52.

Jacquez, G. & Greiling, D., 2003. Local clustering in breast, lung and colorectal cancer in long island, new york. International Journal of Health Geographics, 2 (1), 3.

Kulldorff, M., 1997. A spatial scan statistic. Communications in Statistics-Theory and Methods, 26 (6), 1481-1496.

Kulldorff, M., Huang, L., Pickle, L. & Duczmal, L., 2006. An elliptic spatial scan statistic. Statistics in Medicine, 25 (22), 3929-3943.

42

Kulldorff, M. & Information Management Services Inc., 2010. Satscantm v9.1: Software for the spatial and space-time scan statistics. http://www.satscan.org/

Kulldorff, M. & Nagarwalla, N., 1995. Spatial disease clusters - detection and inference. Statistics in Medicine, 14 (8), 799-810.

Kulldorff, M., Tango, T. & Park, P.J., 2003. Power comparisons for disease clustering tests. Computational Statistics & Data Analysis, 42 (4), 665-684.

Lawson, A., 2006. Statistical methods in spatial epidemiology, 2nd ed. Chichester, England ; Hoboken, NJ: Wiley.

Maheswaran, R. & Craglia, M., 2004. Gis in public health practice Boca Raton: CRC Press.

Openshaw, S., Charlton, M., Wymer, C. & Craft, A., 1987. A mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information Systems, 1 (4), 335 - 358.

Rogerson, P. & Yamada, I., 2009. Statistical detection and surveillance of geographic clusters Boca Raton: CRC Press.

Takahashi, K., Yokoyama, T. & Tango, T., 2010. Flexscan v3.1: Software for the flexible scan statistic. http://www.niph.go.jp/soshiki/gijutsu/download/flexscan/index.html

Tango, T., 2008. A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics, 29 (2), 75-95.

Tango, T. & Takahashi, K., 2005. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4, 11-15.

Turnbull, B.W., Iwano, E.J., Burnett, W.S., Howe, H.L. & Clark, L.C., 1990. Monitoring for clusters of disease - application to leukemia incidence in upstate new-york. American Journal of Epidemiology, 132 (1), S136-S143.

Waller, L. & Gotway, C., 2004. Applied spatial statistics for public health data: Wiley-Interscience.

43

Yiannakoulias, N., Rosychuk, R.J. & Hodgson, J., 2007. Adaptations for finding irregularly shaped disease clusters. International Journal of Health Geographics, 6 (1), 28.

44

CHAPTER 3

HIERARCHICAL BAYESIAN MODELING OF THE SPATIO-TEMPORAL PATTERNS OF

LUNG CANCER INCIDENCE RISKS IN GEORGIA, 2000-20072

2 Yin, P., Mu, L., Madden, M. and Vena, J. To be submitted to International Journal of Health

Geographics.

45

Abstract

Lung cancer is the second most commonly diagnosed cancer in men and women in

Georgia. However, the related studies about the patterns of lung cancer in Georgia at a fine

spatio-temporal scale are very limited. In this study, hierarchical Bayesian models are used to

explore the spatio-temporal patterns of lung cancer incidence risks by race and sex in Georgia for

the period of 2000 to 2007. With the census tract level as the spatial scale and the two-year

period aggregation as the temporal scale, we propose and compare a total of seven Bayesian

spatio-temporal models including two under the separate modeling framework and five models

under the joint modeling framework. One of these models is finally chosen and its results clearly

show that the northwest region of Georgia has stably elevated lung cancer incidence risks for all

population groups during the study period. Showing more detailed and reliable variations of the

lung cancer incidence risks in space and time, our study aims to better support assessing

healthcare performance, establishing etiological hypotheses, and making effective and efficient

health policies. In addition, our study shows that there are strong inverse relationships between

the socioeconomic status (SES) and the lung cancer incidence risk in Georgia males, especially

white males, and weak inverse relationships in both white and black Georgia females. The study

results are expected to lead to further studies including, the spatial and temporal random effects

in the models that may provide some implications on the potential disease risk factors for further

ecological studies. The limitations of this study including the lack of smoking data and

population estimation error are also discussed in the end.

Keywords: Bayesian model, Spatio-temporal pattern, Lung cancer, Socioeconomic status,

Georgia

46

3.1 Introduction

Lung cancer is not only the second most commonly diagnosed cancer in men and women,

but also the leading cause of cancer-related death in Georgia in the United States (Georgia

Department of Public Health 2008). However, as far as we know, the lung cancer studies in

Georgia are very few, and most of these mainly focus on descriptive analyses using crude rates at

a coarse spatio-temporal scale, such as the 5-year incidence rates at the health district or county

level. Such analytical results usually obscure the detailed variations of lung cancer risks in space

and time, and could introduce inferential biases on etiological hypotheses. In addition, they can

only provide limited help for healthcare performance assessment and health policy making to

improve the efficiency of interventions and the distribution of resources.

The small number problem is one of the challenges for mapping lung cancer risk at a fine

spatio-temporal scale. For rare diseases such as cancers, the total counts of cases could become

very sparse at some fine spatio-temporal scales, especially when more demographic dimensions

are also considered, such as sex, age, race, among others. With the sparseness of the counts,

some traditional estimates of disease risk or relative risk, such as the Standardized Incidence

Ratio (SIR), could become unreliable and may lead to a large misunderstanding of the true

disease risk due to high sampling variability. Recently, hierarchical Bayesian models have been

widely used to map disease risk spatially or spatio-temporally (Bernardinelli et al. 1995, Waller

et al. 1997, Xia and Carlin 1998, Knorr-Held 2000, Mollié 2001, Wakefield et al. 2001, Best et

al. 2005, Richardson et al. 2006, Abellan et al. 2008, Lawson 2009, Fortunato et al. 2011). For

sparse count data, integrating both data fit and subjective prior information makes Bayesian

models possible to mitigate the inferential biases of frequentist methods that totally depend on

data fit. In addition, it is easy to develop model-based spatial and spatio-temporal smoothing

47

methods under the Bayesian framework that not only consider the effects of disease risk factors,

but also borrow strengths from neighboring areas and/or time periods.

In this study, we use hierarchical Bayesian models to explore the spatial-temporal

patterns of lung cancer incidence risks in Georgia. The analyses are conducted for four

population groups stratified by sex and race at the census tract level over four two-year periods

from 2000-2007. A total of seven spatio-temporal models under two modeling frameworks were

proposed and compared. One framework is to model the relative risks (RRs) of each population

group separately, and the other framework is to jointly model the RRs of each population group

under the assumption that some common disease risk factors exist in all population groups. One

model is finally chosen based on some criterion and its results are interpreted. The aim of the

study is to obtain reliable spatio-temporal patterns of lung cancer incidence risks by sex and race

in Georgia at a fine scale, which are expected to identify the spatio-temporal hot-spots of the

disease risks of a specific population group for further study, and help to facilitate the related

health policy making in Georgia. In addition, evaluating the effects of area-based socioeconomic

status (SES) on the lung cancer incidence risks in each population group is also explored in the

modeling. The understandings of the socioeconomic gradients in lung cancer incidence risks by

race and sex could provide some implications on how to reduce the lung cancer disparities in

Georgia. This paper will be organized as follows. In the next section, the study area and data

sources are described. Then, the method for population estimation, the area-based SES measure,

and the seven Bayesian spatio-temporal models under the two modeling frameworks are

explained. Next, the modeling results and discussions are given, followed by some conclusions.

48

3.2 Study Area and Data

Our study area is the state of Georgia with 1,618 census tracts in 2000. Figure 3.1 shows

the distribution of population density by census tract in Georgia 2000. The 10 most populous

cities in 2000 are also shown in this map. We can see that the population is mainly concentrated

in the north region of Georgia, especially in the metropolitan Atlanta area that includes the cities

of Atlanta, Sandy Spring, Rowswell, and Marietta. All of the population data and socioeconomic

data come from the U.S. Census.

Figure 3.1. Population density by census tract and the 10 most populous cities in Georgia 2000

49

The lung cancer data (primary site codes from C340-C349 in ICD-O-3) are extracted

from the Georgia Comprehensive Cancer Registry (Georgia Department of Public Health 2011).

A total of 44,671 lung cancer cases were diagnosed in Georgia from 2000-2007. In this study, we

only consider the cases among white and black individuals over 20 years old and the total

number is 43,504. A total of 3,219 cases are excluded from the analyses due to their lower spatial

accuracy than the census tract level. Therefore, 40,285 cases are finally included and aggregated

to the 1,618 census tracts in the geography of the Census 2000. Table 3.1 shows the total number

of cases of individuals over 20 years old and the percentage of included cases in the analyses by

sex and race. We can see that the lowest percentage of included cases is 89.81% for black males.

Table 3.1. Total number of cases of individuals over 20 years old and the percentage of included cases in the analyses by sex and race

White Black Total cases Included cases (%) Total cases Included cases (%)

Male 20,547 90.59 5,557 89.81 Female 14,882 91.36 3,362 91.73

To avoid a high level of sparseness while keeping the temporal dimension, cases are

aggregated to four two-year periods, 2000-2001, 2002-2003, 2004-2005, and 2006-2007, for the

analyses. The average number of cases per census tract per two-year period is 2.9 for white

males, 2.1 for white females, 0.77 for black males, and 0.48 for black females. The expected

numbers of cases by census tract by two-year period by sex and race are calculated based on the

reference rates that are the average age-specific incidence rates by sex and race across the whole

Georgia and over the time period 2000-2007. In the calculation of the reference rates, a total 10

age groups are considered including age groups from 20-39 and 40-49, 7 five-year age groups

from 50-84 and one group from 85 and over.

50

3.3 Methods

3.3.1 Population Estimation for Intercensal Years

The population at risk is important to accurately calculate expected cases and estimate

disease risk. However, the census population data at the tract level are only available at the

census years (e.g. 2000 and 2010). It is also noted that the geographic boundaries of census tracts

vary every census year. For example, there are a total of 1,618 tracts in Census 2000, while a

total of 1,969 tracts exist in Census 2010. At the county level, the Census Bureau (Population

Estimates Program 2011) provides the estimates of population by race, sex and age for each

intercensal year. In this study, the boundaries of census tracts in 2000 are used as the standard

geography for the whole study period. With the census population data currently available, one

of the interpolation methods proposed by Best and Wakefield (1999) is used to estimate the

population by race, sex and age at the census tract level for each intercensal year.

The steps of the population estimation are as follows. First, we use the overlay function in

the Geographical Information System (GIS), ArcGISTM (ESRI, Inc.) and the areal weighting

interpolation method (Goodchild and Lam 1980) to estimate the population in 2010 using the

geography of the 2000 census tracts. To improve the accuracy, we use the 2010 population data

at the block level instead of the tract level since blocks are at a finer spatial scale. Then, for each

population group by race, sex and age in a county, we assume the population N are

multinomially distributed to the census tracts in that county with a vector of apportionment

probabilities p=(p1,…,pI)T, where I denotes the number of census tracts in that county and pi is

the proportion of the population in census tract i in the population of the county N. The

probabilities p for each intercensal year is estimated via a simple linear interpolation between the

censuses (i.e., 2000 and 2010).

51

Based on the population estimates, the reference rates for all population groups are then

calculated. Using the U.S. 2000 standard population for standardization, the direct age-adjusted

(over 20 years old) lung cancer incidence annual rates (per 100000 population) in Georgia (2000-

2007) are 132.7 for white males, 75.3 for white females, 135.2 for black males, and 54.5 for

black females.

3.3.2 Area-based SES Measure

Due to the relative homogeneity, the area-based SES measure at the census tract level

could be a good surrogate of individual SES in a health study when individual SES is unavailable

(Krieger 1992). Detailed discussions of area-based SES measures can be found in the literature

(Krieger et al. 1997, Carstairs 2001, Krieger et al. 2002, Darden et al. 2009). Various single

variable or composite measures can capture different aspects of socioeconomic characteristics. In

this study, we use the modified Darden-Kamel Composite Index (Darden et al. 2009) to measure

the SES at the census tract level, and evaluate its relationships with the lung cancer incidence

risks by race and sex. The modified Darden-Kamel Composite Index is an average Z score of

total nine socioeconomic variables in U.S. census data (Table 3.2).

Table 3.2. Variables incorporated in the modified Darden-Kamel Composite Index

Modified Darden-Kamel Composite Index 1. Percentage of residents with university degrees 2. Median household income 3. Percentage of managerial and professional positions 4. Median value of dwelling 5. Median gross rent of dwelling 6. Percentage of homeownership 7. Percentage below poverty 8. Unemployment rate 9. Percentage of households with vehicle

52

Based on Census 2000 data, the modified Darden-Kamel Composite Indexes for the

census tracts in Georgia are calculated and their range is from -31.05 to 24.77. A larger value

means a higher SES. Based on the index, the census tracts in Georgia are divided into five SES

groups with equal number of census tracts. Group 1 has the highest SES and group 5 has the

lowest. Figure 3.2 shows the distribution of the SES by census tract. We can see that the higher

SES regions are mainly concentrated in the large cities in Georgia.

Figure 3.2. Quintile map of SES in Georgia 2000

53

3.3.3 Bayesian Spatio-temporal Models

Bayesian models have naturally hierarchical structures. At the first level, the number of

observed cases yitk for census tract i =1,…,1618, time period t =1,…,4 and population group by

race and sex k =1,…,4 is assumed to follow a Poisson distribution with mean EitkRitk, where Eitk

and Ritk are respectively the known expected number of cases and the unknown RR compared to

the corresponding reference risk (measured by the reference rate of specific population group) in

census tract i, time period t and population group k. At the second level, the logarithms of RRs

are decomposed into fixed effects for those measured risk factors such as the SES, and random

effects for those unmeasured or unobserved risk factors. In Bayesian spatio-temporal models,

three random effects are usually considered: spatial random main effect, temporal random main

effect and spatio-temporal interaction random effect. Both spatial and temporal random main

effects could be further divided into a structured component and an unstructured component,

which reflect the dependent and heterogeneous variations of risks in space and time, respectively.

In the Bayesian paradigm, prior distributions are needed to be assigned to the model parameters

and the random effects. Then, the references are made based on the posterior distributions of the

parameters and random effects derived from simulations.

In this study, we model the RR of each population group individually under two

modeling frameworks. The first framework is separate modeling where each population group

has an independent set of random effects. The second framework is joint modeling where there

are shared random effects representing some common unmeasured or unknown risk factors

among all the population groups. This joint modeling framework has been used to map one

disease for multiple population groups or multiple diseases that have common risk factors

(Knorr-Held and Best 2001, Held et al. 2005, Richardson et al. 2006, Downing et al. 2008). We

54

compare a total of seven models including two separate models and five joint models. Table 3.3

shows the components of the logarithms of RRs in each model.

Table 3.3. Components of logarithms of RRs in the seven Bayesian spatio-temporal models

Model Type Model # Logarithms of RRs

Separate Model1 tkikkitkR ξλα +++= i

Tk xβlog )(

Model2 itktkikkitkR υξλα ++++= i

Tk xβlog )(

Joint

Model3 itktkikkitkR ωςδφδα ++++= ,2,1)( iTk xβlog

Model4 tkiktkikkitkR ξλςδφδα +++++= ,2,1)( iTk xβlog

Model5 tkikittkikkitkR ξλθςδφδα ++++++= ,2,1)( iTk xβlog

Model6 itktkiktkikkitkR ωξλςδφδα ++++++= ,2,1)( iTk xβlog

Model7 itktkikittkikkitkR ωξλθςδφδα +++++++= ,2,1)( iTk xβlog

In each model, αk is the overall log-RR for population group k across the whole study area

over the whole study period, and βk are the coefficients associated with the SES group vector xi

for population group k. The difference among the seven models is in the components of random

effects. Separate models 1 and 2 both have spatial random main effect λik for population group k

in census tract i and temporal random main effect ξtk for population group k in time period t.

Model 2 also considers the spatio-temporal interaction υitk in census tract i and time period t for

population group k. In addition to the population-group-specific random effects like those in

separate models 1 and 2, joint models 3-7 also consider shared random effects across the four

population groups by race and sex. In these shared components of the joint models, ϕi represents

the shared spatial component in census tract i, and ϛt represents the shared temporal component

in time period t. The coefficients δ1,k and δ2,k allow gradients of the shared spatial and temporal

components among all the population groups. In models 5 and 7, a shared spatio-temporal

interaction θit is also considered. With respect to the population-group-specific random effects,

55

model 3 only considers a spatio-temporal interaction random effect ωitk for population group k,

and models 4 and 5 only consider specific spatial and temporal random main effects λik and ξtk.

For the two components λik and ξtk in models 4-7, We set them equal to 0 in white male models

(k=1) so that these two components in other population group models (k=2, 3 and 4) actually are

the differentials of the spatial and temporal random main effects between that population group

and white males.

Some early experiments show that only considering structured components in spatial and

temporal random main effects have better modeling results than considering both structured and

unstructured components. Therefore, the widely used Gaussian intrinsic conditional

autoregression normal (CAR normal) prior proposed by Besag et al. (1991) are assigned to the

spatial random main effects λik and ϕi and the temporal random main effects ξtk and ϛt to represent

the dependent variations of RRs over space and time. For a spatial random effect in an area, CAR

normal specifies that its conditional distribution, given all other spatial effects, is a normal

distribution with mean equal to the average spatial effects of its neighboring areas and variance

inversely proportional to the number of these neighbors. In this study, the spatial neighbors are

defined if they share a border or a vertex. For a temporal random effect in a time period, CAR

normal smoothes it towards the temporal effects of its temporal neighbors (i.e. the previous and

the next time periods).

Due to the lack of strong prior knowledge, vague prior distributions are used for other

parameters in the models based on current literature. We assign a flat prior on the overall log-RR

terms, αk, and assign independent Normal (0, 105) prior distributions to fixed effects βk. The

logarithms of the scaling parameters δ1,k and δ2,k are assigned independent Normal (0, 5) prior

distributions (Downing et al. 2008). With respect to the spatio-temporal interaction random

56

effects, independent normal prior distributions with means equal to 0 and precisions τυk, k

=1,…,4, are assigned to υitk in model 2 for each population group, independent normal prior

distributions with means equal to 0 and precisions τθ are assigned to θit in models 5 and 7, and a

multivariate normal prior distribution with covariance matrix Σ is assigned to ωitk in models 3, 6

and 7 to allow correlations amongst the population groups (Richardson et al. 2006, Downing et

al. 2008). Following the previous studies (Kelsall and Wakefield 1999, Best et al. 2005,

Downing et al. 2008), independent conjugate hyperpriors Gamma (0.5, 0.0005) are assigned to

all of the precision parameters in the normal priors for shared components τϕ, τϛ, τθ and for

population-group-specific components τλk, τξk, τυk, k =1,…,4. The covariance matrix Σ in the

multivariate normal prior is assigned a Wishart (Q, 4) distribution, where Q is set to be a

diagonal matrix with 0.01s (Richardson et al. 2006).

All of the models are constructed and run using WinBUGS software (Lunn et al. 2000).

For each model, two independent chains are run. The first 50,000 iterations are discarded as

burn-in to make sure inferences can be made based on converged simulations of the models.

Then, 10,000 iterations are run and every 10th is kept for reference. Therefore, the modeling

results are based on thinned samples of 2,000. Brooks-Gelman-Rubin diagnostics (Brooks and

Gelman 1998) and visual checks are used to assess convergence.

Similar to the joint mapping of male and female lung cancer risks by Richardson et al

(2006), the scaling parameters δ2,k are difficult to converge during the data fitting of models. This

could be because only four time periods cannot provide enough information to differentiate the

shared and specific temporal patterns. So, we fixed δ2,k = 1 for all joint models.

We use the deviance information criterion (DIC) to compare the seven models and choose

the best one to interpret. The DIC was proposed by Spiegelhalter et al (2002) as the sum of D

57

and pD, where D is the posterior mean of the deviance measuring the goodness-of-fit of a model,

and pD is the number of effective model parameters measuring model complexity. The model

with a smaller DIC is preferred.

3.4 Results

From Table 3.4, we can see that joint model 6 has the smallest DIC value of 64155.6

among the seven models. The best data fit is model 7 and the simplest model is model 4. All of

the joint models except for model 3 are better than the separate models based on their DICs. In

the following, we choose the results of model 6 to interpret. In model 6, both the shared and the

specific components include the spatial and temporal random main effects, and the specific

spatio-temporal interaction random effect is also considered.

Table 3.4. DICs of the seven models

Model Type Model # D pD DIC

Separate Model1 63349.2 962.636 64311.8 Model2 63029.5 1264.91 64294.4

Joint

Model3 62996.6 1383.51 64380.1 Model4 63328.4 869.157 64197.6 Model5 63099.8 1064.9 64164.7 Model6 62908.1 1247.48 64155.6 Model7 62904.5 1347.36 64251.9

As we know, the crude standardized incidence ratio (SIR), the ratio of the number of

observed cases to the number of expected cases, is the best maximum likelihood estimate for RR

in frequentist methods. For comparison, Figure 3.3 shows the spatial patterns of crude SIRs by

race and sex in the first period 2000-2001. Due to the uneven population distribution and

possible missing in data collection, in these SIR maps, especially for black males and black

females, many census tracts have zero cases observed in that tract in that time period which

58

cause zero SIRs. However, it is impossible that there are no disease risks in these census tracts in

reality. In addition, it is obvious that the SIR surfaces are not smooth across the whole area since

most of the RRs fall into either very high or very low category.

Figure 3.3. Maps of crude standardized incidence ratios (SIRs) by race and sex during 2000-2001

59

Figures 3.4-3.7 show the maps of posterior median RRs by race and sex in the four time

periods. Compared to the crude SIRs in Figure 3.3, the posterior median RRs show much

smoother spatial patterns without RRs equal to 0. For white males and white females, the high

RRs are mainly concentrated in the northwest, southeast, and middle regions of Georgia. For

black males, the high RRs are mainly concentrated in the northwest, east, and south in Georgia.

The high RRs for black females are mainly concentrated in the northwest of Georgia. Comparing

the maps of different time periods, we can see that, for white males and black males, more

census tracts with moderate and low RRs emerge and the number of census tracts with high RRs

decreases over the time; while the situations inverse for white females and black females.

Following Richardson et al.’s (2004) study evaluating the sensitivity and specificity of

Bayesian hierarchical disease mapping models, we use a cut-off rule of 0.8 on the posterior

probability that an area has a RR greater than 1 to pick out the areas with truly elevated RRs.

Figure 3.8 shows the maps indicating how many times each census tract has an truly elevated

RRs during the 4 time periods based on the rule of prob( RR>1) > 0.8. The frequency associated

with each census tract reflects the stability of elevated RR in that area over the whole time period.

From these maps, we can see that the northwest of Georgia and the area near Augusta have stably

elevated RRs for all population groups. White males have the largest number of census tracts

with stably elevated RRs, and black females have the smallest number. These results could be

helpful to establish some etiological hypotheses.

60

Figure 3.4. Maps of the posterior median RRs for white males in each time period

61

Figure 3.5. Maps of the posterior median RRs for white females in each time period

62

Figure 3.6. Maps of the posterior median RRs for black males in each time period

63

Figure 3.7. Maps of the posterior median RRs for black females in each time period

64

Figure 3.8. Maps of elevated RR frequency by race and sex during 2000-2007

Figure 3.9 shows clearer spatial patterns of RRs by the maps of the posterior median of

the shared spatial component and the differential spatial components. Taking white males as the

reference with its scaling parameter equal to 1 for the shared spatial component, the posterior

median of the scaling parameters for white females, black males, and black females are 0.743,

0.538, and 0.571, respectively. The white female-white male differential and the black males-

white males differential are relatively flat (less contrast) across the whole area, which indicates

65

that the pattern of the shared spatial component can well capture the variations of the spatial

effects on RRs for both white females and white males. The strong contrast of the black female-

white male differential reflects that there is an obvious difference in the patterns of spatial effects

on RR between white males and black females.

Figure 3.9. Maps of the posterior median of the shared spatial component and differential spatial components

66

Table 3.5 shows the posterior medians and 95% credible intervals (CIs) of the shared

temporal component and the differential temporal components. We can see that the shared

temporal trend keeps flat and slightly decreases after 2004. This trend well captures the temporal

trend in the RRs of black males, but is different from those of white females and black females.

Table 3.5. Posterior median (95% CI) of the shared temporal components and differential temporal components

Time period

Shared temporal

components

White female-White male differential

Black male-White male differential

Black female-white male differential

2000-2001 1.04 (1.02, 1.07) 0.93 (0.90, 0.97) 1.01 (0.98, 1.06) 0.92 (0.86, 0.98)

2002-2003 1. 04 (1.01, 1.06) 0.97 (0.94, 1.00) 1.00 (0.97, 1.04)) 0.97 (0.92, 1.02)

2004-2005 0.98 (0.96, 1.00) 1.02 (0.99, 1.05) 1.00 (0.97, 1.04) 1.03 (0.98, 1.08)

2006-2007 0.95 (0.92, 097) 1.09 (1.05, 1.13) 0.98 (0.94, 1.02) 1.09 (1.03, 1.16)

To understand the relationships between SES and RR by race and sex, Table 3.6 shows

the posterior median of the RRs for SES quintile. The highest SES group is taken as the

reference. We can see that the general trend for all population groups is that lower SES leads to a

higher RR. However, the gradients of SES effects on the RRs in males, especially white males,

are larger than those in females. That means the socioeconomic disparities in lung cancer RR are

more obvious in males in Georgia. We also note that the RRs of SES groups 2 and 3 in black

females are not statistically significant from that of SES group 1.

Bayesian modeling is sensitive to the choice of priors and hyperpriors. Following

Downing et al’s (2008) work, we perform a sensitivity analysis using an alternative hyperprior

distribution Gamma (1,1) to replace Gamma (0.5, 0.0005) for the precision parameters in model

2. The Gamma (0.5, 0.0005) distribution makes the variances (inverse of precision) have a 99%

probability of lying between 0.000151 and 6.25 with a mode at 0.00033. For the Gamma (1, 1)

67

distribution, the 99% probability range of the variances is from 0.217 to 100 and the mode is at

0.5. Table 3.7 shows the correlations between the posterior median RRs using model 2 with the

two types of hyperpriors. We can see that the two groups of results show a good concordance in

general, but the correlations in black indivduals are slightly lower than those in white individuals.

These differences may be due to the different degrees of the sparseness of counts in races.

Table 3.6. Posterior median (95% CI) of the RRs for SES quintile

SES group White males White females Black males Black females 1 (highest) 1 1 1 1

2 1.28 (1.20, 1.36) 1.11 (1.04, 1.18) 1.19 (1.04, 1.36) 1.01 (0.87, 1.19) 3 1.51 (1.41, 1.62) 1.20 (1.12, 1.30) 1.42 (1.24, 1.63) 1.13 (0.96, 1.33) 4 1.58 (1.46, 1.70) 1.16 (1.07, 1.26) 1.51 (1.32, 1.72) 1.23 (1.06, 1.44)

5 (lowest) 1.76 (1.61, 1.92) 1.32 (1.20, 1.44) 1.73 (1.52, 1.98) 1.41 (1.22, 1.65)

Table 3.7. Correlations between the posterior median RRs using model 2 with two different types of hyperpriors

Time period White males White females Black males Black females 2000-2001 0.998 0.992 0.988 0.990 2002-2003 0.998 0.991 0.988 0.989 2004-2005 0.998 0.991 0.987 0.988 2006-2007 0.998 0.991 0.987 0.988

3.5 Discussions

One of the limitations in this study is the lack of suitable smoking data at the fine spatial

scale. It is well known that an individual’s smoking behavior is an important risk factor for lung

cancer. To some extent, the random effects in our hierarchical Bayesian spatio-temporal models

can approximate the total effects of unmeasured or unknown risk factors including smoking.

However, we believe that integrating suitable smoking data into the models can greatly improve

the accuracy of the models.

68

For the diseases with a long latency period such as cancers, lifetime exposures could be

important. In this study, we measure the area-based SES with Census 2000 data and assume they

could reflect the individual SES during the long latency period. This assumption could introduce

biases into the model inferences. In addition, the analysis of the relationship between disease RR

and SES is subject to the modifiable area unit problem (Openshaw and Taylor 1981). It means

that the references based on the analyses at current scale and/or unit definition may not be

generalized to other scales and/or unit definitions.

Estimation of population in small areas is a hot research topic in geography and statistics

recently. In our study, we use an apportionment method to estimate the population by race, sex

and age in each census tract in each intercensal year. Improvement of population estimation

model could greatly benefit the disease mapping models.

3.6 Conclusions

Facing the fact that there are a limited number of lung cancer studies in Georgia,

especially at a fine spatio-temporal scale, we use hierarchical Bayesian models to explore the

spatio-temporal patterns of lung cancer incidence risks in Georgia for the period 2000-2007. The

study is conducted at the census tract level using two-year time period as the temporal unit. The

fine spatial and temporal scales enable the study to show more detailed variations of lung cancer

incidence risks in space and time, which can better support healthcare performance assessment,

thereby establishing potential etiological hypotheses and making effective and efficient health

policies. Compared to the crude SIR, use of the Bayesian spatio-temporal model can provide a

more reliable estimate of disease risk in a fine spatio-temporal scale. The study also shows that

there are strong inverse relationships between SES and lung cancer incidence risk in males and

69

weak inverse relationships in females in Georgia. This could lead to further studies on the

underlying reasons such as occupational risk factors.

A total of seven Bayesian spatio-temporal models under the separate and joint modeling

frameworks are proposed and compared. In this study, the joint models generally have better

performance than the separate models using DIC as the criterion. Currently, our study is

primarily focusing on mapping the patterns of disease risks. However, the spatial and temporal

random effects in these disease mapping models may provide some implications on the potential

disease risk factors for further ecological studies.

70

References

Abellan, J.J., Richardson, S. & Best, N., 2008. Use of space–time models to investigate the stability of patterns of disease. Environmental health perspectives, 116 (8), 1111.

Bernardinelli, L., Clayton, D., Pascutto, C., Montomoli, C., Ghislandi, M. & Songini, M., 1995. Bayesian analysis of space—time variation in disease risk. Statistics in Medicine, 14 (21 22), 2433-2443.

Besag, J., York, J. & Mollié, A., 1991. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43 (1), 1-20.

Best, N. & Jon, W., 1999. Accounting for inaccuracies in population counts and case registration in cancer mapping studies. Journal of the Royal Statistical Society. Series A (Statistics in Society), 162 (3), 363-382.

Best, N., Richardson, S. & Thomson, A., 2005. A comparison of bayesian spatial models for disease mapping. Statistical Methods in Medical Research, 14 (1), 35.

Brooks, S.P. & Gelman, A., 1998. Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434-455.

Carstairs, V., 2001. 4.. Socio-economic factors at areal level and their relationship with health. Spatial Epidemiology, 1 (9), 51-68.

Darden, J., Rahbar, M., Jezierski, L., Li, M. & Velie, E., 2009. The measurement of neighborhood socioeconomic characteristics and black and white residential segregation in metropolitan detroit: Implications for the study of social disparities in health. Annals of the Association of American Geographers, 100 (1), 137-158.

Downing, A., Forman, D., Gilthorpe, M., Edwards, K. & Manda, S., 2008. Joint disease mapping using six cancers in the yorkshire region of england. International Journal of Health Geographics, 7 (1), 41.

Fortunato, L., Abellan, J.J., Beale, L., Lefevre, S. & Richardson, S., 2011. Spatio-temporal patterns of bladder cancer incidence in utah (1973-2004) and their association with the presence of toxic release inventory sites. International Journal of Health Geographics, 10 (1), 16.

71

Georgia Department of Public Health, 2008. Cancer program and data summary. Atlanta,GA.

Georgia Department of Public Health, 2011. Georgia comprehensive cancer registry [online]. http://www.health.state.ga.us/programs/gccr/ [Accessed Access Date 2011].

Goodchild, M.F. & Lam, N.S., 1980. Areal interpolation: A variant of the traditional spatial problem. Geo-Processing, 1, 297-312.

Held, L., Natário, I., Fenton, S.E., Rue, H. & Becker, N., 2005. Towards joint disease mapping. Statistical Methods in Medical Research, 14 (1), 61-82.

Kelsall, J. & Wakefield, J., 1999. Discussion of ' bayesian models for spatially correlated disease and exposure data', by best et al. In Bernardo, J., Berger, J., Dawid, A. & Smith, A. eds. Bayesian statistics 6. Oxford, UK: Oxford University Press, 151.

Knorr-Held, L., 2000. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine, 19 (17-18), 2555-2567.

Knorr-Held, L. & Best, N.G., 2001. A shared component model for detecting joint and selective clustering of two diseases. Journal of the Royal Statistical Society: Series A (Statistics in Society), 164 (1), 73-85.

Krieger, N., 1992. Overcoming the absence of socioeconomic data in medical records: Validation and application of a census-based methodology. American Journal of Public Health, 82 (5), 703.

Krieger, N., Chen, J.T., Waterman, P.D., Soobader, M.J., Subramanian, S. & Carson, R., 2002. Geocoding and monitoring of us socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter? American Journal of Epidemiology, 156 (5), 471.

Krieger, N., Williams, D.R. & Moss, N.E., 1997. Measuring social class in us public health research: Concepts, methodologies, and guidelines. Annual Review of Public Health, 18 (1), 341-378.

Lawson, A.B., 2009. Bayesian disease mapping: Hierarchical modeling in spatial epidemiology: Chapman & Hall/CRC.

72

Lunn, D.J., Thomas, A., Best, N. & Spiegelhalter, D., 2000. Winbugs-a bayesian modelling framework: Concepts, structure, and extensibility. Statistics and computing, 10 (4), 325-337.

Mollié, A., 2001. 15.. Bayesian mapping of hodgkins disease in france. Spatial Epidemiology, 1 (9), 267-286.

Openshaw, S. & Taylor, P.J., 1981. The modifiable areal unit problem. In Wrigley, N. & Bennett, R. eds. Quantitative geography: A british view. London and Boston: Routledge and Kegan Paul, 60-69.

Population Estimates Program, 2011. County intercensal estimates (2000-2010) [online]. http://www.census.gov/popest/data/intercensal/county/county2010.html [Accessed Access Date 2012].

Richardson, S., Abellan, J. & Best, N., 2006. Bayesian spatio-temporal analysis of joint patterns of male and female lung cancer risks in yorkshire (uk). Statistical Methods in Medical Research, 15 (4), 385.

Richardson, S., Thomson, A., Best, N. & Elliott, P., 2004. Interpreting posterior relative risk estimates in disease-mapping studies. Environmental health perspectives, 112 (9), 1016.

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. & Van Der Linde, A., 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64 (4), 583-639.

Wakefield, J., Best, N. & Waller, L., 2001. 7.. Bayesian approaches to disease mapping. Spatial Epidemiology, 1 (9), 104-128.

Waller, L., Carlin, B., Xia, H. & Gelfand, A., 1997. Hierarchical spatio-temporal mapping of disease rates. Journal of the American Statistical Association, 607-617.

Xia, H. & Carlin, B., 1998. Spatio-temporal models with errors in covariates: Mapping ohio lung cancer mortality. Statistics in Medicine, 17 (18), 2025-2043.

73

CHAPTER 4

MODULAR CAPACITATED MAXIMAL COVERING LOCATION PROBLEM FOR THE

OPTIMAL SITING OF EMERGENCY VEHICLES3

3 Yin, P. and Mu, L. 2012. Applied Geography 34: 247-254.

Reprinted here with permission of the publisher.

74

Abstract

To improve the application of the maximal covering location problem (MCLP), several

capacitated MCLP models were proposed to consider the capacity limits of facilities. However,

most of these models assume only one fixed capacity level for the facility at each potential site.

This assumption may limit the application of the capacitated MCLP. In this article, a modular

capacitated maximal covering location problem (MCMCLP) is proposed and formulated to allow

several possible capacity levels for the facility at each potential site. To optimally site emergency

vehicles, this new model also considers allocations of the demands beyond the service covering

standard. Two situations of the model are discussed: the MCMCLP-facility-constraint (FC),

which fixes the total number of facilities to be located, and the MCMCLP-non-facility-constraint

(NFC), which does not. In addition to the model formulations, one important aspect of location

modeling—spatial demand representation—is included in the analysis and discussion. As an

example, the MCMCLP is applied with Geographic Information System (GIS) and optimization

software packages to optimally site ambulances for the Emergency Medical Services (EMS)

Region 10 in the State of Georgia. The limitations of the model are also discussed.

Keywords: Modular capacitated MCLP, Spatial demand representation, GIS, Emergency vehicle

75

4.1 Introduction

Given a covering standard for a service, such as a distance or travel-time maximum, the

objective of the maximal covering location problem (MCLP) is to locate a fixed number of

facilities to provide the service to cover as many demands as possible. MCLP modeling, after

being put forward by Church and ReVelle (1974), has been a powerful and widely used tool in

many planning processes to optimally distribute limited resources to maximize social and

economic benefits, such as the placement of emergency warning sirens (Current and O'Kelly

1992), fire stations (Indriasari et al. 2010), distribution centers for humanitarian relief (Balcik

and Beamon 2008), health centers (Bennett et al. 1982, Verter and Lapierre 2002, Griffin et al.

2008, Ratick et al. 2009), and ecological reserves (Church et al. 1996). Among many different

versions of MCLP models that have been proposed, a basic underlying assumption is that the

facilities to be sited are uncapacitated. Under this assumption, the demand will be served as long

as it is within the service covering standard of any facility. However, this assumption of

uncapacitated facilities severely limits the application of covering models (Current and Storbeck

1988). Many service facilities have finite capacities to ensure an acceptable level of service and

spatial equity (Murray and Gerrard 1997, Liao and Guo 2008). For example, an ambulance base

can only respond to a limited number of demands within its service covering standard (e.g., 8-

min driving distance) at one time because of the availability status of the ambulances stationed at

the base. Therefore, the capacity limit—the main constraint addressed in this article—is an

important consideration in location problems, especially for the siting of emergency facilities.

Chung et al. (1983) and Current and Storbeck (1988) published two early papers dealing

with the capacitated versions of the MCLP. Both groups of authors added maximum capacity

constraints into the mathematical formulations of the MCLP to ensure that the demands allocated

76

to a facility will not exceed the capacity of that facility. However, these two capacitated MCLP

models only consider the allocation of the demands within the service covering standard of

facilities. Many systems, particularly public services, are typically available to all demands

within their jurisdiction. For example, even if a demand is located in an area where no

ambulances can reach the demand within a time standard, the demand must still be responded to

and be counted as part of some facility’s workload. Therefore, Pirkul and Schilling (1991)

proposed an extension of the capacitated MCLP where all demands are assigned to facilities,

regardless of whether that demand lies within the service covering standard. Such an idea of

allocating all demands to facilities is also shown in some uncapacitated MCLP models, such as

the generalized maximal covering location problem of Berman and Krass (2002). Following the

work of Pirkul and Schilling (1991), Haghani (1996) proposed a multi-objective capacitated

MCLP model where the objective function maximizes the weighted covered demand while

simultaneously minimizing the average distance from the uncovered demands to the located

facilities. He showed how to ensure the maximization of the weighted covered demand to be the

primary objective in the model by adjusting its weight in the objective function.

In all of the above capacitated MCLP models, only one fixed capacity level of the facility

is considered for each potential facility site. However, many situations arise where each potential

facility site could have several possible maximum capacity levels for a facility to choose. For

example, the capacity limit of an emergency facility (e.g., ambulance base or fire station) can be

assumed to be determined by its stationed emergency vehicles (e.g., ambulances or fire trucks).

Therefore, varied numbers of emergency vehicles will provide a series of possible maximum

capacity levels for the emergency facility to choose. Correia and Captivo (2003) called the

location problems with such capacity constraints modular capacitated location problems.

77

However, their model is an extension of the capacitated plant location problem, the objective of

which is to minimize total costs, including fixed costs and operating costs, associated with plant

and transportation costs, among others. For emergency services, the objective is often stated as

the minimization of losses to the public, which is equivalent to the maximization of benefits

(Indriasari et al. 2010). Cost is usually not the first consideration in these services. Therefore, the

capacitated MCLP is more suitable than the capacitated plant location problem for emergency

services. Although Griffin et al. (2008) considered three capability levels for each type of health

care facility in their capacitated MCLP model, there is no composing relationship for the

capacity levels of facilities, such as that between emergency vehicles and emergency facilities. In

addition, their model did not consider the allocation of demands outside the service covering

standard.

To apply the capacitated MCLP model to the emergency facility siting problem in which

an emergency facility could have different possible capacity levels with varied numbers of

stationed emergency vehicles, we propose an extension of the MCLP called the modular

capacitated maximal covering location problem (MCMCLP). Similar to the multi-objective

function in the model of Haghani (1996), the MCMCLP aims to maximize the weighted covered

demand while simultaneously minimizing the average distance from the uncovered demands to

the located facilities.

The remainder of this article is organized as follows: In the next section, the concepts,

formulations, and related issues of the MCMCLP are introduced and discussed in terms of two

situations. The first situation involves a fixed total number of facilities to be located; in the

second situation, the total number of facilities is not fixed. Subsequently, we briefly review the

approaches for spatial demand representation that could influence the accuracy of the problem

78

solutions. The method called service area spatial demand representation (SASDR) is briefly

described. Next, the MCMCLP and the SASDR are applied to the optimal siting of ambulances

for the Emergency Medical Services (EMS) Region 10 in the State of Georgia (GA). Finally, a

discussion and conclusions are provided.

4.2 Modular Capacitated Maximal Covering Location Problem (MCMCLP)

Because of the capacity limit of a facility, the allocation problem (i.e., how to allocate

demands to facilities) sometimes must be solved in conjunction with the location problem (i.e.,

where to site facilities) (Haghani 1996). Under the assumption that one demand can only be

allocated to, at most, one facility, we define three demand types and use them in the following

part of this article: 1) unallocated demand, which is not allocated to any facility (e.g., the

demands da and db in Figure 4.1); 2) covered allocated demand, which is located within the

service covering standard of a facility and is allocated to that facility (e.g., the demand dc in

Figure 4.1); 3) uncovered allocated demand, which is located beyond the service covering

standard of a facility but is allocated to that facility (e.g., the demand dd in Figure 4.1).

Figure 4.1. Illustration of three demand types: unallocated demand (da and db), covered allocated demand (dc), and uncovered allocated demand (dd)

da

db dc

dd

f

Facility

Demand

Allocated to

Service Covering

Standard

79

Following the work of Pirkul and Schilling (1991) and Haghani (1996), and in light of a

different perspective of the capacitated plant location problem of Correia and Captivo (2003), we

present an extension to the capacitated MCLP called MCMCLP and utilize it for siting

emergency services. In addition to the basic concept of the MCLP that the covered allocated

demands should be maximized by optimally siting a fixed number of facilities, the MCMCLP

also includes the following considerations: 1) the facility at each potential site has a maximum

capacity, which will be chosen from a finite and discrete set of available capacity levels; 2) all

demands need to be allocated to facilities (i.e., no unallocated demands exist), and the uncovered

allocated demands could be assigned on the basis of their proximity to facilities; 3) the demands

within a demand object, which is a spatial point or areal unit derived by abstracting or

partitioning continuous demand space, may be divided and allocated to multiple facilities.

An area with a larger population usually has a higher frequency of calls for emergency

service than an area with a smaller population. In addition, one emergency vehicle can only

respond to one call at a time and will be available only after that task is finished. Therefore, the

larger population an ambulance serves, the higher the busyness probability it usually has, the

longer the average response time for a call is, and the poorer the service it will provide. To

ensure an acceptable average response time for a call, each emergency vehicle can be thought to

have a maximum population that it can serve. In this article, we take population as demands, and

the upper limit of the population served by an emergency vehicle is defined as the capacity of

that vehicle. In fact, the calculation of an emergency vehicle’s capacity needs to consider

multiple factors, including the requirement for the average response time, the average frequency

of calls in the population that it will serve, and the average treatment time for a task, among

others. For simplicity, in this article, all emergency vehicles are assumed having the same

80

capacity, and the capacity of a facility can be assumed as the total capacities of all vehicles

stationed in that facility. For example, if there could be at most p vehicles stationed in a facility,

there are p possible levels of capacity from which to choose. A facility will not be established in

a location unless at least one emergency vehicle needs to be stationed there.

There are two situations for the MCMCLP. If there is no constraint on the total number of

emergency facilities that will be established to station vehicles, then we call such a non-facility-

constraint problem MCMCLP-NFC. This situation mainly focuses on how to allocate a given

number of vehicles to a set of predefined potential facility sites. If the total number of facilities is

fixed, such facility-constraint problem is termed MCMCLP-FC. This situation needs to select the

sites for a given number of facilities and then allocate a given number of vehicles to these

facilities. Consider the following notation:

I = the set of demand objects {1, ..., i, …,m;

J = the set of potential facility sites {1, ..., j, …,n};

S = the service covering standard of facility (i.e., maximum distance or time);

dij= the travel distance or time from potential facility site j to demand object i;

Ji = the set of potential facility sites j within the service covering standard of which

demand object i lies, i.e., { Sdj ij ≤| };

ai = the amount of service demands at demand object i;

p = the total number of emergency vehicles to be located;

c = the capacity of one emergency vehicle (assuming all vehicles have the same capacity);

w = the weight associated with all the uncovered allocated demands;

81

xj = the number of emergency vehicles stationed at potential facility site j; a facility is

located on site j when 0>jx ;

yij = the percentage of demands at demand object i that is allocated to the facility on site j.

The formulation of the MCMCLP-NFC is

Maximize ∑ ∑∑∑∈ ∈ ∉∈

−Ii Ii Jj

ijiijJj

ijiii

yadwya Equation 4.1

Subject to:

Jj cxyaIi

jiji ∈∀≤∑∈

Equation 4.2

∑∈

=Jj

j px Equation 4.3

Ii yJj

ij ∈∀=∑∈

1 Equation 4.4

Jj p0,1,2,...,x j ∈∀= Equation 4.5

Ii yij ∈∀≤≤ 10 Equation 4.6

Among Equations 4.1 to 4.6, 4.1 is a multiple objective function that seeks to maximize the

amount of the covered allocated demands (∑∑∈ ∈Ii Jj

ijii

ya ) while simultaneously minimizing the

total distance between the uncovered allocated demands and the sites to which they are assigned

(∑∑∈ ∉Ii Jj

ijiiji

yad ). In this function, the weight w≥0 can be varied to adjust the preference on each

objective. Constraints 4.2 ensure that all demands allocated to any facility cannot exceed the

82

maximum capacity of that facility (i.e., the total capacities of the emergency vehicles stationed

there). If no facility (i.e., no vehicle) is located on a site, no demand will be allocated to that site.

Constraint 4.3 specifies the total number of emergency vehicles to be located. Constraints 4.4

ensure that all demands at each demand object will be allocated to a facility. Constraints 4.5

indicate that the decision variable xj is a non-negative integer. Constraints 4.6 restrict the

continuous decision variable yij, which ranges from 0 to 1.

We use min{p, n} to denote the smaller value between the total number of emergency

vehicles, p, and the total number of potential facility sites, n. In the MCMCLP-NFC, emergency

vehicles could be stationed in the facilities located on the sites as many as min{p, n}, whereas the

MCMCLP-FC considers fixing the total number of facilities to be sited. To present the

formulation of the MCMCLP-FC, we need to introduce additional notations:

q = the total number of facilities to be sited;

K = the set of possible facility sizes (i.e., the number of vehicles) on each potential

facility site (1,…, k,…, p);

=otherwise0

sitefacility potentialon loated is vehiclesith facility w a if1 jkx jk

The MCMCLP-FC has the same objective function Equation 4.1 and constraints 4.4 and 4.6 as

the MCMCLP-NFC formulation. The other constraints include:

JjxKk

jk ∈∀≤∑∈

1 Equation 4.7

JjkcxyaIi Kk

jkiji ∈∀≤∑ ∑∈ ∈

Equation 4.8

83

∑∑∈ ∈

=Jj Kk

jk pkx Equation 4.9

∑∑∈ ∈

=Jj Kk

jk qx Equation 4.10

{ } KkJjx jk ∈∈∀∈ , 1 0, Equation 4.11

Constraints 4.7 ensure that no more than one facility can be located on each potential facility site.

Constraints 4.8 ensure that all the demands allocated to a facility cannot exceed the maximum

capacity of that facility. Constraint 4.9 specifies the total number of emergency vehicles to be

stationed. Constraint 4.10 specifies the total number of facilities to be sited. Constraints 4.11

impose integrality restriction on the decision variable xjk.

In objective function Equation 4.1 for both MCMCLP models, the weight w associated

with uncovered allocated demands can be varied to trade off the two objectives: the

maximization of covered allocated demands and the minimization of the total distance of

uncovered allocated demands to facilities. When w = 0, the model considers only the former

objective, and the service level for the uncovered allocated demands will not be assured because

they may be allocated to a further facility instead of to a nearer one. With w increases, the service

level for the uncovered allocated demands will improve because more preference is given to the

latter objective while the covered allocated demands may not be maximized by as many as

demands as when w = 0. In general, maximization of the covered allocated demands would be

the primary objective in emergency service planning, which means that, for a model with an

appropriate weight w, the optimal solution will provide as good or better coverage of the covered

allocated demand than any other feasible solutions (Haghani 1996). With the similar proof given

by Haghani (1996), we can prove that, to ensure maximization of the covered allocated demands

84

is the primary objective, the weight w must meet the following condition when assuming integer

demands:

( ) minmax

10ddA

w−

≤≤ Equation 4.12

where A is the total demands ∑∈Ii

ia , and dmax and dmin are the maximum and minimum distances,

respectively, between any pairs of demand object i and potential facility site j.

4.3 Spatial Demand Representation

Taking residents as demands, the aggregated census data may be the spatial information

of demands that we can easily obtain. When information on individual activity or tracking data is

not available, a practical consideration is to assume that the demands are distributed continuously

within the census units. For such continuous area demands, some spatial demand representation

has to be adopted so that the MCLP model can be applied. The widely used point-based

abstractions may be prone to measurement and coverage errors (Murray and O'Kelly 2002, Tong

and Murray 2009). The areal representations with census units or grids of regular polygons often

complicate the model because of the explicit processing of partial coverage caused by the

mismatch between the boundaries of service covering areas and the demand areal units. To

maintain both the simplicity and the high degree of accuracy of the maximal coverage model, the

SASDR, which was proposed by Yin and Mu (2011), is used in this article to represent demand

space.

The SASDR is a polygon-overlay-based representation for continuously spatial demands.

In this representation, the demand objects are created by using the service areas of all potential

facility sites to partition the whole demand space. Figure 4.2(a) shows an example where a

85

square demand space U will be partitioned into the SASDR by two potential facilities f1 and f2

with circular service areas S1 and S2. Figure 4.2(b) shows the four resulting demand objects in the

final SASDR, which includes ( )21 SSU − , ( ) 21 SSU − , ( ) 12 SSU − , and 21 SSU . The

biggest advantage of the SASDR is that all the demand objects lie either within or beyond the

service covering standard of any potential facility site, which can avoid partial coverage in the

model. With the basic functions in GIS software packages, such as buffer, overlay and network

analysis, the SASDR can be easily realized.

(a) (b)

Figure 4.2. Example of the SASDR with circular facility service area (a) demand space U (the square) and two potential service areas S1 and S2 (the circles) (b) four demand objects in the SASDR result of demand space U partitioned by service areas S1 and S2

4.4 Applications: Optimal Siting of Ambulances

Because of its important social and economic objectives, the ambulance location problem

has been widely studied over the past 40 years (Eaton et al. 1985, Adenso-Díaz and Rodríguez

1997, Brotcorne et al. 2003, Daskin and Dean 2005, Henderson and Mason 2005). Because

ambulances are usually stationed in fire departments or parking lots with little additional

86

construction or administrative costs, it is unnecessary to limit the total number of facilities to be

sited. Given this practical consideration, the MCMCLP-NFC model may be more appropriate

than the MCMCLP-FC model. However, to better compare the performances of these two

models, we here apply both MCMCLP-NFC and MCMCLP-FC to the optimal siting of

ambulances for EMS Region 10 in GA.

4.4.1 Study Area and Data

EMS Region 10 is one of the 10 EMS regions in GA, which is in the northeastern section

of GA and is composed of 10 counties (Figure 4.3). The region serves 405,231 people (2000

census data) in a 3,006 total square mile area with 13 licensed ambulance services and 58

vehicles (OEMS 2006). The population in 2010 was 460,189, and the quartile map of the

population density (persons/km2) by census block group is shown in Fig. 3. The population data,

boundary maps of census units, and street map are all taken from US 2010 census data because

we need to reflect well the variation in demand across the study area with the population data at a

relatively low spatial aggregation level, such as at the block group or block level, which are only

available in census years. The Georgia EMS stations data from 2005 to 2007 are the only EMS

data that we can obtain thus far; these data come from the Homeland Security Infrastructure

Program (HSIP) and were downloaded from the website of the Georgia Department of

Community Affairs (DCA 2011). These data consist of the information of the locations where

the EMS personnel are stationed or based, or where the equipment that such personnel use in

performing their jobs is stored for ready use. According to these data, a total of 82 EMS stations

provide ambulance service in our study area (Figure 4.3). Among these stations, only two

(Madison County Emergency Medical Services Station 4 and Greene County Emergency

Medical Service) are not stationed in the fire departments. The count of EMS stations (82) is

87

larger than the count of ambulances (58). This result may be due to the inconsistency in the time

periods for which the data were collected. In addition, it is common for ambulances to be

periodically relocated among facilities to insure a good coverage at all times, which is an

important difference between the operations of emergency medical services and other emergency

services, such as those of fire departments or police departments (Brotcorne et al. 2003).

Therefore, some EMS stations may not site the vehicles all the time. Although the population

data and EMS data for different time periods are used, the time interval between these data is

short; the time inconsistency is therefore ignored in this application until better-quality data

become available. This data input is not the critical part of the models and should not

significantly influence the illustration and validation of our models and their applications.

Figure 4.3. Population density of Georgia EMS Region 10 (study area) by census block

group and existing ambulance facility locations

88

4.4.2 Tasks

To test the application of the MCMCLP for emergency services, a total of 58 ambulances

will be allocated to maximize the covered allocated demands within 8-min driving distance from

the facilities. The locations of 82 existing EMS stations are regarded as the potential facility sites.

The demands are represented by the census population in 2010 by census block group. To ensure

the existence of a feasible solution to the problem, we define the capacity of each ambulance as

8000 persons so that 58 ambulances have total capacity of 464,000, which exceeds the total

demand of 460,189. We assume that the capacity of 8000 persons per ambulance can meet the

requirement of the average response time to the calls for service in this region. In the MCMCLP-

NFC model, the 58 vehicles could be allocated to, at most, 58 facility sites. In the MCMCLP-FC

model, only 20 potential facility sites will be chosen, and the 58 vehicles will be allocated to

these 20 sites. ArcGISTM v9.3.1 is used to realize the SASDR. Programming with Visual Basic

for Applications (VBA) for ArcObjects in ArcGISTM v9.3.1 is used to structure the optimization

model files. The optimization problems are then solved using the commercial mixed integer

programming (MIP) software package CPLEX v12.2. All analyses are performed on a personal

computer equipped with an Intel Core Quad 2.4 GHz CPU and 3 GB of RAM.

4.4.3 Results

4.4.3.1 Realization of SASDR

In the realization of SASDR, three types of roads are used to create the road network and

then to create the 8-min service area for each potential facility site. The information for roads is

listed in Table 4.1 and includes the MAF/TIGER Feature Class Codes (MTFCC) defined in the

census data, road descriptions and hypothetical speed limits. Figure 4.4 shows the road network

in the study area.

89

Table 4.1. Information for roads

MTFCC Description Speed limit(miles/hour)

S1100 Primary Road 70 S1200 Secondary Road 55 S1400 Local Neighborhood Road,

Rural Road, City Street 40

Figure 4.4. Road network in EMS Region 10 in GA

After the road network is created, a service layer that includes the 8-min service polygons

for the 82 potential facility sites is created from the road network using the network-analysis

functions in ArcGIS (Figure 4.5). The white areas indicate that no vehicles can reach these

locations within 8 minutes from any potential facility location. Each service polygon was

identified by the ID of its corresponding facility site.

90

Figure 4.5. Eight-minute service areas (non-white polygons) of all potential

ambulance facility sites (red points) based on the road network

With the polygon overlay tool “Identity” in ArcGIS, the service layer is used to partition

the study area to derive the partition layer that includes all intersecting units among the service

polygons and the study area. Because of possible overlap among the service polygons, the

partition layer may include duplicate intersecting units that have the same location and shape but

different facility site IDs. A new field, “DO_ID”, is created in the partition layer, and the “Field

Calculator” function in ArcGIS with VBScript is used to compare the centroid coordinates and

the area of each unit to identify the duplicate units. All units that represent the same demand

object will be assigned the same demand object ID in the field “DO_ID”. In the attribute table of

the partition layer, both facility site ID and demand object ID now exist in each record. The

facility site j in the record of the demand object i indicates that the demand object i can be

91

completely covered by the service from the potential facility site j. This information will later be

used to construct the model input file for CPLEX to solve the problem. A total of 2,721 demand

objects are obtained for the study area. We export them from the partition layer to create the

demand object boundary layer.

The next step for the realization of SASDR is to calculate the amount of demands in each

demand object, which will be interpolated from the census block group population data and

assumed to be distributed uniformly within the demand object. When the polygon overlay tool

“Intersect” in ArcGIS is used to overlay the layer of population density by block group on the

demand object boundary layer, many intersecting units will emerge. The population in each unit

is calculated by timing its population density with the size of that unit. Finally, the population of

the intersecting units is aggregated to the demand objects. Fig. 6 shows the final SASDR result

for the study area with demand (i.e., population) distribution. Because of the round-off error, a

total aggregated population of 460,219 in the study area is obtained, which is then used as the

total amount of demands in the subsequent model. There are 623 demand objects with no people

because of their small sizes and low population densities. These zero-population demand objects

are first excluded from the optimization problem to reduce the computing complexity. After the

optimization problem is solved by CPLEX, these demand objects will be brought back and

allocated to their nearest facilities.

92

Figure 4.6. SASDR result for the study area with demand (population) distribution

4.4.3.2 Model Construction and Solution

The distance between demand object and facility location is measured from the centroid

of the demand object to the facility location point in kilometers. The maximum distance in this

study area is 33.377 km and the minimum distance is 210683.2 −× km. According to Equation

4.12, the value of weight w should be within the range [0, 810515.6 −× ] to ensure that the

maximization of the covered allocated demands is the primary objective. In fact, as long as the

value of weight w falls in this range and does not equal zero, the solutions of each model will be

the same, irrespective of the weight w. Therefore, we set w= 8106 −× for both the MCMCLP-

NFC and MCMCLP-FC models.

93

The model input files were constructed with the VBA program of ArcObjects in ArcGIS.

These models were then solved in CPLEX, which uses a branch-and-cut technique to find the

optimal solution (CPLEX Help 2011). The run time is 3,361 seconds for the MCMCLP-NFC

model and 706 seconds for the MCMCLP-FC model. The solutions obtained from CPLEX were

finally visualized as maps in ArcGIS.

Figure 4.7 shows the results of two MCMCLP models using the choropleth maps overlaid

with selected facility sites. In these maps, the facility and the demands allocated to it are

represented in the same colors, and larger facility symbols indicate more ambulances. With such

maps, the location-allocation patterns of the problem solution can be easily understood. For those

demand objects whose demands will be divided and allocated to more than one facility, the

strategy here is to split the demand object into multiple parts. For each facility that partially

serves the demand object, there is a part in the demand object trying to be close to that facility,

and its size is proportional to the percentage of demands served by that facility. In Figure 4.7(a),

in which the MCMCLP-NFC is applied, a total of 51 out of 82 potential sites are chosen to set up

the facilities, and 402,365 demands (87.4% of total demands) are covered within the 8-min

service covering standard. In Figure 4.7(b), in which the MCMCLP-FC is applied, 20 facilities

are required by the problem specification, and 358,477 demands (77.9% of total demands) are

covered within the service covering standard. As expected, the amount of the covered allocated

demands obtained by the MCMCLP-NFC is greater than that obtained by the MCMCLP-FC

because more facilities in the MCMCLP-NFC provide greater flexibility for siting the

ambulances. Because the proximity of the uncovered allocated demands to the facilities is

considered in both models (i.e., w= 8106 −× ), the demands allocated to a facility are generally

distributed more compactly and more continuous than those in the models with w=0 (results not

94

shown). However, the allocations of many facilities are still dispersed into several parts that may

be far away from one another. For example, there are two major demand patches with varied

sizes (filled with diagonals) allocated to the facility at site 13 in Figure 4.7(a). One reason for

this allocation is that the primary objective of the models is to maximize the covered allocated

demands instead of the proximity of the uncovered allocated demands to the facilities. The

splitting operation of the demand objects to represent the partial coverage could also cause the

noncontinuous demand allocations in the maps. Because of the smaller number of facilities

established, the MCMCLP-FC shows a more compact and continuous distribution of the

demands than the MCMCLP-NFC shows.

Table 4.2 shows the counts of the facilities with varied numbers of ambulances in these

two models. The maximum number of ambulances in a facility is 3 (site 45 in Figure 4.7(a)) in

the MCMCLP-NFC model and 12 (site 35 in Figure 4.7(b)) in the MCMCLP-FC model.

95

(a)

(b)

Figure 4.7. Results of the MCMCLP models siting 58 ambulances in 82 potential facility locations with w= 8106 −× (the facility location is rendered in the same color as its allocation area) (a) the MCMCLP-NFC model (b) the MCMCLP-FC model with 20 facilities

96

Table 4.2. Count of the facilities with varied numbers of ambulances

Number of ambulances in a

facility

Count of facilities

MCMCLP-NFC MCMCLP-FC

1 45 2 2 5 10 3 1 5 4 0 1 5 0 1 12 0 1

Total 51 20

4.5 Discussion

Several assumptions are made in this article to apply the MCMCLP models to optimally

site emergency vehicles such as ambulances. One assumption is that a facility has a capacity that

is related to the vehicles stationed there. This assumption is simple but reasonable. If the

population in the jurisdiction of a facility is too large, one of the important indicators for the

emergency service quality, the average response time to the calls for emergency service, will be

too long. When the population exceeds a limit, the quality of the emergency service provided by

that facility will be unacceptable. Given a requirement on the average response time to the calls,

a facility with more vehicles may serve a greater population. In our application, for simplicity,

we assume that each vehicle has the same capacity and that the capacity of a facility is equal to

the total capacity of the vehicles located there. Admittedly, this is a very restrictive assumption

because the capacity of an emergency vehicle actually depends on multiple factors, including the

requirement on the average response time, the average frequency of calls in the population it will

serve, and the average treatment time for a task, among others. A discussion of this problem

exceeds the scope of this article. However, if the possible capacity levels of the facility at each

potential site can be estimated and taken as a group of constants, the MCMCLP model can be

easily modified to accommodate the situation. The location problems of emergency vehicles are,

97

in reality, complex. The MCMCLP is a static model that does not consider the dynamic factors

such as the daily population movement. Accounting for such factors will be the focus of our

future work.

The MCLP has been proven to be nondeterministic polynomial time (NP)-hard (Megiddo

et al. 1981), which means that no algorithm has yet been discovered to solve it in polynomial

time in the worst case. As an extension to the MCLP, the MCMCLP is also NP-hard. Therefore,

the use of exact methods (e.g., enumeration or linear programming with branch-and-bound) to

solve a large-scale MCMCLP will be difficult. Seeking heuristic methods (e.g., genetic algorithm

or Lagrangian relaxation) is important for promoting the applications of the MCMCLP. A

potential heuristic method for solving the MCMCLP is a two-phase procedure, in which the

locations of the facilities and the demand allocation are first determined under the assumption

that the facilities are uncapacitated; the emergency vehicles are then allocated to each facility

depending on the allocated demands. We note that this two-phase procedure does not consider

that the second phase may change the demand allocation determined by the first phase, which

will cause the configuration of facility locations determined by the first phase to not necessarily

be the optimal solution for the whole problem.

Although model formulation and the optimization of algorithms are always the focus in

location modeling, many other aspects of the location problem, such as the representation for

spatial demands, also influence the accuracy of the modeling solutions and require attention. An

effective visualization of the problem solutions will be helpful in understanding the location-

allocation patterns and in making decisions by comparing different modeling results. One

problem that we need to address for our MCMCLP models in the future is how to better

represent in the map the demand objects served by multiple facilities.

98

In the MCMCLP model, GIS plays an important role. It is used to manage and organize

the spatial data, to realize the spatial demand representation, to help construct the model input

file for optimization software packages, and to visualize the problem solution with maps. In

addition to these important functions, GIS also facilitates theoretical advances in current location

science (Church 2002, Murray 2010).

4.6 Conclusion

The MCMCLP that we proposed in this article is an extension of the capacitated MCLP

to accommodate situations where the facilities to be sited have several possible capacity levels.

For the optimal siting of emergency vehicles, the MCMCLP considers the modular capacity

levels of a facility, the allocation of all demands, and the proximity of the uncovered allocated

demands to facilities. Two situations—the MCMCLP-NFC and the MCMCLP-FC—can be used

depending on the circumstances of the facility. In cases where the cost of a facility is low and

maximization of the covered allocated demands is the main purpose, such as establishing bases

for ambulances that are not always based in a building but are often at a very rudimentary

location such as a parking lot (Brotcorne et al. 2003), the MCMCLP-NFC may be more useful

because more covered allocated demands are generally obtained than with the MCMCLP-FC. If

the cost of facilities is also an important consideration, such as with fire stations for fire trucks,

the MCMCLP-FC may be better because we can incorporate information about how many

facilities we can build in the location modeling.

99

References

Adenso-Díaz, B. & Rodríguez, F., 1997. A simple search heuristic for the mclp: Application to the location of ambulance bases in a rural region. Omega, 25 (2), 181-187.

Balcik, B. & Beamon, B.M., 2008. Facility location in humanitarian relief. International Journal of Logistics: Research & Applications, 11 (2), 101-121.

Bennett, V.L., Eaton, D.J. & Church, R.L., 1982. Selecting sites for rural health workers. Social Science & Medicine, 16 (1), 63-72.

Berman, O. & Krass, D., 2002. The generalized maximal covering location problem. Computers &amp; Operations Research, 29 (6), 563-581.

Brotcorne, L., Laporte, G. & Semet, F., 2003. Ambulance location and relocation models. European Journal of Operational Research, 147 (3), 451-463.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

Church, R.L., 2002. Geographical information systems and location science. Computers & Operations Research, 29 (6), 541-562.

Church, R.L., Stoms, D.M. & Davis, F.W., 1996. Reserve selection as a maximal covering location problem. Biological conservation, 76 (2), 105-112.

Correia, I. & Captivo, M.E., 2003. A lagrangean heuristic for a modular capacitated location problem. Annals of Operations Research, 122 (1), 141-161.

Cplex Help, 2011. Branch and cut [online]. http://www.iro.umontreal.ca/~gendron/IFT6551/CPLEX/HTML/usrcplex/solveMIP9.html#638133 [Accessed Access Date 2011].

100

Current, J. & O'kelly, M., 1992. Locating emergency warning sirens. Decision Sciences, 23 (1), 221-234.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Daskin, M. & Dean, L., 2005. Location of health care facilities. Operations Research and Health Care, 43-76.

Dca, 2011. Data and maps for planning [online]. http://www.georgiaplanning.com/dataforplanning.asp [Accessed Access Date 2011].

Eaton, D.J., Daskin, M.S., Simmons, D., Bulloch, B. & Jansma, G., 1985. Determining emergency medical service vehicle deployment in austin, texas. Interfaces, 96-108.

Griffin, P.M., Scherrer, C.R. & Swann, J.L., 2008. Optimization of community health center locations and service offerings with statistical need estimation. IIE Transactions, 40 (9), 880-892.

Haghani, A., 1996. Capacitated maximum covering location models: Formulations and solution procedures. Journal of advanced transportation, 30 (3), 101-136.

Henderson, S. & Mason, A., 2005. Ambulance service planning: Simulation and data visualisation. Operations Research and Health Care, 77-102.

Indriasari, V., Mahmud, A.R., Ahmad, N. & Shariff, A.R.M., 2010. Maximal service area problem for optimal siting of emergency facilities. International Journal of Geographical Information Science, 24 (2), 213-230.

Liao, K. & Guo, D., 2008. A clustering based approach to the capacitated facility location problem. Transactions in GIS, 12 (3), 323-339.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

Murray, A.T., 2010. Advances in location modeling: Gis linkages and contributions. Journal of geographical systems, 12 (3), 335-354.

101

Murray, A.T. & Gerrard, R.A., 1997. Capacitated service and regional constraints in location-allocation modeling. Location Science, 5 (2), 103-118.

Murray, A.T. & O'kelly, M.E., 2002. Assessing representation error in point-based coverage modeling. Journal of geographical systems, 4 (2), 171-191.

Oems, 2006. Office of emergency medical services/trauma operating report.

Pirkul, H. & Schilling, D.A., 1991. The maximal covering location problem with capacities on total workload. Management Science, 37 (2), 233-248.

Ratick, S.J., Osleeb, J.P. & Hozumi, D., 2009. Application and extension of the moore and revelle hierarchical maximal covering model. Socio-Economic Planning Sciences, 43 (2), 92-101.

Tong, D. & Murray, A.T., 2009. Maximising coverage of spatial demand for service. Papers in regional science, 88 (1), 85-97.

Verter, V. & Lapierre, S.D., 2002. Location of preventive health care facilities. Annals of Operations Research, 110 (1), 123-132.

Yin, P. & Mu, L., 2011. Service area spatial demand representation in maximal coverage modeling. Manuscript submitted for publication.

102

CHAPTER 5

AN EMPIRICAL COMPARISON OF SPATIAL DEMAND REPRESENTATIONS IN

MAXIMAL COVERAGE MODELING4

4 Yin, P and Mu, L. To be submitted to Environment and Planning B.

103

Abstract

Operationally representing spatial demand is necessary to apply location models to

planning processes and closely related to the efficiency of modeling solutions. A spatial demand

representation should not only be able to minimize representation error, but also keep the

complexity of model as low as possible. Most of the current research, however, is primarily

focused on assessing and reducing/eliminating representation error while ignoring the

complexity of modeling associated with demand representation. In this study, we use expressions

of set theory to formulize a polygon-overlay-based demand representation called service area

spatial demand representation (SASDR). Using the maximal covering location problem (MCLP)

as an example, we empirically compare SASDR to widely-used point-based and regular-area-

based demand representations in terms of both problem complexity and representation error. Our

study shows that, although use of SASDR can eliminate some errors associated with other

demand representations, problem complexity with SASDR could become extremely high with

the increase of potential facility sites, which could become computationally intractable for exact

methods in current optimization software. Point-based demand representation with fine

granularity sometimes is a good alternative to SASDR because it can provide similarly effective

modeling solutions while avoiding extensive computation in GIS for the realization of SASDR.

Regular-area-based demand representation is not strongly recommended based on its poor

performance compared to the point-based demand representation with a similar problem

complexity.

Keywords: MCLP, Spatial demand representation, Representation error, Problem complexity,

GIS

104

5.1 Introduction

The fact that different scale- and/or unit-definitions in geographic analyses produce

different results is known as the modifiable areal unit problem (MAUP) (Openshaw and Taylor

1981). The MAUP is important not only in general areas of geographic analysis, but also in

location modeling where the MAUP is manifested in aggregation and representation errors

(Cromley et al. 2012). There has been a long history of study on aggregation error in location

modeling including p-median problems and covering location problems (Hillsman and Rhoda

1978, Goodchild 1979, Current and Schilling 1987, Daskin et al. 1989, Current and Schilling

1990, Hodgson and Neuman 1993, Bowerman et al. 1999, Francis et al. 2009, Cromley et al.

2012). More recently, representation error in location modeling, especially covering location

models, has started to receive more attention (Murray and O'Kelly 2002, Murray et al. 2008,

Tong and Murray 2009, Cromley et al. 2012).

For covering location modeling, it is common to assume that aggregated or continuous

spatial demand is concentrated on a set of points or uniformly distributed within areal units. With

respect to these point-based and area-based demand representations, there are several studies

focusing on assessing the associated representation errors (Murray and O'Kelly 2002, Murray et

al. 2008). Several other studies tried to reduce or eliminate the representation errors by new

covering model formulations (Murray 2005, Tong and Murray 2009). Different from the

traditional area-based representations using census units or regular polygons, such as triangles or

rectangles, as demand objects, Cromley et al. (2012) proposed a new area-based demand

representation that partitions a continuous demand space using polygon overlay methods into a

set of areal units called the least common demand coverage units (LCDCUs). This representation

105

approach, without complicated model formulations, could reduce or eliminate some errors

associated with the traditional point-based and area-based representations.

Current studies with respect to spatial demand representations primarily focus on the

evaluation of representation errors and how to reduce or eliminate these errors. However, the

complexity of problems associated with demand representations is rarely discussed. Many

covering location models, such as the maximal covering location problem (MCLP), have been

proven to be nondeterministic polynomial time (NP)-hard (Megiddo et al. 1981), which means

that no algorithm has been discovered yet to solve it in polynomial time in the worst case.

Actually, the size of a covering location problem is highly related to the demand representation it

adopts. Therefore, even if a demand representation approach may theoretically reduce or

eliminate some representation errors in a problem, it probably could make the problem difficult,

if not impossible, to solve using exact methods in current optimization software. Relying on

some heuristic algorithms to solve such a complicated problem may introduce other errors in

modeling results.

As Cromley et al.’s (2012) spatial demand representation with LCDCUs is based on the

service area of a facility at each potential facility site, we define this representation as service

area spatial demand representation (SASDR). In this paper, we use the MCLP as an example to

empirically compare SASDR to the traditional point-based and regular-area-based

representations where both representation error and problem complexity are simultaneously

considered. Specifically, we evaluate problem complexity associated with these three types of

demand representations and compare their representation errors given similar degrees of problem

complexity. This comparison is expected to provide some insight on how to choose appropriate

demand representations in practical applications. Although the question of how to realize

106

SASDR with GIS was briefly described in texts by Cromley et al. (2012), it is worth formulizing

the process of its realization for better preciseness and clarity. In the following two sections,

more details about representation error and problem complexity in the MCLP are reviewed. Next,

the formulization of SASDR is given and explained. Experimental designs for understanding the

problem complexity and modeling errors associated with the three types of demand

representations are then described, followed by the experimental results and discussions. Finally,

some conclusions are offered.

5.2 Representation Error in Covering Location Modeling

In covering location modeling, aggregation and representation errors are related but

fundamentally different. Murray and O’Kelly (2002) have noted that the aggregation of spatial

information assumes there is one true lowest level of data. For example, the population at any

higher level in the census hierarchy is an instance of the aggregation of the population at any

lower level such as the census block level. Aggregation error occurs in any analysis conducted

above the level of the individual or whenever a scale change occurs (Cromley et al. 2012).

Comparing to demand aggregation, demand representation usually has no such hierarchy as that

in census data. Individual demand is usually represented by the location point of that demand.

Any aggregated or continuous demand is often assumed to be concentrated on a set of points or

uniformly distributed within areal units. With different point or areal tessellations for

representing the same aggregated or continuous demand in a region, some modeling errors could

occur. Such representation error is usually measured by comparing modeling results with one

spatial demand representation to those with another at the same aggregation levels.

It is a long-held tradition that continuous demand is represented by a set of discrete

weighted points where the weight represents the amount of demand for service on that point.

107

Many location models including the MCLP were proposed based on this kind of demand

representation. Along with the development of GIS in location science, areal units have been

used to represent continuous demand due to the 2-dimensional nature of demand space and the

strong capability of GIS to manipulate 2-dimensional spatial objects (Miller 1996, Kim and

Murray 2008, Murray et al. 2008, Tong et al. 2009, Tong and Murray 2009, Alexandris and

Giannikos 2010). Figure 5.1 shows four examples of the traditional point-based and area-based

representations for the demand in a region with three polygons. In Figure 5.1(a), the demand in

each polygon is assumed to be concentrated on the centroid of that polygon or uniformally

distributed within that polygon. Figure 5.1(b) shows using a rectangle grid or its centroids to

represent the demand space where the demand in each rectangle is assumed uniformally

distributed or concentrated on its centroid. When the demand within each demand object cannot

be obtained directly, which is very common, it may need to be estimated using areal

interpolation techniques with other available demand data that have inconsistent boundaries of

units with the demand representation. Especially, intelligent areal interpolation methods, which

is based on the principles of dasymetric mapping, usually can provide better estimates of the

spatial heterogeneity of demand within areal units than simple areal interplation methods do

(Cromley et al. 2012).

108

(a) (b)

Figure 5.1. Examples of spatial demand representations with (a) census blocks or their centroids, and (b) rectangle grid or its centroids

In many covering location models, demand of a demand object only has a binary status

— being completely covered by a facility or completely not. In Figure 5.1, we assume a facility

(the star) with circular service coverage is located in the region. According to the point-based

demand representation in Figure 5.1(a), the demand within polygon C is considered covered by

the facility since its centroid is within the service coverage. No demand in polygons A and B is

considered covered since both of their centroids are outside the service coverage. However, the

reality is that a portion of demand within polygon C is not covered while a portion of demand

within polygons A and B is covered. Based on the area-based representation in Figure 5.1(a), no

demand in the whole region is considered covered since none of these three polygons is

completely within the service coverage. However, it is true that a portion of demand in these

three polygons is covered. The similar situation occurs when using the point-based or area-based

demand representations in Figure 5.1(b). Assuming the demand estimate within each areal unit is

“real”, we can see that point-based demand representation could either underestimate or

overestimate the amount of “real” demand covered, whereas traditional area-based demand

109

representation could underestimate the amount of “real” demand covered. Such underestimation

or overestimation will lead to modeling errors in both the total amount of covered demand

estimated by the objective functions of models and the configuration of facilities given by the

decision variables in model results.

Based on the discussions by Casillas (1983) and Cromley et al. (2012), representation

error is defined as the difference between the objective function values optimized for the same

study area with two different demand representations. We use Cromley et al.’s (2012)

terminology and consider the following notation:

fa is an objective function using representation a

fb is an objective function using representation b

xa is the optimal solution to the problem using representation a

xb is the optimal solution to the problem using representation b

Taking representation b as the reference, representation error is defined as follow:

( ) ( )( )bb

bbaa

xfxfxf

error tionRepresenta][ −

= Equation 5.1

Representation error can be decomposed into cost error and optimality error. Cost error is the

difference between the objective function values of the same solution measured with two

different demand representations, which is shown as follow:

( ) ( )( )bb

abaa

xfxfxf

error Cost][ −

= Equation 5.2

110

Optimality error is the difference between the objective function values of two optimal solutions

measured with the same demand representation. It is defined as follow:

( ) ( )( )bb

bbab

xfxfxf

error Optimality][ −

= Equation 5.3

5.3 The MCLP Model and Problem Complexity

Given a covering standard for a service, such as maximum distance or travel time, the

objective of the MCLP is to locate a fixed number of facilities to provide service coverage for as

much spatial demand as possible. Consider the following notation:

I = the set of demand objects (i as demand object index)

J = the set of potential facility sites (j as facility site index)

dij= the travel distance or time from potential facility site j to demand object i

S = the distance or time beyond which a demand object is considered ‘uncovered’

wi = the demand for service at i

p = the total number of facilities to be located

=otherwise0

selected is sitefacility if1 jx j

=otherwise0

served)(or covered is demand if1 iyi

=otherwise0

.., demand serving of capable is sitefacility if1 Sdeiija ij

ij

111

The formulation of the MCLP (Church and ReVelle (1974) is

Maximize ∑∈Ii

ii yw Equation 5.4

Subject to

iyxa

Jjijij ∀≥∑

Equation 5.5

∑∈

=Jj

j px Equation 5.6

{ } jx j ∀∈ 1 0, Equation 5.7

{ } iyi ∀∈ 1 0, Equation 5.8

The objective Equation 5.4 seeks to maximally cover the amount of weighted demand.

Constraints 5.5 require that demand i can be covered only if at least one facility is located at the

sites where the service can cover demand i. Constraint 5.6 specifies the total number of facilities

to be located. Constraints 5.7 and 5.8 impose integrality conditions on decision variables.

The complexity of the MCLP problem mainly depends on the number of demand

constraints (Equation 5.5) and the number of integrality constraints on decision variables

(Equation 5.7) and (Equation 5.8). For each demand object (e.g., point or areal unit), if its

demand weight is larger than 0 and it can be covered by a facility at a potential location, there

will be a demand constraint and an integrality constraint associated with this demand object in

the MCLP model. Each potential facility site also contributes an integrality constraint to the

model. Therefore, the complexity of the MCLP problem is highly related with the spatial demand

representation and the number of potential facility sites in an application. When using census

112

units or their centroids to represent demand, the number of demand objects is equal to the

number of census units in the study area. However, when using point grid or regular area grid to

represent demand, the number of demand objects depends on the grid design which is often

arbitrary.

In applications of the MCLP model, the size of census unit or regular areal unit for

demand representation is usually smaller than the service coverage of a facility for better

accuracy of modeling results. Analysis based on a demand representation with finer granularity

(i.e., smaller size of demand object) also is expected to lead to smaller representation errors since

more complete demand objects can be covered within service coverage of a facility. With respect

to predefined potential facility sites, we need to consider multiple factors including cost, site

availability, proximity to demand, access to other services, etc., which may have large variability

in a region. More potential facility sites could provide more configurations of facilities to choose,

which in turn can improve the optimality on the amount of demand covered by a given number

of facilities. It is noted that, however, at the same time when more demand objects and potential

facility sites are used to improve modeling results, the model could become dramatically

complex and lead to a computational challenge for exact methods in current commercial

optimization software. Heuristic methods, such as genetic algorithms, provide alternative

approaches to solve such complex location problems. However, they cannot ensure optimal

solutions which could lead to other errors in modeling results, and sophisticated strategies for

heuristic algorithms and strong programming skills are also required.

5.4 Service Area Spatial Demand Representation

SASDR was originally described by Cromley et al. (2012) as an area-based demand

representation, with or without intelligent areal interpolation, used to be compared to census-

113

centroid-based demand representation in terms of representation and scale error. In this section,

we use expressions of set theory to formulize the realization of SASDR, which is easier to

understand and to be implemented in different GIS software packages. In addition, we discuss

both representation error and problem complexity of SASDR based on its concept.

The map overlay process has been used for approximately 50 years, and its multiple

forms are important spatial analysis methods in GIS (McHarg and American Museum of Natural

History. 1969, Longley et al. 2005). SASDR is based on one of the map overlay operations.

Considering two sets A (rectangle) and B (circle) in Figure 5.2(a), the overlay operation A▲B is

defined as below:

{ }φ≠−=∈= B} and XB,A{AIX|X B A ▲ Equation 5.9

where I is a two-member set in which, as shown in Figure 5.2(b), member BA − is the set of all

elements that are members of A but not members of B, and member BA is the set of all

elements that are members of both A and B. A▲B is the set whose members are those non-empty

members of I. Therefore, A▲B can be a two-member set { }BABA ,− when BA ≠ and

φ≠BA , be a one-member set { }BA − when φ≠A , BA ≠ and φ=BA , be a one-member

set { }BA when BA = and φ≠BA , or be the empty set φ when φ=A .

114

(a) (b)

Figure 5.2. Illustration of overlay operation A▲B: (a) set A and set B (b) the result from A▲B

For a set of sets C = {Ci, i= 1, 2, 3, …, n} and a set D, overlay operation C▲D is defined

as below:

( ) D C D Cn

1ii

=

= ▲▲ Equation 5.10

Therefore, C▲D is actually a set of sets consisting of all members of the sets obtained by

conducting the overlay operation on each member Ci of set C with set D.

Because the set of potential facility sites and the service standard are given in our case,

the service area at each potential facility site can then be determined. Consider the following

notation:

U = the whole demand space

Sj = the service area at potential facility site j (j = 1, 2, 3, …, m)

SASDR is defined as the partition of demand space U into a finite demand object set SA_DOS :

m321 S ... S S S USA_DOS ▲▲▲▲= Equation 5.11

115

Each element DOSSAD _∈ is defined as a demand object, also called LCDCU following

Cromley et al.’s (2012) terminology, that is disjointed with one other and UDDOSSAD

=∈

_

.

Figure 5.3(a) shows an example in which a rectangle demand space U will be partitioned

into a SASDR by two potential facility sites f1 and f2 with circular service areas S1 and S2. First,

demand space U is partitioned by service area S1, creating two demand objects

{ }11,▲ SUSU S U 1 −= (Figure 5.3(b)). Then, service area S2 is used to continue to partition

the demand space U. A total of four demand objects

( ) ( ) ( ) ( ){ }21212121▲▲ SSU,SSU,SSU,SSU S S U 21 −−−−= are created in the final

SASDR (Figure 5.3(c)). Demand objects ( ) 2SSU 1 − and ( ) 2SSU 1 can be completely

covered if a facility is located at site f1, and demand objects ( ) 21 SSU − and ( ) 2SSU 1 can

be completely covered if a facility is located at site f2. Neither of the services can completely or

partially cover demand object ( ) 21 SSU −− . Despite the simple circular shape demonstrated, the

facility service area could be any shape.

We can see that SASDR is fundamentally a simple map overlay-based approach.

Compared to point-based demand representations, it uses areal demand units that can reduce the

potential measurement and coverage errors caused by aggregating continuous demand to discrete

point demands. Compared to those traditional area-based demand representations using census

units or regular area grid, it has the advantage that all demand objects will either be completely

covered or not be covered by the service from any potential facility site. Without the partial

coverage problem, the modeling is more efficient than those in which the partial coverage needs

to be handled explicitly in models to reduce modeling errors, such as those proposed by Murray

(2005) and Tong and Murray (2009).

116

(a)

(b) (c)

Figure 5.3. The SASDR with circular facility service area: (a) demand space U and two potential service areas S1 and S2, (b) the partition of demand space U with service area S1, and (c) the

partition of demand space U with both service areas S1 and S2

Different from point-based and traditional area-based demand representations where the

number of demand objects is independent of the configuration of potential facility sites, the

number and arrangement of demand objects in SASDR are completely determined by the service

standard and the configuration of potential facility sites in an application. In other words, the

complexity of a MCLP model using SASDR is a function of the combination of service standard

117

and configuration of potential facility sites. This could be a problem when a high density of

potential facility sites is needed.

5.5 Experimental Design

Unlike previous studies where the comparisons of spatial demand representations only

focus on representation error, we also simultaneously consider problem complexity associated

with spatial demand representations. It is known that the increase of demand objects or potential

facility sites is expected to reduce representation error and improve the optimality of modeling

solutions. In our experiments, we mainly focus on the following two questions:

(1) How does the complexity of a problem using SASDR change when varying service

standard and configuration of potential facility sites?

(2) Given similar degrees of problem complexity, is there a large representation error

between SASDR and other types of demand representations including point-based

and traditional area-based approaches?

The study area in the experiments is the City of Decatur, Georgia which has an area of

approximately 4.2 square miles. The 2010 U.S. Census population data at the block level are

used to estimate the demand of each spatial object in all representations. To improve the

accuracy of the demand estimation, we use the 2010 land use data showing developed and

undeveloped areas as ancillary data and overly it on the census population data so that all

population are constrained within the developed areas. The 2010 land use data were downloaded

from the website of Atlanta Regional Commission (ARC 2012).

To have an understanding about question 1, we design three modes for potential facility

sites including one regular pattern and two irregular patterns as shown in Figure 5.4. Figure 5.4(a)

shows regular grid points with spacing R. Figure 5.4(b) shows the centroids of all census blocks,

118

and Figure 5.4(c) shows all intersections of major roads in the study area. Both GIS data for

census blocks and major roads came from the 2010 Census data. For the mode of regular grid

points in Figure 5.4(a), we set spacing R with 5 values (meter as unit) including 500m, 400m,

300m, 250m, and 200m, which produce 42, 66, 116, 177, and 272 potential facility sites. Then,

the same numbers of potential facility sites are randomly chosen from the centroids of census

blocks in Figure 5.4(b) and the intersections of major roads in Figure 5.4(c). Finally, we have

total 15 configurations of potential facility sites with three modes (regular grid point, centroid of

census block, and intersection of roads) and five different numbers of sites (42, 66, 116, 177, and

272). With respect to the service standard of facilities, we define circular service coverage with

three different radii: 300m, 650m, and 1000m. With each combination of service standard and

configuration of potential facility sites, we create a SASDR and record the number of demand

objects.

(a)

(b)

(c)

Figure 5.4. Three modes of potential facility sites: (a) regular grid points with spacing R, (b) centroids of census blocks, and (c) intersections of major roads

119

For question 2, we use circular service coverage with a radius of 1000m in the

experiment. Among the 15 configurations of potential facility sites created in previous

experiment, we choose two configurations with 66 and 272 grid points and two configurations

with 66 and 272 centroids of census blocks. Therefore, there are total four SASDRs with the

combinations of one type of circular service coverage and four configurations of potential

facility sites. In all of these four situations, the whole study area can be covered by the service if

there are enough facilities located. For the traditional demand representations used to compare

with the SASDRs, we use four rectangle grids as the examples of traditional area-based demand

representation, and use the centroids of these rectangle grids as the examples of point-based

demand representation (Figure 5.5). By adjusting the spacing of the rectangle grid, we make the

numbers of demand objects in these four grid-rectangle-based and four grid-point-based demand

representations close to those in the four SASDRs. Finally, there are total four groups of

problems in this experiment for comparison, each of which includes three problems that have

different types of demand representations but similar degrees of problem complexity. The

number of facilities evaluated p in Equation 5.6 for all of the problems starts from 1 and

increases by 1 every time until the modeling reports 100% demand covered.

120

Figure 5.5. Examples of grid-point-based and grid- rectangle-based demand representations for comparison with SASDR

ArcGISTM v10 is used to realize the SASDR and its visualization. Programming with

Visual Basic for Applications (VBA) for ArcObjects in ArcGISTM v10 is used to structure the

optimization model file. The problems are solved using a commercial optimization package

CPLEX v12.2 that uses a Branch-and-Cut technique to search the optimal solution (CPLEX Help

2011). All analyses are carried out on a personal computer with Intel Core Quad 2.4 GHz CPU

and 3 GB RAM.

5.6 Results and Discussions

5.6.1 Problem Complexity with SASDR

Table 5.1 summarizes the numbers of demand objects in 45 SASDRs with different

combinations of service radius (SR) and configuration of potential facility sites. We can see that,

regardless of whether the pattern of potential facility sites is regular (grid point) or irregular

121

(block centroid or road intersection), the number of demand objects in the SASDR increases

dramatically with the increase of the number of potential facility sites. Taking the group with

grid points for potential facility sites and SR=1000m as an example, an increment in the number

of potential facility sites by a factor of 6.5 (i.e. 272/42) increases the number of demand objects

by a factor of 39.4 (i.e. 37012/939). Such a sharply increasing trend is even more obvious when

SR=300m and SR=650m in this experiment.

Table 5.1. Numbers of demand objects in 45 SASDRs

Mode / Number of

potential facility sites Number of demand objects

SR = 300m SR = 650m SR = 1000m Grid_Point /42 109 533 939 Grid_Point /66 427 1,479 2,120 Grid_Point /116 783 4,302 7,162 Grid_Point /177 2,849 8,355 15,505 Grid_Point /272 5,276 22,467 37,012 Block_Centroid/42 162 490 904 Block_Centroid/66 500 1,434 2,425 Block_Centroid/116 1,026 3,839 7,007 Block_Centroid/177 2,566 9,347 16,385 Block_Centroid/272 5,948 21,064 37,721 Road_Intersection/42 123 490 917 Road_Intersection/66 323 1,222 1,938 Road_Intersection/116 1,031 3,628 6,701 Road_Intersection/177 2,670 9,584 16,897 Road_Intersection/272 5,884 21,140 37,467

With the same number of potential facility sites and SR, we note that the number of

demand objects in SASDR with regular pattern of potential facility sites could be either larger or

less than that with irregular pattern of potential facility sties. Therefore, there is no obvious rule

on the numbers of demand objects in SASDRs between regular and irregular patterns of potential

facility sites. Since the number of demand objects in SASDR is determined by both SR and

122

configuration of potential facility sites, we use Site-Service Index to measure the degree of

clustering of potential facility sites at the scale defined by SR. Site-Service Index describes the

average number of potential facility sites within a circle with radius = 2SR and is defined as

follow:

( )N

SRdI Index Service-Site

N

i

N

jij∑∑ ≤

=2

Equation 5.12

where i and j are the indexes of potential facility sites, dij is the distance between potential

facility sites i and j, N is the total number of potential facility sites in a study region, and I(·) is an

indicator function. We define the ratio of the total number of demand objects in SASDR to N as

demand object density. Figure 5.6 shows the scatter plot of Site-Service Index and demand

object density for the 45 SASDRs in our experiment. We can see there is a strong linear

relationship between these two measures for either regular or irregular patterns of potential

facility sites. The R2 is 0.998 among all of the three modes of potential facility sites. This linear

relationship can be used to predict the number of demand objects in SASDR with circular service

coverage, which equals to the multiplication of demand object density and N. Given a fixed

study area and a SR, when N increases to some degree, the spatial pattern of potential facility

sites start to become more and more clustered, and then Site-Service Index increases accordingly,

which indicates an increase of the demand object density based on the linear relationship.

Therefore, both increases of N and demand object density will make the total number of demand

objects rise quickly.

123

Figure 5.6. Relationship between Site-Service Index and demand object density

in SASDR with circular service coverage

Based on above experimental results, it is obvious that the problem complexity could

become extremely high when a large number and highly clustered of potential facility sites is set.

In many practical applications, especially those working with continuous space, the number of

potential facility sties could easily rise to thousands or even millions and they could be highly

clustered compared to the service coverage. The quick explosion of problem size with the

increase of potential facility sites could make the problem computationally intractable for exact

methods in current optimization software. In addition, the realization of SASDR with a large

amount of potential facility sites could also be a challenge for current GIS software since the

algorithms of polygon overlay even now is one of the most difficult and complex parts in vector-

based GIS (Longley et al. 2005).

y = 0.8335x + 1.5637 R² = 0.998

0

20

40

60

80

100

120

140

160

0 50 100 150 200

Dem

and

Obj

ect D

ensi

ty

Site-Service Index

Grid_Point Road_Intersection Block_Centroid

124

5.6.2 Comparison in Representation Error

Given SR=1000m, Table 5.2 shows the numbers of demand objects in the four groups of

problems with three types of demand representations for comparison. In the SASDRs in this

experiment, the configurations of 66 potential facility sites lead to about 2,000 demand objects,

while 272 potential facility sites lead to over 30,000 demand objects. The different numbers of

demand objects also reflect the degrees of granularity of the demand representations. Since the

difference in the number of demand objects within each group is less than 0.1%, and the same

configuration of potential facility sites is used for the three problems in each group, the problems

in each group for comparison have similar degrees of complexity.

Table 5.2. Numbers of demand objects in all demand representations for comparison

Mode / Number of potential facility sites

Number of demand objects SASDR Point or rectangle grid Difference

Grid_Point /66 2,120 2,120 0.00% Grid_Point /272 37,012 36,988 0.06% Block_Centroid/66 2,425 2,426 0.04% Block_Centroid/272 37,721 37,715 0.02%

Table 5.3 shows the minimum numbers of facilities reported by the objective functions to

cover 100% demand in the study area. As expected, more potential facility sites usually need

fewer facilities to cover the same demand space. We also notice that one more facility is needed

for the grid-rectangle demand representation than other two demand representations when using

66 block centroids as the potential facility sites. It is mainly due to the underestimation of “real”

covered demand by grid-rectangle demand representation.

125

Table 5.3. Minimum numbers of facilities reported by models for covering 100% demand

Mode / Number of potential facility sites

Minimum number of facilities for 100% demand coverage SASDR Point grid Rectangle grid

Grid_Point /66 8 8 8 Grid_Point /272 7 7 7 Block_Centroid/66 9 9 10 Block_Centroid/272 7 7 7

Figure 5.7 shows the percentages of covered demand reported by the MCLP models with

three types of demand representations for four configurations of potential facility sites. Both of

the regular and irregular configurations of potential facility sites show similar characteristics on

the percentage of covered demand. When there are only 66 potential facility sites, the grid-

rectangle demand representations lead to less percentages of covered demand than the SASDRs

and point-based demand representations do. When the number of potential facility sties increases

to 272, all three demand representations have very similar percentages of covered demand.

126

(a) (b)

(c) (d)

Figure 5.7. Percentages of covered demand reported by the MCLP models with 3 types of demand representations when the configuration of potential facility sites include: (a) 66 grid points, (b) 272 grid points (c) 66 block centroids, and (d) 272 block centroids

Using SASDR as the reference, Table 5.4 shows the percent cost and optimality errors

between the grid-point-based demand representations and the SASDRs for the 4 configurations

of potential facility sites. We can see that the cost errors are the primary part of the

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

1 2 3 4 5 6 7 8

Perc

enta

ge o

f co

vere

d de

man

d

Number of facilities

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

1 2 3 4 5 6 7

Perc

enta

ge o

f cov

ered

dem

and

Number of facilities

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f cov

ered

dem

and

Number of facilities

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

1 2 3 4 5 6 7

Perc

enta

ge o

f cov

ered

dem

and

Number of facilities

127

representation errors in each group. The magnitudes of the cost errors become smaller when

more demand objects (i.e., finer granularity of demand representation) are used. In addition, the

non-zero cost errors are either positive or negative, which is the same as what we expect that

point-based demand representation could either overestimate or underestimate covered demand.

Table 5.4 also shows that only a few non-zero optimality errors occur when 66 potential facility

sites are set with about 2000 demand objects. When 272 potential facility sites are used with over

30,000 demand objects, all optimality errors are 0. This observation shows that, with the

improvement of the granularity of demand representation, the differences generally become

smaller on the optimal configurations of facilities given by the MCLP models with point-based

demand representation and SASDR. We also notice that, when the number of demand objects is

small, the real 100% covered demand may be not reached when the models report 100% covered

demand, such as 8 facilities for the potential facility sites of 66 grid points and 9 facilities for the

potential facility sites of 66 block centroids in this experiment.

Table 5.4. Cost and optimality errors between grid-point-based demand representations and SASDRs

Grid_Point /66 (2120)

Grid_Point /272 (36988)

Block_Centroid/66 (2426)

Block_Centroid/272 (37715)

p Cost Optimality Cost Optimality Cost Optimality Cost Optimality 1 -0.08% 0.00% -0.06% 0.00% 0.06% 0.00% 0.01% 0.00% 2 -0.03% 0.00% 0.05% 0.00% 0.12% 0.00% 0.00% 0.00% 3 0.24% 0.00% -0.02% 0.00% 0.12% -0.07% 0.02% 0.00% 4 0.39% 0.00% 0.00% 0.00% 0.26% -0.12% 0.02% 0.00% 5 0.14% 0.00% 0.00% 0.00% 0.14% 0.00% -0.02% 0.00% 6 -0.02% 0.00% 0.00% 0.00% 0.04% 0.00% 0.01% 0.00% 7 0.00% 0.00% 0.00% 0.00% 0.07% 0.00% 0.00% 0.00% 8 0.02% -0.02% 0.01% 0.00% 9 0.03% -0.03%

Note: the number in the parentheses shows the number of demand objects in each demand representation

128

Table 5.5 shows the percent cost and optimality errors between the grid-rectangle-based

demand representations and the SASDRs for the four configurations of potential facility sites. It

is noted that the magnitudes of both cost and optimality errors are generally larger than those of

the grid- point-based demand representations shown in Table 5.4. The cost errors are still the

primary part in the representation errors for the grid-rectangle-based demand representations. In

addition, the non-zero cost errors are all negative, which reflects that grid-rectangle-based

demand representation usually underestimates covered demand. Similar with the grid- point-

based demand representations shown in Table 5.4, the improvement of the granularity of demand

representation decreases the difference on the optimal configurations of facilities given by the

MCLP models with grid-rectangle-based demand representation and SASDR. Moreover, the

grid-rectangle-based demand representations can offer solutions that cover real 100% demand.

Table 5.5. Cost and optimality errors between grid-rectangle-based demand representations and SASDRs

Grid_Point /66 (2120)

Grid_Point /272 (36988)

Block_Centroid/66 (2426)

Block_Centroid/272 (37715)

p Cost Optimality Cost Optimality Cost Optimality Cost Optimality 1 -7.69% 0.00% -1.85% 0.00% -7.61% 0.00% -1.82% 0.00% 2 -7.70% 0.00% -1.72% 0.00% -5.85% 0.00% -2.04% 0.00% 3 -4.77% -0.37% -1.51% 0.00% -4.62% 0.00% -1.09% 0.00% 4 -3.01% -0.89% -0.87% 0.00% -3.38% 0.00% -0.68% 0.00% 5 -1.83% 0.00% -0.36% 0.00% -1.33% -0.61% -0.31% -0.07% 6 -0.51% -0.18% -0.02% 0.00% -0.84% 0.00% -0.07% 0.00% 7 -0.06% 0.00% 0.00% 0.00% -0.45% 0.00% 0.00% 0.00% 8 0.00% 0.00% -0.19% -0.09% 9 -0.05% -0.02% 10 0.00% 0.00%

Note: the number in the parentheses shows the number of demand objects in each demand representation

Based on the experimental results about representation error described above, we have

the following main findings:

129

(1) SASDR and the traditional area-based demand representations (e.g., use of census

units or regular polygons as demand objects) can both offer solutions providing real

100% covered demand if the whole demand space can be covered by enough number

of facilities with a given configuration of potential facility sites. However, the

minimum number of needed facilities analyzed with the traditional area-based

demand representation could be larger than optimal solutions. Point-based demand

representation with coarse granularity is difficult to offer solutions that provide real

100% covered demand. However, the improvement of the granularity of point-based

demand representation could mitigate the problem.

(2) Given similar problem sizes and using SASDR as the reference, when the granularity

of demand representation is relatively coarse, the representation errors, including cost

and optimality errors, associated with both point-based and the traditional area-based

demand representations are obvious. However, when the granularity of demand

representation is fine, the representation errors could become very small, especially

the optimality errors. In that case, the model solutions about the configuration of

facilities could be equally effective no matter which type of demand representation is

used.

(3) When the degrees of granularity are close, grid-point-based demand representation

usually has better performance than grid-area-based demand representation in terms

of both cost and optimality error.

These main findings provide us some implications on how to choose appropriate spatial

demand representation in practical applications. When a small number of potential facility sites

is needed or there is a requirement on real 100% covered demand, SASDR is a good choice.

130

When the number of potential facility sites rises to a large number that could lead to a SASDR

with very fine granularity, using a point-based demand representation may be a good choice

based on the following considerations. If a SASDR results in a large problem size that, however,

is still solvable for exact methods in current optimization software, using a point-based demand

representation with similar problem complexity as an alternative can give similar modeling

solutions while avoiding extensive computation in GIS for the realization of SASDR. In point-

based demand representation, the number of demand points is independent of the configuration

of potential facility sites, which provides a flexible approach to balance problem complexity and

representation error. If a problem using SASDR is too complex to solve by exact methods in

current optimization software, it is possible to replace it by a point-based demand representation

with less number of demand objects that can be defined based on the capability of optimization

software. The loss of covered demand due to the representation errors could be compensated by

increasing the number of potential facility sites without a large increase of problem size.

Regular-area-based demand representation is not strongly recommended because, given similar

problem sizes, its performance is usually not as good as point-based demand representation and

it also needs spatial analysis functions in GIS to examine the topological relationship between

service coverage and each regular areal demand unit, which could be very time-consuming.

5.7 Conclusions

Spatial demand representation is an important topic in location modeling because it is

necessary for applying location models to the planning process and strongly associated with the

efficiency of modeling solutions. A spatial demand representation should not only be able to

minimize representation error but also need to keep the complexity of model as low as possible.

Most of current research, however, is primarily focusing on assessing and trying to reduce or

131

eliminate representation error while ignoring the complexity of model associated with demand

representation. In this paper, we use expressions of set theory to formulize SASDR that is a

polygon-overlay-based demand representation originally described by Cromley et al. (2012) and

also used for siting emergency vehicles by Yin and Mu (2012). Using the MCLP as an example,

we then empirically compare SASDR to widely-used point-based and regular-area-based demand

representations in terms of both problem complexity and representation error.

SASDR has several advantages including being able to offer solutions providing real 100%

covered demand and eliminating some errors associated with point-based and other area-based

demand representations. However, our study shows that, the complexity of problem with

SASDR could become extremely high when increasing the number and the degree of clustering

of potential facility sites. This problem could lead to a dilemma for many practical applications

where it is common to set a large number of potential facility sites for larger covered demand.

Many covering location problems themselves are nondeterministic polynomial time (NP)-hard

(Megiddo et al. 1981), which means that no algorithm has yet been discovered to solve it in

polynomial time in the worst case. Therefore, these problems using SASDR could become more

difficult, if not impossible, to solve by exact methods in current commercial optimization

software. In such cases, heuristic methods may be the only ways that however could introduce

other errors to modeling solutions and requires sophisticated strategies for algorithms and strong

programming skills. In addition, the realization of SASDR for a large number of potential facility

sites could be also a computational challenge for current GIS software.

The empirical comparisons of problems with similar degrees of complexity, but different

spatial demand representations, provide us some insight on how to choose appropriate spatial

demand representation in practical applications. Point-based demand representation sometimes is

132

a good alternative to SASDR when the problem with SASDR is too complex to solve by exact

methods in current optimization software.

As we know, point-based and regular-area-based demand representations can be very

flexible depending on the number and arrangement of demand objects as well as the shape of

areal unit. In this study, we only choose a limited number of point-based and regular-area-based

demand representations as examples to explore their characteristics in the MCLP modeling. Our

findings may not be able to be generalized well to all situations.

In addition, we need to notice that the MCLP has been extended to incorporate more

considerations to meet specific application requirements, such as the capacitated facility (Chung

et al. 1983, Current and Storbeck 1988, Haghani 1996) and the allocation of demand beyond the

covering standard in emergency service planning (Pirkul and Schilling 1991, Yin and Mu 2012).

In these variations of the MCLP, allocation of demand to facilities needs to be considered. The

aggregation and representation errors on demand allocation could be one topic of our research in

the future.

133

References

Alexandris, G. & Giannikos, I., 2010. A new model for maximal coverage exploiting gis capabilities. European Journal of Operational Research, 202 (2), 328-338.

Arc, 2012. Gis data and maps [online]. http://www.atlantaregional.com/info-center/gis-data-maps/gis-data [Accessed Access Date 2012].

Bowerman, R.L., Calamai, P.H. & Brent Hall, G., 1999. The demand partitioning method for reducing aggregation errors in p-median problems. Computers & Operations Research, 26 (10-11), 1097-1111.

Casillas, P., 1983. Data aggregation and the p-median problem in continuous space. In Ghosh, A. & Rushton, G. eds. Spatial analysis and location-allocation models. New York: Van Nostrand Reinhold, 327-344.

Chung, C., Schilling, D. & Carbone, R., Year. The capacitated maximal covering problem: A heuristiced.^eds. Proceedings of the Fourteenth Annual Pittsburgh Conference on Modeling and Simulation, 1423-1428.

Church, R. & Revelle, C., 1974. The maximal covering location problem. Papers in regional science, 32 (1), 101-118.

Cplex Help, 2011. Branch and cut [online]. http://www.iro.umontreal.ca/~gendron/IFT6551/CPLEX/HTML/usrcplex/solveMIP9.html#638133 [Accessed Access Date 2011].

Cromley, R.G., Lin, J. & Merwin, D.A., 2012. Evaluating representation and scale error in the maximal covering location problem using gis and intelligent areal interpolation. International Journal of Geographical Information Science, 26 (3), 495-517.

Current, J. & Storbeck, J., 1988. Capacitated covering models. Environment and Planning B, 15, 153-164.

Current, J.R. & Schilling, D.A., 1987. Elimination of source a and b errors in p‐ median location problems. Geographical Analysis, 19 (2), 95-110.

134

Current, J.R. & Schilling, D.A., 1990. Analysis of errors due to demand data aggregation in the set covering and maximal covering location problems. Geographical Analysis, 22 (2), 116-126.

Daskin, M.S., Haghani, A.E., Khanal, M. & Malandraki, C., 1989. Aggregation effects in maximum covering models. Annals of Operations Research, 18 (1), 113-139.

Francis, R., Lowe, T., Rayco, M. & Tamir, A., 2009. Aggregation error for location models: Survey and analysis. Annals of Operations Research, 167 (1), 171-208.

Goodchild, M.F., 1979. The aggregation problem in location‐ allocation. Geographical Analysis, 11 (3), 240-255.

Haghani, A., 1996. Capacitated maximum covering location models: Formulations and solution procedures. Journal of advanced transportation, 30 (3), 101-136.

Hillsman, E.L. & Rhoda, R., 1978. Errors in measuring distances from populations to service centers. The Annals of Regional Science, 12 (3), 74-88.

Hodgson, M.J. & Neuman, S., 1993. A gis approach to eliminating source c aggregation error in p-meidan models. Computers & Operations Research.

Kim, K. & Murray, A.T., 2008. Enhancing spatial representation in primary and secondary coverage location modeling. Journal of Regional Science, 48 (4), 745-768.

Longley, P.A., Goodchild, M.F., Maguire, D.J. & Rhind, D.W., 2005. Geographic information systems and science, 2nd ed.: John Wiley & Sons, Ltd.

Mcharg, I.L. & American Museum of Natural History., 1969. Design with nature, 1st ed. Garden City, N.Y.,: Published for the American Museum of Natural History [by] the Natural History Press.

Megiddo, N., Zemel, E. & Hakimi, S.L., 1981. The maximum coverage location problem: Northwestern University.

Miller, H.J., 1996. Gis and geometric representation in facility location problems. International Journal of Geographical Information Systems, 10 (7), 791-816.

135

Murray, A.T., 2005. Geography in coverage modeling: Exploiting spatial structure to address complementary partial service of areas. Annals of the Association of American Geographers, 95 (4), 761-772.

Murray, A.T. & O'kelly, M.E., 2002. Assessing representation error in point-based coverage modeling. Journal of geographical systems, 4 (2), 171-191.

Murray, A.T., O'kelly, M.E. & Church, R.L., 2008. Regional service coverage modeling. Computers & Operations Research, 35 (2), 339-355.

Openshaw, S. & Taylor, P.J., 1981. The modifiable areal unit problem. In Wrigley, N. & Bennett, R. eds. Quantitative geography: A british view. London and Boston: Routledge and Kegan Paul, 60-69.

Pirkul, H. & Schilling, D.A., 1991. The maximal covering location problem with capacities on total workload. Management Science, 37 (2), 233-248.

Tong, D., Murray, A. & Xiao, N., 2009. Heuristics in spatial analysis: A genetic algorithm for coverage maximization. Annals of the Association of American Geographers, 99 (4), 698-711.

Tong, D. & Murray, A.T., 2009. Maximising coverage of spatial demand for service. Papers in regional science, 88 (1), 85-97.

Yin, P. & Mu, L., 2012. Modular capacitated maximal covering location problem for the optimal siting of emergency vehicles. Applied Geography, 34 (0), 247-254.

136

CHAPTER 6

CONCLUSIONS

6.1 Summary and Conclusions

With increasing digital health data and environmental, socioeconomic, behavioral data

available, Geographic Information Systems (GIS) are receiving increased attention in public

health studies. This dissertation research mainly focuses on three aspects of health studies using

GIS and spatial analysis: spatial disease cluster detection, spatio-temporal disease mapping, and

health service planning. New methods or models are proposed and implemented with GIS in this

research to address an important problem in each of the three aspects.

With respect to the detection of spatial disease cluster, for the first time, our study

implements and tests Tango’s (2008) restricted likelihood ratio combined with Assunção et al.’s

(2006) dynamic Minimum Spanning Tree (dMST) search strategy to quickly detect disease

clusters in arbitrary shapes. To understand the performance of this redesigned hybrid method in

various situations, we design six cluster models and two non-cluster scenarios. These cluster

models consider different numbers of disease cases in a study area and various shapes of clusters.

The choice of the screening level α1 in restricted likelihood ratio is also explored in our

redesigned spatial scan statistic method (RSScan). Besides the metric of power, we propose

using the Kappa Index of Agreement (KIA) to evaluate and compare the performances of cluster

detection methods to identify the boundaries of clusters in order to avoid the effects due to the

different cluster model properties. Finally, we provide the application of our RSScan method in a

137

case of detecting the cluster of lung cancer incidence in Georgia for the period 1998-2005. The

experimental results indicate that RSScan method with appropriate screening level α1 generally

has higher power and accuracy than Tango’s method, Assunção et al.’s method, and Kulldorff’s

circular spatial scan statistic method (CSScan ) for the clusters in irregular shapes. Based on

numeric experiments, our study recommends 0.2 as default for the screening level α1 in the

RSScan method to get higher statistical power and more accurate boundaries of clusters. It also

should be noted that the performances of both RSScan method and other three methods vary

under different situations such as counts of disease incidence cases and true cluster shapes. This

finding corresponds well with the power analysis given by Waller and Gotway (2004) that most

tests to detect clusters have spatially heterogeneous power.

Facing the fact that there are only a limited number of lung cancer studies in Georgia,

especially at a fine spatio-temporal scale, our research using hierarchical Bayesian models to

explore the spatio-temporal patterns of lung cancer incidence risks in Georgia from 2000-2007

contributes to the geospatial health analysis literature. The study is conducted at the census tract

level using two-year time period as the temporal unit. The fine spatial and temporal scales enable

the study show more detailed variations of lung cancer incidence risks in space and time, which

can better support healthcare performance assessment, establishing potential etiological

hypotheses, and making effective and efficient health policies. Compared to the crude

Standardized Incidence Ratio (SIR), Bayesian spatio-temporal model can provide more reliable

estimate of disease risk in a fine spatio-temporal scale. A total of seven Bayesian spatio-temporal

models under the separate and joint modeling frameworks are developed and compared. In this

study, the joint models generally have better performance than the separate models using the

deviance information criterion (DIC) as the criterion. The study also shows that there are strong

138

inverse relationships between the socioeconomic status (SES) and the lung cancer incidence risk

in Georgia males, especially white males, and weak inverse relationships in both white and black

Georgia females. This could lead to further studies on the underlying reasons such as

occupational risk factors.

The modular capacitated maximal covering location problem (MCMCLP) developed in

Chapter 4 is an extension of the capacitated maximal covering location problem (MCLP) to

accommodate situations where the facilities to be sited have several possible capacity levels. For

the optimal siting of emergency vehicles, the MCMCLP considers the modular capacity levels of

a facility, the allocation of all demands, and the proximity of the uncovered allocated demands to

facilities. Two situations—the MCMCLP-NFC and the MCMCLP-FC—can be used depending

on the circumstances of the facility. As an example, these two models are successfully applied to

optimally site ambulances for emergency medical services (EMS) Region 10 in Georgia. In the

MCMCLP models, GIS plays an important role. It is used to manage and organize the spatial

data, to realize the spatial demand representation, to help construct the model input file for

optimization software packages, and to visualize the problem solution with maps. In addition to

these important functions, GIS also facilitates theoretical advances in current location science

(Church 2002, Murray 2010).

Spatial demand representation is an important topic in location-allocation modeling, such

as the MCMCLP discussed above. A spatial demand representation should not only be able to

minimize representation error but also need to keep the complexity of model as low as possible.

In Chapter 5, we use expressions of set theory to formulize the service area spatial demand

representation (SASDR). Using the MCLP as an example, we then empirically compare SASDR

to widely-used point-based and regular-area-based demand representations in terms of both

139

problem complexity and representation error. SASDR has several advantages including being

able to offer solutions providing real 100% covered demand and eliminating some errors

associated with point-based and other area-based demand representations. However, our study

shows the complexity of the problem with SASDR could become extremely high when

increasing the number and the degree of clustering of potential facility sites. This problem could

lead to a dilemma for many practical applications where it is common to set a large number of

potential facility sites for larger covered demand. In addition, the realization of SASDR for a

large number of potential facility sites could be also a computational challenge for current GIS

software. The empirical comparisons of problems with similar degrees of complexity but

different spatial demand representations indicate that point-based demand representation could

be a good alternative to SASDR when the problem with SASDR is too complex to solve by exact

methods in current optimization software.

6.2 Future Research

Based on the results of this dissertation research, the future research will continue using

GIS and spatial analysis to advance health studies. As examples, three research directions are

shown as follows:

(1) New method for disease cluster detection

Although our RSScan method shows good statistical power and relative high accuracy of

the boundaries of detected clusters in detecting spatial disease clusters in arbitrary shapes, the

weakness of this method also need to be noted. Our experiments shows that the statistical power

of our RSScan method varies in situations with different numbers of disease cases, shapes of the

true clusters, patterns of population at risks. The same situation exists in other existing cluster

detection methods as well. The relative arbitrary choice of the parameter of screening level in the

140

restricted likelihood ratio makes the RSScan method difficult to use in practice. Therefore,

improving the statistical power and the accuracy of the boundaries of detected clusters in

arbitrary shapes is one task of my future research. It could be realized by seeking more efficient

artificial intelligence methods as searching strategies and construct better penalty parameters for

test statistics. Recently, a multi-objective algorithm (Cançado et al. 2010) was proposed to avoid

or mitigate the subjectivity in choosing the penalty or other parameters in the test statistics in

traditional cluster detection methods. This could be a direction in my future research. In addition,

extending cluster detection from spatial dimension to spatial and temporal dimensions is

receiving considerable interests in disease surveillance. I will take exploring new methods for

detecting spatio-temporal disease clusters as one of my future studies.

(2) Risk factors to lung cancer risk in Georgia

My dissertation research shows the spatio-temporal patterns of lung cancer incidence

risks by race and sex across whole Georgia from 2000 to 2007. These patterns could aid

authorities in making more effective health policies and healthcare services planning to reduce

health disparities and promote public health. However, to better prevent lung cancer, an

important question needs to be answered: what factors lead to such patterns? For example, why

dose northwest Georgia have stably high lung cancer incidence risks for all population subgroups?

In the future, study on the environmental factors related to the spatio-temporal patterns of lung

cancer incidence risks in Georgia is one of my research tasks. For example, how is the

correlation between the distribution of radon in underground water and the lung cancer incidence

in Georgia?

141

(3) Dynamic factors in health service planning

People usually concentrate in working places or commercial districts in daytime, and stay

in residences in nighttime. Considering such population movements in health service planning

could greatly improve the efficiency and efficacy of the usage of sources, especially emergency

vehicles such as ambulances discussed in my dissertation. In the future, I will integrate dynamic

factors in demand into my MCMCLP models to solve more practical problems.

142

References

Assunção, R., Costa, M., Tavares, A. & Ferreira, S., 2006. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine, 25 (5), 723-742.

Cançado, A.L.F., Duarte, A.R., Duczmal, L.H., Ferreira, S.J., Fonseca, C.M. & Gontijo, E.C.D.M., 2010. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters. International Journal of Health Geographics, 9 (1), 55.

Church, R.L., 2002. Geographical information systems and location science. Computers & Operations Research, 29 (6), 541-562.

Murray, A.T., 2010. Advances in location modeling: Gis linkages and contributions. Journal of geographical systems, 12 (3), 335-354.

Tango, T., 2008. A spatial scan statistic with a restricted likelihood ratio. Japanese Journal of Biometrics, 29 (2), 75-95.

Waller, L. & Gotway, C., 2004. Applied spatial statistics for public health data: Wiley-Interscience.

143

APPENDIX I

LIST OF ACRONYMS

Acronym Full description

0-9

2SFCA Two-step Floating Catchment Area

C

CAR Conditional Autoregression

CEPP Cluster Evaluation Permutation Procedure

CI Credible Interval

CSScan Circular Spatial Scan Statistic

D

DCA Department of Community Affairs

DIC Deviance Information Criterion

dMST Dynamic Minimum Spanning Tree

E

EMS Emergency Medical Services

F

FC Facility-constraint

G

GA State of Georgia

GAM Geographical Analysis Machine

GIS Geographic Information Systems

H

HSIP Homeland Security Infrastructure Program

K

KIA Kappa Index of Agreement

144

Acronym Full description

L

LCDCU Least Common Demand Coverage Unit

M

MAUP Modifiable Areal Unit Problem

MCMCLP Modular Capacitated Maximal Covering Location Problem

MCLP Maximal Covering Location Problem

MIP Mixed Integer Programming

MTFCC MAF/TIGER Feature Class Codes

N

NFC Non-facility-constraint

NP Polynomial Time

R

RR Relative Risk

RSScan Redesigned Spatial Scan Statistic

S

SASDR Service Area Spatial Demand Representation

SES Socioeconomic Status

SIR Standardized Incidence Ratio

SR Service Radius

V

VBA Visual Basic for Applications