30
Disease Mapping with WinBUGS and MLwiN Andrew B. Lawson Department of Epidemiology and Biostatistics University of South Carolina, USA William J. Browne School of Mathematical Sciences University of Nottingham, UK Carmen L. Vidal Rodeiro Department of Epidemiology and Biostatistics University of South Carolina, USA

Disease Mapping with WinBUGS and MLwiN · 2016-08-12 · STATISTICS IN PRACTICE Advisory Editor Stephen Senn University College London, UK Founding Editor Vic Barnett Nottingham Trent

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Disease Mapping withWinBUGS and MLwiN

    Andrew B. LawsonDepartment of Epidemiology and Biostatistics

    University of South Carolina, USA

    William J. BrowneSchool of Mathematical Sciences

    University of Nottingham, UK

    Carmen L. Vidal RodeiroDepartment of Epidemiology and Biostatistics

    University of South Carolina, USA

    Innodata047085605X.jpg

  • Disease Mapping with

    WinBUGS and MLwiN

  • STATISTICS IN PRACTICE

    Advisory Editor

    Stephen Senn

    University College London, UK

    Founding Editor

    Vic Barnett

    Nottingham Trent University, UK

    Statistics in Practice is an important international series of texts which provide

    detailed coverage of statistical concepts, methods and worked case studies in

    specific fields of investigation and study.

    With sound motivation and many worked practical examples, the books

    show in down-to-earth terms how to select and use an appropriate range of

    statistical techniques in a particular practical field within each title’s special

    topic area.

    The books provide statistical support for professionals and research workers

    across a range of employment fields and research environments. Subject areas

    covered include medicine and pharmaceutics; industry, finance and commerce;

    public services; the earth and environmental sciences, and so on.

    The books also provide support to students studying statistical courses applied

    to the above areas. The demand for graduates to be equipped for the work

    environment has led to such courses becoming increasingly prevalent at uni-

    versities and colleges.

    It is our aim to present judiciously chosen and well-written workbooks to

    meet everyday practical needs. Feedback of views from readers will be most

    valuable to monitor the success of this aim.

    A complete list of titles in this series appears at the end of the volume.

  • Disease Mapping withWinBUGS and MLwiN

    Andrew B. LawsonDepartment of Epidemiology and Biostatistics

    University of South Carolina, USA

    William J. BrowneSchool of Mathematical Sciences

    University of Nottingham, UK

    Carmen L. Vidal RodeiroDepartment of Epidemiology and Biostatistics

    University of South Carolina, USA

  • Copyright # 2003 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,West Sussex PO19 8SQ, England

    Telephone (þ44) 1243 779777

    Email (for orders and customer service enquiries): cs-bookswiley.co.ukVisit our Home Page on www.wileyeurope.com or www.wiley.com

    All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system ortransmitted in any form or by any means, electronic, mechanical, photocopying, recording,scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 orunder the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham CourtRoad, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to thePublisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium,Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreqwiley.co.uk, orfaxed to (þ44) 1243 770571.

    This publication is designed to provide accurate and authoritative information in regard to thesubject matter covered. It is sold on the understanding that the Publisher is not engaged in renderingprofessional services. If professional advice or other expert assistance is required, the services of acompetent professional should be sought.

    Other Wiley Editorial Offices

    John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

    Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

    Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

    John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia

    John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809

    John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

    Library of Congress Cataloging-in-Publication Data

    Lawson, Andrew (Andrew B.)Disease mapping with WinBUGS and MLwiN / Andrew B. Lawson, William J. Browne,Carmen L. Vidal Rodeiro.

    p. cm. – (Statistics in practice)Includes bibliographical references and index.ISBN 0-470-85604-1 (hbk. : alk. paper)1. Medical geography. 2. Medical geography – Maps – Data processing.

    3. Epidemiology – Statistical methods. 4. Epidemiology – Data processing. 5. Public healthsurveillance. I. Browne, William J., Ph.D. II. Vidal Rodeiro, Carmen L. III. Title. IV.Statistics in practice (Chichester, England).

    RA792.5.L388 2003615.4’2’0727–dc21 2003053782

    British Library Cataloguing in Publication Data

    A catalogue record for this book is available from the British Library

    ISBN 0–470–85604–1

    Typeset in 10/12 pt by Kolam Information Services Pvt. Ltd, Pondicherry, IndiaPrinted and bound in Great Britain by T J International Ltd, Padstow, CornwallThis book is printed on acid-free paper responsibly manufactured from sustainable forestry in whichat least two trees are planted for each one used for paper production.

    http://www.wileyeurope.comhttp://www.wiley.com

  • Contents

    Preface ix

    Notation xi

    0.1 Standard notation for multilevel modelling xi

    0.2 Spatial multiple-membership models and the MMMC notation xii

    0.3 Standard notation for WinBUGS models xii

    1 Disease mapping basics 1

    1.1 Disease mapping and map reconstruction 2

    1.2 Disease map restoration 3

    2 Bayesian hierarchical modelling 17

    2.1 Likelihood and posterior distributions 17

    2.2 Hierarchical models 18

    2.3 Posterior inference 19

    2.4 Markov chain Monte Carlo methods 20

    2.5 Metropolis and Metropolis–Hastings algorithms 21

    2.6 Residuals and goodness of fit 26

    3 Multilevel modelling 29

    3.1 Continuous response models 30

    3.2 Estimation procedures for multilevel models 35

    3.3 Poisson response models 38

    3.4 Incorporating spatial information 42

    3.5 Discussion 43

    4 WinBUGS basics 45

    4.1 About WinBUGS 45

    4.2 Start using WinBUGS 47

    4.3 Specification of the model 50

    4.4 Model fitting 59

    4.5 Scripts 64

    4.6 Checking convergence 65

    v

  • 4.7 Spatial modelling: GeoBUGS 67

    4.8 Conclusions 72

    5 MLwiN basics 75

    5.1 About MLwiN 75

    5.2 Getting started 77

    5.3 Fitting statistical models 84

    5.4 MCMC estimation in MLwiN 94

    5.5 Spatial modelling 104

    5.6 Conclusions 113

    6 Relative risk estimation 115

    6.1 Relative risk estimation using WinBUGS 115

    6.2 Spatial prediction 137

    6.3 An analysis of the Ohio dataset using MLwiN 139

    7 Focused clustering: the analysis of

    putative health hazards 155

    7.1 Introduction 155

    7.2 Study design 156

    7.3 Problems of inference 158

    7.4 Modelling the hazard exposure risk 160

    7.5 Models for count data 164

    7.6 Bayesian models 166

    7.7 Focused clustering in WinBUGS 167

    7.8 Focused clustering in MLwiN 190

    8 Ecological analysis 197

    8.1 Introduction 197

    8.2 Statistical models 198

    8.3 WinBUGS analyses of ecological datasets 199

    8.4 MLwiN analyses of ecological datasets 219

    9 Spatially-correlated survival analysis 235

    9.1 Survival analysis in WinBUGS 235

    9.2 Survival analysis in MLwiN 237

    10 Epilogue 251

    Appendix 1: WinBUGS code for focused

    clustering models 253

    A.1 Falkirk example 253

    A.2 Ohio example 256

    vi Contents

  • Appendix 2: S-Plus function for conversion

    to GeoBUGS format 263

    Bibliography 267

    Index 275

    Contents vii

  • Statistics in Practice

    Human and Biological Sciences

    Brown and Prescott – Applied Mixed Models in Medicine

    Ellenberg, Fleming and DeMets – Data Monitoring in Clinical Trials:

    A Practical Perspective

    Lawson, Browne and Vidal Rodeiro – Disease Mapping with WinBUGS and

    MLwiN

    Marubini and Valsecchi – Analysing Survival Data from Clinical Trials and

    Observation Studies

    Parmigiani – Modeling in Medical Decision Making: A Bayesian Approach

    Senn – Cross-over Trials in Clinical Research, Second Edition

    Senn – Statistical Issues in Drug Development

    Whitehead – Design and Analysis of Sequential Clinical Trials, Revised

    Second Edition

    Whitehead – Meta-analysis of Controlled Clinical Trials

    Earth and Environmental Sciences

    Buck, Cavanagh and Litton – Bayesian Approach to Intrepreting Archaeological

    Data

    Glasbey and Horam – Image Analysis for the Biological Sciences

    Webster and Oliver – Geostatistics for Environmental Scientists

    Industry, Commerce and Finance

    Aitken – Statistics and the Evaluation of Evidence for Forensic Scientists

    Lehtonen and Pahkinen – Practical Methods for Design and Analysis of Complex

    Surveys, Second Edition

    Ohser andMücklich – Statistical Analysis of Microstructures in Materials Science

  • Preface

    The analysis of disease maps has seen considerable development over the last

    decade. This development has been reflected in a fast-increasing literature and

    has been matched by the development of software tools. The intersecting areas

    of spatial statistical methods development and geographical information

    systems (GIS) have both witnessed this growth. With increasing public health

    concerns about environmental risks and even bioterrorism, the need for good

    methods for analysing spatial health data is immediate. Two major software

    tools have now been developed, which allow the modelling of spatially-refer-

    enced small area health data. These tools, MLwiN and WinBUGS, both provide

    facilities for sophisticated modelling of realistically complex health data. Win-

    BUGS was developed to allow the application of a wide range of hierarchical

    Bayesian models, exploiting modern computational advances, in particular

    Gibbs sampling. MLwiN was developed to allow the fitting of models to multi-

    level data where a natural parameter hierarchy exists. Originally, this was

    implemented using iterative likelihood and quasi-likelihood estimation methods.

    However, the most recent versions of the package have implemented Bayesian

    computational methodology and now have many parallel capabilities. Increas-

    ingly both packages are being used by researchers and also now there is a desire

    to be able to apply such methodology in practical public health applications. In

    response to this need, the authors have attempted to provide an introduction to

    the methods and types of applications where such modelling is feasible. We do

    not claim to provide a comprehensive text on disease mapping and have

    confined our attention to the main application of these methods to counted

    data, where numbers of cases are recorded within small areas.

    This book is designed to be of interest to final-year undergraduate and

    graduate level statistics and biostatistics students but will also be of relevance

    to epidemiologists and public health workers both in higher education and

    beyond. The book provides in the introductory chapters (Chapters 1–5) general

    background to disease mapping, Bayesian hierarchical modelling and multilevel

    modelling approaches, and basic introductions to the use of WinBUGS and

    MLwiN. The latter part of the book is focused on application areas, and is

    divided between relative risk estimation (Chapter 6), focused clustering (Chapter

    7), ecological analysis (Chapter 8), and finally spatial survival analysis (Chapter

    ix

  • 9). Throughout the book we provide clear descriptions of the model program-

    ming execution and analysis of and interpretation of results. We have adopted

    the philosophy that we would attempt to demonstrate how MLwiN and Win-

    BUGS approach the same data example, but also have included examples where

    either one or the other packages have limitations. We cannot necessarily hope

    to provide definitive answers to how modelling is to be approached in every

    case. However, we would hope that we provide useful pointers to the issues and

    potential benefits of the approaches described. As both MLwiN and WinBUGS

    are evolving packages, it is to be expected that features described here may vary

    in the future. However, we have done our best to describe the current or soon-

    to-be current form of the packages which is relevant to the potential audience

    for this published work. All the material described here is available in WinBUGS

    1.4 (see Section 4.8.2 for download information and website www.mrc-bsu.

    cam.ac.uk/bugs), and in MLwiN (see section 5.6.1 and website http://multileve-

    l.ioe.ac.uk/index.html for more details). Most datasets used in this book are

    available to download (with associated WinBUGS code) from the site http://

    www.sph.sc.edu/alawson/.

    We would like to acknowledge the help and contribution of a number of

    people during the development of this work. First, we would like to acknowledge

    the help of the MLwiN project team, in particular Jon Rasbash, Harvey Gold-

    stein, Amy Burch, Lisa Brennan, Fiona Steele and Min Yang. We would like to

    thank Allan Clark, Robin Puett, Lance Waller, Tom Richards, James Hebert,

    Alastair Leyland, Sudipto Banerjee, Robert McKeown, Ken Kleinman, Peter

    Rogerson, Dan Wartenburg and Martin Kulldorff for support and encourage-

    ment in the project. In addition, we would like to acknowledge the data

    availability afforded by the sophisticated online public access GIS layer system

    developed by, amongst others, Guang Zhao of the South Carolina Department of

    Health and Environmental Control. Finally, the continuing support and encour-

    agement of Sian Jones and Rob Calver at Wiley Europe must be acknowledged

    and is much appreciated.

    Andrew Lawson (Columbia, SC, USA)

    William Browne (Nottingham, UK)

    Carmen Vidal Rodeiro (Columbia, SC, USA)

    March 2003

    x Preface

  • Notation

    In complex random effects models there is often a myriad of different ‘standard’

    notations to represent a statistical model. This is generally because the models

    were first discovered by many different authors at roughly the same time and

    each author had their own particular notation and style.

    0.1 STANDARD NOTATION FOR MULTILEVEL

    MODELLING

    In this book, in the multilevel modelling sections, as we will be using the MLwiN

    software package, we will use the notation used by this software package.

    If we consider a three-level nested Normal model, then the standard multi-

    level model will be written as

    yijk ¼ X� þ vk þ ujk þ eijk, vk � N(0, �2v ), ujk � N(0,�2u ), eijk � N(0, �2e )

    Here the fixed effects are represented by �, X is a design matrix, and the randomeffects at levels 1, 2 and 3 are represented by e, u, and v respectively. Level

    1 units are indexed i, level 2 units j and level 3 units k. There is a rather

    unfortunate notational clash as in disease mapping e is typically used to repre-

    sent the expected counts. However, the level 1 residuals disappear from the

    equation in the Poisson response multilevel model which minimizes confusion.

    A three-level Poisson response model is typically written in MLwiN as

    yijk � Poisson(�ijk),log (�ijk) ¼ log (eijk)þ X� þ vk þ ujk,

    vk �N(0, �2v ), ujk � N(0,�2u ):(1)

    In standard disease mapping � is often used rather than �, and the eijk is oftenput on the right-hand side of the equation.

    xi

  • 0.2 SPATIAL MULTIPLE-MEMBERSHIP MODELS AND

    THE MMMC NOTATION

    The disadvantage of the standard multilevel notation is that it relies on the

    nested structure of the model. Browne et al. (2001) consider more general

    random effect structures including crossed random effects and multiple-mem-

    bership structures. Rather than give an index for each classification (level in a

    nested structure) they instead use mapping functions to define the unit in the

    classification that a particular observation belongs. For example let us consider

    the three-level Poisson model and assume that the levels are counties within

    regions within nations. Then in the notation of Browne et al. (2001) we can

    write Equation (1) as follows:

    yi � Poisson(�i),log (�i) ¼ log (ei)þ X� þ u(3)nation [i] þ u

    (2)region[i],

    u(3)nation[i] �N(0, �

    2u(3)), u

    (2)region[i] � N(0, �

    2u(2)):

    So here we define all terms with respect to the lowest (observation) level which

    is labelled i. The functions nation[i] and region[i] are mapping functions that

    return the nation and region respectively that observation i belongs to. As the

    random part of the model consists of a set of classifications which need not now

    be ordered in terms of nesting (and if the model contained crossed effects could

    not) we simply define each set of random effects with the letter u but include a

    superscript that gives the classification a number. We start numbering from 2 as

    1 is reserved for the observation level.

    The spatial multiple-membership models that we will consider later can be

    easily written in this notation as follows:

    yi � Poisson(�i),log (�i) ¼ log (ei)þ X� þ

    X

    j2neigh(i)w

    (3)i,j u

    (3)j þ u

    (2)region[i],

    u(3)j �N(0, �2u(3)), u

    (2)region[i] � N(0, �

    2u(2)):

    Here we have a set of region effects indexed by 2 and a set of neighbour effects

    that are indexed by 3.

    0.3 STANDARD NOTATION FOR WinBUGS MODELS

    In hierarchical models for disease maps, the notation commonly used is slightly

    different from that used in multilevel models. The basic Poisson likelihood model

    is defined as

    xii Notation

  • yi � Poisson(ei�i),

    where ei is the expected count and �i is the relative risk in the ith small area.Note that in the notation of multilevel models, �i ¼ ei�i. It is also common to useli ¼ ei�i, and this form is used in Chapter 7.

    Modelling focuses on �i. Here we assume this notation for all the standardanalysis within WinBUGS. In addition to region specific notation we also

    introduce space–time notation with a second subscript denoting the time period:

    yik � Poisson(eik�ik):

    Here, k denotes the relevant time period and the expected count and relative risk

    are allowed to vary over time periods.

    When random effects are introduced into models it is usual to denote region-

    specific uncorrelated heterogeneity as vi, and correlated heterogeneity for the

    same unit as ui. This differs slightly from the convention in multilevel models.

    In each section the relevant notation for that section is introduced and it is

    hoped that any differences between sections will not create difficulties for the

    reader.

    Notation xiii

  • 1

    Disease Mapping Basics

    The representation and analysis of maps of disease incidence data is now

    established as a basic tool in the analysis of regional public health. One of the

    earliest examples of disease mapping is the map of the addresses of cholera

    victims related to the locations of water supplies given by Snow (1854). In that

    case, the street addresses of victims were recorded and their proximity to

    putative pollution sources (water supply pumps) was assessed.

    The subject area of disease mapping has developed considerably in recent

    years. This growth in interest has led to a greater use of geographical or spatial

    statistical tools in the analysis of data both routinely collected for public health

    purposes and in the analysis of data found within ecological studies of disease

    relating to explanatory variables. The study of the geographical distribution

    of disease can have a variety of uses. The main areas of application can

    be conveniently broken down into the following classes: (1) disease mapping,

    (2) disease clustering, and (3) ecological analysis. In the first class, usually the

    object of the analysis is to provide (estimate) the true relative risk of a disease of

    interest across a geographical study area (map): a focus similar to the processing

    of pixel images to remove noise. Applications for such methods lie in health

    services resource allocation, and in disease atlas construction (see, for example,

    Pickle et al., 1999). The second class, that of disease clustering, has particular

    importance in public health surveillance, where it may be important to be able

    to assess whether a disease map is clustered and where the clusters are located.

    This may lead to examination of potential environmental hazards. A particular

    special case arises when a known location is thought to be a potential pollution

    hazard. The analysis of disease incidence around a putative source of hazard

    is a special case of cluster detection called focused clustering. The third class,

    that of ecological analysis, is of great relevance within epidemiological research,

    as its focus is the analysis of the geographical distribution of disease in

    relation to explanatory covariates, usually at an aggregated spatial level.

    Many issues relating to disease mapping are also found in this area, in addition

    to issues relating specifically to the incorporation of covariates.

    1

    Disease Mapping with WinBUGS and MLwiN A. Lawson, W. Browne and C. Vidal Rodeiro

    # 2003 John Wiley & Sons, Ltd ISBN: 0-470-85604-1 (HB)

  • In this volume, we focus on the issues of modelling. While the focus here is on

    statistical methods and issues in disease mapping, it should be noted that the

    results of such statistical procedures are often represented visually in mapped

    form. Hence, some consideration must be given to the purely cartographic

    issues that affect the representation of geographical information. The method

    chosen to represent disease intensity on the map, be it colour scheme or

    symbolic representation, can dramatically affect the resulting interpretation of

    disease distribution. It is not the purpose of this review to detail such cognitive

    aspects of disease mapping, but the reader is directed to some recent discussions

    of these issues (MacEachren, 1995; Monmonier, 1996; Pickle and Hermann,

    1995; Walter, 1993).

    1.1 DISEASE MAPPING AND MAP RECONSTRUCTION

    To begin, we consider two situations which commonly arise in studies of the

    geographic distribution of disease. These situations are defined by the form of

    the mapped data which arises in such studies. First a study area or window

    is defined and within this area for a fixed period of time the locations of cases

    of a specified disease are recorded. These locations are usually residential

    addresses (street address or, at a higher spatial scale, zip code (USA) or post

    code unit (UK) ). When such addresses are known it is possible to proceed

    by direct analysis of the case locations. This is termed case-event analysis.

    Often this analysis requires the use of point process models and associated

    methodology. This form of analysis is reviewed in Lawson (2001, Chapters 4

    and 5) and elsewhere (see, for example, Elliott et al. (2000, Chapter 6) ).

    Due to the requirements of medical confidentiality, it is often not possible

    to obtain data at this level of resolution and so resort must be made to the

    analysis of counts of cases within small areas within the study window. These

    small areas are arbitrary regions usually defined for administrative purposes,

    such as census tracts, counties, municipalities, electoral wards or health

    district regions. Data of this type consist of counts of cases within tracts and

    the analysis of this data is termed tract count analysis. In this volume we focus

    exclusively on tract count analysis. An example of the analysis of case-event

    data with a Bernouilli model using WinBUGS is given in Congdon (2003,

    Chapter 7).

    Essentially the count is an aggregation of all the cases within the tract. By

    aggregation, the individual case spatial references (locations) are lost and

    therefore any georeference of the count is related to the tract ‘location’. Often

    this is represented by the tract centroid. In a chosen study window there is

    found to be m tracts. Denote the counts of disease within the m tracts as

    {yi}, i ¼ 1, . . . , m. Figure 1.1 displays a tract count example.This example is of the 46 counties of South Carolina in which were collected

    the congenital abnormality death counts for the year 1990.

    2 Disease mapping basics

  • 0

    7

    1

    5

    11

    5

    16

    0

    17

    4

    00

    1

    1

    71

    3

    0

    0

    8

    2

    13

    7

    0

    8

    0

    3

    2

    41

    110

    1

    2

    3

    3

    8

    6

    143

    11

    6

    0

    1

    5

    Figure 1.1 South Carolina congenital abnormality deaths 1990 by counties.

    1.2 DISEASE MAP RESTORATION

    1.2.1 Simple statistical representations

    The representation of disease-incidence data can vary from pictorial representa-

    tion of counts within tracts, to the mapping of estimates from complex models

    purporting to describe the structure of the disease events. In this section, we

    describe the range of mapping methods from simple representations to model-

    based forms. The geographical incidence of disease has as its fundamental unit

    of observation, the address location of cases of disease. The residential address

    (or possibly the employment address) of cases of disease contains important

    information relating to the type of exposure to environmental risks. Often,

    however, the exact address locations of cases are not directly available, and

    one must use instead counts of disease in arbitrary administrative regions, such

    as census tracts or postal districts.

    1.2.1.1 Crude representation of disease distribution

    The simplest possible mapping form is the depiction of disease rates at specific

    sets of locations. For counts within tracts, this is a pictorial representation of the

    number of events in the tracts plotted at a suitable set of locations (e.g., tract

    centroids). The locations of case-events within a spatially heterogeneous popu-

    lation can display a small amount of information concerning the overall pattern

    of disease events within a window. However, any interpretation of the structure

    Disease map restoration 3

  • of these events is severely limited by the lack of information concerning the

    spatial distribution of the background population which might be ‘at risk’ from

    the disease of concern and which gave rise to the cases of disease. This popula-

    tion also has a spatial distribution and failure to take account of this spatial

    variation severely limits the ability to interpret the resulting case-event map. In

    essence, areas of high density of ‘at risk’ population would tend to yield high

    incidence of case-events and so, without taking account of this distribution,

    areas of high disease intensity could be spuriously attributed to excess disease

    risk.

    In the case of counts of cases of disease within tracts, similar considerations

    apply when crude count maps are constructed. Here, variation in population

    density also affects the spatial incidence of disease. It is also important to consider

    how a count of cases could be depicted in amapped representation. Countswithin

    tracts are totals of events from the whole tract region. If tracts are irregular, then

    a decision must be made to either ‘locate’ the count at some tract location (e.g.

    tract centroid, however defined) with suitable symbolization, or to represent the

    count as a fill colour or shade over the whole tract (choropleth thematic map). In

    the former case, the choice of location will affect interpretation. In the latter case,

    symbolization choice (shade and/or colour) could distort interpretation also,

    although an attempt to represent the whole tract may be attractive.

    In general, methods that attempt to incorporate the effect of background ‘at

    risk’ population (termed: at risk background) are to be preferred. These are

    discussed in the next section.

    1.2.1.2 Standardized mortality/morbidity ratios and standardization

    To assess the status of an area with respect to disease incidence, it is convenient

    to attempt to first assess what disease incidence should be locally ‘expected’ in

    the tract area and then to compare the observed incidence with the ‘expected’

    incidence. This approach has been traditionally used for the analysis of counts

    within tracts. Traditionally, the ratio of observed to expected counts within

    tracts is called a Standardized Mortality/Morbidity Ratio (SMR) and this ratio is

    an estimate of relative risk within each tract (i.e., the ratio describes the odds of

    being in the disease group rather than the background group). The justification

    for the use of SMRs can be supported by the analysis of likelihood models with

    multiplicative expected risk (see, for example, Breslow and Day, 1987). In

    Section 1.2.3.1, we explore further the connection between likelihood models

    and tract-based estimators of risk. Figure 1.2 displays the SMR thematic map for

    congenital abnormality deaths within South Carolina, USA, for the year 1990

    based on expected rates calculated from the South Carolina 1990–1998 state-

    wide rate per 1000 births.

    Define yi as the observed count of the case disease in the ith tract, and ei as the

    expected count within the same tract. Then the SMR is defined as:

    4 Disease mapping basics

  • Congenital abnormality deathsSMR 1990 using 8 year rate

    1.51 to 4.1 (9)1.09 to 1.51 (9)0.78 to 1.09 (9)0.5 to 0.78 (9)0 to 0.5 (10)

    Figure 1.2 South Carolina congenital abnormality deaths 1990: SMRs.

    b��i ¼ yiei: (1:1)

    In this case it must be decided whether to express the b��i as fill patterns in eachregion, or to locate the result at some specified tract location, such as the

    centroid. If it is decided that these measures should be regarded as continuous

    across regions then some further interpolation of b��i must be made (see, forexample, Breslow and Day, 1987, pp. 198–9).

    SMRs are commonly used in disease map presentation, but have many

    drawbacks. First, they are based on ratio estimators and hence can yield large

    changes in estimate with relatively small changes in expected value. In the

    extreme, when a (close to) zero expectation is found the SMR will be very large

    for any positive count. Also the zero SMRs do not distinguish variation in

    expected counts, and the SMR variance is proportional to 1=ei. The SMR isessentially a saturated estimate of relative risk and hence is not parsimonious.

    1.2.2 Informal methods

    To circumvent the problems associated with SMRs a variety of methods have

    been proposed. Some of these are relatively informal or nonparametric and

    others highly parametric. In the rest of this volume we will concentrate on

    the model-based relative risk estimation methods. However, it is useful here to

    present briefly some notes on alternative methods.

    One approach to the improvement of relative risk estimation is to employ

    smoothing tools on SMRs to reduce the noise. These tools could be based on

    Disease map restoration 5

  • interpolation methods, or more commonly on nonparametric smoothers such as

    kernel regression (Nadaraya-Watson, local linear) (Bowman and Azzalini,

    1997), and partition methods (Ferreira et al., 2002). A variety of exploratory

    data analysis (EDA) methods have also been advocated (see, for example,

    Cressie, 1993). These methods usually require the estimation of a smoothing

    constant which describes the overall behaviour of the relative risk surface. Some

    local methods are also available. Generalized additive models have also been

    proposed and these have the advantage of allowing the incorporation of covari-

    ates (see, for example, Kelsall and Diggle, 1998).

    1.2.3 Basic models

    When more substantive hypotheses and/or greater amounts of prior informa-

    tion are available concerning the problem, then it may be advantageous to

    consider a model-based approach to disease map construction. Model-based

    approaches can also be used in an exploratory setting, and if sufficiently general

    models are employed then this can lead to better focusing of subsequent

    hypothesis generation. In what follows, we consider first likelihood models for

    case event data and then discuss the inclusion of extra information in the form

    of random effects.

    1.2.3.1 Likelihood models

    Usually the basic model for case-event data is derived from the following

    assumptions:

    (1) Individuals within the study population behave independently with respect

    to disease propensity, after allowance is made for observed or unobserved

    confounding variables.

    (2) The underlying at risk background intensity has a continuous spatial

    distribution, within a specified boundary.

    (3) The case-events are unique, in that they occur as single spatially separate

    events.

    Assumption (1) above allows the events to be modelled via a likelihood

    approach, which is valid conditional on the outcomes of confounder variables.

    Further, assumption (2), if valid, allows the likelihood to be constructed with

    a background continuous modulating intensity function representing the ‘at

    risk’ background. The uniqueness of case-event locations is a requirement of

    point process theory (the property called orderliness: see, for example, Daley and

    Vere-Jones, 1988), which allows the application of Poisson-process models in

    this analysis. Assumption (1) is generally valid for non-infectious diseases. It

    6 Disease mapping basics

  • may also be valid for infectious diseases if the information about current

    infectives were known at given time points. Assumption (2) will be valid at

    appropriate scales of analysis. It may not hold when large areas of a study

    window include zones of zero population (e.g. harbours/industrial zones). Often

    models can be restricted to exclude these areas however. Assumption (3) will

    usually hold for relatively rare diseases but may be violated when households

    have multiple cases and these occur at coincident locations. This may not be

    important at more aggregate scales, but could be important at a fine spatial

    scale. Remedies for such non-orderliness are the use of de-clustering algo-

    rithms (which perturb the locations by small amounts), or analysis at a higher

    aggregation level. Note that it is also possible to use a conventional case-control

    approach to this problem (Diggle et al., 2000).

    In the case of observed counts of disease within tracts, the Poisson-process

    assumptions given above mean that the counts are Poisson distributed with, for

    each tract, a different expectation. Often at this point a simplifying assumption is

    made where the ith tract count expectation is regarded as being a function of a

    parameter within a model hierarchy, without considering the spatial continuity

    of the intensity. This assumption leads to considerable simplifications and the

    distribution of the tract counts is often assumed to be

    yi � Poisson(ei�i),

    where �i is assumed to be a constant relative risk parameter. In this definitionthe expected value of the count is a multiplicative function of the expected

    count/rate (ei) and a relative risk. This is the classic model assumed in many

    disease mapping studies. The log-likelihood associated with this model is, bar a

    constant, given by:

    l ¼Xmi¼1

    yi ln (ei�i)�Xmi¼1

    ei�i:

    Note that by differentiation the saturated maximum likelihood estimator of �i isjust yi=ei, the SMR.

    This model makes a number of assumptions. First it is assumed that any

    excess risk in a tract will be expressed beyond that described by ei. For example,

    the expected rate (ei) can be estimated in a variety of ways. Often external

    standardization is used, where known supra-regional rates for different age �sex groups are applied to the local population in each tract. The use of external

    standardization alone to estimate the expected counts/rates within tracts may

    provide a different map from that provided by a combination of external stand-

    ardization and measures of tract-specific deprivation (e.g. deprivation indices

    (Carstairs, 1981) ). If any confounding variables are available and can be

    included within the estimate of the at risk background, then these should be

    Disease map restoration 7

  • considered for inclusion. Examples of confounding variables could be found

    from national census data, particularly relating to socioeconomic measures.

    These measures are often defined as ‘deprivation’ indicators, or could relate to

    lifestyle choices. For example, the local rate of car ownership or percentage

    unemployed within a census tract or other small area, could provide a surrogate

    measure for increased risk, due to correlations between these variables and poor

    housing, smoking lifestyles, and ill-health. Hence, if it is possible to include such

    variables, then any resulting map will display a close representation of the ‘true’

    underlying risk surface. When it is not possible to include such variables, it is

    sometimes possible to adapt a mapping method to include covariates of this type

    within regression setting.

    1.2.3.2 Fixed effects

    Usually the focus of attention when more sophisticated models are applied in

    disease mapping is the relative risk. Hence, all the models we will examine in

    this volume will be models for the {�i}. One simple model for the relative riskswould be to suppose that there could be a spatial trend or long-range variation

    over the study area. To do this we can construct a model which is a function of

    the spatial coordinates of the tract centroids: {x1i, x2i} representing eastings and

    northings, say. Simple forms of spatial trend can be modelled by using the

    centroid coordinates or functions of the coordinates as covariates and assuming

    a regression-type model. As the relative risks must be positive it is usual to

    model the logarithm of the relative risk as a linear function. Hence, in this case

    we could have:

    �i ¼ exp{�0 þ �1x1i þ �2x2i}: (1:2)

    This model includes a constant rate (exp {�0}) which captures the overall rateacross the whole study region, and two linear parameters: �1,�2. This modeldescribes a planar trend across the study region, and can be easily extended to

    include higher-order trend surfaces by adding power functions of the coordinates.

    Here we have used centroid locations as covariates, and indeed this model can be

    generalized simplywhenyouobserveother covariatesmeasuredwithin the tracts.

    For example it may be possible to include deprivation scores for each tract or

    census variables such as percentage unemployed or percentage car ownership. In

    general, assume that the intercept (constant rate) term is defined for a variable x0iwhich is 1 for each tract. Hence we can specify the model compactly as

    �i ¼ exp{xib},

    where x is a m� p matrix consisting of p� 1 covariates, b is a p� 1 parametervector and xi denotes the ith observation row of x.

    8 Disease mapping basics

  • This type of fixed effect model can be fitted in conventional statistical pack-

    ages which allow Poisson regression or log-linear modelling. The glm function

    in R or S-Plus with a log link and log offset of the {ei} can be used, for example.

    1.2.3.3 Random effects

    In the sections above some simple approaches to mapping counts within tracts

    have been described. These methods assume that once all known and observ-

    able confounding variables are included then the resulting map will be clean of

    all artefacts and hence depicts the true excess risk surface. However, it is often

    the case that unobserved effects could be thought to exist within the observed

    data and that these effects should also be included within the analysis. These

    effects are often termed random effects, and their analysis has provided a large

    literature both in statistical methodology and in epidemiological applications

    (see, for example, Manton et al., 1981; Tsutakawa, 1988; Breslow and Clayton,

    1993; Clayton, 1991; Best and Wakefield, 1999; Lawson, 2001; Richardson,

    2003). Within the literature on disease mapping, there has been a considerable

    growth in recent years in modelling random effects of various kinds. In the

    mapping context, a random effect could take a variety of forms. In its simplest

    form, a random effect is an extra quantity of variation (or variance component)

    which is estimable within the map and which can be ascribed a defined

    probabilistic structure. This component can affect individuals or can be associ-

    ated with tracts or covariates. For example, individuals vary in susceptibility to

    disease and hence individuals who become cases could have a random com-

    ponent relating to different susceptibility. This is sometimes known as frailty.

    Another example is the interpolation of a spatial covariable to the locations of

    case events or tract centroids. In that case, some error will be included in the

    interpolation process, and could be included within the resulting analysis of

    case or count events. Also, the locations of case-events might not be precisely

    known or subject to some random shift, which may be related to uncertain

    residential exposure. Finally, within any predefined spatial unit, such as tracts

    or regions, it may be expected that there could be components of variation

    attributable to these different spatial units. These components could have

    different forms depending on the degree of prior knowledge concerning the

    nature of this extra variation. For example, when observed counts, thought to

    be governed by a Poisson distribution, display greater variation than expected

    (i.e. variance> mean), it is sometimes described as overdispersion. This over-dispersion can occur for various reasons. Often it arises when clustering occurs

    in the counts at a particular scale. It can also occur when considerable numbers

    of cells have zero counts (sparseness), which can arise when rare diseases are

    mapped. In spatial applications, it is important furthermore to distinguish two

    basic forms of extra variation. First, as in the aspatial case, a form of independ-

    ent and spatially uncorrelated extra variation can be assumed. This is often

    Disease map restoration 9

  • called uncorrelated heterogeneity (see, for example, Besag et al., 1991). Another

    form of random effect is that which arises from a model where it is thought that

    the spatial unit (such as case-events, tracts or regions) is correlated with neigh-

    bouring spatial units. This is often termed correlated heterogeneity. Essentially, this

    form of extra variation implies that there exists spatial autocorrelation between

    spatial units (see, for example, Cliff and Ord (1981) for an accessible introduction

    to spatial autocorrelation). This autocorrelation could arise for a variety of

    reasons. First, the disease of concern could be naturally clustered in its spatial

    distribution at the scale of observation. Many infectious diseases display such

    spatial clustering, and a number of apparently non-infectious diseases also cluster

    (see, for example, Cuzick and Hills, 1991; Glick, 1979). Second, autocorrelation

    can be induced in spatial disease patterns by the existence of unobserved environ-

    mental or frailty effects. Hence the extra variation observed in any application

    could arise from confounding variables that have not been included in the

    analysis. In disease mapping examples, this could easily arise when simple

    mapping methods are used on SMRs with just basic age–sex standardization.

    In the discussion above on heterogeneity, it is assumed that a global measure

    of heterogeneity applies to a mapped pattern. That is, any extra variation in

    the pattern can be captured by including a general heterogeneity term in the

    mapping model. However, often spatially-specific heterogeneity may arise

    where it is important to consider local effects as well as, or instead of, general

    heterogeneity. To differentiate these two approaches, we use the term specific

    and nonspecific heterogeneity. Specific heterogeneity implies that spatial loca-

    tions are to be modelled locally; for example, clusters of disease are to be

    detected on the map. In contrast, ‘nonspecific’ describes a global approach to

    such modelling, which does not address the question of the location of effects. In

    this definition, it is tacitly assumed that the locations of clusters of disease can be

    regarded as random effects themselves. Hence, there are strong parallels be-

    tween image processing tasks and the tasks of disease mapping.

    Random effects can take a variety of forms and suitable methods must be

    employed to provide correctly estimated maps under models including these

    effects. In this section, we discuss simple approaches to this problem from a

    frequentist, multilevel and Bayesian viewpoint.

    A frequentist approach. In what follows, we use the term ‘frequentist’ to

    describe methods that seek to estimate parameters within a hierarchical model

    structure. The methods do assume that the random effects have mixing (or prior)

    distributions. For example a common assumption made when examining tract

    counts is that yi � Poisson(ei�i) independently, and that �i � Gamma(�,�). Thislatter distribution is often assumed for the Poisson relative risk parameter and

    provides for a measure of overdispersion relative to the Poisson distribution

    itself, depending on the �,� values used. The joint distribution is now given bythe product of a Poisson likelihood and a gamma distribution. At this stage a

    choice must be made concerning how the random intensities are to be estimated

    or otherwise handled. One approach to this problem is to average over the

    10 Disease mapping basics

  • values of �i to yield what is often called the marginal likelihood. Having averagedover this density, it is then possible to apply standard methods such as max-

    imum likelihood. This is usually known as marginal maximum likelihood (see,

    for example, Bock and Aitkin, 1981; Aitkin, 1996b). In this approach, the

    parameters of the gamma distribution are estimated from the integrated likeli-

    hood. A further development of this approach is to replace the gamma density

    with a finite mixture. This approach is essentially nonparametric and does not

    require the complete specification of the parameter distribution (see, for

    example, Aitkin, 1996a).

    Although the example specified here concerns tract counts, the method

    described above can equally be applied to case-event data, by inclusion of a

    random component in the intensity specification.

    A Bayesian approach. It is natural to consider modelling random effects

    within a Bayesian framework. First, random effects naturally have prior distri-

    butions and the joint density discussed above is proportional to the posterior

    distribution for the parameters of interest. Hence, the development of full Bayes

    and empirical Bayes (posterior approximation) methods has progressed natur-

    ally in the field of disease mapping. The prior distribution(s) for the (u, say)parameters in the intensity specification ei�i, have hyperparameters (in thePoisson–gamma example above, these were �,�). These hyperparameters canalso have hyperprior distributions. The distributions chosen for these param-

    eters depend on the application. In the full Bayesian approach, inference is

    based on the posterior distribution of u given the data. However, as in thefrequentist approach above, it is possible to adopt an intermediate approach

    where the posterior distribution is approximated in some way, and subsequent

    inference may be made via ‘frequentist-style’ estimation of parameters or by

    computing the approximated posterior distribution. In the tract-count example,

    approximation via intermediate prior-parameter estimation would involve the

    estimation of � and �, followed by inference on the estimated posterior distribu-tion (see, for example, Carlin and Louis, 1996, pp. 67–8).

    For count data, a number of examples exist where independent Poisson

    distributed counts (with constant within-tract rate) are associated with prior

    distributions of a variety of complexity. The earliest examples of such a Bayesian

    mapping approach can be found in Manton et al. (1981) and Tsutakawa

    (1988). Also, Clayton and Kaldor (1987) developed a Bayesian analysis of

    a Poisson likelihood model where yi has expectation ei�i, and found thatwith a prior distribution given by �i � Gamma(�,�), the Bayes estimate of �i isthe posterior expectation:

    yi þ �ei þ �

    : (1:3)

    Hence one could map directly these Bayes estimates. Now, the distribution of �iconditional on yi is Gamma(yi þ �, ei þ �) and a Bayesian approach would

    Disease map restoration 11

  • South Carolina empirical Bayes mean RRestimates congenital abnormality deaths 1990

    1.51 to 3.96 (8)1.09 to 1.51 (9)0.78 to 1.09 (10)0.51 to 0.78 (8)0.01 to 0.51 (11)

    Figure 1.3 Empirical Bayes mean relative risk (RR) estimates.

    require summarization of �i from this posterior distribution. In practice, this isoften obtained by generation of realizations from this posterior and then the

    summarizations are empirical (e.g.MarkovChainMonteCarlo (MCMC)methods).

    Figure 1.3 displays the empirical Bayes estimates under the Poisson–gamma

    model with � and � estimated as in Clayton and Kaldor. Note that in contrast tothe SMR map (Figure 1.2), Figure 1.3 presents a smoother relative risk surface.

    Other approaches and variants in the analysis of simple mapping models have

    been proposed by Tsutakawa (1988), Marshall (1991) and Devine and Louis

    (1994). In the next section, more sophisticated models for the prior structure of

    the parameters of the map are discussed.

    1.2.4 Advanced Bayesian models

    Many of the models discussed above can be extended to include the specification

    of prior distributions for parameters and hence can be examined via Bayesian

    methods. In general, we distinguish here between empirical Bayes methods and

    full Bayes methods, on the basis that any method which seeks to approximate

    the posterior distribution is regarded as empirical Bayes (Bernardo and Smith,

    1994). All other methods are regarded as full Bayes. This latter category

    includes maximum a posteriori estimation, estimation of posterior functionals,

    as well as posterior sampling.

    12 Disease mapping basics

  • 1.2.4.1 Empirical Bayes methods

    The methods encompassed under the definition above are wide-ranging, and

    here we will only discuss a subset of relevant methods. The first method

    considered by the earliest workers was the evaluation of simplified (constrained)

    posterior distributions. Manton et al. (1981) used a direct maximization of a

    constrained posterior distribution, Tsutakawa (1988) used integral approxima-

    tions for posterior expectations, while Marshall (1991) used a method of

    moments estimator to derive shrinkage estimates. Devine and Louis (1994)

    further extended this method by constraining the mean and variance of the

    collection of estimates to equal the posterior first and second moments.

    The second type of method which has been considered in the context of

    disease mapping is the use of likelihood approximations. Clayton and Kaldor

    (1987) first suggested employing a quadratic normal approximation to a Pois-

    son likelihood, with gamma prior distribution for the intensity parameter of the

    Poisson distribution and a spatial correlation prior. Extensions to this approach

    lead to simple generalized least squares (GLS) estimators for a range of likeli-

    hoods (Lawson, 1994; 1997).

    A third type is the Laplace asymptotic integral approximation, which has

    been applied by Breslow and Clayton (1993) to a generalized linear modelling

    framework in a disease mapping example. This integral approximation method

    allows the estimation of posterior moments and normalizing integrals (see, for

    example, Bernardo and Smith, 1994, pp. 340–4). A further, but different,

    integral approximation method is where the posterior distribution is integrated

    across the parameter space: that is, the nuisance parameters are ‘integrated out’

    of the model. In that case the method of nonparametric maximum likelihood

    (NPML) can be employed (Bock and Aitkin, 1981; Aitkin, 1996b; Clayton and

    Kaldor, 1987). Another possibility is to employ Linear Bayes methods (Marshall,

    1991).

    1.2.4.2 Full Bayes methods

    Full posterior inference for Bayesian models has now become available, largely

    because of the increased use of MCMC methods of posterior sampling. The first

    full sampler reported for a disease mapping example was a Gibbs sampler

    applied to a general model for intrinsic autoregression and uncorrelated hetero-

    geneity by Besag et al. (1991). Subsequently, Clayton and Bernardinelli (1992),

    Breslow and Clayton (1993) and Bernardinelli et al. (1995) have adapted this

    approach to mapping, ecological analysis and space–time problems.

    This has been facilitated by the availability of general Gibbs sampling pack-

    ages such as BEAM and BUGS (GeoBUGS and WinBUGS). Such Gibbs sampling

    methods can be applied to putative source problems as well as mapping/eco-

    logical studies. Alternative, and more general, posterior sampling methods, such

    Disease map restoration 13

  • as the Metropolis–Hastings algorithm, are currently not separately available in

    a packaged form, although these methods can accommodate considerable

    variation in model specification. WinBUGS does provide such estimators when

    non-convex posterior distributions are encountered. Metropolis–Hastings algo-

    rithms have been applied in comparison to approximate maximum a posteriori

    (MAP) estimation by Lawson et al. (1996) and Diggle et al. (1998); hybrid

    Gibbs–Metropolis samplers have been applied to space–time problems by Waller

    et al. (1997). In addition, diagnostic methods for Bayesian MCMC sample output

    have been discussed for disease mapping examples by Zia et al. (1997). Devel-

    opments in this area have been reviewed recently (Lawson et al., 1999; Elliott

    et al., 2000; Lawson, 2001).

    1.2.5 Multilevel modelling approaches

    An alternative to the above specification can be considered where a log-linear

    form is specified:

    �i ¼ exp{�0 þ vi},

    where the random term has a zero mean Gaussian distribution, i.e. vi � N(0,�2v )and �2v is the variance of the random effects v.

    This model may be rewritten (in terms of counts rather than rates) as

    yi � Poisson(�i), log(�i) ¼ log(ei)þ �0 þ vi:

    Here �i ¼ ei�i and the log(ei) are treated as known ‘offset’ terms.Generally, multilevel models (see, for example, Goldstein, 1995) are fitted

    to data that possess levels of clustering in their structure. In disease mapping

    and geographical applications in general such levels would be different levels of

    geographical aggregation, for example, census tracts nested within counties

    nested within countries. For each level of geography we could then fit normally

    distributed random effects so for example if we had data on census tracts nested

    within counties we could fit

    yij � Poisson(�ij), log(�ij) ¼ log(eij)þ �0 þ vj þ uij,

    where both the county and tract random effects have Gaussian distributions, i.e.

    vj � N(0, �2v ) and uij � N(0,�2u ).Poisson response multilevel models can be fitted using either frequentist or

    Bayesian approaches. Frequentist approaches generally involve some approxi-

    mations, for example the software package MLwiN (Rasbash et al., 2000) uses

    quasi-likelihood methods that involve Taylor series approximations (Goldstein,

    1991; Goldstein and Rasbash, 1996) to transform the problem so that it can be

    14 Disease mapping basics