A smoothed ANOVA model for multivariate ecological regression

ORIGINAL PAPER

A smoothed ANOVA model for multivariate ecological regression

Marc Marı-Dell’Olmo • Miguel A. Martinez-Beneito •

Merce Gotsens • Laia Palencia

Published online: 18 August 2013

� Springer-Verlag Berlin Heidelberg 2013

Abstract Smoothed analysis of variance (SANOVA) has

recently been proposed for carrying out disease mapping.

The main advantage of this approach is its conceptual

simplicity and ease of interpretation. Moreover, it allows us

to fix the combination of diseases of particular interest in

advance and to make specific inferences about them. In this

paper we propose a reformulation of SANOVA in the

context of ecological regression studies. This proposal

considers the introduction in a non-parametric way of one

(or several) covariate(s) into the model, explaining some

pre-specified combinations of the outcome variables. In

addition, random effects are also incorporated in order to

model geographical variation in the combinations of

outcome variables not explained by the covariate. Lastly,

the model permits the decomposition of the variance in the

set of outcome variables into different orthogonal compo-

nents, quantifying the contribution of every one of them.

The proposed model is applied to the geographical analysis

of mortality due to malignant stomach neoplasm among

women resident in the city of Barcelona (Spain). The

available outcome variables are deaths grouped into two

time periods, and a socioeconomic deprivation index is

included as a covariate. The model has been implemented

through INLA, a novel inference tool for Bayesian

statistics.

Keywords Disease mapping � Smoothed analysis of

variance � Multivariate analysis � Mortality �Socioeconomic factors � Ecological regression

1 Introduction

There is a growing interest among public health profes-

sionals in the study of health indicators at a small-area

level. Small area studies permit a deeper understanding to

be gained of geographical patterns and have proved to be

essential in uncovering local-level inequalities often

masked by health estimates from larger areas (Borrell et al.

2010). However, statistical analysis in this context is far

from trivial due to the small amount of information on

which every available observation is based (Barcelo et al.

2008).

The statistical models employed in these studies usually

take into account the possible spatial dependence of the

data. The assumption of dependence between neighbouring

sites improves the estimates of indicators as they are based

on more information, namely that of each geographical unit

This article has been included in the doctoral thesis of one of the

authors (Marc Marı-Dell’Olmo), being carried out at Pompeu Fabra

University.

Electronic supplementary material The online version of thisarticle (doi:10.1007/s00477-013-0782-2) contains supplementarymaterial, which is available to authorized users.

M. Marı-Dell’Olmo � M. A. Martinez-Beneito (&) �M. Gotsens � L. Palencia

CIBER Epidemiologıa y Salud Publica (CIBERESP), Madrid,

Spain

e-mail: [email protected]

M. Marı-Dell’Olmo � M. Gotsens � L. Palencia

Agencia de Salut Publica de Barcelona, Barcelona, Spain

M. Marı-Dell’Olmo � M. Gotsens � L. Palencia

Institut d’Investigacio Biomedica (IIB Sant Pau), Barcelona,

Spain

M. A. Martinez-Beneito

Centro Superior de Investigacion en Salud Publica CSISP-

FISABIO, Av. Cataluna, 21., 46020 Valencia, Spain

123

Stoch Environ Res Risk Assess (2014) 28:695–706

DOI 10.1007/s00477-013-0782-2

http://dx.doi.org/10.1007/s00477-013-0782-2

and its neighbours. One of the most widely-used models for

the analysis of small areas is that of Besag, York and

Mollie (BYM) (Besag et al. 1991). Their model includes

two random effects: one takes into account the spatial

dependence between nearby units, due to the existence of

risk factors common to several geographical units; and the

other, spatially heterogeneous, models the existence of risk

factors which only affect particular geographical units and

which make their risks palpably different to those of

neighbouring regions.

Recently, different methods capable of modelling a

multivariate outcome have appeared. These take into

account dependence among the different components of

such an outcome and the spatial dependence of the data.

These methods, for example, allow conducting joint anal-

yses of various potentially related causes of death, joint

analyses of a cause of death for different population groups

(such as different sexes or ethnic groups), etc. In what

follows, for the sake of simplicity, we will refer to the

problem of multivariate smoothing as the joint smoothing

of various diseases.

The vast majority of proposals for conducting studies of

this type are based on the multivariate generalisation of

conditional auto-regressive distributions (CAR) (Rue and

Held 2005) to more than one response considering

dependence between them. Mardia (1988) could be con-

sidered a particular relevant contribution within this

approach. Another approach to multivariate modelling is

that known as Spatial Factor Bayesian models (Tzala and

Best 2008; Marı-Dell’Olmo et al. 2011; Banerjee et al.

2003; Wang and Wall 2003; Liu et al. 2005), which

assumes a set of underlying geographical patterns to be

shared by every one of the diseases. Both these patterns and

their contribution to each of the diseases are estimated

within these models. In this way information is shared (in

the form of underlying patterns) between diseases and also

between different geographical units, since spatial depen-

dence is usually assumed (via a CAR prior) for the com-

mon patterns. Bayesian shared component models (Knorr-

Held and Best 2001) may be considered as a special case of

spatial factor Bayesian models. Recently, a novel approach

to multivariate disease mapping, known as smoothed

analysis of variance (SANOVA) (Hodges et al. 2007;

Zhang et al. 2009), has been proposed. The choice of one

or other of these proposals ultimately depends on the aims

pursued by the study, such as the estimation of one (or

several) factors common to the disease, the specific esti-

mation of the covariance structure between diseases, etc.

The main advantage of SANOVA in comparison to

other multivariate modelling approaches is its conceptual

simplicity and ease of interpretation. In general, the other

approaches lead to more complex models capable of

modelling the relationship between various health

indicators in more detail, but which are more complicated

to define, implement and interpret (Zhang et al. 2009).

Moreover, SANOVA allows the fixing in advance of which

factors (contrasts) are of particular interest in the study, and

the making of specific inferences about these particularly

relevant factors. Therefore, from an applied perspective,

SANOVA is of great interest.

Ecological regression studies try to establish associa-

tions between one or various covariates and one health

indicator, using aggregate sets of individuals as the unit of

analysis; for example, in terms of a geographical division.

However, it is not usual to find multivariate ecological

regression models in which various health indicators are

studied jointly, relating them to one or more covariates.

Such studies could increase the power of ecological

regression since the relationships between covariates and

health indicators would be based on more information.

Moreover, they would make it possible to study the rela-

tionship between the covariates and a combination of

indicators (for example, differences in mortality for two

diseases), instead of the raw relationship with each one of

the indicators separately.

This study proposes an adaptation of SANOVA to the

context of ecological regression models. This adaptation

will allow the fitting of multivariate ecological regression

models, with the specific interpretational advantages of

SANOVA models. This proposal will decompose the var-

iance of the set of outcomes into orthogonal components

related to each of the SANOVA contrasts, the covariate and

random effects.

The paper is organized as follows: Sect. 2 specifies the

adaptation of SANOVA to the ecological regression context

that we will propose. Section 3 illustrates its application to a

real case study: data on malignant stomach neoplasm

mortality among women in the city of Barcelona for the

periods 1996–2001 and 2002–2007. A brief discussion of

our proposal and its main results is given in Sect. 4.

2 Model specification

We begin this section by proposing an ecological regres-

sion model that will be very convenient for the subsequent

multivariate generalisation. Next we will describe the for-

mulation of SANOVA modeling in detail, and we will

propose a reformulation of it which will enable us to

employ, in general, any type of spatial dependence struc-

ture, in contrast to the original proposal which in practical

terms (due to computational limitations) only permits the

use of an Intrinsic CAR distribution to model spatial

dependence. Finally, we will show how the aforementioned

proposal for ecological regression and our formulation of

SANOVA can be joined into a single model.

696 Stoch Environ Res Risk Assess (2014) 28:695–706

123

Before launching into the rest of the section, let us

introduce some notation. In what follows, variables in

boldface will refer to vectors or matrices. These two pos-

sibilities will be distinguished by the number of subscripts

accompanying the variable. Also, for a matrix A we will

use A�j to denote its jth column and vec Að Þ to denote the

vector ðA0�1; . . .;A0�JÞ0. The expression diag(a) will be used

to denote the diagonal matrix whose elements are given by

the vector a, and in the case where the components of a are

matrices, diag(a) will denote the corresponding block-

diagonal matrix. Finally, we will denote the Kronecker

product of two matrices by �, so that if A is a matrix with

dimension IxJ, then:

A� B ¼a11B � � � a1JB

..

. . .. ..

.

aI1B � � � aIJB

0B@

1CA

2.1 A univariate ecological regression model

Let Oi be the number of observed cases of a disease in the

area i, Ei the number of expected cases, and Ci the value of

the covariate that we wish to study, also in this area. Our

proposed ecological regression model assumes:

Oijhi�PoissonðEihiÞ; i ¼ 1; . . .; I

logðhiÞ ¼ lþ f ðCiÞ þ Si; i ¼ 1; . . .; I

where l is the model intercept, f(C) describes the effect of

the covariate on the logarithm of risk (h), and S is a spa-

tially dependent random effect describing the geographical

variation of risk unexplained by the covariate.

Our intention is to avoid the use of parametric forms to

describe the relationship between the covariate and the

logarithm of risk, since such functions often turn out to be

too restrictive to describe certain types of relationship. Our

modelling proposal is as follows: for a continuous variable

C we categorise its values into discrete groups according to

some number of quantiles. Thus, to categorise C in K

groups, we calculate its quantiles as 100(k/K) % where

k = 1, …, K-1 (hereafter denoted by Q1, …, QK-1

respectively), and we define:

f ðCiÞ ¼

a1 if Ci\Q1

a2 if Q1\Ci\Q2

..

. ...

aK if QK�1\Ci

8>>><>>>:

Obviously, K will be an important parameter in this

process, since small values could mask important variations

in the relationship we are attempting to describe, whereas

large values could lead to poor estimates of a1, …, aK. With

the intention of limiting this latter effect, we will consider

that a = (a1, …, aK) are dependent values, given the ordinal

relation of its components. Specifically, we could propose,

for example, an Intrinsic CAR (Rue and Held 2005), as a

prior distribution of a, considering contiguous values of the

subscript to be neighbours. Thus the components of a will

share information among themselves and as a consequence

its estimates will be more stable and f(C) will define a smooth

relationship between C and h. Moreover, if C was an ordinal

variable, we could use the same prior distribution as defined

above. On the other hand, if C was a nominal variable,

perhaps it would be more appropriate to use an independent

prior distribution for a.

Obviously, other modeling alternatives could be pro-

posed for f(C). For example a CAR distribution of

higher order, taking into account more distant neigh-

bours, could be proposed as a prior distribution of a(Rue and Held 2005). This proposal would yield

smoother relationships between the covariate and the log-

relative-risks. Spline modeling, for example, could also

be an alternative for modeling f(C) or even by means of

the sum of two (independent and dependent) random

effects (Besag et al. 1991). Nevertheless, although we

acknowledge that the specific modeling of f(C) is a

question of interest for the model, it is not a key

question for the main goal of this paper, which is how to

carry out multivariate ecological regression and how to

decompose the variance of the diseases under study.

Thus although we do not put special emphasis on this

question we would like to mention that other modeling

alternatives are also be possible.

The random effect S will model the geographical vari-

ations of the risk which are not linked to the covariate, that

is, those that cannot be described through f(C). S will have

a spatial structure reflecting the geographical nature of its

values. Indeed, we may consider S to be the sum of two

effects, one with an Intrinsic CAR prior distribution and the

other with independent values, following the reasoning

proposed by (Besag et al. 1991). In what follows we will

consider this spatial structure, since it is the form of

modelling most commonly used in disease mapping stud-

ies, even though other alternatives could be considered.

If the geographical distribution of C shows a spatially

dependent structure, it may be that f(C) and S compete to

explain the same pattern (Clayton et al. 1993). This pattern

would, at the same time, be related to the covariate and

would show spatial dependence and could therefore be

attributed to either of these two factors. In order to avoid

this circumstance we will restrict the random effect S to

being orthogonal to the effect of the covariate f(C), since

this way we guarantee that S explains the variability

unexplained by C, which was our intention. Computa-

tionally this restriction is fairly complex since the values of

a that define f(C) are unknown and hence the restriction on

S is not clearly determined, rather it is stochastic. A more

Stoch Environ Res Risk Assess (2014) 28:695–706 697

123

convenient computational alternative is to restrict the val-

ues of S through the following set of conditions:

Xfi:Qk\Ci\Qkþ1g

Si ¼ 0 : k ¼ 1; . . .;K

8<:

9=;

In other words, for each one of the groups that we have

used to define f(C), we require the sum of the

corresponding random effects S to be 0. It can be shown

that this set of restrictions leads to the orthogonality of

f(C) and S, independently of the values taken by the vector

a. Although this set of conditions is more restrictive than

merely imposing orthogonality between f(C) and S (we are

imposing K restrictions instead of one), S will still enjoy

considerable flexibility for the geographical description of

the variability of risk, since the number of values taken by

S (the number of units under study) will usually be much

greater than the number of restrictions imposed (the

number of quantiles chosen). Moreover, the linear space

to which S has been restricted is orthogonal to the linear

space to which f(C) belongs, and therefore the variability

not explained by one of these components will be

explained by the other, except possibly that due to the

shrinking out of random effects. As these spaces are

orthogonal, we avoid the two factors competing to explain

the same effect.

According to this proposal, the logarithms of the risks

are projected into three mutually orthogonal spaces of

dimensions 1, K-1 and I-K, respectively. Such orthogo-

nality will have interesting properties since it will allow us

to decompose the variability of the geographical pattern of

risk as variability associated to the covariate and variability

associated to other factors not linked to the covariate.

2.2 Multivariate disease mapping via SANOVA

We begin this section by defining the problem of multi-

variate disease mapping and the notation to be used. This

time, we will call Oij, i = 1, …, I, j = 1, …, J the number

of observed cases of disease j, in the ith geographic unit.

Similarly we will use Eij for the number of expected cases

for unit i and disease j. In the multivariate case, most

disease mapping proposals assume, as we also do, that:

Oijjhij�PoissonðEijhijÞ; i ¼ 1; . . .; I; j ¼ 1; . . .; J

One of the most interesting approaches to multivariate

disease mapping is that known as Smoothed ANOVA

(Zhang et al. 2009). According to that approach a series of

orthogonal contrasts (in the sense of analysis of variance)

between diseases is considered, i.e. a set of linear

combinations of the diseases we wish to model.

SANOVA proposes the modeling of every one of these

contrasts through spatially structured random effects

instead of employing such effects for the independent

modeling of every one of the diseases. Therefore, spatial

dependence is assumed for certain specific linear

combinations of the risks which are of particular interest

for the research. This idea turns out to be very reasonable

since often the spatial pattern is not specific to each one of

the diseases, but rather is due to the distribution of some

risk factors which may influence more than one disease. In

such cases it is more natural to attribute spatial dependence

to certain combinations of the risks for some diseases, and

therefore multivariate modelling with SANOVA may be

particularly appropriate. Moreover, SANOVA allows us to

make inferences about these specific contrasts which are

expected to be of epidemiological interest. The main

problem with SANOVA is precisely the choice of these

contrasts, since the degree of benefit obtainable through

fitting these models will depend mainly on how appropriate

this choice is, and often there are no clear criteria to help

choose them.

SANOVA may be mathematically formulated as fol-

lows: let H be a matrix of dimensions J 9 J whose col-

umns contain the coefficients of a series of contrasts

between diseases. In other words, the columns of H contain

the coefficients of the linear combinations of the diseases

we are going to model through spatially structured random

effects. H is by definition orthogonal, so that H0H = IJ.

We have to note that H will usually include a column that

is proportional to 1J, i.e. following Zhang et al.’s notation

H = [1J|H(-)], where H(-) is an J 9 (J - 1) matrix whose

columns are contrasts among diseases. This column is

usually used to model the common geographical patterns of

all the diseases studied. It is not really a contrast since it

does not make any comparison between diseases, never-

theless, in order to simply the notation, we will refer to the

columns of H as contrasts in general.

In the original proposal by Zhang et al. (2009),

SANOVA was used to model the logarithm of risks for a

set of diseases as:

lnðvecðhÞÞ ¼ ðH� VÞ � vecðuÞ

where V is the matrix containing the eigenvectors of the

precision matrix common to the columns of h and

conferring a spatial structure on them. u is a matrix of

random effects with dimension I 9 J, the prior distribution

for its columns being given by:

u�j�Nð0; ðsjDÞ�1Þ; j ¼ 1; . . .; J

where D is the diagonal matrix with the corresponding

eigenvalues of V.

This formulation corresponds to the original proposal

for the SANOVA model. In summary, that proposal con-

siders h to be a double transformation, i.e., by H and V, of


123

Gaussian independent random effects u. The first of these

transformations induces dependence among diseases

through the modeling of specific contrasts defined by

matrix H. The second one was motivated by the desire to

use an Intrinsic CAR distribution to spatially smooth these

contrasts. However, the second of these transformations

could be skipped, leading to a more flexible (and compu-

tationally convenient) class of models, which we will

introduce shortly. In more detail, the expression of the log-

risks in the SANOVA model can also be written as:

lnðvecðhÞÞ ¼ ðH� VÞ � vecðuÞ¼ ðH� IIÞ � ðIJ � VÞ � vecðuÞ¼ ðH� IIÞ � vecðwÞ

so that

vecðwÞ ¼ ðH0 � IIÞ � lnðvecðhÞÞ

that is, w models the contrasts defined by H and besides

vec(w) has variance–covariance matrix given by:

ðIJ � VÞdiagðfðsjDÞ�1gJj¼1ÞðIJ � VÞ0

¼ diagðfVðsjDÞ�1V0gJj¼1Þ ¼ diagðfðsjVDV0Þ�1gJ

j¼1Þ¼ diagðfðsjQÞ�1gJ

j¼1Þ:

In other words, the variance–covariance matrix of each

one of the columns in w will be proportional to Q-1, the

covariance matrix corresponding to the spatial dependence

structure we have assumed. Summarising, the SANOVA

model may be formulated either as proposed in the original

article by (Zhang et al. 2009), or by considering a series of

spatially dependent patterns (each one of the columns in

w), and combining such patterns according to the contrasts

defined in matrix H. Furthermore, such combinations may

also be expressed as follows:

ðH� IIÞ � vecðwÞ ¼H11II � � � H1JII

..

. . .. ..

.

HJ1II � � � HJJII

0B@

1CA

w�1...

w�J

0B@

1CA

¼H11w�1 þ � � � þ H1Jw�J

..

.

HJ1w�1 þ . . .þ HJJw�J

0B@

1CA

¼ H�1 � w�1 þ . . .þH�J � w�J :

Therefore, we can reformulate the SANOVA model as the

sum of the Kronecker products between each one of the

contrasts contained in H and the corresponding vectors of

(spatially dependent) random effects. This decomposition will

be very useful for our proposed ecological regression model.

Finally, we would like to stress that the reformulation of

SANOVA as it appears in this latter expression permits the

consideration of any distribution for the columns of w, and

not necessarily an Intrinsic CAR distribution as used in the

original proposal by Zhang et al. Thus, it is possible to

assume, for example, that these vectors undergo a convo-

lution process of two random effects, as Besag et al. pro-

pose (Besag et al. 1991) or any spatial process that we

might want to fit (Ugarte et al. 2005; Botella-Rocamora

et al. 2013). Therefore, the reformulation which we have

introduced allows the generalisation of the SANOVA

model to a broader collection of spatial processes, while at

the same time allowing the use of conventional Bayesian

inference software (INLA or WinBUGS for example) for

the estimation of these models.

2.3 Multivariate ecological regression using SANOVA

So far we have proposed an ecological regression model

which we consider convenient, since it allows the decom-

position of the variance into one part explained by the

covariate and another part not explained by it. We have

also introduced SANOVA as a multivariate disease map-

ping tool, based on the decomposition of variance within

the contrasts we wish to study. Our intention now is to

propose an ecological regression model, based on

SANOVA, which will simultaneously permit the sharing of

information between diseases while allowing us to

decompose the total variance of all diseases as a function of

the different elements considered in the study.

As we have already mentioned, our reformulation of

SANOVA allows us to express these models as the sum of

Kronecker products of the contrasts which we consider

opportune and the corresponding random effects. More-

over, these effects can take the dependence structure con-

sidered most appropriate, and thus our proposal for

multivariate ecological regression consists in making these

effects (w�1,…,w�J) depend on the covariate we wish to

study. Specifically, if we want to study the effect of

covariate C on the set of log-risks h, we would do so as

follows:

lnðvecðhÞÞ ¼ H�1 � w�1 þ � � � þH�J � w�J

w�1 ¼ l1 � 1þ f1ðCÞ þ S�1

� � �w�J ¼ lJ � 1þ fJðCÞ þ S�J

Thus, w�1,…,w�J model the spatial pattern of each one of

the comparisons of risks between diseases defined by

H�1,…,H�J, respectively. In turn, these will be made to

depend, in a flexible way, on the covariate under study.

Therefore it will be possible to determine and, if it exists,

describe the association between these two factors. Thus as

an added advantage of the multivariate modelling we could

establish the relationship between the covariate and certain

specific combinations of diseases, instead of establishing

the relationship between the covariate and each one of the

diseases independently, as is the case in univariate studies.


123

The functions f1(), …, fJ() are defined as described

above for the univariate ecological regression model. Thus

these functions will be based on the same partitioning of

the variable C, and will only differ in the value of the

vector a for each one of them. On the other hand, the

vectors of random effects S�1, …, S�J will usually have a

spatially structured distribution (not necessarily an Intrinsic

CAR distribution). Furthermore, these vectors will be

restricted to the orthogonal subspace generated by the

design matrix of the functions f(C), as we proposed above

for the univariate ecological regression model.

As outlined at the start of this section, the terms making

up w�j, for any j, decompose the spatial pattern of the

corresponding contrast in terms of three orthogonal factors

(intercept, covariate effect, and other factors). Thus with

our proposal the set of risks of the J diseases under study is

decomposed into 3J components. Moreover, the compo-

nents corresponding to different contrasts are also mutually

orthogonal, since if we take two components corresponding

to different contrasts we have:

ðH�i � aÞ0 � ðH�j � bÞ ¼ H1ia0 . . . HJia

0ð ÞH1jb

..

.

HJjb

0B@

1CA

¼XJ

k¼1

HkiHkj

!� a0bð Þ

where a and b are any two of the components making up

w�i and w�j, respectively. The first of the two factors on the

right of the above expression will be 0, since the columns

of H are orthogonal. Therefore, the proposed model leads

to a decomposition of the variance of the J patterns studied

into 3J components, all mutually orthogonal. J of these

components, those corresponding to the intercept terms of

each one of the contrasts, are projected into one-dimen-

sional spaces; J of them, those corresponding to the func-

tions f1(C), …, fJ(C), are projected into (K-1)-dimensional

spaces; and the last J of them into (I-K)-dimensional

spaces. Therefore the proposed model, apart from carrying

out an ecological regression, structures and allows the

quantification of the variability attributable to each one of

these components.

3 A real case study

We are going to apply the proposal above to a real problem

in order to illustrate its potential and to show its ease of

interpretation. We are going to study the geographical

distribution of malignant neoplasm of stomach mortality

(ICD-9: 151; ICD-10: C16) in women residing in the city

of Barcelona (Spain). Deaths were grouped into two peri-

ods (1996–2001 and 2002–2007), which will be the two

outcomes we want to model. The expected cases were

calculated by indirect standardization, taking as reference

the mortality rates of the city by age (in 5 years age-

groups) for the first period (1996–2001). The geographical

units used for analysis were the 1491 census tracts of the

city according to the 2001 census. We included a socio-

economic deprivation index as a covariate to explain

mortality and it was calculated through a principal com-

ponents analysis on socioeconomic variables, based on the

methodology described by Domınguez-Berjon et al.

(2008). In our case we want to model two outcomes (the

number of study periods considered), and so our proposal

assumes:

lnðvecðhÞÞ ¼ H�1 � w�1 þH�2 � w�2

w�1 ¼ l1 � 1þ f1ðCÞ þ S�1

w�2 ¼ l2 � 1þ f2ðCÞ þ S�2

Matrix H will contain the contrasts used to model the

matrix of risks. In our case we consider it of particular

interest to determine the pattern common to both periods,

as well as the evolution of risks, for each census tract,

between them. These two patterns will tell us about the

distribution of the risk factors which remain constant over

time in the study region and those which have undergone

changes over time. These two patterns will correspond to

the first and second columns, respectively, of the matrix of

contrasts H ¼ 2�0:5 1 �1

1 1

� �.

Each one of these patterns is divided, through the

decomposition of w�1 and w�2, into three components:

intercept, covariate effect, and variability not attributable

to the covariate. The intercept, or average value, of the

vectors w�1 and w�2 is modelled using l1 and l2 respec-

tively. In order to model the effect of the deprivation index

C on the two contrasts mentioned above in a non-para-

metric way, we have categorized the covariate into 20

quantile-groups (K = 20), and decomposed w�1 and w�2 as

functions of f1(C) and f2(C), respectively. The variability in

each of the contrasts not explained by deprivation is

modelled using S�1, and S�2. S�1 explains the variability in

risk common to both periods which cannot be explained by

the deprivation index, whereas S�2 explains the geograph-

ical variability of evolution over time of the risks which

cannot be attributed to deprivation. Each one of these

random effects has been decomposed in turn as the sum of

two random effects: S�1 = U�1 ? V�1 and S�2 = U�2 ? V�2,

following the proposal by Besag et al. (1991), with U�1 and

U�2 being two Intrinsic CAR spatial random effects and V�1,

V�2 being two heterogenous Gaussian random effects. For

the Intrinsic CAR spatial random effects, we have used an

adjacency matrix defined considering that two areas are

neighbours if they share a common boundary.


123

Independent zero-mean Gaussian prior distributions,

with precision 0.0001, are assigned to both l1 and l2. We

have considered that the vectors a1 and a2 determining

f1(C) and f2(C), respectively, follow a first-order random-

walk process with unknown variances r2f1

and r2f2

. An

Intrinsic-CAR prior distribution was assigned to the spatial

effects U�1 and U�2 (Besag et al. 1991), with unknown

variances r2U1 and r2

U2, respectively. Finally, the hetero-

geneous effects V�1 and V�2 were assumed to have unknown

variances r2V1 and r2

V2, respectively. A half-normal distri-

bution with mean 0 and precision 0.0001 was assigned to

the standard deviations rf1 ; rf2 ; rU1; rU2; rV1 and rV2

(Gelman 2006).

Inference in complex Bayesian hierarchical models is

usually carried out by means of Markov Chain Monte Carlo

(MCMC) simulation, but these techniques are usually very

time-consuming in large multivariate disease mapping

problems. For this reason, we have used the Integrated

Nested Laplace Approximations (INLA) (Rue and Martino

2009; Rue et al. 2009; Schrodle and Held 2010, 2011)

approach as an alternative to MCMC simulation (see

Appendix). INLA has been implemented within the R soft-

ware (R Development Core Team 2012). Moreover, INLA

allows the above-mentioned sum-to-zero restrictions on

f1ðCÞ and f2ðCÞ to be incorporated in a fairly straightforward

way, therefore making this a very convenient tool for car-

rying out inference in our model. Another illustrative

example of specification and incorporation of linear con-

straints with INLA may be found in Schrodle and Held

(2010), Sects. 3.2 and 3.4. For all these reasons INLA has

experienced an increase in popularity and it is commonly

used in spatial data analysis (Munoz et al. 2012). The R code

to perform the model can be found in the Appendix.

4 Results

Figure 1 presents the association between deprivation and

mortality as a series of graphs. Figure 1a shows the associ-

ation between deprivation and the pattern of risk common to

both periods. Figure 1b shows the association between

deprivation and the evolution of mortality over time, in this

case positive values in the relationship indicate an increase in

risk over time for the corresponding deprivation level.

Finally, Fig. 1c, d represent the association existing between

mortality and deprivation in the first and second periods,

respectively; these associations are just specific combina-

tions of the two preceding graphs defined by the matrix of

contrasts H. The associations are represented in terms of

relative risks, the continuous lines representing the posterior

means and the dotted lines their 95 % credible intervals.

Figure 1a shows that the pattern of mortality common to

both periods appears to be associated with deprivation,

with areas of less deprivation (on the left) presenting lower

mortality than the average for the city. Figure 1b shows a

decline over time of mortality in the most deprived areas,

although this decline is not so clear in statistical terms as

the relationship in Fig. 1a. The combination of these two

factors allows us to estimate the relationship between

deprivation and mortality in the two periods studied

(Fig. 1c, d). Thus, Fig. 1c suggests a positive and

approximately linear association between deprivation and

mortality in the first period. In addition, it may be seen that

the credible intervals estimated for the areas of least

deprivation are below 1, while the areas with greater

deprivation have credible intervals above 1. However, in

Fig. 1d we observe that in the second period, the associa-

tion between deprivation and mortality has been modified,

becoming less steep. In this period the association almost

disappears since the credible intervals of relative risks

include the value 1 for all levels of deprivation. This result

indicates a reduction over time of inequalities associated

with deprivation. This is particularly interesting since

despite the fact that the proposed model considers the two

periods as dependent (in fact there is a strong common

pattern), this fact does not prevent us finding appreciable

differences between the two periods.

Figure 2 shows a choropleth map of the socio-economic

deprivation index by septiles. In this figure, the areas

shown in dark brown (respectively, dark green) represent

the census tracts with most (respectively, least) depriva-

tion. The map suggests a high degree of spatial correlation

in this covariate, with the west of the city having fewer

depressed areas, while the eastern (city centre and old

quarter of the city), northeastern and northern areas present

greater deprivation.

SANOVA allows us to perform different decomposi-

tions of the geographical patterns of mortality for different

contrasts in the time periods, which will allow us to analyse

in more detail not only the association of mortality with

deprivation, but also the effect of time. In order to illustrate

this, Fig. 3 presents septile maps of some of these

decompositions. We have used the same divergent color

scheme pattern as in Fig. 2, where greens represent areas

with less mortality, while those with greater mortality are

represented with browns. In order to be able to compare

values between the different maps, the areas have been

coloured using fixed cut-off values.

Figure 3a presents the distribution of smoothed Stand-

ardised Mortality Ratios (sSMR) for the first period. In

order to develop this map we have made use of the mor-

tality common to both periods 2�0:5ðf1ðCÞ þ S�1Þ and the

effect of time which induces geographical differences in


123

the risk between the two periods 2�0:5ðf2ðCÞ þ S�2). Spe-

cifically, these are combined according to the values of H

and consequently the sSMR values for the first period are

given by sSMR ¼ 100 � exp 2�0:5ð f1 Cð Þ þ S�1ð Þ��

f2 Cð Þ þ S�2ð ÞÞÞ. The same procedure was used for the

sSMR values of the second period (Fig. 3b). The two maps

present a similar spatial distribution, allowing us to

appreciate the considerable advantages of the joint mod-

elling we have performed. However, despite the two pat-

terns having been modelled in a dependent manner, we

may appreciate slight differences between them, for

example it is particularly notable that in the second period

there is less spatial variability than in the first. This seems

to reflect a reduction in geographical inequalities.

Figure 3c is a geographical representation of the asso-

ciation found between deprivation and mortality common

to both periods, in other words (the posterior mean of)

100 � expð2�0:5f1ðCÞÞ. This figure basically displays the

relationship shown in Fig. 1a geographically. Figure 3d

shows the effect of deprivation on the temporal evolution

of mortality, in other words (the posterior mean of)

100 � expð2�0:5f2ðCÞÞ. Once again, this figure corresponds

to Fig. 1b and the green census tracts on the map corre-

spond to areas where mortality has declined over time.

Comparing the two figures, we may observe that a large

amount of the geographical variability found is common to

the two periods. Even so, it may also be appreciated that

those census tracts with the highest global risk for the two

periods studied are the ones which present the greatest

decline over time.

Figure 3e presents the distribution of mortality common

to the two periods which cannot be associated with depri-

vation, basically because there is no correlation with the

distribution of this factor, in other words (the posterior

Fig. 1 Relative risk (and 95 %

credible intervals) death due 10

malignant neoplasm of the

stomach as a function of the

socio-economic deprivation

index. a Association between

deprevation and the pattern of

mortality common to the two

periods. b Association between

deprivation and the evolution or

mortality over time.

c Association between

deprivation and mortality in the

first period. d Association

between deprivation and

mortality in the second period


123

mean of) 100 � expð2�0:5S�1Þ. Finally, Fig. 3f presents the

evolution over time of mortality which cannot be attributed

to deprivation, since both periods have different spatial

patterns, i.e. (the posterior mean of) 100 � expð2�0:5S�2Þ. It

may be seen in Fig. 3e that once the effect of deprivation

has been eliminated, the mortality common to the two

periods still presents some spatial correlation. Specifically,

two clear spatial clusters of low mortality (green) may be

observed in the north and south of the city centre area, and

a spatial cluster of high mortality (brown) in the southwest

of the city. These clusters cannot be explained by depri-

vation. In contrast, Fig. 3f indicates that after removing the

effect of deprivation in the second period, mortality in

general declined all over the city except for the coastal area

(southeast) where it remained stable (white areas). Once

again it may be seen that the spatial variability associated

with factors other than deprivation is greater in the pattern

common to the two periods than in the evolution of mor-

tality over time, and therefore the risk factor responsible

for this pattern has mainly had a persistent presence over

time.

As described in Sect. 2, all the components of w�1 and w�2are mutually orthogonal. This fact permits the decomposi-

tion of the variance of the logarithms of risk (lnðvec hð ÞÞÞ as

the sum of the variances of all these components. Thus, we

have: Varðlnðvec hð ÞÞÞÞ ¼ VarðH�1 � ðl1 � 1ÞÞ þ VarðH�1 �f1ðCÞÞþVarðH�1 � S�1Þ þ VarðH�2 � ðl2 � 1ÞÞ þ VarðH�2�f2ðCÞÞ þ VarðH�2 � S�2Þ. The first of these terms is 0 since it

is the variance of a constant vector. The percentages of

variance explained by each of the remaining components are,

respectively: 41.59, 8.71, 43.48, 5.98 and 0.24 %. Thus, it is

a particularly significant fact that the factor explaining the

most variance (43.48 %) is the component associated with

l2 � 1, which reflects the evolution over time of the average

mortality in each period, indicating an important time trend

in mortality affecting all census tracts equally. It is also

noteworthy that 47.57 % (41.59 ? 5.98 %) of the variability

studied is associated with deprivation, whereas only 8.95 %

(8.71 ? 0.24 %) would be explained by the action of other

factors. It is also interesting to learn that 50.30 %

(41.59 ? 8.71 %) of the variability in risk is spatial vari-

ability common to the two periods, whereas changes over

time in the geographical patterns only explain 6.22 %

(5.98 ? 0.24 %) of the variance. In our opinion, results of

this kind are quite interesting from an applied point of

view.

5 Discussion

In this article we propose a reformulation of SANOVA

which allows us to study the association between one

covariate and various response variables jointly. This

reformulation of SANOVA allows us to assign any spa-

tially structured distribution to the random effects includ-

ing, as we have illustrated in our example, the popular

convolution of random effects by Besag et al. (1991).

In the context of ecological regression some authors have

described the problem of the possible existence of colin-

earity between the covariate and the random effects

(Clayton et al. 1993; Reich et al. 2006; Hodges and Reich

2010; Hughes and Haran 2013). This can represent a serious

problem since it may be masking the true association

between the outcome and the covariate, and is usually

resolved by imposing random effects to lie on the orthog-

onal space of the fixed effects design matrix. In the present

study we have decomposed the variability of the outcome

into two orthogonal components, one associated with the

covariate, and a second orthogonal with the first. These two

components, by construction, cannot compete to explain the

same effect through being orthogonal. However, the effect

of any covariate whose spatial distribution is correlated with

the distribution of the covariate under study will be attrib-

uted to the covariate under study. In this sense it is impor-

tant for the study to include all those covariates which could

induce variations in the risks being investigated.

As mentioned, orthogonality has been previously used to

avoid confounding between fixed and random effects. In

contrast, our main motivation for imposing orthogonality is

somewhat different, our aim being to allow the decomposition

of all variance of the diseases under study. Therefore

Fig. 2 Distribution of the deprivation index by septiles. Green areas

represent low socioeconomic deprivation and brown areas represent

high socioeconomic deprivation


123

orthogonality may achieve these two goals at the same time

and, as a consequence, papers published under one of these

approaches may be very useful for the other one. Nevertheless,

in our opinion, our approach has certain computational

advantages since it does not rely on any spectral decomposi-

tion of a matrix. Such decomposition is a problem when the

Fig. 3 SANOVA model:

a distribution of smoothed

standardised mortality ratios

(sSMR) for the first period,

b distribution of (sSMR) for the

second period, c accociation

between deprivation and the

pattern of mortality common to

the two periods, d association

between deprevation and the

evoluation of mortality over

time, e distribution of

mortality,which cannot be

associated with deprivation,

common to the two periods,

f evolution over time of

mortality which cannot be

explained by deprivation


123

matrix to be decomposed depends on one or more parameters,

such as the CAR-proper distribution or geostatistical models

(Hughes and Haran 2013). In this sense the present work may

help to make ecological regressions avoiding confounding

between fixed and random effects more popular.

The implementation of the model in this paper has been

undertaken in INLA. This may be considered a particular

strength of the paper since it does not require any specific

coding of the method beyond the coding of INLA itself. Any-

way, as mentioned above we provide the INLA routines used in

our real case study as supplementary material to the paper.

Nevertheless, we feel it necessary to mention that INLA is at the

time of writing still a novel inference tool not fully checked in

models as complex as ours. In this sense more extensive

checking of INLA in this kind of model will be welcomed.

In general, in studies analysing mortality inequalities in

terms of some covariate, the covariate is included in the

model either as a fixed factor continuous variable, or it is

incorporated divided into a small number of groups (such

as quintiles). In the first case we assume a linear relation-

ship between mortality and the covariate, when in fact we

have no guarantees that this is so. In the second case, by

categorising the variable, groups are formed which contain

a large number of areas, and in consequence much infor-

mation about the true association between the original

covariate and the variable under study is lost. It should be

noted that by using the proposed SANOVA model, these

situations are avoided since the covariate is divided into a

larger number of groups than is usually used and they are

modeled with flexible non-parametric tools.

Acknowledgments We wish to thank Carme Borrell and one

reviewer for their valuable suggestions on the article. We also wish to

thank Prof. Havard Rue for his support with the R-INLA imple-

mentation of the model proposed in this study.

Funding This article has been partially funded by: ‘‘Plan Nacional

de I?D?I 2008–2011’’ and the ‘‘ISCIII –Subdireccion General de

Evaluacion y Fomento de la Investigacion-’’ (Project number

PI081488); the project INEQ-CITIES, ‘‘Socioeconomic inequalities

in mortality: evidence and policies of cities of Europe’’, project

funded by the Executive Agency for Health and Consumers (Com-

mission of the European Union), Project No. 2008 12 13; and by

CIBER Epidemiologıa y Salud Publica (CIBERESP), Spain.

Appendix

In what follows we explain briefly the general ideas

involved in the Integrated nested Laplace approximations

(INLA) inference method, the approach used in this paper

to approximate the marginal posterior density of the

parameters in our multivariate ecological regression model.

Consider a three-stage Bayesian hierarchical model

based on an observation model p yjxð Þ ¼Q

i p yijxið Þ, a

parameter model p xjhð Þ, and a hyperprior p hð Þ. Here y ¼y1; . . .; ynð Þ denotes the observed data, x are unknown

parameters which typically follow a Gaussian Markov

random field (GMRF) and h are unknown hyperparameters.

The marginal posterior density of xi can be written as

p xijyð Þ ¼Z

h

p xijh; yð Þp hjyð Þdh

and can be approximated by the finite sum

~p xijyð Þ ¼X

k

~p xijhk; yð Þ~p hkjyð ÞDk

Integrated nested Laplace approximations (INLA)

approximates these marginal posterior densities using an

approximation ~p xijh; yð Þ of p xijh; yð Þ and an additional

approximation ~p hjyð Þ of the marginal posterior density of

the hyperparameters p hjyð Þ. Dk are chosen appropriately.

To approximate p hjyð Þ we consider:

p x; h; yð Þ ¼ p xjh; yð Þ � p hjyð Þ � p yð Þ;

therefore for any x

p hjyð Þ / p x; h; yð Þp xjh; yð Þ

INLA approximates p hjyð Þ using a Laplace

approximation (Tierney and Kadane 1986):

~p hjyð Þ / ~p x; h; yð Þ~pG xjh; yð Þ jx¼x� hð Þ

where the denominator ~pG xjh; yð Þ is the Gaussian approx-

imation of p xjh; yð Þ and x� hð Þ is the mode of the full

conditional p xjh; yð Þ (Rue and Held 2005; Rue et al. 2009).

Posterior marginals for the hyperparameters ~p hjyð Þ could

be obtained using two strategies. The first, more accurate

but also time consuming, consists of defining a grid of

points covering the area where most of the mass of ~p hjyð Þis located (GRID strategy). The second one, named central

composite design (CCD strategy), consists in laying out a

small amount of ‘‘points’’ in a m-dimensional space in

order to estimate the curvature of ~p hjyð Þ (Rue et al. 2009).

In this paper the CCD strategy has been used as it needs

much less computational time and the differences between

the CCD and GRID strategies are expected to be minor.

Moreover, the CCD strategy is recommended for problems

with high dimensionality of the hyperparameter vector h.

Secondly, in order to approximate ~p xijh; yð Þ, a Gaussian

approximation ~pG xijh; yð Þ ¼ N xi; li hð Þr2i hð Þ

� �could be

used. However, the approximation can be improved using

Laplace approximation or a simplified Laplace approxi-

mation based on the skew-normal distribution (Rue et al.

2009). In this paper the Laplace approximation has been

used (Tierney and Kadane 1986). This is the most time


123

consuming approximation but on the other hand is the

most accurate.

References

Banerjee S, Wall MM, Carlin BP (2003) Frailty modeling for spatially

correlated survival data, with application to infant mortality in

Minnesota. Biostatistics (Oxford, England) 4:123–142

Barcelo MA, Saez M, Cano-Serral G, Martinez-Beneito MA,

Martinez JM, Borrell C, Ocana-Riola R, Montoya I, Calvo M,

Lopez-Abente G, Rodriguez-Sanz M, Toro S, Alcala JT, Saurina

C, Sanchez-Villegas P, Figueiras A (2008) Metodos para la

suavizacion de indicadores de mortalidad: aplicacion al analisis

de desigualdades en mortalidad en ciudades del Estado espanol

(Proyecto MEDEA). Gaceta sanitaria/S.E.S.P.A.S 22:596–608

Besag J, York J, Mollie A (1991) Bayesian image restoration, with

two applications in spatial statistics. Ann Inst Stat Math 43:1–59

Borrell C, Marı-Dell’Olmo M, Serral G, Martinez-Beneito MA,

Gotsens M, Other MEDEA members. (2010) Inequalities in

mortality in small areas of eleven Spanish cities (the multicenter

MEDEA project). Health & Place 16:703–711

Botella-Rocamora P, Lopez-Quılez A, Martinez-Beneito MA (2013)

Spatial moving average risk smoothing. Stat Med 32:2595–2612

Clayton DG, Bernardinelli L, Montomoli C (1993) Spatial correlation

in ecological analysis. Int J Epidemiol 22:1193–1202

Dominguez-Berjon MF, Borrell C, Cano-Serral G, Esnaola S, Nolasco A,

Pasarin MI, Ramis R, Saurina C, Escolar-Pujolar A (2008)

Constructing a deprivation index based on census data in large

Spanish cities (the MEDEA project). Gaceta sanitaria/S.E.S.P.A.S

22:179–187

Gelman A (2006) Prior distributions for variance parameters in

hierarchical models. Bayesian Anal 1:515–533

Hodges JS, Reich BJ (2010) Adding spatially-correlated errors can

mess up the fixed effect you love. Am Stat 64:325–334

Hodges JS, Cui Y, Sargent DJ, Carlin BP (2007) Smoothing balanced

single-error-term analysis of variance. Technometrics 49:12–25

Hughes J, Haran M (2013) Dimension reduction and alleviation of

confounding for spatial generalized linear mixed models. J R

Stat Soc 75:139–159

Knorr-Held L, Best NG (2001) A shared component model for

detecting joint and selective clustering of two diseases. J R Stat

Soc 164:73–85

Liu X, Wall MM, Hodges JS (2005) Generalized spatial structural

equation models. Biostatistics (Oxford, England) 6:539–557

Mardia KV (1988) Multi-dimensional multivariate Gaussian Markov

random fields with application to image processing. J Multivar

Anal 24:265–284

Marı-Dell’Olmo M, Martinez-Beneito MA, Borrell C, Zurriaga O,

Nolasco A, Dominguez-Berjon MF (2011) Bayesian factor

analysis to calculate a deprivation index and its uncertainty.

Epidemiology (Cambridge, Mass) 22:356–364

Munoz F, Pennino MG, Conesa D, Lopez-Quı-lez A, Bellido J (2013)

Estimation and prediction of the spatial occurrence of fish

species using Bayesian latent Gaussian models. Stoch Environ

Res Risk Assess 27:1171–1180

R Development Core Team (2012). R: a language and environment

for statistical computing

Reich BJ, Hodges JS, Zadnik V (2006) Effects of residual smoothing

on the posterior of the fixed effects in disease-mapping models.

Biometrics 62:1197–1206

Rue H, Held L (2005) Gaussian Markov random fields: theory and

applications (Chapman & Hall/CRC monographs on statistics &

applied probability): Chapman and Hall/CRC, London

Rue H, Martino S (2009) INLA: functions which allow to perform a

full Bayesian analysis of structured additive models using

integrated nested laplace approximaxion. R package version 0.0

Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference

for latent Gaussian models by using integrated nested Laplace

approximations. J R Stat Soc 71:319–392

Schrodle B, Held L (2011a) A primer on disease mapping and

ecological regression using INLA. Comput Stat 26:241–258

Schrodle B, Held L (2011b) Spatio-temporal disease mapping using

INLA. Environmetrics 22(6):725–734

Tierney L, Kadane JB (1986) Accurate approximations for posterior

moments and marginal densities. J Am Stat Assoc 81:82–86

Tzala E, Best N (2008) Bayesian latent variable modelling of

multivariate spatio-temporal variation in cancer mortality. Stat

Methods Med Res 17:97–118

Ugarte M, Ibanez B, Militino A (2005) Detection of spatial variation

in risk when using CAR models for smoothing relative risks.

Stoch Environ Res Risk Assess 19:33–40

Wang F, Wall MM (2003) Generalized common spatial factor model.

Biostatistics (Oxford, England) 4:569–582

Zhang Y, Hodges JS, Banerjee S (2009) Smoothed ANOVA with

spatial effects as a competitor to MCAR in multivariate spatial

smoothing. Ann Appl Stat 3:1805


123

Documents

A smoothed ANOVA model for multivariate ecological regression