Upload
laia
View
213
Download
0
Embed Size (px)
Citation preview
ORIGINAL PAPER
A smoothed ANOVA model for multivariate ecological regression
Marc Marı-Dell’Olmo • Miguel A. Martinez-Beneito •
Merce Gotsens • Laia Palencia
Published online: 18 August 2013
� Springer-Verlag Berlin Heidelberg 2013
Abstract Smoothed analysis of variance (SANOVA) has
recently been proposed for carrying out disease mapping.
The main advantage of this approach is its conceptual
simplicity and ease of interpretation. Moreover, it allows us
to fix the combination of diseases of particular interest in
advance and to make specific inferences about them. In this
paper we propose a reformulation of SANOVA in the
context of ecological regression studies. This proposal
considers the introduction in a non-parametric way of one
(or several) covariate(s) into the model, explaining some
pre-specified combinations of the outcome variables. In
addition, random effects are also incorporated in order to
model geographical variation in the combinations of
outcome variables not explained by the covariate. Lastly,
the model permits the decomposition of the variance in the
set of outcome variables into different orthogonal compo-
nents, quantifying the contribution of every one of them.
The proposed model is applied to the geographical analysis
of mortality due to malignant stomach neoplasm among
women resident in the city of Barcelona (Spain). The
available outcome variables are deaths grouped into two
time periods, and a socioeconomic deprivation index is
included as a covariate. The model has been implemented
through INLA, a novel inference tool for Bayesian
statistics.
Keywords Disease mapping � Smoothed analysis of
variance � Multivariate analysis � Mortality �Socioeconomic factors � Ecological regression
1 Introduction
There is a growing interest among public health profes-
sionals in the study of health indicators at a small-area
level. Small area studies permit a deeper understanding to
be gained of geographical patterns and have proved to be
essential in uncovering local-level inequalities often
masked by health estimates from larger areas (Borrell et al.
2010). However, statistical analysis in this context is far
from trivial due to the small amount of information on
which every available observation is based (Barcelo et al.
2008).
The statistical models employed in these studies usually
take into account the possible spatial dependence of the
data. The assumption of dependence between neighbouring
sites improves the estimates of indicators as they are based
on more information, namely that of each geographical unit
This article has been included in the doctoral thesis of one of the
authors (Marc Marı-Dell’Olmo), being carried out at Pompeu Fabra
University.
Electronic supplementary material The online version of thisarticle (doi:10.1007/s00477-013-0782-2) contains supplementarymaterial, which is available to authorized users.
M. Marı-Dell’Olmo � M. A. Martinez-Beneito (&) �M. Gotsens � L. Palencia
CIBER Epidemiologıa y Salud Publica (CIBERESP), Madrid,
Spain
e-mail: [email protected]
M. Marı-Dell’Olmo � M. Gotsens � L. Palencia
Agencia de Salut Publica de Barcelona, Barcelona, Spain
M. Marı-Dell’Olmo � M. Gotsens � L. Palencia
Institut d’Investigacio Biomedica (IIB Sant Pau), Barcelona,
Spain
M. A. Martinez-Beneito
Centro Superior de Investigacion en Salud Publica CSISP-
FISABIO, Av. Cataluna, 21., 46020 Valencia, Spain
123
Stoch Environ Res Risk Assess (2014) 28:695–706
DOI 10.1007/s00477-013-0782-2
and its neighbours. One of the most widely-used models for
the analysis of small areas is that of Besag, York and
Mollie (BYM) (Besag et al. 1991). Their model includes
two random effects: one takes into account the spatial
dependence between nearby units, due to the existence of
risk factors common to several geographical units; and the
other, spatially heterogeneous, models the existence of risk
factors which only affect particular geographical units and
which make their risks palpably different to those of
neighbouring regions.
Recently, different methods capable of modelling a
multivariate outcome have appeared. These take into
account dependence among the different components of
such an outcome and the spatial dependence of the data.
These methods, for example, allow conducting joint anal-
yses of various potentially related causes of death, joint
analyses of a cause of death for different population groups
(such as different sexes or ethnic groups), etc. In what
follows, for the sake of simplicity, we will refer to the
problem of multivariate smoothing as the joint smoothing
of various diseases.
The vast majority of proposals for conducting studies of
this type are based on the multivariate generalisation of
conditional auto-regressive distributions (CAR) (Rue and
Held 2005) to more than one response considering
dependence between them. Mardia (1988) could be con-
sidered a particular relevant contribution within this
approach. Another approach to multivariate modelling is
that known as Spatial Factor Bayesian models (Tzala and
Best 2008; Marı-Dell’Olmo et al. 2011; Banerjee et al.
2003; Wang and Wall 2003; Liu et al. 2005), which
assumes a set of underlying geographical patterns to be
shared by every one of the diseases. Both these patterns and
their contribution to each of the diseases are estimated
within these models. In this way information is shared (in
the form of underlying patterns) between diseases and also
between different geographical units, since spatial depen-
dence is usually assumed (via a CAR prior) for the com-
mon patterns. Bayesian shared component models (Knorr-
Held and Best 2001) may be considered as a special case of
spatial factor Bayesian models. Recently, a novel approach
to multivariate disease mapping, known as smoothed
analysis of variance (SANOVA) (Hodges et al. 2007;
Zhang et al. 2009), has been proposed. The choice of one
or other of these proposals ultimately depends on the aims
pursued by the study, such as the estimation of one (or
several) factors common to the disease, the specific esti-
mation of the covariance structure between diseases, etc.
The main advantage of SANOVA in comparison to
other multivariate modelling approaches is its conceptual
simplicity and ease of interpretation. In general, the other
approaches lead to more complex models capable of
modelling the relationship between various health
indicators in more detail, but which are more complicated
to define, implement and interpret (Zhang et al. 2009).
Moreover, SANOVA allows the fixing in advance of which
factors (contrasts) are of particular interest in the study, and
the making of specific inferences about these particularly
relevant factors. Therefore, from an applied perspective,
SANOVA is of great interest.
Ecological regression studies try to establish associa-
tions between one or various covariates and one health
indicator, using aggregate sets of individuals as the unit of
analysis; for example, in terms of a geographical division.
However, it is not usual to find multivariate ecological
regression models in which various health indicators are
studied jointly, relating them to one or more covariates.
Such studies could increase the power of ecological
regression since the relationships between covariates and
health indicators would be based on more information.
Moreover, they would make it possible to study the rela-
tionship between the covariates and a combination of
indicators (for example, differences in mortality for two
diseases), instead of the raw relationship with each one of
the indicators separately.
This study proposes an adaptation of SANOVA to the
context of ecological regression models. This adaptation
will allow the fitting of multivariate ecological regression
models, with the specific interpretational advantages of
SANOVA models. This proposal will decompose the var-
iance of the set of outcomes into orthogonal components
related to each of the SANOVA contrasts, the covariate and
random effects.
The paper is organized as follows: Sect. 2 specifies the
adaptation of SANOVA to the ecological regression context
that we will propose. Section 3 illustrates its application to a
real case study: data on malignant stomach neoplasm
mortality among women in the city of Barcelona for the
periods 1996–2001 and 2002–2007. A brief discussion of
our proposal and its main results is given in Sect. 4.
2 Model specification
We begin this section by proposing an ecological regres-
sion model that will be very convenient for the subsequent
multivariate generalisation. Next we will describe the for-
mulation of SANOVA modeling in detail, and we will
propose a reformulation of it which will enable us to
employ, in general, any type of spatial dependence struc-
ture, in contrast to the original proposal which in practical
terms (due to computational limitations) only permits the
use of an Intrinsic CAR distribution to model spatial
dependence. Finally, we will show how the aforementioned
proposal for ecological regression and our formulation of
SANOVA can be joined into a single model.
696 Stoch Environ Res Risk Assess (2014) 28:695–706
123
Before launching into the rest of the section, let us
introduce some notation. In what follows, variables in
boldface will refer to vectors or matrices. These two pos-
sibilities will be distinguished by the number of subscripts
accompanying the variable. Also, for a matrix A we will
use A�j to denote its jth column and vec Að Þ to denote the
vector ðA0�1; . . .;A0�JÞ0. The expression diag(a) will be used
to denote the diagonal matrix whose elements are given by
the vector a, and in the case where the components of a are
matrices, diag(a) will denote the corresponding block-
diagonal matrix. Finally, we will denote the Kronecker
product of two matrices by �, so that if A is a matrix with
dimension IxJ, then:
A� B ¼a11B � � � a1JB
..
. . .. ..
.
aI1B � � � aIJB
0B@
1CA
2.1 A univariate ecological regression model
Let Oi be the number of observed cases of a disease in the
area i, Ei the number of expected cases, and Ci the value of
the covariate that we wish to study, also in this area. Our
proposed ecological regression model assumes:
Oijhi�PoissonðEihiÞ; i ¼ 1; . . .; I
logðhiÞ ¼ lþ f ðCiÞ þ Si; i ¼ 1; . . .; I
where l is the model intercept, f(C) describes the effect of
the covariate on the logarithm of risk (h), and S is a spa-
tially dependent random effect describing the geographical
variation of risk unexplained by the covariate.
Our intention is to avoid the use of parametric forms to
describe the relationship between the covariate and the
logarithm of risk, since such functions often turn out to be
too restrictive to describe certain types of relationship. Our
modelling proposal is as follows: for a continuous variable
C we categorise its values into discrete groups according to
some number of quantiles. Thus, to categorise C in K
groups, we calculate its quantiles as 100(k/K) % where
k = 1, …, K-1 (hereafter denoted by Q1, …, QK-1
respectively), and we define:
f ðCiÞ ¼
a1 if Ci\Q1
a2 if Q1\Ci\Q2
..
. ...
aK if QK�1\Ci
8>>><>>>:
Obviously, K will be an important parameter in this
process, since small values could mask important variations
in the relationship we are attempting to describe, whereas
large values could lead to poor estimates of a1, …, aK. With
the intention of limiting this latter effect, we will consider
that a = (a1, …, aK) are dependent values, given the ordinal
relation of its components. Specifically, we could propose,
for example, an Intrinsic CAR (Rue and Held 2005), as a
prior distribution of a, considering contiguous values of the
subscript to be neighbours. Thus the components of a will
share information among themselves and as a consequence
its estimates will be more stable and f(C) will define a smooth
relationship between C and h. Moreover, if C was an ordinal
variable, we could use the same prior distribution as defined
above. On the other hand, if C was a nominal variable,
perhaps it would be more appropriate to use an independent
prior distribution for a.
Obviously, other modeling alternatives could be pro-
posed for f(C). For example a CAR distribution of
higher order, taking into account more distant neigh-
bours, could be proposed as a prior distribution of a(Rue and Held 2005). This proposal would yield
smoother relationships between the covariate and the log-
relative-risks. Spline modeling, for example, could also
be an alternative for modeling f(C) or even by means of
the sum of two (independent and dependent) random
effects (Besag et al. 1991). Nevertheless, although we
acknowledge that the specific modeling of f(C) is a
question of interest for the model, it is not a key
question for the main goal of this paper, which is how to
carry out multivariate ecological regression and how to
decompose the variance of the diseases under study.
Thus although we do not put special emphasis on this
question we would like to mention that other modeling
alternatives are also be possible.
The random effect S will model the geographical vari-
ations of the risk which are not linked to the covariate, that
is, those that cannot be described through f(C). S will have
a spatial structure reflecting the geographical nature of its
values. Indeed, we may consider S to be the sum of two
effects, one with an Intrinsic CAR prior distribution and the
other with independent values, following the reasoning
proposed by (Besag et al. 1991). In what follows we will
consider this spatial structure, since it is the form of
modelling most commonly used in disease mapping stud-
ies, even though other alternatives could be considered.
If the geographical distribution of C shows a spatially
dependent structure, it may be that f(C) and S compete to
explain the same pattern (Clayton et al. 1993). This pattern
would, at the same time, be related to the covariate and
would show spatial dependence and could therefore be
attributed to either of these two factors. In order to avoid
this circumstance we will restrict the random effect S to
being orthogonal to the effect of the covariate f(C), since
this way we guarantee that S explains the variability
unexplained by C, which was our intention. Computa-
tionally this restriction is fairly complex since the values of
a that define f(C) are unknown and hence the restriction on
S is not clearly determined, rather it is stochastic. A more
Stoch Environ Res Risk Assess (2014) 28:695–706 697
123
convenient computational alternative is to restrict the val-
ues of S through the following set of conditions:
Xfi:Qk\Ci\Qkþ1g
Si ¼ 0 : k ¼ 1; . . .;K
8<:
9=;
In other words, for each one of the groups that we have
used to define f(C), we require the sum of the
corresponding random effects S to be 0. It can be shown
that this set of restrictions leads to the orthogonality of
f(C) and S, independently of the values taken by the vector
a. Although this set of conditions is more restrictive than
merely imposing orthogonality between f(C) and S (we are
imposing K restrictions instead of one), S will still enjoy
considerable flexibility for the geographical description of
the variability of risk, since the number of values taken by
S (the number of units under study) will usually be much
greater than the number of restrictions imposed (the
number of quantiles chosen). Moreover, the linear space
to which S has been restricted is orthogonal to the linear
space to which f(C) belongs, and therefore the variability
not explained by one of these components will be
explained by the other, except possibly that due to the
shrinking out of random effects. As these spaces are
orthogonal, we avoid the two factors competing to explain
the same effect.
According to this proposal, the logarithms of the risks
are projected into three mutually orthogonal spaces of
dimensions 1, K-1 and I-K, respectively. Such orthogo-
nality will have interesting properties since it will allow us
to decompose the variability of the geographical pattern of
risk as variability associated to the covariate and variability
associated to other factors not linked to the covariate.
2.2 Multivariate disease mapping via SANOVA
We begin this section by defining the problem of multi-
variate disease mapping and the notation to be used. This
time, we will call Oij, i = 1, …, I, j = 1, …, J the number
of observed cases of disease j, in the ith geographic unit.
Similarly we will use Eij for the number of expected cases
for unit i and disease j. In the multivariate case, most
disease mapping proposals assume, as we also do, that:
Oijjhij�PoissonðEijhijÞ; i ¼ 1; . . .; I; j ¼ 1; . . .; J
One of the most interesting approaches to multivariate
disease mapping is that known as Smoothed ANOVA
(Zhang et al. 2009). According to that approach a series of
orthogonal contrasts (in the sense of analysis of variance)
between diseases is considered, i.e. a set of linear
combinations of the diseases we wish to model.
SANOVA proposes the modeling of every one of these
contrasts through spatially structured random effects
instead of employing such effects for the independent
modeling of every one of the diseases. Therefore, spatial
dependence is assumed for certain specific linear
combinations of the risks which are of particular interest
for the research. This idea turns out to be very reasonable
since often the spatial pattern is not specific to each one of
the diseases, but rather is due to the distribution of some
risk factors which may influence more than one disease. In
such cases it is more natural to attribute spatial dependence
to certain combinations of the risks for some diseases, and
therefore multivariate modelling with SANOVA may be
particularly appropriate. Moreover, SANOVA allows us to
make inferences about these specific contrasts which are
expected to be of epidemiological interest. The main
problem with SANOVA is precisely the choice of these
contrasts, since the degree of benefit obtainable through
fitting these models will depend mainly on how appropriate
this choice is, and often there are no clear criteria to help
choose them.
SANOVA may be mathematically formulated as fol-
lows: let H be a matrix of dimensions J 9 J whose col-
umns contain the coefficients of a series of contrasts
between diseases. In other words, the columns of H contain
the coefficients of the linear combinations of the diseases
we are going to model through spatially structured random
effects. H is by definition orthogonal, so that H0H = IJ.
We have to note that H will usually include a column that
is proportional to 1J, i.e. following Zhang et al.’s notation
H = [1J|H(-)], where H(-) is an J 9 (J - 1) matrix whose
columns are contrasts among diseases. This column is
usually used to model the common geographical patterns of
all the diseases studied. It is not really a contrast since it
does not make any comparison between diseases, never-
theless, in order to simply the notation, we will refer to the
columns of H as contrasts in general.
In the original proposal by Zhang et al. (2009),
SANOVA was used to model the logarithm of risks for a
set of diseases as:
lnðvecðhÞÞ ¼ ðH� VÞ � vecðuÞ
where V is the matrix containing the eigenvectors of the
precision matrix common to the columns of h and
conferring a spatial structure on them. u is a matrix of
random effects with dimension I 9 J, the prior distribution
for its columns being given by:
u�j�Nð0; ðsjDÞ�1Þ; j ¼ 1; . . .; J
where D is the diagonal matrix with the corresponding
eigenvalues of V.
This formulation corresponds to the original proposal
for the SANOVA model. In summary, that proposal con-
siders h to be a double transformation, i.e., by H and V, of
698 Stoch Environ Res Risk Assess (2014) 28:695–706
123
Gaussian independent random effects u. The first of these
transformations induces dependence among diseases
through the modeling of specific contrasts defined by
matrix H. The second one was motivated by the desire to
use an Intrinsic CAR distribution to spatially smooth these
contrasts. However, the second of these transformations
could be skipped, leading to a more flexible (and compu-
tationally convenient) class of models, which we will
introduce shortly. In more detail, the expression of the log-
risks in the SANOVA model can also be written as:
lnðvecðhÞÞ ¼ ðH� VÞ � vecðuÞ¼ ðH� IIÞ � ðIJ � VÞ � vecðuÞ¼ ðH� IIÞ � vecðwÞ
so that
vecðwÞ ¼ ðH0 � IIÞ � lnðvecðhÞÞ
that is, w models the contrasts defined by H and besides
vec(w) has variance–covariance matrix given by:
ðIJ � VÞdiagðfðsjDÞ�1gJj¼1ÞðIJ � VÞ0
¼ diagðfVðsjDÞ�1V0gJj¼1Þ ¼ diagðfðsjVDV0Þ�1gJ
j¼1Þ¼ diagðfðsjQÞ�1gJ
j¼1Þ:
In other words, the variance–covariance matrix of each
one of the columns in w will be proportional to Q-1, the
covariance matrix corresponding to the spatial dependence
structure we have assumed. Summarising, the SANOVA
model may be formulated either as proposed in the original
article by (Zhang et al. 2009), or by considering a series of
spatially dependent patterns (each one of the columns in
w), and combining such patterns according to the contrasts
defined in matrix H. Furthermore, such combinations may
also be expressed as follows:
ðH� IIÞ � vecðwÞ ¼H11II � � � H1JII
..
. . .. ..
.
HJ1II � � � HJJII
0B@
1CA
w�1...
w�J
0B@
1CA
¼H11w�1 þ � � � þ H1Jw�J
..
.
HJ1w�1 þ . . .þ HJJw�J
0B@
1CA
¼ H�1 � w�1 þ . . .þH�J � w�J :
Therefore, we can reformulate the SANOVA model as the
sum of the Kronecker products between each one of the
contrasts contained in H and the corresponding vectors of
(spatially dependent) random effects. This decomposition will
be very useful for our proposed ecological regression model.
Finally, we would like to stress that the reformulation of
SANOVA as it appears in this latter expression permits the
consideration of any distribution for the columns of w, and
not necessarily an Intrinsic CAR distribution as used in the
original proposal by Zhang et al. Thus, it is possible to
assume, for example, that these vectors undergo a convo-
lution process of two random effects, as Besag et al. pro-
pose (Besag et al. 1991) or any spatial process that we
might want to fit (Ugarte et al. 2005; Botella-Rocamora
et al. 2013). Therefore, the reformulation which we have
introduced allows the generalisation of the SANOVA
model to a broader collection of spatial processes, while at
the same time allowing the use of conventional Bayesian
inference software (INLA or WinBUGS for example) for
the estimation of these models.
2.3 Multivariate ecological regression using SANOVA
So far we have proposed an ecological regression model
which we consider convenient, since it allows the decom-
position of the variance into one part explained by the
covariate and another part not explained by it. We have
also introduced SANOVA as a multivariate disease map-
ping tool, based on the decomposition of variance within
the contrasts we wish to study. Our intention now is to
propose an ecological regression model, based on
SANOVA, which will simultaneously permit the sharing of
information between diseases while allowing us to
decompose the total variance of all diseases as a function of
the different elements considered in the study.
As we have already mentioned, our reformulation of
SANOVA allows us to express these models as the sum of
Kronecker products of the contrasts which we consider
opportune and the corresponding random effects. More-
over, these effects can take the dependence structure con-
sidered most appropriate, and thus our proposal for
multivariate ecological regression consists in making these
effects (w�1,…,w�J) depend on the covariate we wish to
study. Specifically, if we want to study the effect of
covariate C on the set of log-risks h, we would do so as
follows:
lnðvecðhÞÞ ¼ H�1 � w�1 þ � � � þH�J � w�J
w�1 ¼ l1 � 1þ f1ðCÞ þ S�1
� � �w�J ¼ lJ � 1þ fJðCÞ þ S�J
Thus, w�1,…,w�J model the spatial pattern of each one of
the comparisons of risks between diseases defined by
H�1,…,H�J, respectively. In turn, these will be made to
depend, in a flexible way, on the covariate under study.
Therefore it will be possible to determine and, if it exists,
describe the association between these two factors. Thus as
an added advantage of the multivariate modelling we could
establish the relationship between the covariate and certain
specific combinations of diseases, instead of establishing
the relationship between the covariate and each one of the
diseases independently, as is the case in univariate studies.
Stoch Environ Res Risk Assess (2014) 28:695–706 699
123
The functions f1(), …, fJ() are defined as described
above for the univariate ecological regression model. Thus
these functions will be based on the same partitioning of
the variable C, and will only differ in the value of the
vector a for each one of them. On the other hand, the
vectors of random effects S�1, …, S�J will usually have a
spatially structured distribution (not necessarily an Intrinsic
CAR distribution). Furthermore, these vectors will be
restricted to the orthogonal subspace generated by the
design matrix of the functions f(C), as we proposed above
for the univariate ecological regression model.
As outlined at the start of this section, the terms making
up w�j, for any j, decompose the spatial pattern of the
corresponding contrast in terms of three orthogonal factors
(intercept, covariate effect, and other factors). Thus with
our proposal the set of risks of the J diseases under study is
decomposed into 3J components. Moreover, the compo-
nents corresponding to different contrasts are also mutually
orthogonal, since if we take two components corresponding
to different contrasts we have:
ðH�i � aÞ0 � ðH�j � bÞ ¼ H1ia0 . . . HJia
0ð ÞH1jb
..
.
HJjb
0B@
1CA
¼XJ
k¼1
HkiHkj
!� a0bð Þ
where a and b are any two of the components making up
w�i and w�j, respectively. The first of the two factors on the
right of the above expression will be 0, since the columns
of H are orthogonal. Therefore, the proposed model leads
to a decomposition of the variance of the J patterns studied
into 3J components, all mutually orthogonal. J of these
components, those corresponding to the intercept terms of
each one of the contrasts, are projected into one-dimen-
sional spaces; J of them, those corresponding to the func-
tions f1(C), …, fJ(C), are projected into (K-1)-dimensional
spaces; and the last J of them into (I-K)-dimensional
spaces. Therefore the proposed model, apart from carrying
out an ecological regression, structures and allows the
quantification of the variability attributable to each one of
these components.
3 A real case study
We are going to apply the proposal above to a real problem
in order to illustrate its potential and to show its ease of
interpretation. We are going to study the geographical
distribution of malignant neoplasm of stomach mortality
(ICD-9: 151; ICD-10: C16) in women residing in the city
of Barcelona (Spain). Deaths were grouped into two peri-
ods (1996–2001 and 2002–2007), which will be the two
outcomes we want to model. The expected cases were
calculated by indirect standardization, taking as reference
the mortality rates of the city by age (in 5 years age-
groups) for the first period (1996–2001). The geographical
units used for analysis were the 1491 census tracts of the
city according to the 2001 census. We included a socio-
economic deprivation index as a covariate to explain
mortality and it was calculated through a principal com-
ponents analysis on socioeconomic variables, based on the
methodology described by Domınguez-Berjon et al.
(2008). In our case we want to model two outcomes (the
number of study periods considered), and so our proposal
assumes:
lnðvecðhÞÞ ¼ H�1 � w�1 þH�2 � w�2
w�1 ¼ l1 � 1þ f1ðCÞ þ S�1
w�2 ¼ l2 � 1þ f2ðCÞ þ S�2
Matrix H will contain the contrasts used to model the
matrix of risks. In our case we consider it of particular
interest to determine the pattern common to both periods,
as well as the evolution of risks, for each census tract,
between them. These two patterns will tell us about the
distribution of the risk factors which remain constant over
time in the study region and those which have undergone
changes over time. These two patterns will correspond to
the first and second columns, respectively, of the matrix of
contrasts H ¼ 2�0:5 1 �1
1 1
� �.
Each one of these patterns is divided, through the
decomposition of w�1 and w�2, into three components:
intercept, covariate effect, and variability not attributable
to the covariate. The intercept, or average value, of the
vectors w�1 and w�2 is modelled using l1 and l2 respec-
tively. In order to model the effect of the deprivation index
C on the two contrasts mentioned above in a non-para-
metric way, we have categorized the covariate into 20
quantile-groups (K = 20), and decomposed w�1 and w�2 as
functions of f1(C) and f2(C), respectively. The variability in
each of the contrasts not explained by deprivation is
modelled using S�1, and S�2. S�1 explains the variability in
risk common to both periods which cannot be explained by
the deprivation index, whereas S�2 explains the geograph-
ical variability of evolution over time of the risks which
cannot be attributed to deprivation. Each one of these
random effects has been decomposed in turn as the sum of
two random effects: S�1 = U�1 ? V�1 and S�2 = U�2 ? V�2,
following the proposal by Besag et al. (1991), with U�1 and
U�2 being two Intrinsic CAR spatial random effects and V�1,
V�2 being two heterogenous Gaussian random effects. For
the Intrinsic CAR spatial random effects, we have used an
adjacency matrix defined considering that two areas are
neighbours if they share a common boundary.
700 Stoch Environ Res Risk Assess (2014) 28:695–706
123
Independent zero-mean Gaussian prior distributions,
with precision 0.0001, are assigned to both l1 and l2. We
have considered that the vectors a1 and a2 determining
f1(C) and f2(C), respectively, follow a first-order random-
walk process with unknown variances r2f1
and r2f2
. An
Intrinsic-CAR prior distribution was assigned to the spatial
effects U�1 and U�2 (Besag et al. 1991), with unknown
variances r2U1 and r2
U2, respectively. Finally, the hetero-
geneous effects V�1 and V�2 were assumed to have unknown
variances r2V1 and r2
V2, respectively. A half-normal distri-
bution with mean 0 and precision 0.0001 was assigned to
the standard deviations rf1 ; rf2 ; rU1; rU2; rV1 and rV2
(Gelman 2006).
Inference in complex Bayesian hierarchical models is
usually carried out by means of Markov Chain Monte Carlo
(MCMC) simulation, but these techniques are usually very
time-consuming in large multivariate disease mapping
problems. For this reason, we have used the Integrated
Nested Laplace Approximations (INLA) (Rue and Martino
2009; Rue et al. 2009; Schrodle and Held 2010, 2011)
approach as an alternative to MCMC simulation (see
Appendix). INLA has been implemented within the R soft-
ware (R Development Core Team 2012). Moreover, INLA
allows the above-mentioned sum-to-zero restrictions on
f1ðCÞ and f2ðCÞ to be incorporated in a fairly straightforward
way, therefore making this a very convenient tool for car-
rying out inference in our model. Another illustrative
example of specification and incorporation of linear con-
straints with INLA may be found in Schrodle and Held
(2010), Sects. 3.2 and 3.4. For all these reasons INLA has
experienced an increase in popularity and it is commonly
used in spatial data analysis (Munoz et al. 2012). The R code
to perform the model can be found in the Appendix.
4 Results
Figure 1 presents the association between deprivation and
mortality as a series of graphs. Figure 1a shows the associ-
ation between deprivation and the pattern of risk common to
both periods. Figure 1b shows the association between
deprivation and the evolution of mortality over time, in this
case positive values in the relationship indicate an increase in
risk over time for the corresponding deprivation level.
Finally, Fig. 1c, d represent the association existing between
mortality and deprivation in the first and second periods,
respectively; these associations are just specific combina-
tions of the two preceding graphs defined by the matrix of
contrasts H. The associations are represented in terms of
relative risks, the continuous lines representing the posterior
means and the dotted lines their 95 % credible intervals.
Figure 1a shows that the pattern of mortality common to
both periods appears to be associated with deprivation,
with areas of less deprivation (on the left) presenting lower
mortality than the average for the city. Figure 1b shows a
decline over time of mortality in the most deprived areas,
although this decline is not so clear in statistical terms as
the relationship in Fig. 1a. The combination of these two
factors allows us to estimate the relationship between
deprivation and mortality in the two periods studied
(Fig. 1c, d). Thus, Fig. 1c suggests a positive and
approximately linear association between deprivation and
mortality in the first period. In addition, it may be seen that
the credible intervals estimated for the areas of least
deprivation are below 1, while the areas with greater
deprivation have credible intervals above 1. However, in
Fig. 1d we observe that in the second period, the associa-
tion between deprivation and mortality has been modified,
becoming less steep. In this period the association almost
disappears since the credible intervals of relative risks
include the value 1 for all levels of deprivation. This result
indicates a reduction over time of inequalities associated
with deprivation. This is particularly interesting since
despite the fact that the proposed model considers the two
periods as dependent (in fact there is a strong common
pattern), this fact does not prevent us finding appreciable
differences between the two periods.
Figure 2 shows a choropleth map of the socio-economic
deprivation index by septiles. In this figure, the areas
shown in dark brown (respectively, dark green) represent
the census tracts with most (respectively, least) depriva-
tion. The map suggests a high degree of spatial correlation
in this covariate, with the west of the city having fewer
depressed areas, while the eastern (city centre and old
quarter of the city), northeastern and northern areas present
greater deprivation.
SANOVA allows us to perform different decomposi-
tions of the geographical patterns of mortality for different
contrasts in the time periods, which will allow us to analyse
in more detail not only the association of mortality with
deprivation, but also the effect of time. In order to illustrate
this, Fig. 3 presents septile maps of some of these
decompositions. We have used the same divergent color
scheme pattern as in Fig. 2, where greens represent areas
with less mortality, while those with greater mortality are
represented with browns. In order to be able to compare
values between the different maps, the areas have been
coloured using fixed cut-off values.
Figure 3a presents the distribution of smoothed Stand-
ardised Mortality Ratios (sSMR) for the first period. In
order to develop this map we have made use of the mor-
tality common to both periods 2�0:5ðf1ðCÞ þ S�1Þ and the
effect of time which induces geographical differences in
Stoch Environ Res Risk Assess (2014) 28:695–706 701
123
the risk between the two periods 2�0:5ðf2ðCÞ þ S�2). Spe-
cifically, these are combined according to the values of H
and consequently the sSMR values for the first period are
given by sSMR ¼ 100 � exp 2�0:5ð f1 Cð Þ þ S�1ð Þ��
f2 Cð Þ þ S�2ð ÞÞÞ. The same procedure was used for the
sSMR values of the second period (Fig. 3b). The two maps
present a similar spatial distribution, allowing us to
appreciate the considerable advantages of the joint mod-
elling we have performed. However, despite the two pat-
terns having been modelled in a dependent manner, we
may appreciate slight differences between them, for
example it is particularly notable that in the second period
there is less spatial variability than in the first. This seems
to reflect a reduction in geographical inequalities.
Figure 3c is a geographical representation of the asso-
ciation found between deprivation and mortality common
to both periods, in other words (the posterior mean of)
100 � expð2�0:5f1ðCÞÞ. This figure basically displays the
relationship shown in Fig. 1a geographically. Figure 3d
shows the effect of deprivation on the temporal evolution
of mortality, in other words (the posterior mean of)
100 � expð2�0:5f2ðCÞÞ. Once again, this figure corresponds
to Fig. 1b and the green census tracts on the map corre-
spond to areas where mortality has declined over time.
Comparing the two figures, we may observe that a large
amount of the geographical variability found is common to
the two periods. Even so, it may also be appreciated that
those census tracts with the highest global risk for the two
periods studied are the ones which present the greatest
decline over time.
Figure 3e presents the distribution of mortality common
to the two periods which cannot be associated with depri-
vation, basically because there is no correlation with the
distribution of this factor, in other words (the posterior
Fig. 1 Relative risk (and 95 %
credible intervals) death due 10
malignant neoplasm of the
stomach as a function of the
socio-economic deprivation
index. a Association between
deprevation and the pattern of
mortality common to the two
periods. b Association between
deprivation and the evolution or
mortality over time.
c Association between
deprivation and mortality in the
first period. d Association
between deprivation and
mortality in the second period
702 Stoch Environ Res Risk Assess (2014) 28:695–706
123
mean of) 100 � expð2�0:5S�1Þ. Finally, Fig. 3f presents the
evolution over time of mortality which cannot be attributed
to deprivation, since both periods have different spatial
patterns, i.e. (the posterior mean of) 100 � expð2�0:5S�2Þ. It
may be seen in Fig. 3e that once the effect of deprivation
has been eliminated, the mortality common to the two
periods still presents some spatial correlation. Specifically,
two clear spatial clusters of low mortality (green) may be
observed in the north and south of the city centre area, and
a spatial cluster of high mortality (brown) in the southwest
of the city. These clusters cannot be explained by depri-
vation. In contrast, Fig. 3f indicates that after removing the
effect of deprivation in the second period, mortality in
general declined all over the city except for the coastal area
(southeast) where it remained stable (white areas). Once
again it may be seen that the spatial variability associated
with factors other than deprivation is greater in the pattern
common to the two periods than in the evolution of mor-
tality over time, and therefore the risk factor responsible
for this pattern has mainly had a persistent presence over
time.
As described in Sect. 2, all the components of w�1 and w�2are mutually orthogonal. This fact permits the decomposi-
tion of the variance of the logarithms of risk (lnðvec hð ÞÞÞ as
the sum of the variances of all these components. Thus, we
have: Varðlnðvec hð ÞÞÞÞ ¼ VarðH�1 � ðl1 � 1ÞÞ þ VarðH�1 �f1ðCÞÞþVarðH�1 � S�1Þ þ VarðH�2 � ðl2 � 1ÞÞ þ VarðH�2�f2ðCÞÞ þ VarðH�2 � S�2Þ. The first of these terms is 0 since it
is the variance of a constant vector. The percentages of
variance explained by each of the remaining components are,
respectively: 41.59, 8.71, 43.48, 5.98 and 0.24 %. Thus, it is
a particularly significant fact that the factor explaining the
most variance (43.48 %) is the component associated with
l2 � 1, which reflects the evolution over time of the average
mortality in each period, indicating an important time trend
in mortality affecting all census tracts equally. It is also
noteworthy that 47.57 % (41.59 ? 5.98 %) of the variability
studied is associated with deprivation, whereas only 8.95 %
(8.71 ? 0.24 %) would be explained by the action of other
factors. It is also interesting to learn that 50.30 %
(41.59 ? 8.71 %) of the variability in risk is spatial vari-
ability common to the two periods, whereas changes over
time in the geographical patterns only explain 6.22 %
(5.98 ? 0.24 %) of the variance. In our opinion, results of
this kind are quite interesting from an applied point of
view.
5 Discussion
In this article we propose a reformulation of SANOVA
which allows us to study the association between one
covariate and various response variables jointly. This
reformulation of SANOVA allows us to assign any spa-
tially structured distribution to the random effects includ-
ing, as we have illustrated in our example, the popular
convolution of random effects by Besag et al. (1991).
In the context of ecological regression some authors have
described the problem of the possible existence of colin-
earity between the covariate and the random effects
(Clayton et al. 1993; Reich et al. 2006; Hodges and Reich
2010; Hughes and Haran 2013). This can represent a serious
problem since it may be masking the true association
between the outcome and the covariate, and is usually
resolved by imposing random effects to lie on the orthog-
onal space of the fixed effects design matrix. In the present
study we have decomposed the variability of the outcome
into two orthogonal components, one associated with the
covariate, and a second orthogonal with the first. These two
components, by construction, cannot compete to explain the
same effect through being orthogonal. However, the effect
of any covariate whose spatial distribution is correlated with
the distribution of the covariate under study will be attrib-
uted to the covariate under study. In this sense it is impor-
tant for the study to include all those covariates which could
induce variations in the risks being investigated.
As mentioned, orthogonality has been previously used to
avoid confounding between fixed and random effects. In
contrast, our main motivation for imposing orthogonality is
somewhat different, our aim being to allow the decomposition
of all variance of the diseases under study. Therefore
Fig. 2 Distribution of the deprivation index by septiles. Green areas
represent low socioeconomic deprivation and brown areas represent
high socioeconomic deprivation
Stoch Environ Res Risk Assess (2014) 28:695–706 703
123
orthogonality may achieve these two goals at the same time
and, as a consequence, papers published under one of these
approaches may be very useful for the other one. Nevertheless,
in our opinion, our approach has certain computational
advantages since it does not rely on any spectral decomposi-
tion of a matrix. Such decomposition is a problem when the
Fig. 3 SANOVA model:
a distribution of smoothed
standardised mortality ratios
(sSMR) for the first period,
b distribution of (sSMR) for the
second period, c accociation
between deprivation and the
pattern of mortality common to
the two periods, d association
between deprevation and the
evoluation of mortality over
time, e distribution of
mortality,which cannot be
associated with deprivation,
common to the two periods,
f evolution over time of
mortality which cannot be
explained by deprivation
704 Stoch Environ Res Risk Assess (2014) 28:695–706
123
matrix to be decomposed depends on one or more parameters,
such as the CAR-proper distribution or geostatistical models
(Hughes and Haran 2013). In this sense the present work may
help to make ecological regressions avoiding confounding
between fixed and random effects more popular.
The implementation of the model in this paper has been
undertaken in INLA. This may be considered a particular
strength of the paper since it does not require any specific
coding of the method beyond the coding of INLA itself. Any-
way, as mentioned above we provide the INLA routines used in
our real case study as supplementary material to the paper.
Nevertheless, we feel it necessary to mention that INLA is at the
time of writing still a novel inference tool not fully checked in
models as complex as ours. In this sense more extensive
checking of INLA in this kind of model will be welcomed.
In general, in studies analysing mortality inequalities in
terms of some covariate, the covariate is included in the
model either as a fixed factor continuous variable, or it is
incorporated divided into a small number of groups (such
as quintiles). In the first case we assume a linear relation-
ship between mortality and the covariate, when in fact we
have no guarantees that this is so. In the second case, by
categorising the variable, groups are formed which contain
a large number of areas, and in consequence much infor-
mation about the true association between the original
covariate and the variable under study is lost. It should be
noted that by using the proposed SANOVA model, these
situations are avoided since the covariate is divided into a
larger number of groups than is usually used and they are
modeled with flexible non-parametric tools.
Acknowledgments We wish to thank Carme Borrell and one
reviewer for their valuable suggestions on the article. We also wish to
thank Prof. Havard Rue for his support with the R-INLA imple-
mentation of the model proposed in this study.
Funding This article has been partially funded by: ‘‘Plan Nacional
de I?D?I 2008–2011’’ and the ‘‘ISCIII –Subdireccion General de
Evaluacion y Fomento de la Investigacion-’’ (Project number
PI081488); the project INEQ-CITIES, ‘‘Socioeconomic inequalities
in mortality: evidence and policies of cities of Europe’’, project
funded by the Executive Agency for Health and Consumers (Com-
mission of the European Union), Project No. 2008 12 13; and by
CIBER Epidemiologıa y Salud Publica (CIBERESP), Spain.
Appendix
In what follows we explain briefly the general ideas
involved in the Integrated nested Laplace approximations
(INLA) inference method, the approach used in this paper
to approximate the marginal posterior density of the
parameters in our multivariate ecological regression model.
Consider a three-stage Bayesian hierarchical model
based on an observation model p yjxð Þ ¼Q
i p yijxið Þ, a
parameter model p xjhð Þ, and a hyperprior p hð Þ. Here y ¼y1; . . .; ynð Þ denotes the observed data, x are unknown
parameters which typically follow a Gaussian Markov
random field (GMRF) and h are unknown hyperparameters.
The marginal posterior density of xi can be written as
p xijyð Þ ¼Z
h
p xijh; yð Þp hjyð Þdh
and can be approximated by the finite sum
~p xijyð Þ ¼X
k
~p xijhk; yð Þ~p hkjyð ÞDk
Integrated nested Laplace approximations (INLA)
approximates these marginal posterior densities using an
approximation ~p xijh; yð Þ of p xijh; yð Þ and an additional
approximation ~p hjyð Þ of the marginal posterior density of
the hyperparameters p hjyð Þ. Dk are chosen appropriately.
To approximate p hjyð Þ we consider:
p x; h; yð Þ ¼ p xjh; yð Þ � p hjyð Þ � p yð Þ;
therefore for any x
p hjyð Þ / p x; h; yð Þp xjh; yð Þ
INLA approximates p hjyð Þ using a Laplace
approximation (Tierney and Kadane 1986):
~p hjyð Þ / ~p x; h; yð Þ~pG xjh; yð Þ jx¼x� hð Þ
where the denominator ~pG xjh; yð Þ is the Gaussian approx-
imation of p xjh; yð Þ and x� hð Þ is the mode of the full
conditional p xjh; yð Þ (Rue and Held 2005; Rue et al. 2009).
Posterior marginals for the hyperparameters ~p hjyð Þ could
be obtained using two strategies. The first, more accurate
but also time consuming, consists of defining a grid of
points covering the area where most of the mass of ~p hjyð Þis located (GRID strategy). The second one, named central
composite design (CCD strategy), consists in laying out a
small amount of ‘‘points’’ in a m-dimensional space in
order to estimate the curvature of ~p hjyð Þ (Rue et al. 2009).
In this paper the CCD strategy has been used as it needs
much less computational time and the differences between
the CCD and GRID strategies are expected to be minor.
Moreover, the CCD strategy is recommended for problems
with high dimensionality of the hyperparameter vector h.
Secondly, in order to approximate ~p xijh; yð Þ, a Gaussian
approximation ~pG xijh; yð Þ ¼ N xi; li hð Þr2i hð Þ
� �could be
used. However, the approximation can be improved using
Laplace approximation or a simplified Laplace approxi-
mation based on the skew-normal distribution (Rue et al.
2009). In this paper the Laplace approximation has been
used (Tierney and Kadane 1986). This is the most time
Stoch Environ Res Risk Assess (2014) 28:695–706 705
123
consuming approximation but on the other hand is the
most accurate.
References
Banerjee S, Wall MM, Carlin BP (2003) Frailty modeling for spatially
correlated survival data, with application to infant mortality in
Minnesota. Biostatistics (Oxford, England) 4:123–142
Barcelo MA, Saez M, Cano-Serral G, Martinez-Beneito MA,
Martinez JM, Borrell C, Ocana-Riola R, Montoya I, Calvo M,
Lopez-Abente G, Rodriguez-Sanz M, Toro S, Alcala JT, Saurina
C, Sanchez-Villegas P, Figueiras A (2008) Metodos para la
suavizacion de indicadores de mortalidad: aplicacion al analisis
de desigualdades en mortalidad en ciudades del Estado espanol
(Proyecto MEDEA). Gaceta sanitaria/S.E.S.P.A.S 22:596–608
Besag J, York J, Mollie A (1991) Bayesian image restoration, with
two applications in spatial statistics. Ann Inst Stat Math 43:1–59
Borrell C, Marı-Dell’Olmo M, Serral G, Martinez-Beneito MA,
Gotsens M, Other MEDEA members. (2010) Inequalities in
mortality in small areas of eleven Spanish cities (the multicenter
MEDEA project). Health & Place 16:703–711
Botella-Rocamora P, Lopez-Quılez A, Martinez-Beneito MA (2013)
Spatial moving average risk smoothing. Stat Med 32:2595–2612
Clayton DG, Bernardinelli L, Montomoli C (1993) Spatial correlation
in ecological analysis. Int J Epidemiol 22:1193–1202
Dominguez-Berjon MF, Borrell C, Cano-Serral G, Esnaola S, Nolasco A,
Pasarin MI, Ramis R, Saurina C, Escolar-Pujolar A (2008)
Constructing a deprivation index based on census data in large
Spanish cities (the MEDEA project). Gaceta sanitaria/S.E.S.P.A.S
22:179–187
Gelman A (2006) Prior distributions for variance parameters in
hierarchical models. Bayesian Anal 1:515–533
Hodges JS, Reich BJ (2010) Adding spatially-correlated errors can
mess up the fixed effect you love. Am Stat 64:325–334
Hodges JS, Cui Y, Sargent DJ, Carlin BP (2007) Smoothing balanced
single-error-term analysis of variance. Technometrics 49:12–25
Hughes J, Haran M (2013) Dimension reduction and alleviation of
confounding for spatial generalized linear mixed models. J R
Stat Soc 75:139–159
Knorr-Held L, Best NG (2001) A shared component model for
detecting joint and selective clustering of two diseases. J R Stat
Soc 164:73–85
Liu X, Wall MM, Hodges JS (2005) Generalized spatial structural
equation models. Biostatistics (Oxford, England) 6:539–557
Mardia KV (1988) Multi-dimensional multivariate Gaussian Markov
random fields with application to image processing. J Multivar
Anal 24:265–284
Marı-Dell’Olmo M, Martinez-Beneito MA, Borrell C, Zurriaga O,
Nolasco A, Dominguez-Berjon MF (2011) Bayesian factor
analysis to calculate a deprivation index and its uncertainty.
Epidemiology (Cambridge, Mass) 22:356–364
Munoz F, Pennino MG, Conesa D, Lopez-Quı-lez A, Bellido J (2013)
Estimation and prediction of the spatial occurrence of fish
species using Bayesian latent Gaussian models. Stoch Environ
Res Risk Assess 27:1171–1180
R Development Core Team (2012). R: a language and environment
for statistical computing
Reich BJ, Hodges JS, Zadnik V (2006) Effects of residual smoothing
on the posterior of the fixed effects in disease-mapping models.
Biometrics 62:1197–1206
Rue H, Held L (2005) Gaussian Markov random fields: theory and
applications (Chapman & Hall/CRC monographs on statistics &
applied probability): Chapman and Hall/CRC, London
Rue H, Martino S (2009) INLA: functions which allow to perform a
full Bayesian analysis of structured additive models using
integrated nested laplace approximaxion. R package version 0.0
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference
for latent Gaussian models by using integrated nested Laplace
approximations. J R Stat Soc 71:319–392
Schrodle B, Held L (2011a) A primer on disease mapping and
ecological regression using INLA. Comput Stat 26:241–258
Schrodle B, Held L (2011b) Spatio-temporal disease mapping using
INLA. Environmetrics 22(6):725–734
Tierney L, Kadane JB (1986) Accurate approximations for posterior
moments and marginal densities. J Am Stat Assoc 81:82–86
Tzala E, Best N (2008) Bayesian latent variable modelling of
multivariate spatio-temporal variation in cancer mortality. Stat
Methods Med Res 17:97–118
Ugarte M, Ibanez B, Militino A (2005) Detection of spatial variation
in risk when using CAR models for smoothing relative risks.
Stoch Environ Res Risk Assess 19:33–40
Wang F, Wall MM (2003) Generalized common spatial factor model.
Biostatistics (Oxford, England) 4:569–582
Zhang Y, Hodges JS, Banerjee S (2009) Smoothed ANOVA with
spatial effects as a competitor to MCAR in multivariate spatial
smoothing. Ann Appl Stat 3:1805
706 Stoch Environ Res Risk Assess (2014) 28:695–706
123