Marginal log-linear parameters for graphical Markov models

© 2013 Royal Statistical Society 1369–7412/13/75743

J. R. Statist. Soc. B (2013)75, Part 4, pp. 743–768

Marginal log-linear parameters for graphical Markovmodels

Robin J. Evans

University of Cambridge, UK

and Thomas S. Richardson

University of Washington, Seattle, USA

[Received July 2011. Final revision November 2012]

Summary.Marginal log-linear (MLL) models provide a flexible approach to multivariate discretedata. MLL parameterizations under linear constraints induce a wide variety of models, includingmodels that are defined by conditional independences.We introduce a subclass of MLL modelswhich correspond to acyclic directed mixed graphs under the usual global Markov property. Wecharacterize for precisely which graphs the resulting parameterization is variation independent.The MLL approach provides the first description of acyclic directed mixed graph models in termsof a minimal list of constraints. The parameterization is also easily adapted to sparse modellingtechniques, which we illustrate by using several examples of real data.

Keywords: Acyclic directed mixed graph; Discrete graphical model; Marginal log-linearparameter; Parsimonious modelling; Variation independence

1. Introduction

Models defined by conditional independence constraints are central to many methods in multi-variate statistics, and in particular to graphical models (Darroch et al., 1980; Whittaker, 1990).In the case of discrete data, marginal log-linear (MLL) parameters can be used to parameterizea broad range of models, including some graphical classes and models for conditional indepen-dence (Rudas et al., 2010; Forcina et al., 2010). These parameters are defined by considering anordered sequence M1, M2, : : : , Mk of margins of the distribution which respects inclusion (i.e.Mi precedes Mj if Mi⊂Mj), with each such sequence giving rise to a smooth parameterizationof the saturated model. Useful submodels can be induced by setting some of the parametersto 0, or more generally by restricting attention to a linear or affine subset of the parameterspace.

The flexibility in this scheme presents a challenge in terms of both interpreting the resultingmodel and performing model selection, for which a tractable search space is typically required.We describe a subclass of MLL models corresponding to a class of graphs known as acyclicdirected mixed graphs (ADMGs), which contain directed (‘→’) and bidirected (‘↔’) edges,subject to the constraint that there are no cycles of directed edges; an example is given inFig. 1. The relationship between the MLL models and models based on ADMGs is analogousto that between ordinary log-linear models and (undirected) graphical log-linear models give

Address for correspondence: Robin J. Evans, Statistical Laboratory, Wilberforce Road, Cambridge, CB3 0WB,UK.E-mail: [email protected]

744 R. J. Evans and T. S. Richardson

1 2

3

4

Fig. 1. An ADMG G1

Re Va

(a)

Y

Ag Co

H

Re Va

(b)

Y

Ag Co

H1 H2

Fig. 2. Two generating processes for the flu vaccine encouragement design: both graphs imply Re??Ag, Co;however, (b) additionally implies Re??Y jVa, Ag; the vertices H, H1 and H2 are unobserved

a very rich class of models to choose from, since their number grows doubly exponentially asthe number of variables increases; undirected graphs provide a natural and more manageablesubset of models with which to work (Darroch et al., 1980).

The patterns of independence that are described by ADMGs arise naturally in the contextof generating processes in which not all variables are observed. To illustrate this, consider therandomized encouragement design carried out by McDonald et al. (1992) to investigate theeffect of computer reminders for doctors on take-up of influenza vaccinations, and consequentmorbidity in patients. The study involved 2861 patients; here we focus on the following fields:

(a) Re, the patient’s doctor sent a card asking to remind them about flu vaccine (randomized);(b) Va, the patient is vaccinated against influenza;(c) Y , the end point—the patient was not hospitalized with flu;(d) Ag, the age of the patient—0, ‘65 years and under’; 1, ‘over 65 years’;(e) Co, the patient has chronic obstructive pulmonary disease, as measured at baseline.

The graphs in Fig. 2 represent two possible data-generating processes. Under both structures,whether or not a patient’s doctor received a reminder note is independent of the baseline variablesage (Ag) and chronic obstructive pulmonary disease status (Co), as would be expected underrandomization. Further the absence of an edge Re →Y encodes the assumption that whetheror not a reminder (Re) was received influences the final outcome (Y ) only via whether or not apatient received a flu vaccination (Va). Both structures also assume that there are unobservedconfounding factors between vaccination status and chronic obstructive pulmonary disease, andbetween chronic obstructive pulmonary disease and the final outcome. However, the graph inFig. 2(b) supposes that there is no additional confounding between Va and Y . As a consequencethe generating process given in Fig. 2(b) implies the additional restriction that Re⊥⊥Y |Va, Ag.(We make no assumptions about the state spaces of the variables H , H1 and H2, since thesefactors are unobserved.)

Graphical Markov Models 745

Re Va

(a)

Y

Ag Co

Re Va

(b)

Y

Ag Co

Fig. 3. Two ADMGs representing the conditional independence restrictions on the observed margin impliedby the corresponding graphs in Fig. 2

In Fig. 3 we show the ADMGs corresponding to the generating processes in Fig. 2. Thesegraphs contain only observed variables, but by including bidirected edges (↔) they encode thesame observable conditional independence relations; see Section 3.1 for details.

All the work herein can easily be extended to graphs which also contain an undirected compo-nent, provided that no undirected edge is adjacent to an arrowhead. This latter case is equivalentto the summary graphs of Wermuth (2011) and strictly includes all ancestral graphs (Richardsonand Spirtes, 2002). Our approach may be seen as extending earlier work (Rudas et al., 2006,2010; Forcina et al., 2010) which described the conditional independence structure of certainMLL models.

1.1. Acyclic directed mixed graph modelsRichardson (2003) described local and global Markov properties for ADMGs, whereasRichardson (2009) gave a parameterization for discrete random variables via a collection ofconditional probabilities of the form P.XH = 0|XT = xT /. Although Richardson’s parameteri-zation is simple, it does not naturally lead to parsimonious submodels. In addition, the param-eters are subject to variation dependence constraints, in the sense that setting some parametersto particular values may restrict the valid range of others; this makes maximum likelihoodfitting, for example, more challenging (Evans and Richardson, 2012). To illustrate this point,consider the graph G1 in Fig. 1 as an example; it encodes the model under which X1⊥⊥X3 andX4⊥⊥X1|X2. Richardson’s parameterization consists in this case (for binary random variables)of the probabilities

P.X1=0/,

P.X2=0|X1=x1/,

P.X2=0, X3=0|X1=x1/,

P.X3=0/,

P.X4=0|X2=x2/,

P.X3=0, X4=0|X1=x1, X2=x2/

where x1, x2 ∈ {0, 1}. A disadvantage of this parameterization is that, for instance, the jointprobabilities P.X2=0, X3=0|X1=x1/ are bounded above by the marginal probabilities P.X2=0|X1= x1/. Consequently, from the point of view of parameter interpretation, it makes littlesense to consider the joint probabilities in isolation. For example, strong (conditional) correla-tion between X2 and X3 is present when the joint probability is large relative to the marginals.


However, replacing the joint probabilities P.X2=0, X3=0|X1=x1/ with the conditional oddsratios

P.X2=0, X3=0|X1=x1/P.X2=1, X3=1|X1=x1/

P.X2=1, X3=0|X1=x1/P.X2=0, X3=1|X1=x1/, x1 ∈{0, 1},

(and similarly for P.X3=0, X4=0|X1=x1, X2=x2/) yields a variation-independent parameter-ization: the odds ratio measures dependence without reference to marginal distributions. Thismeans that if we wish to define a prior distribution over the univariate probabilities and theodds ratios we may, if appropriate, simply use a product of univariate distributions; similarly,to fit a generalized linear model with these parameters as joint responses, we need only to usesimple univariate link functions. We shall see that this approach to discrete parameterizationscan be generalized by using MLL parameters.

In Section 2 we introduce MLL parameters and some of their properties, while Section 3gives background theory about ADMGs and the parameterization of Richardson (2009). Thedevelopment of MLL parameters for ADMG models is presented in Section 4, resulting in aparameterization that we refer to as ingenuous (since it arises naturally, but ‘natural parameter-ization’ already has a particular meaning). We also show that this parameterization can alwaysbe embedded in a larger parameterization corresponding to a complete graph and the saturatedmodel, where some of the parameters in the bigger model are linearly constrained. In Section5 we classify for which models the ingenuous parameterization is variation independent, sincethis can facilitate interpretation of the resulting coefficients. In Section 6 we discuss approachesto sparse modelling using MLL parameters in the context of several additional data sets and asimulation. Longer proofs are in the on-line supplementary material.

2. Marginal log-linear parameters

We consider collections of random variables .Xv/v∈V with finite index set V , taking values infinite discrete probability spaces .Xv/v∈V under a strictly positive probability measure P ; withoutloss of generality, Xv={0, 1, : : : , |Xv|−1}. For A⊆V we let XA≡×v∈A .Xv/ and X≡XV andsimilarly XA≡ .Xv/v∈A and X≡XV , and xA≡ .xv/v∈A and x≡xV . In addition X is the subset of Xwhich does not contain the last possible element in any co-ordinate, i.e. Xv={0, 1, : : : , |Xv|−2},and X=×v∈V .Xv/. We use pA.xA/≡P.XA=xA/ and pA|B.xA|xB/≡P.XA=xA|XB=xB/, forprobabilities of particular instantiations of x.

Following Bergsma and Rudas (2002), we define a general class of parameters on discretedistributions. The definition relies on abstract collections of subsets, so it may be helpful to thereader to keep in mind that the sets Mi ∈M are margins, or subsets, of the distribution overV , and each set Li is a collection of effects in the margin Mi. A pair .L, Mi/ corresponds to alog-linear interaction over the set L, within the margin Mi.

Definition 1. For L⊆M⊆V , the pair .L, M/ is an ordered pair of subsets of V . Let P be acollection of such pairs, and define

M≡{M|.L, M/∈P for some L⊆M}to be the collection of margins in P. If M={M1, : : : , Mk}, write

Li≡{L|.L, Mi/∈P},

for the set of effects in the margin Mi. We say that the collection P is hierarchical if the orderingon M may be chosen so that whenever i < j we have Mj � Mi and also L∈Lj⇒L � Mi; thesecond condition is equivalent to saying that each L is associated only with the first margin M


of which it is a subset. We say that the collection is complete if every non-empty subset of V isan element of precisely one set Li.

The term ‘hierarchical’ is used because each log-linear interaction is defined in the first pos-sible margin in an ascending class, and ‘complete’ because all interactions are present. Someresearchers (Rudas et al., 2010; Lupparelli et al., 2009) consider only collections which arecomplete.

Definition 2. For each M⊆V and xM ∈XM , define the functions λML .xL/ by the identity

log{pM.xM/}≡ ∑L⊆M

λML .xL/,

subject to the identifiability constraint that, for every ∅ �=L⊆M, xL ∈XL and v∈L,∑xv∈Xv

λML .xL\{v}, xv/=0,

i.e. the sum over the support of each variable is 0. We call λML .xL/ an MLL parameter.

Note that the constant λM∅ is determined by the values of the other parameters and the fact that

the probabilities pM.xM/ sum to 1. In what follows we shall always assume that L is non-empty.

The term ‘MLL parameter’ is coined by analogy with ordinary log-linear parameters, whichcorrespond to the special case M=V . The following result provides an explicit expression forλM

L .xL/.

Lemma 1. For L⊆M⊆V and xL ∈XL we have

λML .xL/= 1

|XM |∑

yM∈XM

log{pM.yM/} ∏v∈L

.|Xv|I{xv=yv}−1/, .1/

where IA is the indicator function of the event A.

This result is elementary, and its proof is omitted.For a collection of ordered pairs of subsets P (see definition 1), we let

Λ.P/={λML .xL/|.L, M/∈P, xL ∈ XL}

be the collection of MLL parameters associated with P. We avoid the redundancy created bythe identifiability constraint by considering only xL ∈ XL.

The definition of an MLL parameter that we give is equivalent to the recursive definitiongiven in Bergsma and Rudas (2002); since both expositions are somewhat abstract, we invitethe reader to consult the examples below for additional intuition. In particular note that, forbinary random variables, the product in equation (1) is always ±1. Bergsma and Rudas (2002),theorem 2, showed that any collection Λ.P/ where P is hierarchical and complete smooth-ly parameterizes the saturated model, i.e. it parameterizes the set of all positive distributionson X.

The restriction that the parameters must sum to 0 is required for identifiability, but differentconstraints can be used in its place. We might instead require that λM

L .xL/ be 0 whenever anyentry of xL is 0 (or some other selected value); this is seen in Marchetti and Lupparelli (2011),for example, and its use would not substantially affect any of the results in this paper.

2.1. Examples of marginal log-linear modelsWe shall write λM

L to mean the collection {λML .xL/|xL∈XL}; the expression λM

L =0 denotes thatwe are setting all the parameters in this collection to 0.


2.1.1. Example 1The classical log-linear parameters for a discrete distribution over a set of variables V are{λV

L |L⊆V}.

2.1.2. Example 2Up to trivial transformations, the multivariate logistic parameters of Glonek and McCullagh(1995) are {λL

L|L⊆V}.

2.1.3. Example 3Let V ={1, 2, 3} and assume that all random variables are binary. Write P001≡P.X1=0, X2=0, X3=1/, and P1++≡P.X1=1/, etc. Then

λ11.0/= 1

2log(

P0++P1++

),

which, up to a multiplicative constant, is the logit of the probability of the event {X1=0}. Also,

λ121 .0/= 1

4log(

P00+P01+P10+P11+

)

and

λ1212.0, 0/= 1

4log(

P00+P11+P10+P01+

),

which are the log-odds product and log-odds ratio between X1 and X2 respectively.If instead X1 is ternary, we obtain

λ11.0/= 1

3log

(P2

0++P1++P2++

),

λ121 .0/= 1

6log

(P2

00+P201+

P10+P11+P20+P21+

)

and

λ1212.0, 0/= 1

6log

(P2

00+P11+P21+P10+P20+P2

01+

):

Here λ11.0/ contrasts the probability P.X1= 0/ with the geometric mean of the probabilities

P.X1=1/ and P.X1=2/. Up to constants, λ1212.0, 0/ is an average of the two log-odds ratios

log(

P00+P21+P20+P01+

)

and

log(

P00+P11+P10+P01+

),


and so gives a contrast between P.X1=X2= 0/ and other joint probabilities in a way whichgeneralizes the binary log-odds ratio and provides a measure of dependence; in particular notethat λ12

12.0, 0/=0 if X1⊥⊥X2.Here we have written, for example, 12 instead of {1, 2}; similarly, for sets A and B we some-

times write AB for A∪B, and aB for {a}∪B.

2.2. Properties of marginal log-linear modelsThe next result relates MLL parameters to conditional independences; it is found as lemma 1in Rudas et al. (2010) and equation (6) of Forcina et al. (2010).

Lemma 2. For any disjoint sets A, B and C, where C may be empty, A⊥⊥B|C if and only if

λABCA′B′C′ =0 for every ∅ �=A′ ⊆A,∅ �=B′ ⊆B and C′ ⊆C:

The special case of C=∅ (giving marginal independence) was proved in the context of mul-tivariate logistic parameters by Kauermann (1997).

2.2.1. Example 4Take a complete and hierarchical parameterization of three variables,

λ11 λ2

2 λ33 λ12

12 λ1313 λ123

23 λ123123:

Then we can force X1⊥⊥X3 by setting λ1313= 0. Similarly X2⊥⊥X3|X1 corresponds to setting

λ12323 =λ123

123=0.

The following lemma shows that, under conditional independence constraints, certain MLLparameters defined within different margins are equal.

Lemma 3. Suppose that A⊥⊥B|C, and A is non-empty. Then, for any D⊆C,

λABCAD .xAD/=λAC

AD.xAD/, for each xAD ∈XAD:

The proof of this result is found in the on-line supplementary material.

3. Acyclic directed mixed graphs

We introduce basic graphical concepts that are used to describe the global Markov propertyand parameterization schemes.

Definition 3. A directed mixed graph G consists of a set of vertices V , and both directed (→)and bidirected (↔) edges. Edges of the same type and orientation may not be repeated, but theremay be multiple edges of different types between a pair of vertices.

A path in G is a sequence of adjacent edges, without repetition of a vertex; a path may beempty, or equivalently consist of only one vertex. The first and last vertices on a path are the endpoints (these are not distinct if the path is empty); other vertices on the path are non-end-points.The graph G1 in Fig. 1, for example, contains the path 1→2→4↔3, with end points 1 and 3,and non-end-points 2 and 4. A directed path is a path in which all the edges are directed (→) andare oriented in the same direction, whereas a bidirected path consists entirely of bidirected edges.

A directed cycle is a non-empty sequence of edges of the form v→: : :→v. An ADMG is onewhich contains no directed cycles.


Definition 4. For a graph G and a subset of its vertices A⊆V , we denote by GA the inducedsubgraph that is formed by A, i.e. the graph containing the vertices A, and the edges in G whoseend points are both in A.

Definition 5. Let a and d be vertices in a mixed graph G. If a=d, or there is a directed pathfrom a to d, we say that a is an ancestor of d, and that d is a descendant of a. The sets of ancestorsof d and descendants of a are denoted anG(d) and deG(a) respectively. If there is a directed pathfrom a to d containing precisely one edge (a→d) then a is called a parent of d; the set of verticeswhich are parents of d is written paG(d).

The district of a, which is denoted disG(a), is the set containing a and all vertices which areconnected to a by a bidirected path. These definitions are applied disjunctively to sets of vertices,so that, for example,

paG.W/≡ ⋃w∈W

paG.w/,

disG.W/≡ ⋃w∈W

disG.w/:

A set of vertices A is ancestral if A=anG.A/, i.e. A contains all its own ancestors.

3.1. Example 5Consider the graph G1 in Fig. 1. We have

anG1.4/={1, 2, 4},

anG1.{2, 3}/={1, 2, 3}:

The district of 3 is the set {2, 3, 4} and, since 3 has no parents, paG1.3/=∅.

Note that, by the definitions of some researchers, vertices are not their own ancestors (Lau-ritzen, 1996). The above notation may be shortened on induced subgraphs so that paA≡paGA

,and similarly for other definitions. In some cases where the meaning is clear, we shall dispensewith the subscript altogether.

We use the now standard notation of Dawid (1979) and represent the statement ‘X is inde-pendent of Y given Z under a probability measure P ’, for random variables X, Y and Z, byX⊥⊥Y |Z[P ]. If P is unambiguous, this part is dropped, and if Z is empty we write simply X⊥⊥Y .Finally, we abuse notation in the usual way: v and Xv are used interchangeably as both a vertexand a random variable; likewise A denotes both a vertex set and XA.

3.2. Global Markov property for acyclic directed mixed graphsA Markov property associates a particular set of independence relations with a graph.

A non-end-point vertex c on a path is a collider on the path if the edges preceding andsucceeding c on the path have an arrowhead at c, e.g.→ c← or↔ c←; otherwise a non-end-point c is a non-collider. A path between vertices a and b in a mixed graph is said to be blockedgiven a set C if either

(a) there is a non-collider on the path and in C, or(b) there is a collider on the path which is not in anG.C/.

If all paths from a to b are blocked by C, then a and b are said to be m separated given C. Sets A

and B are said to be m separated given C if every a∈A and every b∈B are m separated given C.This naturally extends the d-separation criterion of Pearl (1988) to graphs with bidirected edges.


A probability measure P on X is said to satisfy the global Markov property for G if, for everytriple of disjoint sets of vertices A, B and C,

A is m separated from B given C in G ⇒ XA⊥⊥XB |XC [P ]:

The model that is associated with an ADMG G is simply the set of distributions that obey theglobal Markov property for G.

Proposition 1. If a path from x to y is not blocked given Z in G, then every vertex on the pathis in anG.{x, y}∪Z/.

Proof. The proof follows from the definition of m-separation.

3.2.1. Example 6Consider the graph G1 in Fig. 1. There are two paths between the vertices 1 and 4:

π1 : 1→2→4

and

π2 : 1→2↔3↔4;

both are blocked by C={2}. The path π1 is blocked because 2 is a non-collider on the path andis in C, whereas π2 is blocked because 3 is a collider on the path and is not in anG1.C/={1, 2}.Hence {1} and {4} are m separated given {2} in G1.

We can similarly see that {1} and {3} are m separated given C= ∅, and that no otherm-separations hold for this graph. Thus a joint distribution P obeys the global Markov propertyfor G1 if and only if X1⊥⊥X4|X2[P ] and X1⊥⊥X3[P ].

By similar arguments the independences that are associated with the ADMGs in Fig. 2 mayalso be read off.

3.3. Existing parameterization of acyclic directed mixed graph modelsThis subsection defines the parameters of Richardson (2009) for multivariate discrete distribu-tions satisfying the global Markov property for an ADMG.

Definition 6. Let G be an ADMG with vertex set V . We say that a collection of verticesW⊆V is barren if, for each v∈W , we have W ∩deG.v/={v}; in other words v has no non-trivialdescendants in W . For an arbitrary set of vertices U, the maximal subset with no non-trivialdescendants in U is denoted barrenG(U).

A head is a collection of vertices H which is connected by bidirected paths in Gan.H/ and isbarren in G. We write H.G/ for the collection of heads in G. The tail of a head H is the set

tailG.H/≡paG{disan.H/.H/}∪{disan.H/.H/\H}:

Thus the tail of H is the set of vertices in V \H connected to a vertex in H by a path on whichevery vertex is a collider and an ancestor of a vertex in H . We typically write T for a tail, providedthat it is clear which head it belongs to.

Proposition 2. Let H be a head. Then

(a) H =barrenG{H ∪ tailG.H/} and(b) tailG.H/⊆anG.H/.

Proof. The proof is immediate from the respective definitions. �


Richardson (2009) observed that discrete distributions obeying the global Markov propertyfor an ADMG G are parameterized by the conditional probabilities

{P.XH =xH |XT =xT /|H ∈H, T = tailG.H/, xH ∈ XH , xT ∈XT }:

This is achieved via factorizations based on head–tail pairs; let ‘≺’ be the partial ordering onheads such that Hi≺Hj if Hi⊂ anG.Hj/ and Hi �=Hj. This is well defined, since otherwise Gwould contain a directed cycle. Then let [·]G be a function which partitions sets of vertices intoheads by repeatedly removing heads which are maximal under ≺. (Note that the definition of[·]G originally given in Richardson (2009) is wrong; see Evans and Richardson (2013).)

Then P satisfies the global Markov property for G if and only if it obeys the factorizations

P.XA=xA/= ∏H∈[A]G

P.XH =xH |XT =xT / .2/

for ancestral sets of vertices A; see Evans and Richardson (2013) for details. In the case of adirected acyclic graph (DAG), this corresponds to the probability distribution of each vertexconditional on its parents.

3.3.1. Example 7Consider again the ADMG G1 in Fig. 1; its head–tail pairs .H , T/ are .1,∅/, .2, 1/, .3,∅/, .23, 1/,.4, 2/ and .34, 12/. Multivariate binary distributions obeying the global Markov property withrespect to G1 are therefore parameterized by

p1.0/, p2|1.0|x1/, p3.0/, p23|1.0, 0|x1/, p4|2.0|x2/, p34|12.0, 0|x1, x2/,

for x1, x2 ∈{0, 1}, as mentioned in Section 1.

3.4. Graphical completionsGiven a discrete model defined by a set of conditional independence constraints, it is naturalto consider it as a submodel of the saturated model, which contains all positive probabilitydistributions. In a setting where the model is graphical, it becomes equally natural to think ofthe graph as a subgraph of a complete graph, by which we mean a graph containing at leastone edge between every pair of vertices. We can obtain a complete graph from an incompleteone by inserting edges between each pair of vertices which lack one, but this leaves a choice ofedge type and orientation. These choices may affect how much of the structure and spirit of theoriginal graph is retained; we shall require that a completion preserves the heads of the originalgraph, which in turn helps to preserve the structure of the parameterization.

Definition 7. Given an ADMG G and a supergraph G, we call G a head preserving completionof G if G is complete, and H.G/⊆H.G/.

It is easy to see that a head preserving completion always exists; for example, if we add in abidirected edge between every pair of vertices which are not joined by an edge, then it is clearthat barren sets in G will remain barren in G, and bidirected connected sets in G will remainbidirected connected in G.

It is not always necessary for every pair of vertices to be joined by an edge for a graph torepresent the saturated model; however, we shall require this for our completions.

3.4.1. Example 8Fig. 4 shows a head preserving completion of the ADMG in Fig. 1.


1 2

3

4

Fig. 4. A head preserving completion G1 of the ADMG in Fig. 1

Proposition 3. If G is a head preserving completion of G then anG.v/⊆anG.v/. In particular,if a set A is ancestral in G then A is also ancestral in G.

Proof. The proof follows because G contains a subset of the edges in G.

4. Ingenuous parameterization of an acyclic directed mixed graph model

We now use the MLL parameters that were defined in Section 2 to parameterize the ADMGmodels that were discussed in Section 3.

Definition 9. Consider an ADMG G with head–tail pairs .Hi, Ti/ over some index i, and letMi=Hi∪Ti. Further, let Li={A|Hi⊆A⊆Hi∪Ti}. This collection of margins and associatedeffects is the ingenuous parameterization of G, denoted Ping.G/.

4.1. Example 9We return again to the ADMG G1 in Fig. 1; the head–tail pairs are .1,∅/, .2, 1/, .3,∅/, .23, 1/,.4, 2/ and .34, 12/, meaning that the ingenuous parameterization is given by the margins andeffects in Table 1.

The ordering of the margins given here is hierarchical; to use most of the results of Bergsmaand Rudas (2002), we need to confirm that definition 9 always leads to a hierarchical parame-terization, which is shown by the following result.

Lemma 4. For any ADMG G, there is an ordering on the margins Mi of the ingenuousparameterization Ping.G/ which is hierarchical.

Proof. Firstly we show that, for distinct heads Hi and Hj, the collections Li and Lj are dis-joint. To see this, assume for a contradiction that there exists A such that Hi⊆A⊆Hi∪Ti andHj⊆A⊆Hj ∪Tj. Since Hi �=Hj, assume without loss of generality that there exists v∈Hi\Hc

j ⊆A.

Table 1. Margins and effectsfor example 9

M L

1 112 2, 123 3123 23, 12324 4, 241234 34, 134, 234, 1234


Then v∈Hj ∪Tj implies that v∈Tj, and thus there is a directed path from v to some w∈Hj.Now, w =∈Hi, since v, w∈Hi would imply that Hi is not barren. But, if w∈Hj\Hc

i , then by the sameargument as above we can find a directed path from w to some x∈Hi. Then v→: : :→w→: : :→x

is a directed path between elements of Hi, which is a contradiction. Thus Li and Lj are disjoint.Now, consider the partial ordering ≺ of heads that was defined in Section 3.2: Hi ≺Hj

whenever Hi⊂anG.Hj/ and Hi �=Hj. Any total ordering which respects this partial ordering ishierarchical, because each set A∈Li is a subset of the ancestors of Hi. �

We proceed to show that the ingenuous parameters for an ADMG G characterize the set ofdistributions which obey the global Markov property with respect to G.

Lemma 5. For any sets M and L⊆M, the collection of MLL parameters

{λMA .xA/|L⊆A⊆M, xM ∈ XM},

together with the .|L|− 1/-dimensional marginal distributions of XL conditional on XM\L,smoothly parameterizes the distribution of XL conditional on XM\L.

A proof is given in the on-line supplementary material. We now come to the main result ofthis section.

Theorem 1. The ingenuous parameterization Λ{Ping.G/} of an ADMG G parameterizes pre-cisely those distributions P obeying the global Markov property with respect to G.

Proof. We proceed by induction. Again we use the partial ordering ≺ on heads from Section3.2. For the base case, we know that singleton heads {h} with empty tails are parameterized bythe logits λh

h.

Now, suppose that we wish to find the distribution of a head H conditional on its tail T .Assume that we have the distribution of all heads H ′ which precede H , conditional on theirrespective tails; we claim that this is sufficient to give the .|H |−1/-dimensional marginal distri-butions of H conditional on T .

Let v∈H , and let C=H \{v} be an .|H |−1/-dimensional marginal of interest. The set A=anG.H/ \ {v} is ancestral, since v cannot have (non-trivial) descendants in anG.H/; in partic-ular C∪T ⊆A. Theorem 4.12 of Evans and Richardson (2013) shows that the factorization inequation (2) holds for every ancestral set, so

pA.xA/= ∏H ′∈[A]G

T ′=tail.H/

pH ′|T ′.xH ′ |xT ′/:

But all the probabilities in the product are known by our induction hypothesis, and the marginaldistribution of C conditional on T is given by the distribution of A.

The ingenuous parameterization, by definition, contains λH∪TA for H ⊆A⊆H ∪T , and thus

the result follows from lemma 5.

4.2. Example 10Returning to our running example, the graph G1 in Fig. 1 corresponds to the model

{P |X1⊥⊥X4|X2 [P ] and X1⊥⊥X3 [P ]}:

Theorem 1 tells us that this collection of distributions is precisely characterized by the ingenuousparameters for G1,


λ11, λ12

2 , λ1212, λ3

3, λ12323 , λ123

123, λ244 , λ24

24, λ123434 , λ1234

134 , λ1234234 , λ1234

1234:

4.3. Constraint-based model descriptionThe results above show that the ingenuous parameters for an ADMG G, like Richardson’sparameters, provide precisely the information required to reconstruct a distribution obeying theglobal Markov property for G. However, it is difficult to use this parameterization in practiceunless we can evaluate the likelihood, which requires us to make explicit the map which we haveimplicitly defined from the ingenuous parameters to the joint probability distribution underthe model. For example, for the parameters in Richardson (2009) there is an explicit map fromthe parameters back to the joint distribution using a generalization of Mobius inversion. Thiswas used by Evans and Richardson (2012) to fit these models via maximum likelihood. Incontrast, the map from ingenuous parameters to the joint distribution cannot be written inclosed form.

An alternative approach is to consider the ingenuous parameterization as part of a largercomplete parameterization of the saturated model, such that the additional parameters areconstrained to be 0 under the submodel defined by G. This enables us to fit the model by usingLagrange-type algorithms, as in Evans and Forcina (2013).

Theorem 2. Let G be an ADMG, and G a head preserving completion of G. The ingenuousparameterization of G corresponds to setting

λML =0

for .L, M/∈Ping.G/ whenever L does not appear as an effect in Ping.G/. In particular, theseconstraints define the set of distributions which satisfy the global Markov property withrespect to G.

To prove this result we require the following lemma.

Lemma 6. Let G be a head preserving completion of G, and let H ∈H.G/ have tails T and T

in G and G respectively. Then, under the global Markov property for G,

H⊥⊥.T \T/ |T [P ]:

Proof. Let π be a path in G from some h ∈H to t ∈ T \ T , and assume without loss ofgenerality that π does not intersect H or T \ T other than at its end points. Assume for acontradiction that π is not blocked given T in G. By proposition 1, every vertex on π is inanG.{h, t}∪T/⊆anG.H ∪ T /. Since G is complete, if v∈anG.H ∪ T /, then v∈H ∪ T ; thus H ∪ T

is ancestral in G. By proposition 3, H ∪ T is also ancestral in G; thus every vertex on π is inH ∪ T .

By proposition 2, T ⊆ anG.H/, so H ∪ T = anG.H/. However, since H forms a head in G, H

is barren in G. Thus, in G, no proper descendant of a vertex in H is on π, and by proposition 3this also holds in G.

Now let y be the first vertex after h on π that is not in T . By hypothesis, y exists since t =∈T .By construction, any vertices between h and y on π are in T and hence are colliders on π andancestors of H in G (by proposition 2). Thus y∈disG.H/∪paG{disG.H/}. If y∈ anG.H/ theny∈T , which is a contradiction; hence y∈disG.H/ and y =∈ anG.H/. As shown earlier, y is nota descendant of a vertex in H , so H ∪ {y} forms a head in G. Since G is a head preservingcompletion, it follows that H ∪ {y} also forms a head in G, and thus y =∈ anG.H/=H ∪ T , butthis is a contradiction.


4.4. Proof of theorem 2Let .H , T / be a head–tail pair in G. There are three possibilities for how this pair relatesto G: if .H , T / is also a head–tail pair in G, then there is no work to be done; otherwiseeither

(a) H is not a head in G or(b) H is a head in G but T is not its tail.

If (a) holds, then we claim that, under G, λHTA =0 for all H⊆A⊆H ∪ T . To see this, first note

that H is a barren set in G and, since H is maximally connected, this means that all elements arejoined by bidirected edges in G. Since G contains a subset of the edges in G, H is also barren inG; since H is not a head in G this means that H=K∪L for disjoint non-empty sets K and L withno edges directly connecting them. But this implies that K and L are m separated conditional onT , and thus XK⊥⊥XL|XT under the Markov property for G. Then, by lemma 2, these parametersare all identically 0 under G.

Assumption (b) implies that H is a head in both G and G, but T ≡ tailG.H/⊃ tailG.H/≡T . Then λHT

A =0 for all H⊆A⊆H ∪ T such that A∩ .T \T/ �=∅; this follows from lemma 6 andapplication of lemma 2.

We have shown that all parameters corresponding to effects that are not found in Ping.G/ areidentically 0 under G. The vanishing of these parameters defines the correct submodel, but notethat some of the margins in Ping.G/ which we have not yet considered are not present in Ping.G/.These remaining cases are again from assumption (b), but where H ⊆A⊆H ∪T ; in this caseλHT

A =λHTA under G, again due to lemma 6, this time combined with lemma 3.

Thus we have shown that, under G, all the ingenuous parameters for G are either 0 or equalto ingenuous parameters for G. Combined with theorem 1, this shows that those constraintsdefine the model.

4.5. Example 11Consider again the ADMG G1 in Fig. 1; a possible head preserving completion G1 (shown inFig. 4) is obtained by adding the edges 1→ 3 and 1→ 4. The ingenuous parameterization forG1 is given in Table 2.

The effects found in Ping.G1/ but not in Ping.G1/ are 13, 14 and 124, and indeed the submodeldefined by G1 corresponds to setting

λ1313=λ124

14 =λ124124=0;

under this model the following equalities hold by lemma 3:

Table 2. Ingenuous param-eterization for example 11

M L

1 12 2, 1213 3, 13123 23, 123124 4, 14, 24, 1241234 34, 134, 234, 1234


1 2

3

4

Fig. 5. A complete ADMG G1, of which G1 is a subgraph, but whose ingenuous parameterization does notcontain the model described by G1 as a linear subspace because the associated completion is not headpreserving

Table 3. Ingenuous parameterization forFig. 5

M L

3 313 1, 13123 2, 12, 23, 1231234 4, 14, 24, 124, 34, 134, 234, 1234

λ1244 =λ24

4 ,

λ12424 =λ24

24:

Removing the zero parameters in Ping.G1/ and renaming two others according to the aboveequations returns us to the ingenuous parameterization of G1.

Theorem 2 shows that we can fit the model defined by G1 using maximum likelihood simplyby maximizing the log-likelihood subject to λ13

13=λ12414 =λ124

124=0. In particular, this approachalways provides a list of independent constraints which characterize the model.

An obvious question which arises is whether any completion of a graph will lead to acomplete parameterization with the property of theorem 2. We can obtain a counterexampleby considering the complete graph G1 in Fig. 5, the ingenuous parameterization for whichis given in Table 3.

The graph G1 in Fig. 1 is a subgraph of G1 and corresponds to the model obtained by set-ting λ13

13=λ12414 =λ124

124= 0; however, these last two parameters do not appear in the ingenuousparameterization of G1, and so there is no way to enforce the submodel as a linear constraint.

G1 is, of course, not head preserving. Such completions may still lead to parameterizationswhich satisfy the property of theorem 2: for example, if the edge 1→3 is added to the graph inFig. 6(a), this destroys the head {1, 2, 3}, but the submodel corresponds to λ13

13= 0, which is aparameter in the complete graph.

4.6. Relationship to prior workRudas et al. (2010) parameterized chain graph models of multivariate regression type, whichare also known as type IV chain graph models, by using MLL parameters. Type IV chain graphmodels are a special case of ADMG models, in the sense that by replacing the undirected edgesin a type IV chain graph with bidirected edges the global Markov property on the resultingADMG is equivalent to the Markov property for the chain graph (see Drton (2009)). Thegraphs in Fig. 6 are examples of type IV models. However, there are models in the class of


Fig. 6. (a) Graph with a variation-dependent ingenuous parameterization, (b) a Markov equivalent graph to(a) with a variation-independent ingenuous parameterization and (c) a graph with no variation-independentMLL parameterization

ADMGs which do not correspond to any chain graph, such as the model that is described byG1 in Fig. 1.

The parameterization of Rudas et al. (2010) uses choices of margins that are different fromthose of the ingenuous parameterization, though their parameters can be shown to be equalto those that are considered here under the global Markov property, using lemma 3. Thus thevariation dependence properties of that parameterization are identical to those of the ingenuousparameterization (see Section 5). Forcina et al. (2010) provided an algorithm which gives a rangeof ‘admissible’ margins in which collections of conditional independence constraints may bedefined.

Marchetti and Lupparelli (2011) also parameterized type IV chain graph models in a similarmanner to Rudas et al. (2010), but using multivariate logistic contrasts.

5. Variation independence

As discussed in Section 1, the interpretation of parameters and the construction of prior distri-butions is simpler when parameters are variation independent.

Definition 10. Let θi, for i=1, : : : , k, be a collection of parameters such that θi takes all valuesin the set Θi. We say that the vector θ= .θ1, : : : , θk/ is variation independent if θ can take everyvalue in the set Θ1×: : :×Θk.

Bergsma and Rudas (2002) characterized precisely which hierarchical and complete parame-terizations are variation independent, using a notion that they called ordered decomposability.We now do the same for ingenuous parameterizations.

Definition 11. A collection of sets M= {M1, : : : , Mk} is incomparable if Mi � Mj for everyi �= j.

A collection M of incomparable subsets of V is decomposable if it has at most two elements,or there is an ordering M1, : : : , Mk on the elements of M wherein, for each i= 3, : : : , k, thereexists ji < i such that ( i−1⋃

l=1Ml

)∩Mi=Mji ∩Mi:

This is also known as the running intersection property.A collection M of (possibly comparable) subsets is ordered decomposable if it has at most two

elements, or there is an ordering M1, : : : , Mk such that Mi �Mj for i>j and, for each i=3, : : : , k,


the inclusion maximal elements of {M1, : : : , Mi} form a decomposable collection. We say thata collection P of parameters is ordered decomposable if there is an ordering on the margins M

which is both hierarchical and ordered decomposable.

The following example is found in Bergsma and Rudas (2002).

5.1. Example 12Let M={12, 13, 23, 123}. To have a hierarchical ordering of these margins it is clear that the set123 must come last, but there is no way to order the collection of inclusion maximal margins{12, 13, 23} such that it has the running intersection property. Thus M is not ordered decom-posable.

The next result links variation independence to ordered decomposability.

Theorem 3 (Bergsma and Rudas (2002), theorem 4). Let P be a parameterization which ishierarchical and complete. Then the parameters Λ.P/ are variation independent if and onlyif P is ordered decomposable.

As previously noted, the ingenuous parameterization is not complete in general, and so wecannot apply this result directly to characterize its variation dependence. However, by construct-ing complete parameterizations of which the ingenuous parameterizations are linear submodels,we can obtain the following result.

Theorem 4. The ingenuous parameterization for an ADMG G is variation independent if andonly if G contains no heads of size greater than or equal to 3.

The proof of this result is found in the on-line supplementary material.

5.2. Example 13The graph G1 in Fig. 1 has maximum head size 2, and therefore the associated ingenuousparameterization is variation independent. Likewise the graphs in Figs 3(a) and 3(b) contain noheads of size greater than 2, so the resulting ingenuous parameters are variation independent.Note that this was not true of the parameters that were given by Richardson (2009).

5.3. Example 14The bidirected 3-chain in Fig. 6(a) has head 123, and therefore its ingenuous parameterizationis variation dependent. This can easily be seen directly: in the binary case, for example, ifthe parameters λ12

12.0, 0/ and λ2323.0, 0/ are chosen to be very large, this induces very strong

dependence between the variables X1 and X2, and between X2 and X3 respectively. If thesecorrelations are chosen to be too large, then it is impossible for X1 and X3 to be marginallyindependent, which is a contradiction since X1⊥⊥X3 is implied by the graph.

Observe that we could use the Markov equivalent graph in Fig. 6(b), which has no heads ofsize 3, and thus obtain a variation-independent parameterization of the same model. However,if we add incident arrows as shown in Fig. 6(c), we obtain a graph where such a trick is notpossible. In fact this third graph has no variation-independent parameterization in the Bergsmaand Rudas framework, since it requires λ0124

0124=λ01340134=λ0234

0234=0, and these margins cannot beordered in a way which satisfies the running intersection property (see example 12).

In general, it would be sensible for a statistician who is concerned about variation dependenceto choose a graph from the Markov equivalence class of their model which has the smallestpossible maximum head size. This could be achieved by reducing the number of bidirected


1 2

34

Fig. 7. The bidirected 4-cycle

edges in the graph, where possible; see, for example, Ali et al. (2005) and Drton and Richardson(2008a) for algorithms for finding the graph with the minimal number of arrowheads in a givenMarkov equivalence class.

5.4. Example 15The bidirected 4-cycle, which is shown in Fig. 7, contains a head of size 4, and so its ingenuousparameterization is variation dependent. However, there is a marginal log-linear parameteriza-tion of this model which is ordered decomposable, and therefore variation independent. The4-cycle is precisely the model with X1⊥⊥X3 and X2⊥⊥X4. Set M={13, 24, 1234}, with

L1={1, 3, 13},

L2={2, 4, 24},

L3=P.{1, 2, 3, 4}/\ .L1∪L2/;

here P.A/ denotes the power set of A. This gives a hierarchical, complete and ordered decom-posable parameterization, so the parameters are variation independent. The 4-cycle correspondsexactly to setting λ13

13=λ2424=0, and it follows that the remaining parameters are still variation

independent under this constraint.

This approach to parameterization, which considers disconnected sets, is discussed in detail byLupparelli et al. (2009). It produces a variation-independent parameterization for graphs wherethe disconnected sets do not overlap, and it may be preferable to the ingenuous parameterizationin these cases. In sparser graphs, however, it does not seem as useful; as mentioned above, somegraphs have no variation-independent MLL parameterization.

6. Parsimonious modelling with marginal log-linear parameters

The number of parameters in a model associated with a sparse graph containing bidirectededges can, in certain cases, be relatively large. In a purely bidirected graph, the parameter countdepends on the number of connected sets of vertices; in the case of a chain of bidirected edgessuch as that shown later in Section 6.5 in Fig. 11(a), this means that the number of parametersgrows quadratically in the length of the chain.

The parameterization of Richardson (2009), and its special case for purely bidirected graphs(see Drton and Richardson (2008b)) does not present us with any obvious method of reducingthe parameter count while preserving the conditional independence structure. In contrast, thereare well-established methods for sparse modelling with other classes of graphical models. In thecase of an undirected graph with binary random variables, restricting to one parameter for eachvertex and each edge leads to a Boltzmann machine (Ackley et al., 1985). Rudas et al. (2006)


used MLL parameters to provide a sparse parameterization of a DAG model, again restrictingto one parameter for each vertex and edge.

As we shall see from the following examples, the ingenuous parameterization allows us to fitgraphical models with a large number of parameters, and then to remove higher order inter-actions to obtain a more parsimonious model while preserving the conditional independencestructure of the original graph.

6.1. Flu vaccination data revisitedWe first return to the study of McDonald et al. (1992) that was considered in Section 1. Allvariables are binary, and (except Ag) are coded as 0, false, and 1, true; we add constraints toour model sequentially, recording the results in the analysis-of-deviance Table 4. The ADMGin Fig. 3(a) represents the constraint Ag, Co⊥⊥Re; it fits well, having a deviance of 2.54 on 3degrees of freedom. The smaller model for Fig. 3(b) encodes

Ag, Co⊥⊥Re,

Y ⊥⊥Re |Va, Ag;

note that these precise independences cannot be represented by a DAG or chain graph (of anyof the types that were considered by Drton (2009)). It also fits well (deviance 7.66 on 7 degreesof freedom), so we may prefer it on the grounds of simplicity.

The ingenuous parameterization in this case contains some higher order effects, including thefive-way interaction between all variables. Setting λM

L = 0 for |L|� 4 removes five parameterswhile increasing the deviance by only 2.22; removing the effects of size 3 adds a further 8.39 tothe deviance while removing seven more parameters. The resulting model has a total devianceof 18.28 on 19 degrees of freedom, representing a good fit compared with the saturated model(likelihood ratio test p=0:49).

6.2. Incorporating symmetry: twins dataHakim et al. (2003) investigated genetic effects on the presence or absence of two soft tissuedisorders, frozen shoulder and tennis elbow, based on a study in pairs of monozygotic anddizygotic twins; the data are reproduced in Ekholm et al. (2012). We have count data for afive-way contingency table over the variables Si and Ei, indicators of whether twin i in the pairsuffers from frozen shoulder and tennis elbow respectively, i∈ {1, 2}, and T , an indicator ofwhether the pair are monozygotic or dizygotic twins. There are a total of 866 observations for

Table 4. Analysis-of-deviance table of models considered for theinfluenza data†

Constraint Figure Additional Degrees Totaldeviance of freedom deviance

Ag, Co⊥⊥Re 3(a) 2.54 3 2.54Y⊥⊥Re|Va, Ag 3(b) 5.11 7 7.66No 4- and 5-way 2.22 12 9.88

parametersNo 3-way parameters 8.39 19 18.28

†Constraints are added sequentially from top to bottom; the lastthree columns give the additional deviance for the constraint, thetotal degrees of freedom and the total deviance of the models.


S1 E1

S2 E2

(a)

S1 E1

S2 E2

(b)

Fig. 8. Graphs for the twins data for models corresponding to (a) a common gene and (b) separate genesaffecting the prevalence of frozen shoulder and tennis elbow

monozygotic pairs, and 963 for dizygotic pairs; twin 1 corresponds to the twin who was bornfirst.

We first fitted the model T ⊥⊥.S1, S2, E1, E2/ to test whether the zygosity of the twins has anyeffect on the other variables; we obtained a deviance of 16.4 on 15 degrees of freedom, suggestingthat there is no evidence that T is related to the other variables. Note that this contradicts theconclusions of Ekholm et al. (2012), but they used additional assumptions to obtain morepowerful tests.

Collapsing to a four-way table over .S1, S2, E1, E2/, we consider the complete bidirected modelin Fig. 8(a). A further simplifying assumption is to impose symmetry between the twins in eachpair, on the basis that we do not expect any association between the prevalence of the disordersand which twin was born first. Using the ingenuous parameterization for the graph in Fig. 8(a),which is itself symmetric with respect to the individual twins, this amounts to six independentlinear constraints and gives a deviance of 0.59 compared with the saturated model on fourvariables; there is therefore no evidence to reject symmetry.

Now, a hypothesis of interest is whether a common gene is responsible for the increasedrisk of the two disorders, or the genetic effects are separate and independent. In the latter casewe would expect the data to be explained by the model that is encoded by the graph in Fig.8(b), and therefore to observe the marginal independences E1⊥⊥S2 and E2⊥⊥S1 (see Drton andRichardson (2008b) for more details). This amounts to the constraint λ

E1S2E1S2=λ

E2S1E2S1= 0; the

first equality already holds by symmetry, so only one additional constraint is imposed.This model has a deviance of 8.41 on 7 degrees of freedom, which is not rejected in a like-

lihood ratio test with the saturated model (p= 0:30), and so there is no evidence to reject theseparate genes hypothesis. We remark, however, that the model with symmetry but no marginalindependences has a slightly lower Bayesian information criterion score, BIC, and so might bepreferred.

The elimination of the four-way and three-way interaction parameters for the model fromFig. 8(b) with symmetry results in deviances of 11.63 on 8 degrees of freedom and 16.69 on 10degrees of freedom respectively, both of which also represent reasonable fits; the latter of thesehas just five free parameters.

6.3. Netherlands kinship dataThe Netherlands Kinship Panel Study is an on-going study which collects longitudinal infor-mation on several thousand Dutch individuals and their families (Dykstra et al., 2005, 2007).Both the primary respondents (anchors) and their partners are asked the question ‘How is yourhealth in general?’, with possible responses ‘excellent’, ‘good’, ‘neither good nor poor’, ‘poor’


A1 P1

A2 P2

(a)

A1 P1

A2 P2

(b)

Fig. 9. Graphs for the Netherlands Kinship Panel Study data—responses of anchors A and partners Pregarding their assessment of health (the subscripts indicate time): (a) a complete graph; (b) a subgraphwhich implies that P2 ?? A1jP1

and ‘very poor’. We combined responses ‘neither good nor poor’, ‘poor’ and ‘very poor’ intoone category to avoid small counts.

Two waves of data are currently available: from 2002–2004 and 2006–2007. We consideredonly anchors who had the same partner in both waves, and such that both the individual andthe partner answered the health question in both waves. Let Ai and Pi denote the response ofthe anchor and partner respectively for wave i∈{1, 2}. In total there are n=2318 data points,classified into a 3×3×3×3 table.

We begin with the complete graph in Fig. 9. One plausible model would be that anchorsand their partners are exchangeable. Since the graph is symmetrical in this respect, so is theingenuous parameterization, and enforcing symmetry amounts merely to a set of 36 linearconstraints, e.g.

λA1P1A2P2A2P2

.1, 0/=λA1P1A2P2A2P2

.0, 1/:

This model has a deviance of 89.98, which when compared with the tail of a χ236-distribution

gives p=1:6×10−6; thus the symmetry model is a poor fit to the data and is rejected. The lackof exchangeability is probably due to selection bias in the sampling of the anchors, as well as thedifferent ways in which the anchors and their partners were asked the question: anchors wereasked about their health as part of a face-to-face interview, whereas the partners were askedonly to complete a survey. See Siemiatycki (1979) for an analysis of differences in responsesresulting from survey mode.

If instead we remove the edge A1→P2 and fit the graph in Fig. 9(b), we obtain an explanationof the data which is not rejected at the 5% level (deviance 19.09 on 12 degrees of freedom;p=0:086); this model corresponds to the conditional independence P2⊥⊥A1|P1. This graph isthe only subgraph of the complete graph in Fig. 9(a) which leads to a good fit; in particularthe model that is created by removing the edge P1→A2 is strongly rejected, which is onemanifestation of the asymmetry between individuals and their partners.

Note that we could also have obtained the independence P2⊥⊥A1|P1, for instance, by using aDAG with topological ordering P1, A1, P2, A2, but the resulting parameterization would havemade it much more difficult to enforce the symmetry constraint that was tested above.

6.4. Example 16: trust dataDrton and Richardson (2008b) examined responses to seven questions relating to trust andsocial institutions, taken from the US General Social Survey between 1975 and 1994. Briefly,the seven questions were as follows.


(a) Trust: can most people be trusted?(b) Helpful: do you think that most people are usually helpful?(c), (d) MemUn and MemCh: are you a member of a labour union or church?(e), (f), (g) ConLegis, ConClerg and ConBus: do you have confidence in Congress, or organized

religion or business?

In Drton and Richardson (2008b) the model given by the graph in Fig. 10 was shown to explainthe data adequately, having a deviance of 32.67 on 26 degrees of freedom, when compared withthe saturated model. They also provided an undirected graphical model which has one moreedge than the graph in Fig. 10, and yet has 62 fewer parameters. It also gives a good fit tothe data, having a deviance of 87.62 on 88 degrees of freedom. Both graphs were chosen bybackward stepwise selection methods; see Drton and Richardson (2008b) for details.

For practical and theoretical reasons, the bidirected model may be preferred to the undirectedmodel, even though the latter appears to be much more parsimonious. One may consider thedependence between the responses that are given to a questionnaire to be manifestations of un-measured characteristics of the respondent, such as their political beliefs. Such a system can bewell represented by a bidirected graph, through its marginal independence structure and connec-tion to latent variable models, but not necessarily by an undirected model, which induces condi-tional independences. Since models that are defined by undirected and bidirected graphs are notnested, there is no a priori reason to expect the two methods to give a similar graphical structure.

The greater parsimony of the undirected model (when defined purely by conditional indep-endences) is due to its hierarchical nature: if we remove an edge between two vertices a and b, then

Trust

Helpful

MemCh

MemUn

ConClerg

ConBus ConLegis

Fig. 10. Markov model for trust data given in Drton and Richardson (2008b)

1 2 3 · · · k

(a)

1 2 3 · · · k

h1 h2 h3 hk−1

(b)

Fig. 11. (a) A bidirected k-chain and (b) a DAG with latent variables (h1,: : : , hk�1) generating the sameobservable conditional independence structure


this corresponds to requiring that λVA=0 for every effect A containing both a and b. Removing

such an edge in a bidirected model may correspond merely to setting λabab=0 and nothing else,

depending on the other edges that are present. Using the ingenuous parameterization, it iseasy to constrain additional higher order terms to be 0 and to obtain submodels of the set ofdistributions obeying the global Markov property.

Increase in Deviance

Den

sity

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

Increase in Deviance(a) (b)

(c)

Den

sity

0 10 20 30 40

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Increase in Deviance

Den

sity

0 50 100 150

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Fig. 12. Histograms showing the increase in deviance caused by setting to 0 (a) the five- and six-wayinteraction parameters, (b) the four-, five- and six-way interaction parameters and (c) the three-, four-, five-and six-way interaction parameters: the plots are based on 1000 data sets, each with 10000 observations,generated from the DAG in Fig.11(b); the densities plotted are χ2 with (a) 3, (b) 6 and (c) 10 degrees of freedom


Starting with the model in Fig. 10 and fixing the four-, five-, six- and seven-way interactionterms to be 0 increases the deviance to 84.18 on 81 degrees of freedom; none of the four-wayinteraction parameters was found to be significant on its own. Furthermore, removing 21 ofthe remaining 25 three-way interaction terms increases the deviance to 111.48 on 102 degreesof freedom; using an asymptotic χ2-approximation gives a p-value of 0.245, so this model isnot contradicted by the data. The only parameters that are retained are the one-dimensionalmarginal probabilities, the two-way interactions corresponding to edges in Fig. 10 and thefollowing three-way interactions:

MemUn, ConClerg, ConBus;

Helpful, MemUn, MemCh;

Trust, ConLegis, ConBus;

MemCh, ConClerg, ConBus:

This model retains the marginal independence structure of Drton and Richardson’s (2008b)model but provides a good fit with only 25 parameters, rather than the original 101.

A similar analysis, for different data, was performed by Lupparelli et al. (2009), page 573;again they found an undirected graphical model to be much more parsimonious than anybidirected model but obtained comparable fits by removing statistically insignificant higherorder parameters.

6.5. Simulated dataWe saw in the earlier examples that we could often remove higher order interaction parame-ters without compromising the goodness of fit. Here we explore this phenomena further viasimulations.

Consider the DAG with latent variables shown in Fig. 11(b); over the observed variables,the conditional independences which hold are exactly those given by the bidirected chain inFig. 11(a).

We randomly generated 1000 distributions from this DAG model with k=6, where each latentvariable was given three states, and each observed variable two. The probability of each observedvariable being 0, conditional on each state of its parents, was an independent uniform randomdraw on .0, 1/; latent states were fixed to occur with equal probability. For each distribution, asample size of 10000 was drawn, and the bidirected chain model was fitted to it by maximumlikelihood estimation. For each of the 1000 data sets, we then measured the increase in devianceassociated with removing higher order parameters

The histogram in Fig. 12(a) demonstrates that the deviance increase from setting the five-and six-way interaction parameters to 0 (a total of three parameters) was not distinguishablefrom that which would be observed under the null hypothesis that these parameters are 0. Thedeviance increase from setting the four-, five- and six-way interactions to 0 appeared to haveonly a slightly heavier tail than the associated χ2-distribution, as suggested by the outliers inFig. 12(b). Removing the three-way interactions in addition to this caused a dramatic increasein the deviance, as may be observed from the heavy tail of the histogram in Fig. 12(c). Thisillustrates that the ingenuous parameterization can be used to produce more parsimoniousmodel descriptions than would be possible by using Richardson’s parameters.

Under the process which generated these models, each of these interaction parameters wasnon-zero almost surely. As the sample size increases, the power of a likelihood ratio test fora fixed distribution tends to 1, so it must be that a simulation such as the above would, forsufficiently large data sets, show a significant deviation from the associated χ2-distributions.


However, even at a fairly large sample size of 10000, a limited effect was observed in Figs 12(a)and 12(b), and the examples above with real data suggest that higher order interactions are oftennot particularly useful in practice for describing data.

Acknowledgements

This research was supported by US National Science Foundation grant CNS-0855230 andUS National Institutes of Health grant R01 AI032475. The Netherlands Kinship Panel Studyis funded by grant 480-10-009 from the Major Investments Fund of the Netherlands Organ-isation for Scientific Research, and by the Netherlands Interdisciplinary Demographic Institute,Utrecht University, the University of Amsterdam and Tilburg University. We thank C. J.McDonald, S. L. Hui and W. M. Tierney for giving us permission to use their flu vaccine data.

Our thanks go to Tamas Rudas for helpful discussions, and to Antonio Forcina for valuableinsights and the use of his computer programs. Finally we thank two referees and the AssociateEditor for their thorough reading of an earlier draft, and for very useful suggestions.

References

Ackley, D. H., Hinton, G. E. and Sejnowski, T. J. (1985) A learning algorithm for Boltzmann machines. Cogn.Sci., 9, 147–169.

Ali, R. A., Richardson, T. S., Spirtes, P. and Zhang, J. (2005) Towards characterizing Markov equivalence classesfor directed acyclic graphs with latent variables. In Proc. 21st Conf. Uncertainty in Artificial Intelligence (edsF. Bacchus and T. Jaakola), pp. 10–17. Arlington: Association for Uncertainty in Artificial Intelligence Press.

Bergsma, W. P. and Rudas, T. (2002) Marginal models for categorical data. Ann. Statist., 30, 140–159.Darroch, J. N., Lauritzen, S. L. and Speed, T. P. (1980) Markov fields and log-linear models for contingency tables.

Ann. Statist., 8, 522–539.Dawid, A. P. (1979) Conditional independence in statistical theory (with discussion). J. R. Statist. Soc. B, 41,

1–31.Drton, M. (2009) Discrete chain graph models. Bernoulli, 15, 736–753.Drton, M. and Richardson, T. S. (2008a) Graphical methods for efficient likelihood inference in Gaussian covari-

ance models. J. Mach. Learn. Res., 9, 893–914.Drton, M. and Richardson, T. S. (2008b) Binary models for marginal independence. J. R. Statist. Soc. B, 70,

287–309.Dykstra, P. A., Kalmijn, M., Knijn, T. C. M., Komter, A. E., Liefbroer, A. C. and Mulder, C. H. (2005) Code-

book of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in familyrelationships: wave 1. Working Paper 4. Netherlands Interdisciplinary Demographic Institute, The Hague.

Dykstra, P. A., Kalmijn, M., Knijn, T. C. M., Komter, A. E., Liefbroer, A. C. and Mulder, C. H. (2007) Code-book of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in familyrelationships: wave 2. Working Paper 6.

Ekholm, A., Jokinen, J., McDonald, J. W. and Smith, P. W. F. (2012) A latent class model for bivariate binaryresponses from twins. Appl. Statist., 61, 493–514.

Evans, R. J. and Forcina, A. (2013) Two algorithms for fitting constrained marginal models. Computnl Statist.Data Anal., to be published.

Evans, R. J. and Richardson, T. S. (2012) Maximum likelihood fitting of acyclic directed mixed graphs to binarydata. In Proc. 26th Conf. Uncertainty in Artificial Intelligence (eds P. Grunwald and P. Spirtes). Arlington:Association for Uncertainty in Artificial Intelligence Press.

Evans, R. J. and Richardson, T. S. (2013) Markovian acyclic directed mixed graphs for discrete data. PreprintarXiv:1301.6624.

Forcina, A., Lupparelli, M. and Marchetti, G. M. (2010) Marginal parameterizations of discrete models definedby a set of conditional independencies. J. Multiv. Anal., 101, 2519–2527.

Glonek, G. F. V. and McCullagh, P. (1995) Multivariate logistic models. J. R. Statist. Soc. B, 57, 533–546.Hakim, A. J., Cherkas, L. F., Spector, T. D. and MacGregor, A. J. (2003) Genetic associations between frozen

shoulder and tennis elbow: a female twin study. Rheumatology, 42, 739–742.Kauermann, G. (1997) A note on multivariate logistic models for contingency tables. Aust. J. Statist., 39, 261–

276.Lauritzen, S. L. (1996) Graphical Models. Oxford: Clarendon.Lupparelli, M., Marchetti, G. M. and Bergsma, W. P. (2009) Parameterizations and fitting of bi-directed graph

models to categorical data. Scand. J. Statist., 36, 559–576.


Marchetti, G. M. and Lupparelli, M. (2011) Chain graph models of multivariate regression type for categoricaldata. Bernoulli, 17, 827–844.

McDonald, C. J., Hui, S. L. and Tierney, W. M. (1992) Effects of computer reminders for influenza vaccinationon morbidity during influenza epidemics. MD Comput., 9, 304–312.

Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems. San Francisco: Morgan Kaufmann.Richardson, T. S. (2003) Markov properties for acyclic directed mixed graphs. Scand. J. Statist., 30, 145–157.Richardson, T. S. (2009) A factorization criterion for acyclic directed mixed graphs. In Proc. 25th Conf. Uncertainty

in Artificial Intelligence (eds J. Bilmes and A. Ng). Corvallis: Association for Uncertainty in Artificial IntelligencePress.

Richardson, T. S. and Spirtes, P. (2002) Ancestral graph Markov models. Ann. Statist., 30, 962–1030.Rudas, T., Bergsma, W. P. and Nemeth, R. (2006) Parameterization and estimation of path models for categorical

data. In Proc. 17th Symp. Computational Statistics, pp. 383–394. Heidelberg: Physica.Rudas, T., Bergsma, W. P. and Nemeth, R. (2010) Marginal log-linear parameterization of conditional indepen-

dence models. Biometrika, 97, 1006–1012.Siemiatycki, J. (1979) A comparison of mail, telephone and home interview strategies for household health surveys.

Am. J. Publ. Hlth, 69, 238–245.Wermuth, N. (2011) Probability distributions with summary graph structure. Bernoulli, 17, 845–879.Whittaker, J. (1990) Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.

Supporting informationAdditional ‘supporting information’ may be found in the on-line version of this article:

‘Marginal log-linear parameters for graphical Markov models—Supplementary Material’.

Documents

Marginal log-linear parameters for graphical Markov models