Conditional log-linear structures for log-linear modelling

Computational Statistics & Data Analysis 50 (2006) 2044–2064www.elsevier.com/locate/csda

Conditional log-linear structures for log-linearmodelling

Sung-Ho Kim∗Division of Applied Mathematics, Korea Advanced Institute of Science and Technology,

Daejeon, 305-701, South Korea

Received 5 May 2004; received in revised form 3 January 2005; accepted 25 March 2005Available online 29 April 2005

Abstract

Fienberg and Kim [1999. J. Amer. Statist. Assoc. 94 (445), 229–239] investigated the relation-ship between a log-linear model (LLM) and its conditional log-linear model (CLLM). While theyconsidered a single conditional variable, we explored the relationship further considering multipleconditional variables. It is shown that we can find the exact model structure of an LLM from thesets of CLLMs of the LLM conditional on some sets of variables under a certain condition for theconditional sets. A graphical interpretation of the relation is presented when the LLM is graphical.The results are applied effectively for analyzing categorical real data from a lecture evaluation.© 2005 Elsevier B.V. All rights reserved.

Keywords: Generator; Graphical model; Independence graph; Induced subgraph; Interaction effect; Zero-wphenomenon; Strong hierarchy principle

1. Introduction

A hierarchical log-linear model of categorical data is expressed as a linear sum of theterms that represent main or interaction effects of subsets of the variables, X�, � ∈ V ,

∗ Tel.: +82 42 869 2737; fax: +82 42 869 5710.E-mail address: [email protected].

0167-9473/$ - see front matter © 2005 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2005.03.009

http://www.elsevier.com/locate/csda

mailto:[email protected]

S.-H. Kim / Computational Statistics & Data Analysis 50 (2006) 2044–2064 2045

involved in the model. Each term in the model is defined on a subset of {X�}�∈V . The modelstructure of a hierarchical log-linear model is determined by the maximal domain subsetsof the terms, which are called the generators of the model and whose collection is calledthe generating class of the model.

As a contingency table can be expressed as a collection of sliced tables of a subset ofvariables that are sliced with respect to the remaining subset of variables, a log-linear model(LLM) can be expressed as a collection of conditional log-linear models (CLLMs) of a subsetof variables conditional on the rest of the variables. This collection of conditional modelscan be expressed in a tree form with the nodes for conditional variables and the leaves forconditional models. When there is only one conditional variable, the tree is a one-node treewith as many leaves as the number of the categories of the conditional variable.

Adapting our notation from Bishop et al. (1975), we will represent an LLM by {�1, . . . ,

�k}, where the �i’s are generators of the LLM. For notational convenience, we use theindices of the X variables to represent the LLM. Fienberg and Kim (1999) found (Lemma1 thereof) that if � appears in at least one of the CLLMs of a given one-node tree, theoriginal log-linear model, say M , must contain � or the union of � and the index, say c, ofthe conditional variable. In particular, it was found that if � is in all of the CLLMs, theneither � or � ∪ {c} is a generator of M , and that otherwise, � ∪ {c} must be a generator ofM provided that � is maximal in the union of the CLLMs. A key point in this is that partsof the model structure of M are found in the CLLMs.

Conditional models are often used for model representation in artificial intelligence. TheBayesian network (Pearl, 1988) is one of the most popular graphical models in artificial in-telligence, depicted via directed acyclic graphs. D’Ambrosio (1995), Geiger and Heckerman(1996), and Mahoney and Laskey (1999) consider the problem of representing Bayesiannetworks, using conditional probability models and conditional variables, which are capa-ble of explicitly capturing many of the lower-level structural details. They use the tree formof model representation with the CLLM replaced by a conditional probability distribution.Chickering et al. (1997) and Friedman and Goldszmidt (1998) employ the detailed modelrepresentation in learning Bayesian networks. Conditional models are embedded in a wholemodel as local descriptions of the whole model, making a detailed model representationpossible. Our purpose for using the CLLM is different from them in that we aim to use theCLLM for searching model structures rather than for model representation.

The main goal of this paper is to explore further the relationship between LLM and itsCLLM and propose an easier method of log-linear modelling which makes use of the piecesof information on model structure that are embedded in CLLMs.

This paper consists of 7 sections. In Section 2, we describe the relationship betweenan LLM and its CLLM in the context of a log-linear expression for joint probability. Agraphical interpretation of the relationship is presented when the LLM is graphical. Therelationship is further explored in Section 3 when a single conditional variable is considered,and multiple conditional variables are considered in Section 4. In this sense, Section 4 is anextension of Section 3. However, we included Section 3 for smoothness of exposition. Theresults of Section 4 are carried over, in Section 5, into a procedure for log-linear modellingby using CLLMs. This procedure is then applied to real data in Section 6, and we close thepaper in Section 7 with some concluding remarks.

2046 S.-H. Kim / Computational Statistics & Data Analysis 50 (2006) 2044–2064

2. CLLM and graphic representation

In this section, we will summarize the results of Fienberg and Kim (1999) through acouple of examples and use a graphical representation for the model structure when theLLM is graphical in order to help the reader get better insight into the relationship betweenan LLM and its conditional model. In the examples, we compare the functional forms ofLLM and CLLM. Some terms in the log-linear expression of a CLLM are from the LLMand the other terms are each obtained as a sum of terms in the LLM. An important pointhere is that we cannot tell in a CLLM expression which term is a term of the LLM andwhich term is a sum of terms of the LLM. We will elaborate on this below.

We will use V to denote the index set of all the variables involved in a model and |A|to denote the cardinality of A. For a set A, we let XA = (X�)�∈A, and, for � ∈ V , we putX(�) = (Xi)i �=�. The set of all the possible values of X� will be denoted by I�, and, forA ⊂ V , IA = ∏

�∈AI�. We will write X instead of XV when confusion is not likely. Wewill simply write pA (xA) and pA|B (xA; xB) for P (XA = xA) and P (XA = xA|XB = xB),respectively.

Example 1. Consider an LLM, M1, for 5 variables, X1, . . . , X5, i.e., V = {1, 2, 3, 4, 5},given by

M1 = {{1, 2, 3}, {1, 3, 4}, {3, 4, 5}} (1)

and denote the three generators in (1), respectively, by �1, �2, �3. We can also expressM1 as

log pV (x) = u�1(x) + u�2(x) + u�3(x) + R(x),

where u�(x) depends on x through x� only and R(x) includes a constant u-term plus thesummation of the u�-terms, each of whose subscript set is a strict subset of �i for somei ∈ {1, 2, 3}.

Let

w1(x1)

�

(x(1)

) = u�(x) + u�∪{1}(x),

w1(x1) = u + u1(x) − log p1 (x1) .

We can see that w1(x1)

�

(x(1)

)depends on x through x�∪{1} only.

The logarithm of the conditional probability of X(1) = x(1) given X1 = x1 is given by

log p(1)|1(x(1); x1

)= w1(x1) + w

1(x1)2

(x(1)

) + w1(x1)3

(x(1)

) + w1(x1)4

(x(1)

) + u5(x) + w1(x1)23

(x(1)

)+ w

1(x1)34

(x(1)

) + u35(x) + u45(x) + u345(x), (2)

where w1(x1)ij

(x(1)

) = w1(x1){i,j}

(x(1)

)and similarly for uij .

Note in (2) that, under the hierarchy assumption for the model, the CLLM is determinedby the terms w1

23 and u345 if neither of them equals zero. This is because the index sets of


all the w- and u-terms in (2) are subsets of {2, 3} or {3, 4, 5}. Also note that the index sets,{2, 3} and {3, 4, 5}, of these terms are contained in

{�1\{1}, �2\{1}, �3\{1}} = {{2, 3}, {3, 4}, {3, 4, 5}}.

When u�(x)=0 for all x� ∈ I�, we will simply write u�=0, and we will write w1(x∗)� =0

when w1(x∗)�

(x(1)

)=0 for all x{1}∪� ∈ I{1}∪� with x1=x∗. The lemma below shows how thew- and u-terms are related. Although its proof is given in Fienberg and Kim, it is worthwhileto have it here.

Lemma 1. Let � ∩ {1} = ∅. Then, u� = u{1}∪� = 0 if and only if w1(x1)

� = 0 for all x1 ∈ I1.

Proof. Suppose that w1(x1)

�

(x(1)

) = 0 for all x{1}∪� ∈ I{1}∪�. Then it follows that

u{1}∪�(x) = −u�(x) for all x ∈ IV . (3)

Since∑

x1∈I1u{1}∪�(x) = 0, Eq. (3) implies that u� = 0. The proof for the other direction

is straightforward. �

According to Lemma 1, it is possible that w1(x1)34 = 0 for some value x1 of X1 when

u134 �= 0. If w1(x1)34 in (2) equals 0 for all values of x1, then by the lemma u34 = u134 = 0.

This means that the corresponding LLM cannot be hierarchical sinceu345 �= 0; similarly, it isalso possible that w1(x1)

3 =0 while w1(x1)34 �= 0. To avoid these “non-hierarchical” situations,

we shall assume, throughout the remainder of the paper, that the hierarchy principle holdsfor both the CLLM and the LLM. We refer to this as the strong hierarchy principle (SHP).

Under the SHP, the CLLM is subject to whether a particular w-term is zero or not. Forinstance, the CLLM of the model represented by expression (2) is determined by u345 andw

1(x1)23 when w

1(x1)23 �= 0 and by u345 and w

1(x1)2 when w

1(x1)23 = 0 and w

1(x1)2 �= 0. We refer

to such situations where w = 0 as zero-w phenomena.Taking the zero-w phenomena into consideration in (2), we come up with the possible

CLLMs for X2, . . . , X5 given X1 as

{{2, 3}, {3, 4, 5}}, {{2}, {3, 4, 5}}, {{3, 4, 5}}. (4)

By applying Lemma 1 to the list in (4), we can obtain one-node trees of CLLMs for the 5random variables in Example 1. According to the lemma, the CLLM {{2, 3}, {3, 4, 5}} canappear at every leaf of a one-node tree of CLLMs that corresponds to the LLM M1, but notfor the other CLLMs in (4). For example, if the one-node tree is of the CLLMs {{3, 4, 5}}and {{2}, {3, 4, 5}} only, then, by Lemma 1, the set {1, 2, 3} cannot be a generator of M1.Note that {3, 4, 5} appears in all the CLLMs in (4), but this is not guaranteed for a generatorof an LLM which contains “1” when the conditional variable is X1. This point is elaboratedin the following example:

Example 2. Consider a submodel M2 of the model M1:

M2 = {{1, 2, 3}, {1, 3, 4}}. (5)


Since “1” is contained in both of the sets in expression (5), the CLLMs conditional on X1are each in terms of w-terms only. Thus, as for M1, considering all the possible zero-wphenomena gives rise to possible CLLMs of the form:{

�1, �2}

with �1 ⊆ {2, 3}, �2 ⊆ {3, 4},where {2, 3} and {3, 4} must show up, by Lemma 1, at least once in the CLLMs of a one-nodetree.

It is worthwhile to note in both of the examples that, for each generator � of a given LLM,�\{1} shows up in at least one of the CLLMs unless �\{1} is a subset of any other generatorof the LLM. We denote by M1(x1) the CLLM of X(1) given X1 = x1 when the LLM of XV

is M . Then, for every set � that is maximal in⋃

x∈I1M1(x), either � or � ∪ {1} must show

up in M .We can represent the relationship between an LLM and its CLLM in general terms.

Consider an LLM, M , of XV given by

M = {�1, �2, . . . , �k} . (6)

For a collection A of sets, we define an operator 〈·〉 as

〈A〉 = {A ∈ A; A is maximal in A},where “maximal” is in the sense of set-inclusion. Suppose that we condition the LLM on X1.There may be some sets �j in M that are disjoint with {1}. Let them be �j , j = r +1, . . . , k

and form a set �3. Among �j ’s, j = 1, . . . , r , there may be �j such that (�j\{1}) ⊆ �′ forsome �′ ∈ �3. Let them be �j , j = 1, . . . , d, and form a set �1. We denote the set of theremaining �’s, �d+1, . . . , �r , by �2.

For �′ ∈ �2, �′\{1} is not a subset of any � ∈ �3. Thus, under the SHP, for every � ∈ �2,�\{1} appears in at least one of the |I1| CLLMs by Lemma 1. From this, it follows that⟨ ⋃

x∈I1

M1(x)

⟩= {�\{1}; � ∈ �2 ∪ �3} , (7)

which we will denote by �{1}(M) where the subscript {1} means that the conditional vari-able is X1. Eq. (7) is valid for all the hierarchical LLMs since Lemma 1 holds for everyhierarchical LLM.

Although our interest lies in the hierarchical LLMs, we will make use of an attractivefeature of a graphical LLM (Darroch et al., 1980; Fienberg, 1980) to visualize the relationbetween LLM and its CLLM.

An LLM is called a graphical LLM if its model structure is representable via an undirectedgraph of vertices and edges. The undirected graph is also called an independence graph(Whittaker, 1990) and a Markov network (Pearl, 1988).

We denote a graph by G = (V , E) with V and E as the set of the vertices in G and theset of the edges in G, respectively. Since we consider only the undirected graph, (�, �) ∈ E

implies (�, �) ∈ E. So if an edge (�, �) is in G, we will simply write either (�, �) ∈ E


1 2

3 41

2

34

5 6

7

8(a) (b)

Fig. 1. Examples of undirected graph.

or (�, �) ∈ E. An induced subgraph of G is defined as

GA = (A, EA),

where EA = E ∩ (A × A). A path of length n from � to � is a sequence of n + 1 distinctvertices � = �0, �1, . . . , �n = � such that (�i−1, �i ) ∈ E for all i = 1, 2, . . . , n.

A clique in G is a maximal complete subgraph of G where maximal is in the senseof set-inclusion. If the LLM in (6) is graphical with graph G, each �i , i = 1, . . . , k, cor-responds to a clique in G. For example, the graph in panel (a) of Fig. 1 is the graph ofthe LLM, {{1, 2}, {1, 3}, {2, 4}, {3, 4}} and that in panel (b) is the graph of the LLM,{{1, 2, 3}, {2, 3, 4, 5}, {5, 6}, {4, 5, 7}, {4, 5, 8}}. Properties of the graphical LLM aredescribed in detail in Darroch et al. (1980).

We will refer to the graphical LLM simply as the graphical model. As is well known, theconditional independence relationship among the variables involved in a graphical modelis represented in a graph in such a way that the conditional independence of X and Y givenZ is depicted in the corresponding graph of the model where all the paths from X to Y arethrough Z (Whittaker, 1990, The Separation Theorem).

In panel (a) of Fig. 1, the LLM corresponding to the induced subgraphG{2,3,4} is expressedas

{{2, 4}, {3, 4}}; (8)

similarly, for panel (b), the LLM corresponding to the induced subgraphGV \{5} is expressedas

{{1, 2, 3}, {2, 3, 4}, {4, 7}, {4, 8}, {6}}. (9)

If we denote the graphical models with the corresponding graphs in panels (a) and (b) ofFig. 1 by Ma and Mb, respectively, we can see from (6) that �{1} (Ma) and �{5} (Mb) arethe same as those in (8) and (9), respectively. We will summarize this in Theorem 1. If agraphical model, M , has its graph G, we may regard M as a set-expression of G. In thiscontext, we will write, for � ∈ V , ��(G) in the same sense as ��(M) and let �(G) = M

for notational convenience.

Theorem 1. Suppose that the LLM of X is graphical with the corresponding graph G.Then, under the SHP,

�{1}(G) = �(GV \{1}

).

Proof. Follows immediately from the definitions of �{1}(G) and �(GV \{1}). �


3. CLLMs with single conditional variables

Consider a graph G = (V , E) and let A ⊆ V . In an undirected graph, if (�, �) ∈ E, wesay that the vertices � and � are adjacent in G and write � ∼ �.

When � /∼ � in G, it is obvious, by the definition of an induced subgraph, that

�(GV \{�}

) ∪ �(GV \{�}

) = �(G).

Thus, if there are at least one pair of non-adjacent nodes in G, it is always true that⋃�∈V

�(GV \{�}

) = �(G). (10)

We will prove this under a general setting in Theorem 2. For two collections of sets Aand B, we will write A�B if for every set A ∈ A, there exists a set B ∈ B such thatA ⊆ B, and write A ≺ B if A�B and there exists at least one set in A which is a propersubset of a set in B.

Theorem 2. Consider an LLM, M , in (6). Then, for �, � ∈ V , the following holds:

(i) If {�, �} ⊆ �′ for some �′ in M, then⟨�{�}(M) ∪ �{�}(M)

⟩ ≺ M .(ii) Otherwise,

⟨�{�}(M) ∪ �{�}(M)

⟩ = M .

Proof. In case (i), it is obvious that �′ does not show up in �{�}(M) ∪ �{�}(M), since it isnot included in either of �{�}(M) and �{�}(M).

In case (ii), we may suppose without loss of generality that � is included only in the setsin S� = {�1, . . . , �a} and � only in the sets in S� = {�a+1, . . . , �a+b} for some positiveintegers a and b with a + b�k. Note in (7) that⟨ ⋃

x∈I1

M1x

⟩� {�\{1}; � ∈ �3} = �3.

For the same reason, we have that

�{�}(M)� {�\{�}; � ∈ M\S�} = M\S�

and

�{�}(M)�{�\{�}; � ∈ M\S�

} = M\S�.

Since the sets in S� are all distinct from the sets in S�,

(M\S�) ∪ (M\S�

) = M .

Therefore,(�{�}(M) ∪ �{�}(M)

)�M .

Expression (7) implies that

�{�}(M)�M, for every � ∈ V .

This completes the proof. �


Note that, by Theorem 1, Eq. (10) is immediate from result (ii) of Theorem 2. From result(i) of Theorem 2, we can see that (10) does not hold if a graph is complete. For instance, ifG is a complete graph of nodes 1, 2, and 3, then⋃

�∈{1,2,3}�

(GV \{�}

) = {{1, 2}, {2, 3}, {1, 3}}, (11)

while �(G) = {{1, 2, 3}}. The LLM as in the right-hand side of (11) is a good example of ahierarchical LLM that is not graphical (Darroch et al., 1980).

4. CLLMs with multiple conditional variables

So far, we have considered CLLMs with only one conditional variable.We can now extendthe argument to the situations of multiple conditional variables. For C ⊂ V , suppose thatXC is the vector of the conditional variables. We define, for XC = xC ,

wC(xC)

� (x(C)) =∑�⊆C

u�∪�(x).

Then, we can have an extended version of Lemma 1.

Theorem 3. Let � ∩ C = ∅ for a set C ⊂ V . Then, under the SHP,

u�∪� = 0 for all � ⊆ C

if and only if

wC(xC)

� = 0 for all xC ∈ IC .

Proof. See Appendix A. �

For a collection F of sets, we define

F\C = {�\C; � ∈ F}.Consider the LLM in (6) and let three disjoint sets �1, �2, �3 satisfy M = �1 ∪ �2 ∪ �3.For C ⊂ V , we assume that

� ∩ C �= ∅ for � ∈ �1 ∪ �2 (12)

and

� ∩ C = ∅ for � ∈ �3.

We also assume that for � ∈ �1,

(�\C) ⊂ �′ for some �′ ∈ �3,


1

34

5 6

783

4

5 6

78

1

3

5 6

78

C={2,4}

C={4}

1

2

3

5 6

78

1

2 5 6

78

C={3,4}

C={1,2} C={2}

(a)

(d) (e)

(b) (c)

Fig. 2. Some induced subgraphs of the graph (G) in panel (b) of Fig. 1. The graphs in panels a, b, c, d, e areGV \{2,4},GV \{1,2},GV \{2},GV \{3,4},GV \{4}, respectively. (a) C = {2, 4}; (b) C = {1, 2}; (c) C = {2}; (d)C = {3, 4}; (e) C = {4}.

but none of the sets in �2\C is a subset of any set in M\�2. As is specified in the paragraphbetween expressions (6) and (7), we let

�2 = {�j , j = d + 1, . . . , r �k}. (13)

Then, by extending the zero-w phenomenon to wC(xC), we can see that the CLLM of XV \Cgiven XC is expressed in the form of⟨{�d+1, �d+2, . . . ,�r} ∪ �3

⟩for �j ⊆ �j\C, j = d + 1, . . . , r . (14)

We denote by MC(x) the CLLM of X(C) given XC = x when M is the LLM of X. ByTheorem 3, the maximal sets in

⋃x∈IC

MC(x) are given by⟨ ⋃x∈IC

MC(x)

⟩= {�\C; � ∈ �2 ∪ �3} , (15)

which we denote by �C(M). This equation is an extended version of (7) into the situationof multiple conditional variables. From (15) follows that

�C(M)�M . (16)

Theorem 4. For two subsets, C1 and C2, of V ,⟨�C1(M) ∪ �C2(M)

⟩��C1∩C2(M).

Proof. For C′ ⊆ C ⊂ V , expression (15) implies that

�C(M)��C′(M).

Therefore, the desired result follows. �

Fig. 2 is an example of the inequality of Theorem 4. The sets, {2, 4} and {1, 2}, sharevariables with different generators, {1, 2, 3} and {2, 3, 4, 5}, of the model, M , in panel (b)of Fig. 1, respectively, and 〈�{2,4}(M) ∪ �{1,2}(M)〉 = �{2}(M). On the other hand, the


6 6

1

3

5

78

12

4

5

78

12

34

5

78

C={6} C={3}C={2,4}(a) (b) (c)

Fig. 3. Some induced subgraphs of the graph (G) in panel (b) of Fig. 1. The graphs in panels a, b, c areGV \{6},GV \{2,4},GV \{3}, respectively. (a) C = {6}; (b) C = {2, 4}; (c) C = {3}.

sets, {2, 4} and {3, 4}, share variables with the generator, {2, 3, 4, 5}, of the model M , and〈�{2,4}(M) ∪ �{3,4}(M)〉 ≺ �{4}(M). The property of �C(M) as stated in Theorem 4 isuseful in practice as will be described in Section 5.

The theorem below is an extension of Theorem 1 to the situation of multiple conditionalvariables.

Theorem 5. Suppose that the LLM, M , of X is graphical with the corresponding graphG. Then, for C ⊂ V ,

�C(M) = �(GV \C

).

Proof. For the disjoint sets �1, �2, �3 as defined in the statements from (12) through (13),we have

�(GV \C

) = �3 ∪ (�2\C)

since �3∩C=∅, �1\C��3, and none of the sets in �2\C is a subset of any set in �1∪�3.So, by (15), the desired result holds. �

According to this theorem, we can see that, if C ∩ � �= ∅ for only one generator � of agraphical LLM, M , then �

(GV \C

) = 〈M\�, �\C〉. As more generators intersect with C,the corresponding graph GV \C represents a smaller part of GV . For example, as for thegraphical LLM with the corresponding graph, G, in panel (b) of Fig. 1, the graph GV \{6}in panel (a) of Fig. 3 represents most of GV and the graph GV \{2,4} in panel (b) representsonly a small part of GV . It is interesting to note that �

(GV \{6}

) ∪ �(GV \{3}

) = �(G) while�

(GV \{2,4}

)∪�(GV \{3}

) ≺ �(G).A general result concerningG and its induced subgraphsis given below.

Theorem 6. Consider an LLM, M , as in (6). Let C1 and C2 be non-empty subsets of V .Then, under the SHP, the following holds:

(i) Generator sharing: If C1 ∩ �′ �= ∅ and C2 ∩ �′ �= ∅ for some �′ ∈ M , then⟨�C1(M) ∪ �C2(M)

⟩ ≺ M .(ii) No sharing of generators: Otherwise, �C1(M) ∪ �C2(M) = M .

Proof. See Appendix A. �


For subsets of V , C1, C2, . . . , Ct , we have, by Theorem 6, that

t⋃i=1

�Ci(M) = M

if there exist a pair of disjoint sets, Ci and Cj , that do not share any generator of M . However,we do not know, in reality, the true model M , so we cannot see whether a pair of disjointsets share a generator of M . There is however an effective way of selecting subsets of thevariables of V as connoted in the theorem below.

Theorem 7. Let M be a hierarchical LLM of XV . For three disjoint subsets, C1, C2, and

C3, of V , if {u, v} ⊆ � for some � ∈ M , then there always exists a set, �, in⟨⋃3

i=1�Ci(M)

⟩such that {u, v} ⊆ �.

Proof. Suppose, on the contrary, that {{u, v}} /�⟨⋃3

i=1�Ci(M)

⟩. Then, by (15), {u, v} ∩

Ci �= ∅ for all i = 1, 2, 3. This implies that the three Ci’s are not disjoint sets, which iscontradictory to the condition that the three sets are disjoint sets. Therefore, it must hold

that {{u, v}}�⟨⋃3

i=1�Ci(M)

⟩, as is desired in the theorem. �

Since⟨⋃3

i=1�Ci(M)

⟩�M by Theorem 6, it follows, for three disjoint subsets,C1, C2, C3,

of V , that

{{u, v}}�M if and only if {{u, v}}�⟨

3⋃i=1

�Ci(M)

⟩. (17)

Theorem 7and result (17) recommend the use of at least three disjoint subsets of V asconditioning sets for conditional log-linear modelling. This point will be illustrated in thefollowing section. Result (17) becomes more apparent when M is graphical.

Corollary 1. Let the LLM M be graphical with the corresponding graph G. Then, forthree disjoint subsets, C1, C2, C3, of V , every edge in G is found in at least one of GV \Ci

,i = 1, 2, 3, and vice versa.

Proof. Follows immediately from Theorem 5 and result (17). �

5. Procedure for log-linear modelling by using CLLSs

In applying the main result of the preceding section for log-linear modelling for a setV of categorical random variables, we need to (i) select sets of conditional variables, (ii)find CLLMs conditional on the values of the sets of conditional variables, (iii) construct alog-linear model for V , and then (iv) we use the model as an initial model in search of amodel that is appropriate to given data. This procedure will be described under a generalsetting in this section.


Theorem 6 suggests that we select sets of conditional variables so that there may be at leastone pair of disjoint sets, say C1 and C2, which share variables with different generators of thelog-linear model M of V . A conditional set may share variables with multiple generators ofM . Suppose thatC1 shares variables with generators,�1, . . . , �a , and thatC2 shares variableswith generators, �a+1, . . . , �a+b. Then, by Theorem 6(ii), we have �C1(M)∪�C2(M)=M

as long as(⋃a

i=1�i

)∩(⋃a+b

i=a+1�i

)=∅. Furthermore, we can see, according to result (15),

that

MC(x)��C(M) for every x ∈ IC .

It is important to note that if a generator � of M shares variables with C, then � is notfound in �C(M). Thus it is desirable that the conditional set C shares variables with as smalla number of generators of M as possible. Let variable X1 be contained in these generatorsand let �1 = {�; � ⊆ � for � ∈ M such that 1 ∈ �} and M̃ = {�; � ⊆ � for � ∈ M}.

Then we can express the logarithm of the joint probability of XV as

log pV (x) =∑

�∈�1

u�(x) +∑

�∈(M̃\�1)

u�(x).

Let �2 = M\�1, A = ⋃�∈�1

�, B = ⋃�∈�2

�, and J = ⋃�∈�1∩�2

�. By the definition ofconditional independence (Dawid, 1979, Section 3), we can see that X�1 is conditionallyindependent of X�2 given XJ = xJ for all xJ ∈ IJ . We will simplify this statement ofconditional independence to

�1 ⊥⊥ �2|J . (18)

From this conditional independence, it follows, by Lemma 4.2 of Dawid (1979), that {1} ⊥⊥ �2|J . This last statement of conditional independence means that X�2 is no moreinformative about X1 than XJ is. Thus, we can see, from (18), that X�1\{1} forms a set ofpredictor variables in a regression model of X1. Since X1 is categorical, a logistic regression(McCullagh and Nelder, 1989) or a classification tree method (Breiman et al., 1984) isappropriate for regression analysis, the former being a parametric method and the latter anon-parametric method. The classification tree method is useful when |V | is large sincepredictor variables are selected one after another in the order of the amount of informationthat a predictor variable has about the predicted variable, X1, conditional on the outcomesof the variables that are selected already in the selection process. On the other hand, weneed to deal with the whole set of variables, XV , at once for the logistic regression analysis,which is not as easy as the classification tree method when |V | is large. This is a main reasonwhy we use the classification tree method for selection of a set of random variables whichshares as small a number of generators as possible.

The conditional sets must be selected in a sequential manner. Suppose that we havecontingency table data from a log-linear model M and that the computer we use can handleas many as t variables in a time efficient way. We denote the initial set by C1 and by �0

C1(M)

the collection of the CLLMs of XV \C1 which is obtained based on the data conditional onXC1 . Next, we select another set, say C2, which is disjoint with C1, so that C2 sharesvariables with as few sets in �0

C1(M) as possible and satisfies the condition, |C2|� |V | − t .

We denote the collection of the resulting CLLMs by �0C2

(M).


According to Theorem 6(ii), if there is no generator of M that shares variables with both

of C1 and C2, then⟨�0

C1(M) ∪ �0

C2(M)

⟩would itself be an appropriate model for the data.

But we cannot see whether C1 and C2 share a generator of M based on the data. Theorem7 and result (17) suggest that we select a third set, say C3, which is disjoint with C1 ∪ C2and satisfies |C3|� |V | − t , and obtain �0

C3(M). If M is graphical, then, for all the edges

(u, v) of the graph of the model which is appropriate to the data, we have, by Corollary 1,that {{u, v}}�⋃3

i=1�0Ci

(M).

Let M0 =⟨⋃3

i=1�0Ci

(M)⟩. We check if the model M0 is appropriate to the data. If M0

does not fit well, then we try a model which is obtained by taking unions of some generatorsof M0 in the spirit of (17). For example, if M0 = {{1, 2, 3}, {2, 4}, {3, 4}} does not fit well,then M1 ={{1, 2, 3}, {2, 3, 4}} will do. Note that M2 ={{1, 2, 3, 4}} is the only supermodelof M0 other than M1 and that M0 and M2 violate condition (17) since {{1, 4}} ≺ M2, butthis does not hold for M0.

As we get more �0Ci

(M)’s for disjoint Ci’s, we have more information about M . However,

three such �0Ci

(M)’s will do as is demonstrated in the following section. We may have

different MCi(x)’s for different x’s in ICi. But this does not affect our modelling procedure

since the �0Ci

(M)’s in M0=⟨⋃3

i=1�0Ci

(M)⟩is obtained through expression (15) and �0

Ci(M)

is obtained by joining MCi(x)’s for x ∈ ICi.

6. Application to real data

We now apply our results in the context of a log-linear modelling process using data onn = 28, a collection of 270 students as part of a lecture evaluation process in 2000 at aUniversity in South Korea. The evaluation form had 23 questions and we have extracted 8of them, listed in Table 1, for analysis. Question D has three options, easier, middle, andharder, and each of the other 7 questions has three options, negative, half-and-half, positive,and so the random variables are all ternary, taking on values 1 (for negative), 2 (for half-and-half), or 3 (for positive). The frequency table of the data is of 38 = 6561 cells. Becauseof the size of the table, the data set is not included in the paper but is given as a file named“lec00.sel8.dat” in the web site, “http://amath.kaist.ac.kr/∼slki/research/data.”

The 8 variables all look related to each other and so it is not easy to guess an initialmodel structure to begin with. Although only eight variables are involved, it takes toolong a time to use a backward deletion method starting from the full model, which isof 6560 parameters to estimate. However, if we use the information from CLLMs of asubset of the eight variables, we can easily guess a possible model structure for the wholedata set.

The smaller the model, the easier the modelling. Log-linear modelling with 5 variablesis feasible and takes much less time than dealing with 6 variables. So we tried to find a setof 3 conditional variables which are desired, according to Theorems 5 and 6, to be closelyrelated to each other. More closely related variables are likely to be contained in a generatorof a model M , and this allows �C(M), by expression (15), to contain a larger number ofgenerators of M .

http://amath.kaist.ac.kr/slki/research/data


Table 1The labels of 8 questions and their meanings

Variables Key words Question contents

A Class atmosphere Was the lecture given in an interactive atmosphere?T Thought provoking Was the lecture thought-provoking?O Organization Was the lecture well organized throughout the course?E Explanation Was the lecture given with an explanation good enough

for a clear understanding?D Difficulty Were the test and the homework problems at appropriate

levels?F Feedback Did you get satisfactory feedback from the comments on

your homework?H Homework Was the homework assignments helpful in understanding

the lecture and related subjects?R Recommendation Would you recommend this course to your friends?

The random variables are listed according to their contents. The first four are about lectures, the next threeconcerning homework, and the last may be regarded as an overall level of evaluation.

If we have sets of conditional variables that are closely related to each other within eachset and if there are at least two such sets that are disjoint, then, by Theorem 6, we have abetter chance of finding an appropriate model for the full data set.

We used CART (Breiman et al., 1984) to build a regression tree for the variable R (studentrecommendation) using the data and then we extracted from it the first two variables inthat regression tree. These turned out to be {E, O}. Thus, we used {E, O, R} as the firstconditional set in our model building process.

The CLLM of the remaining 5 variables, A, T , D, F , and H, that is appropriate, condi-tional on C1 = {E, O, R}, is [AFHT][DFHT] which is a simpler expression of the CLLM,{{A, F, H, T }, {D, F, H, T }}. The goodness-of-fit levels are listed in panel (a) of Table 2.In this CLLM, C2 = {F, H, T } is contained in both of the generators of the model. So weselected the variables in the set as conditional and the model [AEOR][DEOR] is selected asreasonably good. The corresponding goodness-of-fit levels are listed in panel (b) of Table 2.Out of all the possible levels of the conditional variables, only those levels are used wherethe corresponding sample sizes (n) are near to or greater than 5 times the cell count (i.e.,35 = 243) of each conditional contingency table. This applies to all the panels of the table.

Note that the two conditional sets, C1 and C2, are disjoint but they could share generatorsas in result (i) of Theorem 6. As is recommended in Theorem 7, the third conditional setneeds to be disjoint with C1 ∪ C2, i.e., a subset of {A, D}. Log-linear modelling with 6variables seemed to take more than three times the time that it took for modelling with5 variables. Recall that the variables are all ternary. So, we selected two conditional sets,C3 = {A, D, E} and C4 = {A, D, F }, which are selected so that C3 ∩ C4 = {A, D}. Theanalysis results are listed in panels (c) and (d) in Table 2. The CLLMs, [FHR][FHT][HORT]and [EHT][EOR][HORT], look appropriate to the data corresponding, respectively, to theconditional sets, C3 and C4.


Table 2The goodness-of-fit levels of the four CLLMs corresponding to four sets of conditional variables, C1, . . . , C4

(a) [AFHT][DFHT] conditional on C1 = {E, O, R}Cell value of C1 n p-value

222 2569 0.09223 1696 0.72233 1669 0.31323 1837 0.40332 1934 0.66333 11760 0.08

(b) [AEOR][DEOR] conditional on C2 = {F, H, T }Cell value of C2 n p-value

122 1181 0.85222 2357 0.024322 1145 0.78132 1056 0.43232 1751 0.37332 2017 0.19223 1370 0.99323 1027 0.64133 1558 0.07233 3327 0.10333 6519 0.54

(c) [FHR][FHT][HORT] conditional on C3 = {A, D, E}Cell value of C3 n p-value

222 2482 0.01*

223 2583 0.09232 1426 0.32322 1635 0.95323 7180 0.05332 1291 0.99333 4689 0.48

(d) [EHT][EOR][HORT] conditional on C4 = {A, D, F }Cell value of C4 n p-value

221 1309 0.95222 2502 0.04223 1669 0.86232 1293 0.87233 1206 0.99321 1683 0.98322 3158 0.93323 4201 0.04331 1060 0.98332 1781 0.19333 3434 0.39

∗This is the p-value of the model [FHR][FHT][HOR][HOT][RT] which is a submodel of [FHR][FHT][HORT].The p-values are obtained from the SAS package.


By Theorem 4, we have that⟨�0

C3(M) ∪ �0

C4(M)

⟩= {{H, O, R, T }, {F, H, R}, {F, H, T }, {E, H, T }, {E, O, R}}��0{A,D}(M). (19)

By Theorem 6 and expression (15), we have that⟨�0

C1(M) ∪ �0

C2(M) ∪ �0{A,D}(M)

⟩�M .

So, from (19), it follows that

〈�0C1

(M) ∪ �0C2

(M) ∪ �0C3

(M) ∪ �0C4

(M)〉= {{A, F, H, T }, {D, F, H, T }, {A, E, O, R}, {D, E, O, R}, {H, O, R, T },

{F, H, R}, {E, H, T }}�M . (20)

We denote the model in expression (20) by M̂ . The p-value of the goodness-of-fit test ofM̂ is 0.46 and its ANOVA-like summary from the SAS package is given in Appendix B. Wecan see in the SAS output that all the interaction effects corresponding to the generators of M̂

are significant except the generator {D, F, H, T }. But we will keep the term for simplicityof expression of the model.

From M̂ , we have

�C1(M̂) = {{A, F, H, T }, {D, F, H, T }},�C2(M̂) = {{A, E, O, R}, {D, E, O, R}},�C3(M̂) = {{F, H, R}, {F, H, T }, {H, O, R, T }},�C4(M̂) = {{E, O, R}, {E, H, T }, {H, O, R, T }}.

The model M̂ is not graphical and so it cannot be represented via an independence graph.However, if we create symbols of interaction effects and add them to an independence graph,a graphic representation is obtained as in Fig. 4. Boxes are used in the figure to symbolizeexistence of interaction effects in a given model. The variables connected to the same boxby dotted edges form an interaction term in the log-linear expression.

If we are interested in how R is influenced by the other 7 variables, we can find, from theexpression for M̂ , a logistic function of R given as a linear sum of the terms that are definedon the sets,

{A, E, O}, {D, E, O}, {H, O, T }, {F, H }.

The functional form of the logistic function can easily be read from the graphical represen-tation of the hierarchical LLM as in Fig. 4.


A

T

F

D

R

O

E

H

e2

e3

e1

Fig. 4. A graphical representation of M̂ . The boxes are symbols of the interaction effects among the variables thatare connected by dotted edges to the same box; e1, e2, and e3 stand for that there are interaction effects amongthe variables in {E, H, T }, {H, O, R, T }, and {F, H, R}, respectively.

7. Concluding remarks

We will call a set A irreducible if 〈A〉=A. When an LLM, M , is graphical, we learned thatthe induced subgraph, which is obtained by removing the nodes of the conditional variablesfrom the graph of M , can also be obtained as the irreducible set of the generators of all theCLLMs of M . This node-removal from a graph is equivalent to variable-elimination fromthe structure of a hierarchical LLM.A nice feature in conditionalizing an LLM, M , is that theconditional variables disappear from M with the other variables in M remaining untouched.Individual CLLMs may vary across the values of the conditional variables. However, fromthe collection of the CLLMs, we can obtain a model structure which is an analogue of theinduced subgraph.

We have shown that we can obtain a model structure of M if there is no generator of M

that shares variables with two conditional sets. Furthermore, it is shown that if we use threedisjoint conditional sets, we can find an appropriate model for a given data set by joininggenerators in the conditional log-linear models under a certain condition (i.e., condition(17)).

When modelling with real data and if it is too time-consuming to find an appropriatemodel structure for the data, it is relatively easy, as we have seen in the above two sections,to find a model structure which is either an appropriate model to the data or one of itssubmodels. When the model is a submodel of an appropriate model, then we can reach abetter model by taking unions of generators as described in the second to the last paragraphof Section 5.

In attempting to handle all of the variables at once, are runs the risk of beginning with awrong model resulting in unreasonably long computation times before a reasonable initialmodel is found. As we have seen in the real data example, we may not have to look atall of the CLLMs. We may ignore the CLLMs for which the sample sizes are not largeenough.


Acknowledgements

The author is grateful to the two referees and a co-editor for their careful reading of thepaper and for their insightful comments and suggestions on it. Professor Stephen E. Fienberggave constructive comments and suggestions concerning both exposition and substance forthe revision of an early draft of this paper. The author’s special thanks go to him. This workwas supported by Korea Research Foundation Grant (KRF-2001-015-DP0072).

Appendix A: Proofs

Proof of Theorem 3. Necessity is obvious by the definition of w. Suppose that wC(xC)

� =0for all possible values of xC and x�. So we have

uC∪�(x) = −∑�⊂C

u�∪�(x) for all x ∈ IV .

By the constraint on u-terms,∑i∈C

∑xi∈Ii

uC∪�(x) =∑i∈C

|Ii |u�(x) = 0 for all x ∈ IV .

So, we have u� = 0.From this, it follows that, for every j ∈ C, u{j}∪� = 0 since∑

i∈C\{j}

∑xi∈Ii

uC∪�(x) =∑

i∈C\{j}|Ii |u{j}∪�(x) = 0 for all x ∈ IV .

Suppose that

uA∪� = 0 (A.1)

for every subset A ⊆ C with |A|�k < |C|. Then

uC∪�(x) = −∑

�⊂C,|�|>k

u�∪�(x) for all x ∈ IV .

For B ⊂ C with |B| = k + 1, we have

uB∪�(x) = 0 for all x ∈ IV

since ∑i∈C\B

∑xi∈Ii

uC∪�(x) =∑

i∈C\B|Ii |uB∪�(x) = 0 for all x ∈ IV .

Therefore, by the induction hypothesis (A.1), we have, for � ⊆ C,

u�∪� = 0. �


Proof of Theorem 6. If the condition of statement (i) holds, �′\C1 is a subset of a com-ponent set of �C1(M) and �′\C2 is a subset of a component set of �C2(M). So the set �′cannot show up in �C1(M) ∪ �C2(M), which proves (i).

In statement (ii), C1 and C2 do not share variables with the same set in M . Suppose thatC1 shares variables with every set in F1 = {�i1 , . . . , �ia } in M and C2 with every set inF2 = {�j1 , . . . , �jb

} in M for some positive integers a and b with a + b�k, where

F1 ∩ F2 = ∅. (A.2)

Then, by Theorem 3, we have

�C1(M) ⊇ 〈(M\F1)〉 (A.3)

since C1 ⊆ (⋃ak=1 �ik

)and, by the same reasoning,

�C2(M) ⊇ 〈(M\F2)〉 . (A.4)

From (A.2) it follows that

(M\F1) ∪ (M\F2) = M .

So, we have, from (A.3) and (A.4), that

�C1(M) ∪ �C2(M) ⊇ M .

Thus, by (16), the desired result of (ii) follows. �

Appendix B: SAS output for the model M̂ in expression (20)

Maximum Likelihood Analysis of Variance

Source DF Chi-Square Pr > ChiSq

A 2 94.89 <.0001T 2 137.60 <.0001A*T 4 753.38 <.0001F 2 26.46 <.0001A*F 4 37.96 <.0001T*F 4 20.70 0.0004A*T*F 8 26.82 0.0008H 2 84.52 <.0001A*H 4 111.45 <.0001T*H 4 14.86 0.0050A*T*H 8 90.96 <.0001H*F 4 61.15 <.0001A*H*F 8 22.51 0.0041




T*H*F 8 4.15 0.8430A*T*H*F 16 58.61 <.0001D 2 308.39 <.0001T*D 4 65.60 <.0001D*F 4 26.72 <.0001T*D*F 8 12.45 0.1323H*D 4 182.51 <.0001T*H*D 8 36.20 <.0001H*D*F 8 29.66 0.0002T*H*D*F 16 19.74 0.2323O 2 172.28 <.0001A*O 4 59.67 <.0001E 2 63.55 <.0001A*E 4 112.15 <.0001O*E 4 288.55 <.0001A*O*E 8 42.70 <.0001R 2 85.54 <.0001A*R 4 28.08 <.0001O*R 4 72.22 <.0001A*O*R 8 10.52 0.2306E*R 4 82.91 <.0001A*E*R 8 20.83 0.0076O*E*R 8 16.76 0.0327A*O*E*R 16 50.10 <.0001O*D 4 41.82 <.0001E*D 4 36.53 <.0001O*E*D 8 50.24 <.0001D*R 4 75.25 <.0001O*D*R 8 22.06 0.0048E*D*R 8 21.68 0.0056O*E*D*R 16 27.68 0.0345T*O 4 166.07 <.0001O*H 4 167.19 <.0001T*O*H 8 17.32 0.0270T*R 4 117.39 <.0001H*R 4 183.99 <.0001T*H*R 8 9.02 0.3403T*O*R 8 13.61 0.0924O*H*R 8 24.37 0.0020T*O*H*R 16 97.55 <.0001




F*R 4 97.89 <.0001H*F*R 8 26.32 0.0009T*E 4 417.46 <.0001E*H 4 172.52 <.0001T*E*H 8 35.79 <.0001

Likelihood ratio 3E3 3044.42 0.4587

References

Bishop, Y.M.M., Fienberg, S.E., Holland, P.W., 1975. Discrete Multivariate Analysis: Theory and Practice. MITPress, Cambridge, MA.

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth,Belmont, CA.

Chickering, D.M., Heckerman, D., Meek, C., 1997. A Bayesian approach to learning Bayesian networks withlocal structure. In: Geiger, D., Shenoy, P. (Eds.), Proceedings of 13th Conference on Uncertainty in ArtificialIntelligence. Morgan Kaufmann, Providence, RI, pp. 80–89.

D’Ambrosio, B., 1995. Local expression languages for probabilistic dependence. Internat. J. Approx. Reason. 13,61–81.

Darroch, J.N., Lauritzen, S.L., Speed, T.P., 1980. Markov fields and log-linear interaction models for contingencytables. Ann. Statist. 8 (3), 522–539.

Dawid, A.P., 1979. Conditional independence in statistical theory. J. Roy. Statist. Soc. B 41 (1), 1–31.Fienberg, S.E., 1980. The Analysis of Cross-classified Categorical Data. MIT Press, Cambridge, MA.Fienberg, S.E., Kim, S.-H., 1999. Combining conditional log-linear structures. J. Amer. Statist. Assoc. 94 (445),

229–239.Friedman, N., Goldszmidt, M., 1998. Learning Bayesian networks with local structure. In: Jordan, M.I. (Ed.),

Learning in Graphical Models. Kluwer, Dordrecht, Netherlands, pp. 421–460.Geiger, D., Heckerman, D., 1996. Knowledge representation and inference in similarity networks and Bayesian

multinets. Artif. Intell. 82, 45–74.Mahoney, S.M., Laskey, K.B., 1999. Representing and combining partially specified CPTs. In: Laskey, K.B.,

Prade, H. (Eds.), Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco, CA, pp.391–400.

McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, second ed., Chapman & Hall, New York.Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann

Publishers, San Mateo, CA.Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. Wiley, New York.

Documents

Conditional log-linear structures for log-linear modelling