LOG-LINEAR MODELS AND TORIC IDEALS

DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI

LOG-LINEAR MODELS

AND TORIC IDEALS

Fabio Rapallo

Relatore di tesi: Coordinatore del Dottorato:

Prof. GIOVANNI PISTONE Prof. CLAUDIO PEDRINI

(Politecnico di Torino) (Universita di Genova)

DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI

XV CICLO

Sede Amministrativa: Universita di Genova

Sedi consorziate: Universita di Torino, Politecnico di Torino

Tesi presentata per il conseguimento del titolo di

Dottore di Ricerca in Matematica e Applicazioni nel mese di Aprile 2003

Contents

Introduction 3

1 Statistical and algebraic background 7

1.1 Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.1 Definitions and basic properties . . . . . . . . . . . . . . 8

1.1.2 Maximum likelihood estimates . . . . . . . . . . . . . . . 12

1.1.3 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . 14

1.1.4 An important remark . . . . . . . . . . . . . . . . . . . . 16

1.1.5 Exact methods . . . . . . . . . . . . . . . . . . . . . . . 17

1.2 Algebraic theory of toric ideals . . . . . . . . . . . . . . . . . . . 21

1.2.1 Definitions and first properties . . . . . . . . . . . . . . . 22

1.2.2 Computation of toric ideals. First method . . . . . . . . 24

1.2.3 Navigating inside the set of solutions of an integer linear

system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.4 Computation of toric ideals. Second method . . . . . . . 27

2 The Diaconis-Sturmfels algorithm 29

2.1 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 29

2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Two-way contingency tables 39

3.1 Two-way contingency tables: Independence . . . . . . . . . . . . 40

3.2 Some models for incomplete and square tables . . . . . . . . . . 44

3.2.1 Incomplete tables . . . . . . . . . . . . . . . . . . . . . . 44

3.2.2 The quasi-independence model . . . . . . . . . . . . . . . 45

3.2.3 The symmetry model . . . . . . . . . . . . . . . . . . . . 47

3.2.4 The quasi-symmetry model . . . . . . . . . . . . . . . . 47

1

2 CONTENTS

3.2.5 Computational notes . . . . . . . . . . . . . . . . . . . . 48

3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 An application. The h-sample problem . . . . . . . . . . . . . . 52

4 Rater agreement models 55

4.1 A medical problem . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Multidimensional rater agreement models . . . . . . . . . . . . . 56

4.3 Exact inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Sufficiency and estimation 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Sufficiency and sampling . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Geometry of toric models . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Algebraic representation of the sufficient statistics . . . . . . . . 84

5.5 Minimum-variance estimator . . . . . . . . . . . . . . . . . . . . 85

5.6 Algebraic method . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A Programs 89

Bibliography 98

Introduction

It is useful to start this work with two sentences from a recent survey paper

by Agresti (2001) recently appeared in “Statistics in Medicine”.

“For the null hypothesis of independence and conditional independence,

the relevant conditional distribution is simple, which is why ordinary Monte

Carlo applies so easily. This is not true of more complex hypotheses, such

as quasi-independence, quasi-symmetry, and others for square and triangular

contingency tables that were until recently given little attention in the exact

literature.”

“If statisticians cannot agree how to analyze even a 2 × 2 table, with no

one approach being obviously best, what hope is there for a consensus on more

complex analyses? The probability is probably decreasing to 0 with time, . . .”

These two claims give concisely an excellent idea on how statisticians need

simple and comprehensive procedures for analyzing complex problems involv-

ing contingency tables.

In this thesis we use techniques from Computational Commutative Algebra

to solve complex problems in estimation and hypothesis testing for discrete

data.

The application of Computational Commutative Algebra in statistics and

probability is a recent topic and a first book in the field is Pistone et al. (2001a).

In this thesis, starting from the work by Diaconis & Sturmfels (1998), we

study the log-linear models through polynomial algebra, with important ap-

plications to statistical problems, such as goodness-of-fit tests for contingency

tables. In particular, we apply the algebraic theory of toric ideals to study

a Markov Chain Monte Carlo method for sampling from conditional distribu-

tions, through the notion of Markov basis. Throughout the thesis we will con-

sider a wide range of statistical problems: Models for two-way and multi-way

contingency tables; rater agreement models; contingency tables with structural

zeros. In some cases we will characterize the relevant Markov basis, while in

3

4 CONTENTS

other cases we will give results which lead to a fundamental simplification of

the symbolic computations. A number of examples will show the relevance,

the efficiency and the versatility of these algebraic procedures.

The relevance of the topics considered here is proved by the increasing

number of specific workshops, conferences and lectures on this subject. More-

over, the number of people interested in grows rapidly, also including Professor

Bernd Sturmfels, and Professor Stephen Fienberg.

In Chapter 1, we introduce some background material about the log-linear

models and the toric ideals, which have a key role in the analysis of contin-

gency tables. In Chapter 2, we present a slight generalization of the Diaconis-

Sturmfels algorithm, while in Chapter 3 we analyze the case of two-way con-

tingency tables, with some results for the characterization of the relevant toric

ideals and some other results useful to simplify their computation. In Chapter

4 we study an application of the previous theory to rater agreement models,

which are widely used in biomedical and pharmaceutical research. We also

give results for the characterization of toric ideals for multi-way contingency

tables. Finally, in Chapter 5 we study the algebraic properties of the notions of

sufficiency and maximum likelihood estimation in statistical models on finite

sample space, with results about the computation of the sufficient statistic

of the model and its geometric representation. The programs, both symbolic

and numerics, presented in Appendix A, show the practical applicability of the

algorithms, together with the examples on real data sets which are presented

in all Chapters.

The results presented in Chapters 3, 4 and 5 represent original work, as

well as the generalization of the algorithm presented in Chapter 2 and the

programs in Appendix A.

A part of the thesis has been published in preliminary versions. In particu-

lar, a part of Chapter 2 is in press on the “Scandinavian Journal of Statistics”,

while a part of Chapters 4 and 5 represents the subject of a preprint of the

Department of Mathematics of Genova. Two papers are currently submitted

to international journals.

The topics presented here have been the subject of some presentations

to conferences and international workshops. Among other we want to cite

here the series “Grostat” and the Meetings of the Italian Statistical Society.

Moreover, the tutorial for the course “Statistical Toric Models”, at the “First

International School on Algebraic Statistics”, has been based on the work

CONTENTS 5

presented in this thesis.

Finally, the theory and the applications developed in Chapter 4 have been

the starting point for a current cooperation with the pharmaceutical company

“Schering” in Berlin.

Acknowledgements

During the writing of this work, through the last two years, I have benefited

from the help of many people, both from the scientific and moral point of view.

In alphabetical order, a minimal list is formed by Daniele De Martini (Uni-

versity of “Piemonte Orientale”, Novara), Franco Fagnola (University of Gen-

ova), Mauro Gasparini (Politecnico di Torino), my advisor Giovanni Pistone

(Politecnico di Torino), Eva Riccomagno (University of Warwick), Lorenzo

Robbiano (University of Genova), Maria Piera Rogantin (University of Gen-

ova), Ivano Repetto (University of Genova), Richard Vonk (Schering, Berlin).

Outside the academia, I want to acknowledge the constant support of Mat-

teo Cella, my cousin, and Andrea Giacobbe. Their support has been essential

in the most critical moments.

6 CONTENTS

Chapter 1

Statistical and algebraicbackground

In this chapter we introduce the main background theory we need, both from

the statistical and the algebraic point of view. In Section 1.1 we recall the

properties of log-linear models for the analysis of qualitative data, with a

particular attention on the maximum likelihood estimation and the goodness

of fit tests, using the classical asymptotic approach. Moreover, we emphasize

how simple algebraic objects, such as power products and binomials, naturally

appear in the classical analysis of log-linear models. Finally, in Section 1.1.5,

we review the main exact methods for hypothesis testing for contingency tables.

Exact methods come back to the works by Fisher in the forties, where an exact

test for independence in 2×2 contingency tables was introduced. A wide range

of exact procedures for inference for contingency tables is now available and

in most cases the increased speed and computing power of computers has

played a fundamental role for the practical applicability of such procedures.

We describe here the main methods for exact analysis of contingency tables.

In Section 1.2, we introduce the algebraic theory of toric ideals, with spe-

cial attention to the statistical meaning of the algebraic procedures. A brief

and neat review of some relevant concepts from Computational Commutative

Algebra is presented in order to give a presentation as self-contained as pos-

sible. In particular, we deal with the computation of toric ideals, which will

represent a key step in the algorithms presented in the next chapters.

In this chapter, the proofs of the results are omitted, with references to

literature, in particular Haberman (1974), Bishop et al. (1975) and Agresti

(2002) for the theory of log-linear models, and Kreuzer & Robbiano (2000)

and Bigatti & Robbiano (2001) for the theory of toric ideals.

7

8 CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND

1.1 Log-linear models

1.1.1 Definitions and basic properties

The material presented in this section is mainly taken from classical works in

the field of the log-linear models. The aim of this section is to show that in the

classical representation of the log-linear models, i.e. using vector space theory,

polynomial equations play a fundamental role. Some simple examples of this

section will be used many times throughout the thesis. The main references

are Haberman (1974), Bishop et al. (1975), Agresti (1996), Fienberg (1980)

and Dobson (1990). In particular we follow the notation in Haberman (1974),

as it allows us to give an unique presentation of the theory of log-linear models

in a general setting. Some examples and useful remarks are taken from the

other references.

Log-linear models are statistical models for the analysis of qualitative data,

then the sample space is finite and the data are usually presented in a contin-

gency table.

Definition 1.1 A contingency table is a collection f of frequencies (f(x) ∈N : x ∈ X ), where X is the sample space.

A contingency table contains the joint observations of d random variables

on N subjects. A first classification of contingency tables can be made looking

at the sample space X . In particular, the contingency table is said to be a d-way

contingency table if f is the cross-classification of the subjects with respect to

d random variables. Consequently, the sample space X is a finite subset of Nd,

possibly with a suitable coding of the levels of the random variables. Moreover,

the contingency table f is complete if the sample space is a Cartesian product

of one-dimensional sample spaces and is incomplete otherwise.

Log-linear models are used to describe the stochastic structure of the joint

probability distribution underlying the sample, i.e. to determine stochastic re-

lations between the observed variables, such as independence, non-interaction,

symmetry and others more complex relations.

If we denote the cardinality of the sample space X by q, the table f is

an element of the q-dimensional space Rq, with the standard inner product

defined by

< f1, f2 >=∑x∈X

f1(x)f2(x)

1.1. LOG-LINEAR MODELS 9

and the canonical basis (ex)x∈X .

In a log-linear model, f is assumed to be the realization of a random vector

F with mean m = (m(x) : x ∈ X ), where m(x) > 0 for all x ∈ X . Here

and throughout we denote the observed tables with lowercase letters and the

random vectors with the corresponding uppercase letters. Thus for each x ∈ X ,

log m(x) is well defined. Here we assume that the probabilities of the sample

points are positive, and that the means m(x) > 0 are a parameterization of

the model. Therefore, if µ = (log m(x) : x ∈ X ), then µ ∈ Rq. In a log-linear

model it is assumed that µ ∈ M , where M is a p-dimensional linear manifold

contained in Rq. When q = p, we say that the corresponding log-linear model

is saturated.

Example 1.2 Consider a 2-way contingency table with two levels for any

random variable, i.e. a 2 × 2 contingency table. The space R4 is spanned by

these four orthogonal vectors:

v1 =

(1 11 1

)v2 =

(1 1−1 −1

)v3 =

(1 −11 −1

)v4 =

(1 −1−1 1

).

We can consider a model M such that M = span(v1, v2, v3). In such case,

Bishop et al. (1975) showed that the model can be written in the form

log m(i, j) = λ + λXi + λY

j (1.1)

for i, j = 1, 2, with the constraints

λX1 + λX

2 = λY1 + λY

2 = 0 . (1.2)

If we denote by mi+ and m+j the marginal totals, one can write

mi+ = eλeλXi (eλY

1 + eλY2 ) ,

m+j = eλeλYj (eλX

1 + eλX2 ) ,

N = eλ(eλX1 + eλX

2 )(eλY1 + eλY

2 ) .

Hence we obtain

m(i, j) = mi+m+j/N (1.3)

where mi+ and m+j are the marginal totals. Moreover, since v4 is orthogonal

to M1, we obtain the relation

logm(1, 1)m(2, 2)

m(1, 2)m(2, 1)= 0 (1.4)


from which we obtain

m(1, 1)m(2, 2)−m(1, 2)m(2, 1) = 0 . (1.5)

Equation (1.5) represents an algebraic relation between the means of the cells

and in particular it is a binomial. We will see later the relevance of such

binomials in our framework.

A detailed representation of the vector space bases of log-linear models is

presented in Collombier (1980), with special attention to the meaning of such

representation in terms of hypothesis testing.

For the statistical inference, it is now essential to specify the underlying

probability distribution of the table f . The main models in literature are the

Poisson model and the multinomial model.

The Poisson model

In the Poisson model, the elements of F are independent Poisson random

variables with E[F (x)] = m(x) for all x ∈ X and m(x) > 0 for all x ∈ X . The

log-likelihood is

log L(f, µ) =∑x∈X

f(x) log m(x)−m(x) =

= < f, µ > −∑x∈X

exp(µ(x)) .

If PM is the orthogonal projection from Rq to M , then since µ ∈ M and PM

is a symmetric operator, we have

log L(f, µ) =< PMf, µ > −∑x∈X

exp(µ(x)) . (1.6)

Therefore, the family of Poisson models such that µ ∈ M is an exponential fam-

ily. Following classical results about exponential families, the projection PMf

is a complete minimal sufficient statistic, see for example Lehmann (1983).

Moreover, any nonsingular linear transformation of PMf is again a complete

minimal sufficient statistic. Traditionally, Equation (1.6) is presented as the

following theorem.

Theorem 1.3 Under the Poisson model, the sufficient statistic is expressible

as linear combination of the cell counts (f(x) : x ∈ X ).


For the proof, see Agresti (2002), page 136.

The random vector F has mean m and variance-covariance matrix D(µ),

where D(µ) is a diagonal matrix with the variances m(x) on the main diagonal

and 0 otherwise.

We denote the convergences in probability and in distribution by the sym-

bolsp−→ and

d−→, respectively. Moreover the symbol N denotes the Gaussian

distribution. The first asymptotic results are summarized in the following

proposition.

Proposition 1.4 If (F (N))N>0 is a sequence of random vectors from a Poisson

model with respective means (m(N))N>0 and if N−1m(N) −→ m, then

N−1F (N) p−→ m (1.7)

and

N−1/2(F (N) −m(N))d−→ N (0, D(µ)) . (1.8)

For the proof, see Haberman (1974), page 7.

The multinomial model

In the multinomial model, the vector F consists of one or more independent

multinomial random vectors (Fi : i ∈ Xk) with mean (mi : i ∈ Xk). The sets

Xk, k = 1, . . . , r are disjoint and have union X , and m(x) > 0 for all x ∈ X .

We suppose that the model M is such that for each k = 1, . . . , r

ν(k) = {IXk(x) : x ∈ X} ∈ M .

For the determination of the complete minimal sufficient statistic, we consider

the decomposition of M in M1 and M2 = M−M1, where M1 is the linear man-

ifold generated by (ν(k) : k = 1, . . . , r) and M2 is its orthogonal complement.

We denote the sample size for (f(x) : x ∈ Xk) by Nk. The log-likelihood is

log L(f, µ) =< PM2f, PM2µ > −r∑

k=1

Nk log(< m(PM2µ), ν(k) >) +r∑

k=1

log Nk! .

(1.9)

Thus, PM2f is a complete minimal sufficient statistic. As any nonsingular

linear transformation of PM2f is again a complete minimal sufficient statistic,

we can use the statistic PMf as in the Poisson model. The random vector F

has mean m and variance-covariance matrix

Σ(µ) = D(µ)(I − P ∗M1

(µ))


where P ∗M1

is the orthogonal projection from Rq onto M1 and I is the identity

matrix. We summarize in the following proposition the basic convergence

results.

Proposition 1.5 If (F (N))N>0 is a sequence of random vectors from a multi-

nomial model with respective means (m(N))N>0 and if N−1m(N) −→ m, then

N−1F (N) p−→ m (1.10)

and

N−1/2(F (N) −m(N))d−→ N (0, D(µ)(I − P ∗

M1(µ))) . (1.11)

For the proof, see Haberman (1974), page 13.

The close connection between the Poisson model and the multinomial model

can be found in Haberman (1974), where the author proves that many results

about maximum likelihood estimation and hypothesis testing are the same

under both the Poisson model and the multinomial model.

Other models can be considered, such as the product-multinomial model

and the conditional Poisson model. See Haberman (1974) and Agresti (2002)

for details. Moreover, the different meanings of the models from an applicative

point of view are widely discussed in Chapters 2 and 3 of Fienberg (1980).

1.1.2 Maximum likelihood estimates

We start by considering the problem of maximum likelihood estimation for the

Poisson model. The maximum likelihood estimate µ of µ is the value such that

log L(f, µ) = supµ∈M

log L(f, µ) . (1.12)

We briefly recall the basic theorems about the existence and the uniqueness of

the maximum likelihood estimate

Theorem 1.6 If a maximum likelihood estimate µ exists, then it is unique

and satisfies the equation

PMm = PMf . (1.13)

Conversely, if for some µ ∈ M and m = exp(µ) and Equation (1.13) is satis-

fied, then µ is the maximum likelihood estimate of µ.


Theorem 1.7 A necessary and sufficient condition that maximum likelihood

estimate µ of µ exists is that there exists δ ∈ M⊥ such that f(x) + δ(x) > 0

for all x ∈ X .

Corollary 1.8 If f(x) > 0 for all x ∈ X , then the maximum likelihood esti-

mate µ exists.

Theorem 1.9 A necessary and sufficient condition that maximum likelihood

estimate µ exists is that there not exist µ ∈ M such that µ 6= 0, µ(x) ≤ 0 for

all x ∈ X and < f, µ >= 0.

For the multinomial model, the relevant theorem is the following, which

states that the maximum likelihood estimate is the same under the Poisson

model and the multinomial model.

Theorem 1.10 If µ(m) is the maximum likelihood estimate for a multinomial

model with µ ∈ M and if µ is the corresponding maximum likelihood estimate

for a Poisson model with µ ∈ M , then µ(m) = µ, in the sense that when one

side of the equation exists, then the other side exists and the two sides are

equal.

For the proof of all the above theorems, see Haberman (1974), pages 35-41.

In view of this theorem, the necessary and sufficient conditions stated for

the Poisson model also apply to the multinomial model.

In many cases, the maximum likelihood estimate can not be computed

directly and we resort to numerical approximations. Among the numerical

methods, the most important ones in this field are the Newton-Raphson al-

gorithm and the Iterative Proportional Fitting algorithm. Details on such

methods can be found in Haberman (1974) and Agresti (2002). We will come

back later on the numerical computation of the maximum likelihood estimate.

Now, we analyze the asymptotic behavior of the maximum likelihood esti-

mate.

Suppose that (F (N))N>0 is a succession of random vectors with means

(m(N))N>0. We denote by (µ(N))N>0 the sequence of the logarithms of the

means. Using the same notation as above, the main asymptotic results are

summarized in the following theorem.


Theorem 1.11 Let m∗ = limN→+∞ N−1m(N), assume m∗ > 0, and let µ∗ =

log m∗. The following relations hold when N goes to +∞.

µ(N) − µNp−→ 0 (1.14)

N−1m(N) p−→ m∗ (1.15)

N1/2(µ(N) − µ(N))d−→ N (0, (P ∗

M(µ∗)− P ∗M1

(µ∗))D−1(µ∗)) (1.16)

N1/2(m(N) −m(N))d−→ N (0, D(µ∗)(P ∗

M(µ∗)− P ∗M1

(µ∗))) (1.17)

For the proof, see Agresti (2002), page 584.

The previous result, and in particular Equations (1.16) and (1.17) allows

us to compute asymptotic confidence intervals for any linear combination of

the components of µ and m, and simple hypotheses tests for such quantities,

applying the theory described in Lehmann (1983) and Lehmann (1987).

Remark 1.12 For historical reasons, the log-linear models are usually ex-

pressed in terms of the expected cell counts, instead of in terms of the cell

probabilities. However, the representation of our models in terms of the cell

probabilities is given by noting that in our notation f is the vector of the ob-

served frequencies, while N−1f is the vector of empirical probabilities. Thus,

m(N) represents the vector of the expected frequencies, and N−1m(N) repre-

sents the vector of cell probabilities, as well as m∗. Sometimes in the thesis,

we will refer to the vector of the cell probabilities as p = (p(x) : x ∈ X ).

1.1.3 Goodness of fit tests

Hypothesis tests for log-linear models are generally based on test statistics

which have an asymptotic chi-squared distribution under the null hypothesis.

We consider here tests of the form H0 : µ ∈ M0 ⊂ M1 against the alternative

hypothesis µ ∈ M1. Such tests are commonly called goodness of fit tests, in

the sense that they test if a smaller model M0 can replace a bigger model

M1 without a significant loss of information. Moreover, in many situations

these tests give some information about the stochastic relations between the

variables. The goodness of fit tests are based on the Pearson test statistic

C(µ(1), µ(0)) =∑x∈X

(m(1)(x)− m(0)(x))2

m(0)(x)(1.18)


or on the log-likelihood ratio test statistic

∆(µ(1), µ(0)) = log L(f, µ(0))− log L(f, µ(1)) . (1.19)

Suppose that M0 has dimension p0 and M1 has dimension p1. The relevant

results about hypothesis testing is the following.

Theorem 1.13 Consider a sequence of null hypotheses H0 : µ(N) ∈ M0 and

alternatives H1 : µ(N) ∈ M1. Suppose that µ(N,1) and µ(N,0) are the maximum

likelihood estimates of µ(N) under H0 and H1, respectively. Then under H0,

the statistics −2∆ and C are asymptotically equivalent and

limN→+∞

P[−2∆(µ(N,1), µ(N,0)) > c∗(α)] = P[C(µ(N,1), µ(N,0)) > c∗(α)] = α ,

(1.20)

where c∗(α) is the α-th quantile of the chi-square distribution with p1 − p0

degrees of freedom.

For the proof, see Bishop et al. (1975), page 514.

If M1 is the saturated model, we can apply the fact that µ(N,1) = f and

the test statistics in Equations (1.18) and (1.19) assume the most familiar

expressions

C(f, µ(0)) =∑x∈X

(f(x)− m(0)(x))2

m(0)(x)(1.21)

and

∆(f, µ(0)) =∑x∈X

m(0)(x) logf(x)

m(0)(x). (1.22)

A detailed discussion on these test statistics can be found in Dobson (1990)

or Haberman (1978). The small sample properties of the asymptotic tests are

presented in Fienberg (1980), Appendix IV. We will consider in details the

small sample properties later in the thesis.

Remark 1.14 In the applications, the goodness of fit tests are performed

also in those cases where the maximum likelihood estimate does not exist. In

this case we can use the notion of extended maximum likelihood estimate, see

Appendix B of Haberman (1974) for details.


1.1.4 An important remark

As we have seen in Example (1.2), the vector space theory leads to a canonical

representation of the log-linear models in the form

log m(i) = AΛ , (1.23)

where A is a the model matrix and Λ is the vector of the model parameters.

In many cases, see for example Haberman (1978), the matrix A has integer

non-negative entries and the model can be written in the form

log m(i) =s∑

h=1

A(h, i)λ(h)(i) . (1.24)

This is the classical representation of the log-linear models and it allows a

simple interpretation of the statistical relations among the variables, see Fien-

berg (1980), Fingleton (1984) and Agresti (1996). We present here a detailed

analysis of the independence model for two categorical variables. This model

is enough simple to be analyzed in few pages with any detail and it contains all

the relevant algebraic objects we have pointed out in the previous discussion.

More complex models which involve a greater number of categorical variables

and different stochastic relations among the variables will be presented in the

next chapters.

Example 1.15 Consider the cross-classification of two categorical random

variables X and Y with supports {1, . . . , I} and {1, . . . , J}, respectively.

The saturated model is written in the form


j + λXYij

with the constraints

I∑i=1

λXi =

J∑j=1

λYj =

I∑i=1

λXYij =

J∑j=1

λXYij = 0 ,

while the independence model, where we impose no interaction between the

variables, assumes the form


j

with the constraintsI∑

i=1

λXi =

J∑j=1

λYj = 0 .


This representation follows immediately from the vector space representation

presented in Example 1.2. Note that a vector space basis is easy to be found

in the two-way case, while in the multi-way case some difficulties arise.

From our point of view is of great interest the multiplicative form of the log-

linear model, obtained by exponentiating the log-linear form. For the saturated

model we obtain the expression

m(i, j) = ζ0ζXi ζY

j ζXYij (1.25)

and for the independence model we can write

m(i, j) = ζ0ζXi ζY

j (1.26)

where the ζ parameters satisfy appropriate constraints. Note that our notation

is coherent with the notation in Chapter 6 of Pistone et al. (2001a).

Manipulating Equation (1.26), it is easy to check that under the indepen-

dence model the means m(i, j) must satisfy the relations

m(i1, j1)m(i2, j2)−m(i1, j2)m(i2, j1) = 0 (1.27)

for all i1, i2 ∈ {1, . . . , I} and j1, j2 ∈ {1, . . . , J}. Of course, the same relations

hold for the cell probabilities p(i, j). Thus, the independence model is repre-

sented by the zero set of the polynomial system formed by the binomials in

Equation (1.27). We will see in the next section the correspondence between

algebraic restrictions of this form and the geometric restrictions on the space

of the means parameters, or equivalently on the space of the cell probabilities.

The multiplicative form of the log-linear models has been widely studied

in Goodman (1979a), Goodman (1979b) and Goodman (1985). We will see in

the next sections the relevance of this representation.

1.1.5 Exact methods

Exact statistics can be useful in situations where the asymptotic assumptions

are not met, and so the asymptotic p−values are not close approximations for

the true p−values. Standard asymptotic methods involve the assumption that

the test statistic follows a particular distribution when the sample size is suf-

ficiently large. When the sample size is not large, asymptotic results may not

be valid, with the asymptotic p−values differing substantially from the exact

p−values. Asymptotic results may also be unreliable when the distribution of


the data is sparse. Examples can be found in Agresti (1996) and Bishop et al.

(1975). We will also see some examples in our framework later in the thesis.

Exact computations are based on the statistical theory of exact conditional

inference for contingency tables, reviewed by Agresti (2001).

Following such exact conditional methods, one computes the exact distri-

bution of the test statistic given the sufficient statistic. These procedures often

arise for independence and goodness of fit tests, as well as in the construction of

uniformly most powerful tests and accurate confidence interval computations

through the Rao-Blackwell theorem, see Lehmann (1987). Thus, the analysis

is restricted to the set of tables with the same value t of the sufficient statistic

TN as the observed table

Ft = {f : X −→ N : TN(f) = t} . (1.28)

This set is often called the reference set. Throughout all the thesis, we will

assume that Ft is a finite set.

In literature, there exist many exact algorithms for a wide number of exact

tests, such as Pearson chi-square, likelihood-ratio chi-square, Mantel-Haenszel

chi-square, Fisher’s exact test and McNemar’s test, but in general such algo-

rithms are limited to two-way contingency tables. A number of exact tests for

contingency tables are presented for example in Agresti (2001).

These algorithms are also implemented in statistical softwares, see for ex-

ample SAS/STAT User’s Guide (2000). These exact algorithms compute exact

p−values for general I × J tables using the network algorithm developed by

Mehta & Patel (1983). This algorithm provides a substantial advantage over

direct enumeration of the reference set Ft, which can be very time-consuming

and feasible only for small problems. Refer to Agresti (1992) for a review of

algorithms for computation of exact p−values, and refer to Metha et al. (1991)

for information on the performance of the network algorithm.

Limiting ourselves to the basic steps of the algorithm, the network algo-

rithm proceeds as follows. Corresponding to the reference set Ft, the algorithm

forms a directed acyclic network consisting of nodes in a number of stages. A

path through the network corresponds to a distinct table in the reference set.

The distances between nodes are defined so that the total distance of a path

through the network is the corresponding value of the test statistic. At each

node, the algorithm computes the shortest and longest path distances for all

the paths that pass through that node. The exact computation of these quan-


tities for statistics which are linear combinations of the cells counts is presented

in Agresti et al. (1990). For statistics of other forms, the computation of an

upper bound for the longest path and a lower bound for the shortest path

are needed. The longest and shortest path distances or bounds for a node

are compared to the value of the test statistic to determine whether all paths

through the node contribute to the p−value, none of the paths through the

node contribute to the p−value, or neither of these situations occur. If all

paths through the node contribute, the p−value is incremented accordingly,

and these paths are eliminated from further analysis. If no paths contribute,

these paths are eliminated from the analysis. Otherwise, the algorithm contin-

ues, still processing this node and the associated paths. The algorithm finishes

when all nodes have been accounted for, incrementing the p−value accordingly,

or eliminated.

For each possible table, the algorithm compares its test statistic value with

the corresponding value for the observed table. If the value for the table is

greater than or equal to the observed test statistic value, it increments the

exact p−value by the probability of that table, which is calculated under the

null hypothesis using the multinomial frequency distribution.

For several tests we meet in the analysis of categorical data, the test

statistic is nonnegative, and large values of the test statistic indicate a de-

parture from the null hypothesis. Such tests include Pearson’s chi-square, the

likelihood-ratio chi-square, and other ones, such as the Mantel-Haenszel chi-

square, Fisher’s exact test for tables larger than 2× 2 tables, McNemar’s test.

The exact p−value for these tests is the sum of probabilities for those tables

having a test statistic greater than or equal to the value of the observed test

statistic.

These recently developed algorithms for two-way tables, together with im-

provements in computer power, make it feasible now to perform exact com-

putations for data sets where previously only asymptotic methods could be

applied. Nevertheless, there are still large problems that may require a pro-

hibitive amount of time and memory for exact computations

As noted by Diaconis & Sturmfels (1998), the reference set Ft is very

large and then difficult to enumerate even for very small data sets, where the

asymptotic approximation is heavily inadequate.

Recently, a software has been implemented in order to compute efficiently

the cardinality of the reference set Ft also in non trivial cases. This soft-


ware, called LattE, acronym of “Lattice Point Enumeration”, is presented in

De Loera et al. (2003) and uses non-standard algebraic results about convex

polytopes. For example, using this software, the authors are able to com-

pute the cardinality of the reference set Ft for Table 1.1 with two-dimensional

marginal totals as components of the sufficient statistic, concluding that this

cardinality is equal to 441, while for a table whose two-dimensional marginal

total are displayed in Table 1.2, the cardinality of the reference set is about

2.2498 ∗ 1040.

X2

X1 1 2 31 96 72 1612 10 7 63 1 1 2

X2

X1 1 2 31 186 127 512 11 7 33 0 1 0

X3 = 1 X3 = 2

Table 1.1: A 3× 3× 2 example.

X2

X1 1 2 31 164424 324745 1272392 262784 601074 93691163 149654 7618489 1736281

X3

X1 1 2 31 163445 49395 4035682 1151824 767866 83132843 16099500 6331023 1563901

X3

X2 1 2 31 184032 123585 2692452 886393 6722333 9355823 1854344 302366 9075926

Table 1.2: Two-dimensional marginal totals for a 3× 3× 3 example.

As the software LattE is at this time under development, we only dispose

of some demo, and this is the reason why we present here examples taken from

De Loera et al. (2003), instead of new examples.

1.2. ALGEBRAIC THEORY OF TORIC IDEALS 21

However, it is worth noting that for the statistical applications, it is often

sufficient to have an approximation of the cardinality of Ft.

When asymptotic methods may not be sufficient for such large problems,

we can use Monte Carlo estimation of the exact p−values. A formula does

not exist that can predict in advance how much time and memory are needed

to compute an exact p-value for a certain problem. The time and memory

required depend on several factors, including which test is being performed,

the total sample size, the number of rows and columns, and the specific ar-

rangement of the observations into table cells. Moreover, for a fixed sample

size, time and memory requirements tend to increase as the number of rows

and columns increases, since this corresponds to an increase in the number

of tables in the reference set. Also for a fixed sample size, time and memory

requirements increase as the marginal row and column totals become more

homogeneous, see Agresti et al. (1990) for details.

1.2 Algebraic theory of toric ideals

Following the description of the log-linear models in Section 1.1, we remark

that binomials and power products play a fundamental role in the formal

representation of the log-linear models.

Experts in Commutative Algebra have studied extensively the algebraic

relations of power products, leading to the theory of toric ideals. In this section

we briefly review the theory of toric ideals with a particular emphasis on the

results more relevant for the statistical applications. Moreover, we present

some recent computational issues which lead to a feasible computation of toric

ideals.

In the following, some basic notions in Computational Commutative Al-

gebra is needed. In particular, we will use the notions of polynomial ideal,

Grobner basis, elimination ideal, saturation. The reader can refer to Kreuzer

& Robbiano (2000) or Cox et al. (1992). Many of the statistical applications

of the theory of toric ideals in Probability and Statistics comes from the fun-

damental work by Sturmfels (1996)

For the theory of toric ideals we mainly refer to the papers Bigatti &

Robbiano (2001) and Bigatti et al. (1999). The computations in the examples

are obtained with the free software CoCoA, see Capani et al. (2000).


1.2.1 Definitions and first properties

Let K be a numeric field and let K[y1, . . . , ys] and K[ξ1, . . . , ξq] be the polyno-

mial rings in the indeterminates y1, . . . , ys and ξ1, . . . , ξq, respectively. Recall

that a term in a polynomial ring is a power product and that a binomial is a

difference of terms. We denote by Ts the set of all terms in s indeterminates.

Definition 1.16 Let t1, . . . , tq be terms in K[y1, . . . , ys]. The toric ideal I =

I(t1, . . . , tq) ⊆ K[ξ1, . . . , ξq] associated to the set {t1, . . . , tq} is the ideal of all

polynomials p such that p(t1, . . . , tq) = 0.

Note that different sets of power products produce the same toric ideals.

This remark will be analyzed from the point of view of probability models late

in the thesis.

In the following, we use the multi-index notation, i.e. yd means yd11 · · · yds

s .

Given D = {d1, . . . , dq} ⊂ Ns such that ydi = ti for i = 1, . . . , q, consider the

semigroup homomorphism

T : Nq −→ Ns (1.29)

ei 7−→ di .

We can associate to this map a ring homomorphism

π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys] (1.30)

ξi 7−→ ti .

With these definitions, we refer to the toric ideal I as the toric ideal asso-

ciated to T or the toric ideal associated to π. It is worth noting that the toric

ideal I is the kernel of the ring homomorphism π.

Note that we have π(ξu) = yv = yT (u), with u ∈ Nq and v ∈ Ns such that

v =∑q

i=1 uidi.

We report in the following results the main properties about toric ideals of

statistical interest.

Proposition 1.17 Let T : Nq −→ Ns be a semigroup homomorphism and

let us denote by π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys] the corresponding ring

homomorphism. The toric ideal I associated to π is generated, as vector space,

by the binomials of the set

{ξu − ξv : T (u) = T (v)} . (1.31)


Proof. Choose a term-ordering τ on Ts and suppose there exists a polyno-

mial g in I which cannot be written as linear combination of polynomials in

{ξu−ξv : T (u) = T (v)}. Choose g 6= 0 such that the leading term LT (g) = ξa

is minimal with respect to τ .

As g ∈ ker(π), we have g(t1, . . . , tq) = 0. In particular, the term yT (a)

must be eliminated and in g there is another monomial ξb ≺τ ξa such that

T (b) = T (a).

Now, the polynomial g′ = g − ξa + ξb is non-zero, LT (g′) ≺τ LT (g) and it

cannot be written as linear combination of polynomials in {ξu − ξv : T (u) =

T (v)}. This is a contradiction, because we supposed LT (g) minimal.

Now we define

T : Zq −→ Zs (1.32)

the extension of T to the group Z. As any u ∈ Zq can be written in the

form u = u+ − u−, where u+ and u− are the positive and negative part of u

respectively, the above result together with the standard theory of Grobner

bases leads to the following proposition.

Proposition 1.18 For any term-ordering τ on Tq, there exists a finite set of

vectors Gτ ⊂ ker(T ) such that the reduced Grobner basis of I with respect to τ

is

{ξu+ − ξu− : u ∈ Gτ} . (1.33)

Proof. The result follows from Proposition 1.17 and from standard results

about Grobner bases.

Theorem 1.19 Let t1, . . . , tq be terms in the indeterminates y1, . . . , ys and let

J be the ideal in K[ξ1, . . . , ξq, y1, . . . , ys] generated by {ξ1− t1, . . . , ξq− tq}. We

have that

1. I = J ∩K[ξ1, . . . , ξq];

2. if Gτ is a Grobner basis of J with respect to an elimination ordering τ for

{y1, . . . , ys}, then I is generated by the polynomials in Gτ ∩K[ξ1, . . . , ξq].

For the proof, see Bigatti & Robbiano (2001), Proposition 1.4.

Moreover, nice properties of toric ideals are summarized in the following

theorem. We will see later in the thesis the relevance of these properties.


Theorem 1.20 In the previous settings, consider any grading on the polyno-

mial ring K[ξ1, . . . , ξq, y1, . . . , ys] such that the degrees of the yj are arbitrary

integers and deg(ξi) = deg(ti), i = 1, . . . , r. Then

1. the ideal I is prime;

2. the ideal I is generated by pure binomials, i.e.

I = Ideal(t1 − t2 : π(t1) = π(t2), gcd(t1, t2) = 1) ; (1.34)

3. the ideal I is homogeneous.

For the proof, see Bigatti & Robbiano (2001), Theorem 1.6.

The variety Variety(I) is the geometric counterpart of the ideal I. A

point (ξ1, . . . , ξq) belongs to Variety(I) if and only if p(ξ1, . . . , ξq) = 0 for all

polynomials p ∈ I. This condition can be verified making use of a system of

generators of the ideal I. For details about the relationships between ideals

and varieties, see for example Cox et al. (1992).

1.2.2 Computation of toric ideals. First method

A simple algorithm to compute the Grobner basis of the toric ideal I =

I(t1, . . . , tq) is based on the elimination algorithm as described in Theorem

(1.19).

We first consider the homomorphism π : K[ξ1, . . . , ξq] −→ K[y1, . . . , ys]

defined by ξi 7−→ ti; then, we consider the ideal J in K[ξ1, . . . , ξq, y1, . . . , ys]

generated by the set of polynomials

{ξ1 − t1, . . . , ξq − tq} . (1.35)

Using the results from Theorem (1.19), we have that a (reduced) Grobner basis

of I is obtained computing a (reduced) Grobner basis of J with respect to

a term-ordering of elimination for y1, . . . , ys and taking the only polynomials

not involving the y’s.

Example 1.21 Consider the polynomial ring K[y1, . . . , y5] and the following

set of six terms:

{y1y4, y1y5, y2y4, y2y5, y3y4, y3y5} (1.36)

We will consider this example later in the thesis, viewed from a statistical

point of view.


In order to compute the toric ideal I = I(y1y4, y1y5, y2y4, y2y5, y3y4, y3y5)

we introduce the polynomial ring K[ξ1, . . . , ξ6] and we define the ring homo-

morphism π as described in Equation (1.30). The toric ideal of interest I is

the kernel of π. To compute this kernel, we define the ideal

J = Ideal(ξ1 − y1y4, ξ2 − y1y5, ξ3 − y2y4, ξ4 − y2y5, ξ5 − y3y4, ξ6 − y3y6)

in the polynomial ring K[ξ1, . . . , ξ6, y1, . . . , y5]. Using CoCoA, the Grobner

basis of this ideal with respect to an elimination term ordering for the inde-

terminates y1, . . . , y5 is

G(J ) ={−y3y5 + ξ6,−y3y4 + ξ5,−y2y5 + ξ4,−y2y4 + ξ3,

− y1y5 + ξ2,−y1y4 + ξ1, ξ5y5 − ξ6y4, ξ3y5 − ξ4y4, ξ1y5 − ξ2y4,

ξ4y3 − ξ6y2, ξ2y3 − ξ6y1, ξ2y2 − ξ4y1, ξ3y3 − ξ5y2, ξ1y3 − ξ5y1,

ξ1y2 − ξ3y1, ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4}and the only polynomials which do not involve the y’s are the last three ones.

Thus

I = Ideal(ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4) (1.37)

and the Grobner basis of the relevant toric ideal consists of three binomials.

This first method, although intuitive, is not the best from the computa-

tional point of view. Better methods for computing toric ideals can be found in

Bigatti et al. (1999). These methods are based on the theory of the saturation.

We will illustrate such methods in Section 1.2.4.

1.2.3 Navigating inside the set of solutions of an integerlinear system

The saturation of an ideal with respect to a polynomial is given by the fol-

lowing construction. Consider a polynomial p in K[ξ1, . . . , ξr] and an ideal

B ⊂ K[ξ1, . . . , ξr]. The ideal I is the saturation of B with respect to p if

I = Elim(v,B + Ideal(pv − 1)) (1.38)

where v is a new indeterminate and the computations are made in the poly-

nomial ring K[ξ1, . . . , ξr, v].

Let t1, . . . , tq power products in the polynomial ring K[y1, . . . , ys], i.e.

yi = ta1i1 · · · yasi

s . (1.39)


We can define the matrix A ∈ Matq×s(N) associated to the power products.

The matrix A has non-negative entries. Conversely, any matrix A defines a

set of power products using Equation (1.39).

Definition 1.22 Let A ∈ Matq×s(N). We define the toric ideal associated to

A as the toric ideal IA = I(t1, . . . , tq), where tq is defined in Equation (1.39).

Now, we investigate the relation between the toric ideal IA and the kernel

of A viewed as Z-module. First, we need some technical definitions.

Consider the homogeneous system of Diophantine equations associated to

A:

a11z1 + · · ·+ α1qzq = 0

· · ·

as1z1 + · · ·+ αsqzq = 0

(1.40)

where aij is the (i, j)-th element of the matrix A. The set of integer solutions

ker(A) of such system is a Z-module. In the sequel, we will use the decompo-

sition of a vector into its positive and negative parts, that is v = v+ − v−.

Now, let Bin(K[ξ1, . . . , ξq]) be the set of all binomials in K[ξ1, . . . , ξq] and

define the maps ρ′ : Zq −→ K[ξ1, . . . , ξq] and η′ : Bin(K[ξ1, . . . , ξq]) −→ Zq as

ρ′(u) = ξu+ − ξu− (1.41)

and

η′(ξu − ξv) = u− v . (1.42)

Definition 1.23 Let B = {v1, . . . , vr} be a set of vectors. If B generates

ker(A) as Z-module, the ideal I(B) = I(ρ′(v1), . . . , ρ′(vr)) is called a lattice

ideal associated to ker(A).

Let π =∏q

i=1 ξi be the product of all the ξ indeterminates. The main

theorem about the application of the saturation for computing toric ideals is

the following.

Theorem 1.24 Let B = {v1, . . . , vr} ⊆ ker(A). The following conditions are

equivalent.

a) I(B) : π∞ = I(A);

b) I(B)Pπ = I(A)Pπ;


c) B is a set of generators of ker(A), i.e. I(B) is a lattice ideal.

For the proof, see Bigatti & Robbiano (2001), Theorem 2.10. For further

details about Theorem 1.24, see also Sturmfels (1996).

Now, let A ∈ Matq×s(N) and let b = (b1, . . . , bs) be a vector in Ns. We

want to find the non-negative solutions of the system

a11z1 + · · ·+ α1qzq = b1

· · ·

as1z1 + · · ·+ αsqzq = bs

(1.43)

The following result easily lead to a method for navigating inside the set

of solutions of a Diophantine system.

Theorem 1.25 Let S a Diophantine system as defined in Equation (1.43).

Let ti, i = 1, . . . , q be the power products associated to the matrix A and

t = yb11 · · · ybs

s . Moreover, let (α1, . . . , αq) ∈ Nq. Let J be the ideal in the

polynomial ring K[ξ1, . . . , ξq, y1, . . . , ys] generated by the set of binomials {ξ1−t1, . . . , ξq − tq}. The following conditions are equivalent.

a) The vector (α1, . . . , αq) is a solution of S.

b) There is an equality of power products tα11 · · · tαq

q = yb11 · · · ybs

s .

c) The binomial yb11 · · · ybs

s − ξα11 · · · ξαq

q is in J .

For the proof, see Bigatti & Robbiano (2001), Proposition 3.2.

We will see an important example of such kind of navigation in the following

Chapter, in the framework of the contingency tables.

1.2.4 Computation of toric ideals. Second method

In view of the results in Section 1.2.3, and in particular Theorem 1.24, we can

compute the toric ideal with the following steps:

• consider the matrix representation A of the power products;

• compute a lattice basis B of ker(A) and then the lattice ideal I(B);

• compute the saturation of I(B) with respect to the polynomial π =∏qi=1 ξq.


Example 1.26 Consider again the problem of Example 1.21. Now we apply

the saturation-based algorithm.

The matrix representation of the power products in Equation (1.36) is

A =

1 0 0 1 01 0 0 0 10 1 0 1 00 1 0 0 10 0 1 1 00 0 1 0 1

(1.44)

A lattice basis of ker(A) is formed by the two vectors

(1,−1,−1, 1, 0, 0) and (1,−1, 0, 0,−1, 1) .

Thus, the lattice ideal I(B) is Ideal(ξ1ξ4 − ξ2ξ3, ξ1ξ6 − ξ2ξ5). The ideal J =

I(B) + (πv − 1) is generated by the polynomials

G(J ) ={−ξ2ξ5 + ξ1ξ6,−ξ2ξ3 + ξ1ξ4,−ξ21ξ3ξ4ξ

26v + 1,

− ξ21ξ

23ξ

36v + ξ5,−ξ3

1ξ24ξ

26v + ξ2, ξ4ξ5 − ξ3ξ6}

(1.45)

and, eliminating the v indeterminate, we obtain the same result as in Example

1.21, but with a relevant computational improvement.

We will consider again Examples 1.21 and 1.26, after the discussion of the

Diaconis-Sturmfels algorithm.

Note that the first step of the algorithm, i.e. the computation of a basis

of ker(A), can be made using the theory of vector spaces, with a notable

saving of time. A comparison of the different timings between the algorithms

introduced here can be found in Bigatti et al. (1999). The saturation-based

algorithm is implemented in the CoCoA function Toric, which computes a

system of generators of the toric ideal, starting from the matrix representation

of the power products. We will use CoCoA for all symbolic computations

presented in this thesis.

Chapter 2

The Diaconis-Sturmfelsalgorithm

In the present chapter we present an algorithm for sampling from conditional

distributions on finite sample spaces, first introduced by Diaconis & Sturmfels

(1998). While in the original paper by Diaconis & Sturmfels (1998) the at-

tention is restricted to exponential models, in this chapter we show that the

algorithm can be applied to a more general class of probability models, namely

all probability models which verify a linearity property of the sufficient statis-

tic.

In Section 2.1, we introduce the multivariate hypergeometric distribution,

showing how that distribution is essential in the problems of estimation and

hypothesis testing. In Section 2.2, we describe the algorithm, and we discuss

the relevance of the use of toric ideals. The algorithm is a Monte Carlo Markov

chain one, based on the key notion of Markov basis. The links between the

theorems proving the validity of the algorithm and the theory of toric ideals

presented in Chapter 1 will be pointed out. In Section 2.3 we briefly present

some results about the rates of convergence of the Markov chain and in Section

2.4 we present two simple examples.

2.1 Hypergeometric distribution

This Section is a slight generalization of Diaconis & Sturmfels (1998), Section

1, where this theory is presented under the exponential model.

Let X be a finite set and let

P[x] = ϕ(T (x)) x ∈ X , (2.1)

29

30 CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM

be a probability model with sufficient statistic T : X −→ Ns and such that the

distribution of a sample of independent and identically P-distributed random

variables X = (X1, . . . , XN) is of the form

P∗N = ψ(TN) where TN =N∑

k=1

T (Xk) . (2.2)

This means that the sufficient statistic of X is the sum of the sufficient statistics

of the one-dimensional random variables Xk, k = 1, . . . , N .

This model contains the usual exponential models of the form

Pθ[x] = c(θ)e<θ,T (x)> ,

but other ones are included, such as the models of the form

Pθ[(x, y)] = c(θ)eθx+θ2y .

We denote

Yt =

{(x1, . . . , xN)

∣∣∣∣∣N∑

k=1

T (xk) = t

}, (2.3)

i.e., the set of all samples with fixed value t of the sufficient statistic TN . It is

known that the distribution of X given {TN = t} is uniform on Yt. In fact

P∗N [X = x | TN = t] =P∗N [x]∑

{y : TN (y)=t} P∗N [y]=

1

#{x : TN(x) = t} =1

#Yt

for all x = (x1, . . . , xN) ∈ Yt.

As presented in the previous chapter, standard statistical techniques, such

as bootstrap methods, need uniformly distributed samples on Yt. Even in

simplest cases, the set Yt is very large and difficult to enumerate and it is not

easy to sample directly from Yt with classical Monte Carlo methods. Refer in

particular to the discussion in Chapter 1, Section 1.1.5.

As we are in the finite case, the problem can be reduced to a more suitable

form, in terms of counting. Then, we consider the reference set Ft introduced

in Chapter 1

Ft =

{f : X −→ N :

∑x∈X

f(x)T (x) = t

}, (2.4)

i.e., the set of all frequency tables obtained from samples with value t of the

sufficient statistic TN .

2.2. THE ALGORITHM 31

The relation between Yt and Ft is given by the map F : Yt −→ Ft defined

by

F (x1, . . . , xN) =∑x∈X

ex

N∑

k=1

I{xk=x} (2.5)

which associates to every sample the corresponding frequency table. Here

(ex)x∈X denotes the canonical basis of RX . Note that the additivity condition

(2.2), together with the conditions imposed in the definitions of Yt and Ft, is

essential for the following composition:

Yt Ft

Rs

-F

?

∑f(x)T (x)

@@

@@

@@R

TN

The image probability of P∗N [ · |TN = t] induced by F is

Ht(f) = P∗N [F−1(f) | TN = t] =#{x : F (x) = f}

#Yt

, (2.6)

which is by definition the hypergeometric distribution on Ft. Simple compu-

tations show that

Ht(f) =N !

#Yt

∏x∈X

(f(x)!)−1 . (2.7)

For further details on hypergeometric distribution, see Bishop et al. (1975) and

Agresti (2002). However, MCMC methods presented in this thesis need not

the computation of #Yt, as it will be clear in the next Section.

2.2 The algorithm

As pointed out in Chapter 1, in order to perform hypothesis testing for con-

tingency tables, the usual approach is the asymptotic one, which involves chi-

squared distributions, but in many cases, especially when the table is sparse,

the chi-squared approximation may not be adequate (for further details on

this topic see, for example, Appendix IV of Fienberg (1980)). We can ob-

tain approximations of test statistics via Monte Carlo methods, drawing an iid

hypergeometric ample of contingency tables in the reference set Ft.

The problem is then reduced to sample from the hypergeometric distri-

bution on Ft, or equivalently from the uniform distribution on Yt. Litera-

ture suggests to avoid the enumeration problem via the use of Markov Chains


Monte Carlo (MCMC) methods. In particular we are interested in Metropolis–

Hastings algorithm which is based on a set of moves for constructing the rel-

evant Markov chain. A review on the Metropolis–Hastings algorithm can be

found in Chib & Greenberg (1995).

The following is a key definition in that theory.

Definition 2.1 A Markov basis of Ft is a set of functions m1, . . . , mL : X −→Z, called moves, such that for any 1 ≤ i ≤ L

∑x∈X

mi(x)T (x) = 0 , (2.8)

where T is the sufficient statistic, and for any f, f ′ ∈ Ft there exist a sequence

of moves (mi1 , . . . , miA) and a sequence (εj)Aj=1 with εj = ±1 such that

f ′ = f +A∑

j=1

εjmij (2.9)

and

f +a∑

j=1

εjmij ≥ 0 (2.10)

for all 1 ≤ a ≤ A.

The condition (2.8) implies that a move is a table with integer entries (even

negative) and such that the value of the sufficient statistic is constant for every

table obtained with moves in {m1, . . . , mL}. Note once again the importance

of the linearity condition for the sufficient statistic T . In particular, if the

components of the sufficient statistic are the margins, then every move is a

table with null margins.

For example, if we consider the 3 × 3 tables with the marginal totals as

components of the sufficient statistic a move is For example the following move

for the 3× 3 tables

0 +2 0+1 −1 0

0 −1 +1

From this definition it is clear that Markov basis is the main tool to define

a random-walk-like Markov chain on Ft. It is well known that a connected,

reversible and aperiodic Markov chain converges to the stationary distribution.

In particular, we use here the following theorem.


Theorem 2.2 Given a Markov basis {m1, . . . ,mL}, generate a Markov chain

on Ft by choosing I uniformly in {1, . . . , L}. If the chain is currently at g ∈ Ft,

determine the set of j ∈ Z such that g + jmI ∈ Ft. Choose j in this set with

probability proportional to

∏x∈CI

{(g(x) + jmI(x))!}−1 (2.11)

with CI = {x : mI(x) > 0}. This is a connected, reversible, aperiodic Markov

chain with stationary distribution Ht.

Proof. It is easy to see that the product in Equation (2.11) is proportional

to the stationary distribution Ht, and then the Markov chain is reversible with

respect to Ht. As M is a Markov basis, from the property (2.9) it follows that

the Markov chain is connected. To show that the chain is aperiodic, choose a

move and apply this move until some cell counts is zero, obtaining a holding

probability strictly positive.

In practice, to obtain a sample from the distribution of interest σ(f) on

Ft, the Markov chain is performed as follows:

(a) at the time 0 the chain is in f ;

(b) choose a move m uniformly in the Markov basis and ε = ±1 with prob-

ability 1/2 each independently from m;

(c) if f + εm ≥ 0 then move the chain from f to f + εm with probability

min{σ(f + εm)/σ(f), 1}; in all other cases, stay at f .

The transition matrix of this Markov chain is given by

P (f, f + m) =1

2Lmin{σ(f + m)/σ(f), 1} if f + m ≥ 0

P (f, f −m) =1

2Lmin{σ(f −m)/σ(f), 1} if f −m ≥ 0

for all m ∈ {m1, . . . , mL} and P (f, f) chosen in order to have a stochastic

matrix. For example, if σ is the uniform distribution, we have

P (f, f + m) =1

2Lif f + m ≥ 0

P (f, f −m) =1

2Lif f −m ≥ 0


and P (f, f) is simply the ratio of the number of inadmissible moves among

{±m1, . . . , ,±mL} over the total number of moves.

Moreover, note that in the hypergeometric case the computation of the

ratio σ(f + m)/σ(f) leads to some simplifications. In fact

Ht(f + m)/Ht(f) =

∏x∈X f(x)!∏

x∈X (f(x) + m(x))!=

=

∏x∈X f(x)!∏

x∈X (f(x) + m(x))!=

∏

x : m(x)6=0

f(x)!

(f(x) + m(x))!. (2.12)

Let us recall here the basic convergence theorem for the Metropolis–Has-

tings algorithm working in our framework.

Theorem 2.3 Let σ(f) be a positive function on Ft. Given a Markov basis

{m1, . . .mL}, the Markov chain generated following the described algorithm is

connected, reversible and aperiodic on Ft with stationary distribution propor-

tional to σ(f).

Proof. The proof is the same as for Theorem 2.2, considering a general

stationary distribution σ(f) instead of the hypergeometric distribution Ht.

Now, consider an indeterminate for each sample point, that is q indeter-

minates ξ1, . . . , ξq, an let K[ξ1, . . . , ξq] be the polynomial ring in the inde-

terminates ξ1, . . . , ξq with coefficients in the numeric field K. Moreover, let

M = {m1, . . . , mL} be a set of moves. We define the ideals

IT = Ideal(ξa − ξb : T (a) = T (b)) (2.13)

IM = Ideal(ξm+i − ξm−

i : i = 1, . . . , L) (2.14)

Note that the ideal IT is a toric ideal as T is a semigroup homomorphism.

The application of the theory of toric ideals comes from the following result.

Theorem 2.4 The Markov chain with moves in M is connected if and only

if IM = IT .

Proof. “only if”: Notice that that IM ⊆ IT is obvious. Moreover, if the

Markov chain is connected basis, then for any a, b with T (a) = T (b) the term

ξa can be written as

ξa = ξb +A∑

j=1

εj(ξm+

ij − ξm−

ij ) (2.15)


for a suitable sequence of moves (mi1 , . . . , miA) ∈ M and a suitable sequence

of signs (ε1, . . . , εA). This implies that ξa = ξb ∈ IM, and then IT ⊆ IM.

“if”: For all g, g′ such that∑

x∈X (g(x) − g′(x))T (x) = 0, there is a repre-

sentation of the form

ξg − ξg′ =L∑

j=1

εjξhj(ξ

f+ij − ξ

f−ij ) (2.16)

with εj = ±1 for all j. If L = 1, the above Equation reduces to the con-

nectedness condition. If L > 1 the result follows by induction. In fact from

Equation (2.16), it follows that either ξg = ξhr ξf+

ir or ξg = ξhr ξf−ir for some

r = 1, . . . , L. Suppose for example ξg = ξhr ξf+

ir . Then g − f−ir is non-negative

and so g +f+ir

is non-negative. Subtracting ξhr(ξf+ir − ξf−ir ) from both sides and

using hr + f+ir

= g + fir , we obtain an expression for ξg+fir − ξg′ having length

L− 1. By induction, g + fir can be connected to g′ in L− 1 steps. The proof

is now complete.

Theorem 2.4 gives also the relation between moves and binomials. If m is

a move, m = m+ − m−, the corresponding binomial is ξm+ − ξm−and vice

versa. For example, the move presented above for the 3× 3 tables

0 +2 0+1 −1 0

0 −1 +1

is represented by the binomial

gm = ξ212ξ21ξ33 − ξ22ξ32 .

in the polynomial ring Q[ξ11, ξ12, ξ13, ξ21, ξ22, ξ23, ξ31, ξ32, ξ33].

Thus, in view of Theorem 2.4, in order to compute the Markov basis, one

computes the toric ideals associated to the sufficient statistic and one defines

the moves from the binomials following the above rule.

Note that the theory of Markov bases does not need the notion of Grobner

basis; it is sufficient to consider the ideals and set of generators. Grobner

bases will be used below as computational tool. Moreover, the above theory

can be applied to the contingency tables observing that: 1) the (reduced)

Grobner basis of a toric ideal is formed only by binomials; 2) applying the

division algorithm to a monomial with respect to a binomial Grobner basis,

the dividends and the Normal Form are all monomials: this means that at

every step of the algorithm we have a monomial, or equivalently the algebraic

rules transform a contingency table into another contingency table.


2.3 Rates of convergence

In general, the Markov chain described in Section 2.2 needs some running time

to reach the stationary distribution.

There are no general results for the rates of convergence of the chain,

but there is a number of specific results working in some special framework.

Roughly speaking, the rate of convergence depends on the diameter γ of the

random walk. γ is the diameter of the graph with the points in Ft as vertices

and two points f and f ′ are connected if and only if f ′ can be reached from f

in one step.

Here is an example of such theorems, taken from Diaconis & Sturmfels

(1998). Consider a sample space of the form {1, . . . , I} × {1, . . . , J}, the row

sums and the column sums as components of the sufficient statistic, and let Ube the uniform distribution on the reference set Ft.

Theorem 2.5 In the previous settings, let K(f, g) be the connected Markov

chain on Ft based on the moves of the form

+1 −1−1 +1

for any 2× 2 minor of the table and zero otherwise. Then,

‖Kh(x, f)− U‖TV ≤ A1 exp(−A2c) (2.17)

for h = cγ2, where ‖ · ‖TV denotes the total variation distance.

Refer to Diaconis & Sturmfels (1998), Section 2.3, for a general survey on

other results and pointers to literature. However, in the case of sample spaces

with a more complex geometrical structure, no results are available.

2.4 Two examples

We conclude this chapter with the illustration of two examples.

Example 2.6 Consider a sample space formed by 6 points, say

X = {a(1), . . . , a(6)}

2.4. TWO EXAMPLES 37

and consider the 5−dimensional sufficient statistic with components

T1 = F (a(1)) + F (a(2))

T2 = F (a(3)) + F (a(4))

T3 = F (a(5)) + F (a(6))

T4 = F (a(1)) + F (a(3)) + F (a(5))

T5 = F (a(1)) + F (a(2)) + F (a(6))

The matrix representation of this sufficient statistic produces the matrix an-

alyzed in Example 1.26 in Chapter 1. The Grobner basis of the toric ideal

is

{ξ4ξ5 − ξ3ξ6, ξ2ξ5 − ξ1ξ6, ξ2ξ3 − ξ1ξ4}corresponding to the three moves

(0, 0, 1,−1,−1, 1)

(1,−1, 0, 0,−1, 1)

(1,−1,−1, 1, 0, 0)

We will consider again this example in the next chapter in the framework of

the independence model.

The above example belongs to a special class where the components of

the sufficient statistic are the counts over subsets of the sample space. The

following example is not in that class, and statistical literature calls such model

logistic regression model.

Example 2.7 Suppose that the sample space is

X = {(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)} ⊂ N2

and consider the 5−dimensional sufficient statistic with components

T1 = F (1, 1) + F (1, 2)

T2 = F (2, 1) + F (2, 2)

T3 = F (3, 1) + F (3, 2)

T4 = F (1, 2) + F (2, 2) + F (2, 3)

T5 = F (1, 2) + 2F (2, 2) + 3F (2, 3)


Using the matrix representation of this sufficient statistic and computing the

toric ideal associated to that matrix, we obtain the following Grobner basis

{ξ11ξ222ξ31 − ξ12ξ

221ξ32}

corresponding to a Markov basis with only one move

+1 −1−2 +2+1 −1

Chapter 3

Two-way contingency tables

In the present chapter, we restrict our attention to two-way contingency ta-

bles. In particular, we consider the application of the algorithm presented in

the previous chapter to a wide class of log-linear models. Throughout this

chapter we will present some new results which make explicit the computation

of Markov bases for some classical log-linear models for two-way contingency

tables, such as quasi-independence and quasi-symmetry. For a discussion on

the meaning of the different log-linear models in the two-way case, the reader

can refer to Fienberg (1980), Fingleton (1984) and Agresti (2002), where a

number of applications of log-linear models to different problems coming from

biology, psychology, econometrics are presented.

The notation is the same as in the previous chapters, except for the cell

frequencies, where we will use the standard notation in the analysis of contin-

gency tables, e.g. we will write fij instead of f(i, j). This choice allows also

to save space in the proofs.

In Section 3.1 we adapt the notation of Chapters 1 and 2 to the framework

of the contingency tables, starting with the independence model for complete

tables, as it represents an easy prototype, both from the algebraic and statisti-

cal point of view. In Section 3.2 we present new characterizations or computa-

tions of the Markov bases for the most widely used models for two-way contin-

gency tables, including models for tables with structural zeros. In Section 3.3

we analyze three examples of goodness of fit test for the quasi-independence

and the quasi-symmetry models. The first one is a standard case where the

asymptotic is correct; the second one is a case where the Monte Carlo p− value

leads to a different decision with respect to the asymptotic p−-value; the third

one, from Agresti (1996), shows a case where the asymptotics dramatically fail.

Finally, Section 3.4 shows the application of the methodology presented here

39

40 CHAPTER 3. TWO-WAY CONTINGENCY TABLES

to a class of problems which frequently happens in the biological and medical

research.

3.1 Two-way contingency tables: Independ-

ence

In this Section we consider I×J contingency tables formed by considering the

counts generated from two categorical variables.

Let us denote the observed count for the (i, j) cell by fij and the totals

for the ith row and jth column by fi+ and f+j respectively. Finally, let N =∑Ii=1 fi+ =

∑Jj=1 f+j the sample size. We emphasize that this notation is

easily extendable to the multi-way contingency tables and also the topics of

this Section can be extended to the multi-way case. We will see examples of

log-linear models for multi-way contingency tables in Chapter 4.

Using the same notation as in Chapters 1 and 2, the sample space is X =

{(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J} and the vectors r = (f1+, . . . , fI+) and

c = (f+1, . . . , f+J) are the components of the sufficient statistic T . In the

independence case, the probability model can be written as

I∏i=1

pr(i)fi+

J∏j=1

pc(j)f+j , (3.1)

where pr(i) and pc(j), with i = 1, . . . , I and j = 1, . . . , J , are the parameters

of the marginal distributions.

Every margin as component of the sufficient statistic is additive and then

this probability model is coherent with the above theory. If we denote the set

of all contingency tables with fixed margins by F(r,c), the following result holds

true.

Proposition 3.1 The probability distribution on F(r,c) is the hypergeometric

distribution, and its explicit formula is

H(r,c)(f) =

∏Jj=1

(f+j

f1j ···fIj

)(

Nf1+···fI+

) . (3.2)

Proof. The probability of (f11, . . . , f1J) in the first row of the table is

(f+1

f11

) · · · (f+J

f1J

)(

Nf1+

) =

∏Jj=1

(f+j

f1j

)(

Nf1+

)

3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE 41

and by straightforward computation the result is proved.

As it is difficult to find the number of tables in F(r,c) and to have a complete

list of all tables in F(r,c), we approximate the distribution of a generic test

statistic C conditioned to the sufficient statistic using the algorithm described

in Chapter 2.

Recalling the definition of Markov basis, see Chapter 2, Definition 2.1, it is

clear that the Markov basis for the independence model has to be formed by

moves with null margins. Moreover, the set of moves must give a connected

Markov chain.

Then, using this Markov basis, one can apply the algorithm described in

the previous chapter in order to obtain a Monte Carlo approximation of the

distribution of the test statistic.

The likelihood of any I×J contingency table f can be written in monomial

form

ξf11

11 · · · ξfIJ

IJ , (3.3)

where the ξij’s are functions of the model parameters. The sufficient statistic

in the independence case is a map

T : NI×J −→ NI+J (3.4)

f 7−→ (f1+, . . . , fI+, f+1, . . . , f+J) (3.5)

Then, for I × J contingency tables the ring homomorphism of interest is

π : K[ξ11, . . . , ξIJ ] −→ K[y1+, . . . , yI+, y+1, . . . , y+J ] (3.6)

ξij 7−→ yi+y+j (3.7)

The following result characterizes a well known object, e.g. the set of 2 × 2

minors, as the Grobner basis of the toric ideal I associated to π.

Theorem 3.2 Given any term-ordering τ , the reduced Grobner basis of I is

Gτ = {ξilξjk − ξikξjl : 1 ≤ i < j ≤ I, 1 ≤ k < l ≤ J} (3.8)

Proof. First, we prove that Gτ is a set of generators for ker(π). Every

polynomial of the form ξilξjk − ξikξjl is in ker(π), so Ideal(Gτ ) ⊆ ker(π). For

the converse, consider a monic polynomial g ∈ ker(π). Letting ξa = LT (g) the


leading term of g and π(ξa) = yu, there exists in g at least another term ξb

such that π(ξb) = yu. The polynomial ξa− ξb is such that the exponents verify

I∑i=1

aij =I∑

i=1

bij for all j = 1, . . . , J

andJ∑

j=1

aij =J∑

j=1

bij for all i = 1, . . . , I .

With these constraints, after long but trivial computations one proves that

ξa− ξb ∈ Ideal(Gτ ). Now, also the polynomial g′ = g− ξa + ξb is in ker(π) and

LT (g′) ≺τ LT (g). Then, the result is proved by applying the same procedure

to g′ and so on for a finite number of times.

The proof that Gτ is a Grobner basis is a simple application of the notion

of syzygy, which is a standard tool in Commutative Algebra. The reader can

find the theory of syzygies in Kreuzer & Robbiano (2000), Section 2.3.

The following example is the continuation of Example 2.6 of Chapter 2.

Example 3.3 The reduced Grobner basis for the 3 × 2 contingency tables

gives the three moves

+1 −1 +1 −1 0 0−1 +1 0 0 +1 −1

0 0 −1 +1 −1 +1

This example can be used to show that the standard analysis of contingency

tables with the use of the theory of vector spaces is not sufficient in order to

find a Markov basis, see also Pistone et al. (2001a). The matrix representation

of the homomorphism T in Equation (3.5) is

S1 =

1 1 0 0 0 00 0 1 1 0 00 0 0 0 1 11 0 1 0 1 00 1 0 1 0 1

(3.9)

Note that we use the notation S1 in order to preserve here the notation of

the statistical literature, but the matrix S1 is just the matrix A introduced in

Chapter 1, Section 1.2.3.

3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE 43

We extend its action from R6 to R5 and compute its kernel with the system

of linear equations

a11 + a12 = 0a21 + a22 = 0a31 + a32 = 0

a11 + a21 + a31 = 0a12 + a22 + a32 = 0

(3.10)

As the system has rank 4, the kernel has dimension 2. Two solutions for this

system are

(a11, a12, a21, a22, a31, a32) = (1,−1, 0, 0,−1, 1) , (3.11)

(a11, a12, a21, a22, a31, a32) = (0, 0, 1,−1,−1, 1) . (3.12)

Then, a basis of the orthogonal of ST1 = Z1 is given by

Z2 =

1 0−1 00 10 −1−1 −11 1

(3.13)

but the polynomials ξ11ξ32 − ξ12ξ31 and ξ21ξ32 − ξ22ξ31 are not a Markov basis.

For example, starting from the table

3 56 40 0

, (3.14)

the Markov chain based on the two moves

+1 −1 0 00 0 +1 −1

−1 +1 −1 +1

does not produce any other tables. In fact, the polynomial ξ11ξ22 − ξ12ξ21

represents a non-trivial move, but it does not lie in the ideal Ideal(ξ11ξ32 −ξ12ξ31, ξ21ξ32 − ξ22ξ31). Although this table is statistically trivial, this example

allows us to write explicitly the homomorphism in a simple case. In other

models there are more interesting examples, as we will see in Section 3.3.

Using the saturation-based algorithm presented in Chapter 1, Section 1.2.4,

the Markov basis can be found through the computation of the Grobner basis

of the ideal

Ideal(ξ11ξ32− ξ12ξ31, ξ21ξ32− ξ22ξ31, hu− 1)∩K[ξ11, ξ12, ξ21, ξ22, ξ31, ξ32] (3.15)


where h is the polynomial formed by the product of all the ξ’s indeterminates.

This operation leads to the three moves described above. Refer to Chapter 1

for details on the computation of toric ideals as well as for a discussion on the

symbolic software we use.

Finally, note that in the independence case the Markov basis consists of all

tables of the type

+1 −1−1 +1

for any 2× 2 minor of the table and zero otherwise. Then, the computation of

the ratio H(r,c)(f +m)/H(r,c)(f) in the rejection probability of the Metropolis–

Hastings algorithm is again simplified. In fact, referring to Equation (2.12),

we obtain

H(r,c)(f + m)/H(r,c)(f) =∏

i,j : mij 6=0

fij!

(fij + mij)!= (3.16)

=∏

i,j : mij=+1

fij!

(fij + 1)!

∏i,j : mij=−1

fij!

(fij − 1)!= (3.17)

=∏

i,j : mij=+1

(fij + 1)−1∏

i,j : mij=−1

fij (3.18)

and only four numbers are involved.

3.2 Some models for incomplete and square

tables

We have considered in the previous sections the independence model as it

represents an easy prototype. We analyze now the Markov bases for other

models, with special attention to the most widely used models for two-way

contingency tables. A survey on these models in the framework of the log-

linear models can be found in Haberman (1978).

3.2.1 Incomplete tables

Incomplete tables are contingency tables with structural zeroes. In this case

the sample space X is a subset of {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J}. A complete

description of a Markov basis for incomplete tables is not easy. For example,

we consider a 3×3 table with the missing entry (1, 3), again with the row sums

and the column sums as components of the sufficient statistic. Using CoCoA

3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES 45

for computing a Grobner basis of the corresponding toric ideal, we obtain these

five moves in the Markov basis:

+1 −1 • +1 −1 • 0 0 •−1 +1 0 0 0 0 +1 −1 0

0 0 0 −1 +1 0 −1 +1 0

0 0 • 0 0 •+1 0 −1 0 +1 −1−1 0 +1 0 −1 +1

In this case, the Markov basis seems to be obtained from the corresponding

Markov basis of the complete table by deleting the moves which involve the

missing entry. But this is false in general as we show in the following example.

If we consider a 3 × 3 table with all diagonal cells as missing entries, the

algebraic computation show that the Markov basis has only one move:

• +1 −1−1 • +1+1 −1 •

In this example, the ring homomorphism of interest is

π : K[ξ12, ξ13, ξ21, ξ23, ξ31, ξ32] −→ K[y1+, y2+, y3+, y+1, y+2, y+3] (3.19)

ξij 7−→ yi+y+j (3.20)

with i 6= j.

3.2.2 The quasi-independence model

In square tables is often useful the quasi-independence model. The log-linear

form of this model is

log mij = λ + λXi + λY

j + δiI{i=j} (3.21)

where the λXi ’s are the effects of the first variable X with values 1, . . . , I, the

λYj ’s are the effects of the second variable Y with values 1, . . . , I and the δi’s

are the effect of the diagonal cells. Here the indicator I{i=j} equals 1 when

i = j and 0 otherwise. The usual constraints of this model are

I∑i=1

λXi = 0 and

I∑j=1

λYj = 0 . (3.22)


This model consider the independence except for the diagonal cells, which

are fitted exactly. In this model, the components of the sufficient statistic

are the row sums and the column sums together with the diagonal entries

d = (f11, . . . , fII). As the diagonal entries are themselves sufficient statistic,

the model fits exactly these cells, and then no move of the Markov basis can

modify the diagonal. This observation leads us to the following result.

Proposition 3.4 The reduced Grobner basis of the toric ideal I for the quasi-

independence model is the Grobner basis for the corresponding incomplete table

with missing diagonal entries.

Proof. An equivalent minimal sufficient statistic has components d, r =

(f1+, . . . , fI+) and c = (f+1, . . . , f+I), where fi+ = fi1 + . . . + fi i−1 + fi i+1 +

. . . + fiI is the i-th row sum except for the i-th diagonal cell which is excluded

and the same definition holds for the column sums f+j. With this sufficient

statistic, the equations defining the kernel of π are

ξij = yi+y+j for i, j = 1, . . . , I, i 6= j (3.23)

ξii = fii for i = 1, . . . , I (3.24)

A Grobner basis G1 of the ideal J1 = Ideal(ξij − yi+y+j, i, j = 1, . . . , I, i 6= j)

is the Grobner basis coming from the incomplete table with missing diagonal

entries. A Grobner basis G2 of the ideal J2 = Ideal(ξii − fii, i = 1, . . . , I) is

{ξii − fii, i = 1, . . . , I}.Now, observe that the first group of equations involves only the indeter-

minates ξij, i 6= j, yi+ and y+j, i, j = 1, . . . I, whereas the second group of

equations involves the indeterminates ξii and fii, i = 1, . . . I. As a conse-

quence, it is a standard fact in polynomial algebra that a Grobner basis of the

ideal J = J1 + J2 is G1 ∪ G2.

As we have previously seen, the Grobner basis of the toric ideal I = J ∩K[ξij : i 6= j] is obtained from G1 ∪ G2 by deleting the polynomials involving

the y’s and the f ’s indeterminates. So, the polynomials in G2 are all deleted

and the result is proved.

As the computation of a Grobner basis of a toric ideal is very computer

intensive and contingency tables need a great number of indeterminates and

equations, the above result is specially useful from a computational point of

view, as it reduces the number of equations used in the computation. Moreover,

3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES 47

note that the above proposition can be extended to other models where some

entries are sufficient statistics.

3.2.3 The symmetry model

Following the notations of the log-linear model, the symmetry model can be

written in the form

log mij = λ + λi + λj + λij (3.25)

with λij = λji for all i, j = 1, . . . , I and with the constraint∑I

i=1 λi = 0. This

model implies the marginal homogeneity. The sufficient statistic is s = (sij =

fij + fji : 1 ≤ i ≤ j ≤ I).

The Markov basis for the symmetry model is characterized by the following

result.

Theorem 3.5 Given any term-ordering τ , the reduced Grobner basis of I for

the symmetry model is

Gτ = {ξij − ξji : 1 ≤ i < j ≤ I} (3.26)

Proof. Writing explicitly the kernel of the ring homomorphism we obtain

a polynomial system which contains the couples of equations{

ξij = sij

ξji = sij(3.27)

for all 1 ≤ i < j ≤ I and the equations

ξii = sii (3.28)

for all i = 1, . . . , I. As the corresponding polynomials are a Grobner basis,

straightforward elimination of sij’s indeterminates gives the result.

3.2.4 The quasi-symmetry model

The quasi-symmetry model has the log-linear form

log mij = λ + λXi + λY

j + λij (3.29)

with λij = λji for all i, j = 1, . . . , I and with the constraints

I∑i=1

λXi = 0 and

I∑j=1

λYj = 0 . (3.30)


This model is more general than the previous, as it does not imply the marginal

homogeneity. The components of the sufficient statistic are the row sums, the

column sums and s = (sij = fij + fji : 1 ≤ i ≤ j ≤ I). In this model the

Markov basis can be computed with CoCoA. For example, for a 4 × 4 table,

the seven moves are

0 0 0 0 0 0 −1 +1 0 −1 0 +10 0 −1 +1 0 0 0 0 +1 0 0 −10 +1 0 −1 +1 0 0 −1 0 0 0 00 −1 +1 0 −1 0 +1 0 −1 +1 0 0

0 0 −1 +1 0 −1 0 +1 0 +1 −1 00 0 +1 −1 +1 0 −1 0 −1 0 0 +1

+1 −1 0 0 0 +1 0 −1 +1 0 0 −1−1 +1 0 0 −1 0 +1 0 0 −1 +1 0

0 −1 +1 0+1 0 −1 0−1 +1 0 0

0 0 0 0

It should be noted that the Markov basis is easily computable also in those

models where the maximum likelihood equations are not in explicit form and

a numerical method, such as the Newton-Raphson method, must be used in

order to calculate the maximum likelihood estimate, see Chapter 1.

3.2.5 Computational notes

The application of the above theory leads to two computational problems.

First, the computation of the Grobner bases of toric ideals involves a great

number of indeterminates and the number increases fast when the dimensions

of the table increase. For example, the computation of the Grobner basis of

the toric ideal in the independence model for complete I × J tables involves

IJ + I + J indeterminates.

For other models the number of indeterminates is even greater. For exam-

ple, the quasi-symmetry model for an I × I table involves I2 + 2I + I(I + 1)/2

indeterminates. The most common symbolic software packages, such as Co-

CoA or Maple, give computational problems if the number of indeterminates

is too high. So, for solving problems for large tables are essential the theorems

of the above Sections in order to have the Grobner basis avoiding the compu-

tation. However, we have found in our computations that CoCoA, having the

function Toric for the computation of toric ideals, is the package which allows

to work with the grater number of indeterminates.

Second, the number of moves increases fast when I and J increase. Con-

sidering once again the independence model, the number of polynomials of the

3.3. EXAMPLES 49

Grobner basis, or equivalently the number of moves of the Markov basis, is(I2

)(J2

). In Table 3.1, we report the number of moves of the Markov basis for

the independence model.

JI 2 3 4 5 6 7 8 9 102 1 3 6 10 15 21 28 36 453 3 9 18 30 45 63 84 108 1354 6 18 36 60 90 126 168 216 2705 10 30 60 100 150 210 280 360 4506 15 45 90 150 225 315 420 540 6757 21 63 126 210 315 441 588 756 9458 28 84 168 280 420 588 784 1008 12609 36 108 216 360 540 756 1008 1296 1620

10 45 135 270 450 675 945 1260 1620 2025

Table 3.1: Number of moves for the independence model.

In this case, the main problem is how to store the results of the symbolic

computations in matrix form in order to use these results with statistical soft-

wares.

3.3 Examples

In this section we present three different examples of application of the Dia-

conis-Sturmfels algorithm to goodness of fit tests for log-linear models.

Example 3.6 Consider the categorical data in Fingleton (1984), page 142

and reported in Table 3.2. These data describe the place of residence of a

subgroup of British migrants in 1966 and 1971. The data are simplified in

order to consider only four places, labelled A, B, C and D. For the quasi-

symmetry model, the Pearson statistic is C = 2.5826. If we compare this

value with the chi-squared distribution with (I − 1)(I − 2)/2 = 3 degrees of

freedom, it gives an asymptotic approximate p−value 0.4607.

Running the Markov chain based on the seven moves found in Section 3.2

for the quasi-symmetry model with 50,000 burn-in steps and sampling every

50 steps for a total of 10,000 values, the Monte Carlo approximate p−value

is 0.4714. As the sample size is high, the classical asymptotic approximation

works well and the two approximations are similar.


19711966 A B C D

A 118 12 7 23B 14 2127 86 130C 8 69 2548 107D 12 110 88 7712

Table 3.2: British migrants data.

The sample size is chosen in order to have a 95%-confidence interval for

the exact p−value shorter than 0.02. In fact, assuming near independence,

the width of the 95%-confidence interval is 2 · 1.96√

p(1− p)/B, where p is

the estimated p−value and B is the size of the Monte Carlo sample. In the

inequality 2 · 1.96√

p(1− p)/B < 0.02, the left hand term has a maximum in

p = 1/2, so the inequality is reduced to 2 · 1.96√

1/(4B) < 0.02. From the last

inequality we easily obtain B > 9, 604.

Example 3.7 As a second example, we present a quasi-independence model

for complete tables. We analyze the medical data in Agresti (1996), page 242

and reported in Table 3.3.

YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10

Table 3.3: Carcinoma data.

This table shows ratings by two pathologists, X and Y , who separately

classified 118 slides regarding the presence and extent of carcinoma of the

uterine cervix. The rating scale has four ordered categories, 1, 2, 3 and 4.

Under the quasi-independence model, the Pearson statistic is C = 11.5, leading

to an asymptotic approximate p−value 0.0423. Applying the algebraic MCMC

algorithm as in the previous example, the Monte Carlo approximate p−value

is 0.0080. If we perform a 1%-level goodness of fit test, the conclusions are

different using different approaches.

3.3. EXAMPLES 51

Example 3.8 Finally, we consider again the data of the previous example,

but under the quasi-symmetry model. The Pearson statistic for this table

under the quasi-symmetry model is C = 0.6348. Agresti (1996) compares this

value with the chi-squared distribution with 2 degrees of freedom, obtaining

an asymptotic approximate p−value 0.7280. Running the Markov chain as

above, the Monte Carlo approximate p−value is exactly 1, and in this case the

two approximations are not similar. We used the definition of p−value as the

smallest significance level at which the model is rejected with probability 1.

But there is more. Using the seven moves of the Markov basis, we can

compute for this table the exact distribution of C under the model. In fact,

starting from the original table (denoted here by f1), only the last move

0 −1 +1 0+1 0 −1 0−1 +1 0 0

0 0 0 0

with negative sign, can be applied and this move gives the table f2 below.

Again, only the last move can be used: if we choose the positive sign we come

back to the table f1, whereas if we choose the negative sign we obtain the table

f3.

f2 =

22 3 1 04 7 15 01 1 36 00 1 17 10

f3 =

22 4 0 03 7 16 02 0 36 00 1 17 10

(3.31)

Now, only the last move can be applied, with positive sign, coming back to

table f2.

There are only three tables with non-negative entries and with the appro-

priate values of the sufficient statistic and it is not difficult to compute the

corresponding hypergeometric probabilities. Applying the formula (3.16), we

obtain the equations H(f2)/H(f1) = 4/9 and H(f3)/H(f2) = 1/32, together

with the constraint H(f1) +H(f2) +H(f3) = 1. Solving this simple system,

we find H(f1) = 72/105, H(f2) = 32/105, H(f3) = 1/105. The Pearson

statistic values for these three tables are C(f1) = 0.6348, C(f2) = 1.8405,

C(f3) = 12.3206.

Moreover, following the Example 3.3, the size of the matrix S1 is 16× 18 and

the dimension of its kernel is 3. Then, a basis of the kernel as vector space is

given by three linearly independent moves, as instance the first three ones. If


the last move is excluded, the Markov chain starting from f1 cannot reach the

other two tables.

3.4 An application. The h-sample problem

In this section we present a first application of the theory above, and in par-

ticular of the exact test for the independence model.

We want to compute the exact p−values for tests on equality of h multi-

nomial distributions, through the test of independence on h × k tables. It is

worth noting that when the p−value is not exactly computed, classical approx-

imations based on chi-squared asymptotic distribution of the test statistic are

used. As for log-linear models, asymptotic approximations in many situations

are not accurate, indeed the problem of computing exact p−values has been

widely considered in recent literature.

Mehta & Patel (1983) provide a network algorithm to inspect the orbit of

h × k contingency tables with fixed margins. But, the entire set of tables is

often too wide and consequently approximation methods have been proposed.

Baglivo et al. (1992) used a FFT algorithm to compute the exact p−value on

h × k tables with fixed margins. Hirji & Johnson (1996) compared the two

algorithms above and concluded that the first one is superior to the second

one with respect to computing speed and accuracy.

The algorithm proposed by Metha et al. (1988) provides an approximation

of the exact p−value and it is based on a complicated Monte Carlo method

which generates random tables providing statistics belonging to the critical

region of the test.

Recently, Strawderman & Wells (1998) used saddlepoint approximations to

the exact distribution of the conditional maximum likelihood estimate (MLE)

for several 2× 2 tables, providing approximated p−values, power calculations

and confidence intervals. Booth & Butler (1999) proposed a Monte Carlo im-

portance sampling for conducting exact conditional tests in log-linear models.

Moreover, Metha et al. (2000) proposed a Monte Carlo method for approx-

imating the exact p−values in conditional logistic regression and foresee the

applicability of algebraic-based methods in this framework. The main problem

in Metha et al. (1988), and in most cited works, was how to generate random

tables with fixed margins. In this paper we give a new solution to the h-sample

problem applying the technique described in Diaconis & Sturmfels (1998), and

3.4. AN APPLICATION. THE H-SAMPLE PROBLEM 53

the computer program source can be included in few lines of statements. More-

over, we provide a solution not only for 2 × k tables but, more generally, for

h× k tables.

The h-sample problem can be stated as follows. Let X1, . . . , Xh be random

variables with values in {1, . . . , k} defined on h different groups and consider

h samples drawn from these groups: (X1,1, . . . , X1,n1) from the multinomial

distribution D1, . . ., (Xh,1, . . . , Xh,nh) from the multinomial distribution Dh.

Let N =∑h

i=1 ni be the total sample size. The distribution Di has parameters

pi1, . . . , pik, for i = 1, . . . , h with the constraints pij ≥ 0 and∑k

j=1 pij = 1 for

all i = 1, . . . , h. The usual test in this situation is the test for equality of the

proportions, where the null hypothesis is

H0 : p1j = . . . = phj for j = 1, . . . k (3.32)

against the composite alternative hypothesis of different proportions.

The components of the sufficient statistic are the sums of the observations

in each group and the sums of observations for each possible value of the

variables, that is

S =

(∑u,j

I{X1,u=j}, . . . ,∑j,u

I{Xh,u=j},∑i,u

I{Xi,u=1}, . . . ,∑i,u

I{Xi,u=k}

)(3.33)

for i = 1, . . . , h and j = 1, . . . , k, where IA is the indicator function of A. Of

course, the sample can be summarized in a contingency table with h rows and

k columns and we denote this table by F . In other words Fij =∑ni

u=1 I{Xi,u=j}.

The table F has raw parameters p∗ij = pijni/N . Moreover, we denote by fi+ and

by f+j the row sums and the column sums, respectively. The sufficient statistic

S is represented by the margins of this table and the maximum likelihood

estimate of the parameters is

p∗ij = fi+f+j/N2 for i = 1, . . . h , j = 1, . . . , k . (3.34)

For this hypothesis the most commonly used test statistics are the Pear-

son statistic C =∑

i,j(Fij − Np∗ij)2/Np∗ij and the log-likelihood ratio L =

2∑

i,j Fij log(Fij/Np∗ij), see for example Agresti (1996).

There are many nonparametric algorithms, based on Monte Carlo approx-

imations of the p−value which do not involve limit theorems. The algorithm

presented here is still valid when the test statistic is of the form

T (F ) =∑i,j

aij(Fij) , (3.35)


where the aij’s are real valued functions on the set N of the nonnegative inte-

gers.

As usual in this framework, we consider a problem of conditional infer-

ence, where only the tables with the same margins of the original table are

relevant. So, we restrict the analysis to the reference set Ft. Using the al-

gorithm described above with the Markov basis for the independence model,

the approximate Monte Carlo p−value is simply the ratio of the number of

tables whose value of the test statistic is greater than or equal to the one of

the original table, over the number B of sampled tables.

Note that or the two-sample problem, there is an algorithm proposed by

Metha et al. (1988), which is an importance sampling algorithm and is based

on the network representation of the reference set Ft. It were also used in order

to approximate the power of the test, but is difficult to rewrite it for the h-

sample problem. Moreover, our method allows us to approximate the p−value

as stated before, instead of considering also the hypergeometric probability

weights of the sampled tables.

It is interesting to note that also Metha et al. (2000) considered the possi-

bility of applying algebraic-based methods: “An investigation is needed to es-

tablish conditions under which the probability of connectedness of the Markov

chain is high”. But, the theorems presented in Diaconis & Sturmfels (1998),

and described in Chapter 2 of this thesis, state that the algebraic algorithm

described above defines a connected Markov chain.

As in this kind of applications only the independence model is involved, it

is worth noting the algebraic step of the algorithm needs not symbolic com-

putations. In fact, the relevant Markov bases are completely characterized in

Theorem 3.2.

Finally, we emphasize that classical methods, for example Metha et al.

(1988), proposed the solution by sampling few well chosen tables with great

computational effort. On the other hand, this method easily generates tables

from the reference set.

Chapter 4

Rater agreement models

In this chapter we apply the algebraic technique for sampling from condi-

tional distributions presented in Chapter 2 in order to define an exact Markov

Chain Monte Carlo (MCMC) algorithm for conditional hypothesis testing in

the problems of rater agreement. Moreover, we extend the results and the

computations made up in Chapter 3 to the multi-dimensional case, i.e. to

the case of d-way tables, with d > 2. These new results allow us to compute

Markov bases also in non-trivial cases and some of the most important cases

are presented explicitly.

In Section 4.1, we present the medical framework where models for rater

agreement apply. In Section 4.2 we review the main methods for measuring

rater agreement, with particular attention to the multi-observer case. The

methods considered here are the Cohen’s κ and the goodness of fit tests for

appropriate log-linear models. In Section 4.3, we describe the conditional in-

ference for these problems, using the Diaconis-Sturmfels algorithm and the

notion of Markov basis, while in Section 4.4 a careful presentation of the con-

ditioning structures arising from complex problems is given. Section 4.5 deals

with the computation of some relevant Markov bases coming from agreement

problems, through some new theoretical results useful for simplifying the sym-

bolic computations. Finally, in Section 4.6, an example on a real data set

shows the practical applicability of the algebraic procedures.

4.1 A medical problem

The material presented in this chapter was inspired by a medical research on

medical imaging drugs and biological products, in particular contrast agents,

see CDER (2000). Contrast agents are drugs that improve the visualization of

55

56 CHAPTER 4. RATER AGREEMENT MODELS

tissues, organs or physiologic processes by increasing the relative difference of

imaging signal intensities in adjacent parts of the body.

From the statistical point of view, the effectiveness of a contrast agent is

assessed by evaluating with statistical quantitative studies the agent’s ability to

provide a better reading of the medical images. The assessment of the validity

of a new product over the existing ones is performed in general by comparing

radiographies, computed tomographies and magnetic resonance imaging taken

with the new product and the old product, when available.

In this medical setting, the most commonly used statistical models are the

models for agreement among raters. For the case of two raters the proposed

methods are the Cohen’s κ and similar indices, see Landis & Koch (1975) for

a survey. More recently, the application of log-linear models, in particular the

quasi-independence model and the quasi-symmetry model has been proposed,

see for example Agresti (1992).

In some particular clinical settings, Government Agencies can require mod-

els for the agreement among many raters or the analysis of stratified samples

with respect to other covariates, for example when the study includes subjects

that represent the full spectrum of the population. For example, in a study

intended to demonstrate the validity of a contrast agent to assess bronchiesta-

sis, the sample should be stratified with respect to age classes of the patients

and to the presence or absence of the main related pulmonary disorders, e.g.,

chronic bronchitis, pneumonia, asthma, cystic fibrosis, see CDER (2000).

In these situations, the use of conditional inference is common practice. In

fact, the use of conditional inference seems to be preferable with respect to the

use of an unconditional approach, see Klar et al. (2000) for a discussion and

further references. In addition, different structures of conditioning arise in the

inferential procedures, as it will be clear in the next sections. For example, it

is usual to define different indices of agreement for each pair of observers, e.g.

different κ’s for any pairs of observers and for any strata. As it will be clear

from the following, such a study can be much more informative than the study

of an overall index of agreement for the sample.

4.2 Multidimensional rater agreement models

Suppose that d observers rate the same set of N subjects.

In this section, we briefly review the main methods for measuring rater

4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS 57

agreement, starting with the case of d = 2 observers, say A and B. We present

here a wide range of models, both for binary, nominal and ordinal rating scales,

and we denote the set of possible responses as {1, . . . , I}. We do not consider

continuous responses, where parametric or nonparametric ANOVA methods

can be applied,see for example Landis & Koch (1975).

The data are summarized in a two-way contingency table. We note πij =

P{A = i, B = j} the probability that observer A classifies the subject in

the category i and observer B in category j, with i, j = 1 . . . , I. Moreover, we

denote by {πi+, i = 1, . . . , I} and {π+j, j = 1, . . . , I} the marginal distributions

for the observers A and B, respectively. Note that in the previous chapters

the cell probabilities were denoted by p, as usual in the analysis of contingency

tables. But in the rater agreement literature the notation π to denote the

probabilities is the standard one.

Probably, the most widely used index of agreement for nominal and ordinal

categories is the Cohen’s κ defined as

κ =

∑Ii=1 πii −

∑Ii=1 πi+π+i

1−∑Ii=1 πi+π+i

(4.1)

leading to the model

πij = (1− κ)πi+π+j for i 6= j

πii = (1− κ)πi+π+i + κ/I for i = j

This index measures the observer agreement beyond chance in the sense

that the index equals zero if the agreement is equal to that expected by chance

and computed under the independence assumption, see Landis & Koch (1975)

for further details or the original work Cohen (1960).

Another version of the κ index implies marginal homogeneity and its defi-

nition assumes the form

κh =

∑Ii=1 πii −

∑Ii=1(πi+ + π+i)

2/4

1−∑Ii=1(πi+ + π+i)2/4

. (4.2)

There are also several versions of weighted κ for ordinal scales, see Landis

& Koch (1975) or Fleiss (1981) for details and discussion. As the Markov bases

do not change when considering weights, we restrict our presentation to the

case of standard weights, such as in (4.1) and (4.2). Moreover, some authors

define separated k’s for each pair of responses, see for example Agresti (1992).


An agreement index is a normalized index such that it is equal to 1 in the

case of perfect agreement and it is equal to 0 in the case of independence. For

d > 2 raters, one can find different versions of the Cohen’s κ. Here we use the

following formula

κ(d) =

(d2

)−1{∑d

h<k=1

∑Iih=ik=1 πi1,...,ih,...,ik,...,id −

∑dh<k=1

∑Ii=1 πk(i)πh(i)

}

1− (d2

)−1 ∑dh<k=1

∑Ii=1 πk(i)πh(i)

(4.3)

where πk(i) is the proportion of the i-th level in the k-th marginal distribution,

see Landis & Koch (1975).

With the marginal homogeneity hypothesis we obtain the formula:

κh(d) =

(d2

)−1(∑d

h<k=1

∑Iih=ik=1 πi1,...,ih,...,ik,...,id

)−∑I

i=1 π(i)2/d2

1−∑Ii=1 π(i)2/d2

(4.4)

where π(i) =∑d

h=1 πh(i) is the proportion of ratings in the i-th category by

the d observers.

It is known that the maximum likelihood estimates of the κ parameters are

obtained simply by substituting in the above formulae the observed propor-

tions to the theoretical proportions. We denote by κ the maximum likelihood

estimators.

As the interpretation of the Cohen’s κ is not always straightforward, see

Fleiss (1981), more recently the use of log-linear models for describing ob-

server agreement has been proposed. For the specific application of the log-

linear models for measuring the agreement, Agresti (1992) illustrates the use

of the quasi-independence and the quasi-symmetry models. In particular, the

p−value of the goodness of fit test is used to measure the agreement.

Let {nij | i, j = 1, . . . , I} and {mij | i, j = 1, . . . , I} denote the observed

cell counts and the expected cell counts, respectively. The quasi-independence

model assumes the form

log mij = µ + λAi + λB

j + δiI{i=j} (4.5)

where µ is a global mean parameter, λAi , λB

j , i, j = 1, . . . , I are the marginal

effects with the constraints∑I

i=1 λAi = 0,

∑Jj=1 λB

j = 0, and δi, i = 1, . . . , I

are the main diagonal parameters.

The expression of the independence model, which gives the expected agree-

ment by chance, has the form


j .

4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS 59

As pointed out in Agresti (1992), the quasi-independence model decom-

poses the agreement between the two raters in two parts: a first addendum

which represents the agreement by chance and a second addendum which rep-

resents the agreement beyond chance. For further details on the use of the

quasi-independence model for measuring rater agreement, see Bishop et al.

(1975) or Tanner & Young (1985).

We can also define a simplified quasi-independence model with an unique

parameter δ for the main diagonal cells instead of the sequence {δi, i = 1, . . . I}.This makes the fit of the main diagonal unsaturated and the model has the

form


j + δI{i=j} . (4.6)

Another log-linear model to assess rater agreement is the quasi-symmetry

model, defined by the equations


j + λij (4.7)

with the constraints∑I

i=1 λAi = 0,

∑Jj=1 λB

j = 0, and λij = λji for all i, j =

1, . . . , I, see Agresti (1992). Other related models can also be found in Becker

(1990).

The extension of the theory of log-linear models to higher dimensional

tables is presented for example in Fienberg (1980). However, as pointed out in

the introduction, concept such as independence and conditional independence

have been largely studied in the statistical literature, while the notions of

quasi-independence, symmetry and quasi-symmetry are still underdeveloped.

Some models of quasi-independence for many raters can be found in Tanner

& Young (1985). A generalization of the quasi-symmetry model to higher

order tables is presented in Rudas & Bergsma (2002) for marginal log-linear

models. Roughly speaking, marginal models are generalizations of classical

log-linear models and they are defined by restriction on appropriate marginal

distributions of the table, see Bergsma & Rudas (2002) for details.

We consider here 4 models. First, the quasi-independence model given by

log mi1...id = λ + λA1i1

+ . . . + λAdid

+ δiI{i1=...=id} (4.8)

see Tanner & Young (1985). From this model we can also generalize the

simplified quasi-independence model by considering δi = δ for all i = 1, . . . , I.

Tanner & Young (1985) proposed also a non-homogeneous quasi-indep-

endence model defined by measuring the agreement among the(

dk

)subsets of


k raters. The model is specified by the equation:


+ . . . + λAdid

+ δi1...id (4.9)

with

δi1...id = I1δ1 + . . . + I(dk)

δ(dk)

where Ig = 1 for those cells corresponding to agreement between the k ob-

servers of the g-th k-tuple of observers and Ig = 0 otherwise. In particular,

when k = 2 this model measures the pairwise agreement.

The generalization of the quasi-symmetry model is less straightforward.

The model in Equation (4.10) below is a prototype and generalizes a model

presented in Rudas & Bergsma (2002):


+ . . . + λAdid

+∑

h<k=1,...,d

λAkAhikih

(4.10)

with λAkAhikih

= λAhAkihik

for all h, k = 1, . . . , d. In the same paper other models

are introduced. Our choice is motivated by the fact that often in medical

application the hypothesis of pairwise agreement is more useful rather than an

overall quasi-symmetry condition of the d observers. The definition of more

restrictive quasi-symmetry models is possible by adding to the previous model

other parameters for higher-order symmetries.

In some clinical studies the repeated image evaluation by the same reader

is of great interest. In fact, the medical studies usually require the assess-

ment of both the inter-reader variability and the intra-reader variability. In

particular, intra-reader agreement models are useful for the study of category

distinguishability for the neat definition of the rating scales, see Darroch &

McCloud (1986), where quantitative measures are welcome, see CDER (2000).

But, the resulting statistical structure and modeling in the intra-reader agree-

ment is the same than in the previous setting of inter-reader agreement.

Finally, in the case where a truth standard is available, the analysis usually

concerns non-square tables, in most cases 2 × I tables. In such situation, the

statistical analysis is usually performed using the Receiver Operating Char-

acteristic (ROC) curves, see for example Hanley (1998). We do not consider

here this application, limiting ourselves to square tables.

4.3. EXACT INFERENCE 61

4.3 Exact inference

In this section we introduce the application of the algebraic algorithm for sam-

pling from conditional distribution described in Chapter 2 in order to define

exact inferential procedures for the Cohen’s κ and for the log-linear models pre-

sented in the previous section. As mentioned above, we define our procedures

in the framework of the conditional inference.

In the recent literature, some methods for estimating κ indices with non-

asymptotic techniques are referred to as the methods of generalized estimating

equations, see Klar et al. (2000) and Williamson et al. (1994) for details. In this

last paper, a simple simulation study is given. This approach is a generalization

of the Cohen’s κ and it is useful in particular when each subject is rated by

a different number of observers. Other studies approach the problem from

the perspective of latent class analyses, see Guggenmoos-Holzmann & Vonk

(1998).

Some procedures for hypothesis testing on κ have been recently developed,

see for example Donner et al. (2000), but limited to two raters or dichoto-

mous variables. Other works on the same subject can be found in Donner &

Klar (1996). Note that under the conditional approach the distribution of the

estimator κ is difficult to compute. In fact, also its support depends on the

observed marginal distributions.

Here, we use the algebraic algorithms presented previously in order to define

appropriate MCMC approximations of the test statistics. Note that specific

work on the quasi-independence model can also be found in Smith et al. (1996).

The method we present here applies to general multi-dimensional tables,

with both more than 2 levels of the rating scale and more than 2 raters. Let

X = {1, . . . , I}d be the sample space. Denote the cell probabilities by π and

the sample proportions by p. Moreover, let f be a generic contingency table,

that is a function f : X −→ N.

As discussed in the previous chapters, for the log-linear models, the good-

ness of fit tests are performed conditionally to the sufficient statistics, whereas

in the κ approach, the distribution of the estimator κ is computed condition-

ally to the one-dimensional margins of the tables. These statements are well

known in the two-observer case. In the multi-observer case the situation is less

well understood and it will be widely discussed later.

We call conditioning statistic the function T : X −→ Ns, under which we


compute the conditional distributions. In the log-linear case, this is equivalent

to the notion of sufficient statistic, but in a general approach this terminology

could lead to some misunderstandings, especially when we compute conditional

distributions of the estimators.

Now, the p−values and the conditional distributions of the estimators can

be computed applying the algorithm described in Chapter 2.

4.4 Conditioning

The problem of finding the conditioning statistic T for the models introduced

above is relatively easy for d = 2. In fact, the distribution of the estimator κ is

computed with fixed margins if we allow different marginal distributions, i.e.,

using the coefficient defined in Equation (4.1). In this case the conditioning

statistic is

T =

(I∑

j=1

fij, i = 1, . . . I,

I∑i=1

fij, j = 1, . . . , I

).

Using the definition of κh of Equation (4.2) where we impose marginal

homogeneity, the appropriate conditioning statistic is

T =

(I∑

j=1

fij +I∑

r=1

fri, i = 1, . . . I,

).

For the goodness of fit tests for log-linear models, we compute the dis-

tribution of the test statistics conditioned to the sufficient statistic, see for

example Fienberg (1980) for details and formulae. For example, the condi-

tioning statistic for the quasi-independence model as defined in Equation (4.5)

is given by

T =

(I∑

j=1

fij, i = 1, . . . I,

I∑i=1

fij, j = 1, . . . , I, fii, i = 1, . . . , I

).

The situation presented above for the two-observer case is much less clear

in the multi-observer case.

Analyzing Equations (4.3) and (4.4), it follows that the conditioning statis-

tics for the Cohen’s κ problem are those coming from the complete indepen-

dence model for the estimation of κ(d) and from the complete independence

4.4. CONDITIONING 63

model together with the marginal homogeneity for the estimation of κh(d). In

the first case, the conditioning statistic T is

T = (fk(i), i = 1, . . . , I, k = 1, . . . , d)

where fk(i) =∑

ih 6=ikfi1,...,ih,...,idI{ik=i} is the i-th count of the k-th marginal

distribution, while in the second case the conditioning statistic T is

T =

(d∑

k=1

fk(i), i = 1, . . . , I

).

In the log-linear models perspective, the conditioning statistic is

T = (fk(i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I) (4.11)

for the quasi-independence model described by Equation (4.8), while for the

non-homogeneous model we need to know the same conditioning statistic for all

marginal two-way tables. Note that here our attention is restricted to the case

k = 2 in Equation (4.9), i.e. to the pairwise agreement, but the conditioning

structure is the same also for the agreement among h > 2 observers.

For the quasi-symmetry model in Equation (4.10) the appropriate condi-

tioning statistic is given by

T = (fk(i), i = 1, . . . , I, k = 1, . . . , d,

∑

ik 6=ij ,ih

(fi1...ij ...ih...id + fi1...ih...ij ...id), h ≥ k = 1, . . . , I

. (4.12)

In other words the quasi-symmetry model presented in this paper considers

the quasi-symmetry of all two-way marginal tables. This is sufficient because

many agreement models refer to pairwise agreement.

The marginal two-way agreement in a d-dimensional table is sometimes

preferable with respect to an overall test of quasi-independence or quasi-

symmetry. In Agresti (1992) a discussion on this topic is presented for the

quasi-symmetry models. In particular the author’s attention is addressed to

the study of the degrees of freedom of the test statistics and he concludes that,

in some cases when the constraints imposed in the model reduce the degrees of

freedom, the test can be too conservative. Although we use a non-asymptotic

approach, also in the exact framework the most complex models can lead to


results with a difficult interpretation, especially when the sample size is small

and there are sampling zeros. To show this, we present a simple example of

two-way table. Consider the data shown in Table 4.1,

YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10

Table 4.1: Carcinoma data.

These data has already been used in Example 3.8, Chapter 3. If we analyze

these data under the quasi-symmetry model, there are only three tables in the

appropriate reference set Ft and the exact p−value of the goodness of fit test

is 1. But, this result can be criticized because of the small cardinality of Ft.

Moreover, if we slightly modify this table, setting to 0 the (1, 3) entry of the

table and then considering the data in Table 4.2, the reference set is formed

only by the original table and no other ones. In such cases, the inference is

probably infeasible.

YX 1 2 3 41 22 2 2 02 5 7 14 03 0 2 36 04 0 1 17 10

Table 4.2: Carcinoma data modified.

4.5 Computations

In this section, we compute the relevant Markov bases for the statistical mod-

els introduced above using the same technique as in Chapter 3. Here, the

computations are much more complex and new results are presented.

As pointed out in the previous chapter, in order to find Markov bases it

is not necessary to consider Grobner bases. We can consider any finite set of

generators of the relevant ideals. Given an ideal, it is possible to determine

4.5. COMPUTATIONS 65

a minimal set of generators , that is a set of polynomials which generates the

ideal and has minimal cardinality, see Kreuzer & Robbiano (2000). Roughly

speaking, a minimal set of generators is another basis of the ideal. For the

application of the Diaconis-Sturmfels algorithm, the relevant object is the toric

ideal IT coming from the matrix representation AT of the sufficient statistic,

and the algorithm is independent from the particular representation of such

ideal. While in simple cases there is no considerable difference between the

Grobner basis and the corresponding minimal set of generators of an ideal,

in complex cases a minimal set of generators is preferable, because of the

significant reduction in the number of binomials, leading to a simplification of

the Monte Carlo step of the algorithm.

Nevertheless, Grobner bases are essential in the symbolic computation of

the toric ideals, as they has nice algebraic properties, namely they guarantee

the finiteness of the division algorithm needed in a great number of algebraic

algorithms, including that for the computation of toric ideals.

Here, we use the same notation as in Chapters 2 and 3, i.e., we use the

ξ’s indeterminates for the cell probabilities and the y’s indeterminates for the

components of the sufficient statistic, using these indeterminates as parame-

ters.

The minimal set of generators is computed in CoCoA through the function

MinGens, starting from the Grobner basis. Note that in the following we leave

the resulting Markov bases in terms of binomials in order to save space.

For the κ(d) coefficient in the case of 3 observers and a response variable with

3 levels we simply write in CoCoA the matrix representation of the conditioning

statistic and few lines of statements as follows.

Mat1:= [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0],[0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0],[0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1],[1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0],[0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0],[0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1],[1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1]];

T1:=Toric(Mat1);MB1:=MinGens(T1);


We obtain 162 moves of the form ξ121ξ211−ξ111ξ221. As we can see, the alge-

braic complexity of the problems increases fast with the number of observers.

For example the above Markov basis consists of 162 moves, while the Markov

basis for the corresponding problem with two observers consists of 9 moves.

If we consider the distribution of κh(d), the problem is implemented with

Mat2:= [[3,2,2,2,1,1,2,1,1,2,1,1,1,0,0,1,0,0,2,1,1,1,0,0,1,0,0],[0,1,0,1,2,1,0,1,0,1,2,1,2,3,2,1,2,1,0,1,0,1,2,1,0,1,0],[0,0,1,0,0,1,1,1,2,0,0,1,0,0,1,1,1,2,1,1,2,1,1,2,2,2,3]];

T2:=Toric(Mat2);MB2:=MinGens(T2);

and we obtain 44 moves: 17 moves of degree one, that is coming from binomials

of degree one, and 27 moves of degree two. From a statistical point of view, the

degree of a move represents the number of counts moved from a cell to another

cell, see the correspondence between binomials and moves given in Chapter 2.

For higher-dimensional problems, i.e. with a greater number of observers

and/or with a greater number of levels of the response variable, it is enough to

add appropriate rows and columns to the previous matrices of the conditioning

statistics.

Note that the matrix representation of the conditioning statistic allows us

to write easily the expression of the underlying statistical model. In fact, as

each column represents a raw probability and each row represents a component

of the conditioning statistic, it is easy to write the multiplicative form of the

model. For example in the previous model we have

πikj = yiyjyk (4.13)

with i, j, k = 1, . . . , I, giving π111 = y31 in the first column, π112 = y2

1y2 in the

second column, and so on.

Now, consider the log-linear models presented in Section 2. In the models

below we do not write explicitly the matrix representation of the conditioning

statistics, but we write directly the results. For the quasi-independence model

and the quasi-symmetry model for two-way contingency tables, the computa-

tion of the Markov bases has been presented in the previous chapter, so we

restrict our attention to higher-dimensional tables. In particular we compute,

as an example, the Markov bases for a 3× 3× 3 table.


For the quasi-independence model defined in Equation (4.8) with condi-

tioning statistic given in Equation (4.11), we find a Markov basis with 105

moves, all of degree two.

It is worth noting that in the quasi-independence model we have single cells

as components of the sufficient statistic. So, we can apply in the computation of

the Markov bases the following result, which is a generalization of Proposition

3.4, Chapter 3, holding for two-way tables.

Theorem 4.1 The Grobner basis of the toric ideal IT for the quasi-indep-

endence model is the Grobner basis for the corresponding incomplete table with

missing diagonal entries under the complete independence model.

Proof. The conditioning statistic for the quasi-independence model is

T = (fk(i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I)

Consider the equivalent conditioning statistic

T ∗ = (f ∗k (i), i = 1, . . . , I, k = 1, . . . , d, fi...i, i = 1, . . . , I)

where f ∗k (i) denotes the sum excluding the main diagonal cells. Using the

conditioning statistic T ∗, the set of binomials coming from the ring homomor-

phism ξu → tu = yT (u) is formed by two separated subsets of equations. The

first subset does not involve the indeterminates ξi...i and yi...i, and is formed by

the equations of the complete independence model for the table with missing

diagonal entries, i.e. the binomials are

ξi1...id −d∏

k=1

ykik

for all non-diagonal cells. The second set of equations is formed the binomials

ξi...i − yi...i for i = 1, . . . I.

Now, eliminating the y’s indeterminates, the Grobner basis coming from

the first subset of equations is the Grobner basis for the complete independence

model for the table with missing diagonal entries, while the binomials of the

second subset are all deleted. This proves the theorem.

Usually, because of a reduction of the indeterminates, Theorem 4.1 speadies

the symbolic computation, with a great saving of time, especially for high-

dimensional tables.

Moreover, we can also apply the following result, if we have the Markov

basis for the complete independence model.


Theorem 4.2 The Grobner basis of the toric ideal IT for the quasi-indep-

endence model is the Grobner basis for elimination ideal obtained by the toric

ideal of the complete independence model with the elimination of those inde-

terminates corresponding to the diagonal cells.

Proof. Let T be the conditioning statistic of the quasi-independence defined

in Equation(4.11) and let T ′ be the conditioning statistic of the complete

independence model for the same table. As the diagonal cells are components

of the conditioning statistic T , the binomials ξi...i − yi...i, i = 1, . . . , I, are

generators of the ideal JT . The ideal JT is then formed by the set of binomials

corresponding to the complete independence model

ξi1...id −d∏

k=1

ykik (4.14)

for all cells, and by the set of binomials ξi...i − yi...i, i = 1, . . . , I. The bi-

nomials in Equation (4.14) are the generators of the ideal JT ′ of the com-

plete independence model and the corresponding toric ideal is computed as

IT ′ = Elim(y,JT ′).

Following the theory in Chapter 2 of Kreuzer & Robbiano (2000), by the

linearity of the last binomials, we can replace ξi...i with yi...i in the binomials

(4.14).

Eliminating the y’s indeterminates from JT , the binomials of the form

ξi...i − yi...i are all deleted and for the first set we obtain Elim(y,JT ′). As we

can replace ξi...i with yi...i, the following chain of equalities holds:

IT = Elim(y,JT ) = Elim((y, ξi...i),JT ) =

= Elim((y, ξi...i),JT ′) = Elim(ξi...i, IT ′) .

The proof is now complete.

We also consider the same problem under the simplified quasi-independence

model, as defined in Equation (4.6) and we obtain a Markov basis formed by

164 moves: the 105 moves of the quasi-independence model above, together

with 38 moves of degree 3 and 21 moves of degree 4.

For the non-homogeneous model defined in Equation (4.9) a slight more

complex situation happens. In fact, the Markov basis is formed by 3436 moves:

52 moves of degree 3; 774 moves of degree 4; 1064 moves of degree 5; 1503 moves

of degree 6; 18 moves of degree 7; 27 moves of degree 8.


Under the quasi-symmetry model defined in Equation (4.10) with the con-

ditioning statistic specified in Equation (4.12), we have a Markov basis formed

by 219 moves: 13 moves of degree 3; 90 moves of degree 4; 54 moves of degree

5; 54 moves of degree 6; 18 moves of degree 8.

To illustrate the computational growth when moving from two- to three-

dimensional tables, we give in Table 4.3 the number of moves for the 3-observer

problems and the corresponding moves for the 2-observer problems of the pre-

vious models.

model 2 observers 3 observersκ 9 162

κ with marg. hom. 9 44q.i. 1 105

simplified q.i. 9 164non-hom. q.i. 1 3436

q.s. 1 219

Table 4.3: Number of moves for a 3-level response and 2 or 3 observers.

In addition, it is interesting to show the importance of considering minimal

sets of generators rather than Grobner bases. This is clearly shown in Table 4.4,

where we compare the number of moves coming from the Grobner bases and

the number of moves coming from the minimal sets of generators. Moreover,

as the complexity of a move is represented by its degree, we also compare the

maximum degree of the moves.

model GB Max Degree MSG Max Degreeκ 168 3 162 2

κ with marg. hom. 44 2 44 2q.i. 121 4 105 2

simplified q.i. 210 7 164 4non-hom. q.i. 5995 17 3436 8

q.s. 421 13 219 8

Table 4.4: Number of moves and maximum degree from Grobner bases (GB)and from minimal sets of generators (MSG).

Finally, notice that non-trivial moves are involved. For example, the move

of degree 13 coming from the Grobner basis for the quasi-symmetry model is


represented by the polynomial

−ξ133ξ2212ξ

2221ξ

3322ξ

2323ξ

3331 + ξ122ξ213ξ

2222ξ231ξ313ξ

4321ξ

2332ξ333

and it corresponds to the move displayed in Table 4.5. This move is not needed

if we look at the minimal set of generators instead of the Grobner basis.

ji 1 2 31 0 0 02 0 -2 +13 0 +4 -3

ji 1 2 31 0 +1 02 -2 +2 03 0 -3 +2

ji 1 2 31 0 0 -12 +1 0 03 +1 -2 +1

k = 1 k = 2 k = 3

Table 4.5: Move of degree 13 from the Grobner basis for the quasi-symmetrymodel.

4.6 Example

In this section we apply the methodology described in the paper to a data set

taken from Dillon & Mulani (1984) and shown in Table 6. This data set gives

the ratings of 164 subjects by three observers. The rating scale has 3 levels:

Positive(+), Neutral (N) and Negative (−).

A2

A1 + N −+ 56 12 1N 1 2 1− 0 1 0

A2

A1 + N −+ 5 14 2N 3 20 1− 0 4 7

jA1 + N −+ 0 0 1N 0 4 1− 1 2 24

A3 = + A3 =N A3 = −Table 4.6: Data set from Dillon & Mulani (1984).

For all models, we compute the non-asymptotic p−values of the tests run-

ning the Markov chain based on the appropriate Markov basis. Any Markov

chain samples 50,000 tables with 50,000 burn-in steps and sampling every 50

tables, i.e., with the same Monte Carlo parameters as in the example of Chap-

ter 3.

We consider two tests, a first test without the constraint of marginal ho-

mogeneity with null hypothesis H0 : κ(3) = 0 against the one-sided alternative

4.6. EXAMPLE 71

model df asymptotic MCMCκ <0.0001 0

κ with marg. hom. <0.0001 0q.i. 17 0.0031 0.0105

simplified q.i. 19 0.0001 0.0001non-hom. q.i. 17 0.0016 0.0081

q.s. 11 0.1296 0.1624

Table 4.7: Comparison of the results.

hypothesis H1 : κ(3) > 0, and a second test with the constraint of marginal ho-

mogeneity with null hypothesis H0 : κh(3) = 0 against the one-sided alternative

hypothesis H1 : κh(3) > 0. In the case of log-linear models we use the Pearson’s

chi-squared C statistic for testing the goodness of fit of the 4 models presented

in Section 2.

Few lines of statements are sufficient to produce the output in ten seconds

approximately of CPU time usage on a standard PC, except for the non-

homogeneous quasi-independence model, which requires some additional time

for the symbolic computation.

In Table 4.7 we compare the non-asymptotic MCMC p−values with the

classical ones based on asymptotically normal or chi-square distributions.

From Table 4.7, we remark that the differences between the two approaches

are small in those cases where there is strong evidence against the null hypoth-

esis, while there are notable differences if the significance of the test is less evi-

dent. We conclude that there are significant differences between the asymptotic

p−values and the MCMC ones, especially when the exact p−values are near to

the usual α-levels of the tests and the sample size is small. These differences

can also lead to different conclusions on the rejection of the null hypothesis.


Chapter 5

Sufficiency and estimation

The problems of sufficiency and estimation of log-linear models are an impor-

tant topics in the analysis of discrete data and, specially, in the analysis of

contingency tables. In Chapter 1 we have presented an introduction to log-

linear models, with special attention to their structure. In particular, literature

suggests the maximum likelihood (ML) approach, as introduced in Chapter 1.

In the past few years, the application of new algebraic non-linear tech-

niques to Probability and Statistics have been presented. Here we follow the

polynomial representation of random variables on discrete sample spaces as

introduced in Pistone et al. (2001a) and we study some algebraic and geo-

metrical property of a class of models introduced as toric models in Pistone

et al. (2001b), showing the links between toric models and log-linear models.

The polynomial Algebra is used to describe the geometrical structure of the

statistical toric models on finite sample spaces. The first works in this direc-

tion, but limited to the analysis of graphical models, are Geiger et al. (2002)

and Garcia et al. (2003). Moreover, the work of Diaconis & Sturmfels (1998)

presented in Chapter 2 leads to the possibility of sampling from conditional

distributions and we will use such technique in order to actually compute the

minimum-variance unbiased estimator of the cell counts.

In this chapter we consider a general finite sample space and we use alge-

braic techniques in order to obtain a description of the notion of sufficiency and

to define an unbiased minimum-variance estimator. This estimator is defined

for all observed tables, even in the non-negative case, i.e. it avoids problems

with structural zeros. Moreover, we show that it is well-defined also for sparse

tables, i.e. it avoids problems with sampling zeros. Starting from this the-

ory, we also define an algorithm to actually compute the estimate, using the

technique described in Chapter 2.

73

74 CHAPTER 5. SUFFICIENCY AND ESTIMATION

Working in the non-negative case, in Section 2 we introduce the class of

toric models, and we give a polynomial representation of the notion of sufficient

statistic, first for a single observation and then for a independent sample,

generalizing some known results. In Section 3 we study the geometry of toric

models and we show that toric models are disjoint union of exponential models.

Moreover, we show that the parameterization play a fundamental role, and we

define a special parameterization such that the model contains all the possible

exponential models. In Section 4 we define an unbiased minimum-variance

estimator of the cell counts, we show its uniqueness and its relationship with

the ML estimate. Finally, in Section 5 we define a MCMC algorithm for the

computation of the minimum-variance estimate.

5.1 Introduction

Consider a statistical model on a finite sample space X . Although in contin-

gency tables the sample space is usually a Cartesian product (for example in

the two-way case is a space of the form {(i, j) | i = 1, . . . I, j = 1, . . . , J}), we

assume here, without loss of generality, a generic list

X = {a(1), . . . , a(k)} .

A probability distribution on X is characterized by the parameters pi = P[a(i)]

with i = 1, . . . , k. Roughly speaking, a model M is a subset of the space of

parameters. As in the previous chapters, we use the vector notation, that is p

is the vector (p1, . . . , pk). The parameter space is given by the simplex pi ≥ 0,

i = 1, . . . , k and∑k

i=1 pi = 1.

If (x1, . . . , xN) is a sample of size N drawn from a vector (X1, . . . , XN) of

independent and identically P-distributed random variables with values in X ,

the ML approach considers the likelihood function L(p; x1, . . . , xN) and the

maximum likelihood estimate p of p is given by

L(p; x1, . . . , xN) = supp∈M

L(p; x1, . . . , xN) .

The ML estimate may not exist or it may be not unique for some observed

tables. The problems of non-existence or non-uniqueness are related to the

presence of sampling zeros or structural zeros.

In the case where we restrict to the case of positive parameters, i.e. pi > 0

for all i = 1, . . . , k, and the conditions for the existence and uniqueness of

5.2. SUFFICIENCY AND SAMPLING 75

the ML estimate are satisfied, the ML estimate is found by differentiating the

likelihood function.

In the log-linear models for contingency tables, the likelihood equations

can be classified in two types:

(a) the equation system has an explicit solution p and no approximation is

needed. This is the case, for example, of the independence model;

(b) the equation system has not an explicit solution and we must use a

numerical method in order to approximate the solution. This is the case,

for example, of the quasi-independence model.

For a discussion on this topic, see Section 3.7 of Kreuzer & Robbiano (2000),

where some symbolic methods for solving system of polynomial equations are

presented.

As outlined in Chapter 1, in the case (b) the most widely used meth-

ods for approximating the ML estimate are the Newton–Raphson one and

the Iterative Proportional Fitting one. For details about these methods, see

Haberman (1974), Chapter 3. Many optimized versions of such algorithms are

implemented in various software packages. For example, a modified Newton–

Raphson method is used in PROC CATMOD of SAS software, see SAS/STAT

User’s Guide (2000) for a concise computation’s discussion and for further ref-

erences on numerics. Details on the applications of the Newton–Raphson al-

gorithm and the Fisher scoring are given, for example, in McCullagh & Nelder

(1989), Chapter 2. Again, the main problem in the application of numerical

methods is the presence in the contingency tables of sampling or structural

zeros. Then, the algorithms need slight modifications of the observed counts,

in order to eliminate the null entries, see for example the recent algorithms in

Duffy (2002).

When ML estimate exists, it is in general a biased estimate of the cell

counts, and in some applications one may ask for an unbiased estimate. The

minimum-variance unbiased estimator can be used in such situation and also

when the ML estimate does not exist or it is not unique.

5.2 Sufficiency and sampling

Let us consider a statistical model on the finite sample space X of the form

px = Pφ[x] = φ(T (x)), x ∈ X (5.1)


where T : X −→ Ns is the vector of integer valued components of the sufficient

statistic. Here φ ∈ Φ, where Φ is a subset of functions from Ns to R. In point

of fact, the range of T is a finite subset T ⊂ Ns. Thus, the set of functions

from T to R has a vector space structure with a finite basis. This remark will

be useful in the next sections.

The set Φ defines a subset M of the space of the probabilities with sufficient

statistic T , and we refer to this subset as to a statistical model. In other words,

M is a subset defined through

M = {p : p = φT, φ ∈ Φ} .

We specialize Equation (5.1), by assuming that there exists a parameteri-

zation with parameters ζ1, . . . , ζs such that the probabilities assume the form

px =L(ζ, x)∑

x∈X L(ζ, x), (5.2)

with L(ζ, x) = ζT (x). Here T (x) = (T1(x), . . . , Ts(x)) and∑

x∈X L(ζ, x) is the

normalizing constant. As in the previous chapters, we use the vector notation,

e.g. ζT (x) means ζT1(x)1 · · · ζTs(x)

s . The algebraic structure of the functions φ and

T in Equation (5.1) will be discussed later in Section 5.4. Following Pistone

et al. (2001a) and Pistone et al. (2001b), we can state this definition.

Definition 5.1 If a statistical model M consists of all likelihoods of the form

in Equation (5.2), then we say that the model is a toric model.

The term “toric” comes from Commutative Algebra, because of the alge-

braic structure of the probabilities. As we have seen in Chapter 1, toric ideals

describe the algebraic relations among power product and in toric models the

probabilities are expressed in terms of power products. See also Sturmfels

(1996) and Bigatti & Robbiano (2001), where toric ideals are studied in de-

tails.

In Definition 5.1, the ζ parameters are unrestricted, except the non-negativ-

ity constraint. Note that in general a toric model is bigger than the exponential

model, as we do not assume positive probabilities, and with the constraint

px > 0 for all x ∈ X it is a log-linear model. Let M>0 be the subset of the

toric model M with the restriction px > 0 for all x ∈ X . Then

log px =∑

j

(log ζj)Tj(x) + log ζ0 , (5.3)

5.2. SUFFICIENCY AND SAMPLING 77

where

ζ0 =

(∑x∈X

L(ζ, x)

)−1

(5.4)

is the normalizing constant. M>0 appears to be a log-linear, then it is an

exponential model with sufficient statistic T and canonical parameters log ζj,

j = 1, . . . , s.

Note that the representation of the toric model in Equation (5.2) refers

to the notion of multiplicative form of a log-linear model, see for example

Goodman (1979a).

A first example of toric model is a model with sufficient statistic consisting

of the counts over the sets A1, . . . , As ⊆ X , possibly overlapping:

T : x 7−→ (IA1(x), . . . , IAs(x))

In most examples in the literature

X ⊆ {1, . . . , I1} × · · · × {1, . . . , Id}

is a d-way array, possibly incomplete.

A second example is a log-linear model of the type

logPψ(x) =s∑

i=1

ψiTi(x)− k(ψ) (5.5)

with integer valued Ti’s and parameters ψ = (ψ1, . . . , ψs). In such a case each

Ti(x) =∑a∈X

Ti(a)Ia(x) (5.6)

is an integer combination of all counts, see Fienberg (1980).

Note that in Equation (5.1) the strict positivity implied by the log-linear

model in Equation (5.5) is not assumed.

Consider now the problem of sampling. As the probabilities can be written

as

px = ζ0ζT1(x)1 · · · ζTs(x)

s (5.7)

the probability of a sample (x1, . . . , xN) is

px1 · · · pxN= ζN

0 ζP

i T1(xi)1 · · · ζ

Pi Ts(xi)

s (5.8)

i.e., the sufficient statistic for the sample of size N is the sum of the sufficient

statistics of the N components of the sample. Note that this result is formally

the same as in the positive case where the theory of exponential models applies.


The j-th component of the sufficient statistic can be represented as

∑i

Tj(xi) =∑a∈X

Tj(a)Fa(x1, . . . , xN) (5.9)

where

Fa(x1, . . . , xN) =N∑

j=1

Ia(xj) (5.10)

is the count of the cell a. This result generalizes the corresponding theorem

holding in the framework of log-linear models, which states that the sufficient

statistics for all log-linear models are linear combinations of the cell counts.

5.3 Geometry of toric models

Now, define the matrix AT in the following way. AT is a matrix with k rows

and s columns and its generic element AT (i, j) is Tj(xi) for all i = 1, . . . k and

j = 1, . . . , s.

If ηj = E[Tj] are the mean-parameters and p is the row vector of the

probabilities, then

η = pAT (5.11)

The matrix AT can also be used in order to describe the geometric structure

of the statistical model. First, we can state the following result.

Proposition 5.2 Choose a parameter ζj and take the set X ′ of points x ∈ Xsuch that Tj(x) > 0. Suppose that X ′ 6= X . If we set ζj = 0, we obtain a model

on the remaining sample points and this model is again a toric model.

Proof. Without loss of generality, suppose j = 1 and T1(x) = 0 for

i = 1, . . . , k′ and T1(x) > 0 for i = k′ + 1, . . . , k. The matrix AT can be

partitioned as

AT =

0... A′

T

0∗...∗

(5.12)

where ∗ denotes non-zero entries. Now, the matrix A′T is non-zero and it is

the representation of a toric model on the first k′ sample points.

5.3. GEOMETRY OF TORIC MODELS 79

Thus, the geometric structure of the toric model is an exponential model

and at most s toric models with (s− 1) parameters on appropriate subsets of

the sample space X . Moreover, by applying recursively the above theorem we

obtain the following result.

Theorem 5.3 Let ζj1 , . . . , ζjr be a set of parameters and take the set X ′ of

points x ∈ X such that Tib(x) > 0 for at least one b ∈ {1, . . . , r} and Tib(x) = 0

for all b = 1, . . . , r and x ∈ X − X ′. Suppose that X ′ 6= X . If we set ζjb= 0

for all b = 1, . . . , r, we obtain a toric model on the remaining sample points.

Proof. Apply b times Proposition 5.2.

These results lead to a geometrical characterization of a toric model.

Theorem 5.4 A toric model is the disjoint union of exponential models.

Proof. The result follows from the application of Theorem 5.3 for all

possible sets of parameters ζj1 , . . . , ζjbfor which X ′ 6= X .

Example 5.5 Consider the independence model for 3× 2 tables analyzed in

Example 3.3 of Chapter 3. The matrix representation of the sufficient statistic

is

AT =

1 0 0 1 01 0 0 0 10 1 0 1 00 1 0 0 10 0 1 1 00 0 1 0 1

and the toric model is, apart from the normalizing constant,

p11 = ζ1ζ4

p12 = ζ1ζ5

p21 = ζ2ζ4

p22 = ζ2ζ5

p31 = ζ3ζ4

p32 = ζ3ζ5

If ζ > 0 we have the log-linear model. Moreover, we can choose the following

sets X ′ in Theorem 5.3:

• a) X ′ = {(1, 1), (1, 2)} corresponding to ζ1 = 0;


b) X ′ = {(2, 1), (2, 2)} corresponding to ζ2 = 0;

c) X ′ = {(3, 1), (3, 2)} corresponding to ζ3 = 0.

In this cases we obtain the independence model for the 2×2 tables X\X ′;

• d) X ′ = {(1, 1), (2, 1), (3, 1)} corresponding to ζ4 = 0;

e) X ′ = {(1, 2), (2, 2), (3, 2)} corresponding to ζ5 = 0.

In this case we obtain the multinomial model for the second and the first

column, respectively;

• f) X ′ = {(1, 1), (1, 2), (2, 1), (2, 2))} corresponding to ζ1 = ζ2 = 0;

g) X ′ = {(1, 1), (1, 2), (3, 1), (3, 2))} corresponding to ζ1 = ζ3 = 0;

h) X ′ = {(2, 1), (2, 2), (3, 1), (3, 2))} corresponding to ζ2 = ζ3 = 0.

In this case we obtain the Bernoulli model for the third, the second and

the first row, respectively;

• i) X ′ = {(1, 1), (1, 2), (2, 1), (3, 1)} corresponding to ζ1 = ζ4 = 0;

j) X ′ = {(1, 1), (1, 2), (2, 2), (3, 2)} corresponding to ζ1 = ζ5 = 0;

k) X ′ = {(2, 1), (2, 2), (1, 1), (3, 1)} corresponding to ζ2 = ζ4 = 0;

l) X ′ = {(2, 1), (2, 2), (1, 2), (3, 2)} corresponding to ζ2 = ζ5 = 0;

m) X ′ = {(3, 1), (3, 2), (1, 1), (2, 1)} corresponding to ζ3 = ζ4 = 0;

n) X ′ = {(3, 1), (3, 2), (1, 2), (2, 2)} corresponding to ζ3 = ζ5 = 0.

In this case we obtain 6 Bernoulli models. These models are the 6 models

on the columns of the 3 independence models found above;

• corresponding to the six conditions ζ1 = ζ2 = ζ4 = 0, ζ1 = ζ2 = ζ5 = 0,

ζ1 = ζ3 = ζ4 = 0, ζ1 = ζ3 = ζ5 = 0, ζ2 = ζ3 = ζ4 = 0 and ζ2 = ζ3 = ζ5 = 0

we obtain 6 trivial distributions on one sample point.

The toric model is then formed by 21 models: the log-linear model on 6 points,

3 models on 4 points, 2 models on 3 points, 9 models on 2 points and 6 trivial

models on 1 point. Apart from the trivial models, the remaining 15 models

are represented in Table 5.5.

In Table 5.5 the 0 means structural zero, while ∗ denotes the cells with

non-zero probability.

The procedure in Theorem 5.3, applied as in Example 5.5 can also be used

to define the admissible sets of structural zeros.


∗ ∗∗ ∗∗ ∗

∗ ∗∗ ∗0 0

∗ ∗0 0∗ ∗

0 0∗ ∗∗ ∗

0 ∗0 ∗0 ∗

∗ 0∗ 0∗ 0

0 00 0∗ ∗

0 0∗ ∗0 0

∗ ∗0 00 0

0 00 ∗0 ∗

0 0∗ 0∗ 0

0 ∗0 00 ∗

∗ 00 0∗ 0

0 ∗0 ∗0 0

∗ 0∗ 00 0

Table 5.1: Nontrivial models.

Definition 5.6 A subset X ′ ⊂ X is an admissible set of structural zeros if

there exist parameters ζi1 , . . . , ζir such that the condition of Theorem 5.3 holds.

Now, we consider the behavior of the η-parameterization in the different

exponential models. Starting from the representation of the η parameters as

function of the ζ parameters in Equation (5.11), we can easily prove that the

η parameterization is coherent, as stated in the following proposition.

Proposition 5.7 The η parameters for the reduced models are the same as in

the exponential case, provided that a component is fixed to zero.

Proof. For the proof, it is enough to combine the linear relation between η

and ζ given in Equation (5.11) and Theorem 5.3.

Remark 5.8 The definition of toric model as disjoint union of exponential

models has a counterpart in differential geometry, namely in the definition of

the differential manifolds with corners. For the construction of a differential

manifold with corners, consider the positive quadrant in Rk. It is an open

variety and its boundary is formed by k quadrants of dimension k − 1. Add

one or more of these quadrants to the variety and iterate k times the procedure.

In general, such derivatives differ from the limit of the derivatives computed

in Rk. Every real variety diffeomorphic to a variety defined as above is a

differential manifold with corner. For details about differential manifolds with

corners, see Margalef Roig & Outerelo Dominguez (1992)


Now, we state a differential relationship among the ζ parameters and the

η parameters.

Theorem 5.9 In the positive case the following equation holds

ζj∂ζ0

∂ζj

= −ηjζ20 (5.13)

for all j = 1, . . . , s

Proof. By straightforward computation, we have

∂ζ0

∂ζj

= −∑

x∈X Tj(x)ζT (x)/ζj(∑x∈X ζT (x)

)2 = −ζ20ηj

ζj

,

and the proof is complete.

Note that the previous Equation holds in all exponential models contained

in the toric model.

The geometric representation of the structural zeros as presented in the

previous definition needs some further discussions. Let we consider a simple

example.

Example 5.10 Consider the simple example of the independence model for

2× 2 contingency tables. Using the notations of the previous chapters, a first

parameterization is

p11 = ζ0

p12 = ζ0ζ1

p21 = ζ0ζ2

p22 = ζ0ζ1ζ2

(5.14)

This parameterization is used in Pistone et al. (2001a). A second parameteri-

zation is the following one, based on the matrix representation of the sufficient

statistic as explained in Chapter 3.

p11 = ζ1ζ3

p12 = ζ1ζ4

p21 = ζ2ζ3

p22 = ζ3ζ4

(5.15)

Eliminating the ζ indeterminates in the above systems of equations, in both

cases we obtain only one polynomial, namely the binomial

p11p22 − p12p21 = 0 (5.16)


But, the toric models in Equation (5.14) and in Equation (5.15) are different.

In fact, the second parameterization contains for example the Bernoulli model

on the two points (1, 1) and (1, 2), while the first parameterization does not.

Now, consider the binomial equations obtained from a toric model by elim-

ination of the ζ indeterminates as described in Chapter 1. In this way we

obtain a set of binomials which defines a statistical model, but in general the

binomial form differs from the parametric form. The binomial form is inde-

pendent on the parameterizations and it can allow some boundaries excluded

from the parametric form, as in the previous Example. In the following, we

will use the terms “parametric toric model” and “binomial toric model”.

Moreover, we can also restate the definition of set of structural zeros for a

toric model in binomial form.

Definition 5.11 Let B be the set of binomials defined by elimination as de-

scribed in Chapter 2. A subset X ′ ⊂ X is an admissible set of structural zeros

independent from the parameterization if the polynomial system B = 0 together

with px = 0 for all x ∈ X ′ has a non-negative normalized solution.

In general, it is difficult to find a parameterization such that the parametric

form of the toric model contains all the exponential sub-models.

Definition 5.12 If a parameterization defines a parametric toric model which

contains all the exponential sub-models, we say that the parameterization is a

full parameterization.

In view of Example 5.10 and Definition 5.12, it follows that not all matrix

representations of the sufficient statistic are equivalent. In general there is an

infinite number of non-negative integer valued bases of the sub-space spanned

by T , but not all of these contains all the exponential sub-models. This hap-

pens because the columns of the matrix AT are defined in the vector space

framework, but in the power product representation we need non-negative

exponents and then we need linear combinations with non-negative integer co-

efficients. In general, it is difficult to find a full parameterization, but it is easy

to characterize all the parameterizations which lead to a given binomial toric

model.


Theorem 5.13 Let AT be the matrix model. If v1, . . . , vs is a non-negative

integer system of generator of the rank of AT , then

px = ζv1(x)1 · · · ζvs(x) (5.17)

for x ∈ X is a parameterization and all parameterizations of this kind lead to

the same binomial toric model.

Proof. Let Av be the matrix

Av = (v1, . . . , vs)

As the kernels of the systems As = 0 and AT = 0 are the same. As a conse-

quence of the construction of the toric ideals presented in Section 2 of Chapter

1, the toric models defined through AT and As are represented by the same

binomials.

Among others, one can consider the rank of AT as a lattice and then con-

sider a Hilbert basis of such lattice.

Definition 5.14 Let S ⊆ Nk be the set of integer solutions of a diophantine

system AT p = 0. A set of integer vectors H = {v1, . . . , vh} is a Hilbert basis

of S if for all β ∈ S

β =∑v∈H

cvv (5.18)

where cv ∈ N.

It is known that such a set H exists and is unique. The number of elements

in H in general differs from the dimension of the rank of AT as vector sub-

space. For details about the properties of the Hilbert basis and the algorithms

for its computation, see Sturmfels (1993) and Kreuzer & Robbiano (2003).

The application of theory of Hilbert bases to the geometrical description of

the toric models is currently in progress.

5.4 Algebraic representation of the sufficient

statistics

Let us now introduce the algebraic representation of the sufficient statistic for

toric models. Let T = T (X ) be the image of the sufficient statistic T , i.e. the

set of all values of the sufficient statistic. As noted in Section 5.2, the set T is

a finite subset of Ns.

5.5. MINIMUM-VARIANCE ESTIMATOR 85

Proposition 5.15 Let (T1(X ), . . . , Ts(X )) = T ⊆ Ns be the range of the suf-

ficient statistic. As T is a finite subset of Qs, then there exists a list of mono-

mials

{tα : α ∈ L} (5.19)

such that all function φ : T −→ T are expressed as

φ(t) =∑α∈L

φαtα (5.20)

For the proof, see e.g. Pistone et al. (2001a). Implemented algorithms can

be used to derive such list.

The model M has the polynomial representation

px = φ(T1(x), . . . , Ts(x)) =∑α∈L

φαTα(x) . (5.21)

The functions Tα, with α ∈ L are a sufficient statistic equivalent to T1, . . . , Ts.

If the φα are unrestricted except for the normalizing constraint, this means that

the model contains all the probabilities with sufficient statistic T . The advan-

tage of the former is the linearity of Equation (5.21). This remark prompts

the following definition.

Definition 5.16 The set T1, . . . , Ts′ is a basis for the statistical model M if

P∗N(x1, . . . , xN) = φN

(∑i

Tj(xi), j = 1, . . . , s

)(5.22)

for all N ≥ 2.

The Tα in Equation (5.21) are a basis for all models with sufficient statistic

T .

5.5 Minimum-variance estimator

In this section, we define an unbiased minimum-variance estimator of the cell

counts. This estimator is defined as the mean of the hypergeometric distribu-

tion, whose parameters depend on the observed value of the sufficient statistic

TN . Moreover we show that, under some extra conditions, this estimate is

equal to the ML estimate of the vector of parameters Np.

In the classical log-linear theory, where the attention is restricted to the

positive case, the most commonly used theorem of existence and uniqueness


of the ML estimate is based on the vector space representation of log-linear

models, see Chapter 1 for an introduction and Haberman (1974), Theorem 2.2,

for a detailed proof.

From this point of view, an unbiased minimum-variance estimator has two

advantages. This estimator is well defined also in such cases where the ML

estimator does not exists or its computation is difficult. Moreover, in some

applications, it could be desirable to use an unbiased estimator instead of a

biased one.

The difficulty of the computation of the minimum-variance estimate will

be avoided with an algebraic based algorithm presented in the next section.

From Definition 5.16 it follows that the basis is the algebraic counterpart

of the complete sufficient statistic.

Definition 5.17 Define

Fx = Eφ[Fx(X1, . . . , XN) | T αN , α ∈ L] (5.23)

Fx is the minimum-variance unbiased estimator of the expected cell counts

Npx.

In fact, Eφ[Fx] = Eφ[Fx] = Npx and Fx is minimum-variance as it is func-

tion of the sufficient statistic. Moreover, if TαN is a basis, see Definition 5.16,

then the estimator Fx is a linear combination of the TαN .

Theorem 5.18 If T1, . . . , Ts are a vector space basis of the space of the suffi-

cient statistic, the estimator Fx is the unique unbiased estimator of the count

Npx which is function of T1, . . . , Ts.

Proof. Following the technique used in Lehmann (1983), let Fx be another

estimator of Npx function of T1, . . . , Ts. We have

Fx =∑

α

cαT α

and

Fx =∑

α

dαTα

Thus, we can write

0 = Eφ[Fx − Fx] =∑

α

(cα − dα)Eφ[Tα]

5.6. ALGEBRAIC METHOD 87

for all Eφ[Tα]. As the φα are unrestricted, the previous Equation leads to

0 = Eφ[Fx − Fx] =∑

α

φαE0[(Fx − Fx)Tα] ,

where E0 is the mean with respect to a dominant measure. As E0[(Fx− Fx)Tα]

must be all zero, we conclude that Fx = Fx.

Remark 5.19 Note that the result above can be easily proved if we restrict to

the model M>0, as in the exponential model the sufficient statistic is complete

and the result in Theorem 5.18 is a consequence of the completeness.

The estimator Fx exists even when ML estimate is not well defined. The

links between this estimator and the ML estimator is given by the following

result.

Proposition 5.20 If ML estimate exists and the vector F /N belongs to the

model, then the ML estimate Np is equal to F .

Proof. A classical result from Birch (1963), generalized by Bishop et al.

(1975), states that there is a unique configuration of the cell probabilities which

lies in the model and verifies the constraints of the sufficient statistic, and this

configuration is the ML estimate of the cell probabilities.

5.6 Algebraic method

From the above theory, together with the use of the algorithm presented in

Chapter 2, we can derive a Monte Carlo method for the computation of the

estimates fx for all x ∈ X . As in the previous Chapters, let Yt be the set of all

samples with fixed value t of the sufficient statistic TN and Ft be the set of all

frequency tables obtained from samples with value t of the sufficient statistic

TN . As P∗Nφ = φN(TN), then the distribution of the sample given {TN = t} is

uniform on Yt. The distribution of F = (Fx)x∈X is the image of the uniform

distribution, i.e. it is the hypergeometric distribution on Ft, see Chapter 2 for

details.

Given t, we define a Markov chain with sample space Ft and with stationary

distribution Ht. The Markov chain is a random walk on Ft with moves in an


appropriate Markov basis M = {m1, . . . , mL}, computed starting from the

matrix representation of the sufficient statistic T . Thus, as we are interested

in the mean value of the cell counts, the Markov chain is constructed following

a Metropolis-Hastings-type algorithm and it is performed as follows:

(a) let S = 0;

(b) at the time 0 the chain is in f ;

(c) choose a move m uniformly in the Markov basis and ε = ±1 with prob-

ability 1/2 each independently of m;

(d) if f + εm ≥ 0 then move the chain from f to f + εm with probability

min{Ht(f + εm)/Ht(f), 1} . (5.24)

In all other cases, stay at f ;

(e) let S = S + f ;

(f) repeat B times the steps (c)–(e);

(g) the (approximated) mean is S/B.

The convergence of this Markov chain has been proved in Chapter 2. As

in many cases we are interested in the computation of the estimates in the

case of two-way of multi-way contingency tables, we can use the results for the

computation of the Markov bases presented in Chapters 3 and 4.

Appendix A

Programs

In this Appendix we show the programs, both symbolic and numerics, needed

for the implementation of the MCMC algorithm for the goodness of fit tests

for log-linear models presented in Chapters 3 and 4. We will use the data and

the quasi-symmetry model for 4×4 tables in Example 3.8, Chapter 3, as main

example.

The first step of the algorithm needs the computation of the relevant

Markov basis for the model. In order to do that, we use CoCoA. We set

an appropriate polynomial ring, we write the matrix representation of the

sufficient statistic and we use the CoCoA function Toric.

We use a polynomial ring with an indeterminate for each sample point, i.e.,

in our example, with 16 indeterminates. Thus, we write:

Use A::=Q[x[1..16]];T:=Mat[[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1],[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0],[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0],[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0],[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1],[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],[0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0],[0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0],[0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0],[0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0],[0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0]];

89

90 APPENDIX A. PROGRAMS

I:=Toric(T);G:=GBasis(I);G;

The first 4 rows of the matrix represent the row sums, the second 4 rows

represent the column sums and the remaining rows represent the sum of the

diagonal-opposite cells.

Two remarks are useful for the implementation of this part of the algorithm.

First, we use line vectors for the sample points, i.e. we use the indeterminates

ξ1, . . . , ξ16, instead of the two-way representation, i.e. ξ1,1, . . . , ξ4,4. This choice

does not present any problems in the multi-way case, as shown by examples

in Chapter 4. Second, if we want to work with a minimal set of generators

instead of a Grobner basis of the ideal, it is sufficient to add the command

replace the command G:=GBasis(I) with the command G:=MinGens(I).

Now, we have a Grobner basis of the relevant toric ideal. In our example,

the CoCoA output is

[x[2]x[8]x[13] - x[4]x[5]x[14], x[2]x[7]x[9] - x[3]x[5]x[10],-x[7]x[12]x[14] + x[8]x[10]x[15], x[3]x[12]x[13] - x[4]x[9]x[15],x[2]x[7]x[12]x[13] - x[4]x[5]x[10]x[15], x[3]x[8]x[10]x[13] -x[4]x[7]x[9]x[14], -x[3]x[5]x[12]x[14] + x[2]x[8]x[9]x[15]]

Now, we use these binomials for the definition of the Markov basis. We

transform each binomial in a vector, as described in Chapter 2. We can use

the following simple CoCoA function:

Define GensToMat(G);-- input: a list of binomials;-- output: a matrix containing the vector representation-- of the binomials;

L:=Len(G);M:=NewMat(L,NumIndets(),0);For I:=1 To L Do

ListMon:=Monomials(G[I]);L1:=Log(ListMon[1]);L2:=Log(ListMon[2]);M[I]:=L1-L2;

End;Return M;

EndDefine;

Applying this function to the Grobner basis computed above, we obtain

the following matrix

91

Mat[[0, 1, 0, -1, -1, 0, 0, 1, 0, 0, 0, 0, 1, -1, 0, 0],[0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0],[0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 1, 1, 0, -1, 0],[0, 1, 0, -1, -1, 0, 1, 0, 0, -1, 0, 1, 1, 0, -1, 0],[0, 0, 1, -1, 0, 0, -1, 1, -1, 1, 0, 0, 1, -1, 0, 0],[0, -1, 1, 0, 1, 0, 0, -1, -1, 0, 0, 1, 0, 1, -1, 0]]

Here each row represents a move. This matrix is a concise representation

of the 7 moves found in Chapter 3, Section 3.2.4.

Now, we can use this matrix for the second step of the algorithm, i.e. the

simulation step. For this step we use the software Matlab. After using Matlab,

we need an easy ASCII manipulation in order to have a suitable input format

for Matlab.

function simul=simul(tab,mle,m,N,bis,step);

%tab = the observed table;%mle = maximum likelihood estimate;%m = the matrix representation of the moves;%N = number of MCMC replicates;%bis = number of burn-in steps;%step = reduction of correlation step;

nc=length(tab);p=0;c=zeros(N,1);chisqref=chisq(tab,MLE);mm=-m;m=[m; mm];[nmoves,ncm]=size(m);numit=N*step+bis;for i=1:numit

tabp=zeros(1,nc);r=ceil(nmoves*rand(1));tabp=tab+m(r,:);

if tabp>=zeros(1,nc)mhr=1;

for j=1:ncif m(r,j)~=0

mhr=mhr*prod(1:tab(j))/prod(1:tabp(j));end;

end;alpha=rand(1);if mhr>=alpha

tab=tabp;

92 APPENDIX A. PROGRAMS

end;end;

if (rem(i,step)==0) & (i>bis)c((i-bis)/step)=chisq(tab,MLE);if c((i-bis)/step)>=chisqref

p=p+1;end;

end;end;p=p/N;simul=p;

The value of the Pearson statistic is computed through the following func-

tion:

function chisq=chisq(tab,MLE);c=0;l=length(tab);for i=1:l

if MLE(i)~=0c=c+(tab(i)-MLE(i))*(tab(i)-MLE(i))/MLE(i);

end;end;chisq=c;

Some remarks about the parameters of the simulation Matlab function.

• the maximum likelihood estimate of the cell counts mle is computed

with statistical software, for example SAS, see SAS/STAT User’s Guide

(2000), unless we disposed of a closed form;

• the number of Monte Carlo replicates B is chosen accordingly to the

precision we want reach. In our examples we have used B=50.000, and

the motivation of our choice is discussed in Example 3.6, Chapter 3;

• the number of burn-in steps bis and the step for the reduction of the

correlation step should be chosen for any particular problem based on

inspection of the empirical distribution of the random variables and the

traces of auto-correlation function, respectively.

Finally, with a slight modification of the above program, one can obtain

the approximate distribution of the test statistic.

Bibliography

Agresti, A. (1992). Modelling patterns of agreement and disagreement.

Statistical Methods in Medical Research 1, 201–218.

Agresti, A. (1996). An Introduction to Categorical Data Analysis. New

York: Wiley.

Agresti, A. (2001). Exact inference for categorical data: Recent advances

and continuing controversies. Statist. Med. 20, 2709–2722.

Agresti, A. (2002). Categorical Data Analysis. New York: Wiley, 2nd ed.

Agresti, A., Mehta, C. R. & Patel, N. R. (1990). Exact inference for

contingency tables with ordered categories. J. Amer. Statist. Assoc. 85,

453–458.

Baglivo, J., Olivier, D. & Pagano, M. (1992). Methods for exact

goodness-of-fit tests. J. Amer. Statist. Assoc. 87, 464–469.

Becker, M. P. (1990). Quasisymmetric models for the analysis of square

contingency tables. J. R. Statist. Soc. 52, 369–378.

Bergsma, W. P. & Rudas, T. (2002). Marginal models for categorical data.

Ann. Statist. 30, 140–159.

Bigatti, A., La Scala, R. & Robbiano, L. (1999). Computing toric

ideals. J. Symb. Comput. 27, 351–365.

Bigatti, A. & Robbiano, L. (2001). Toric ideals. Mat. Contemp. 21, 1–25.

Birch, M. W. (1963). Maximum likelihood in three-way contingency tables.

J. R. Statist. Soc. 25, 220–233.

Bishop, Y. M., Fienberg, S. & Holland, P. W. (1975). Discrete multi-

variate analysis: theory and practice. Cambridge: MIT Press.

93

94 BIBLIOGRAPHY

Booth, J. G. & Butler, R. W. (1999). An importance sampling algorithm

for exact conditional tests in log-linear models. Biometrika 86, 321–331.

Capani, A., Niesi, G. & Robbiano, L. (2000). CoCoA, a system for doing

Computations in Commutative Algebra. Available via anonymous ftp from

cocoa.dima.unige.it, 4th ed.

CDER (2000). Developing Medical Imaging Drugs and Biological products.

Guidance for industry. U.S. Department of Health and Human Services,

Food and Drug Administration.

Chib, S. & Greenberg, E. (1995). Understanding the Metropolis–Hastings

algorithm. Amer. Statist. 49, 327–335.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational

and Psychological Measurement 20, 37–46.

Collombier, D. (1980). Recherches sur l’Analyse des Tables de Contingence.

Phd thesis, Universite Paul Sabatier de Toulouse.

Cox, D., Little, J. & O’Shea, D. (1992). Ideals, Varieties, and Algorithms.

New York: Springer Verlag.

Darroch, J. N. & McCloud, P. I. (1986). Category distinguishability and

observer agreement. Australian Journal of Statistics 28, 371–388.

De Loera, J. A., Hemmecke, R., Tauzer, J. & Yoshida, R. (2003).

Effective lattice point counting in rational convex polytopes. Preprint.

Diaconis, P. & Sturmfels, B. (1998). Algebraic algorithms for sampling

from conditional distributions. Ann. Statist. 26, 363–397.

Dillon, W. R. & Mulani, N. (1984). A probabilistic latent class model

for assessing inter-judge reliability. Multivariate Behavioral Research 19,

438–458.

Dobson, A. J. (1990). An Introduction to Generalized Linear Models. Lon-

don: Chapman & Hall.

Donner, A. & Klar, N. (1996). The statistical analysis of kappa statistics

in multiple samples. Journal of Clinical Epidemiology 49, 1053–1058.

BIBLIOGRAPHY 95

Donner, A., Shoukri, M. M., Klar, N. & Bartfay, E. (2000). Testing

the equality of two dependent kappa statistics. Statist. Med. 19, 373–387.

Duffy, D. (2002). The gllm package. Available from

http://cran.r-project.org, 0th ed.

Fienberg, S. (1980). The Analysis of Cross-Classified Categorical Data.

Cambridge: MIT Press.

Fingleton, B. (1984). Models of Category Counts. Cambridge: Cambridge

University Press.

Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New

York: Wiley, 2nd ed.

Garcia, L. D., Stillman, M. & Sturmfels, B. (2003). Algebraic geom-

etry of Bayesyan networks. Preprint.

Geiger, D., Meek, C. & Sturmfels, B. (2002). On the toric algebra of

graphical models. Preprint.

Goodman, L. A. (1979a). Multiplicative models for square contingency tables

with ordered categories. Biometrika 66, 413–418.

Goodman, L. A. (1979b). Simple models for the analysis of association in

cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74,

537–552.

Goodman, L. A. (1985). The analysis of cross-classified data having ordered

and/or unordered categories: Association models, correlation models, and

asymmetry models for contingency tables with or without missing entries.

Ann. Statist. 13, 10–59.

Guggenmoos-Holzmann, I. & Vonk, R. (1998). Kappa-like indices of

agreement viewed from a latent class perspective. Statist. Med. 17, 797–

812.

Haberman, S. J. (1974). The Analysis of Frequency Data. Chicago and

London: The University of Chicago Press.

Haberman, S. J. (1978). Analysis of Qualitative Data. New York: Academic

Press.

96 BIBLIOGRAPHY

Hanley, J. A. (1998). Receiver Operating Characteristic (ROC) curves.

In Encyclopedia of Biostatistics, P. Armitage & T. Colton, eds. Wiley, pp.

3738–3745.

Hirji, K. F. & Johnson, T. D. (1996). A comparison of algorithms for

exact analysis of unordered 2× k contingency tables. Comput. Statist. Data

Anal. 21, 419–429.

Klar, N., Lipsitz, S. R. & Ibrahim, J. G. (2000). An estimating equation

approach for modelling Kappa. Biom. J. 42, 45–58.

Kreuzer, M. & Robbiano, L. (2000). Computational Commutative Algebra

1. New York: Springer.

Kreuzer, M. & Robbiano, L. (2003). Computational Commutative Algebra

2. New York: Springer. In preparation.

Landis, R. J. & Koch, G. G. (1975). A review of statistical methods in

the analysis of data arising from observer reliability studies, Parts I and II.

Statist. Neerlandica 29, 101–123, 151–161.

Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley

& Sons.

Lehmann, E. L. (1987). Testing Statistical Hypotheses. New York: John

Wiley & Sons.

Margalef Roig, J. & Outerelo Dominguez, E. (1992). Differential

Topology. No. 173 in North-Holland Mathematixs Studies. Amsterdam:

North-Holland Publishing Co.

McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. Lon-

don: Chapman & Hall, 2nd ed.

Mehta, C. R. & Patel, N. R. (1983). A network algorithm for perfoming

Fisher’s exact test in r × c contingency tables. J. Amer. Statist. Assoc. 78,

427–434.

Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1988). Importance

sampling for estimating exact probabilities in permutational inference. J.

Am. Statist. Assoc. 83, 999–1005.

BIBLIOGRAPHY 97

Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1991). Exact strati-

fied linear rank tests for binary data. In Computing Science and Statistics:

Proceedings of the 23rd Symposium on the Interface, E. M. Keramidas, ed.

American Mathematical Society, pp. 200–207.

Metha, C. R., Patel, N. R. & Senchaudhuri, P. (2000). Efficient monte

carlo methods for conditional logistic regression. J. Am. Statist. Assoc. 95,

99–108.

Pistone, G., Riccomagno, E. & Wynn, H. P. (2001a). Algebraic Statis-

tics: Computational Commutative Algebra in Statistics. Boca Raton: Chap-

man&Hall/CRC.

Pistone, G., Riccomagno, E. & Wynn, H. P. (2001b). Computational

commutative algebra in discrete statistics. In Algebraic Methods in Statistics

and Probability, M. A. G. Viana & D. S. P. Richards, eds., vol. 287 of

Contemporary Mathematics. American Mathematical Society, pp. 267–282.

Rudas, T. & Bergsma, W. P. (2002). On generalised symmetry. Tech. rep.,

Paul Sabbatier University, Toulouse, France. Preprint.

SAS/STAT User’s Guide (2000). Version 8. Cary, NC: SAS Institute, 1st ed.

Smith, P. W. F., Forster, J. J. & McDonald, J. W. (1996). Monte

Carlo exact tests for square contingency tables. J. R. Statist. Soc. 159,

309–321.

Strawderman, R. L. & Wells, M. T. (1998). Approximately exact in-

ference for the common odds ratio in several 2 × 2 tables. J. Am. Statist.

Assoc. 93, 1294–1307.

Sturmfels, B. (1993). Algorithms in Invariant Theory. Texts and Mono-

graphs in Symbolic Computation. New York: Springer.

Sturmfels, B. (1996). Grobner bases and convex polytopes, vol. 8 of Univer-

sity lecture series (Providence, R.I.). American Mathematical Society.

Tanner, M. A. & Young, M. A. (1985). Modelling agreement among

raters. J. Am. Statist. Assoc. 80, 175–180.

98 BIBLIOGRAPHY

Williamson, J. M., Manatunga, A. K. & Lipsitz, S. R. (1994). Model-

ing kappa for measuring dependent categorical agreement data. Biostatistics

1, 191–202.

Documents

LOG-LINEAR MODELS AND TORIC IDEALS