MONTE CARLO METHODS FOR STRUCTURED DATA A …rg833nw3954/... · putational issues, and available deterministic heuristics may be ine ective. Monte Carlo methods present an attractive

MONTE CARLO METHODS FOR STRUCTURED DATA

A DISSERTATION

SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL

AND MATHEMATICAL ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Adam Guetz

January 2012

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/rg833nw3954

© 2012 by Adam Nathan Guetz. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/rg833nw3954

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Susan Holmes, Primary Adviser


Amin Saberi, Co-Adviser


Peter Glynn

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Recent years has seen an increased need for modeling of rich data across many engi-

neering and scientific disciplines. Much of this data contains structure, or non-trivial

relationships between elements, that should be exploited when performing statistical

inference. Sampling from and fitting complicated models present challenging com-

putational issues, and available deterministic heuristics may be ineffective. Monte

Carlo methods present an attractive framework for finding approximate solutions to

these problems. This thesis covers two closely related techniques: adaptive impor-

tance sampling, and sequential Monte Carlo. Both of these methods make use of

sampling-importance resampling to generate approximate samples from distributions

of interest.

Sequential importance sampling is well known to have difficulties in high-dimensional

settings. I present a technique called conditional sampling-importance resampling,

an extension of sampling importance resampling to conditional distributions that

improves performance, particularly when independence structure is present. The

primary application is to multi-object tracking for a colony of harvester ants in a

laboratory setting. Previous approaches tend to make simplifying parametric as-

sumptions on the model in order to make computations more tractable, while the

approach presented finds approximate solutions to more complicated and realistic

models. To analyze structural properties of networks, I expand adaptive importance

sampling techniques to the analysis of network growth models such as preferential

attachment, using the Plackett-Luce family of distributions on permutations, and I

present an application of sequential Monte Carlo to a special form of network growth

model called vertex censored stochastic Kronecker product graphs.

iv

Acknowledgements

I’d like to thank my wife Heidi Lubin, my son Levi, my principal advisor Susan

Holmes, my co-advisor Amin Saberi, my parents, and all of my friends and extended

family.

v

Contents

Abstract iv

Acknowledgements v

1 Introduction 1

1.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Approximate Sampling 6

2.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Effective Sample Size . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Sampling Importance Resampling . . . . . . . . . . . . . . . . 11

2.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.5 Hit-and-Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Sequential Monte Carlo 18

3.1 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . 20

3.3 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

4 Adaptive Importance Sampling 24

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.1 Variance Minimization . . . . . . . . . . . . . . . . . . . . . . 25

4.1.2 Cross-Entropy Method . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Avoiding Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Annealed Importance Sampling . . . . . . . . . . . . . . . . . 30

4.3.2 Population Monte Carlo . . . . . . . . . . . . . . . . . . . . . 31

5 Conditional Sampling Importance Resampling 33

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Conditional Resampling . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.1 Estimating Marginal Importance Weights . . . . . . . . . . . . 36

5.2.2 Conditional Effective Sample Size . . . . . . . . . . . . . . . . 36

5.2.3 Importance Weight Accounting . . . . . . . . . . . . . . . . . 37

5.3 Example: Multivariate Normal . . . . . . . . . . . . . . . . . . . . . . 38

6 Multi-Object Particle Tracking 43

6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1.1 Single Object Tracking . . . . . . . . . . . . . . . . . . . . . . 43

6.1.2 Multi Object Tracking . . . . . . . . . . . . . . . . . . . . . . 45

6.1.3 Tracking Notation . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 Conditional SIR Particle Tracking . . . . . . . . . . . . . . . . . . . . 47

6.2.1 Grouping Subsets for Multi-Object Tracking . . . . . . . . . . 48

6.3 Application: Tracking Harvester Ants . . . . . . . . . . . . . . . . . . 49

6.3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3.2 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3.3 State-Space Model . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3.4 Importance Distribution . . . . . . . . . . . . . . . . . . . . . 54

6.3.5 Computing Relative and Marginal Importance Weights . . . . 62

6.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.4.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

6.4.2 Short Harvester Ant Video . . . . . . . . . . . . . . . . . . . . 65

7 Network Growth Models 70

7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.1.1 Erdos-Renyi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.1.2 Preferential Attachment . . . . . . . . . . . . . . . . . . . . . 73

7.1.3 Duplication/Divergence . . . . . . . . . . . . . . . . . . . . . 75

7.2 Computing Likelihoods with Adaptive Importance Sampling . . . . . 75

7.2.1 Marginalizing Vertex Ordering . . . . . . . . . . . . . . . . . . 78

7.2.2 Plackett-Luce Model as an Importance Distribution . . . . . . 79

7.2.3 Choice of Description Length Function . . . . . . . . . . . . . 80

7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.3.1 Modified Preferential Attachment Model . . . . . . . . . . . . 81

7.3.2 Adaptive Importance sampling . . . . . . . . . . . . . . . . . 82

7.3.3 Annealed Importance sampling . . . . . . . . . . . . . . . . . 82

7.3.4 Computational Effort . . . . . . . . . . . . . . . . . . . . . . . 83

7.3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . 84

8 Kronecker Product Graphs 91

8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.2 Stochastic Kronecker Product Graph model . . . . . . . . . . . . . . 94

8.2.1 Likelihood under Stochastic Kronecker Product Graph model . 94

8.2.2 Sampling Permutations . . . . . . . . . . . . . . . . . . . . . . 96

8.2.3 Computing Gradients . . . . . . . . . . . . . . . . . . . . . . . 96

8.3 Vertex Censored Stochastic Kronecker Product Graphs . . . . . . . . 97

8.3.1 Importance Sampling for Likelihoods . . . . . . . . . . . . . . 98

8.3.2 Choosing Censored Vertices . . . . . . . . . . . . . . . . . . . 100

8.3.3 Sampling Permutations . . . . . . . . . . . . . . . . . . . . . . 100

8.3.4 Multiplicative Attribute Graphs . . . . . . . . . . . . . . . . . 101

8.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 102

viii

List of Tables

6.1 Observation event types. . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1 Comparison of estimators for sparse 500 node preferential attachment

dataset from Figure 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.2 Comparison of estimators for dataset: 5 networks, 30 nodes each, av-

erage degree 2, 20 samples each method . . . . . . . . . . . . . . . . . 86

7.3 Comparison of estimators for dataset: 2 networks, 100 nodes each,

average degree 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.4 Estimated log-likelihoods for Mus Musculus protein-protein interaction

networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

ix

List of Figures

3.1 Dependence structure of hidden Markov models . . . . . . . . . . . . 19

5.1 CSIR Normal example: eigenvalues of covariance matrices . . . . . . 40

5.2 CSIR Normal example: estimate KL-Divergences . . . . . . . . . . . 41

5.3 Same experiments as in Figure 5.2, plotted by method. . . . . . . . . 42

6.1 Example grouping subset functions . . . . . . . . . . . . . . . . . . . 49

6.2 Blob bisection via spectral partitioning . . . . . . . . . . . . . . . . . 52

6.3 Association of objects with observations. ’Events’ correspond to con-

nected components in this bipartite graph, including Normal obser-

vations, splitting, merging, false positives, false negatives, and joint

events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.4 “True” distribution of path lengths and trajectories per frame, simu-

lated example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.5 Centroid observations per frame, simulated example. . . . . . . . . . 66

6.6 Distribution of path lengths and trajectories per frame using a sample

from the importance distribution, simulated example. . . . . . . . . . 67

6.7 Distribution of path lengths and trajectories per frame using CSIR,

simulated example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.8 GemVident screenshot, showing centroids. . . . . . . . . . . . . . . . 68

6.9 Centroid observations per frame from Harvester ant example. . . . . . 68

6.10 Distribution of path lengths and trajectories per frame using a sample

from the importance distribution, Harvester ant example. . . . . . . . 69

x

6.11 Distribution of path lengths and trajectories per frame using CSIR,

Harvester ant example. . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.1 Example runs comparing annealed importance sampling and adaptive

importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.2 Likelihoods and importance weights for cross-entropy method. . . . . 88

7.3 Mus. Musculus (common mouse) PPI network. . . . . . . . . . . . . 89

7.4 Convergence of adaptive importance sampling and annealed impor-

tance sampling for Mus. Musculus PPI network. . . . . . . . . . . . . 90

8.1 Comparison of crude and SIS Monte Carlo for Kronecker graph likeli-

hoods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8.2 Comparison of SKPG and VCSKPG models for AS-ROUTEVIEWS

graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

xi

Chapter 1

Introduction

Contemporary data analysis is often comprised of information with complicated and

high-dimensional relationships between elements. Traditional, deterministic analytic

techniques are often unable to directly cope with the computational challenge, and

must make simplifying assumptions or heuristic approximations. An attractive al-

ternate is the suite of randomized methods known as Monte Carlo. The types of

problems examined in this thesis often contain both discrete and continuous compo-

nents, and can generally be expressed as or related to integral or summation type

problems. Suppose one wishes to compute some quantity µ defined as

µ =

∫Ω

X(ω)P(dω). (1.1)

If X : Ω 7→ R is the random variable defined on probability space (Ω,Σ,P), then this

can be equivalently expressed as the expected value

µ = E[X]. (1.2)

In some cases, µ can be computed exactly using analytic techniques. For many

examples this is not possible and one must resort to methods of approximation. Deter-

ministic numerical integration, or quadrature, generally has good convergence prop-

erties for low and moderate dimensional integrals. However, the computational com-

plexity of quadrature increases exponentially in the dimension of the sample space

1

CHAPTER 1. INTRODUCTION 2

Ω, making high-dimensional inference computationally intractable. This general phe-

nomenon is known as the curse of dimensionality [11], and can be explained in terms

of the relative “sparseness” of high-dimensional space. Monte Carlo integration can

be a viable alternative to quadrature in these settings, as it converges at a rate pro-

portion to the square-root of the sample size regardless of dimension.

1.1 Monte Carlo Integration

Given N independent identically distributed random variables X1, . . . ,XN , XN =

(∑N

i=1 Xi)/N , E[X2] <∞, the strong law of large numbers gives

XNa.s.−−→ µ, (1.3)

wherea.s.−−→ denotes almost sure convergence. This provides the motivation to use xN

as an approximation to µ. Using xN to approximate µ is known as Monte Carlo

Integration. Notationally, in this thesis bold uppercase letters such as X indicate

random variables, while lowercase letters such as x indicate observations of those

random variables.

Under the above conditions, the central limit theorem states that

XN − µd−→ N (0, σ2

X/N), (1.4)

whered−→ denotes convergence in distribution, and σ2

X = var[X]. Roughly speaking,

this means that in the limit convergence of XN to µ occurs at a√N rate. In other

words, to get another digit of accuracy (factor of 10), one would need 102 = 100 times

as many samples. While this rate of convergence is unappealing for low-dimensional

integrals, the rate holds regardless of the sample space Ω, therefore allowing one to

circumvent the curse of dimensionality for high-dimensional problems.

This formulation, however, hides some of the additional complexity inherent in

Monte Carlo integration. One difficulty is that the variance σ2X may grow exponen-

tially in the dimensions of the sample space Ω. In these cases, the Monte Carlo


guarantee of quadratic convergence is unhelpful since the starting point is an esti-

mator with exponentially high variance. To make Monte Carlo methods practical in

this case, it is necessary to employ variance reduction techniques. For comprehensive

reviews of variance reduction techniques, see Liu [68] or Asmussen and Glynn [6].

The primary method for variance reduction used in this text is importance sampling,

where instead of directly sampling the random variable X to estimate (1.1), one in-

stead samples from some biased random variable Y and corrects for the bias. See

§2.1 for background on importance sampling.

Another potential source of complexity is in the generation of independent random

draws from the sample space ω ∈ Ω and the computation of the random variable X(ω).

For many commonly occurring problems, the best known algorithms for sampling

exactly from the sample space of interest take exponential (or worse) time. In these

cases the only feasible alternative is to use approximate sampling techniques. In

this thesis, I will examine two main techniques for approximate sampling: sampling

importance resampling(SIR) and Markov chain Monte Carlo(MCMC). Background

for these techniques is covered in §2. Advanced topics covered include sequential

Monte Carlo §3, including particle filtering, and adaptive importance sampling §4.

1.2 Applications

In §6, techniques are discussed for tracking large numbers of possibly interacting ob-

jects. In multi-object tracking (also known as multi-target tracking), the sequential

process of interest can be represented as a hidden Markov model (HMM). The primary

goal is to make inferences about the hidden state. There are many techniques avail-

able for tracking multiple targets, but almost all of those currently available either

make broad simplifying assumptions or are infeasible for large problem sizes. Gener-

ally it is preferable to use models of movement and observation that are as realistic as

possible while still permitting scalable analysis. One can generally state the problem

of inference as being equivalent to sampling from a posterior distribution, but in the

tracking models studied in this thesis it is generally impossible to make exact draws

from the posterior distribution. Instead, sequential Monte Carlo techniques are used


to draw approximate samples. The standard particle filter doesn’t typically work

well with high-dimensional models due to degeneracy issues, where the bulk of the

importance weight mass is concentrated on a small number of particles. Part of the

degeneracy issue can be resolved using standard SIR, however in the high-dimensional

multi-target tracking example studied in chapter §6 it is insufficient. To address this

issue, chapter §5 introduces a conditional sampling importance resampling step that

takes advantage of inherent independence structures in the model. When individ-

ual targets are far from one another, conditional SIR effectively admits a separate

particle filter for each target while asymptotically maintaining the correct joint dis-

tribution. An implementation of this algorithm with empirical examples of tracking

the movements of harvester ants in a laboratory setting is given.

Adaptive importance sampling is a technique similar in many respects to the par-

ticle filter. In §7 an application of adaptive importance sampling to inference for

network growth models is studied. Network growth models are models of network

creation in which a new vertex arrives and attaches edges to pre-existing vertices

according to a rule depending on the current state of the network. The application

considered, it is attempted to estimate the likelihood of a given network having origi-

nated from the growth model for the purpose of model selection. A primary technical

difficulty is that one typically doesn’t know the order in which vertices joined the

network. A priori, each ordering is equally likely, so one needs to consider all possible

permutations of orderings to make valid inferences. This quantity can be expressed

as a summation over all permutations, and can be represented as estimating the nor-

malizing constant of the distribution that has probability proportional to the model

likelihood for each permutation. Since there are a factorial number of permutations,

for moderate numbers of vertices direct summation is infeasible and one must resort

to approximation techniques. In this setting, crude Monte Carlo tends to work poorly,

since most of the likelihood is concentrated on a vanishingly small subset of permu-

tations. To reduce the variance of the estimator, adaptive importance sampling is

with importance distributions selected from the Plackett-Luce family of permutation

distributions. This is a novel use of this family of distributions in the importance

sampling context. An example is given using the technique on a modified version of


preferential attachment.

In §8 inference in a special type of network growth model known as stochastic

Kronecker product graph (SKPG) is discussed. These models have a simple formu-

lation that permits relatively efficient estimation of maximum likelihood parameters.

The SKPG model is a generalization the Erdos-Renyi G(n, p) model and implicitly

construct a matrix of Bernoulli edge probabilities using Kronecker products of smaller

seed matrices. These models suffer from the same difficulty as other network growth

models in that in order to compute the likelihood one needs to sum over all possible

vertex labeling permutation, but SKPGs have the advantage that it is relatively easy

to compute the normalizing constant. To address the permutation issue, Leskovec and

Faloutsos [64] use a Markov chain Monte Carlo algorithm over permutation space.

However, these models encounter difficulties when there is a mismatch between the

model dimensions and the number of vertices in the network data. To address these

issues, the vertex censored stochastic Kronecker product graph (VCSKPG) model is

introduced, which allows more flexibility in the allowable number of model vertices.

A sequential importance sampling scheme is proposed to perform efficient parameter

fitting and likelihood estimation for this model.

Chapter 2

Approximate Sampling

In many application settings, it is desirable to sample from a distribution of interest

π but it is impossible to do so in a reasonable amount of time, i.e. sampling from π is

computationally intractable. Approximate sampling refers to a set of methods that at-

tempt the next best thing, which is to sample from some distribution γ that is in some

sense “close” to π. There is a strong connection between approximate sampling and

estimation problems. In particular, Jerrum et al. [53] were able to give a polynomial-

time reduction between almost uniform sampling and approximate counting. Another

way to view this relationships is through the well-known importance sampling identity

(2.6), which gives a zero-variance estimator when sampling from the optimal impor-

tance distribution γ∗, and low-variance estimates when approximately sampling from

γ∗.

By “closeness” of γ to π, one can either use metrics between probability distribu-

tions such as total variance distance,

dTV (π, γ) = supA⊂Ω|π(A)− γ(A)|, (2.1)

6

CHAPTER 2. APPROXIMATE SAMPLING 7

or pseudo-metrics such as the Kullback-Leibler divergence,

dKL(π‖γ) = Eπ

[log

π(X)

γ(X)

](2.2)

= Eπ [log π(X)]− Eπ [log γ(X)] , (2.3)

where the notation Eπ(X) indicates that the random variable X is distributed ac-

cording to π. Although Kullback-Leibler divergence is not a true distance function

as it is not symmetric (dKL(π‖γ) 6= dKL(γ‖π) in general), it does have the property

that dKL(π‖γ) = 0 iff π(x) = γ(x) for all x with nonzero measure. The two terms

on the RHS of (2.3) are the entropy of γ, Eπ [log π(X)], roughly representing sample

diversity, and the cross-entropy, Eπ [log γ(X)], representing the “goodness of fit” of γ

to π.

2.1 Importance Sampling

Suppose that there exists a random variable X : Ω 7→ R, and one wishes to compute

the expected value of X,

µ ≡ E[X] =

∫Ω

X(ω)π(dω), (2.4)

where π is the target distribution. In many cases, computing µ exactly is intractable

since it is necessary to integrate over the entire state space Ω. For example, the

problem of approximating the permanent of a matrix [52] can be represented as

perm(A) =1

n!E

[n∏i=1

ai,X(i)

], (2.5)

where X is a random permutation. Computing the permanent is a computationally

difficulty and is known to be #P-complete [98]. #P-complete comprises a set of

counting problems with no known polynomial-time algorithms, and can be thought

of as the counting analog of NP-complete problems. One way to estimate such prob-

lems is through crude Monte Carlo simulations, drawing N independent, identically


distributed (iid) samples x1, . . . , xN , then computing xN . XN may, however, have

unacceptably large variance, exponentially high in the case of approximating the per-

manent.

One practical variance reduction technique is known as importance sampling (IS).

Importance sampling builds an estimator by sampling from a biased distribution in

which the “important” or more heavily weighted states are visited more frequently.

Helpful background references for importance sampling include Evans and Swartz [35],

Asmussen and Glynn [6], Liu [68], and Robert and Casella [85]. Suppose there exist

random variables Y,Z, such that Y(ω) = X(ω) and Z(ω) = X(ω)π(dω)γ(dω)

. Importance

sampling is based on the following simple identity:

E[X] =

∫Ω

X(ω)π(dω)

γ(dω)γ(dω) = E [Z] (2.6)

The importance sampling identity (2.6) holds as long as Z is well defined, i.e. π(dω) =

0 =⇒ γ(dω) = 0 for ω ∈ Ω. γ is the importance distribution, and the ratio

W(ω) ≡ π(dω)/γ(dω) is the importance weight of ω. One can draw N iid samples of

Z and use µIS ≡ ZN as the unbiased importance estimator of µ. If var[Z] < var[X],

ZN will be a better estimate of µ than XN . A primary challenge in importance

sampling is choosing an importance distribution γ that minimizes var[Z].

Practically, it is often the case that one only knows π and/or γ up to a constant

factor, f(dω) = π(dω)/CX, g(dω) = γ(dω)/CY, where CX and CY are the normalizing

constants of f and g. The ratio of the normalizing constants is denoted as C ≡CX/CY, with the random variables W(ω) ≡ f(dω)/g(dω) and Z(ω) ≡ Y(ω)W(ω) =

CZ(ω).

Since E[W] = 1, one can build an unbiased estimator of the ratio C through draws

of the unnormalized importance ratios,WN . Using the same samples ω1, . . . , ωN to

computeWN as for

ZN leads to the biased importance estimator

µBIS ≡ZN(ω1, . . . , ωN)WN(ω1, . . . , ωN)

. (2.7)


Although biased for finite sample sizes, µBIS is asymptotically unbiased and does not

require normalizing constants..

The optimal sampling distribution γ∗ is one for which E[Z2] is smallest. As per

Rubinstein and Kroese [89], E[Z2] is minimized with γ∗(dω) ∝ |X(ω)|π(dω). This

gives

E[Z∗2] =

∫Ω

Z2(ω)γ∗(dω) (2.8)

=

∫Ω

|X(ω)|π(dω) = E[|X|]. (2.9)

In particular, if X ≥ 0, γ∗ provides a zero variance estimator. Direct computation

of the optimal sampling distribution is typically impossible since it relies on the∫|X(ω)|π(dω) as a normalizing constant, which is the quantity to be estimated in

the first place. However, it is often helpful to use γ∗ as a guide to construct “good”

importance sampling distributions.

2.1.1 Effective Sample Size

An important concept when using importance sampling is degeneracy. Degeneracy

occurs when the bulk of the importance weight mass is concentrated in one or a small

number of importance samples. This is means that the Monte Carlo approximation

will be dominated by a small subset of samples, so the “effective” number of samples

that we are using to compute the approximation is small. A standard measure of

degeneracy for importance sampling methods is therefore the effective sample size

[61],

ESS =N

1 + var[W]= N

1

E[W2], (2.10)

where cv refers to the coefficient of variation of the importance weights (cv =

√var[W]

E[W]),

E[W] = 1, and var[W] = E[W2] − 1. One justification for the use of the effective

sample size comes from Liu [67], based on a note from Kong [61], using the the delta


method, and states that

var[xN]

var[µBIS,N ]≈ 1

1 + var[W]. (2.11)

This goes as follows. First note that using the standard delta method for ratio

statistics [21] gives

var[µBIS,N] ≈ 1

N(var[Z] + µ2var[W]− 2µcov(Z,W)). (2.12)

Further note that

cov(Z,W) = E

(Xπ(X)

γ(X)

)− µ (2.13)

= cov

(π(X)

γ(X),X

)+ µE

[π(X)

γ(X)

]− µ (2.14)

and that

var[Z] = E

[π2(X)

γ2(X)X

]− µ2 (2.15)

≈ E[X]E

[π2(X)

γ2(X)

]+ var(X)E

[π(X)

γ(X)

]+ 2µcov

(π(X)

γ(X),X

)− µ2 (2.16)

Applying this to (2.12) gives

var[µBIS,N] ≈ 1

Nvar(X)(1 + var(W)), (2.17)

which yields (2.11). Note that the remainder term in (2.16) is

E

[(π(X)

γ(X)− E

[π(X)

γ(X)

])(X− µ)2

], (2.18)

which can be large depending on the the distribution of X.

Typically in practice one knows neither the true variance of the importance weights

nor the normalizing constants, so it is common to use the empirical unnormalized


effective sample size,

ESS =

(∑Ni=1 W(ωi)

)2

∑Ni=1 W(ωi)2

(2.19)

as a heuristic measure of degeneracy.

2.1.2 Sampling Importance Resampling

Sampling importance resampling (SIR) is a technique for approximate sampling based

on importance weights. Although not widely used in general approximate sampling

settings, where Markov chain Monte Carlo Methods are often more easily imple-

mented and efficient, sampling importance resampling has found a niche in sequential

settings where Markov chain based approximate sampling methods carry a heavier

computational burden. Sequential importance sampling is discussed further in §3.2.

The goal of resampling in the sequential setting is generally to reduce degeneracy. The

idea for SIR came originally from Rubin [87], and was introduced in the sequential

context by Gordon et al. [42].

SIR uses samples drawn from an importance distribution γ to approximately draw

samples from the target distribution π. The procedure to draw M importance sam-

ples is as follows. Given N samples y(1), . . . , y(N) drawn according to γ, compute

the importance weights w(i) for each sample. Choose an index i with probability

proportional to w(i) and assign x = y(i). This process is repeated M times (with

replacement) to generate a collection of approximate samples, x(i)Mi=1. The qual-

ity of the approximate samples depends on the sample size N and how closely γ

matches π. Assuming mild regularity conditions on the importance distribution γ,

asymptotic convergence to the target distribution can be shown (see Asmussen and

Glynn [6],p.387). Since only relative importance weights are needed for resampling,

normalizing constants are not needed.

One possible use of the collection of approximate samples x(i)Ni=1 is to construct


Algorithm 1 Sampling Importance Resampling (SIR)

Draw N samples y(1), . . . , y(N) ∼ Py.Compute importance weights w(i)Ni=1.Draw M samples x(1), . . . , x(M) with replacement from y(i)Ni=1 with probabilitiesproportional to w(i)Ni=1.

an estimator of E[X]

µSIR = N−1

N∑i=1

x(i). (2.20)

However, this estimator is not very useful in practice, as µSIR will always have higher

variance than µIS due to the additional variance from multinomial sampling. A

better resampling method that can reduce the multinomial noise known as residual

resampling, introduced by Liu and Chen [69], is as follows. First, normalize the

importance weights to sum to one, w(i) = w(i)/∑N

j=1w(j) . Then take bMw(j)c

copies of each sample j, for a total of k =∑N

j=1bMw(j)c samples, and set w(j)′ ←w(j) − bMw(j)c, renormalize the importance weights, and take M − k samples with

replacement from the resulting multinomial distribution. This procedure will make

the same expected numbers of copies of each sample, but if k is large can greatly

reduce the multinomial noise.

Another variation of SIR introduced by Skare et al. [94] uses modified importance

weights. Instead of choosing samples with probability w(i), one can use weights pro-

portional to w(i)/(1 − w(i)). Using these weights with M fixed and letting N → ∞,

Skare et al. [94] were able to show point-wise rates convergence for X to X at rate

O(N−1) when sampling with replacement and O(N−2) when sampling without re-

placement. The idea of sampling without replacement for SIR when M << N orig-

inally comes from Gelman [38], and intuitively can be thought of as producing an

“intermediate representation” between the sampling and target distributions.


2.2 Markov Chain Monte Carlo

Markov chain Monte Carlo is among the most widely used methods for difficult ap-

proximate sampling and estimation problems. This is due to the fact it is often easy

to design an ergodic Markov chain that holds a target distribution π as its station-

ary distribution for a wide variety of distributions [4], which can then be used to

draw approximate samples from π and to construct Monte Carlo estimators. Tech-

niques for designing such Markov chains include the Metropolis-Hastings algorithm,

data augmentation, the Gibbs sampler, and the hit-and-run algorithm. For further

background on Markov chain Monte Carlo methods, refer to Liu [68], Asmussen and

Glynn [6], and Robert and Casella [84].

2.2.1 Markov Chains

A random variable X1:t is a discrete valued Markov process if the distribution of Xt

given the most recent state xt−1 is independent of all the previous states, or

P(Xt|x1:t−1) = P(Xt|xt−1). (2.21)

This property of Markov chains is referred to as memorylessness. If the possible

values for each Xt form a countable space, then this type of process is known as a

Markov chain. Markov chains are defined by a kernel K(v, v′) specifying the relative

probability that Xt+1 = v′ given that xt = v. One way to think of a Markov chain

is as a random walk in a weighted directed graph G(V,E) with non-negative edge

weights. Possible states are represented by vertices, and the next state xt+1 given

the current xt is chosen with probability proportional to edge weights. The expected

hitting time of a state v ∈ V is the expected number of steps for the Markov chain

starting in state v to return to v. If a state has a finite expected hitting time, it

is positive recurrent. The periodicity of a state v is the greatest common divisor

amongst all possible hitting times; if the periodicity of v is 1 then v is said to be

aperiodic. If all states v ∈ V are positive recurrent and aperiodic, then the Markov

chain is said to be ergodic, and admits a unique stationary distribution π, such that


Kk(v, v′)→ π(v) as k →∞.

2.2.2 Metropolis Hastings

The Metropolis-Hastings algorithm [73, 46] was one of the first Markov chain Monte

Carlo methods proposed and used in practice, and is still one of the most commonly

used forms. Its popularity can be attributed to the simplicity of designing efficient

mechanisms to approximately sample from an arbitrary distribution π. The two

main requirements necessary for the algorithm is that the relative probabilities for

two states v, v′ under π, π(v′)/π(v) can be computed, and that an ergodic proposal

Markov chain on the state space with transition kernel K can be sampled from. No

normalizing constants are required, and the algorithm works extremely well for many

applications [27]. The procedure is as follows. Starting from a state v, take a step in

the Markov chain K to state v′, and compute the acceptance ratio

a =π(v′)

π(v)

K(v′, v)

K(v, v′). (2.22)

If a > 1, then move to state v′, otherwise draw a uniform random variable u and

move to state v′ if a > u, otherwise stay at v.

One potential difficulty with this algorithm is in the choice of proposal kernel

K. K should ideally be chosen such that K(v, v′) is approximately π(v′), which

can sometimes be difficult in practice. An inappropriate choice of K can induce the

Metropolis chain to have an extremely low rate of convergence. Another potential

issue is that the Metropolis chain may not be ergodic even if the proposal chain

is. However, for many important problems these difficulties can be overcome, and

Metropolis-Hastings has proved to be extremely useful in a wide variety of contexts

and has been named one of the most important algorithms of the 20th century [10].

2.2.3 Gibbs Sampler

The Gibbs sampler (introduced by Geman and Geman [39], see Liu [68, chapter 6] for

a good introduction) is another fundamental tool used for designing ergodic Markov


chains. Suppose that random variable X takes values in state space Ω with probability

distribution π, and that for x ∈ Ω, x can be decomposed as x = x1, . . . , xp. The

Gibbs sampler is as follows. Starting with an initial point x, cycle through coordinate

indices j = 1, . . . , p, for each j sampling xj according to the conditional distribution

xj ∼ π(xj|x[−j]), (2.23)

where the notation x[−j] indicates taking all coordinate indices of x except for j, or

x[−j]def= x1, . . . , xj−1, xj+1, . . . , xp. (2.24)

The method of cycling through coordinate indices is a design choice. If j is chosen

uniformly at random at each step, the procedure is known as random-scan Gibbs

sampling (summarized below in Algorithm 2); if instead one cycles through coordi-

nate indices in a predetermined order, it is called systematic-scan Gibbs sampling.

Denote the Markov transition kernel for random-scan Gibbs sampling as KGibbs, with

KGibbs(x, y) giving the probability density for the next state y conditioned on the

current state x. Under mild conditions the Gibbs sampler will be positive recurrent

Algorithm 2 Random-scan Gibbs sampler.

Start from initial point x← x1, . . . , xp.while (not converged) do

Pick an index j at random.Sample xj ∼ π(xj|x[−j]).

end while

and aperiodic (ergodic), and will therefore admit π as a stationary distribution. It is

easy to see that sampling from the conditional distribution (2.23) leave π invariant,

so assuming ergodicity the Gibbs sampler has π as a stationary distribution.

One is not restricted to only sampling from the single variable conditional distri-

butions in (2.23). If one chooses to sample from the joint conditional distribution for

multiple subindices j1, . . . , jp at once,

(xj1 , . . . , xjp) ∼ π(xj1 , . . . , xjp |x[−j1,...,−jp]), (2.25)


it is known as grouped Gibbs sampling. In grouped Gibbs, instead of sampling from

the line defined by x[−j] as in standard Gibbs, one samples from the hyperplane defined

by x[−j1,...,−jp] for some set of coordinate indices j1, . . . , jp. The set of coordinate

indices may be chosen deterministically or according to a randomized scheme. It

is not difficult to see that grouped Gibbs results in a faster mixing rate relative to

standard Gibbs, see Liu [68] for details.

2.2.4 Data Augmentation

The data augmentation algorithm, proposed by Tanner and Wong [96], can be thought

of as a special case of the Gibbs sampler on a two variable space X = X1,X2, where

one is primarily interested in sampling from the first sub-variable X1. X2 is called

the auxiliary variable. As in the standard Gibbs sampler, we draw samples of x1

according to π(x1|x2), then draw samples of x2 according to π(x2|x1), and repeat

until satisfied. This yields joint samples x approximately distributed according to

π(x), and one may then “discard” the auxiliary variable to get samples x1 ∼ π(x1).

The “art” of data augmentation [99] is in the choice of auxiliary variable x2, which

should ideally be chosen such that the conditional distributions π(x1|x2), π(x2|x1) are

easily computed, and such that the resulting Markov chain is rapidly mixing.

2.2.5 Hit-and-Run

A generalized form of grouped Gibbs is known as the hit-and-run algorithm [3], where

for current state x, one randomly chooses a subset Lk ⊂ Ω with probability w(Lk),

then samples the next state y according to the transition matrix Kk(x, y). The

intuition behind hit-and-run is that it is not necessary to restrict oneself to sampling

from hyperplanes of the state-space, but rather we can choose arbitrary subsets L to

sample from. In order to ensure convergence to the proper stationary distribution π,

it is necessary to choose L, w, and K such that Kk(x, y) has stationary distribution

proportional to wx(k)π(x). Also note that ergodicity of the hit-and-run Markov chain

depends on the choices of L, K, and w.

For each x ∈ Ω and coordinate index j, define subset indices kx,j such that kx,j =


Choose L, w, K such that for each subset index k, Kk(x, y) has stationary distributionproportional to wx(k)π(x).

Hit-and-run requirements

ky,j if and only if x[−j] = y[−j]. If one chooses subsets L and weights w such that

Lkx,j = y : y[−j] = x[−j] (2.26)

wx(Lkx,j) = 1/n1x=x, and Kkx,j(x, y) proportional to π(y)1y∈Lkx,j , then this is equiva-

lent to single variable random-scan Gibbs. This can be seen to satisfy the hit-and-run

requirements, since for each x ∈ Lk, the probability of choosing Lk is 1/n. Hit-and-

run also generalizes several other popular Markov chain Monte Carlo algorithms,

including Swendsen-Wang, data augmentation, and slice sampling. See Andersen

and Diaconis [3] for more details.

Chapter 3

Sequential Monte Carlo

In many important cases, one would like to analyze and develop models for data

that are ordered or sequential in some natural way. This situation occurs in many

examples such as data where there is a time component, i.e. time-series data, but also

situations such as sampling from the space of self-avoiding walks [44, 86], contingency

tables [23], and graphs with a prescribed degree sequence [14, 9]. For inference in these

models, sequential Monte Carlo [31] methods have been developed. Sequential Monte

Carlo methods include the class of algorithms known as particle filters, which were

introduced in their current form by Gordon et al. [42] as the bootstrap filter. Particle

filters are iterative, consisting of two basic elements: sequential importance sampling

(SIS) and sampling importance resampling (SIR).

3.1 Sequential Models

First, some notation will be introduced. For each sample ω ∈ Ω, there is a function

X : Ω 7→ ΩT , where Ω is the state space, and T is a positive integer. Under this

formulation, X is a discrete-time stochastic process. Typical examples include cases

where the state space Ω is finite, Nn, or Rn, but more complicated examples can be

considered as well. A sample x can be written as an array, X(ω) = Xt(ω)t∈1,...,T .

As a shorthand, the ω is dropped and X1:T is referred to as a random variable with

sub-indices X1, . . . ,XT .

18

CHAPTER 3. SEQUENTIAL MONTE CARLO 19

Figure 3.1: Dependence structure of hidden Markov models

In a sequential model, it will be assumed that evaluating and generating samples

from the conditional probabilities P(xt|x1:t−1) for 1 ≤ t ≤ T is computationally

feasible. In a hidden Markov model, X is a Markov process, and there is another

coupled discrete-time stochastic process Y such that the conditional distribution of

Yt given xt is independent of the rest of the x and y variables, or

P(yt|x1:T , y1:t−1, yt+1:T ) = P(yt|xt). (3.1)

X is known as the hidden or latent process and Y is called the observation process.

A diagram showing the dependence structure of hidden Markov models is shown

in Figure 3.1. Practical applications of of hidden Markov models typically involve

samples from the observation process Y, and it is desired to make inferences about

the latent process X. For example, one may wish to compute (and draw samples

from) P(X|y) or to compute the expected value of functions of X conditioned on y.

Another common problem is that the X and Y processes may be defined by some set

of parameters θ and for which we would like to find maximum likelihood estimates.

In the case where the hidden process X is governed by an affine Gaussian process

and the observation model Yt conditioned on the hidden state xt is also affine Gaus-

sian, one can use the Kalman filter [55] to efficiently make direct inferences about

(and sample from) X conditioned on y. The Kalman filter is an iterative procedure

that computes first E[Xt|xt−1], then updates based on current observation yt to get


E[Xt|, xt−1, yt]. This is repeated for s = 1, . . . , t. Due to the Gaussian nature of the

processes, knowing E[Xt|, xt−1, yt] for each t is sufficient to compute and sample ac-

cording to the full conditional distribution P[x1:T |y1:T ]. Estimates for the respective

covariance matrices can also be determined iteratively.

The efficiency of the Kalman filter makes it useful in real-world applications where

Gaussian models are appropriate. In cases where the underlying hidden and obser-

vation processes are non-linear, versions of the Kalman filter such as the extended

Kalman filter [51] or the unscented Kalman filter [54, 103] may be useful. However,

these methods may not be applicable in cases where the underlying state-space and

observation processes are highly non-linear, or take values in non-Euclidean state

spaces such as graphs or other combinatorial objects. In these cases, particle filtering

(§3.3) offers an attractive alternative. See chapter 6 for an application of the Kalman

and particle filters to multi-object tracking.

3.2 Sequential Importance Sampling

Suppose X and Y are the latent and observation processes of a hidden Markov model.

If it is known how to sample Xt according to the law

P(xt|y1:T , x1:t−1) = P(xt|yt:T , xt−1), (3.2)

where the equality is due to the independence properties of X and Y, one can sam-

ple directly from X|y sequentially. Note, however, that this distribution is condi-

tioned on all future observations yt:T for each xt. For non-Gaussian processes, sam-

pling from the optimal importance distribution is usually impractical, so instead

one can sample according to some other distribution γ and use importance sam-

pling. Denote the tth contribution to the sequential target distribution as πt(x) ≡P(xt, yt|xt−1)/P(yt|y1:t−1), and the target distribution of the first t states given the


first t observations as π1:t(x) ≡ P(x1:t|y1:t). π1:t can be built sequentially as follows:

π1:t(x) =P(yt|x1:t, y1:t−1)P(x1:t|y1:t−1)

P(yt|y1:t−1)(3.3)

=P(yt|xt)P(xt|xt−1)

P(yt|y1:t−1)P(x1:t−1|y1:t−1) (3.4)

=P(xt, yt|xt−1)

P(yt|y1:t−1)π1:t−1(x) (3.5)

= πt(x)π1:t−1(x) (3.6)

A sequential importance distribution γ1:t(x) is chosen to be defined sequentially

such that there exist functions γt(x) with

γ1:t(x) =t∏

s=1

γs(x) (3.7)

The tth sequential contribution to the importance weight is Wt(x) = πt(x)/γt(x),

and the importance weight after t time-steps is

W1:t(x) =t∏

s=1

Ws(x). (3.8)

Remarks:

• The denominator of πt, P(yt|y1:t−1), is independent of x and is not required for

approximate sampling. Since it may be difficult to compute, the unnormalized

value is often used,

π1:t(x) = P(xt, yt|xt−1)π1:t−1(x). (3.9)

π1:t(x) has P(y1:t) as its normalizing constant. One can similarly use an unnor-

malized importance distribution γ and importance weight W.

• πt(x) ∝ P(xt|yt, xt−1)P(yt|xt−1), so one sensible possibility is to choose as a se-

quential importance distribution γt(x) proportional to P(xt|yt, xt−1). Note that


this is the optimal (zero variance) importance distribution for Xt given yt and

xt−1. However, xt chosen in this manner is no longer optimal at time-step t+ 1,

since πt+1 depends on xt through P(yt|xt−1). For this reason P(xt|yt, xt−1) is

called the locally optimal choice of importance distribution. The sequential con-

tribution to the relative importance weights in this case is Wt(x) ∝ P(yt|xt−1).

3.3 Particle Filter

One serious issue with sequential importance sampling is that the importance weights

can become degenerate after a small number of steps, with most of the importance

weight mass concentrated on a small subset of samples, even if using the locally opti-

mal importance distribution. To address this issue, Gordon et al. [42] suggest the use

of sampling importance resampling (SIR) in conjunction with sequential importance

sampling (SIS) to create what is now referred to as sequential Monte Carlo. When

the underlying model is hidden Markov, then this is known as the particle filter. A

standard reference for particle filtering is Doucet and De Freitas [31].

Algorithm 3 Particle filter approximate sampling.

Initialize each x(i)0 ∼ π0 for i = 1, . . . , N .

for t ∈ 1, . . . , T doDraw each x

(i)t ∼ γt(·|x(i)

t−1).

Update sequential importance weights W(i)1:t. according to (3.8).

Compute effective sample size ESS according to (2.19).

If ESS < N , resample each x(i) with probability proportional to W(i)t .

end for

To see anecdotally why sequential importance sampling results in degenerate

sample sets, consider the following case. Suppose that the sequential importance

weight Wt(X) is distributed identically identically and independently for each t.

This would occur, for example, if γ is an affine Gaussian process. Then as t → ∞,

log W1:t(x)/σ(log(W1:t(x))) → N(0, 1). Therefore, in the limit W1:t is distributed

according to a lognormal distribution, which is heavy-tailed. This implies that gen-

erally speaking we’d expect a small number of samples to have very high importance


weights relative to the other samples. Low importance weight samples are also less

likely to have high importance weights in the future.

To address this issue, at each step one can perform sampling importance re-

sampling. If SIR is applied at time-step t, this will give an approximate sample

from π1:t(x). Note that for t < T this is not an approximate sample from the

true target distribution π1:T (x1:t) which incorporates all future observations. Instead

the algorithm approximately samples from a sequence of intermediate distributions,

π1, π1:2, . . . , π1:T . In this manner sequential Monte Carlo has parallels to simulated

annealing, which also uses a sequence of approximate samples from intermediate dis-

tributions to sample from a target distribution (see §4.3.1 for more).

Applying SIR has the effect of “pruning” the sample space, correcting previous

“mistakes” and ensuring that the algorithm doesn’t waste time on low probability

paths. Since each resampling adds multinomial noise, requires computational effort,

and there are many possible groupings to resample, it is desirable to only apply

resampling when there is a clear benefit. For this reason, it is common practice to

set some threshold N and resample when the empirical effective sample size ESS is

below this threshold.

Chapter 4

Adaptive Importance Sampling

4.1 Background

Adaptive importance sampling (AIS) takes an indirect approach to building low vari-

ance estimators of E[X]. The goal of adaptive importance sampling is to find the best

importance distribution g∗ within a restricted family of distributions G(·; v) with pa-

rameters v whose likelihood functions are known or relatively easy to calculate. The

idea is to iteratively build distributions within G based on a population of previously

generated importance samples. A generic implementation of adaptive importance

sampling is given in Algorithm 4. There are several different possible choices of how

to update the distribution parameter v based on the population of importance sam-

ples, including variance minimization [88] and the cross-entropy method [89]. As in

the sequential setting, degeneracy is an important issue, and sampling importance

resampling (see §2.1.2) may be used [20] in a similar context as in the particle filter

algorithm.

24

CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 25

Algorithm 4 Adaptive importance sampling (AIS) algorithm

Generate (x(i)0 )1≤i≤N ∼ g

(0)0 .

Compute W(i)0 = f(x

(i)0 )/g

(0)0 (x

(i)0 ).

Generate (x(i)0 )1≤i≤N by resampling (x

(1)0 )1≤i≤N based on W

(i)0 .

k ← 0

while (not converged) do

k ← k + 1

Update gk based on x(i)k−1i≤N restricted to G.

Generate (x(i)k )1≤i≤N ∼ gk.

Compute W(i)k = f(x

(i)k )/gk(x

(i)k ).

Generate (x(i)k )1≤i≤N by resampling (x

(i)k )1≤i≤N based on W

(i)k .

end while

One of the primary challenges in designing an effective adaptive importance sam-

pling algorithm is in choosing the parametric family of distributions G. The choice

of distribution family affects the quality of the estimator and efficiency of the algo-

rithm. See Rubinstein and Kroese [89] for examples of some commonly used proposal

distributions in a range of application settings.

Ideally, the family of proposal distributions G should have enough flexibility to

be able to specify a distribution which frequently visits “important” states. In other

words, the best possible proposal distribution g∗ ∈ G, should be close to the optimal

importance distribution, g∗. On the other hand, the family of distributions G should

be as simple as possible, i.e. have a relatively small parameter space and be easy to

fit to the data, both to avoid overfitting and so that g∗ can be found with minimal

computational effort. Intuitively, the family of distributions should contain a good

“model” of the optimal distribution and easily incorporate some of the underlying

structure of the problem.

4.1.1 Variance Minimization

A primary design issue in the adaptive importance sampling algorithms is in the choice

of update rules for the importance distribution gk. Given a sample x = x1, . . . , xN,


one natural rule would be to take the next proposal distribution v as one that mini-

mizes the sample variance of (2.6)

g = argmingvarg [h(X)W(X)] . (4.1)

Since the expected value of h(x)f(x)g(x)

is constant for g with full support, this reduces

to minimize its second moment

g = argmingEg

[h2(X)W2(X)

]. (4.2)

This is the idea underlying the “variance minimization” (VM) procedure: at step

t, construct a new IS distribution g(t)VM by choosing a new sampling distribution that

minimizes (4.2) over samples (x(i)t−i)1≤i≤N . Often there is no analytic solution to (4.2),

requiring numerical non-linear optimization at each step.

4.1.2 Cross-Entropy Method

Instead of minimizing the variance at each step, one can choose a distribution gCE

that minimizes the Kullback-Leibler cross-entropy between gCE and the sample op-

timal importance distribution g∗(x). In the cross-entropy method [89], a sequence

of distributions parameterized by v1, v2, . . . , vk is iteratively built. At each step, the

Kullback-Leibler divergence (2.3) from g∗ to f based on previously drawn samples is

minimized. One principal advantage to this method is that minimizing cross-entropy

is often more computationally tractable than finding the minimum sample variance.

Recall that the Kullback-Leibler divergence from g to f can be stated as

dKL(g‖f) = Eg [log g(X)]− Eg [log f(X)] . (4.3)

In cross-entropy adaptive importance sampling, the optimal parameter set v∗ that

minimizes dKL(g∗‖f) is defined as

v∗ = argmaxvEg∗ [log f(X; v)] , (4.4)


since the first term in (4.3) is constant with respect to v. One can rewrite this as

v∗ = argmaxvEg∗

[W(X)

W(X)log f(X; v)

](4.5)

= argmaxvEw

[g∗(X)

W(X)log f(X; v)

](4.6)

for some reference distribution W(x). Since solving (4.6) directly is typically in-

tractable, Rubinstein and Kroese [89] suggest an iterative procedure, using the distri-

bution f(·; vt−1) as the reference distribution to solving for vt. Since one can estimate

the expectation in (4.6) via Monte Carlo simulation

Evt−1

[g∗(X)

f(X; vt−1)log f(X; v)

]≈ 1

Cg∗

1

N

∑x1,...,xN∼f(·;vt−1)

Wt−1(xi) log v(xi), (4.7)

where Wt−1(x) = |h(x)|P(G|H,x)f(x;vt−1)

is the importance weight of x under vt−1, and Cg∗ is

the normalizing constant of g∗.

The authors of Rubinstein and Kroese [89] suggest approximating f(·; v∗) at itera-

tion t by maximizing v in (4.7) with respect to the empirical distribution of f(·; vt−1),

vt = argmaxv1

N

∑x1,...,xN∼f(·;vt−1)

Wt−1(xi) log f(xi; v) (4.8)

Note that for the purpose of computing vt, one can ignore the constant multiplier

Cg∗ .

Maximizing the sum (4.8) is equivalent to finding the maximum likelihood estima-

tor under v with sample xi replicated Wt−1(xi) times. This leads to an interpretation

of the cross-entropy method as iterative weighted maximum likelihood estimation[40].

For many families of distributions, computing the maximum likelihood estimator is

fast, efficient, and well understood, so optimizing the sum (4.8) is often straightfor-

ward. For more on this connection between cross-entropy and maximum likelihood

see Asmussen and Glynn [6].


4.2 Avoiding Degeneracy

Directly optimizing (4.8) to find a set of parameters vt may not always produce a

distribution f(·; vt) that robustly generates “good” samples. Unless sample sizes are

large, this estimator may tend to over-fit the empirical data and lead to a degenerate

approximating distribution f(·; vt), with a fit based on a sample set in which the bulk

of the importance weights are concentrated in one or a small number of samples. This

situation can be diagnosed by a low effective sample size (see §2.1.2 for details).

Rubinstein and Kroese [89] suggest several heuristic alterations to (4.8) which

implicitly address the degeneracy issue. First, instead of optimizing (4.8) over all

samples drawn from f(·; vt−1), one can only use the top ρ percentile of the sample

for some constant 0 < ρ < 1 and weight them all equally. This is known as the

elite sample technique. Second, after computing the minimum sample cross-entropy,

one can “smooth” it with the previous importance distribution according to a weight

α (in distribution families where such a parameter smoothing makes sense). Third,

in the fully adaptive cross-entropy method one adaptively adjusts the sample sizes

drawn from f(·; vt) based on performance metrics. These adjustments work well for

some types of problems, as is demonstrated in the examples found in Rubinstein

and Kroese [89]. For the complicated distributions on orderings encountered in §7,

however, the basic cross-entropy method does not perform well, quickly leading to

highly degenerate importance distributions.

It also seems a worthwhile goal to fit the cross-entropy degeneracy problem within

a larger information theoretic or statistical learning context where many tools for for-

mally dealing with over-fitting have been developed. One robust criteria for the

avoidance of over-fitting is the minimum description length (MDL) principle [43].

Minimum description length is a generalization of Occam’s razor, which is principle

that says that one should use the simplest model that best fits the data. In practi-

cal terms, minimum description length gives an expression for the trade-off between

model complexity and “goodness of fit”. In addition to describing the model well

(as measured by likelihood), one should compensate for over-fitting by taking into


account the description length of the model, L(f(·; vt)). This is expressed as

L(f(·; vt), x) = L(f(·; vt))− log(P(x|f(·; vt))) + const. (4.9)

See MacKay [70] chapter 28 for a very readable introduction to minimum description

length.

One interpretation of the description length is to take L(f(·; vt)) to be the Shannon

entropy of distribution v, i.e. L(f(·; vt)) = −Evt(log f(·; vt)). This is quite natural

from an information theoretic perspective, and directly penalizes distributions for

deviation from the uniform distribution. A disadvantage to this formulation is that

it may be difficult to directly estimate the entropy for complicated models. Another

interpretation is to see L(f(·; vt)) as the log-likelihood of the parameters under some

Bayesian prior. This may have the advantage of being easier to compute than the

model entropy, but it may not be as interpretable as a measure of model complexity.

We can use minimum message length principles when choosing a new distribution

vt based on previously drawn samples. However, there is some ambiguity on how to

best apply the technique to adaptive importance sampling. For instance, it is unclear

how to properly weigh L(f(·; vt)) versus log(P(x|f(·; vt))) in the objective function

(4.9). In the standard model selection framework when fitting to samples drawn from

the “true” distribution of interest, no special weights are necessary. As the sample size

increases, the log-likelihood term comes to dominate, allowing for more complicated

models if they fit the data better. In the adaptive importance sampling setting,

however, weighted samples drawn from the previous importance distributions are

used only as a proxy for the optimal importance distribution g∗ when performing an

update. Although many samples may be drawn, the effective sample size with respect

to g∗ is often small, with the top few samples having importance weights orders of

magnitude greater than the rest. In the iterative model selection framework, this

means one should put a weight on the goodness-of-fit term based on the effective

sample size. In §7, several different heuristics are given for estimating this weight,

such as gradually increasing the allowed entropy or assigning randomized weights to

different samples when computing the effective sample size.


4.3 Related Methods

4.3.1 Annealed Importance Sampling

If one can sample directly from the optimal importance distribution, the way to use

importance sampling to estimate the normalizing constant Cf of distribution πf (with

unnormalized function f) when a reference distribution g with known normalizing

constant Cg is accessible is to use what Neal [76] refers to as the simple importance

sampling estimator,

CfCg

= Eg

[f(X)

g(X)

](4.10)

≈ 1

N

N∑i=1

g(xi)

f(xi), (4.11)

where (xi)1≤i≤N ∼ g. It is easy to see that (4.11) will only have high variance if g

and f are not close to one another, especially if g(x) is large when f(x) is small. One

remedy for this is to use a sequence of distributions g = q0, q1, . . . , qn−1, qn = f and

to chain the estimators together:

CfCg

=n−1∏j=1

Cj+1

Cj=

n−1∏j=1

Eqj

[qj+1(Xj)

qj(Xj)

]. (4.12)

This is well known in computational physics as umbrella sampling. It has also been

applied successfully in the approximate counting context for applications such as

counting the number of matchings in a graph [93] or estimating the volume of convex

bodies [33]. The general idea is one of recursive estimation of size [2]. Further

examples can be found in Diaconis and Holmes [28].

One challenging issue associated with this family of techniques is that it may be

very difficult or impossible to sample directly from an intermediate distribution of

interest qj. The most straightforward technique is to use an ergodic Markov chain

that holds qj as its stationary distribution to approximately sample according to qj.

One can then draw correlated samples at each step of the Markov chain to build


an estimator. It may be necessary to have a long “burn-in” at each intermediate

step to avoid correlations and ensure convergence to stationarity [76]. The degree of

correlation of the Markov chain samples will depend on the mixing rate. See §2.2 for

more details.

If subsequent distributions qi and qi+1 are close enough to each other, each term in

the product (4.12) will have low variance and this will produce a good estimate of Cf .

However, if the samples xj+1 are drawn using a Markov chain based on xj, this will

introduce bias. As a remedy to this issue, Neal [75] introduced annealed importance

sampling. The idea is to use a sequence of reversible transition kernels (Ti(x, y))1≤i≤n,

where Ti(x, y) has stationary distribution πi, known to within a normalizing constant

as qi. One can then produce a sample xn approximately distributed according to f ,

in a similar way as in simulated annealing [58], by starting at a state x0, drawn from

g and applying the transition kernel Ti to xi−1 to generate xi for each 1 ≤ i ≤ n in

sequence. One can then compute the importance weight of each sample x(j)n as

W(j) =n−1∏i=0

qi+1(xi)

qi(xi). (4.13)

Then an unbiased estimator of Cf/Cg is

CfCg

=1

N

N∑j=1

W(j). (4.14)

Note that although annealed importance sampling produces an unbiased estimator,

it may still have high variance depending on the sequence of transition kernels chosen

(similar to the choice of annealing schedule in simulated annealing). The variance

can also be reduced by taking a larger number of steps for each transition kernel.

4.3.2 Population Monte Carlo

A generalized framework for the techniques in this chapter is given by Cappe et al.

[20] under the name population Monte Carlo (PMC) and is summarized below in

Algorithm 5. Instead of only allowing an importance distribution g to take values in


a parametric family G based on the set of sample from the last iteration, population

Monte Carlo allows more general update rules and can use samples from any of the

previous time-steps. There are many possible methods that have been proposed to

update g based on a population of samples, such as using mixture-type models [30].

Algorithm 5 Generic population Monte Carlo (PMC) algorithm

Generate (x(i)0 )1≤i≤N ∼ g

(0)0 .

Compute W(i)0 = f(x

(i)0 )/g

(0)0 (x

(i)0 ).

Generate (x(i)0 )1≤i≤N by resampling (x

(1)0 )1≤i≤N based on W

(i)0 .

k ← 0

while (not converged) do

k ← k + 1

Update gk based on x(i)j j<k,i≤N .

Generate (x(i)k )1≤i≤N ∼ gk.

Compute W(i)k = f(x

(i)k )/gk(x

(i)k ).

Generate (x(i)k )1≤i≤N by resampling (x

(i)j )j≤k,1≤i≤N .

end while

Chapter 5

Conditional Sampling Importance

Resampling

In this chapter, conditional sampling importance resampling (CSIR), an extension of

sampling importance resampling (SIR) is considered (see §2.1.2 for background on

SIR). Conditional SIR applies to cases where it is difficult to draw directly from a

target conditional distribution π but it is possible compute conditional distributions

π(xi|xj) for coordinate indices i, j up to a constant factor. This setting often naturally

occurs in many sequential importance sampling contexts.

5.1 Motivation

In many application settings, it is not feasible to generate draws from the conditional

distribution (2.23), but one would like still like to apply the Gibbs sampler. For

this purpose, Koch [60] proposed Gibbs SIR, which at each iteration performs a SIR

based on conditional draws from an importance distribution γ. This algorithm is as

follows. Given an initial point x, for coordinate index j, draw N conditional samples

x(1)j , . . . , x

(N)j according to γ(xjxs|x[−j]), then compute importance weights W

(i)j =

π(x(i)j |x[−j])/γ(x

(i)j |x[−j]), and draw xj from x(i)

j Ni=1 with probability proportional to

W(i)j . Due to the point-wise convergence of SIR (under mild conditions on γ and

π) the conditional probabilities go to (2.23) as N → ∞, so it is easy to see that as

33

CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 34

R,N → ∞ this process converges to π, where R is the number of Gibbs steps. The

motivating application of this in Koch [60] is to image reconstruction.

Algorithm 6 SIR Gibbs

Start with arbitrary x.while (not converged) do

Randomly choose coordinate index j.Draw N iid samples x(i)

j Ni=1 according to γ(xj|x[−j]).

Compute importance weights W(i)j = π(x

(i)j |x[−j])/γ(x

(i)j |x[−j]).

Sample xj from x(i)j Ni=1 w/probability proportional to w

(i)j .

end while

A primary difficulty with the Gibbs SIR procedure is that it may be compu-

tationally expensive to draw samples from the conditional importance distribution

γ(x|x[−j]). This is particularly the case in sequential Monte Carlo contexts where

importance distributions may be complicated and built sequentially over many time-

steps. Below a procedure is detailed that builds approximate conditional samples

using only samples from joint importance distribution, which is referred to as con-

ditional sampling importance resampling (CSIR), and we give a simple example for

the case of the multivariate normal distribution. Section §6 develops the motivating

application to multi-object tracking.

5.2 Conditional Resampling

The goal in conditional SIR, as in standard SIR, is to approximately draw samples

of a target distribution π using samples drawn from an importance distribution γ,

both taking values in state space Ω. It is assumed that γ has full support of π,

i.e. π(x) > 0 ⇒ γ(x) > 0. Suppose that elements x ∈ Ω can be partitioned as

x = x1, . . . , xn, and that one has access to the importance and target conditional

distribution functions, π(xj|x[−j]) and γ(xj|x[−j]), up to a constant factor, but that

these distributions are difficult to sample directly. Suppose that one draws a collection

of iid samples x(i)i=1...N from γ. One can decompose these as a collection of samples

from the marginal distributions, x(i)1 Ni=1, . . . , x

(i)n Ni=1.


The idea of CSIR is then to use these marginal samples as importance samples for

the purpose of approximately sampling from the conditional distributions. One can

think of this as a step of an approximate Gibbs Markov transition kernel. A step of

CSIR Gibbs is as follows. Starting from x ∈ Ω, pick a coordinate index j at random,

compute importance weights W(i)j = π(x

(i)j |x[−j])/γ(x

(i)j ), and draw xj from x(i)

j Ni=1

with probability proportional to W(i)j . This process is summarized in Algorithm 7.

Algorithm 7 CSIR Gibbs

Draw N independent importance samplesx(i)Ni=1∼ γ .

Start with arbitrary x.while (not converged) do

Pick an index j at random.Compute W

(i)j = π(x

(i)j |x[−j])/γ(x

(i)j ) for i = 1, . . . , N .

Draw xj fromx

(i)j

Ni=1

w/prob. proportional to W(i)j .

end while

Theorem 1. If the Gibbs sampler is ergodic for π, and for each x ∈ Ω and index

j, π(xj|x[−j])/γ(xj|x[−j]) < ∞, then as N → ∞, CSIR Gibbs is ergodic and has

stationary distribution π.

Proof. For each sample/coordinate pair x and j, the CSIR Gibbs transition kernel

KCSIR converges to the Gibbs transition kernel K due to the convergence of SIR.

One of the main advantages of CSIR is that it can take advantage of indepen-

dence structure to greatly improve the quality of the approximate sample versus SIR.

This can give lower variance estimators than the standard importance sampling esti-

mate, µIS, unlike SIR. In order to properly take advantage of independence structure,

chapter §6 develops the notion of grouping subsets for CSIR, similar to the grouped

Gibbs sampler or the Lk sets in the hit-and-run algorithm from §2.2.5. A disadvan-

tage of CSIR is that extra computational effort may be needed for approximating

conditionals.


5.2.1 Estimating Marginal Importance Weights

CSIR requires computation of marginal importance distributions γ(xj), which for

complicated importance distributions is often impractical. There are several ways

to do this. One way is to take the one-off conditional distribution of the sample,

γ(xj|x[−j]). Since γ(xj) = E[γ(xj|X[−j])], this is an unbiased estimator. In cases

where γ(xj) is independent of γ(x[−j]) this is exact. Another method is to use the

Monte Carlo estimator, γ(xj) ≈∑N

i=1 γ(xj|x(i)[−j]). Instead of using all N indices, one

can also use a random subset. Empirically, the Monte Carlo estimate appears to

perform well since γ(xj) has been generated by the joint importance distribution γ

and is not a rare occurrence. If there is a higher degree of independence for coordinate

index j this Monte Carlo estimate will be better.

5.2.2 Conditional Effective Sample Size

Similarly to the standard SIR context, degeneracy of importance weights can occur

for conditional resampling, with some sub-samples having much higher conditional

importance weights than others for a given coordinate index j. To measure degener-

acy, some heuristics are introduced here extending the notion of effective sample size.

The marginal effective sample size (MESS) will be defined as

MESSj = NE[W(Xj)]

2

E[W(Xj)2], (5.1)

where W indicates the unnormalized importance weight, one can similarly define an

empirical version MESS. As explained above, one cannot generally compute the

marginal distributions directly, so one can attempt to approximate MESS using

conditional distributions.

When using the same-sample conditional importance weights W(i)j = π(x

(i)j |x

(i)[−j])/γ(x

(i)j |x

(i)[−j]),


this is referred to as the local conditional effective sample size (local CESS),

lCESS =

(∑Ni=1 W

(i)j

)2

∑Ni=1

(W

(i)j

)2 . (5.2)

Alternatively, one can use the cross-sample conditional importance weights,

W(i,k)j = π(x

(i)j |x

(k)[−j])/γ(x

(i)j |x

(k)[−j])

to compute what will be called the global conditional effective sample size,

gCESS =

(∑Ni=1

∑Nk=1 W

(i,k)j

)2

∑Ni=1

∑Nk=1

(W

(i,k)j

)2 . (5.3)

As when estimating the marginal distributions for sampling purposes, one can use a

subset of indices instead of the full N for computing the Monte Carlo estimates. Note

that lCESS and gCESS are simply heuristics used to estimate sample quality, and

are not related to the convergence of CSIR.

5.2.3 Importance Weight Accounting

Recall that a sample x has importance weight

W(x) =π(x)

γ(x)=π(xj|x[−j])π(x[−j])

γ(xj|x[−j])γ(x[−j])(5.4)

With CSIR, one is choosing x′j approximately according to π(·|x[−j]), therefore W(x′) ≈π(x[−j])/γ(x[−j]) = W(x)/Wj(xj). Essentially, the algorithm is “resetting” the con-

ditional weights for the resampled part. One can therefore perform a single step of

CSIR for a sample x, and update its importance weight accordingly. However, since

this is an approximation, it is generally not advisable to perform this for several steps

of CSIR, since it will lead to cascading errors. Instead, a better policy is to start x

as a SIR sample to initially reset all importance weights. One can then assume that


importance weights remain the same for all samples.

5.3 Example: Multivariate Normal

The following example considers a target distribution π as a multivariate Normal with

mean 0 and covariance Σ using a standard multivariate Normal γ as an importance

distribution. The random variables distributed according to π and γ are X and Y

respectively, with X ∼ N(0,Σ) and Y ∼ N(0, In×n). Because one can sample directly

from X as well as the conditional and marginal distributions, it allows for one for the

easy evaluation of the performance of CSIR as compared to SIR.

The importance variable, denoted Y, is a multivariate Normal with mean 0 and

covariance α2I. Y admits an importance density γ such that, for x ∈ Rn,

γ(x) =1

(2π)n/2αexp(−xTx/2α2)

The target random variable is denoted as X, and which is a multivariate Normal

with mean 0 and covariance C. X admits a density π such that , for x ∈ Rn,

π(x) =1

(2π)n/2|C|1/2exp(−xTC−1x/2)

For these choices of importance and target distributions, one can explicitly com-

pute the conditional distributions. Recall that if X can be decomposed as

X =

(Xa

Xb

), (5.5)

where Xa and Xb are column vectors of size q and n − q respectively, then the

conditional distribution of Xa given Xb is q dimensional multivariate normal with

mean

µ = C12C−122 Xb

and covariance matrix

Σ = C11 − C12C−122 C21.


In the below examples, N samples y(i)1...N are drawn from the importance dis-

tribution and their importance weights are computed. To build the SIR sample

x(i)SIR1...N , N samples are drawn with replacement from the importance samples

with probability proportional to their importance weights. To build the CSIR sam-

ples x(i)CSIR1...N , initialize each as a SIR sample, i.e.

x(i)CSIR = x

(i)SIR,

then run the CSIR Gibbs Markov step, updating each coordinate a total of k times.

To test the quality of the approximate samples, a Monte Carlo estimate of the

Kullback-Leibler divergence is computed. If π and γ are normally distributed with

respective means µπ, µγ and covariance matrices Σπ,Σγ, then one can explicitly write

the Kullback-Leibler divergence as

DKL(g, f) =1

2

(tr(Σ−1

π Σγ) + (µπ − µγ)TΣ−1π (µπ − µγ) + log(det Σπ/ det Σγ)− n

)(5.6)

However, due to the multinomial resampling of SIR and CSIR, the distribution of

an approximate sample x cannot be expressed as a multivariate Normal. Instead the

KL-divergence between g and π will be approximated. To estimate the cross-entropy,

E[log π(x)], since samples have been drawn from g and it is feasible to evaluate log(x),

it is possible to use the crude Monte Carlo estimator∑N

i=1 log π(y(i)). To estimate the

entropy, E[log γ(x)], it is not possible to use crude Monte Carlo, since in neither SIR

nor CSIR does one have an explicit representation of γ. Instead a density estimate

g is built. There are many possible ways to build a density estimate g, but in this

case, since both the importance and target distributions are multivariate normal, g

is taken to be multivariate normal and the maximum likelihood fit is used. Crude

Monte Carlo is then used with g to estimate the entropy,∑N

i=1 log γ(y(i)). For the

below examples, three cases are chosen:

1. Cid = αI.

2. Cunif , random matrix with eigenvalues chosen uniformly at random in (0, 1]


0 5 10 15 20 25 30

0.3

0.4

0.5

0.6

0.7

eige

nval

ue

(a) Cid

0 5 10 15 20 25 30

0.2

0.4

0.6

0.8

1.0

eige

nval

ue

(b) Cunif

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

eige

nval

ue

(c) Cskew

Figure 5.1: CSIR Normal example: eigenvalues of covariance matrices

scaled to have max eigenvalue 1.

3. Cskew, random matrix with eigenvalues chosen by a lognormal (µ = 0, σ = 1.5)

in (0, 1] scaled to have max eigenvalue 1.

The random matrices above were generated by taking QTΣQ, where Q is a random

orthogonal matrix. For the case of the uncorrelated target distribution, C = αI,

with 0 < α < 1. In this case, the Kullback-Leibler divergence from the importance

distribution γ to the target distribution π is

DKL(g, f) =1

2(n/α + n log(α)− n) (5.7)

=n

2(1/α + log(α)− 1), (5.8)

so DKL(g, f) is linear in the dimension n. Using the iid properties of the importance

and target distributions, one can see that for the conditional SIR approximate sample,

DKL(gCSIR, f) = E

[log

f(XCSIR)

gCSIR(xCSIR)

](5.9)

=n∑i=1

E

[log

f(XCSIR,i)

gCSIR(xCSIR,i)

](5.10)

= nE

[log

f(XCSIR,1)

gCSIR(xCSIR,1)

], (5.11)


so for a given sample size N , DKL(gCSIR, f) also increases linearly in the dimension

n. For the random covariance matrices, C is formed by pre- and post-multiplying

the diagonal eigenvalue matrix with a random orthogonal matrix Q obtained by

performing a QR-decomposition on a matrix composed of n draws from random

uniform multivariate normal in n dimensions.

5 10 15 20 25 30

24

68

10

n

KL−

Div

erge

nce

(a) Cid

5 10 15 20 25 30

010

030

050

0

n

KL−

Div

erge

nce

(b) Cunif

5 10 15 20 25 30

010

0020

0030

0040

00

n

KL−

Div

erge

nce

(c) Cskew

Figure 5.2: Estimated KL-Divergence for approximate samples using CSIR and SIR,N = 50 samples, k = 5 passes of Gibbs sampler. I repeat this process for 20 differentrandomly chosen covariance matrices for the uniform and skewed. Dotted lines are95% confidence intervals based on 1000 bootstrap samples.

Note that in these experiments, CSIR has much lower KL-divergence than SIR for

each example but particularly in the case of the random covariance matrices. Both

methods perform worst on the skewed covariance matrix. Obviously, in these exam-

ples one can easily sample directly from the target distribution, but they illustrate

how CSIR can give a large improvement over SIR for high-dimensional state spaces.

The next chapter uses this technique for a sequential importance sampling case where

the state space is high-dimensional.


5 10 15 20 25 30

010

0020

0030

0040

00

n

KL−

Div

erge

nce

skewunifid

(a) SIR

5 10 15 20 25 30

010

2030

4050

60

n

KL−

Div

erge

nce

skewunifid

(b) CSIR

Figure 5.3: Same experiments as in Figure 5.2, plotted by method.

Chapter 6

Multi-Object Particle Tracking

This chapter applies conditional resampling in the particle filter context with an

application to multi-object tracking (MOT). The novelty in this approach is that

it allows the use of arbitrary jointly distributed forward movement and observation

models, while maintaining asymptotic convergence properties and computational ef-

ficiency. To make the particle filter computationally feasible under this joint model,

the use of conditional sampling importance resampling for sequential Monte Carlo is

introduced. This modified particle filter tracking algorithm can handle unknown or

varying numbers of objects as well as the problem of association of observations with

objects without making parametric assumptions on the nature of the forward model

or resorting to ad-hoc steps.

6.1 Background

6.1.1 Single Object Tracking

The task of inferring a single object path from a sequence of (possibly noisy) ob-

servations is known as tracking. One way to formulate tracking is as an inference

problem in a hidden Markov model (HMM). Recall that in an hidden Markov model,

it is assumed that there is some underlying “hidden” Markov process X1:T of interest

which cannot be observed directly, and a sample from an “observation” process Y1:T .

43

CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 44

X1:T is also called the state-space process. The simplest non-trivial hidden Markov

model for tracking is a Gaussian state-space process X1:T coupled with a Gaussian

observation process Y1:T , with Xt,Yt ∈ Rn. This can be stated as

Xt = Xt +N(0,Σs) (6.1)

Yt = Xt +N(0,Σo) (6.2)

In this case, one can use the Kalman filter [55] to solve the inference problem numer-

ically. The Kalman filter is a sequential procedure that computes first E[Xt|Xt−1],

then provides updates based on current observation yt to get E[Xt|,Xt−1, yt]. This

is iterated over t = 1, . . . , T . Due to the Gaussian nature of the processes, know-

ing E[Xt|,Xt−1,Yt] for each s is sufficient to know the full conditional distribution

P[X1:t|Y1:t]. See Algorithm (8) for a step in the Kalman filter algorithm for this

simple example.

Algorithm 8 Kalman filter step for simple Gaussian process

P−t = Pt−1 + Σs

Kt = P−t (P−t + Σo)−1

xt = xt−1 +Kt(yt − xt−1)Pt = (I −Kt)P

−t

An alternative to the Kalman filter is particle filter tracking (or particle tracking).

The modeling assumptions required to use the particle filter are quite general com-

pared to the Kalman filter. The advantage to the particle filter is that more realistic,

non-linear, and discontinuous observation and movement models may be specified;

the disadvantage is an increase in computational complexity and implementation dif-

ficulty, usually requiring Monte Carlo simulations instead of fast and relatively simple

deterministic computations.

Inference in tracking problems can often be posed as integrals or expected values.

While there exist deterministic, quadrature type methods for particle filters, in gen-

eral the integrals to be computed are high-dimensional, in which case deterministic

methods tend to be intractable. For this reason, it is common to resort to Monte


Carlo methods for particle filtering (Algorithm 9), although some attempts at deter-

ministic particle filtering have been made (see for example Doucet and De Freitas [31]

ch. 5).

Algorithm 9 Single object particle filter tracking

Draw initial particles x(1)0 , . . . , x

(N)0 from γ0.

for t in 1 . . . T dofor each particle index i do

Sample x(i)−t ∼ γt(·|x(i)

t−1).

Compute importance weights w(i)t .

end forfor each particle index i do

Sample with replacement x(i)t from x

(1)−t , . . . , x

(1)−t with prob. prop. to w

(i)t .

end forend for

6.1.2 Multi Object Tracking

Multi-object tracking is a large and diverse field, with many applications in the sci-

ence and engineering community. It is closely related to the single-object tracking

problem. As is noted in Vermaak et al. [100], most multi-object tracking algorithms

fall into two categories. The first approach is to run multiple copies of individual

single-object trackers, then post-processing the outputs from the single-object track-

ers to handle occlusion and/or confusion. Interactive effects between objects and

observations are assumed to be negligible or are dealt with heuristically. These al-

gorithms tend to be fast and often work well in practice, but if interactive effects

are non-negligible, significant bias may be introduced. Convergence to the optimal

solution won’t be guaranteed as number of samples goes to infinity. The authors Hue

et al. [48] give an early example of this approach, with an algorithm for multi-object

tracking using an ‘association vector’ approach coupled with parallel particle filters.

This model assumes independence in the observation and movement models, and uses

a simulation-wide indicator test to determine whether to add or remove particles at

a given time-step.


A second approach is to formally consider the multi-object tracking problem as

an instance of single-object tracking by enlarging the state space to represent all

object paths jointly as a single object. The process and observation models for this

enlarged object can then be arbitrarily jointly distributed across all individual objects

and observations. Computational techniques developed for single-object tracking can

then be applied to make inferences in this model, such as particle filtering. In this vein,

the authors of Khan et al. [56] give an algorithm for multi-object tracking that uses a

Markov Random Field to define a joint movement distribution for interacting objects.

The drawback from the enlarged state approach is that SIR becomes inefficient in

high-dimensions.

To see why naive SIR is generally ineffective in high dimensions, consider the

example of n objects moving independently with no interaction. Parallel particle

filters tracking each object individually would preferentially draw their most likely

current states for each object. A naive joint-state particle filter implementation would

select the best joint-state space particle, but any single joint particle is unlikely to

contain all of the best individual current states. This line of inquiry shows that

the naive joint-state particle filter would require a number of samples exponential in

dimension size to achieve error comparable to parallel single object particle filters,

which is impractical for even a modest number of tracking objects. To compensate

for this effect, the algorithm described in this chapter takes advantage of the large

degree of independence between groups objects that are not in close proximity to one

another by applying conditional resampling.

Other useful frameworks for multi-object tracking include the mixture particle

filter [100] and the PHD filter [101]. For tutorials on general particle filter methods for

see Chen [24], Doucet and De Freitas [31]. An overview of particle-based approaches

to multi-object tracking is given in the thesis of Ozkan [80].

6.1.3 Tracking Notation

A time-step index t is a sequential coordinate index in Z+. An observation φ is

simply position in Rn, occurring at a time-step t. An observation set Yt is a set of


observations occurring at time-step t. Objects are assigned an index l, and object

states are represented as ζl,t. A trajectory ζl,1:t, is a sequence of states representing

the movements of a single object, also called a path. Note that trajectories may

contain a leading and trailing sequence of null states ∅, indicating the object entered

and/or left the field of view at some point in time. An observation event ψ is an

intermediate variable, representing the process by which a set of observations was

generated by a set of objects. The object state set Xt at time t is a set containing all

of the states for all objects at time-step t, while the event state Ψt is the set of all

object events ψ at time t. The enhanced state at time t is the pair (Xt,Ψt) and is

denoted Xt. A particle x1:t is a set of trajectories representing the full joint evolution

of the state space under the particle filter algorithm up to time-step t. Each particle

has associated with it an importance weight W1:t. Note that after a SIR step, X1:t

is an approximate sample from the target distribution at time t and w1:t is reset to

1 for each particle. state space, event, and observation process samples are denoted

in lower case as x,Ψ, y respectively, and denote the corresponding random variables

in upper case as X,Ψ,Y. In this chapter it is assumed that there are N particle

samples from the state space, and use i as a sample index. The total number of time

steps observed is denoted T .

6.2 Conditional SIR Particle Tracking

In multi-object tracking, there are various discrete ‘decisions’ that a tracking algo-

rithm must make. These decisions correspond to discrete events in the forward simu-

lation model, such as the deletion/creation of new objects or occlusion, as well as the

‘association’ problem of assigning individual observations to objects. Addressing the

association problem is the primary contribution of algorithms such as JDPAF [22];

this method and others deal with association via Monte Carlo. Dealing with unknown

or changing numbers of objects is often addressed via ad-hoc methods such as defining

a ‘decision rule’ in which objects are added/removed across all particles simultane-

ously [48], or by assuming that the forward simulation follows some parametric form

such as a mixture model [100].


In the ‘enlarged state-space’ particle filter, both the deletion/creation and associ-

ation problems are handled in an asymptotically correct way, but as previously noted,

this approach yields poor computational complexity. Conditional SIR can be used

to improve performance. To facilitate this, grouping subset functions are defined to

decompose the state-space into two parts. A grouping subset function takes as input

a particle, and returns a grouping subset, which can be used to apply the grouped

Gibbs or the hit-and-run algorithms as described in §2.2.5. A key algorithmic design

issue in CSIR is in choosing appropriate grouping subset functions.

6.2.1 Grouping Subsets for Multi-Object Tracking

In the multi-object tracking context, one way in which to define grouping subset func-

tions is to return sets of trajectories associated with individual observations φ ∈ yt.Gφ(x1:t) denotes the set of trajectories that are associated by events with observation

φ for particle x up to time t. If there are no trajectories associated with φ in xt, i.e.

φ was considered a false positive in xt, then the empty set ∅ is returned. As noted

above, trajectories ζ1:t may include a leading and trailing set of “null” observations

∅, corresponding to the object entering and leaving the field of view. A simplified

example of grouping based on observations is shown in Figure 6.1.

Given a grouping subsetG, it is possible to sample approximately from π(XG|X[−G])

using conditional SIR. If G is chosen independently of X, then this defines a Markov

transition that leaves the stationary distribution invariant. Note that in the case of

well-separated trajectories, applying conditional SIR to enlarged state-space particles

is equivalent to running a separate particle filter for each object independently.

6.2.1.1 Choosing/Pruning Grouping Functions

To choose for which groupings to apply conditional SIR, a table of groupings is kept

with the empirical CESS for each grouping and is updated after each iteration. There

is much potential redundancy in groupings, and updating the table is expensive, so

this grouping table is typically ‘pruned’ . In general, it is possible to use arbitrary rules

about which groupings to use based on aggregate particle sample properties without


!

"

!

"

!

!

"

Figure 6.1: Grouping subset functions based on observations, denoted by encloseddashed lines. Filled circles: observations. x’s, o’s: individual objects from particles 1and 2 respectively. Note that in the grouping on the right, the particles differ in thenumber of associated object trajectories.

affecting the asymptotic convergence properties. Pruning decision rules include only

considering groupings that originated a constant number of time-steps in the past,

not considering ‘children’ for well-established groups in subsequent time-steps, and

removing ‘eliminated’ groups as determined by heuristics such as CSIR. As noted, a

wide range of pruning algorithms can be used.

6.3 Application: Tracking Harvester Ants

In this section, an application of the CSIR approach is given to tracking the move-

ments of individual ants in a colony of harvester ants in a laboratory setting.

6.3.1 Object Detection

A key step in multi-object tracking is object recognition, i.e. identifying what con-

stitutes an object and what does not. To facilitate this task, the tracking algorithm

use a sophisticated object detection software known as GemIdent [47]. Each pixel is

classified independently as belonging to one of the specified object types (or none).

The algorithm then forms a graph representation of classified pixels and use spectral


graph partitioning [91] to find clusters of pixels corresponding to individual objects.

6.3.1.1 Pixel Classification

Supervised learning refers to the practice of building a learning algorithm with the

aid of a set of training points. Interactive supervised learning is a technique that

recursively applies supervised learning to predict a response variable by asking the

user for input, training the algorithm based on the new input, reporting the results

of the training to the user, and repeating until the user is satisfied. This interactivity

streamlines the training process, allowing the user to initially identify a small number

of example points, then add further points to correct mistakes made in the classifi-

cation process. This has the effect of minimizing the total number of training points

that the user has to provide, and shortens the total time involved in training process.

As an input to the supervised learning algorithm, it is necessary to provide pre-

computed features for each pixel to classify. The primary feature that is used here is

a ring score. The algorithm first normalizes pixel values by using the Mahalanobis

distance based on a pre-defined color set. To compute the ring score, take the average

normalized value relative to color c of all pixels of radius r from the pixel of interest.

The algorithm computes this value for each color c and radius r < R for some pre-

defined maximum R. To incorporate movement information into pixel classification,

the ring scores of the same pixel in adjacent frames are also included, both forward

and backward in time.

Once we’ve computed features for each pixel, random forests [18] are used to clas-

sify each pixel based on computed features. Random forests are used because of their

relative simplicity, and for the current application they empirically give comparable

results to other techniques such as support vector machines (SVM). Random forests

also have the advantage of being easily interpretable, as it is possible to directly

compute the relative importance in each feature in the classification process. This

approach was compared to other machine learning techniques in the publicly avail-

able software library WEKA [37]. For an book length treatment on statistical and

machine learning techniques, see Hastie et al. [45].


6.3.1.2 Centroid Finding

Each pixel is classified independently as belonging to one of the specified object types

(or none). After pixels are classified, it is necessary to determine the number and

locations of ants. The algorithm uses a flood-finding algorithm to form groups of

adjacent pixels of the same type into “blobs”. Small, disparate blobs also occur due

to incorrectly classified pixels. In addition, objects that are clustered together in an

image will tend to produce large connected blobs. The current application needs to

determine which blobs are noise and which, indicate the presence of ants and how

many.

To do this, the algorithm forms a graph of the blob by connecting adjacent pixels,

then use spectral graph partitioning [91] to cut the blob. A good overview of spectral

methods for graphs can be found in Spielman [95]. One way to think about spectral

partitioning is as an approximation to the ’sparsest cut’ problem, which seeks to find

a cut that maximizes the number of vertices separated but minimizes the number of

edges cut. It is necessary to tune the parameters of this centroid finding step. Figure

(6.2) shows two blobs and the results of spectral partitioning.

6.3.2 Observation Model

While functioning relatively well in most cases, the GemIdent machine-learning al-

gorithm is prone to error, due to the presence of background clutter, ant moving in

close proximity to one another, unusual orientations of ants as they traverse objects,

and other factors, confusing even human observers. Additionally, there are tradeoffs

between the quality of the images, the interactive training time of the interactive

supervised learning algorithm, computational efficiency, and the accuracy of the clas-

sifications. It is desirable to have a tracking algorithm that can operate robustly in

the presence of this relatively noisy process. Of particular concern is the propensity

of the algorithm to split an individual ant into two observations or to merge two ants

into a single observation.

Under the observation model, each Yt is generated as the result of events that


Figure 6.2: Blob bisection via spectral partitioning

occurs according to a probability distribution dependent on Xt. In the current ap-

plication, there are two broad classes of object events: independent events, and joint

events. Independent Normal observation events correspond to the canonical single tra-

jectory/single observation case, with an observation occurring at a location according

to a bivariate Normal distribution centered at the object location, with standard de-

viation σo. Independent false negative events indicate no observation recorded for

an individual object. This occurs for each object according to a Bernoulli random

variable with parameter λn. Independent false positive events indicate a spurious ob-

servation not directly caused by any proximate object. These occur according to a

uniform 2D Poisson process with parameter λp. Splitting is a special kind of false pos-

itive in which a pair of observations appear at the front and back of an object, due

to overzealous segmentation. This occurs for each object according to a Bernoulli

random variable with parameter λs. Finally, two adjacent objects i and j have a

probability of ‘merging’ with their neighbors as a function of their distance, with

an observation occurring near the center of the merged objects. Merging occurs in


Event type Description ParametersIndependent observations Single trajectory with a single observation σoIndependent false nega-tives

No observation recorded for an individualobject

λn

Independent false posi-tives

Spurious observation not directly relatedto an object

λp

Splitting Single object yields pair of observations λsMerging May occur when objects are within close

proximityλm, αm

Table 6.1: Observation event types.

the joint model between two objects at distance d with probability αm exp(−d4/λm).

Individual objects may be involved in multiple mergings at the same time.

6.3.3 State-Space Model

In the current application it is assumed that individual objects move according to

independent affine Gaussian stochastic processes, with drift vectors depending on an

estimate of the object’s velocity. For this model, it is possible to decompose the object

state at time t as ζt = ut, vt, where u, v ∈ R2 are the position and velocity vectors

respectively, and write the model as

ut = ut−1 + vt−1 +N(0, σ2mI) (6.3)

vt = α(ut − ut−1) + (1− α)vt−1, (6.4)

where σm is the movement dispersion parameter. The possibility that individual

objects can enter or leave the field of view is modeled according to a birth-death

process with rate parameters λb, λd (both birth and death rate uniform with respect

to space and time). This is a relatively simple movement model, but for the current

application it appears to suffice, as most of the complications and non-linearities come

from the observation process.


6.3.4 Importance Distribution

A critical consideration is the choice of importance distribution. As discussed in §3.2,

the optimal importance distribution γ∗ samples xt based on the previous object state

and all current and future observations , i.e.

γ∗t (x) = π(xt|xt−1, yt:T ) (6.5)

Since drawing from γ∗ is infeasible, the second best method practically available is

often to use the locally optimal importance distribution,

γt(x) = π(xt|xt−1, yt). (6.6)

In the current context it is generally impossible to sample from this importance

distribution directly, so instead the algorithm uses Markov chain Monte Carlo, in

particular the data augmentation algorithm §2.2.4.

To sample approximately from γt, the event set Ψt is used as an auxiliary variable

for the data augmentation algorithm. The goal in this case is to draw a joint sample

according to π(xt,Ψt|xt−1, yt), even though the primary interest is in xt. The algo-

rithm does this by first generating a draw Ψt conditioned on xt, then xt conditioned

on Ψt, or

Ψt ∼ π(Ψt|xt, xt−1, yt) (6.7)

xt ∼ π(xt|Ψt, xt−1, yt), (6.8)

and repeating. As a starting point, an initial value of xt from one forward step from

the object state-space model is used, i.e. xt is drawn according to π(xt|xt−1).

Algorithm 10 Sampling π(xt,Ψt|yt, xt−1) via data augmentation.

Start projected trajectory positions xt.repeat

Sample π(Ψt|xt, yt) via Algorithm 12.Sample π(xt|Ψ, yt, xt−1) as described in §6.3.4.

until converged


6.3.4.1 Sampling from State-Space Given Events

Sampling from π(xt|Ψt, xt−1, yt) is decomposed by event types. Note that at this

step, resampling xt is resampling the set of object positions, as object birth/death

movements are considered when sampling from events. The algorithm cycles through

each element of Ψt and perform actions based on case.

Independent normal observation Both movement and observation processes are

following Gaussian processes, so the object location can be modeled as bivariate

Normal. This is due to the following relationship.

π(xt|xt−1, yt) =π(yt|xt)π(xt|xt−1)

π(yt|xt−1)(6.9)

∝ π(yt|xt)π(xt|xt−1) (6.10)

When the event type is independent Gaussian movement and observation, both

π(yt|xt) and π(xt|xt−1) are Normal, and the product of the distributions will also

be Normal. Since for both the observation and movement models independence

in the covariance structure of the 2D coordinate axes is assumed, the distribution

π(ζt|ζt−1, φt) can be written as the product of the distribution for each coordinate

axes. The coordinate-wise standard deviation and mean of the object position ut will

then be

σut =

√σ2mσ

2o

σ2m + σ2

o

(6.11)

µut =1

σ2m + σ2

o

((ut−1 + vt−1)σ2

o + ytσ2m

), (6.12)

and ut can be expressed as a N(µut , σ2utI) random variable.

Independent false negative In this case there is no observation information, so

the object state is updated according to the forward movement model (6.3).


Independent false positive Independent false positives are not related to any

objects, so no object states need to be updated.

Splitting Splitting is represented as an independent normal observation, using the

average of the two points as the observation center.

Merging/Splitting Given a series of mergings and splittings Ψ, associated subsets

of observations yt, and a subset of object points xt, it is possible to compute the

probability of the event set occurring as π(Ψ, yt|xt, xt−1). In order to sample from

π(xt|Ψ, yt), the algorithm samples from a Metropolis-Hastings independence chain

with proposal density Q(xt) = π(xt|xt−1). The acceptance probability of a new state

x′t will then be

a =π(Ψ, yt|x′t, xt−1)

π(Ψ, yt|xt, xt−1)(6.13)

6.3.4.2 Sampling from Events Given State-Space

The second half of the data augmentation algorithm for sampling from the importance

distribution (6.6) is to sample the events Ψ given the state-space positions xt.

γ(Ψt|xt) = π(Ψt|xt, yt) (6.14)

To sample from γ(Ψt|xt) a Metropolis-Hastings Markov chain is used. This entails

drawing samples from a proposal chain κ, and accepting/rejecting them based their

likelihoods relative to the current state.

The underlying Markov chain used in the current approach is based on sampling

from object/observation associations. The global version is as follows. Every object

is (independently) associated with an observation with probability as a decreasing

function of the distance, f(d(ζ, φ)). It is then possible to directly evaluate the likeli-

hood of this association. The algorithm samples from the target distribution using a

Metropolized independence chain. This global proposal distribution is referred to as


!"#$%&'

!"'$()*&+,-'

Figure 6.3: Association of objects with observations. ’Events’ correspond to con-nected components in this bipartite graph, including Normal observations, splitting,merging, false positives, false negatives, and joint events.

κ. For an edgeset E, one can express κ as

κ(E) =∏ζ

∏(ζ,φ)∈E

f(d(ζ, φ))

∏(ζ,φ)6∈E

(1− f(d(ζ, φ)))

(6.15)

f should be chosen to be close to the probability of an observation with position

φ being generated in an event involving an object at location ζ. If d(ζ, φ) < σo

it’s almost certain, d(ζ, φ) < 2σo its pretty likely, d(ζ, φ) > 3σo probably not. The

algorithm assumes this takes a parametric form such as

f(d) = α exp(−dp/β), (6.16)

where p and β are chosen based on σo, and 0 < α ≤ 1 is perhaps related to the false

positive and merging probabilities. An example would be to take α = .95, p = 4, and

β = −(3σo)p/ log .01, which gives the following plot.


0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

d/simga_o

p

There are several local versions. One is, pick an object at random and resample

its associations, accept according to Metropolis. This is an instance of the “single-

particle Metropolis”. Another variation on this would be to resample the associations

for a subset of objects. This could be random, or something like taking as a subset

the k nearest neighboring objects to a randomly chosen object or observation, or all

objects within d distance to a randomly chosen object or observation. k or d could

be randomly chosen, or a set constant. This procedure will be ergodic if choices are

centered around objects (nonzero probability for every bipartite graph), and ergodic

for choices centered around observations if k or d are large enough such that every

object is close enough to an observation to be resampled at some point. Refer to the

current edge set as E, the proposal edge set as E ′.

The Metropolis acceptance value a is

a =γ(E ′)

γ(E)

K(E ′, E)

K(E,E ′)(6.17)

=γ(E ′)

γ(E)

κ(E)

κ(E ′)(6.18)


The ratio can be directly computed as

κ(E)

κ(E ′)=

∏(ζ,φ)∈E, 6∈E′)

f(d(ζ, φ))

1− f(d(ζ, φ))

∏(ζ,φ)∈E′,6∈E)

1− f(d(ζ, φ))

f(d(ζ, φ))

(6.19)

and similarly for the ratio

γ(E ′)

γ(E)(6.20)

Algorithm 11 Sampling proposal event set Ψ via association distribution κ.

Form bipartite association probability graph A(xt, yt) using association probabilityfunction f .Draw edges E independently w/prob. proportional to edge weights.Find connected components for E, form event set Ψ.Return Ψ.

Each edge set E defines an event set Ψ by the connected components in the

association graph of the edge set. However, this relationship is not injective, as

the same event set Ψ may result from multiple different edge sets, since connected

components for joint events may be defined in multiple ways. Then

κ(Ψ) =κ(E)

κ(E|Ψ)(6.21)

=∑E′ 7→Ψ

κ(E ′) (6.22)

= E[κ(E)1E 7→Ψ] (6.23)

This can be decomposed by event components. For each ψ ∈ Ψ, define Eψ to be the

associated set of edges generated by κ for the objects and observations associated with

ψ. Then κ(Ψ) can be decomposed as the probability that there are no edges between

objects and observations not in the same events, multiplied by the probability of


connectivity within event components.

κ(Ψ) = κ (E \ ∪ψ∈ΨEψ = ∅)∏ψ∈Ψ

κ(Eψ 7→ ψ) (6.24)

To compute the first term on the RHS of (6.24), the probability that there are no

cross-component edges, simply compute

κ (E \ ∪ψ∈ΨEψ = ∅) =∏

(ζ,φ)6∈Ψ

(1− κ((ζ, φ))) , (6.25)

i.e. the product of one minus the probability of each non-component edge occurring.

Since the association probability graph is effectively sparse (probability of an edge

between an object and an observation effectively 0 if distance is more than 4σo),

this probability can be computed quickly. To compute the individual component

probabilities, κ(Eψ 7→ ψ), for small numbers of objects plus observations it is feasible

perform the calculations directly through enumeration. For large joint distributions,

the algorithm estimates this probability through Monte Carlo estimation of

κ(Eψ 7→ ψ) = E[κ(Eψ)1Eψ 7→ψ] (6.26)

To compute the probability of an event set Ψt under γ, note that

γ(Ψt|xt) = π(Ψt|xt, yt) (6.27)

=π(Ψt, xt, yt)

π(xt, yt)(6.28)

=π(yt|xt,Ψt)π(Ψt|xt)π(xt)

π(xt, yt)(6.29)

∝ π(yt|xt,Ψt)π(Ψt|xt) (6.30)

Both of these probability distributions, π(yt|x,Ψt) and π(Ψt|xt), are defined by the

forward observation model. The first term can be decomposed as the product of the


individual event probabilities. This can be written as

π(yt|xt,Ψt) =∏ψ∈Ψt

π(yt,ψ|xt,ψ, ψ), (6.31)

where yt,ψ and xt,ψ are the subsets of yt and xt associated with event ψ.

For the second term on the RHS of (6.30),π(Ψt|xt), recall that the ’event’ random

variable corresponds to a connected components in the association graph but doesn’t

carry any positional information. So this is the probability that the set of trajectories

xt,ψ produced a (joint) event, times the probability that these events produced the

correct number of observations |yt,ψ| given the object associations. The joint eventset

probability can be written as

π(Ψt|xt) = π(Ψyt ,Ψxt|xt) (6.32)

= π(Ψyt |Ψxt , xt)π(Ψxt|xt) (6.33)

To compute π(Ψxt|xt), the ’object association probability’ for the joint forward model

is used to compute the probability of each component being connected times the

probability of there being no cross-component edges, or

π(Ψxt |xt) =

(∏ψ∈Ψ

π(ψ|xt,ψ)

) ∏(φ,φ′)6∈Ψx

(1− π((φ, φ′)|xt))

(6.34)

Combining these yields

γ(Ψ|xt) ∝

(∏ψ∈Ψ

π(yt,ψ, ψ|xt,ψ)

) ∏(φ,φ′)6∈Ψx

(1− π ((φ, φ′) |xt))

, (6.35)

where π(yt,ψ, ψ|xt,ψ) can be broken down as

π(yt,ψ, ψ|xt,ψ) = π(yt,ψ|xt,ψ, ψ)π(ψy|ψx, xt,ψ)π(ψx|xt,ψ) (6.36)


Algorithm 12 Metropolis sampling of π(Ψt|xt, yt).Start with event set Ψrepeat

Ψ′ ← Ψ.Pick a random subset of objects S ⊂ xt, using local or global criteria.Remove events ψ ∈ Ψ′ associated with S.Randomly sample new events for S according to κ, add to Ψ′.Compute acceptance probability a(Ψ,Ψ′) via (6.18).With probability min(1, a), Ψ← Ψ′

until converged

6.3.5 Computing Relative and Marginal Importance Weights

Another main element of the importance sampling algorithm is the computation of

importance weights. It is of interest to compute both the relative importance weights

for an entire particle, and the marginal importance weights for grouping subsets.

6.3.5.1 Computing Particle Relative Importance Weights

The importance weight contribution from time t will be

Wt(x) =πt(x)

γt(x). (6.37)

Since γt(x) ∝ P(xt|Xt−1, yt), the contribution will be proportional to

wt(x) = P(yt|xt−1) (6.38)

This can be computed via Monte Carlo as

P(yt|xt−1) = EXt|xt−1 [P(yt|Xt)], (6.39)

where xt is drawn from P(xt|xt−1). However, this estimator may have high variance,

since for any given xt, it may be unlikely to generate a given yt. From the sampling

step there are approximate samples from P(xt,Ψt|xt−1, yt−1), which is actually the op-

timal importance distribution for estimating (6.39). However, this probability is only


known up to a normalizing constant (the normalizing constant is in fact P(yt|xt−1)).

One way of computing this normalizing constant is observing the following.

P(yt|xt−1) =P(yt|xt,Ψt)P(xt,Ψt|xt−1)

P(xt,Ψt|yt, xt−1)(6.40)

P(xt,Ψt|yt, xt−1)

P(yt|xt,Ψt)=

P(xt,Ψt|xt−1)

P(yt|xt−1)(6.41)

Integrating with respect to xt,Ψt gives

1

P(yt|xt−1)= EXt,Ψt|yt,xt−1

[1

P(yt|Xt,Ψt)

](6.42)

P(yt|xt−1) =(EXt,Ψt|yt,xt−1

[P(yt|Xt,Ψt)

−1])−1

. (6.43)

Also note that computing P(yt|xt,Ψt) is straightforward, this is just the product of

the the observation probabilities for each event ψ ∈ Ψt, which are being computed

anyways to generate the joint samples. This means a Monte Carlo estimate based on

EXt,Ψt|yt,xt−1 [P(yt|Xt,Ψt)−1] can be used to estimate P(yt|xt−1), and can build this

estimate concurrently while building the joint sample Xt,Ψt|yt, xt−1.

6.3.5.2 Computing Marginal Importance Weights

In order to perform resampling, it is necessary to estimate the marginal importance

weights,

W(i)j = π(x

(i)j |x[−j])/γ(x

(i)j ) (6.44)

This computation depends on the choice of “coordinate index” j, which corre-

sponds to a partition of the state space as discussed in §2.2.5. To choose this par-

tition,the extended state-space (xt,Ψt) that includes the event set Ψs will be used.

There are several different possible ways to choose partitions, as discussed in §6.2.1.

The main purpose of conditional resampling is to allow for ’revisions’ across multiple

time-steps by ’grafting’ sub-trajectories from one particle onto another. This is nec-

essary because events represent realizations of ’decision points’, and decisions that


are plausible at timestep t may be much less plausible compared to alternatives when

taking into account future observations.

As previously noted, there are many possible grouping functions that may be

chosen to which to apply conditional resampling. In the current application, tra-

jectory groupings are formed based on well-separated observations. Recall that it

is possible to choose grouping functions arbitrarily based on yt without introducing

bias. The goal in this case is to form groups of individual trajectories based on single

observations. The algorithm then keeps this grouping when/if the set of trajectories

enters a joint merging/splitting setting, and decides to discard the grouping based on

aggregate properties of the group. To resample across the grouping, positions from

one trajectory are ’grafted’ onto another using the last k time-steps, the resulting

marginal importance weights are estimated, and resampling is performed based on

these weights.

Algorithm 13 Conditional sampling importance resampling for particles.

Determine which grouping sets to resample, forming new groupings at each obser-vations and pruning old groupings using criteria such as age and CESS.Resample particles using SIR to get x

(1)t , . . . , x

(N)t .

for each grouping j dofor i in 1, . . . , N do

Compute marginal importance weights using (6.44)Choose grouping k with probabilities proportional to marginal importanceweightsGraft grouping Gj(x

(k)t ) onto x(i) according to §6.3.5.2.

end forend for

6.4 Empirical Results

6.4.1 Simulated Data

For the previous example, the “true” hidden state of the objects is unknown, as

the example comes from real-world data. In this example, samples from the hidden

state-space are simulated using the Gaussian forward model (6.3), and observations


Algorithm 14 Multi-object particle tracking for Harvester ants.

Initialize particles x(1)1 , . . . , x

(N)1 based on prior distribution/first observations y1.

Compute initial importance weights w(1)1 , . . . , w

(N)1 .

for t in 2 . . . T doAdvance each particle by sampling (x

(i)t ,Ψ

(i)t ) ∼ γt via Algorithm 10.

Estimate importance weights for each particle via (6.43) using samples fromprevious step.Apply conditional SIR to construct new particles via Algorithm 13.

end for

are drawn according to the observation model (§6.3.2). Choosing samples from this

known, tractable model allows us to directly evaluate results from the particle tracking

algorithm.

The simulated data starts with 100 objects randomly distributed uniformly in

a 500 by 500 grid. The simulation then proceeds by moving the objects according

to the forward model, with movement standard deviation parameter σ2m = 2. The

observation model is simulated with observation standard deviation parameter σ2o =

1.5, split probability parameter λs = .006 for each object at each time-step, and

with a merging probability of λm = .45 (decreasing with distance). The results for

one state-space sample with path lengths and objects per frame are summarized in

Figure (6.4), and “observed” samples with the number of observations per frame are

summarized in Figure (6.5).

We then ran this example using particle tracking, first using a sample from the

importance distribution (6.6), then using CSIR (6.7). As in the previous example,

CSIR produces much longer path lengths, closely matching the “true” number of

paths that last the entire simulation.

6.4.2 Short Harvester Ant Video

As a first example of the tracking algorithm, we used GemVident to find centroids

for a video file that has 753 frames. A screenshot of the centroid finding results

for a single frame from GemVident is shown in Figure 6.8. The centroid finding

algorithm for this video resulted in an average of 109 observations per frame, with


Figure 6.4: “True” distribution of path lengths and trajectories per frame, simulatedexample.

Figure 6.5: Centroid observations per frame, simulated example.

the distribution of observations shown in Figure 6.9.

For this example, we first drew a sample from the importance distribution and

inspect the quality of the sample. A video file of a draw from the importance distri-

bution can be found here: http://www.stanford.edu/~guetz/particleTracking/

noCSIR.mp4. The blue dots represent trackings, the red dots are the centroids from

GemVident. Note that the trackings tend to follow the observations for a short

number of time-steps before getting lost. One can see this directly by looking at a

histogram of the trajectory lengths over the sample (Figure 6.10).

We then applied CSIR particle filter to this example using 20 particles. A video

file of the CSIR tracking can be found here: http://www.stanford.edu/~guetz/

http://www.stanford.edu/~guetz/particleTracking/noCSIR.mp4

http://www.stanford.edu/~guetz/particleTracking/noCSIR.mp4

http://www.stanford.edu/~guetz/particleTracking/CSIR.mp4


Figure 6.6: Distribution of path lengths and trajectories per frame using a samplefrom the importance distribution, simulated example.

Figure 6.7: Distribution of path lengths and trajectories per frame using CSIR, sim-ulated example.

particleTracking/CSIR.mp4. Note that this is much better. Looking at the trajec-

tory lengths (Figure 6.11), note that a large percentage of the trajectories last the

entire simulation. Also note that for the number of trajectories per frame matches the

number of observations much more closely for both the sample from the importance

distribution and CSIR.

This example was repeated for N = 5, 20, 40 time-steps. N = 5 took 32 seconds,

N = 20 took 126 seconds, N = 40 took 253 seconds. Note that the increase is

essentially linear in the number of particles. This is due to the fact that most grouping

subsets are well separated from each other, so it is possible to use the a single marginal

importance weight computation for most of the particle/grouping subset pair instead




of the more complicated O(N2) version.

Figure 6.8: GemVident screenshot, showing centroids.

Figure 6.9: Centroid observations per frame from Harvester ant example.


Figure 6.10: Distribution of path lengths and trajectories per frame using a samplefrom the importance distribution, Harvester ant example.

Figure 6.11: Distribution of path lengths and trajectories per frame using CSIR,Harvester ant example.

Chapter 7

Network Growth Models

In recent years there has been much interest in the study of generative models that

explain observed properties of networks derived from biology, sociology and computer

science. The next two chapters will discuss Monte Carlo methods for performing infer-

ence for network data structures. Among the most widely researched models include

network growth models such as preferential attachment [8] and duplication/divergence

(also known as vertex copying) [59, 25]. One reason for the focus on dynamic models

of network growth is that it allows researchers to understand how networks develop

over time, with relatively simple rules resulting in surprising complexity. For a sur-

vey of the history and applications on network growth models see Newman [78], and

Durrett [32] for a more recent book-length treatment.

The primary motivations for studying generative models of network growth are to

help explain, understand, and predict observed network data structures and features.

For example, many real-world networks have degree sequences that appear to follow

heavy-tailed distributions, such as the so-called power-law distributions, where the

probability of a vertex having degree k is proportional to k−α for α > 0. The classical

models of random networks, such as the Erdos-Renyi G(n, p) model in which edges

are placed between each pair of vertices according to independent Bernoulli trials

with probability p, have asymptotically Poisson degree distributions making them

unsuitable for modeling networks with heavy-tailed degree distributions. In contrast,

many network growth models such as preferential attachment and vertex copying

70

CHAPTER 7. NETWORK GROWTH MODELS 71

produce asymptotically power-law networks.

7.1 Background

Many realistic models of network formation, whether in regards to sociological, bio-

logical, or computer network, involve growth from some smaller “seed” network. A

network growth model is a stochastic process through which a network G is formed by

adding vertices to a network one by one, with edges created between the new vertex

and pre-existing vertices. Well known examples of network growth models include so

called preferential attachment and vertex duplication models.

In the current context, network growth models are considered. Unless otherwise

stated, graphs are assumed to be are simple, i.e. they are undirected, have no self-

loops, there are no multiple edges, and there are no edge weights. Furthermore, the

models used are assumed to be strict growth models, i.e. removal of vertices or edges

is not allowed, edges cannot be “rewired” once they appear, and new edges always

have one endpoint in the newly added vertex. The advantage of modeling networks

through strict growth mechanisms as opposed to a more general model that might

allow edge and vertex deletions or rewirings is that they are easier to analyze and

simulate.

One issue with many commonly used network growth models is that they are

often formulated in a way that is not statistically well-defined, with most real-world

networks having zero measure. For example, the original Barabasi-Albert preferential

attachment model adds a deterministic number of edges k at each time-step; if a

network does not have kn edges, then it has zero likelihood. Similar complaints can

be made about many other common models, such as the vertex copying and forest

fire models. The primary reason these models are not well-defined statistically is

that they were formulated in order to make theoretical analysis of their aggregate

properties easier, such as showing they give power-law degree distribution or are well-

connected. In the current context, however, it is preferable to use models that are

statistically well-defined. There have been severals attempt to make network growth

models more statistically robust, see for example Leskovec et al. [65] and Sheridan


et al. [90].

Given the order in which nodes arrived in an observed network, one can compute

the likelihood of the network under a growth model by multiplying the likelihood

contribution of each vertex as it enters the network. In other words, to compute the

likelihood of a graph G given an ordering µ and model parameters θ, one simply forms

the product

W(G|θ, µ) =n∏i=1

P(xµ(i)|xµ(1), . . . , xµ(n−1)), (7.1)

where xµ(i) is the event of the µ(i)th vertex entering G along with its corresponding

edges. Denote W(G|θ, µ) the order likelihood of G.

This representation immediately yields a mechanism for computing the full likeli-

hood of G by summing the order likelihoods (7.1) µ over Πn, the space of all possible

orderings of length n,

W(G|θ) =∑µ∈Πn

W(G|θ, µ)P(µ). (7.2)

P(µ) is the prior distribution over the space of permutations πn. In this chapter P(µ)

is assumed to be uniform over the space of permutations, but in practice one may

have prior knowledge about the order in which vertices entered the network and can

construct a different prior distribution.

Since there are n! possible permutations of length n, brute force enumeration

is infeasible except for the smallest graphs. A natural idea is thus to construct an

unbiased Monte Carlo estimator W(G|θ) of (7.2) by drawing N iid samples µ1, . . . , µN

distributed according to P(µ):

W(G|θ) =N∑i=1

W(G|θ, µi)P(µi). (7.3)

In general the estimator W may have a high variance, making the Monte Carlo

estimator little better than brute force. Fortunately, one can usually reduce the


variance of W by using the importance sampling techniques described in §2.1.

7.1.1 Erdos-Renyi

Perhaps the simplest model of network growth is uniform attachment (UA), where

edges are are attached from a new vertex to pre-existing vertices independently at

random with uniform constant probability p. This process is equivalent to the well-

studied Erdos-Renyi random graph model G(n, p). The G(n, p) and closely related

G(n,m) models were introduced by Gilbert [41] in 1959 and subsequent analyzed

by Erdos and Renyi [34]. For comprehensive book length treatments of G(n, p) and

G(n,m) see Bollobas [16], Janson et al. [50] and also Durrett [32].

The advantage of the G(n, p) is in the simplicity of analysis. Vertex degrees are

distributed according to a Binomial Bin(n−1, p) distribution. Since edges are added

independently according to a Bernoulli with parameter p, the number of edges m in

G is a sufficient statistic. The likelihood of G given parameter p is simply

W(G|p) = P

[Bin

((n

2

), p

)= m

], (7.4)

since there are(n2

)possible independently chosen edges.

Many interesting properties have been demonstrated about random graphs, such

as locally tree-like structure, low diameter and high girth. They were famously used

by Erdos as one of the first applications of the probabilistic method in combinatorics

and graph theory. They are, however, restricted in their usefulness as statistical

models for many real-world networks, as many of these networks have “heavy-tailed”

degree distributions, are highly clustered, or have hierarchical structure that is very

rare in the G(n, p) model.

7.1.2 Preferential Attachment

The central concept of preferential attachment (PA) models is that the “rich get

richer”; a new edge is more likely to be added to a vertex if it is already incident to

many edges. The preferential attachment family of network growth models has a long


history, beginning with Yule [105] in 1924 to explain evolutionary speciation events,

then later by Simon [92] in 1955 to explain the distribution of word frequencies in

text documents. Price [82] in 1976 was one of the first to use these models to describe

“power-law” degree distributions in networks.

The Yule-Simon and Price models can be shown to be equivalent to the Polya

urn model [81], in which a vertex corresponds to a “red” ball, all other vertices are

“blue” balls. At each step, choose a ball uniformly at random and replace it with

2 balls of the same color. The proportion of balls that will be colored “red” can be

shown to be a martingale with expected value following a β(r, b) distribution, where

r is the initial number of “red” balls and b is the initial number of “blue” balls. In

the corresponding models of network growth, this means that older vertices tend to

be incident to a significant proportion of the edges in the graph.

This family of models was rediscovered when a closely related model was proposed

by Barabasi and Albert [8] in 1999 to explain “scale-free” degree distributions in the

world wide web graph and other networks. This sparked a greatly renewed interest in

preferential attachment models. In the model of Barabasi and Albert, a constant m

number of edges are added to the graph at each step, each with one endpoint in the

newly added vertex and the other to a vertex chosen with probability proportional

to its degree. They were able to demonstrate experimentally that that this model

generates a power-law of degree 3, which was later proven rigorously by Bollobas

et. al. [17]. Many different generalizations of the Barabasi-Albert model have been

proposed. The Barabasi-Albert model is an example of linear preferential attachment

because edges are attached with probability linear in the vertex degree. Krapivsky

and Redner [62] have analyzed a model in which a different exponent on vertex degree

can be chosen. These models are known as super- or sub-linear attachment depending

on the exponent. Additionally, the authors of Krapivsky and Redner [62] discuss a

model in which the probability of attachment to a vertex is proportional to the degree

plus some constant α, a “hybrid” between pure preferential attachment and uniform

attachment. Another class of models [29] allows for the rate at which new edges are

added at each step to increase over time.


7.1.3 Duplication/Divergence

The duplication/divergence (DD) or vertex copying family of models were motivated

by networks which had degree distributions that followed a power-law, but for which

vertex degree did not correlate with the order in which it was added to the network.

The idea is that when a new vertex enters the network, its edge list is copied from

the edge list of a pre-existing vertex. This can be seen to result in a power-law

vertex degree distribution similar to the preferential attachment models, because it

also results in a “rich get richer” phenomena in vertex degrees; a vertex with higher

degree is more likely to have a neighbor “copied” and is therefore more likely to have

new edges attached. The “divergence” part of the model indicates that a new vertex

may not be an exact copy of its ancestor and may lose some of its edges. Additionally,

an edge may be placed between the new vertex and the vertex it was copied from.

The first example of models of this type specifically for network growth were

suggested by Kleinberg et. al. [59] as a model of web growth, but duplication-

divergence has gained perhaps the most traction as models of cellular networks, where

“gene duplication events” have long been part of the literature [12, 102, 71, 7, 5]. See

Chung et al. [25] for a discussion mathematical properties of duplication-divergence

models and Newman [78] for a brief survey.

7.2 Computing Likelihoods with Adaptive Impor-

tance Sampling

In this section, a technique for estimating likelihood functions for network growth

models using an adaptive importance sampling technique based on the cross-entropy

method §4.1.2 is investigated. In order to accomplish this, a well-known classical

model of rankings known as Plackett-Luce is used as an importance distribution,

and a penalty term for the cross-entropy method is introduced which significantly

improves the efficiency and robustness of the algorithm, inspired by the minimum

description length principle [43].


In general, focusing on any one particular statistic can be problematic when at-

tempting to choose an appropriate model, unless it is a sufficient statistic. Degree

distribution is just one among many possible descriptive statistics of networks that

may be used to determine “goodness of fit”. One could use betweenness-centrality,

subgraph counts (“network motifs”), conductance, expansion, chromatic number,

number of Hamiltonian cycles, etc., as criteria to find the best network model. It

is usually not clear which descriptive statistics are most important for fitting network

models outside of the context of relatively simple network models such as G(n, p) and

exponential family models (also known as p∗ models) where the sufficient statistics

are well defined. Several recent papers have made attempts at using non-sufficient

statistics for dealing with complicated network models. Notable examples include

Middendorf et al. [74], who use subgraph counts to estimate the relative suitability

of network models using machine learning (boosting) techniques, and Ratmann et al.

[83], who formalize the use of non-sufficient statistics as approximate Bayesian com-

puting, also known as likelihood-free inference, and apply the technique to network

growth models. Alternatively, one could attempt to directly estimate the likelihood

function of a model in order to assess its appropriateness. Building an approximation

to the likelihood allows one to apply the well-established model selection frameworks

commonly used in statistics and information theory such as Bayes factors, minimum

description length (MDL), and Akaike information criteria (AIC).

Recently there have been several different Monte Carlo based approaches to esti-

mating the likelihood of observed network data under network growth models. These

include Leskovec and Faloutsos [64], who use a Metropolis-Hastings chain to estimate

parameters for stochastic Kronecker product graph models, which I discuss in §8,

and Wiuf et al. [104], who use permutation likelihoods with a sequential importance

sampling scheme to perform inference on a variant of duplication/divergence mod-

els. Another interesting Markov chain Monte Carlo approach was taken by Bezakova

et al. [13] to investigate the preferential attachment model.

Informally, network growth models are defined by the following process. Starting

from an empty network, new vertices are added to the network sequentially. As each

vertex enters the network, edges are added connecting the new vertex to some subset


of preexisting vertices chosen according to some randomized rule. For example, in

preferential attachment a connecting vertex is chosen with probability proportional

to its degree. It is assumed that for network growth models, edges and vertices

can only arise in this manner, i.e. that new edges are always connected to the most

recently added vertex and that neither edges nor nodes may be deleted. Note that this

process implicitly produces labeled graphs, with labels assigned according to the order

in which a vertex enters the network. It will also be assumed that the networks are

simple, i.e. that each edge connects exactly one distinct, unordered pair of vertices,

that there are no multiple edges, and no self-loops.

The primary advantage of constraining network growth models to only permit

adding vertices and edges in this sequential manner is that if one is given the order in

which vertices appear, it is straightforward to compute the likelihood that an observed

network was produced under the model. The main difficulty then becomes that the

chronology of vertex generation is often unknown, uncertain, or undefined. In order to

compute the likelihood of a particular unlabeled network with no a priori information

on labelings, one should sum the likelihoods over all possible labelings. Since there

are a factorial number of possible labelings, this approach becomes infeasible for all

but the smallest networks. This naturally leads to the idea of constructing a Monte

Carlo estimator by sampling from the space of labelings.

One can build a crude Monte Carlo estimator by simply drawing labelings (permu-

tations of vertices) uniformly at random, computing the likelihood of each permuta-

tion, then returning the mean value of the computed likelihoods. The biggest problem

with this estimator is that it will in general have very high variance. Typically there

is a small subset of permutations that have high relative likelihood under the network

growth model compared to the majority of permutations. In other words, there is a

rare event set of permutations with disproportionately high weights. To get a lower

variance estimator, one could attempt to use importance sampling.

The approach explored in this chapter is to recursively build an importance dis-

tribution using adaptive importance sampling. A key ingredient to any importance

sampling scheme is the choice of importance distribution. Adaptive importance sam-

pling iteratively attempts to find the “best fitting” importance distribution within a


restricted family of distributions. For details on the adaptive importance sampling

algorithm see §4 below. One common problem with adaptive importance sampling

schemes is that they tend to quickly lead to degenerate probability distributions [6].

In order to compensate for this effect, a penalty term based on ideas from the mini-

mum description length principle is introduced in §4.2.

The use of the Plackett-Luce model as an importance distribution for network

growth models is investigated in §7.2.2. In §7.3 this algorithm is tested using a

modified preferential attachment model and applied to an example protein-protein

interaction network from a publicly available database. These results are compared

to those from an implementation of annealed importance sampling (see §4.3.1).

7.2.1 Marginalizing Vertex Ordering

Given a network G and a network growth model H, one may wish to compute the

conditional probability of G given H, p(G|H), also known as the likelihood function

of G evaluated at H. The likelihood of G under growth model H with labeling σ,

P(G|H, σ), is referred to as the order likelihood. Imposing a mixture model (or prior

distribution) over the space of permutations and applying the law of total probability

gives,

P(G|H) =∑σ∈Sn

P(G|H, σ)P(σ), (7.5)

where Sn is the set of all permutations on n vertices, and P(σ) is the mixture or prior

distribution over Sn. In the examples considered, it will usually be assumed that one

has no prior information about the order in which vertices appear and so P(σ) = 1/n!

for all σ.

Since the order likelihood is the function whose expectation is to be estimated,

denote P(G|H, σ) ≡ h(σ). One can then write

P(G|H) = E[h(σ)], (7.6)

where σ is distributed according to P(σ), and one can use this to build a Monte Carlo


estimator.

7.2.2 Plackett-Luce Model as an Importance Distribution

For importance sampling of distributions where the state space is the space of permu-

tations, one can use of the first-order Plackett-Luce model of rank. The Plackett-Luce

model is a well-known distribution for rank data, originally developed primarily in

the psychometric literature. Commonly referred to as a vase or urn model, it is well

known to generate draws from the stationary distribution of the weighted random-

to-top Markov chain, also known as the “Tsetlin Library”. For more information on

Plackett-Luce and other models of rank see Marden [72] and Diaconis [26].

In the Plackett-Luce model on n labels, each vertex j is assigned a weight wj ∈R+. The procedure to draw a sample from Plackett-Luce is essentially “draw n

labels without replacement with probability proportional to weights”. In the “urn”

interpretation, picture an urn filled with balls of different sizes, and suppose one takes

one ball out of the urn at a time until the urn is empty. If the probability that one

chooses a given ball over another is proportional to its volume, then this is precisely

the first order Plackett-Luce model. The generation procedure is stated formally in

Algorithm 15.

Algorithm 15 Random labeling from Plackett-Luce model

Let U ← V be the set of unpicked labels.

Set σ ← ∅.for i = 1 to |V | do

Pick a label j from U with probabilitywj∑u∈U wu

.

Add j to the end of σ.

Remove j from U .

end for

One of the most attractive qualities of the Plackett-Luce model is that its likeli-

hood function is very simple to compute. Let θi = logwi. Then the log-likelihood of


weights θ given a sample ordering σ is given by

W(θ|σ) =∑

θi −n∑i=1

log

(n∑j=i

exp(θσ(j))

). (7.7)

The log-likelihood of collection of samples σ1, . . . , σN is then

l (θ|(σi)1≤i≤N) =N∑i=1

l(θ|σi). (7.8)

Another attractive feature of Plackett-Luce is that under mild assumptions, the

log-likelihood function (equation (7.8)) is strictly concave, and one can solve it nu-

merically using an efficient iterative procedure known as a minorization-maximization

algorithm (a generalization of expectation-maximization) [49]. Since (4.9) is a linear

combination of the concave terms of (7.8) with positive weights, (4.9) is also concave

under the same regularity conditions and one can solve it using the same algorithm.

7.2.3 Choice of Description Length Function

If one chooses the description length L(f(·; vt)) to be the Shannon entropy of the

distribution f(·; f(·; vt)), one can estimate L(f(·; vt)) with crude Monte Carlo for a

particular set of parameters. One can then either attempt to maximize the adjusted

minimum description length via a convex optimization-type algorithm, or employ a

simpler heuristic. In practice, I find that the latter approach works well. One simple

heuristic is to perform a 1-dimensional minimization of L(f(·; vt)) by multiplying each

weight of v by a constant α > 0, searching for the α that minimizes (4.9). If α < 1,

this raises the entropy, making it less sharply concentrated, and if α > 1 it lowers the

entropy, making it more sharply concentrated.

For the second (Bayesian) interpretation of L(f(·; vt)), in practice I choose a prior

on the weights to be iid log-normal with mean 0 and variance σ2. I then optimize

(4.9) using standard numerical optimization techniques. I find that if I also allow

the variance σ2 itself to be a log-normal random variables, overall performance is

improved.


In addition to the minimum description length correction, in the adaptive impor-

tance sampler I take the elite sample to be the top k weighted samples across all vi for

i < t, rather than just for the last batch of samples as in the standard cross-entropy

method. This means that one can update the importance distribution more frequently

and do not waste too many samples on poor quality importance distributions. This

allows one to maintain more sample diversity. To prevent the elite sample size from

becoming too large in terms of memory storage, one can set a limit on the number of

previous iterations to store.

7.3 Examples

7.3.1 Modified Preferential Attachment Model

As an example of a complicated probability distribution over permutations, impor-

tance sampling techniques are implemented using a modified undirected preferen-

tial attachment model. As noted above, the defining characteristic of preferential

attachment models is the concept that the “rich get richer”. In the most basic

versions, such as the Yule model and Barabasi-Albert model, as a new vertex is

added to the network, a fixed number of partners are chosen with probability pro-

portional to the degree (allowing multiple edges). The version of preferential attach-

ment used here is slightly different, and is defined by the following network growth

rule: when a new vertex enters the network at step j, choose the number of edges

Nj = B(j − 1, β) as a Binomial random variable, where β is “network density”

parameter. Then choose vertex j’s Nj partners with replacement with probability

proportional to deg(i)/∑

l<j deg(l)(1−α) +α. α is a term that essentially smoothes

the distribution toward “uniform attachment”; that is, if α = 1, then the probability

of any particular pair occurring is equally likely and independent, and this is essen-

tially the same model as the Erdos-Renyi G(n, p) described in the introduction. If

instead α = 0, this corresponds to the linear preferential attachment model, in which

the probability of adding an edge is strictly proportional to the degree.

Note that since multiple edges are not allowed, if a pair occurs more than once


the extras are discarded. This means that the number of edges added per step may

actually be less than Nj (and this is why the α = 1 model is not exactly equivalent

to G(n, p)). The reason that I sample with replacement is to speed up the likelihood

computations at each step and maintain edge independence. Since I am dealing with

relatively sparse graphs, multiple edges occur rarely.

Simulations indicate that the modified preferential attachment model produces

power-law degree distribution, and there is also some theoretical justification. The

authors of Sheridan et al. [90] present a “Poisson growth model” similar to the one

presented here. They were able to show that the degree distribution is asymptotically

power-law, with exponent determined by the model parameters. The main difference

between their model and the one that I use is that in the model the expected number

of edges added at each step grows linearly.

7.3.2 Adaptive Importance sampling

For each run of the adaptive importance sampler, I take N = 20 samples per iter-

ation. I take an initial ’elite’ sample size of 2. As noted in Section 7.2.2, the elite

sample is over all previous samples, not just the previous sample as in the standard

cross-entropy method. This elite sample size changes dynamically according to the

performance of the algorithm; if there is no improvement, then I increase the elite

sample size by 1. For the choice of penalty parameter, I choose to use the heuristic

univariate minimization using the α parameter as described in Section 7.2.2. Specif-

ically, I allow each of 20 values of α evenly spaced in range of 0 to 1. For each α, I

raise each parameter in vt to the power of α, compute a Monte Carlo estimate of the

model entropy, compute the model log-likelihood, then add these terms together to

get a MDL score. I then take the α that results in the best value.

7.3.3 Annealed Importance sampling

For the sake of comparison I implemented a version of annealed importance sampling

§4.3.1. For each of the annealed importance sampling simulations, I use 1000 tran-

sition kernels, and take 6 steps per kernel. Starting with the uniform distribution


π(σ) = 1/n! and knowing the optimal transition kernel up to a normalizing constant,

q(σ) = h(σ) ∝ g∗(σ), one wants a sequence of transition kernels T1, T2, . . . , T1000 such

that Ti has stationary distribution

qi ∝ π(σ)(1000−i)/1000q(σ)i/1000.

Since each transition kernel is known up to a normalizing constant, define K to be

the random to random walk on the space of permutations, which consists of picking

a labeling at random and moving it to another random spot. This is equivalent to a

card shuffle where one picks a card at random and places it back at a random place in

the deck. Using K one can simply take each Ti to be K plus the Metropolis-Hastings

accept/reject step with respect to qi.

7.3.4 Computational Effort

In general, I set up the simulations so that the adaptive importance sampler and

the annealed importance sampler would make a similar number of objection function

calls. However, it is possible for the annealed importance sampler to use a Markov

chain such as random transpositions that only permits local moves. In this case one

can update the objective function value in linear time. The drawback to this approach

is that random transpositions mixes slowly, so overall computation time to achieve

the same degree of accuracy may not be improved. When using non-local Markov

chains, updating the objective function usually requires the same order of operations

as computing from scratch, typically quadratic in the number of vertices.

Another consideration is the additional computation time required by the iterative

minorization-maximization algorithm to find the maximum likelihood estimator for

the Plackett-Luce model. In my experience, the number of computations this step

requires is negligible compared to the rest of the algorithm. Convergence typically

occurs in fewer operations than the number of operations required to compute a single

objective function value. For example, in the 500 node example below, convergence

occurs with between 25 and 100 iterations. Each iteration requires a linear number

of operations, so the total number of operations for the minorization-maximization


algorithm appears to be sub-quadratic in the number of vertices.

7.3.5 Numerical Results

For each of the simulations below I generate set of unlabeled network using the prefer-

ential attachment procedure, then attempt to compute the likelihood function under

the uniform mixture model, comparing the performance of adaptive importance sam-

pling with annealed importance sampling.

Figure 7.1 displays the progression of estimated log-likelihood for each technique

to estimate the likelihood of a generated 500 node network. The x-axis demarcates

the number of score function calls of the simulation up to that point, while the y-

axis gives the log-likelihood. Note that annealed importance sampling starts at 0

and decreases, while adaptive importance sampling starts at the log-likelihood of the

uniform distribution and goes up, but both end up in about the same place. Once

a schedule is fixed, annealed importance sampling always uses the same number of

score function evaluations. Adaptive importance sampling, on the other hand, can

choose to stop if it hasn’t made any progress for a significant number of iterations.

Tables 7.1, 7.2, and 7.3 directly compare results of these methods different on

network data-sets chosen at random.

Table 7.1: Comparison of estimators for sparse 500 node preferential attachment

dataset from Figure 7.1

Estimator Log Mean Wt. Var Log Wt.

Crude -3.22e3 3.8e3

Annealed IS -2.93e3 7.09e2

Adaptive IS -2.99e3 1.86e2


0 1000 2000 3000 4000 5000 6000−3500

−3000

−2500

−2000

−1500

−1000

−500

0

Number of likelihood computations

log−

likel

ihoo

d

Figure 7.1: Example runs comparing annealed importance sampling and adaptive im-portance sampling for a 500 node, sparse preferential attachment network, 20 sampleseach method. The set of lines coming from the upper left corner represent runs of theannealed importance sampling algorithm. Lines from the lower left represent runs ofthe adaptive importance sampling algorithm. The horizontal line at -2.86e3 representsthe likelihood of the “true” permutation from which the data was generated.


Table 7.2: Comparison of estimators for dataset: 5 networks, 30 nodes each, average

degree 2, 20 samples each method


Annealed IS -838.7 3.68

Adaptive IS -851.4 7.95

Table 7.3: Comparison of estimators for dataset: 2 networks, 100 nodes each, average

degree 2


Annealed IS -1.60e3 17.6

Adaptive IS -1.64e3 60.6

7.3.5.1 Effects of Minimum Description Length Correction

To demonstrate the effect of the minimum description length correction applied to

the adaptive importance sampler, Figure 7.2 shows the progression of two typical

simulation runs for a randomly generated dataset. In one simulation I apply the

minimum description length correction and in the other I do not. A circle denotes

the score of a sample, while a x represents the corresponding importance weight.

The solid line represents the estimated log-likelihood of the model, while the dotted

line gives the value of the “true” likelihood, i.e. the likelihood of the permutation

from which the dataset was generated. This is likely an overestimate of the likelihood

function, since it represents close to the “peak” likelihood, not the “mean” likelihood.

The two methods behave quite differently. On the bottom of Figure 7.2, the

standard cross-entropy method does quite well at producing many high scoring func-

tions, but is degenerate, producing importance weights that are on average 60 orders

of magnitude smaller than their respective sample scores. This method takes many

thousands of iterations to converge, and after 1000 interactions the estimated expected

log-likelihood is 40 orders of magnitude below the adaptive importance sampler with


the penalty function. On the top of Figure 7.2 the penalty function adaptive impor-

tance sampler produces a diverse collection of samples, with high scores almost as

high as those from the standard cross-entropy sampler, but without becoming degen-

erate. This sample diversity gives the penalty function adaptive importance sampler

a much higher chance of encountering “good” solutions.

7.3.5.2 Application: Mus Musculus Protein-Protein Interaction Network

To demonstrate annealed importance sampling and adaptive importance sampling

in a real dataset, obtained from the BioGRID [19] website a manually curated Mus

Musculus (common mouse) protein-protein interaction (PPI) network. This is a large,

simple, undirected network, where an edge is present between two proteins if there is

strong experimental evidence of interaction. I took the largest connected component

of this network, which has 314 nodes and 503 interactions. A rendering of this network

can be seen below in Figure 7.3.

For annealed importance sampling, I ran 20 particles, with 1000 cooling levels and

6 Markov steps at each level. For adaptive importance sampling, I ran 20 simulations,

with N = 20 samples at each iteration, elite sample sizes adjusted dynamically.

Results are summarized in Figure 7.4 and Table 7.4 below. Both techniques converge

to almost the same value. Overall, this provides strong evidence that preferential

attachment is a better model than the Erdos-Renyi G(n, p) model (defined in the

introduction) for this dataset.

Table 7.4: Estimated log-likelihoods for Mus Musculus protein-protein interaction

networksmodel log-likelihood sample var. log-likelihood

Erdos-Renyi −3.070e3 -

PA Adaptive IS −2.280e3 3.41e2

PA Annealed IS −2.276e3 6.80e2


0 100 200 300 400 500 600 700 800 900 1000−1100

−1050

−1000

−950

−900

−850

−800

sample number

log

likel

ihoo

d

0 100 200 300 400 500 600 700 800 900 1000−1100

−1050

−1000

−950

−900

−850

−800

sample number

log

likel

ihoo

d

Figure 7.2: Likelihoods and importance weights for cross-entropy method. The topfigure shows the progression of the MDL-based Plackett-Luce model. The bottomfigure shows the progression for the cross-entropy Plackett-Luce model without theMDL correction. Each sample permutation corresponds to one O and one X; O indi-cates a sample likelihood (score function value), X points show the sample importanceweight. The solid line shows the importance sample estimate of the log-likelihood.The dashed line shows the log-likelihood of the “true” permutation. Dataset: 5 net-works, 30 nodes each, average degree 2.


Figure 7.3: Mus. Musculus (common mouse) PPI network.


0 1000 2000 3000 4000 5000 6000−3000

−2500

−2000

−1500

−1000

−500

0

number of score function evaluations

log−

likel

ihoo

d

Figure 7.4: Convergence of adaptive importance sampling and annealed importancesampling for Mus. Musculus PPI network.

Chapter 8

Kronecker Product Graphs

An interesting family of generative models for graphs is stochastic Kronecker product

graphs [79]. Kronecker product graphs are appropriate for modeling many common

real world networks as they can mimic important properties observed to be preva-

lent. For example, they are shown to have multinomial degree distribution, which

with properly chosen parameters can produce heavy-tails. Give each vertex in the

initiator graph self-loops, and Kronecker graphs can be made to have (small) constant

diameter. The eigenvalues of the adjacency matrix can be shown to be related to the

degree distribution and can be made heavy-tailed. As noted in Leskovec et al. [66],

Kronecker product graphs can encapsulate core-peripherery type networks due to the

recursive nature of the Kronecker product graph formulation. Many social, infor-

mational, and biological networks are known to have dense cores, making stochastic

Kronecker product graph networks particularly suitable. And perhaps most impor-

tantly, stochastic Kronecker product graphs can be fit using a linear (O(E)) time

algorithm via a maximum likelihood estimation algorithm called KronFit developed

by Leskovec et al Leskovec and Faloutsos [64].

As defined in Leskovec and Faloutsos [64], Kronecker product graphs start with

an initiator matrix Kk ∈ RN1×N1 . One then forms the Kronecker product graph

adjacency matrix

Kk = ⊗kK1 (8.1)

91

CHAPTER 8. KRONECKER PRODUCT GRAPHS 92

where ⊗ is the matrix Kronecker product. See Leskovec et al. [66] for definitions and

many interesting properties. As a consequence of this construction, the Kronecker

product graph matrix Kk has Nk1 rows and columns. To make this into a statistical

model of graph formation, one can interpret the matrix entry Kk(i, j) as representing

the probability of an edge between vertices i and j, then place an edge between

each pair of vertices (i, j) according to independent Bernoulli trials with parameter

Kk(i, j). The resulting model is the stochastic Kronecker product graph. Note that

this formulation contains the Erdos-Renyi G(n, p) model on Nk1 vertices as a special

case. A generalization of stochastic Kronecker product graphs called multiplicative

attribute graphs was recently developed [57] that considers a sequence of attribute

matrices instead of taking Kronecker powers of a single initiator matrix as in the

stochastic Kronecker product graph model.

In this chapter I propose an extension of these models called vertex censored

stochastic Kronecker product graphs. This extension results in a better fitting models

as measured by Bayesian information criterion. The primary novelty in my approach

is the use of sequential importance sampling to circumvent a commonly occurring mis-

match between the number of vertices in the Kronecker graphs and in the real-world

network being analyzed. The importance sampling scheme allows one to consider ver-

tex censored network models In addition to improving the likelihoods for pre-existing

Kronecker graph fits, vertex censoring allows one to make large improvements in like-

lihood by greatly expanding the practical number of parameters and initiator matrices

used.

8.1 Motivation

A primary difficulty inherent in stochastic Kronecker product graph models is the

fact that the number of vertices present in Kronecker product graphs are composite

in the sizes of the generator graphs. In the stochastic Kronecker product graph

model, one chooses a single initiator graph with N0 nodes and take k Kronecker

products, resulting in n = Nk0 vertices. Since most real-world networks don’t have a

number of vertices that are composite in small integers, this is problematic. In order


to compensate, the authors of Leskovec et al. [66] suggest forming a new graph by

padding the real-world graph with isolated vertices, and using likelihood estimates

for this enhanced network as a proxy for the original network. This may result in a

large number of isolated vertices. Assuming n = |V | is uniformly distributed between

Nk−10 < n ≤ Nk

0 , this would give (1 − 1/N0)/2 expected fraction of padded vertices,

with a maximum 1− 1/N0 fraction padded in the worst case. For a given k and N0,

this worst case is attained when the real-world graph contains Nk0 + 1 vertices.

Padding with isolated vertices can create inference problems in several ways. First,

a large number of padded isolated vertices skews the degree distribution and other

properties with respect to the original graph. This may not be as problematic for

graphs with power-law or heavy-tailed degree distributions that would expect most

vertices to be incident to very few edges. However, there are many important classes of

networks that are not power-law, such as regular graphs, and padding these networks

with a large number of isolated vertices may distort much of the observed structure.

Second, adding a large number of padded vertices means that one is computing the

likelihood of a potentially much larger graph. This has effect of reducing the maximum

likelihood estimate for the graph. While many of the interesting and important

statistical features of the original graph may still be reproduced, formulating models

that have high maximum likelihood estimates is important in Bayesian Statistics and

model selection. Further, since padding is more likely to occur with larger initiator

matrices, smaller initiator matrices are often preferred in the stochastic Kronecker

product graph model, leaving out potentially useful larger initiators. The fact that

N1 = 2 is often the best or close to the best choice for initiator matrix dimension [66]

makes intuitive sense since the number of vertices in arbitrary real-world networks is

more likely to be close to a power of 2 than for any other initiator matrix size.

The authors of Leskovec et al. [66] note that this class of models belongs to the

curved exponential family of models, and that it is therefore appropriate to use the

Bayesian information criterion (BIC) for purposes of model selection. BIC balances

the complexity of the model (measured by number of model parameters) against

goodness of fit (measured by likelihood). However, due to the vertex padding issue,

the parameter complexity of the model typically contributes an insignificant amount


to the BIC, and large complicated graphs are often found to fit best with 2 by 2 or

3 by 3 initiator matrix For many common statistical models, adding more parame-

ters results in a better fit, and this is the behavior one would intuitively desire for

stochastic Kronecker product graphs.

Finally, it may be desirable to model ’missing data’ problems where one doesn’t

have access to the full network. This may take the form of “link prediction” (missing

edges) or cases where one doesn’t have access to the full vertex set. Vertex censored

models can be naturally adapted to addressing these types of problems. A similar

setting may arise if it is desirable to predict the future growth of a network given a

core ’seed’ graph.

8.2 Stochastic Kronecker Product Graph model

In this section I briefly review the stochastic Kronecker product graph model and the

KronFit algorithm used to compute maximum likelihood estimates.

8.2.1 Likelihood under Stochastic Kronecker Product Graph

model

In general, the likelihood of a graph G(V,E) under Kk can be computed as

P(G|Kk) =∏

(u,v)∈E

Kk(u, v)∏

(u,v) 6∈E

(1−Kk(u, v)). (8.2)


Instead of probability, it is usually easier to work in terms of the log-likelihood:

l(Kk|G) =∑

(u,v)∈E

logKk(u, v) +∑

(u,v)6∈E

log(1−Kk(u, v)) (8.3)

=∑u,v∈V

log(1−Kk(u, v))+ (8.4)

∑(u,v)∈E

[logKk(u, v)− log(1−Kk(u, v))] (8.5)

= le(Kk|G) + lE(Kk|G), (8.6)

where le denotes the log-likelihood of the empty graph, and lE is the edge correction

to the log-likelihood.

Due to the recursive structure of Kronecker products, for each vertex u one can

associate a vector

u = (u1, u2, . . . , uk), (8.7)

where ui ∈ 1, . . . , N1, and the probability of an edge between two vertices v and u

is

Kk(u, v) =k∏i=1

K1(ui, vi), (8.8)

The log-likelihood of an empty graph is

le(Kk) =∑u,v

log(1−Kk(u, v)). (8.9)

Leskovec et al. [66] use the Taylor series approximation to (8.9),

le(Kk) = −

(N1∑i=1

N1∑j=1

K1(i, j)

)k

− 1

2

(N1∑i=1

N1∑j=1

K1(i, j)2

)k

− . . . (8.10)


which implies that le(Kk) to a given precision can be computed to an arbitrary preci-

sion in either constant or logarithmic time assuming on reasonable bounds on entries

of K1. To compute the log-likelihood of the full graph, one can then apply the edge

correction no the empty graph, adding the log-likelihoods for each edge in G and

removing the corresponding ’no-edge’ contributions. This means that the total log-

likelihood for G under Kk can be approximated in O(m) time.

8.2.2 Sampling Permutations

In the previous section, it is assumed that the “true” mapping between vertices in G

and vertices in Kk is known, i.e. that G is labeled. However, G is typically unlabeled,

and it is necessary to account for this fact by summing over all possible permutations.

One can state the conditional probability of a mapping π as

P(π|G,Kk) =P(G|π,Kk)P(π|Kk)

P(G|Kk). (8.11)

Finding π that maximizes this quantity is similar to the linear ordering problem (LOP)

[77, 15], which is NP-hard. To sample from this distribution, Leskovec and Faloutsos

[64] suggest a Metropolis-Hastings Markov chain Monte Carlo algorithm which has

stationary distribution (8.11). As a base chain they use the “vertex switch” Markov

chain, also known as the “random transpositions” shuffle. Other Markov chains on

permutations are possible; see Aldous and Diaconis [1] for background.

The advantage of using a “local” Markov chain such as random transpositions is

that it allows one to use local updates to the likelihood.

Under the stochastic Kronecker product graph model, the log-likelihood of the

empty graph is the same under both permutations and can be computed in time

O(max degree)[66].

8.2.3 Computing Gradients

Computation of gradients follows the same process as log-likelihood computations. To

compute ∇il(Kk), I first compute the empty graph contribution and apply an edge


correction. Approximation to the empty graph gradient proceeds by differentiating

the log-likelihood of the Taylor series approximation (8.10) with respect to parameter

i. This process is repeated for each of the N1 parameters.

8.3 Vertex Censored Stochastic Kronecker Prod-

uct Graphs

In this section, I specify the vertex censored stochastic Kronecker product graph

model and present an algorithm to compute maximum likelihoods. The primary

motivation for this extension is to compensate for the likelihood distortions introduced

by vertex padding in stochastic Kronecker product graph. Vertex censored models

can be thought of as generating networks by some well-defined additive process, then

“dropping” a subset of vertices. Equivalently, one can assume the dropped subset is

simply hidden, or “censored”. Vertex censored network models allow one to make

more valid comparisons across stochastic Kronecker product graphs with different

sized initiator matrices. censored data is a common issue in the statistical literature;

a previous application of censoring to network data in an unrelated problem can

be found in Thomas and Blitzstein [97]. A key component of my algorithm uses

sequential importance sampling to estimate the empty-graph log-likelihood.

The basic vertex censored network model can be stated simply. Suppose that for

the graph G(V,E), n = |V | < Nk1 . Instead of “padding” G with isolated vertices, I

define an injective mapping φ : V 7→ VK . One can then compute the log-likelihood

under the censored model as

lφ(Kk) =∑u,v∈V

[I(u,v)∈E log(Kk(φ(u), φ(v))

+I(u,v)6∈E log(1−Kk(φ(u), φ(v))], (8.12)

where the right hand side consists of the empty-graph log-likelihood lφe (Kk) and edge

correction lφE(Kk).


8.3.1 Importance Sampling for Likelihoods

Note, however, that under the decomposition (8.12) the recursive structure that al-

lows one to easily compute lφe using the Taylor series expansion (8.10) is lost. To

compensate for this, I first represent le as an expectation,

le(Kk) = n2E[log(1−Kk(u, v))], (8.13)

where u and v are chosen uniformly at random. I will show how to estimate (8.13)

using importance sampling. In the current context, the first order Taylor series ap-

proximation to (8.13) suggests that using

g(u, v) =Kk(u, v)∑u,vKk(u, v)

(8.14)

will be close to optimal. One can simulate from g sequentially, drawing each element

ui and vi independently according to

gi(ui, vi) =K1(ui, vi)∑ui,vi

K1(ui, vi), (8.15)

then combining these as g(u, v) =∏k

i=1 gi(ui, vi).

One can show that this choice of importance distribution has bounded relative

error in the uncensored case.

Theorem 2. Let random variables X = log(1−Kk(u, v)) and Z = log(1−Kk(u, v))f(u,v)g(u,v)

,

where u, v are permutations chosen uniformly at random and u, v are chosen according

to (8.15). Then

EZ2

[EX]2≤ E

[1

1−Kk(u, v)

]. (8.16)

If Kk(u, v) ≤ 1 − ε for ε > 0 and all u, v, this implies that Z has bounded relative

error.


Proof. From the Taylor series expansion (8.10),

EX =1

n2le(Kk) ≥

1

n2

(N1∑i=1

N1∑j=1

K1(i, j)

)k

=m

n2, (8.17)

where m =∑

u,vKk(u, v) is the expected number of edges. One can write Z as

Z =m

n2

log(1−Kk(u, v))

Kk(u, v)(8.18)

= −mn2

(1 +

1

2Kk(u, v) +

1

3Kk(u, v)3 + . . .

)(8.19)

Multiplying out the Taylor series expansion for Z gives

EZ2 =m2

n4E

[1 +Kk(u, v) +

11

12Kk(u, v)2 +

5

6Kk(u, v)3 + . . .

](8.20)

≤ m2

n4E[1 +Kk(u, v) +Kk(u, v)2 + . . .

](8.21)

=m2

n4E

[1

1−Kk(u, v)

](8.22)

Combining the two bounds gives the result.

While this bound is not practically useful in the uncensored case due to the pres-

ence of the Taylor series expansion, it provides some intuitive justification for the

use of the importance sampler in the more complicated censored cases. Computing

the gradient of the log-likelihood via importance sampling can follow the same se-

quential importance distribution, though it is not as straightforward choosing a near

optimal importance distribution. Empirical results show that choosing the same im-

portance distribution for the gradient as the log-likelihood gives better results than

crude Monte Carlo.


8.3.2 Choosing Censored Vertices

Using this same form as (8.13), define the censored version as:

lφe (Kk) = n2E[Iφ−1(u),φ−1(v)∈V log(Kk(φ(u), φ(v)))] (8.23)

As noted in Leskovec et al. [66], uniformly dropping vertices alters the expected degree

distribution of the stochastic Kronecker product graph. However, there is no real

reason to assume that the censored data is chosen uniformly; indeed in many contexts

it may make more sense to assume the probability of “seeing” a vertex is a function

of the number of edges incident to it. Choosing the probability of censoring a vertex

as inversely proportional to degree would for example preserve a power-law degree

distribution. In general, one may create models with arbitrary rules for censoring

vertices, such as dropping the vertices with smallest degree. This last approach is

similar to the “isolated vertex padding” approach of Leskovec and Faloutsos [64],

but has the advantage of producing a higher MLE because it is censoring instead of

padding. This is the approach that I take in the numerical examples in §7.

8.3.3 Sampling Permutations

Sampling of permutations proceeds as in the stochastic Kronecker product graph

model. One can run the random transpositions Metropolis-Hastings algorithm over φ

with the same effect. One problem with this formulation is that flipping the ordering

of a ’visible’ vertex with a ’censored’ vertex requires ’dense’ operations to compute

δφ,φ, as this operation changes the empty-graph log-likelihood lφe (Kk). To compute

this update approximately, one can use an importance sampling scheme as above, but

only estimating the empty graph likelihood of the individual vertices switched. This

can be done quickly and accurately, however this will introduce some ’stochastic drift’

over time due to adding of many Monte Carlo estimators with independent errors.

This necessitates occasionally re-approximating the full likelihood.

An alternative is for most steps to run a Markov chain that only switches mapped


vertices of V , i.e. permuting φ instead of permuting the whole space. Since the empty-

graph log-likelihoods haven’t changed, this only requires a constant-time update as

in the stochastic Kronecker product graph model. One could then more infrequently

perform a fuller step that switches a mapped vertex with an unmapped vertex. One

could also simply make the assumption that all of the unmapped vertices belong to

a known fixed subset, such as all of the lowest expected degree vertices in Kk, or by

matching the expected degree of vertices in Kk to vertex degrees in G.

8.3.4 Multiplicative Attribute Graphs

Multiplicative attribute graphs [57] are a generalization of stochastic Kronecker prod-

uct graphs that use a set of attribute matrices Θ1, . . . ,Θk to form a Bernoulli edge

sampling matrix

Mk = ⊗ki=1Θi. (8.24)

They are fit in the same way as stochastic Kronecker product graphs, using empty

graph likelihoods and edge corrections for both the log-likelihood and the gradient.

The sequential importance sampling scheme for computing empty log-likelihoods (2)

operates in the same way, except for instead of choosing the components of the im-

portance distribution identically, components are chosen according to each individual

attribute matrix. Vertex censoring is particularly useful in this context, as the number

of attribute matrices most appropriate for any given real world graph seems unlikely

to coincide with the choice that minimizes the number of padded vertices.

8.4 Empirical Results

The primary network I examine is the AS-ROUTEVIEWS network studied in Leskovec

and Faloutsos [64]. This network has n = 6474 vertices and m = 26467 edges. To

test the importance sampling scheme, I computed le(Kk) for the uncensored stochas-

tic Kronecker product graph model using both the Taylor series approximation (8.10)

and the sequential importance sampling approximation. Figure 8.1 shows the results


for the k = 2 model using the optimal parameters. To compute the following graph,

I ran the standard uniform ’crude’ Monte Carlo scheme with 100 samples 100 times

and did the same with the SIS estimator. I then plotted the histograms and compared

the results to the Taylor series approximation.

Note the extreme reduction in variance; the standard estimate from the SIS sam-

ples is µSIS = −24650±60, for the crude MC estimate µMC = −21614±17495, where

the ’true’ value from the Taylor series expansion is µts = −24644. In general, crude

MC is quite poor, whereas computing the SIS with something like 10n samples gives

about 5 digits of accuracy.

For gradient computations, however, using 10n samples only gives about 2 or 3

digits of accuracy. This is much better than the crude MC estimator, but as a result

the algorithm had a difficult time converging. More thought needs to be put into the

design of the importance sampler for gradient computations.

However, one nice feature of the vertex censored model is that one can take

the same parameters generated by the standard stochastic Kronecker product graph

model with vertex padding and simply “censor” the padded vertices. While this

doesn’t give the maximum likelihood estimator for the vertex censored model, it does

give a lower bound on the MLE (or upper bounds on the negative log-likelihood).

Note the big improvements in MLE for size 4 and 5 initiator matrices in Figure 8.2.

Again, these are upper bounds on the negative log-likelihood for the vertex censored

stochastic Kronecker product graph model. This shows that the padded vertices

have a noticeable detrimental effect, and that more model parameters give a higher

likelihood.

8.4.1 Implementation

Code is written in C++ and R, and is available upon request. The main routine to

compute maximum likelihood estimates was modified from the KronFit implementa-

tion in the publicly available C++ library SNAP[63].


−24800 −24750 −24700 −24650 −24600 −24550 −24500

02

46

810

Estimated Log−Likelihood

Den

sity

Figure 8.1: Performance of Crude and SIS Monte Carlo simulations on AS-ROUTEVIEWS graph for N1 = 2. Top line represents SIS MC density, bottomrepresents Crude MC density


2 3 4 5 6

1200

0013

0000

1400

0015

0000

AS−ROUTEVIEWS, n=6474, m=26467

N_1, Initiator Matrix Size

−lo

g lik

elih

ood

SKPGVCSKPG

Figure 8.2: Comparison of SKPG and VCSKPG models for AS-ROUTEVIEWS graph

Bibliography

[1] D. Aldous and P. Diaconis. Shuffling cards and stopping times. American

Mathematical Monthly, 93(5):333–348, 1986. ISSN 0002-9890.

[2] David Aldous. Approximate counting via Markov chains. Statistical Science, 8

(1):pp. 16–19, 1993. ISSN 08834237.

[3] H.C. Andersen and P. Diaconis. Hit and run as a unifying device. Journal de

la Societe Francaise de Statistique, 148(5):5–28, 2007.

[4] C. Andrieu, N. De Freitas, A. Doucet, and M.I. Jordan. An introduction to

MCMC for machine learning. Machine learning, 50(1):5–43, 2003.

[5] Andreas Wagner Annette M. Evangelisti. Molecular evolution in the yeast

transcriptional regulation network. Journal of Experimental Zoology Part B:

Molecular and Developmental Evolution, 302B:392–411, 2004. ISSN 1552-5015.

link.

[6] S. Asmussen and P.W. Glynn. Stochastic simulation: Algorithms and analysis.

Springer Verlag, 2007.

[7] M. Madan Babu, Nicholas M. Luscombe, L. Aravind, Mark Gerstein, and

Sarah A. Teichmann. Structure and evolution of transcriptional regulatory

networks. Current Opinion in Structural Biology, 14:283–291, Jun 2004. link.

[8] A.L. Barabasi and R. Albert. Emergence of Scaling in Random Networks.

Science, 286(5439):509, 1999.

105

http://dx.doi.org/10.1002/jez.b.20027

http://www.sciencedirect.com/science/article/B6VS6-4CDS7DK-3/2/95e04091d9ace5d21d597e563a31bc22

BIBLIOGRAPHY 106

[9] M. Bayati, J. Kim, and A. Saberi. A sequential algorithm for generating ran-

dom graphs. Approximation, Randomization, and Combinatorial Optimization.

Algorithms and Techniques, pages 326–340, 2007.

[10] I. Beichl and F. Sullivan. The metropolis algorithm. Computing in Science &

Engineering, 2(1):65–69, 2000.

[11] R. Bellman. Adaptive control processes: a guided tour., 1961.

[12] Johannes Berg, Michael Lassig, and Andreas Wagner. Structure and evolution

of protein interaction networks: a statistical model for link dynamics and gene

duplications. BMC Evolutionary Biology, 4:51, 2004. ISSN 1471-2148. link.

[13] I. Bezakova, A. Kalai, and R. Santhanam. Graph model selection using maxi-

mum likelihood. In ICML ’06: Proceedings of the 23rd international conference

on Machine learning, pages 105–112, New York, NY, USA, 2006. ACM. ISBN

1-59593-383-2. doi: http://doi.acm.org/10.1145/1143844.1143858.

[14] J. Blitzstein and P. Diaconis. A sequential importance sampling algorithm for

generating random graphs with prescribed degrees. Internet Mathematics, 6(4):

489–522, 2011.

[15] A. Blum, G. Konjevod, R. Ravi, and S. Vempala. Semi-definite relaxations

for minimum bandwidth and other vertex-ordering problems. In Proceedings of

the thirtieth annual ACM symposium on Theory of computing, pages 100–105.

ACM, 1998. ISBN 0897919629.

[16] B. Bollobas. Random Graphs. Cambridge University Press, 2001.

[17] B. Bollobas, O. Riordan, J. Spencer, and G. Tusnady. The degree sequence of

a scale-free random graph process. Random Structures and Algorithms, 18(3):

279–290, 2001.

[18] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

http://www.biomedcentral.com/1471-2148/4/51

BIBLIOGRAPHY 107

[19] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livs-

tone, R. Oughtred, D.H. Lackner, J. Bahler, V. Wood, et al. The BioGRID

Interaction Database: 2008 update. Nucleic Acids Research, 2007.

[20] O. Cappe, A. Guillin, J.M. Marin, and C.P. Robert. Population Monte Carlo.

J. Comput. Graph. Statist, 13(4):907–929, 2004.

[21] G. Casella and R.L. Berger. Statistical Inference. Duxbury, 2011.

[22] K.C. Chang, C.Y. Chong, and T. Bar-Shalom. Joint probabilistic data associa-

tion in distributed sensor networks. IEEE Transactions on Automatic Control,

31(10):889–897, 1986.

[23] Y. Chen, P. Diaconis, S.P. Holmes, and J.S. Liu. Sequential monte carlo meth-

ods for statistical analysis of tables. Journal of the American Statistical Asso-

ciation, 100(469):109–120, 2005.

[24] Z. Chen. Bayesian filtering: From Kalman filters to particle filters, and beyond.

Unpublished manuscript, 2003.

[25] F. Chung, L. Lu, T.G. Dewey, and D.J. Galas. Duplication Models for Biological

Networks. Journal of Computational Biology, 10(5):677–687, 2003.

[26] P. Diaconis. Group Representations in Probability and Statistics. Institute of

Mathematical Statistics, 1988.

[27] P. Diaconis. The markov chain monte carlo revolution. AMERICAN MATHE-

MATICAL SOCIETY, 46(2):179–205, 2009.

[28] P. Diaconis and S. Holmes. Three examples of monte-carlo markov chains: at

the interface between statistical computing, computer science, and statistical

mechanics. IMA VOLUMES IN MATHEMATICS AND ITS APPLICATIONS,

72:43–43, 1995.

[29] SN Dorogovtsev and JFF Mendes. Effect of the accelerating growth of commu-

nications networks on their structure. Physical Review E, 63(2):25101, 2001.

BIBLIOGRAPHY 108

[30] R. Douc, A. Guillin, J.M. Marin, and CP Robert. Minimum variance impor-

tance sampling via population Monte Carlo. ESAIM: P&S, 11:427–447, 2007.

[31] A. Doucet and N. De Freitas. Sequential Monte Carlo methods in practice.

Springer, 2001.

[32] R. Durrett. Random Graph Dynamics. Cambridge University Press, 2006.

[33] M. Dyer and A. Frieze. A random polynomial time algorithm for approximating

the volume of convex bodies. In Proceedings of the twenty-first annual ACM

symposium on Theory of computing, pages 375–381. ACM, 1989.

[34] P. Erdos and A. Renyi. On random graphs. Publ. Math. Debrecen, 6(290), 1959.

[35] M.J. Evans and T. Swartz. Approximating integrals via Monte Carlo and de-

terministic methods. Oxford University Press, USA, 2000.

[36] T. Feder, A. Guetz, M. Mihail, and A. Saberi. A local switch markov chain on

given degree graphs with application in connectivity of peer-to-peer networks.

In 47th Annual IEEE Symposium on Foundations of Computer Science, pages

69–76. IEEE, 2006.

[37] E. Frank, M.A. Hall, G. Holmes, R. Kirkby, B. Pfahringer, and I.H. Witten.

Weka: A machine learning workbench for data mining. Data Mining and Knowl-

edge Discovery Handbook: A Complete Guide for Practitioners and Researchers,

pages 1305–1314, 2005.

[38] A. Gelman. Bayesian data analysis. CRC press, 2004.

[39] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 6:721–741, 1984.

[40] C.J. Geyer and E.A. Thompson. Constrained Monte Carlo maximum likeli-

hood for dependent data. Journal of the Royal Statistical Society. Series B

(Methodological), 54(3):657–699, 1992.

BIBLIOGRAPHY 109

[41] EN Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):

1141–1144, 1959.

[42] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to

nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Process-

ing, IEE Proceedings F, 140(2):107–113, 1993. ISSN 0956-375X.

[43] P.D. Grunwald. The Minimum Description Length Principle. Mit Press, 2007.

[44] J.M. Hammersley and K.W. Morton. Poor man’s monte carlo. Journal of the

Royal Statistical Society. Series B (Methodological), pages 23–38, 1954.

[45] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of sta-

tistical learning: data mining, inference and prediction. The Mathematical

Intelligencer, 27(2):83–85, 2005.

[46] W. K. Hastings. Monte carlo sampling methods using markov chains and their

applications. Biometrika, 57(1):pp. 97–109, 1970. ISSN 00063444. URL http:

//www.jstor.org/stable/2334940.

[47] S. Holmes, A. Kapelner, and P.P. Lee. An interactive Java statistical image

segmentation system: GemIdent. Journal of Statistical Software, 30:1–20, 2009.

[48] C. Hue, J.P. Le Cadre, and P. Perez. Sequential Monte Carlo methods for mul-

tiple target tracking and data fusion. IEEE Transactions on Signal Processing,

50(2):309–325, 2002.

[49] D.R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals

of Statistics, 32(1):384–406, 2004. link.

[50] S. Janson, T. Luczak, and A. Rucinski. Random graphs. John Wiley New York,

2000.

[51] A.H. Jazwinski. Stochastic processes and filtering theory, volume 63. Academic

Pr, 1970.

http://www.jstor.org/stable/2334940

http://www.jstor.org/stable/2334940

http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1079120141

BIBLIOGRAPHY 110

[52] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM journal on

computing, 18:1149, 1989.

[53] M.R. Jerrum, L.G. Valiant, and V.V. Vazirani. Random generation of combi-

natorial structures from a uniform distribution. Theoretical Computer Science,

43:169–188, 1986.

[54] S.J. Julier and J.K. Uhlmann. A new extension of the kalman filter to nonlin-

ear systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls,

volume 3, page 26. Citeseer, 1997.

[55] R.E. Kalman. A new approach to linear filtering and prediction problems.

Journal of basic Engineering, 82(1):35–45, 1960.

[56] Z. Khan, T. Balch, and F. Dellaert. An MCMC-based particle filter for tracking

multiple interacting targets. Lecture Notes in Computer Science, pages 279–290,

2004.

[57] M. Kim and J. Leskovec. Multiplicative attribute graph model of real-world

networks. Algorithms and Models for the Web-Graph, pages 62–73, 2010.

[58] S. Kirkpatrick, CD Gelati Jr, and MP Vecchi. Optimization by simulated an-

nealing. Biology and Computation: A Physicist’s Choice, 1994.

[59] J. M. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, , and A. Tomkins.

The web as a graph: measurements, models and methods. Proceedings of the

International Conference on Combinatorics and Computing, 1999.

[60] KR Koch. Gibbs sampler by sampling-importance-resampling. Journal of

Geodesy, 81(9):581–591, 2007. ISSN 0949-7714.

[61] A. Kong. A note on importance sampling using standardized weights. Technical

report, Deptartment of Statistics, University Chicago, 1992.

[62] PL Krapivsky and S. Redner. Organization of growing random networks. Phys-

ical Review E, 63(6):66123, 2001.

BIBLIOGRAPHY 111

[63] J. Leskovec. Snap graph library, 2011. URL snap.stanford.edu.

[64] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker

multiplication. In Proceedings of the 24th international conference on Machine

learning, page 504. ACM, 2007.

[65] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution

of social networks. In Proceeding of the 14th ACM SIGKDD international con-

ference on Knowledge discovery and data mining, pages 462–470. ACM, 2008.

[66] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani.

Kronecker graphs: An approach to modeling networks. The Journal of Machine

Learning Research, 11:985–1042, 2010.

[67] J.S. Liu. Metropolized independent sampling with comparisons to rejection

sampling and importance sampling. Statistics and Computing, 6(2):113–119,

1996.

[68] J.S. Liu. Monte Carlo strategies in scientific computing. Springer Verlag, 2008.

[69] J.S. Liu and R. Chen. Sequential Monte Carlo methods for dynamic systems.

Journal of the American Statistical Association, 93(443):1032–1044, 1998.

[70] D.J.C. MacKay. Information theory, inference, and learning algorithms. Cam-

bridge University Press New York, 2003.

[71] M. Madan Babu and Sarah A. Teichmann. Evolution of transcription factors

and the gene regulatory network in escherichia coli. Nucleic Acids Research, 31:

1234–1244, Feb 2003. link.

[72] J.I. Marden. Analyzing and Modeling Rank Data. Chapman & Hall/CRC, 1995.

[73] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Au-

gusta H. Teller, and Edward Teller. Equation of state calculations by fast com-

puting machines. Journal of Chemical Physics, 21(6):1087–1092, 1953. ISSN

00219606. doi: DOI:10.1063/1.1699114.

snap.stanford.edu

http://nar.oxfordjournals.org/cgi/content/abstract/31/4/1234

BIBLIOGRAPHY 112

[74] M. Middendorf, E. Ziv, and C.H. Wiggins. Inferring network mechanisms:

the Drosophila Melanogaster protein interaction network. Proceedings of the

National Academy of Sciences, 102(9):3192–3197, 2005.

[75] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):

125–139, 2001.

[76] R.M. Neal. Estimating Ratios of Normalizing Constants Using Linked Impor-

tance Sampling. Arxiv preprint math.ST/0511216, 2005.

[77] A. Newman. Cuts and orderings: on semidefinite relaxations for the linear

ordering problem. Approximation, Randomization, and Combinatorial Opti-

mization, pages 195–206, 2004.

[78] MEJ Newman. The Structure and Function of Complex Networks. Structure,

45(2):167–256, 2003.

[79] MEJ Newman and EA Leicht. Mixture models and exploratory analysis in

networks. Proceedings of the National Academy of Sciences, 104(23):9564, 2007.

[80] E. Ozkan. Particle methods for bayesian multi-object tracking and parameter

estimation. PhD thesis, Middle East Technical University, 2009.

[81] G. Polya and F. Eggenberger. Uber die statistik verketter vorgange. Z. Angew

Math. Mech, pages 279–289, 1923.

[82] DJ Price. A general theory of bibliometric and other cumulative advantage

processes. Journal of the American Society for Information Science, 27(5):

292–306, 1976.

[83] O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf, S. Richardson, and C. Wiuf.

Using likelihood-free inference to compare evolutionary dynamics of the protein

networks of H. Pylori and P. Falciparum. PLoS Comput Biol, 3(11):e230, 2007.

link.

[84] C.P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.

http://web.ebscohost.com/ehost/detail?vid=1&hid=15&sid=35b7c10b-8439-4a62-9aae-d9e24178a061%40sessionmgr8

BIBLIOGRAPHY 113

[85] C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R. Springer

Verlag, 2010.

[86] M.N. Rosenbluth and A.W. Rosenbluth. Monte carlo calculation of the average

extension of molecular chains. The Journal of Chemical Physics, 23(2):356–359,

1955.

[87] D.B. Rubin. A noniterative sampling/importance resampling alternative to the

data augmentation algorithm for creating a few imputations when fractions of

missing information are modest: the SIR algorithm. Journal of the American

Statistical Association, 82(398):543–546, 1987.

[88] R.Y. Rubinstein. Optimization of computer simulation models with rare events.

European Journal of Operational Research, 99(1):89–112, 1997.

[89] R.Y. Rubinstein and D.P. Kroese. The cross-entropy method: a unified approach

to combinatorial optimization, Monte-Carlo simulation and machine earning.

Springer, 2004.

[90] P. Sheridan, Y. Yagahara, and H. Shimodaira. A preferential attachment model

with Poisson growth for scale-free networks. arxiv.org, 2008.

[91] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transac-

tions on pattern analysis and machine intelligence, 22(8):888–905, 2000.

[92] H.A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):

425–440, 1955.

[93] A. Sinclair and M. Jerrum. Approximate counting, uniform generation and

rapidly mixing Markov chains. Information and Computation, 82(1):93–133,

1989.

[94] Ø. Skare, E. Bølviken, and L. Holden. Improved sampling-importance resam-

pling and reduced bias importance sampling. Scandinavian Journal of Statistics,

30(4):719–737, 2003. ISSN 1467-9469.

BIBLIOGRAPHY 114

[95] DA Spielman. Spectral graph theory and its applications. In Foundations of

Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages

29–38, 2007.

[96] M.A. Tanner and W.H. Wong. The calculation of posterior distributions by

data augmentation. Journal of the American statistical Association, 82(398):

528–540, 1987.

[97] A.C. Thomas and J.K. Blitzstein. The effect of censoring out-degree on network

inferences. Unpublished manuscript, 2009.

[98] L.G. Valiant. The complexity of computing the permanent. Theoretical com-

puter science, 8(2):189–201, 1979.

[99] D.A. van Dyk and X.L. Meng. The art of data augmentation. Journal of

Computational and Graphical Statistics, 10(1):1–50, 2001.

[100] J. Vermaak, A. Doucet, and P. Perez. Maintaining multi-modality through

mixture tracking. In International Conference on Computer Vision, volume 2,

pages 1110–1116. Citeseer, 2003.

[101] B.N. Vo, S. Singh, and A. Doucet. Sequential Monte Carlo methods for multi-

target filtering with random finite sets. IEEE Transactions on Aerospace and

Electronic Systems, 41(4):1224–1245, 2005. ISSN 0018-9251.

[102] Andreas Wagner. How the global structure of protein interaction networks

evolves. Proceedings of the Royal Society B: Biological Sciences, 270:457–466,

Mar 2003. link.

[103] E.A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear

estimation. In Adaptive Systems for Signal Processing, Communications, and

Control Symposium, pages 153–158. IEEE, 2000. ISBN 0780358007.

[104] C. Wiuf, M. Brameier, O. Hagberg, and M.P.H. Stumpf. A likelihood approach

to analysis of network data. Proceedings of the National Academy of Sciences,

103(20):7566–7570, 2006.

http://dx.doi.org/10.1098/rspb.2002.2269

BIBLIOGRAPHY 115

[105] G.U. Yule. A mathematical theory of evolution, based on the conclusions of Dr.

JC Willis. FRS Philosophical Transactions B, 213(1924):21–87, 1924.

Documents

MONTE CARLO METHODS FOR STRUCTURED DATA A …rg833nw3954/... · putational issues, and available deterministic heuristics may be ine ective. Monte Carlo methods present an attractive