Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
MONTE CARLO METHODS FOR STRUCTURED DATA
A DISSERTATION
SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL
AND MATHEMATICAL ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Adam Guetz
January 2012
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/rg833nw3954
© 2012 by Adam Nathan Guetz. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Susan Holmes, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Amin Saberi, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Peter Glynn
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Recent years has seen an increased need for modeling of rich data across many engi-
neering and scientific disciplines. Much of this data contains structure, or non-trivial
relationships between elements, that should be exploited when performing statistical
inference. Sampling from and fitting complicated models present challenging com-
putational issues, and available deterministic heuristics may be ineffective. Monte
Carlo methods present an attractive framework for finding approximate solutions to
these problems. This thesis covers two closely related techniques: adaptive impor-
tance sampling, and sequential Monte Carlo. Both of these methods make use of
sampling-importance resampling to generate approximate samples from distributions
of interest.
Sequential importance sampling is well known to have difficulties in high-dimensional
settings. I present a technique called conditional sampling-importance resampling,
an extension of sampling importance resampling to conditional distributions that
improves performance, particularly when independence structure is present. The
primary application is to multi-object tracking for a colony of harvester ants in a
laboratory setting. Previous approaches tend to make simplifying parametric as-
sumptions on the model in order to make computations more tractable, while the
approach presented finds approximate solutions to more complicated and realistic
models. To analyze structural properties of networks, I expand adaptive importance
sampling techniques to the analysis of network growth models such as preferential
attachment, using the Plackett-Luce family of distributions on permutations, and I
present an application of sequential Monte Carlo to a special form of network growth
model called vertex censored stochastic Kronecker product graphs.
iv
Acknowledgements
I’d like to thank my wife Heidi Lubin, my son Levi, my principal advisor Susan
Holmes, my co-advisor Amin Saberi, my parents, and all of my friends and extended
family.
v
Contents
Abstract iv
Acknowledgements v
1 Introduction 1
1.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Approximate Sampling 6
2.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Effective Sample Size . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Sampling Importance Resampling . . . . . . . . . . . . . . . . 11
2.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.5 Hit-and-Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Sequential Monte Carlo 18
3.1 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . 20
3.3 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vi
4 Adaptive Importance Sampling 24
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Variance Minimization . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Cross-Entropy Method . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Avoiding Degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Annealed Importance Sampling . . . . . . . . . . . . . . . . . 30
4.3.2 Population Monte Carlo . . . . . . . . . . . . . . . . . . . . . 31
5 Conditional Sampling Importance Resampling 33
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Conditional Resampling . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Estimating Marginal Importance Weights . . . . . . . . . . . . 36
5.2.2 Conditional Effective Sample Size . . . . . . . . . . . . . . . . 36
5.2.3 Importance Weight Accounting . . . . . . . . . . . . . . . . . 37
5.3 Example: Multivariate Normal . . . . . . . . . . . . . . . . . . . . . . 38
6 Multi-Object Particle Tracking 43
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1.1 Single Object Tracking . . . . . . . . . . . . . . . . . . . . . . 43
6.1.2 Multi Object Tracking . . . . . . . . . . . . . . . . . . . . . . 45
6.1.3 Tracking Notation . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Conditional SIR Particle Tracking . . . . . . . . . . . . . . . . . . . . 47
6.2.1 Grouping Subsets for Multi-Object Tracking . . . . . . . . . . 48
6.3 Application: Tracking Harvester Ants . . . . . . . . . . . . . . . . . . 49
6.3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3.2 Observation Model . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3.3 State-Space Model . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3.4 Importance Distribution . . . . . . . . . . . . . . . . . . . . . 54
6.3.5 Computing Relative and Marginal Importance Weights . . . . 62
6.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vii
6.4.2 Short Harvester Ant Video . . . . . . . . . . . . . . . . . . . . 65
7 Network Growth Models 70
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.1.1 Erdos-Renyi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1.2 Preferential Attachment . . . . . . . . . . . . . . . . . . . . . 73
7.1.3 Duplication/Divergence . . . . . . . . . . . . . . . . . . . . . 75
7.2 Computing Likelihoods with Adaptive Importance Sampling . . . . . 75
7.2.1 Marginalizing Vertex Ordering . . . . . . . . . . . . . . . . . . 78
7.2.2 Plackett-Luce Model as an Importance Distribution . . . . . . 79
7.2.3 Choice of Description Length Function . . . . . . . . . . . . . 80
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3.1 Modified Preferential Attachment Model . . . . . . . . . . . . 81
7.3.2 Adaptive Importance sampling . . . . . . . . . . . . . . . . . 82
7.3.3 Annealed Importance sampling . . . . . . . . . . . . . . . . . 82
7.3.4 Computational Effort . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . 84
8 Kronecker Product Graphs 91
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Stochastic Kronecker Product Graph model . . . . . . . . . . . . . . 94
8.2.1 Likelihood under Stochastic Kronecker Product Graph model . 94
8.2.2 Sampling Permutations . . . . . . . . . . . . . . . . . . . . . . 96
8.2.3 Computing Gradients . . . . . . . . . . . . . . . . . . . . . . . 96
8.3 Vertex Censored Stochastic Kronecker Product Graphs . . . . . . . . 97
8.3.1 Importance Sampling for Likelihoods . . . . . . . . . . . . . . 98
8.3.2 Choosing Censored Vertices . . . . . . . . . . . . . . . . . . . 100
8.3.3 Sampling Permutations . . . . . . . . . . . . . . . . . . . . . . 100
8.3.4 Multiplicative Attribute Graphs . . . . . . . . . . . . . . . . . 101
8.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 102
viii
List of Tables
6.1 Observation event types. . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1 Comparison of estimators for sparse 500 node preferential attachment
dataset from Figure 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Comparison of estimators for dataset: 5 networks, 30 nodes each, av-
erage degree 2, 20 samples each method . . . . . . . . . . . . . . . . . 86
7.3 Comparison of estimators for dataset: 2 networks, 100 nodes each,
average degree 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4 Estimated log-likelihoods for Mus Musculus protein-protein interaction
networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
ix
List of Figures
3.1 Dependence structure of hidden Markov models . . . . . . . . . . . . 19
5.1 CSIR Normal example: eigenvalues of covariance matrices . . . . . . 40
5.2 CSIR Normal example: estimate KL-Divergences . . . . . . . . . . . 41
5.3 Same experiments as in Figure 5.2, plotted by method. . . . . . . . . 42
6.1 Example grouping subset functions . . . . . . . . . . . . . . . . . . . 49
6.2 Blob bisection via spectral partitioning . . . . . . . . . . . . . . . . . 52
6.3 Association of objects with observations. ’Events’ correspond to con-
nected components in this bipartite graph, including Normal obser-
vations, splitting, merging, false positives, false negatives, and joint
events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 “True” distribution of path lengths and trajectories per frame, simu-
lated example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Centroid observations per frame, simulated example. . . . . . . . . . 66
6.6 Distribution of path lengths and trajectories per frame using a sample
from the importance distribution, simulated example. . . . . . . . . . 67
6.7 Distribution of path lengths and trajectories per frame using CSIR,
simulated example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8 GemVident screenshot, showing centroids. . . . . . . . . . . . . . . . 68
6.9 Centroid observations per frame from Harvester ant example. . . . . . 68
6.10 Distribution of path lengths and trajectories per frame using a sample
from the importance distribution, Harvester ant example. . . . . . . . 69
x
6.11 Distribution of path lengths and trajectories per frame using CSIR,
Harvester ant example. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Example runs comparing annealed importance sampling and adaptive
importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Likelihoods and importance weights for cross-entropy method. . . . . 88
7.3 Mus. Musculus (common mouse) PPI network. . . . . . . . . . . . . 89
7.4 Convergence of adaptive importance sampling and annealed impor-
tance sampling for Mus. Musculus PPI network. . . . . . . . . . . . . 90
8.1 Comparison of crude and SIS Monte Carlo for Kronecker graph likeli-
hoods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Comparison of SKPG and VCSKPG models for AS-ROUTEVIEWS
graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xi
Chapter 1
Introduction
Contemporary data analysis is often comprised of information with complicated and
high-dimensional relationships between elements. Traditional, deterministic analytic
techniques are often unable to directly cope with the computational challenge, and
must make simplifying assumptions or heuristic approximations. An attractive al-
ternate is the suite of randomized methods known as Monte Carlo. The types of
problems examined in this thesis often contain both discrete and continuous compo-
nents, and can generally be expressed as or related to integral or summation type
problems. Suppose one wishes to compute some quantity µ defined as
µ =
∫Ω
X(ω)P(dω). (1.1)
If X : Ω 7→ R is the random variable defined on probability space (Ω,Σ,P), then this
can be equivalently expressed as the expected value
µ = E[X]. (1.2)
In some cases, µ can be computed exactly using analytic techniques. For many
examples this is not possible and one must resort to methods of approximation. Deter-
ministic numerical integration, or quadrature, generally has good convergence prop-
erties for low and moderate dimensional integrals. However, the computational com-
plexity of quadrature increases exponentially in the dimension of the sample space
1
CHAPTER 1. INTRODUCTION 2
Ω, making high-dimensional inference computationally intractable. This general phe-
nomenon is known as the curse of dimensionality [11], and can be explained in terms
of the relative “sparseness” of high-dimensional space. Monte Carlo integration can
be a viable alternative to quadrature in these settings, as it converges at a rate pro-
portion to the square-root of the sample size regardless of dimension.
1.1 Monte Carlo Integration
Given N independent identically distributed random variables X1, . . . ,XN , XN =
(∑N
i=1 Xi)/N , E[X2] <∞, the strong law of large numbers gives
XNa.s.−−→ µ, (1.3)
wherea.s.−−→ denotes almost sure convergence. This provides the motivation to use xN
as an approximation to µ. Using xN to approximate µ is known as Monte Carlo
Integration. Notationally, in this thesis bold uppercase letters such as X indicate
random variables, while lowercase letters such as x indicate observations of those
random variables.
Under the above conditions, the central limit theorem states that
XN − µd−→ N (0, σ2
X/N), (1.4)
whered−→ denotes convergence in distribution, and σ2
X = var[X]. Roughly speaking,
this means that in the limit convergence of XN to µ occurs at a√N rate. In other
words, to get another digit of accuracy (factor of 10), one would need 102 = 100 times
as many samples. While this rate of convergence is unappealing for low-dimensional
integrals, the rate holds regardless of the sample space Ω, therefore allowing one to
circumvent the curse of dimensionality for high-dimensional problems.
This formulation, however, hides some of the additional complexity inherent in
Monte Carlo integration. One difficulty is that the variance σ2X may grow exponen-
tially in the dimensions of the sample space Ω. In these cases, the Monte Carlo
CHAPTER 1. INTRODUCTION 3
guarantee of quadratic convergence is unhelpful since the starting point is an esti-
mator with exponentially high variance. To make Monte Carlo methods practical in
this case, it is necessary to employ variance reduction techniques. For comprehensive
reviews of variance reduction techniques, see Liu [68] or Asmussen and Glynn [6].
The primary method for variance reduction used in this text is importance sampling,
where instead of directly sampling the random variable X to estimate (1.1), one in-
stead samples from some biased random variable Y and corrects for the bias. See
§2.1 for background on importance sampling.
Another potential source of complexity is in the generation of independent random
draws from the sample space ω ∈ Ω and the computation of the random variable X(ω).
For many commonly occurring problems, the best known algorithms for sampling
exactly from the sample space of interest take exponential (or worse) time. In these
cases the only feasible alternative is to use approximate sampling techniques. In
this thesis, I will examine two main techniques for approximate sampling: sampling
importance resampling(SIR) and Markov chain Monte Carlo(MCMC). Background
for these techniques is covered in §2. Advanced topics covered include sequential
Monte Carlo §3, including particle filtering, and adaptive importance sampling §4.
1.2 Applications
In §6, techniques are discussed for tracking large numbers of possibly interacting ob-
jects. In multi-object tracking (also known as multi-target tracking), the sequential
process of interest can be represented as a hidden Markov model (HMM). The primary
goal is to make inferences about the hidden state. There are many techniques avail-
able for tracking multiple targets, but almost all of those currently available either
make broad simplifying assumptions or are infeasible for large problem sizes. Gener-
ally it is preferable to use models of movement and observation that are as realistic as
possible while still permitting scalable analysis. One can generally state the problem
of inference as being equivalent to sampling from a posterior distribution, but in the
tracking models studied in this thesis it is generally impossible to make exact draws
from the posterior distribution. Instead, sequential Monte Carlo techniques are used
CHAPTER 1. INTRODUCTION 4
to draw approximate samples. The standard particle filter doesn’t typically work
well with high-dimensional models due to degeneracy issues, where the bulk of the
importance weight mass is concentrated on a small number of particles. Part of the
degeneracy issue can be resolved using standard SIR, however in the high-dimensional
multi-target tracking example studied in chapter §6 it is insufficient. To address this
issue, chapter §5 introduces a conditional sampling importance resampling step that
takes advantage of inherent independence structures in the model. When individ-
ual targets are far from one another, conditional SIR effectively admits a separate
particle filter for each target while asymptotically maintaining the correct joint dis-
tribution. An implementation of this algorithm with empirical examples of tracking
the movements of harvester ants in a laboratory setting is given.
Adaptive importance sampling is a technique similar in many respects to the par-
ticle filter. In §7 an application of adaptive importance sampling to inference for
network growth models is studied. Network growth models are models of network
creation in which a new vertex arrives and attaches edges to pre-existing vertices
according to a rule depending on the current state of the network. The application
considered, it is attempted to estimate the likelihood of a given network having origi-
nated from the growth model for the purpose of model selection. A primary technical
difficulty is that one typically doesn’t know the order in which vertices joined the
network. A priori, each ordering is equally likely, so one needs to consider all possible
permutations of orderings to make valid inferences. This quantity can be expressed
as a summation over all permutations, and can be represented as estimating the nor-
malizing constant of the distribution that has probability proportional to the model
likelihood for each permutation. Since there are a factorial number of permutations,
for moderate numbers of vertices direct summation is infeasible and one must resort
to approximation techniques. In this setting, crude Monte Carlo tends to work poorly,
since most of the likelihood is concentrated on a vanishingly small subset of permu-
tations. To reduce the variance of the estimator, adaptive importance sampling is
with importance distributions selected from the Plackett-Luce family of permutation
distributions. This is a novel use of this family of distributions in the importance
sampling context. An example is given using the technique on a modified version of
CHAPTER 1. INTRODUCTION 5
preferential attachment.
In §8 inference in a special type of network growth model known as stochastic
Kronecker product graph (SKPG) is discussed. These models have a simple formu-
lation that permits relatively efficient estimation of maximum likelihood parameters.
The SKPG model is a generalization the Erdos-Renyi G(n, p) model and implicitly
construct a matrix of Bernoulli edge probabilities using Kronecker products of smaller
seed matrices. These models suffer from the same difficulty as other network growth
models in that in order to compute the likelihood one needs to sum over all possible
vertex labeling permutation, but SKPGs have the advantage that it is relatively easy
to compute the normalizing constant. To address the permutation issue, Leskovec and
Faloutsos [64] use a Markov chain Monte Carlo algorithm over permutation space.
However, these models encounter difficulties when there is a mismatch between the
model dimensions and the number of vertices in the network data. To address these
issues, the vertex censored stochastic Kronecker product graph (VCSKPG) model is
introduced, which allows more flexibility in the allowable number of model vertices.
A sequential importance sampling scheme is proposed to perform efficient parameter
fitting and likelihood estimation for this model.
Chapter 2
Approximate Sampling
In many application settings, it is desirable to sample from a distribution of interest
π but it is impossible to do so in a reasonable amount of time, i.e. sampling from π is
computationally intractable. Approximate sampling refers to a set of methods that at-
tempt the next best thing, which is to sample from some distribution γ that is in some
sense “close” to π. There is a strong connection between approximate sampling and
estimation problems. In particular, Jerrum et al. [53] were able to give a polynomial-
time reduction between almost uniform sampling and approximate counting. Another
way to view this relationships is through the well-known importance sampling identity
(2.6), which gives a zero-variance estimator when sampling from the optimal impor-
tance distribution γ∗, and low-variance estimates when approximately sampling from
γ∗.
By “closeness” of γ to π, one can either use metrics between probability distribu-
tions such as total variance distance,
dTV (π, γ) = supA⊂Ω|π(A)− γ(A)|, (2.1)
6
CHAPTER 2. APPROXIMATE SAMPLING 7
or pseudo-metrics such as the Kullback-Leibler divergence,
dKL(π‖γ) = Eπ
[log
π(X)
γ(X)
](2.2)
= Eπ [log π(X)]− Eπ [log γ(X)] , (2.3)
where the notation Eπ(X) indicates that the random variable X is distributed ac-
cording to π. Although Kullback-Leibler divergence is not a true distance function
as it is not symmetric (dKL(π‖γ) 6= dKL(γ‖π) in general), it does have the property
that dKL(π‖γ) = 0 iff π(x) = γ(x) for all x with nonzero measure. The two terms
on the RHS of (2.3) are the entropy of γ, Eπ [log π(X)], roughly representing sample
diversity, and the cross-entropy, Eπ [log γ(X)], representing the “goodness of fit” of γ
to π.
2.1 Importance Sampling
Suppose that there exists a random variable X : Ω 7→ R, and one wishes to compute
the expected value of X,
µ ≡ E[X] =
∫Ω
X(ω)π(dω), (2.4)
where π is the target distribution. In many cases, computing µ exactly is intractable
since it is necessary to integrate over the entire state space Ω. For example, the
problem of approximating the permanent of a matrix [52] can be represented as
perm(A) =1
n!E
[n∏i=1
ai,X(i)
], (2.5)
where X is a random permutation. Computing the permanent is a computationally
difficulty and is known to be #P-complete [98]. #P-complete comprises a set of
counting problems with no known polynomial-time algorithms, and can be thought
of as the counting analog of NP-complete problems. One way to estimate such prob-
lems is through crude Monte Carlo simulations, drawing N independent, identically
CHAPTER 2. APPROXIMATE SAMPLING 8
distributed (iid) samples x1, . . . , xN , then computing xN . XN may, however, have
unacceptably large variance, exponentially high in the case of approximating the per-
manent.
One practical variance reduction technique is known as importance sampling (IS).
Importance sampling builds an estimator by sampling from a biased distribution in
which the “important” or more heavily weighted states are visited more frequently.
Helpful background references for importance sampling include Evans and Swartz [35],
Asmussen and Glynn [6], Liu [68], and Robert and Casella [85]. Suppose there exist
random variables Y,Z, such that Y(ω) = X(ω) and Z(ω) = X(ω)π(dω)γ(dω)
. Importance
sampling is based on the following simple identity:
E[X] =
∫Ω
X(ω)π(dω)
γ(dω)γ(dω) = E [Z] (2.6)
The importance sampling identity (2.6) holds as long as Z is well defined, i.e. π(dω) =
0 =⇒ γ(dω) = 0 for ω ∈ Ω. γ is the importance distribution, and the ratio
W(ω) ≡ π(dω)/γ(dω) is the importance weight of ω. One can draw N iid samples of
Z and use µIS ≡ ZN as the unbiased importance estimator of µ. If var[Z] < var[X],
ZN will be a better estimate of µ than XN . A primary challenge in importance
sampling is choosing an importance distribution γ that minimizes var[Z].
Practically, it is often the case that one only knows π and/or γ up to a constant
factor, f(dω) = π(dω)/CX, g(dω) = γ(dω)/CY, where CX and CY are the normalizing
constants of f and g. The ratio of the normalizing constants is denoted as C ≡CX/CY, with the random variables W(ω) ≡ f(dω)/g(dω) and Z(ω) ≡ Y(ω)W(ω) =
CZ(ω).
Since E[W] = 1, one can build an unbiased estimator of the ratio C through draws
of the unnormalized importance ratios,WN . Using the same samples ω1, . . . , ωN to
computeWN as for
ZN leads to the biased importance estimator
µBIS ≡ZN(ω1, . . . , ωN)WN(ω1, . . . , ωN)
. (2.7)
CHAPTER 2. APPROXIMATE SAMPLING 9
Although biased for finite sample sizes, µBIS is asymptotically unbiased and does not
require normalizing constants..
The optimal sampling distribution γ∗ is one for which E[Z2] is smallest. As per
Rubinstein and Kroese [89], E[Z2] is minimized with γ∗(dω) ∝ |X(ω)|π(dω). This
gives
E[Z∗2] =
∫Ω
Z2(ω)γ∗(dω) (2.8)
=
∫Ω
|X(ω)|π(dω) = E[|X|]. (2.9)
In particular, if X ≥ 0, γ∗ provides a zero variance estimator. Direct computation
of the optimal sampling distribution is typically impossible since it relies on the∫|X(ω)|π(dω) as a normalizing constant, which is the quantity to be estimated in
the first place. However, it is often helpful to use γ∗ as a guide to construct “good”
importance sampling distributions.
2.1.1 Effective Sample Size
An important concept when using importance sampling is degeneracy. Degeneracy
occurs when the bulk of the importance weight mass is concentrated in one or a small
number of importance samples. This is means that the Monte Carlo approximation
will be dominated by a small subset of samples, so the “effective” number of samples
that we are using to compute the approximation is small. A standard measure of
degeneracy for importance sampling methods is therefore the effective sample size
[61],
ESS =N
1 + var[W]= N
1
E[W2], (2.10)
where cv refers to the coefficient of variation of the importance weights (cv =
√var[W]
E[W]),
E[W] = 1, and var[W] = E[W2] − 1. One justification for the use of the effective
sample size comes from Liu [67], based on a note from Kong [61], using the the delta
CHAPTER 2. APPROXIMATE SAMPLING 10
method, and states that
var[xN]
var[µBIS,N ]≈ 1
1 + var[W]. (2.11)
This goes as follows. First note that using the standard delta method for ratio
statistics [21] gives
var[µBIS,N] ≈ 1
N(var[Z] + µ2var[W]− 2µcov(Z,W)). (2.12)
Further note that
cov(Z,W) = E
(Xπ(X)
γ(X)
)− µ (2.13)
= cov
(π(X)
γ(X),X
)+ µE
[π(X)
γ(X)
]− µ (2.14)
and that
var[Z] = E
[π2(X)
γ2(X)X
]− µ2 (2.15)
≈ E[X]E
[π2(X)
γ2(X)
]+ var(X)E
[π(X)
γ(X)
]+ 2µcov
(π(X)
γ(X),X
)− µ2 (2.16)
Applying this to (2.12) gives
var[µBIS,N] ≈ 1
Nvar(X)(1 + var(W)), (2.17)
which yields (2.11). Note that the remainder term in (2.16) is
E
[(π(X)
γ(X)− E
[π(X)
γ(X)
])(X− µ)2
], (2.18)
which can be large depending on the the distribution of X.
Typically in practice one knows neither the true variance of the importance weights
nor the normalizing constants, so it is common to use the empirical unnormalized
CHAPTER 2. APPROXIMATE SAMPLING 11
effective sample size,
ESS =
(∑Ni=1 W(ωi)
)2
∑Ni=1 W(ωi)2
(2.19)
as a heuristic measure of degeneracy.
2.1.2 Sampling Importance Resampling
Sampling importance resampling (SIR) is a technique for approximate sampling based
on importance weights. Although not widely used in general approximate sampling
settings, where Markov chain Monte Carlo Methods are often more easily imple-
mented and efficient, sampling importance resampling has found a niche in sequential
settings where Markov chain based approximate sampling methods carry a heavier
computational burden. Sequential importance sampling is discussed further in §3.2.
The goal of resampling in the sequential setting is generally to reduce degeneracy. The
idea for SIR came originally from Rubin [87], and was introduced in the sequential
context by Gordon et al. [42].
SIR uses samples drawn from an importance distribution γ to approximately draw
samples from the target distribution π. The procedure to draw M importance sam-
ples is as follows. Given N samples y(1), . . . , y(N) drawn according to γ, compute
the importance weights w(i) for each sample. Choose an index i with probability
proportional to w(i) and assign x = y(i). This process is repeated M times (with
replacement) to generate a collection of approximate samples, x(i)Mi=1. The qual-
ity of the approximate samples depends on the sample size N and how closely γ
matches π. Assuming mild regularity conditions on the importance distribution γ,
asymptotic convergence to the target distribution can be shown (see Asmussen and
Glynn [6],p.387). Since only relative importance weights are needed for resampling,
normalizing constants are not needed.
One possible use of the collection of approximate samples x(i)Ni=1 is to construct
CHAPTER 2. APPROXIMATE SAMPLING 12
Algorithm 1 Sampling Importance Resampling (SIR)
Draw N samples y(1), . . . , y(N) ∼ Py.Compute importance weights w(i)Ni=1.Draw M samples x(1), . . . , x(M) with replacement from y(i)Ni=1 with probabilitiesproportional to w(i)Ni=1.
an estimator of E[X]
µSIR = N−1
N∑i=1
x(i). (2.20)
However, this estimator is not very useful in practice, as µSIR will always have higher
variance than µIS due to the additional variance from multinomial sampling. A
better resampling method that can reduce the multinomial noise known as residual
resampling, introduced by Liu and Chen [69], is as follows. First, normalize the
importance weights to sum to one, w(i) = w(i)/∑N
j=1w(j) . Then take bMw(j)c
copies of each sample j, for a total of k =∑N
j=1bMw(j)c samples, and set w(j)′ ←w(j) − bMw(j)c, renormalize the importance weights, and take M − k samples with
replacement from the resulting multinomial distribution. This procedure will make
the same expected numbers of copies of each sample, but if k is large can greatly
reduce the multinomial noise.
Another variation of SIR introduced by Skare et al. [94] uses modified importance
weights. Instead of choosing samples with probability w(i), one can use weights pro-
portional to w(i)/(1 − w(i)). Using these weights with M fixed and letting N → ∞,
Skare et al. [94] were able to show point-wise rates convergence for X to X at rate
O(N−1) when sampling with replacement and O(N−2) when sampling without re-
placement. The idea of sampling without replacement for SIR when M << N orig-
inally comes from Gelman [38], and intuitively can be thought of as producing an
“intermediate representation” between the sampling and target distributions.
CHAPTER 2. APPROXIMATE SAMPLING 13
2.2 Markov Chain Monte Carlo
Markov chain Monte Carlo is among the most widely used methods for difficult ap-
proximate sampling and estimation problems. This is due to the fact it is often easy
to design an ergodic Markov chain that holds a target distribution π as its station-
ary distribution for a wide variety of distributions [4], which can then be used to
draw approximate samples from π and to construct Monte Carlo estimators. Tech-
niques for designing such Markov chains include the Metropolis-Hastings algorithm,
data augmentation, the Gibbs sampler, and the hit-and-run algorithm. For further
background on Markov chain Monte Carlo methods, refer to Liu [68], Asmussen and
Glynn [6], and Robert and Casella [84].
2.2.1 Markov Chains
A random variable X1:t is a discrete valued Markov process if the distribution of Xt
given the most recent state xt−1 is independent of all the previous states, or
P(Xt|x1:t−1) = P(Xt|xt−1). (2.21)
This property of Markov chains is referred to as memorylessness. If the possible
values for each Xt form a countable space, then this type of process is known as a
Markov chain. Markov chains are defined by a kernel K(v, v′) specifying the relative
probability that Xt+1 = v′ given that xt = v. One way to think of a Markov chain
is as a random walk in a weighted directed graph G(V,E) with non-negative edge
weights. Possible states are represented by vertices, and the next state xt+1 given
the current xt is chosen with probability proportional to edge weights. The expected
hitting time of a state v ∈ V is the expected number of steps for the Markov chain
starting in state v to return to v. If a state has a finite expected hitting time, it
is positive recurrent. The periodicity of a state v is the greatest common divisor
amongst all possible hitting times; if the periodicity of v is 1 then v is said to be
aperiodic. If all states v ∈ V are positive recurrent and aperiodic, then the Markov
chain is said to be ergodic, and admits a unique stationary distribution π, such that
CHAPTER 2. APPROXIMATE SAMPLING 14
Kk(v, v′)→ π(v) as k →∞.
2.2.2 Metropolis Hastings
The Metropolis-Hastings algorithm [73, 46] was one of the first Markov chain Monte
Carlo methods proposed and used in practice, and is still one of the most commonly
used forms. Its popularity can be attributed to the simplicity of designing efficient
mechanisms to approximately sample from an arbitrary distribution π. The two
main requirements necessary for the algorithm is that the relative probabilities for
two states v, v′ under π, π(v′)/π(v) can be computed, and that an ergodic proposal
Markov chain on the state space with transition kernel K can be sampled from. No
normalizing constants are required, and the algorithm works extremely well for many
applications [27]. The procedure is as follows. Starting from a state v, take a step in
the Markov chain K to state v′, and compute the acceptance ratio
a =π(v′)
π(v)
K(v′, v)
K(v, v′). (2.22)
If a > 1, then move to state v′, otherwise draw a uniform random variable u and
move to state v′ if a > u, otherwise stay at v.
One potential difficulty with this algorithm is in the choice of proposal kernel
K. K should ideally be chosen such that K(v, v′) is approximately π(v′), which
can sometimes be difficult in practice. An inappropriate choice of K can induce the
Metropolis chain to have an extremely low rate of convergence. Another potential
issue is that the Metropolis chain may not be ergodic even if the proposal chain
is. However, for many important problems these difficulties can be overcome, and
Metropolis-Hastings has proved to be extremely useful in a wide variety of contexts
and has been named one of the most important algorithms of the 20th century [10].
2.2.3 Gibbs Sampler
The Gibbs sampler (introduced by Geman and Geman [39], see Liu [68, chapter 6] for
a good introduction) is another fundamental tool used for designing ergodic Markov
CHAPTER 2. APPROXIMATE SAMPLING 15
chains. Suppose that random variable X takes values in state space Ω with probability
distribution π, and that for x ∈ Ω, x can be decomposed as x = x1, . . . , xp. The
Gibbs sampler is as follows. Starting with an initial point x, cycle through coordinate
indices j = 1, . . . , p, for each j sampling xj according to the conditional distribution
xj ∼ π(xj|x[−j]), (2.23)
where the notation x[−j] indicates taking all coordinate indices of x except for j, or
x[−j]def= x1, . . . , xj−1, xj+1, . . . , xp. (2.24)
The method of cycling through coordinate indices is a design choice. If j is chosen
uniformly at random at each step, the procedure is known as random-scan Gibbs
sampling (summarized below in Algorithm 2); if instead one cycles through coordi-
nate indices in a predetermined order, it is called systematic-scan Gibbs sampling.
Denote the Markov transition kernel for random-scan Gibbs sampling as KGibbs, with
KGibbs(x, y) giving the probability density for the next state y conditioned on the
current state x. Under mild conditions the Gibbs sampler will be positive recurrent
Algorithm 2 Random-scan Gibbs sampler.
Start from initial point x← x1, . . . , xp.while (not converged) do
Pick an index j at random.Sample xj ∼ π(xj|x[−j]).
end while
and aperiodic (ergodic), and will therefore admit π as a stationary distribution. It is
easy to see that sampling from the conditional distribution (2.23) leave π invariant,
so assuming ergodicity the Gibbs sampler has π as a stationary distribution.
One is not restricted to only sampling from the single variable conditional distri-
butions in (2.23). If one chooses to sample from the joint conditional distribution for
multiple subindices j1, . . . , jp at once,
(xj1 , . . . , xjp) ∼ π(xj1 , . . . , xjp |x[−j1,...,−jp]), (2.25)
CHAPTER 2. APPROXIMATE SAMPLING 16
it is known as grouped Gibbs sampling. In grouped Gibbs, instead of sampling from
the line defined by x[−j] as in standard Gibbs, one samples from the hyperplane defined
by x[−j1,...,−jp] for some set of coordinate indices j1, . . . , jp. The set of coordinate
indices may be chosen deterministically or according to a randomized scheme. It
is not difficult to see that grouped Gibbs results in a faster mixing rate relative to
standard Gibbs, see Liu [68] for details.
2.2.4 Data Augmentation
The data augmentation algorithm, proposed by Tanner and Wong [96], can be thought
of as a special case of the Gibbs sampler on a two variable space X = X1,X2, where
one is primarily interested in sampling from the first sub-variable X1. X2 is called
the auxiliary variable. As in the standard Gibbs sampler, we draw samples of x1
according to π(x1|x2), then draw samples of x2 according to π(x2|x1), and repeat
until satisfied. This yields joint samples x approximately distributed according to
π(x), and one may then “discard” the auxiliary variable to get samples x1 ∼ π(x1).
The “art” of data augmentation [99] is in the choice of auxiliary variable x2, which
should ideally be chosen such that the conditional distributions π(x1|x2), π(x2|x1) are
easily computed, and such that the resulting Markov chain is rapidly mixing.
2.2.5 Hit-and-Run
A generalized form of grouped Gibbs is known as the hit-and-run algorithm [3], where
for current state x, one randomly chooses a subset Lk ⊂ Ω with probability w(Lk),
then samples the next state y according to the transition matrix Kk(x, y). The
intuition behind hit-and-run is that it is not necessary to restrict oneself to sampling
from hyperplanes of the state-space, but rather we can choose arbitrary subsets L to
sample from. In order to ensure convergence to the proper stationary distribution π,
it is necessary to choose L, w, and K such that Kk(x, y) has stationary distribution
proportional to wx(k)π(x). Also note that ergodicity of the hit-and-run Markov chain
depends on the choices of L, K, and w.
For each x ∈ Ω and coordinate index j, define subset indices kx,j such that kx,j =
CHAPTER 2. APPROXIMATE SAMPLING 17
Choose L, w, K such that for each subset index k, Kk(x, y) has stationary distributionproportional to wx(k)π(x).
Hit-and-run requirements
ky,j if and only if x[−j] = y[−j]. If one chooses subsets L and weights w such that
Lkx,j = y : y[−j] = x[−j] (2.26)
wx(Lkx,j) = 1/n1x=x, and Kkx,j(x, y) proportional to π(y)1y∈Lkx,j , then this is equiva-
lent to single variable random-scan Gibbs. This can be seen to satisfy the hit-and-run
requirements, since for each x ∈ Lk, the probability of choosing Lk is 1/n. Hit-and-
run also generalizes several other popular Markov chain Monte Carlo algorithms,
including Swendsen-Wang, data augmentation, and slice sampling. See Andersen
and Diaconis [3] for more details.
Chapter 3
Sequential Monte Carlo
In many important cases, one would like to analyze and develop models for data
that are ordered or sequential in some natural way. This situation occurs in many
examples such as data where there is a time component, i.e. time-series data, but also
situations such as sampling from the space of self-avoiding walks [44, 86], contingency
tables [23], and graphs with a prescribed degree sequence [14, 9]. For inference in these
models, sequential Monte Carlo [31] methods have been developed. Sequential Monte
Carlo methods include the class of algorithms known as particle filters, which were
introduced in their current form by Gordon et al. [42] as the bootstrap filter. Particle
filters are iterative, consisting of two basic elements: sequential importance sampling
(SIS) and sampling importance resampling (SIR).
3.1 Sequential Models
First, some notation will be introduced. For each sample ω ∈ Ω, there is a function
X : Ω 7→ ΩT , where Ω is the state space, and T is a positive integer. Under this
formulation, X is a discrete-time stochastic process. Typical examples include cases
where the state space Ω is finite, Nn, or Rn, but more complicated examples can be
considered as well. A sample x can be written as an array, X(ω) = Xt(ω)t∈1,...,T .
As a shorthand, the ω is dropped and X1:T is referred to as a random variable with
sub-indices X1, . . . ,XT .
18
CHAPTER 3. SEQUENTIAL MONTE CARLO 19
Figure 3.1: Dependence structure of hidden Markov models
In a sequential model, it will be assumed that evaluating and generating samples
from the conditional probabilities P(xt|x1:t−1) for 1 ≤ t ≤ T is computationally
feasible. In a hidden Markov model, X is a Markov process, and there is another
coupled discrete-time stochastic process Y such that the conditional distribution of
Yt given xt is independent of the rest of the x and y variables, or
P(yt|x1:T , y1:t−1, yt+1:T ) = P(yt|xt). (3.1)
X is known as the hidden or latent process and Y is called the observation process.
A diagram showing the dependence structure of hidden Markov models is shown
in Figure 3.1. Practical applications of of hidden Markov models typically involve
samples from the observation process Y, and it is desired to make inferences about
the latent process X. For example, one may wish to compute (and draw samples
from) P(X|y) or to compute the expected value of functions of X conditioned on y.
Another common problem is that the X and Y processes may be defined by some set
of parameters θ and for which we would like to find maximum likelihood estimates.
In the case where the hidden process X is governed by an affine Gaussian process
and the observation model Yt conditioned on the hidden state xt is also affine Gaus-
sian, one can use the Kalman filter [55] to efficiently make direct inferences about
(and sample from) X conditioned on y. The Kalman filter is an iterative procedure
that computes first E[Xt|xt−1], then updates based on current observation yt to get
CHAPTER 3. SEQUENTIAL MONTE CARLO 20
E[Xt|, xt−1, yt]. This is repeated for s = 1, . . . , t. Due to the Gaussian nature of the
processes, knowing E[Xt|, xt−1, yt] for each t is sufficient to compute and sample ac-
cording to the full conditional distribution P[x1:T |y1:T ]. Estimates for the respective
covariance matrices can also be determined iteratively.
The efficiency of the Kalman filter makes it useful in real-world applications where
Gaussian models are appropriate. In cases where the underlying hidden and obser-
vation processes are non-linear, versions of the Kalman filter such as the extended
Kalman filter [51] or the unscented Kalman filter [54, 103] may be useful. However,
these methods may not be applicable in cases where the underlying state-space and
observation processes are highly non-linear, or take values in non-Euclidean state
spaces such as graphs or other combinatorial objects. In these cases, particle filtering
(§3.3) offers an attractive alternative. See chapter 6 for an application of the Kalman
and particle filters to multi-object tracking.
3.2 Sequential Importance Sampling
Suppose X and Y are the latent and observation processes of a hidden Markov model.
If it is known how to sample Xt according to the law
P(xt|y1:T , x1:t−1) = P(xt|yt:T , xt−1), (3.2)
where the equality is due to the independence properties of X and Y, one can sam-
ple directly from X|y sequentially. Note, however, that this distribution is condi-
tioned on all future observations yt:T for each xt. For non-Gaussian processes, sam-
pling from the optimal importance distribution is usually impractical, so instead
one can sample according to some other distribution γ and use importance sam-
pling. Denote the tth contribution to the sequential target distribution as πt(x) ≡P(xt, yt|xt−1)/P(yt|y1:t−1), and the target distribution of the first t states given the
CHAPTER 3. SEQUENTIAL MONTE CARLO 21
first t observations as π1:t(x) ≡ P(x1:t|y1:t). π1:t can be built sequentially as follows:
π1:t(x) =P(yt|x1:t, y1:t−1)P(x1:t|y1:t−1)
P(yt|y1:t−1)(3.3)
=P(yt|xt)P(xt|xt−1)
P(yt|y1:t−1)P(x1:t−1|y1:t−1) (3.4)
=P(xt, yt|xt−1)
P(yt|y1:t−1)π1:t−1(x) (3.5)
= πt(x)π1:t−1(x) (3.6)
A sequential importance distribution γ1:t(x) is chosen to be defined sequentially
such that there exist functions γt(x) with
γ1:t(x) =t∏
s=1
γs(x) (3.7)
The tth sequential contribution to the importance weight is Wt(x) = πt(x)/γt(x),
and the importance weight after t time-steps is
W1:t(x) =t∏
s=1
Ws(x). (3.8)
Remarks:
• The denominator of πt, P(yt|y1:t−1), is independent of x and is not required for
approximate sampling. Since it may be difficult to compute, the unnormalized
value is often used,
π1:t(x) = P(xt, yt|xt−1)π1:t−1(x). (3.9)
π1:t(x) has P(y1:t) as its normalizing constant. One can similarly use an unnor-
malized importance distribution γ and importance weight W.
• πt(x) ∝ P(xt|yt, xt−1)P(yt|xt−1), so one sensible possibility is to choose as a se-
quential importance distribution γt(x) proportional to P(xt|yt, xt−1). Note that
CHAPTER 3. SEQUENTIAL MONTE CARLO 22
this is the optimal (zero variance) importance distribution for Xt given yt and
xt−1. However, xt chosen in this manner is no longer optimal at time-step t+ 1,
since πt+1 depends on xt through P(yt|xt−1). For this reason P(xt|yt, xt−1) is
called the locally optimal choice of importance distribution. The sequential con-
tribution to the relative importance weights in this case is Wt(x) ∝ P(yt|xt−1).
3.3 Particle Filter
One serious issue with sequential importance sampling is that the importance weights
can become degenerate after a small number of steps, with most of the importance
weight mass concentrated on a small subset of samples, even if using the locally opti-
mal importance distribution. To address this issue, Gordon et al. [42] suggest the use
of sampling importance resampling (SIR) in conjunction with sequential importance
sampling (SIS) to create what is now referred to as sequential Monte Carlo. When
the underlying model is hidden Markov, then this is known as the particle filter. A
standard reference for particle filtering is Doucet and De Freitas [31].
Algorithm 3 Particle filter approximate sampling.
Initialize each x(i)0 ∼ π0 for i = 1, . . . , N .
for t ∈ 1, . . . , T doDraw each x
(i)t ∼ γt(·|x(i)
t−1).
Update sequential importance weights W(i)1:t. according to (3.8).
Compute effective sample size ESS according to (2.19).
If ESS < N , resample each x(i) with probability proportional to W(i)t .
end for
To see anecdotally why sequential importance sampling results in degenerate
sample sets, consider the following case. Suppose that the sequential importance
weight Wt(X) is distributed identically identically and independently for each t.
This would occur, for example, if γ is an affine Gaussian process. Then as t → ∞,
log W1:t(x)/σ(log(W1:t(x))) → N(0, 1). Therefore, in the limit W1:t is distributed
according to a lognormal distribution, which is heavy-tailed. This implies that gen-
erally speaking we’d expect a small number of samples to have very high importance
CHAPTER 3. SEQUENTIAL MONTE CARLO 23
weights relative to the other samples. Low importance weight samples are also less
likely to have high importance weights in the future.
To address this issue, at each step one can perform sampling importance re-
sampling. If SIR is applied at time-step t, this will give an approximate sample
from π1:t(x). Note that for t < T this is not an approximate sample from the
true target distribution π1:T (x1:t) which incorporates all future observations. Instead
the algorithm approximately samples from a sequence of intermediate distributions,
π1, π1:2, . . . , π1:T . In this manner sequential Monte Carlo has parallels to simulated
annealing, which also uses a sequence of approximate samples from intermediate dis-
tributions to sample from a target distribution (see §4.3.1 for more).
Applying SIR has the effect of “pruning” the sample space, correcting previous
“mistakes” and ensuring that the algorithm doesn’t waste time on low probability
paths. Since each resampling adds multinomial noise, requires computational effort,
and there are many possible groupings to resample, it is desirable to only apply
resampling when there is a clear benefit. For this reason, it is common practice to
set some threshold N and resample when the empirical effective sample size ESS is
below this threshold.
Chapter 4
Adaptive Importance Sampling
4.1 Background
Adaptive importance sampling (AIS) takes an indirect approach to building low vari-
ance estimators of E[X]. The goal of adaptive importance sampling is to find the best
importance distribution g∗ within a restricted family of distributions G(·; v) with pa-
rameters v whose likelihood functions are known or relatively easy to calculate. The
idea is to iteratively build distributions within G based on a population of previously
generated importance samples. A generic implementation of adaptive importance
sampling is given in Algorithm 4. There are several different possible choices of how
to update the distribution parameter v based on the population of importance sam-
ples, including variance minimization [88] and the cross-entropy method [89]. As in
the sequential setting, degeneracy is an important issue, and sampling importance
resampling (see §2.1.2) may be used [20] in a similar context as in the particle filter
algorithm.
24
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 25
Algorithm 4 Adaptive importance sampling (AIS) algorithm
Generate (x(i)0 )1≤i≤N ∼ g
(0)0 .
Compute W(i)0 = f(x
(i)0 )/g
(0)0 (x
(i)0 ).
Generate (x(i)0 )1≤i≤N by resampling (x
(1)0 )1≤i≤N based on W
(i)0 .
k ← 0
while (not converged) do
k ← k + 1
Update gk based on x(i)k−1i≤N restricted to G.
Generate (x(i)k )1≤i≤N ∼ gk.
Compute W(i)k = f(x
(i)k )/gk(x
(i)k ).
Generate (x(i)k )1≤i≤N by resampling (x
(i)k )1≤i≤N based on W
(i)k .
end while
One of the primary challenges in designing an effective adaptive importance sam-
pling algorithm is in choosing the parametric family of distributions G. The choice
of distribution family affects the quality of the estimator and efficiency of the algo-
rithm. See Rubinstein and Kroese [89] for examples of some commonly used proposal
distributions in a range of application settings.
Ideally, the family of proposal distributions G should have enough flexibility to
be able to specify a distribution which frequently visits “important” states. In other
words, the best possible proposal distribution g∗ ∈ G, should be close to the optimal
importance distribution, g∗. On the other hand, the family of distributions G should
be as simple as possible, i.e. have a relatively small parameter space and be easy to
fit to the data, both to avoid overfitting and so that g∗ can be found with minimal
computational effort. Intuitively, the family of distributions should contain a good
“model” of the optimal distribution and easily incorporate some of the underlying
structure of the problem.
4.1.1 Variance Minimization
A primary design issue in the adaptive importance sampling algorithms is in the choice
of update rules for the importance distribution gk. Given a sample x = x1, . . . , xN,
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 26
one natural rule would be to take the next proposal distribution v as one that mini-
mizes the sample variance of (2.6)
g = argmingvarg [h(X)W(X)] . (4.1)
Since the expected value of h(x)f(x)g(x)
is constant for g with full support, this reduces
to minimize its second moment
g = argmingEg
[h2(X)W2(X)
]. (4.2)
This is the idea underlying the “variance minimization” (VM) procedure: at step
t, construct a new IS distribution g(t)VM by choosing a new sampling distribution that
minimizes (4.2) over samples (x(i)t−i)1≤i≤N . Often there is no analytic solution to (4.2),
requiring numerical non-linear optimization at each step.
4.1.2 Cross-Entropy Method
Instead of minimizing the variance at each step, one can choose a distribution gCE
that minimizes the Kullback-Leibler cross-entropy between gCE and the sample op-
timal importance distribution g∗(x). In the cross-entropy method [89], a sequence
of distributions parameterized by v1, v2, . . . , vk is iteratively built. At each step, the
Kullback-Leibler divergence (2.3) from g∗ to f based on previously drawn samples is
minimized. One principal advantage to this method is that minimizing cross-entropy
is often more computationally tractable than finding the minimum sample variance.
Recall that the Kullback-Leibler divergence from g to f can be stated as
dKL(g‖f) = Eg [log g(X)]− Eg [log f(X)] . (4.3)
In cross-entropy adaptive importance sampling, the optimal parameter set v∗ that
minimizes dKL(g∗‖f) is defined as
v∗ = argmaxvEg∗ [log f(X; v)] , (4.4)
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 27
since the first term in (4.3) is constant with respect to v. One can rewrite this as
v∗ = argmaxvEg∗
[W(X)
W(X)log f(X; v)
](4.5)
= argmaxvEw
[g∗(X)
W(X)log f(X; v)
](4.6)
for some reference distribution W(x). Since solving (4.6) directly is typically in-
tractable, Rubinstein and Kroese [89] suggest an iterative procedure, using the distri-
bution f(·; vt−1) as the reference distribution to solving for vt. Since one can estimate
the expectation in (4.6) via Monte Carlo simulation
Evt−1
[g∗(X)
f(X; vt−1)log f(X; v)
]≈ 1
Cg∗
1
N
∑x1,...,xN∼f(·;vt−1)
Wt−1(xi) log v(xi), (4.7)
where Wt−1(x) = |h(x)|P(G|H,x)f(x;vt−1)
is the importance weight of x under vt−1, and Cg∗ is
the normalizing constant of g∗.
The authors of Rubinstein and Kroese [89] suggest approximating f(·; v∗) at itera-
tion t by maximizing v in (4.7) with respect to the empirical distribution of f(·; vt−1),
vt = argmaxv1
N
∑x1,...,xN∼f(·;vt−1)
Wt−1(xi) log f(xi; v) (4.8)
Note that for the purpose of computing vt, one can ignore the constant multiplier
Cg∗ .
Maximizing the sum (4.8) is equivalent to finding the maximum likelihood estima-
tor under v with sample xi replicated Wt−1(xi) times. This leads to an interpretation
of the cross-entropy method as iterative weighted maximum likelihood estimation[40].
For many families of distributions, computing the maximum likelihood estimator is
fast, efficient, and well understood, so optimizing the sum (4.8) is often straightfor-
ward. For more on this connection between cross-entropy and maximum likelihood
see Asmussen and Glynn [6].
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 28
4.2 Avoiding Degeneracy
Directly optimizing (4.8) to find a set of parameters vt may not always produce a
distribution f(·; vt) that robustly generates “good” samples. Unless sample sizes are
large, this estimator may tend to over-fit the empirical data and lead to a degenerate
approximating distribution f(·; vt), with a fit based on a sample set in which the bulk
of the importance weights are concentrated in one or a small number of samples. This
situation can be diagnosed by a low effective sample size (see §2.1.2 for details).
Rubinstein and Kroese [89] suggest several heuristic alterations to (4.8) which
implicitly address the degeneracy issue. First, instead of optimizing (4.8) over all
samples drawn from f(·; vt−1), one can only use the top ρ percentile of the sample
for some constant 0 < ρ < 1 and weight them all equally. This is known as the
elite sample technique. Second, after computing the minimum sample cross-entropy,
one can “smooth” it with the previous importance distribution according to a weight
α (in distribution families where such a parameter smoothing makes sense). Third,
in the fully adaptive cross-entropy method one adaptively adjusts the sample sizes
drawn from f(·; vt) based on performance metrics. These adjustments work well for
some types of problems, as is demonstrated in the examples found in Rubinstein
and Kroese [89]. For the complicated distributions on orderings encountered in §7,
however, the basic cross-entropy method does not perform well, quickly leading to
highly degenerate importance distributions.
It also seems a worthwhile goal to fit the cross-entropy degeneracy problem within
a larger information theoretic or statistical learning context where many tools for for-
mally dealing with over-fitting have been developed. One robust criteria for the
avoidance of over-fitting is the minimum description length (MDL) principle [43].
Minimum description length is a generalization of Occam’s razor, which is principle
that says that one should use the simplest model that best fits the data. In practi-
cal terms, minimum description length gives an expression for the trade-off between
model complexity and “goodness of fit”. In addition to describing the model well
(as measured by likelihood), one should compensate for over-fitting by taking into
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 29
account the description length of the model, L(f(·; vt)). This is expressed as
L(f(·; vt), x) = L(f(·; vt))− log(P(x|f(·; vt))) + const. (4.9)
See MacKay [70] chapter 28 for a very readable introduction to minimum description
length.
One interpretation of the description length is to take L(f(·; vt)) to be the Shannon
entropy of distribution v, i.e. L(f(·; vt)) = −Evt(log f(·; vt)). This is quite natural
from an information theoretic perspective, and directly penalizes distributions for
deviation from the uniform distribution. A disadvantage to this formulation is that
it may be difficult to directly estimate the entropy for complicated models. Another
interpretation is to see L(f(·; vt)) as the log-likelihood of the parameters under some
Bayesian prior. This may have the advantage of being easier to compute than the
model entropy, but it may not be as interpretable as a measure of model complexity.
We can use minimum message length principles when choosing a new distribution
vt based on previously drawn samples. However, there is some ambiguity on how to
best apply the technique to adaptive importance sampling. For instance, it is unclear
how to properly weigh L(f(·; vt)) versus log(P(x|f(·; vt))) in the objective function
(4.9). In the standard model selection framework when fitting to samples drawn from
the “true” distribution of interest, no special weights are necessary. As the sample size
increases, the log-likelihood term comes to dominate, allowing for more complicated
models if they fit the data better. In the adaptive importance sampling setting,
however, weighted samples drawn from the previous importance distributions are
used only as a proxy for the optimal importance distribution g∗ when performing an
update. Although many samples may be drawn, the effective sample size with respect
to g∗ is often small, with the top few samples having importance weights orders of
magnitude greater than the rest. In the iterative model selection framework, this
means one should put a weight on the goodness-of-fit term based on the effective
sample size. In §7, several different heuristics are given for estimating this weight,
such as gradually increasing the allowed entropy or assigning randomized weights to
different samples when computing the effective sample size.
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 30
4.3 Related Methods
4.3.1 Annealed Importance Sampling
If one can sample directly from the optimal importance distribution, the way to use
importance sampling to estimate the normalizing constant Cf of distribution πf (with
unnormalized function f) when a reference distribution g with known normalizing
constant Cg is accessible is to use what Neal [76] refers to as the simple importance
sampling estimator,
CfCg
= Eg
[f(X)
g(X)
](4.10)
≈ 1
N
N∑i=1
g(xi)
f(xi), (4.11)
where (xi)1≤i≤N ∼ g. It is easy to see that (4.11) will only have high variance if g
and f are not close to one another, especially if g(x) is large when f(x) is small. One
remedy for this is to use a sequence of distributions g = q0, q1, . . . , qn−1, qn = f and
to chain the estimators together:
CfCg
=n−1∏j=1
Cj+1
Cj=
n−1∏j=1
Eqj
[qj+1(Xj)
qj(Xj)
]. (4.12)
This is well known in computational physics as umbrella sampling. It has also been
applied successfully in the approximate counting context for applications such as
counting the number of matchings in a graph [93] or estimating the volume of convex
bodies [33]. The general idea is one of recursive estimation of size [2]. Further
examples can be found in Diaconis and Holmes [28].
One challenging issue associated with this family of techniques is that it may be
very difficult or impossible to sample directly from an intermediate distribution of
interest qj. The most straightforward technique is to use an ergodic Markov chain
that holds qj as its stationary distribution to approximately sample according to qj.
One can then draw correlated samples at each step of the Markov chain to build
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 31
an estimator. It may be necessary to have a long “burn-in” at each intermediate
step to avoid correlations and ensure convergence to stationarity [76]. The degree of
correlation of the Markov chain samples will depend on the mixing rate. See §2.2 for
more details.
If subsequent distributions qi and qi+1 are close enough to each other, each term in
the product (4.12) will have low variance and this will produce a good estimate of Cf .
However, if the samples xj+1 are drawn using a Markov chain based on xj, this will
introduce bias. As a remedy to this issue, Neal [75] introduced annealed importance
sampling. The idea is to use a sequence of reversible transition kernels (Ti(x, y))1≤i≤n,
where Ti(x, y) has stationary distribution πi, known to within a normalizing constant
as qi. One can then produce a sample xn approximately distributed according to f ,
in a similar way as in simulated annealing [58], by starting at a state x0, drawn from
g and applying the transition kernel Ti to xi−1 to generate xi for each 1 ≤ i ≤ n in
sequence. One can then compute the importance weight of each sample x(j)n as
W(j) =n−1∏i=0
qi+1(xi)
qi(xi). (4.13)
Then an unbiased estimator of Cf/Cg is
CfCg
=1
N
N∑j=1
W(j). (4.14)
Note that although annealed importance sampling produces an unbiased estimator,
it may still have high variance depending on the sequence of transition kernels chosen
(similar to the choice of annealing schedule in simulated annealing). The variance
can also be reduced by taking a larger number of steps for each transition kernel.
4.3.2 Population Monte Carlo
A generalized framework for the techniques in this chapter is given by Cappe et al.
[20] under the name population Monte Carlo (PMC) and is summarized below in
Algorithm 5. Instead of only allowing an importance distribution g to take values in
CHAPTER 4. ADAPTIVE IMPORTANCE SAMPLING 32
a parametric family G based on the set of sample from the last iteration, population
Monte Carlo allows more general update rules and can use samples from any of the
previous time-steps. There are many possible methods that have been proposed to
update g based on a population of samples, such as using mixture-type models [30].
Algorithm 5 Generic population Monte Carlo (PMC) algorithm
Generate (x(i)0 )1≤i≤N ∼ g
(0)0 .
Compute W(i)0 = f(x
(i)0 )/g
(0)0 (x
(i)0 ).
Generate (x(i)0 )1≤i≤N by resampling (x
(1)0 )1≤i≤N based on W
(i)0 .
k ← 0
while (not converged) do
k ← k + 1
Update gk based on x(i)j j<k,i≤N .
Generate (x(i)k )1≤i≤N ∼ gk.
Compute W(i)k = f(x
(i)k )/gk(x
(i)k ).
Generate (x(i)k )1≤i≤N by resampling (x
(i)j )j≤k,1≤i≤N .
end while
Chapter 5
Conditional Sampling Importance
Resampling
In this chapter, conditional sampling importance resampling (CSIR), an extension of
sampling importance resampling (SIR) is considered (see §2.1.2 for background on
SIR). Conditional SIR applies to cases where it is difficult to draw directly from a
target conditional distribution π but it is possible compute conditional distributions
π(xi|xj) for coordinate indices i, j up to a constant factor. This setting often naturally
occurs in many sequential importance sampling contexts.
5.1 Motivation
In many application settings, it is not feasible to generate draws from the conditional
distribution (2.23), but one would like still like to apply the Gibbs sampler. For
this purpose, Koch [60] proposed Gibbs SIR, which at each iteration performs a SIR
based on conditional draws from an importance distribution γ. This algorithm is as
follows. Given an initial point x, for coordinate index j, draw N conditional samples
x(1)j , . . . , x
(N)j according to γ(xjxs|x[−j]), then compute importance weights W
(i)j =
π(x(i)j |x[−j])/γ(x
(i)j |x[−j]), and draw xj from x(i)
j Ni=1 with probability proportional to
W(i)j . Due to the point-wise convergence of SIR (under mild conditions on γ and
π) the conditional probabilities go to (2.23) as N → ∞, so it is easy to see that as
33
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 34
R,N → ∞ this process converges to π, where R is the number of Gibbs steps. The
motivating application of this in Koch [60] is to image reconstruction.
Algorithm 6 SIR Gibbs
Start with arbitrary x.while (not converged) do
Randomly choose coordinate index j.Draw N iid samples x(i)
j Ni=1 according to γ(xj|x[−j]).
Compute importance weights W(i)j = π(x
(i)j |x[−j])/γ(x
(i)j |x[−j]).
Sample xj from x(i)j Ni=1 w/probability proportional to w
(i)j .
end while
A primary difficulty with the Gibbs SIR procedure is that it may be compu-
tationally expensive to draw samples from the conditional importance distribution
γ(x|x[−j]). This is particularly the case in sequential Monte Carlo contexts where
importance distributions may be complicated and built sequentially over many time-
steps. Below a procedure is detailed that builds approximate conditional samples
using only samples from joint importance distribution, which is referred to as con-
ditional sampling importance resampling (CSIR), and we give a simple example for
the case of the multivariate normal distribution. Section §6 develops the motivating
application to multi-object tracking.
5.2 Conditional Resampling
The goal in conditional SIR, as in standard SIR, is to approximately draw samples
of a target distribution π using samples drawn from an importance distribution γ,
both taking values in state space Ω. It is assumed that γ has full support of π,
i.e. π(x) > 0 ⇒ γ(x) > 0. Suppose that elements x ∈ Ω can be partitioned as
x = x1, . . . , xn, and that one has access to the importance and target conditional
distribution functions, π(xj|x[−j]) and γ(xj|x[−j]), up to a constant factor, but that
these distributions are difficult to sample directly. Suppose that one draws a collection
of iid samples x(i)i=1...N from γ. One can decompose these as a collection of samples
from the marginal distributions, x(i)1 Ni=1, . . . , x
(i)n Ni=1.
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 35
The idea of CSIR is then to use these marginal samples as importance samples for
the purpose of approximately sampling from the conditional distributions. One can
think of this as a step of an approximate Gibbs Markov transition kernel. A step of
CSIR Gibbs is as follows. Starting from x ∈ Ω, pick a coordinate index j at random,
compute importance weights W(i)j = π(x
(i)j |x[−j])/γ(x
(i)j ), and draw xj from x(i)
j Ni=1
with probability proportional to W(i)j . This process is summarized in Algorithm 7.
Algorithm 7 CSIR Gibbs
Draw N independent importance samplesx(i)Ni=1∼ γ .
Start with arbitrary x.while (not converged) do
Pick an index j at random.Compute W
(i)j = π(x
(i)j |x[−j])/γ(x
(i)j ) for i = 1, . . . , N .
Draw xj fromx
(i)j
Ni=1
w/prob. proportional to W(i)j .
end while
Theorem 1. If the Gibbs sampler is ergodic for π, and for each x ∈ Ω and index
j, π(xj|x[−j])/γ(xj|x[−j]) < ∞, then as N → ∞, CSIR Gibbs is ergodic and has
stationary distribution π.
Proof. For each sample/coordinate pair x and j, the CSIR Gibbs transition kernel
KCSIR converges to the Gibbs transition kernel K due to the convergence of SIR.
One of the main advantages of CSIR is that it can take advantage of indepen-
dence structure to greatly improve the quality of the approximate sample versus SIR.
This can give lower variance estimators than the standard importance sampling esti-
mate, µIS, unlike SIR. In order to properly take advantage of independence structure,
chapter §6 develops the notion of grouping subsets for CSIR, similar to the grouped
Gibbs sampler or the Lk sets in the hit-and-run algorithm from §2.2.5. A disadvan-
tage of CSIR is that extra computational effort may be needed for approximating
conditionals.
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 36
5.2.1 Estimating Marginal Importance Weights
CSIR requires computation of marginal importance distributions γ(xj), which for
complicated importance distributions is often impractical. There are several ways
to do this. One way is to take the one-off conditional distribution of the sample,
γ(xj|x[−j]). Since γ(xj) = E[γ(xj|X[−j])], this is an unbiased estimator. In cases
where γ(xj) is independent of γ(x[−j]) this is exact. Another method is to use the
Monte Carlo estimator, γ(xj) ≈∑N
i=1 γ(xj|x(i)[−j]). Instead of using all N indices, one
can also use a random subset. Empirically, the Monte Carlo estimate appears to
perform well since γ(xj) has been generated by the joint importance distribution γ
and is not a rare occurrence. If there is a higher degree of independence for coordinate
index j this Monte Carlo estimate will be better.
5.2.2 Conditional Effective Sample Size
Similarly to the standard SIR context, degeneracy of importance weights can occur
for conditional resampling, with some sub-samples having much higher conditional
importance weights than others for a given coordinate index j. To measure degener-
acy, some heuristics are introduced here extending the notion of effective sample size.
The marginal effective sample size (MESS) will be defined as
MESSj = NE[W(Xj)]
2
E[W(Xj)2], (5.1)
where W indicates the unnormalized importance weight, one can similarly define an
empirical version MESS. As explained above, one cannot generally compute the
marginal distributions directly, so one can attempt to approximate MESS using
conditional distributions.
When using the same-sample conditional importance weights W(i)j = π(x
(i)j |x
(i)[−j])/γ(x
(i)j |x
(i)[−j]),
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 37
this is referred to as the local conditional effective sample size (local CESS),
lCESS =
(∑Ni=1 W
(i)j
)2
∑Ni=1
(W
(i)j
)2 . (5.2)
Alternatively, one can use the cross-sample conditional importance weights,
W(i,k)j = π(x
(i)j |x
(k)[−j])/γ(x
(i)j |x
(k)[−j])
to compute what will be called the global conditional effective sample size,
gCESS =
(∑Ni=1
∑Nk=1 W
(i,k)j
)2
∑Ni=1
∑Nk=1
(W
(i,k)j
)2 . (5.3)
As when estimating the marginal distributions for sampling purposes, one can use a
subset of indices instead of the full N for computing the Monte Carlo estimates. Note
that lCESS and gCESS are simply heuristics used to estimate sample quality, and
are not related to the convergence of CSIR.
5.2.3 Importance Weight Accounting
Recall that a sample x has importance weight
W(x) =π(x)
γ(x)=π(xj|x[−j])π(x[−j])
γ(xj|x[−j])γ(x[−j])(5.4)
With CSIR, one is choosing x′j approximately according to π(·|x[−j]), therefore W(x′) ≈π(x[−j])/γ(x[−j]) = W(x)/Wj(xj). Essentially, the algorithm is “resetting” the con-
ditional weights for the resampled part. One can therefore perform a single step of
CSIR for a sample x, and update its importance weight accordingly. However, since
this is an approximation, it is generally not advisable to perform this for several steps
of CSIR, since it will lead to cascading errors. Instead, a better policy is to start x
as a SIR sample to initially reset all importance weights. One can then assume that
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 38
importance weights remain the same for all samples.
5.3 Example: Multivariate Normal
The following example considers a target distribution π as a multivariate Normal with
mean 0 and covariance Σ using a standard multivariate Normal γ as an importance
distribution. The random variables distributed according to π and γ are X and Y
respectively, with X ∼ N(0,Σ) and Y ∼ N(0, In×n). Because one can sample directly
from X as well as the conditional and marginal distributions, it allows for one for the
easy evaluation of the performance of CSIR as compared to SIR.
The importance variable, denoted Y, is a multivariate Normal with mean 0 and
covariance α2I. Y admits an importance density γ such that, for x ∈ Rn,
γ(x) =1
(2π)n/2αexp(−xTx/2α2)
The target random variable is denoted as X, and which is a multivariate Normal
with mean 0 and covariance C. X admits a density π such that , for x ∈ Rn,
π(x) =1
(2π)n/2|C|1/2exp(−xTC−1x/2)
For these choices of importance and target distributions, one can explicitly com-
pute the conditional distributions. Recall that if X can be decomposed as
X =
(Xa
Xb
), (5.5)
where Xa and Xb are column vectors of size q and n − q respectively, then the
conditional distribution of Xa given Xb is q dimensional multivariate normal with
mean
µ = C12C−122 Xb
and covariance matrix
Σ = C11 − C12C−122 C21.
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 39
In the below examples, N samples y(i)1...N are drawn from the importance dis-
tribution and their importance weights are computed. To build the SIR sample
x(i)SIR1...N , N samples are drawn with replacement from the importance samples
with probability proportional to their importance weights. To build the CSIR sam-
ples x(i)CSIR1...N , initialize each as a SIR sample, i.e.
x(i)CSIR = x
(i)SIR,
then run the CSIR Gibbs Markov step, updating each coordinate a total of k times.
To test the quality of the approximate samples, a Monte Carlo estimate of the
Kullback-Leibler divergence is computed. If π and γ are normally distributed with
respective means µπ, µγ and covariance matrices Σπ,Σγ, then one can explicitly write
the Kullback-Leibler divergence as
DKL(g, f) =1
2
(tr(Σ−1
π Σγ) + (µπ − µγ)TΣ−1π (µπ − µγ) + log(det Σπ/ det Σγ)− n
)(5.6)
However, due to the multinomial resampling of SIR and CSIR, the distribution of
an approximate sample x cannot be expressed as a multivariate Normal. Instead the
KL-divergence between g and π will be approximated. To estimate the cross-entropy,
E[log π(x)], since samples have been drawn from g and it is feasible to evaluate log(x),
it is possible to use the crude Monte Carlo estimator∑N
i=1 log π(y(i)). To estimate the
entropy, E[log γ(x)], it is not possible to use crude Monte Carlo, since in neither SIR
nor CSIR does one have an explicit representation of γ. Instead a density estimate
g is built. There are many possible ways to build a density estimate g, but in this
case, since both the importance and target distributions are multivariate normal, g
is taken to be multivariate normal and the maximum likelihood fit is used. Crude
Monte Carlo is then used with g to estimate the entropy,∑N
i=1 log γ(y(i)). For the
below examples, three cases are chosen:
1. Cid = αI.
2. Cunif , random matrix with eigenvalues chosen uniformly at random in (0, 1]
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 40
0 5 10 15 20 25 30
0.3
0.4
0.5
0.6
0.7
eige
nval
ue
(a) Cid
0 5 10 15 20 25 30
0.2
0.4
0.6
0.8
1.0
eige
nval
ue
(b) Cunif
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
eige
nval
ue
(c) Cskew
Figure 5.1: CSIR Normal example: eigenvalues of covariance matrices
scaled to have max eigenvalue 1.
3. Cskew, random matrix with eigenvalues chosen by a lognormal (µ = 0, σ = 1.5)
in (0, 1] scaled to have max eigenvalue 1.
The random matrices above were generated by taking QTΣQ, where Q is a random
orthogonal matrix. For the case of the uncorrelated target distribution, C = αI,
with 0 < α < 1. In this case, the Kullback-Leibler divergence from the importance
distribution γ to the target distribution π is
DKL(g, f) =1
2(n/α + n log(α)− n) (5.7)
=n
2(1/α + log(α)− 1), (5.8)
so DKL(g, f) is linear in the dimension n. Using the iid properties of the importance
and target distributions, one can see that for the conditional SIR approximate sample,
DKL(gCSIR, f) = E
[log
f(XCSIR)
gCSIR(xCSIR)
](5.9)
=n∑i=1
E
[log
f(XCSIR,i)
gCSIR(xCSIR,i)
](5.10)
= nE
[log
f(XCSIR,1)
gCSIR(xCSIR,1)
], (5.11)
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 41
so for a given sample size N , DKL(gCSIR, f) also increases linearly in the dimension
n. For the random covariance matrices, C is formed by pre- and post-multiplying
the diagonal eigenvalue matrix with a random orthogonal matrix Q obtained by
performing a QR-decomposition on a matrix composed of n draws from random
uniform multivariate normal in n dimensions.
5 10 15 20 25 30
24
68
10
n
KL−
Div
erge
nce
(a) Cid
5 10 15 20 25 30
010
030
050
0
n
KL−
Div
erge
nce
(b) Cunif
5 10 15 20 25 30
010
0020
0030
0040
00
n
KL−
Div
erge
nce
(c) Cskew
Figure 5.2: Estimated KL-Divergence for approximate samples using CSIR and SIR,N = 50 samples, k = 5 passes of Gibbs sampler. I repeat this process for 20 differentrandomly chosen covariance matrices for the uniform and skewed. Dotted lines are95% confidence intervals based on 1000 bootstrap samples.
Note that in these experiments, CSIR has much lower KL-divergence than SIR for
each example but particularly in the case of the random covariance matrices. Both
methods perform worst on the skewed covariance matrix. Obviously, in these exam-
ples one can easily sample directly from the target distribution, but they illustrate
how CSIR can give a large improvement over SIR for high-dimensional state spaces.
The next chapter uses this technique for a sequential importance sampling case where
the state space is high-dimensional.
CHAPTER 5. CONDITIONAL SAMPLING IMPORTANCE RESAMPLING 42
5 10 15 20 25 30
010
0020
0030
0040
00
n
KL−
Div
erge
nce
skewunifid
(a) SIR
5 10 15 20 25 30
010
2030
4050
60
n
KL−
Div
erge
nce
skewunifid
(b) CSIR
Figure 5.3: Same experiments as in Figure 5.2, plotted by method.
Chapter 6
Multi-Object Particle Tracking
This chapter applies conditional resampling in the particle filter context with an
application to multi-object tracking (MOT). The novelty in this approach is that
it allows the use of arbitrary jointly distributed forward movement and observation
models, while maintaining asymptotic convergence properties and computational ef-
ficiency. To make the particle filter computationally feasible under this joint model,
the use of conditional sampling importance resampling for sequential Monte Carlo is
introduced. This modified particle filter tracking algorithm can handle unknown or
varying numbers of objects as well as the problem of association of observations with
objects without making parametric assumptions on the nature of the forward model
or resorting to ad-hoc steps.
6.1 Background
6.1.1 Single Object Tracking
The task of inferring a single object path from a sequence of (possibly noisy) ob-
servations is known as tracking. One way to formulate tracking is as an inference
problem in a hidden Markov model (HMM). Recall that in an hidden Markov model,
it is assumed that there is some underlying “hidden” Markov process X1:T of interest
which cannot be observed directly, and a sample from an “observation” process Y1:T .
43
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 44
X1:T is also called the state-space process. The simplest non-trivial hidden Markov
model for tracking is a Gaussian state-space process X1:T coupled with a Gaussian
observation process Y1:T , with Xt,Yt ∈ Rn. This can be stated as
Xt = Xt +N(0,Σs) (6.1)
Yt = Xt +N(0,Σo) (6.2)
In this case, one can use the Kalman filter [55] to solve the inference problem numer-
ically. The Kalman filter is a sequential procedure that computes first E[Xt|Xt−1],
then provides updates based on current observation yt to get E[Xt|,Xt−1, yt]. This
is iterated over t = 1, . . . , T . Due to the Gaussian nature of the processes, know-
ing E[Xt|,Xt−1,Yt] for each s is sufficient to know the full conditional distribution
P[X1:t|Y1:t]. See Algorithm (8) for a step in the Kalman filter algorithm for this
simple example.
Algorithm 8 Kalman filter step for simple Gaussian process
P−t = Pt−1 + Σs
Kt = P−t (P−t + Σo)−1
xt = xt−1 +Kt(yt − xt−1)Pt = (I −Kt)P
−t
An alternative to the Kalman filter is particle filter tracking (or particle tracking).
The modeling assumptions required to use the particle filter are quite general com-
pared to the Kalman filter. The advantage to the particle filter is that more realistic,
non-linear, and discontinuous observation and movement models may be specified;
the disadvantage is an increase in computational complexity and implementation dif-
ficulty, usually requiring Monte Carlo simulations instead of fast and relatively simple
deterministic computations.
Inference in tracking problems can often be posed as integrals or expected values.
While there exist deterministic, quadrature type methods for particle filters, in gen-
eral the integrals to be computed are high-dimensional, in which case deterministic
methods tend to be intractable. For this reason, it is common to resort to Monte
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 45
Carlo methods for particle filtering (Algorithm 9), although some attempts at deter-
ministic particle filtering have been made (see for example Doucet and De Freitas [31]
ch. 5).
Algorithm 9 Single object particle filter tracking
Draw initial particles x(1)0 , . . . , x
(N)0 from γ0.
for t in 1 . . . T dofor each particle index i do
Sample x(i)−t ∼ γt(·|x(i)
t−1).
Compute importance weights w(i)t .
end forfor each particle index i do
Sample with replacement x(i)t from x
(1)−t , . . . , x
(1)−t with prob. prop. to w
(i)t .
end forend for
6.1.2 Multi Object Tracking
Multi-object tracking is a large and diverse field, with many applications in the sci-
ence and engineering community. It is closely related to the single-object tracking
problem. As is noted in Vermaak et al. [100], most multi-object tracking algorithms
fall into two categories. The first approach is to run multiple copies of individual
single-object trackers, then post-processing the outputs from the single-object track-
ers to handle occlusion and/or confusion. Interactive effects between objects and
observations are assumed to be negligible or are dealt with heuristically. These al-
gorithms tend to be fast and often work well in practice, but if interactive effects
are non-negligible, significant bias may be introduced. Convergence to the optimal
solution won’t be guaranteed as number of samples goes to infinity. The authors Hue
et al. [48] give an early example of this approach, with an algorithm for multi-object
tracking using an ‘association vector’ approach coupled with parallel particle filters.
This model assumes independence in the observation and movement models, and uses
a simulation-wide indicator test to determine whether to add or remove particles at
a given time-step.
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 46
A second approach is to formally consider the multi-object tracking problem as
an instance of single-object tracking by enlarging the state space to represent all
object paths jointly as a single object. The process and observation models for this
enlarged object can then be arbitrarily jointly distributed across all individual objects
and observations. Computational techniques developed for single-object tracking can
then be applied to make inferences in this model, such as particle filtering. In this vein,
the authors of Khan et al. [56] give an algorithm for multi-object tracking that uses a
Markov Random Field to define a joint movement distribution for interacting objects.
The drawback from the enlarged state approach is that SIR becomes inefficient in
high-dimensions.
To see why naive SIR is generally ineffective in high dimensions, consider the
example of n objects moving independently with no interaction. Parallel particle
filters tracking each object individually would preferentially draw their most likely
current states for each object. A naive joint-state particle filter implementation would
select the best joint-state space particle, but any single joint particle is unlikely to
contain all of the best individual current states. This line of inquiry shows that
the naive joint-state particle filter would require a number of samples exponential in
dimension size to achieve error comparable to parallel single object particle filters,
which is impractical for even a modest number of tracking objects. To compensate
for this effect, the algorithm described in this chapter takes advantage of the large
degree of independence between groups objects that are not in close proximity to one
another by applying conditional resampling.
Other useful frameworks for multi-object tracking include the mixture particle
filter [100] and the PHD filter [101]. For tutorials on general particle filter methods for
see Chen [24], Doucet and De Freitas [31]. An overview of particle-based approaches
to multi-object tracking is given in the thesis of Ozkan [80].
6.1.3 Tracking Notation
A time-step index t is a sequential coordinate index in Z+. An observation φ is
simply position in Rn, occurring at a time-step t. An observation set Yt is a set of
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 47
observations occurring at time-step t. Objects are assigned an index l, and object
states are represented as ζl,t. A trajectory ζl,1:t, is a sequence of states representing
the movements of a single object, also called a path. Note that trajectories may
contain a leading and trailing sequence of null states ∅, indicating the object entered
and/or left the field of view at some point in time. An observation event ψ is an
intermediate variable, representing the process by which a set of observations was
generated by a set of objects. The object state set Xt at time t is a set containing all
of the states for all objects at time-step t, while the event state Ψt is the set of all
object events ψ at time t. The enhanced state at time t is the pair (Xt,Ψt) and is
denoted Xt. A particle x1:t is a set of trajectories representing the full joint evolution
of the state space under the particle filter algorithm up to time-step t. Each particle
has associated with it an importance weight W1:t. Note that after a SIR step, X1:t
is an approximate sample from the target distribution at time t and w1:t is reset to
1 for each particle. state space, event, and observation process samples are denoted
in lower case as x,Ψ, y respectively, and denote the corresponding random variables
in upper case as X,Ψ,Y. In this chapter it is assumed that there are N particle
samples from the state space, and use i as a sample index. The total number of time
steps observed is denoted T .
6.2 Conditional SIR Particle Tracking
In multi-object tracking, there are various discrete ‘decisions’ that a tracking algo-
rithm must make. These decisions correspond to discrete events in the forward simu-
lation model, such as the deletion/creation of new objects or occlusion, as well as the
‘association’ problem of assigning individual observations to objects. Addressing the
association problem is the primary contribution of algorithms such as JDPAF [22];
this method and others deal with association via Monte Carlo. Dealing with unknown
or changing numbers of objects is often addressed via ad-hoc methods such as defining
a ‘decision rule’ in which objects are added/removed across all particles simultane-
ously [48], or by assuming that the forward simulation follows some parametric form
such as a mixture model [100].
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 48
In the ‘enlarged state-space’ particle filter, both the deletion/creation and associ-
ation problems are handled in an asymptotically correct way, but as previously noted,
this approach yields poor computational complexity. Conditional SIR can be used
to improve performance. To facilitate this, grouping subset functions are defined to
decompose the state-space into two parts. A grouping subset function takes as input
a particle, and returns a grouping subset, which can be used to apply the grouped
Gibbs or the hit-and-run algorithms as described in §2.2.5. A key algorithmic design
issue in CSIR is in choosing appropriate grouping subset functions.
6.2.1 Grouping Subsets for Multi-Object Tracking
In the multi-object tracking context, one way in which to define grouping subset func-
tions is to return sets of trajectories associated with individual observations φ ∈ yt.Gφ(x1:t) denotes the set of trajectories that are associated by events with observation
φ for particle x up to time t. If there are no trajectories associated with φ in xt, i.e.
φ was considered a false positive in xt, then the empty set ∅ is returned. As noted
above, trajectories ζ1:t may include a leading and trailing set of “null” observations
∅, corresponding to the object entering and leaving the field of view. A simplified
example of grouping based on observations is shown in Figure 6.1.
Given a grouping subsetG, it is possible to sample approximately from π(XG|X[−G])
using conditional SIR. If G is chosen independently of X, then this defines a Markov
transition that leaves the stationary distribution invariant. Note that in the case of
well-separated trajectories, applying conditional SIR to enlarged state-space particles
is equivalent to running a separate particle filter for each object independently.
6.2.1.1 Choosing/Pruning Grouping Functions
To choose for which groupings to apply conditional SIR, a table of groupings is kept
with the empirical CESS for each grouping and is updated after each iteration. There
is much potential redundancy in groupings, and updating the table is expensive, so
this grouping table is typically ‘pruned’ . In general, it is possible to use arbitrary rules
about which groupings to use based on aggregate particle sample properties without
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 49
!
"
!
"
!
!
"
Figure 6.1: Grouping subset functions based on observations, denoted by encloseddashed lines. Filled circles: observations. x’s, o’s: individual objects from particles 1and 2 respectively. Note that in the grouping on the right, the particles differ in thenumber of associated object trajectories.
affecting the asymptotic convergence properties. Pruning decision rules include only
considering groupings that originated a constant number of time-steps in the past,
not considering ‘children’ for well-established groups in subsequent time-steps, and
removing ‘eliminated’ groups as determined by heuristics such as CSIR. As noted, a
wide range of pruning algorithms can be used.
6.3 Application: Tracking Harvester Ants
In this section, an application of the CSIR approach is given to tracking the move-
ments of individual ants in a colony of harvester ants in a laboratory setting.
6.3.1 Object Detection
A key step in multi-object tracking is object recognition, i.e. identifying what con-
stitutes an object and what does not. To facilitate this task, the tracking algorithm
use a sophisticated object detection software known as GemIdent [47]. Each pixel is
classified independently as belonging to one of the specified object types (or none).
The algorithm then forms a graph representation of classified pixels and use spectral
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 50
graph partitioning [91] to find clusters of pixels corresponding to individual objects.
6.3.1.1 Pixel Classification
Supervised learning refers to the practice of building a learning algorithm with the
aid of a set of training points. Interactive supervised learning is a technique that
recursively applies supervised learning to predict a response variable by asking the
user for input, training the algorithm based on the new input, reporting the results
of the training to the user, and repeating until the user is satisfied. This interactivity
streamlines the training process, allowing the user to initially identify a small number
of example points, then add further points to correct mistakes made in the classifi-
cation process. This has the effect of minimizing the total number of training points
that the user has to provide, and shortens the total time involved in training process.
As an input to the supervised learning algorithm, it is necessary to provide pre-
computed features for each pixel to classify. The primary feature that is used here is
a ring score. The algorithm first normalizes pixel values by using the Mahalanobis
distance based on a pre-defined color set. To compute the ring score, take the average
normalized value relative to color c of all pixels of radius r from the pixel of interest.
The algorithm computes this value for each color c and radius r < R for some pre-
defined maximum R. To incorporate movement information into pixel classification,
the ring scores of the same pixel in adjacent frames are also included, both forward
and backward in time.
Once we’ve computed features for each pixel, random forests [18] are used to clas-
sify each pixel based on computed features. Random forests are used because of their
relative simplicity, and for the current application they empirically give comparable
results to other techniques such as support vector machines (SVM). Random forests
also have the advantage of being easily interpretable, as it is possible to directly
compute the relative importance in each feature in the classification process. This
approach was compared to other machine learning techniques in the publicly avail-
able software library WEKA [37]. For an book length treatment on statistical and
machine learning techniques, see Hastie et al. [45].
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 51
6.3.1.2 Centroid Finding
Each pixel is classified independently as belonging to one of the specified object types
(or none). After pixels are classified, it is necessary to determine the number and
locations of ants. The algorithm uses a flood-finding algorithm to form groups of
adjacent pixels of the same type into “blobs”. Small, disparate blobs also occur due
to incorrectly classified pixels. In addition, objects that are clustered together in an
image will tend to produce large connected blobs. The current application needs to
determine which blobs are noise and which, indicate the presence of ants and how
many.
To do this, the algorithm forms a graph of the blob by connecting adjacent pixels,
then use spectral graph partitioning [91] to cut the blob. A good overview of spectral
methods for graphs can be found in Spielman [95]. One way to think about spectral
partitioning is as an approximation to the ’sparsest cut’ problem, which seeks to find
a cut that maximizes the number of vertices separated but minimizes the number of
edges cut. It is necessary to tune the parameters of this centroid finding step. Figure
(6.2) shows two blobs and the results of spectral partitioning.
6.3.2 Observation Model
While functioning relatively well in most cases, the GemIdent machine-learning al-
gorithm is prone to error, due to the presence of background clutter, ant moving in
close proximity to one another, unusual orientations of ants as they traverse objects,
and other factors, confusing even human observers. Additionally, there are tradeoffs
between the quality of the images, the interactive training time of the interactive
supervised learning algorithm, computational efficiency, and the accuracy of the clas-
sifications. It is desirable to have a tracking algorithm that can operate robustly in
the presence of this relatively noisy process. Of particular concern is the propensity
of the algorithm to split an individual ant into two observations or to merge two ants
into a single observation.
Under the observation model, each Yt is generated as the result of events that
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 52
Figure 6.2: Blob bisection via spectral partitioning
occurs according to a probability distribution dependent on Xt. In the current ap-
plication, there are two broad classes of object events: independent events, and joint
events. Independent Normal observation events correspond to the canonical single tra-
jectory/single observation case, with an observation occurring at a location according
to a bivariate Normal distribution centered at the object location, with standard de-
viation σo. Independent false negative events indicate no observation recorded for
an individual object. This occurs for each object according to a Bernoulli random
variable with parameter λn. Independent false positive events indicate a spurious ob-
servation not directly caused by any proximate object. These occur according to a
uniform 2D Poisson process with parameter λp. Splitting is a special kind of false pos-
itive in which a pair of observations appear at the front and back of an object, due
to overzealous segmentation. This occurs for each object according to a Bernoulli
random variable with parameter λs. Finally, two adjacent objects i and j have a
probability of ‘merging’ with their neighbors as a function of their distance, with
an observation occurring near the center of the merged objects. Merging occurs in
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 53
Event type Description ParametersIndependent observations Single trajectory with a single observation σoIndependent false nega-tives
No observation recorded for an individualobject
λn
Independent false posi-tives
Spurious observation not directly relatedto an object
λp
Splitting Single object yields pair of observations λsMerging May occur when objects are within close
proximityλm, αm
Table 6.1: Observation event types.
the joint model between two objects at distance d with probability αm exp(−d4/λm).
Individual objects may be involved in multiple mergings at the same time.
6.3.3 State-Space Model
In the current application it is assumed that individual objects move according to
independent affine Gaussian stochastic processes, with drift vectors depending on an
estimate of the object’s velocity. For this model, it is possible to decompose the object
state at time t as ζt = ut, vt, where u, v ∈ R2 are the position and velocity vectors
respectively, and write the model as
ut = ut−1 + vt−1 +N(0, σ2mI) (6.3)
vt = α(ut − ut−1) + (1− α)vt−1, (6.4)
where σm is the movement dispersion parameter. The possibility that individual
objects can enter or leave the field of view is modeled according to a birth-death
process with rate parameters λb, λd (both birth and death rate uniform with respect
to space and time). This is a relatively simple movement model, but for the current
application it appears to suffice, as most of the complications and non-linearities come
from the observation process.
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 54
6.3.4 Importance Distribution
A critical consideration is the choice of importance distribution. As discussed in §3.2,
the optimal importance distribution γ∗ samples xt based on the previous object state
and all current and future observations , i.e.
γ∗t (x) = π(xt|xt−1, yt:T ) (6.5)
Since drawing from γ∗ is infeasible, the second best method practically available is
often to use the locally optimal importance distribution,
γt(x) = π(xt|xt−1, yt). (6.6)
In the current context it is generally impossible to sample from this importance
distribution directly, so instead the algorithm uses Markov chain Monte Carlo, in
particular the data augmentation algorithm §2.2.4.
To sample approximately from γt, the event set Ψt is used as an auxiliary variable
for the data augmentation algorithm. The goal in this case is to draw a joint sample
according to π(xt,Ψt|xt−1, yt), even though the primary interest is in xt. The algo-
rithm does this by first generating a draw Ψt conditioned on xt, then xt conditioned
on Ψt, or
Ψt ∼ π(Ψt|xt, xt−1, yt) (6.7)
xt ∼ π(xt|Ψt, xt−1, yt), (6.8)
and repeating. As a starting point, an initial value of xt from one forward step from
the object state-space model is used, i.e. xt is drawn according to π(xt|xt−1).
Algorithm 10 Sampling π(xt,Ψt|yt, xt−1) via data augmentation.
Start projected trajectory positions xt.repeat
Sample π(Ψt|xt, yt) via Algorithm 12.Sample π(xt|Ψ, yt, xt−1) as described in §6.3.4.
until converged
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 55
6.3.4.1 Sampling from State-Space Given Events
Sampling from π(xt|Ψt, xt−1, yt) is decomposed by event types. Note that at this
step, resampling xt is resampling the set of object positions, as object birth/death
movements are considered when sampling from events. The algorithm cycles through
each element of Ψt and perform actions based on case.
Independent normal observation Both movement and observation processes are
following Gaussian processes, so the object location can be modeled as bivariate
Normal. This is due to the following relationship.
π(xt|xt−1, yt) =π(yt|xt)π(xt|xt−1)
π(yt|xt−1)(6.9)
∝ π(yt|xt)π(xt|xt−1) (6.10)
When the event type is independent Gaussian movement and observation, both
π(yt|xt) and π(xt|xt−1) are Normal, and the product of the distributions will also
be Normal. Since for both the observation and movement models independence
in the covariance structure of the 2D coordinate axes is assumed, the distribution
π(ζt|ζt−1, φt) can be written as the product of the distribution for each coordinate
axes. The coordinate-wise standard deviation and mean of the object position ut will
then be
σut =
√σ2mσ
2o
σ2m + σ2
o
(6.11)
µut =1
σ2m + σ2
o
((ut−1 + vt−1)σ2
o + ytσ2m
), (6.12)
and ut can be expressed as a N(µut , σ2utI) random variable.
Independent false negative In this case there is no observation information, so
the object state is updated according to the forward movement model (6.3).
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 56
Independent false positive Independent false positives are not related to any
objects, so no object states need to be updated.
Splitting Splitting is represented as an independent normal observation, using the
average of the two points as the observation center.
Merging/Splitting Given a series of mergings and splittings Ψ, associated subsets
of observations yt, and a subset of object points xt, it is possible to compute the
probability of the event set occurring as π(Ψ, yt|xt, xt−1). In order to sample from
π(xt|Ψ, yt), the algorithm samples from a Metropolis-Hastings independence chain
with proposal density Q(xt) = π(xt|xt−1). The acceptance probability of a new state
x′t will then be
a =π(Ψ, yt|x′t, xt−1)
π(Ψ, yt|xt, xt−1)(6.13)
6.3.4.2 Sampling from Events Given State-Space
The second half of the data augmentation algorithm for sampling from the importance
distribution (6.6) is to sample the events Ψ given the state-space positions xt.
γ(Ψt|xt) = π(Ψt|xt, yt) (6.14)
To sample from γ(Ψt|xt) a Metropolis-Hastings Markov chain is used. This entails
drawing samples from a proposal chain κ, and accepting/rejecting them based their
likelihoods relative to the current state.
The underlying Markov chain used in the current approach is based on sampling
from object/observation associations. The global version is as follows. Every object
is (independently) associated with an observation with probability as a decreasing
function of the distance, f(d(ζ, φ)). It is then possible to directly evaluate the likeli-
hood of this association. The algorithm samples from the target distribution using a
Metropolized independence chain. This global proposal distribution is referred to as
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 57
!"#$%&'
!"'$()*&+,-'
Figure 6.3: Association of objects with observations. ’Events’ correspond to con-nected components in this bipartite graph, including Normal observations, splitting,merging, false positives, false negatives, and joint events.
κ. For an edgeset E, one can express κ as
κ(E) =∏ζ
∏(ζ,φ)∈E
f(d(ζ, φ))
∏(ζ,φ)6∈E
(1− f(d(ζ, φ)))
(6.15)
f should be chosen to be close to the probability of an observation with position
φ being generated in an event involving an object at location ζ. If d(ζ, φ) < σo
it’s almost certain, d(ζ, φ) < 2σo its pretty likely, d(ζ, φ) > 3σo probably not. The
algorithm assumes this takes a parametric form such as
f(d) = α exp(−dp/β), (6.16)
where p and β are chosen based on σo, and 0 < α ≤ 1 is perhaps related to the false
positive and merging probabilities. An example would be to take α = .95, p = 4, and
β = −(3σo)p/ log .01, which gives the following plot.
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 58
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
d/simga_o
p
There are several local versions. One is, pick an object at random and resample
its associations, accept according to Metropolis. This is an instance of the “single-
particle Metropolis”. Another variation on this would be to resample the associations
for a subset of objects. This could be random, or something like taking as a subset
the k nearest neighboring objects to a randomly chosen object or observation, or all
objects within d distance to a randomly chosen object or observation. k or d could
be randomly chosen, or a set constant. This procedure will be ergodic if choices are
centered around objects (nonzero probability for every bipartite graph), and ergodic
for choices centered around observations if k or d are large enough such that every
object is close enough to an observation to be resampled at some point. Refer to the
current edge set as E, the proposal edge set as E ′.
The Metropolis acceptance value a is
a =γ(E ′)
γ(E)
K(E ′, E)
K(E,E ′)(6.17)
=γ(E ′)
γ(E)
κ(E)
κ(E ′)(6.18)
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 59
The ratio can be directly computed as
κ(E)
κ(E ′)=
∏(ζ,φ)∈E, 6∈E′)
f(d(ζ, φ))
1− f(d(ζ, φ))
∏(ζ,φ)∈E′,6∈E)
1− f(d(ζ, φ))
f(d(ζ, φ))
(6.19)
and similarly for the ratio
γ(E ′)
γ(E)(6.20)
Algorithm 11 Sampling proposal event set Ψ via association distribution κ.
Form bipartite association probability graph A(xt, yt) using association probabilityfunction f .Draw edges E independently w/prob. proportional to edge weights.Find connected components for E, form event set Ψ.Return Ψ.
Each edge set E defines an event set Ψ by the connected components in the
association graph of the edge set. However, this relationship is not injective, as
the same event set Ψ may result from multiple different edge sets, since connected
components for joint events may be defined in multiple ways. Then
κ(Ψ) =κ(E)
κ(E|Ψ)(6.21)
=∑E′ 7→Ψ
κ(E ′) (6.22)
= E[κ(E)1E 7→Ψ] (6.23)
This can be decomposed by event components. For each ψ ∈ Ψ, define Eψ to be the
associated set of edges generated by κ for the objects and observations associated with
ψ. Then κ(Ψ) can be decomposed as the probability that there are no edges between
objects and observations not in the same events, multiplied by the probability of
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 60
connectivity within event components.
κ(Ψ) = κ (E \ ∪ψ∈ΨEψ = ∅)∏ψ∈Ψ
κ(Eψ 7→ ψ) (6.24)
To compute the first term on the RHS of (6.24), the probability that there are no
cross-component edges, simply compute
κ (E \ ∪ψ∈ΨEψ = ∅) =∏
(ζ,φ)6∈Ψ
(1− κ((ζ, φ))) , (6.25)
i.e. the product of one minus the probability of each non-component edge occurring.
Since the association probability graph is effectively sparse (probability of an edge
between an object and an observation effectively 0 if distance is more than 4σo),
this probability can be computed quickly. To compute the individual component
probabilities, κ(Eψ 7→ ψ), for small numbers of objects plus observations it is feasible
perform the calculations directly through enumeration. For large joint distributions,
the algorithm estimates this probability through Monte Carlo estimation of
κ(Eψ 7→ ψ) = E[κ(Eψ)1Eψ 7→ψ] (6.26)
To compute the probability of an event set Ψt under γ, note that
γ(Ψt|xt) = π(Ψt|xt, yt) (6.27)
=π(Ψt, xt, yt)
π(xt, yt)(6.28)
=π(yt|xt,Ψt)π(Ψt|xt)π(xt)
π(xt, yt)(6.29)
∝ π(yt|xt,Ψt)π(Ψt|xt) (6.30)
Both of these probability distributions, π(yt|x,Ψt) and π(Ψt|xt), are defined by the
forward observation model. The first term can be decomposed as the product of the
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 61
individual event probabilities. This can be written as
π(yt|xt,Ψt) =∏ψ∈Ψt
π(yt,ψ|xt,ψ, ψ), (6.31)
where yt,ψ and xt,ψ are the subsets of yt and xt associated with event ψ.
For the second term on the RHS of (6.30),π(Ψt|xt), recall that the ’event’ random
variable corresponds to a connected components in the association graph but doesn’t
carry any positional information. So this is the probability that the set of trajectories
xt,ψ produced a (joint) event, times the probability that these events produced the
correct number of observations |yt,ψ| given the object associations. The joint eventset
probability can be written as
π(Ψt|xt) = π(Ψyt ,Ψxt|xt) (6.32)
= π(Ψyt |Ψxt , xt)π(Ψxt|xt) (6.33)
To compute π(Ψxt|xt), the ’object association probability’ for the joint forward model
is used to compute the probability of each component being connected times the
probability of there being no cross-component edges, or
π(Ψxt |xt) =
(∏ψ∈Ψ
π(ψ|xt,ψ)
) ∏(φ,φ′)6∈Ψx
(1− π((φ, φ′)|xt))
(6.34)
Combining these yields
γ(Ψ|xt) ∝
(∏ψ∈Ψ
π(yt,ψ, ψ|xt,ψ)
) ∏(φ,φ′)6∈Ψx
(1− π ((φ, φ′) |xt))
, (6.35)
where π(yt,ψ, ψ|xt,ψ) can be broken down as
π(yt,ψ, ψ|xt,ψ) = π(yt,ψ|xt,ψ, ψ)π(ψy|ψx, xt,ψ)π(ψx|xt,ψ) (6.36)
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 62
Algorithm 12 Metropolis sampling of π(Ψt|xt, yt).Start with event set Ψrepeat
Ψ′ ← Ψ.Pick a random subset of objects S ⊂ xt, using local or global criteria.Remove events ψ ∈ Ψ′ associated with S.Randomly sample new events for S according to κ, add to Ψ′.Compute acceptance probability a(Ψ,Ψ′) via (6.18).With probability min(1, a), Ψ← Ψ′
until converged
6.3.5 Computing Relative and Marginal Importance Weights
Another main element of the importance sampling algorithm is the computation of
importance weights. It is of interest to compute both the relative importance weights
for an entire particle, and the marginal importance weights for grouping subsets.
6.3.5.1 Computing Particle Relative Importance Weights
The importance weight contribution from time t will be
Wt(x) =πt(x)
γt(x). (6.37)
Since γt(x) ∝ P(xt|Xt−1, yt), the contribution will be proportional to
wt(x) = P(yt|xt−1) (6.38)
This can be computed via Monte Carlo as
P(yt|xt−1) = EXt|xt−1 [P(yt|Xt)], (6.39)
where xt is drawn from P(xt|xt−1). However, this estimator may have high variance,
since for any given xt, it may be unlikely to generate a given yt. From the sampling
step there are approximate samples from P(xt,Ψt|xt−1, yt−1), which is actually the op-
timal importance distribution for estimating (6.39). However, this probability is only
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 63
known up to a normalizing constant (the normalizing constant is in fact P(yt|xt−1)).
One way of computing this normalizing constant is observing the following.
P(yt|xt−1) =P(yt|xt,Ψt)P(xt,Ψt|xt−1)
P(xt,Ψt|yt, xt−1)(6.40)
P(xt,Ψt|yt, xt−1)
P(yt|xt,Ψt)=
P(xt,Ψt|xt−1)
P(yt|xt−1)(6.41)
Integrating with respect to xt,Ψt gives
1
P(yt|xt−1)= EXt,Ψt|yt,xt−1
[1
P(yt|Xt,Ψt)
](6.42)
P(yt|xt−1) =(EXt,Ψt|yt,xt−1
[P(yt|Xt,Ψt)
−1])−1
. (6.43)
Also note that computing P(yt|xt,Ψt) is straightforward, this is just the product of
the the observation probabilities for each event ψ ∈ Ψt, which are being computed
anyways to generate the joint samples. This means a Monte Carlo estimate based on
EXt,Ψt|yt,xt−1 [P(yt|Xt,Ψt)−1] can be used to estimate P(yt|xt−1), and can build this
estimate concurrently while building the joint sample Xt,Ψt|yt, xt−1.
6.3.5.2 Computing Marginal Importance Weights
In order to perform resampling, it is necessary to estimate the marginal importance
weights,
W(i)j = π(x
(i)j |x[−j])/γ(x
(i)j ) (6.44)
This computation depends on the choice of “coordinate index” j, which corre-
sponds to a partition of the state space as discussed in §2.2.5. To choose this par-
tition,the extended state-space (xt,Ψt) that includes the event set Ψs will be used.
There are several different possible ways to choose partitions, as discussed in §6.2.1.
The main purpose of conditional resampling is to allow for ’revisions’ across multiple
time-steps by ’grafting’ sub-trajectories from one particle onto another. This is nec-
essary because events represent realizations of ’decision points’, and decisions that
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 64
are plausible at timestep t may be much less plausible compared to alternatives when
taking into account future observations.
As previously noted, there are many possible grouping functions that may be
chosen to which to apply conditional resampling. In the current application, tra-
jectory groupings are formed based on well-separated observations. Recall that it
is possible to choose grouping functions arbitrarily based on yt without introducing
bias. The goal in this case is to form groups of individual trajectories based on single
observations. The algorithm then keeps this grouping when/if the set of trajectories
enters a joint merging/splitting setting, and decides to discard the grouping based on
aggregate properties of the group. To resample across the grouping, positions from
one trajectory are ’grafted’ onto another using the last k time-steps, the resulting
marginal importance weights are estimated, and resampling is performed based on
these weights.
Algorithm 13 Conditional sampling importance resampling for particles.
Determine which grouping sets to resample, forming new groupings at each obser-vations and pruning old groupings using criteria such as age and CESS.Resample particles using SIR to get x
(1)t , . . . , x
(N)t .
for each grouping j dofor i in 1, . . . , N do
Compute marginal importance weights using (6.44)Choose grouping k with probabilities proportional to marginal importanceweightsGraft grouping Gj(x
(k)t ) onto x(i) according to §6.3.5.2.
end forend for
6.4 Empirical Results
6.4.1 Simulated Data
For the previous example, the “true” hidden state of the objects is unknown, as
the example comes from real-world data. In this example, samples from the hidden
state-space are simulated using the Gaussian forward model (6.3), and observations
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 65
Algorithm 14 Multi-object particle tracking for Harvester ants.
Initialize particles x(1)1 , . . . , x
(N)1 based on prior distribution/first observations y1.
Compute initial importance weights w(1)1 , . . . , w
(N)1 .
for t in 2 . . . T doAdvance each particle by sampling (x
(i)t ,Ψ
(i)t ) ∼ γt via Algorithm 10.
Estimate importance weights for each particle via (6.43) using samples fromprevious step.Apply conditional SIR to construct new particles via Algorithm 13.
end for
are drawn according to the observation model (§6.3.2). Choosing samples from this
known, tractable model allows us to directly evaluate results from the particle tracking
algorithm.
The simulated data starts with 100 objects randomly distributed uniformly in
a 500 by 500 grid. The simulation then proceeds by moving the objects according
to the forward model, with movement standard deviation parameter σ2m = 2. The
observation model is simulated with observation standard deviation parameter σ2o =
1.5, split probability parameter λs = .006 for each object at each time-step, and
with a merging probability of λm = .45 (decreasing with distance). The results for
one state-space sample with path lengths and objects per frame are summarized in
Figure (6.4), and “observed” samples with the number of observations per frame are
summarized in Figure (6.5).
We then ran this example using particle tracking, first using a sample from the
importance distribution (6.6), then using CSIR (6.7). As in the previous example,
CSIR produces much longer path lengths, closely matching the “true” number of
paths that last the entire simulation.
6.4.2 Short Harvester Ant Video
As a first example of the tracking algorithm, we used GemVident to find centroids
for a video file that has 753 frames. A screenshot of the centroid finding results
for a single frame from GemVident is shown in Figure 6.8. The centroid finding
algorithm for this video resulted in an average of 109 observations per frame, with
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 66
Figure 6.4: “True” distribution of path lengths and trajectories per frame, simulatedexample.
Figure 6.5: Centroid observations per frame, simulated example.
the distribution of observations shown in Figure 6.9.
For this example, we first drew a sample from the importance distribution and
inspect the quality of the sample. A video file of a draw from the importance distri-
bution can be found here: http://www.stanford.edu/~guetz/particleTracking/
noCSIR.mp4. The blue dots represent trackings, the red dots are the centroids from
GemVident. Note that the trackings tend to follow the observations for a short
number of time-steps before getting lost. One can see this directly by looking at a
histogram of the trajectory lengths over the sample (Figure 6.10).
We then applied CSIR particle filter to this example using 20 particles. A video
file of the CSIR tracking can be found here: http://www.stanford.edu/~guetz/
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 67
Figure 6.6: Distribution of path lengths and trajectories per frame using a samplefrom the importance distribution, simulated example.
Figure 6.7: Distribution of path lengths and trajectories per frame using CSIR, sim-ulated example.
particleTracking/CSIR.mp4. Note that this is much better. Looking at the trajec-
tory lengths (Figure 6.11), note that a large percentage of the trajectories last the
entire simulation. Also note that for the number of trajectories per frame matches the
number of observations much more closely for both the sample from the importance
distribution and CSIR.
This example was repeated for N = 5, 20, 40 time-steps. N = 5 took 32 seconds,
N = 20 took 126 seconds, N = 40 took 253 seconds. Note that the increase is
essentially linear in the number of particles. This is due to the fact that most grouping
subsets are well separated from each other, so it is possible to use the a single marginal
importance weight computation for most of the particle/grouping subset pair instead
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 68
of the more complicated O(N2) version.
Figure 6.8: GemVident screenshot, showing centroids.
Figure 6.9: Centroid observations per frame from Harvester ant example.
CHAPTER 6. MULTI-OBJECT PARTICLE TRACKING 69
Figure 6.10: Distribution of path lengths and trajectories per frame using a samplefrom the importance distribution, Harvester ant example.
Figure 6.11: Distribution of path lengths and trajectories per frame using CSIR,Harvester ant example.
Chapter 7
Network Growth Models
In recent years there has been much interest in the study of generative models that
explain observed properties of networks derived from biology, sociology and computer
science. The next two chapters will discuss Monte Carlo methods for performing infer-
ence for network data structures. Among the most widely researched models include
network growth models such as preferential attachment [8] and duplication/divergence
(also known as vertex copying) [59, 25]. One reason for the focus on dynamic models
of network growth is that it allows researchers to understand how networks develop
over time, with relatively simple rules resulting in surprising complexity. For a sur-
vey of the history and applications on network growth models see Newman [78], and
Durrett [32] for a more recent book-length treatment.
The primary motivations for studying generative models of network growth are to
help explain, understand, and predict observed network data structures and features.
For example, many real-world networks have degree sequences that appear to follow
heavy-tailed distributions, such as the so-called power-law distributions, where the
probability of a vertex having degree k is proportional to k−α for α > 0. The classical
models of random networks, such as the Erdos-Renyi G(n, p) model in which edges
are placed between each pair of vertices according to independent Bernoulli trials
with probability p, have asymptotically Poisson degree distributions making them
unsuitable for modeling networks with heavy-tailed degree distributions. In contrast,
many network growth models such as preferential attachment and vertex copying
70
CHAPTER 7. NETWORK GROWTH MODELS 71
produce asymptotically power-law networks.
7.1 Background
Many realistic models of network formation, whether in regards to sociological, bio-
logical, or computer network, involve growth from some smaller “seed” network. A
network growth model is a stochastic process through which a network G is formed by
adding vertices to a network one by one, with edges created between the new vertex
and pre-existing vertices. Well known examples of network growth models include so
called preferential attachment and vertex duplication models.
In the current context, network growth models are considered. Unless otherwise
stated, graphs are assumed to be are simple, i.e. they are undirected, have no self-
loops, there are no multiple edges, and there are no edge weights. Furthermore, the
models used are assumed to be strict growth models, i.e. removal of vertices or edges
is not allowed, edges cannot be “rewired” once they appear, and new edges always
have one endpoint in the newly added vertex. The advantage of modeling networks
through strict growth mechanisms as opposed to a more general model that might
allow edge and vertex deletions or rewirings is that they are easier to analyze and
simulate.
One issue with many commonly used network growth models is that they are
often formulated in a way that is not statistically well-defined, with most real-world
networks having zero measure. For example, the original Barabasi-Albert preferential
attachment model adds a deterministic number of edges k at each time-step; if a
network does not have kn edges, then it has zero likelihood. Similar complaints can
be made about many other common models, such as the vertex copying and forest
fire models. The primary reason these models are not well-defined statistically is
that they were formulated in order to make theoretical analysis of their aggregate
properties easier, such as showing they give power-law degree distribution or are well-
connected. In the current context, however, it is preferable to use models that are
statistically well-defined. There have been severals attempt to make network growth
models more statistically robust, see for example Leskovec et al. [65] and Sheridan
CHAPTER 7. NETWORK GROWTH MODELS 72
et al. [90].
Given the order in which nodes arrived in an observed network, one can compute
the likelihood of the network under a growth model by multiplying the likelihood
contribution of each vertex as it enters the network. In other words, to compute the
likelihood of a graph G given an ordering µ and model parameters θ, one simply forms
the product
W(G|θ, µ) =n∏i=1
P(xµ(i)|xµ(1), . . . , xµ(n−1)), (7.1)
where xµ(i) is the event of the µ(i)th vertex entering G along with its corresponding
edges. Denote W(G|θ, µ) the order likelihood of G.
This representation immediately yields a mechanism for computing the full likeli-
hood of G by summing the order likelihoods (7.1) µ over Πn, the space of all possible
orderings of length n,
W(G|θ) =∑µ∈Πn
W(G|θ, µ)P(µ). (7.2)
P(µ) is the prior distribution over the space of permutations πn. In this chapter P(µ)
is assumed to be uniform over the space of permutations, but in practice one may
have prior knowledge about the order in which vertices entered the network and can
construct a different prior distribution.
Since there are n! possible permutations of length n, brute force enumeration
is infeasible except for the smallest graphs. A natural idea is thus to construct an
unbiased Monte Carlo estimator W(G|θ) of (7.2) by drawing N iid samples µ1, . . . , µN
distributed according to P(µ):
W(G|θ) =N∑i=1
W(G|θ, µi)P(µi). (7.3)
In general the estimator W may have a high variance, making the Monte Carlo
estimator little better than brute force. Fortunately, one can usually reduce the
CHAPTER 7. NETWORK GROWTH MODELS 73
variance of W by using the importance sampling techniques described in §2.1.
7.1.1 Erdos-Renyi
Perhaps the simplest model of network growth is uniform attachment (UA), where
edges are are attached from a new vertex to pre-existing vertices independently at
random with uniform constant probability p. This process is equivalent to the well-
studied Erdos-Renyi random graph model G(n, p). The G(n, p) and closely related
G(n,m) models were introduced by Gilbert [41] in 1959 and subsequent analyzed
by Erdos and Renyi [34]. For comprehensive book length treatments of G(n, p) and
G(n,m) see Bollobas [16], Janson et al. [50] and also Durrett [32].
The advantage of the G(n, p) is in the simplicity of analysis. Vertex degrees are
distributed according to a Binomial Bin(n−1, p) distribution. Since edges are added
independently according to a Bernoulli with parameter p, the number of edges m in
G is a sufficient statistic. The likelihood of G given parameter p is simply
W(G|p) = P
[Bin
((n
2
), p
)= m
], (7.4)
since there are(n2
)possible independently chosen edges.
Many interesting properties have been demonstrated about random graphs, such
as locally tree-like structure, low diameter and high girth. They were famously used
by Erdos as one of the first applications of the probabilistic method in combinatorics
and graph theory. They are, however, restricted in their usefulness as statistical
models for many real-world networks, as many of these networks have “heavy-tailed”
degree distributions, are highly clustered, or have hierarchical structure that is very
rare in the G(n, p) model.
7.1.2 Preferential Attachment
The central concept of preferential attachment (PA) models is that the “rich get
richer”; a new edge is more likely to be added to a vertex if it is already incident to
many edges. The preferential attachment family of network growth models has a long
CHAPTER 7. NETWORK GROWTH MODELS 74
history, beginning with Yule [105] in 1924 to explain evolutionary speciation events,
then later by Simon [92] in 1955 to explain the distribution of word frequencies in
text documents. Price [82] in 1976 was one of the first to use these models to describe
“power-law” degree distributions in networks.
The Yule-Simon and Price models can be shown to be equivalent to the Polya
urn model [81], in which a vertex corresponds to a “red” ball, all other vertices are
“blue” balls. At each step, choose a ball uniformly at random and replace it with
2 balls of the same color. The proportion of balls that will be colored “red” can be
shown to be a martingale with expected value following a β(r, b) distribution, where
r is the initial number of “red” balls and b is the initial number of “blue” balls. In
the corresponding models of network growth, this means that older vertices tend to
be incident to a significant proportion of the edges in the graph.
This family of models was rediscovered when a closely related model was proposed
by Barabasi and Albert [8] in 1999 to explain “scale-free” degree distributions in the
world wide web graph and other networks. This sparked a greatly renewed interest in
preferential attachment models. In the model of Barabasi and Albert, a constant m
number of edges are added to the graph at each step, each with one endpoint in the
newly added vertex and the other to a vertex chosen with probability proportional
to its degree. They were able to demonstrate experimentally that that this model
generates a power-law of degree 3, which was later proven rigorously by Bollobas
et. al. [17]. Many different generalizations of the Barabasi-Albert model have been
proposed. The Barabasi-Albert model is an example of linear preferential attachment
because edges are attached with probability linear in the vertex degree. Krapivsky
and Redner [62] have analyzed a model in which a different exponent on vertex degree
can be chosen. These models are known as super- or sub-linear attachment depending
on the exponent. Additionally, the authors of Krapivsky and Redner [62] discuss a
model in which the probability of attachment to a vertex is proportional to the degree
plus some constant α, a “hybrid” between pure preferential attachment and uniform
attachment. Another class of models [29] allows for the rate at which new edges are
added at each step to increase over time.
CHAPTER 7. NETWORK GROWTH MODELS 75
7.1.3 Duplication/Divergence
The duplication/divergence (DD) or vertex copying family of models were motivated
by networks which had degree distributions that followed a power-law, but for which
vertex degree did not correlate with the order in which it was added to the network.
The idea is that when a new vertex enters the network, its edge list is copied from
the edge list of a pre-existing vertex. This can be seen to result in a power-law
vertex degree distribution similar to the preferential attachment models, because it
also results in a “rich get richer” phenomena in vertex degrees; a vertex with higher
degree is more likely to have a neighbor “copied” and is therefore more likely to have
new edges attached. The “divergence” part of the model indicates that a new vertex
may not be an exact copy of its ancestor and may lose some of its edges. Additionally,
an edge may be placed between the new vertex and the vertex it was copied from.
The first example of models of this type specifically for network growth were
suggested by Kleinberg et. al. [59] as a model of web growth, but duplication-
divergence has gained perhaps the most traction as models of cellular networks, where
“gene duplication events” have long been part of the literature [12, 102, 71, 7, 5]. See
Chung et al. [25] for a discussion mathematical properties of duplication-divergence
models and Newman [78] for a brief survey.
7.2 Computing Likelihoods with Adaptive Impor-
tance Sampling
In this section, a technique for estimating likelihood functions for network growth
models using an adaptive importance sampling technique based on the cross-entropy
method §4.1.2 is investigated. In order to accomplish this, a well-known classical
model of rankings known as Plackett-Luce is used as an importance distribution,
and a penalty term for the cross-entropy method is introduced which significantly
improves the efficiency and robustness of the algorithm, inspired by the minimum
description length principle [43].
CHAPTER 7. NETWORK GROWTH MODELS 76
In general, focusing on any one particular statistic can be problematic when at-
tempting to choose an appropriate model, unless it is a sufficient statistic. Degree
distribution is just one among many possible descriptive statistics of networks that
may be used to determine “goodness of fit”. One could use betweenness-centrality,
subgraph counts (“network motifs”), conductance, expansion, chromatic number,
number of Hamiltonian cycles, etc., as criteria to find the best network model. It
is usually not clear which descriptive statistics are most important for fitting network
models outside of the context of relatively simple network models such as G(n, p) and
exponential family models (also known as p∗ models) where the sufficient statistics
are well defined. Several recent papers have made attempts at using non-sufficient
statistics for dealing with complicated network models. Notable examples include
Middendorf et al. [74], who use subgraph counts to estimate the relative suitability
of network models using machine learning (boosting) techniques, and Ratmann et al.
[83], who formalize the use of non-sufficient statistics as approximate Bayesian com-
puting, also known as likelihood-free inference, and apply the technique to network
growth models. Alternatively, one could attempt to directly estimate the likelihood
function of a model in order to assess its appropriateness. Building an approximation
to the likelihood allows one to apply the well-established model selection frameworks
commonly used in statistics and information theory such as Bayes factors, minimum
description length (MDL), and Akaike information criteria (AIC).
Recently there have been several different Monte Carlo based approaches to esti-
mating the likelihood of observed network data under network growth models. These
include Leskovec and Faloutsos [64], who use a Metropolis-Hastings chain to estimate
parameters for stochastic Kronecker product graph models, which I discuss in §8,
and Wiuf et al. [104], who use permutation likelihoods with a sequential importance
sampling scheme to perform inference on a variant of duplication/divergence mod-
els. Another interesting Markov chain Monte Carlo approach was taken by Bezakova
et al. [13] to investigate the preferential attachment model.
Informally, network growth models are defined by the following process. Starting
from an empty network, new vertices are added to the network sequentially. As each
vertex enters the network, edges are added connecting the new vertex to some subset
CHAPTER 7. NETWORK GROWTH MODELS 77
of preexisting vertices chosen according to some randomized rule. For example, in
preferential attachment a connecting vertex is chosen with probability proportional
to its degree. It is assumed that for network growth models, edges and vertices
can only arise in this manner, i.e. that new edges are always connected to the most
recently added vertex and that neither edges nor nodes may be deleted. Note that this
process implicitly produces labeled graphs, with labels assigned according to the order
in which a vertex enters the network. It will also be assumed that the networks are
simple, i.e. that each edge connects exactly one distinct, unordered pair of vertices,
that there are no multiple edges, and no self-loops.
The primary advantage of constraining network growth models to only permit
adding vertices and edges in this sequential manner is that if one is given the order in
which vertices appear, it is straightforward to compute the likelihood that an observed
network was produced under the model. The main difficulty then becomes that the
chronology of vertex generation is often unknown, uncertain, or undefined. In order to
compute the likelihood of a particular unlabeled network with no a priori information
on labelings, one should sum the likelihoods over all possible labelings. Since there
are a factorial number of possible labelings, this approach becomes infeasible for all
but the smallest networks. This naturally leads to the idea of constructing a Monte
Carlo estimator by sampling from the space of labelings.
One can build a crude Monte Carlo estimator by simply drawing labelings (permu-
tations of vertices) uniformly at random, computing the likelihood of each permuta-
tion, then returning the mean value of the computed likelihoods. The biggest problem
with this estimator is that it will in general have very high variance. Typically there
is a small subset of permutations that have high relative likelihood under the network
growth model compared to the majority of permutations. In other words, there is a
rare event set of permutations with disproportionately high weights. To get a lower
variance estimator, one could attempt to use importance sampling.
The approach explored in this chapter is to recursively build an importance dis-
tribution using adaptive importance sampling. A key ingredient to any importance
sampling scheme is the choice of importance distribution. Adaptive importance sam-
pling iteratively attempts to find the “best fitting” importance distribution within a
CHAPTER 7. NETWORK GROWTH MODELS 78
restricted family of distributions. For details on the adaptive importance sampling
algorithm see §4 below. One common problem with adaptive importance sampling
schemes is that they tend to quickly lead to degenerate probability distributions [6].
In order to compensate for this effect, a penalty term based on ideas from the mini-
mum description length principle is introduced in §4.2.
The use of the Plackett-Luce model as an importance distribution for network
growth models is investigated in §7.2.2. In §7.3 this algorithm is tested using a
modified preferential attachment model and applied to an example protein-protein
interaction network from a publicly available database. These results are compared
to those from an implementation of annealed importance sampling (see §4.3.1).
7.2.1 Marginalizing Vertex Ordering
Given a network G and a network growth model H, one may wish to compute the
conditional probability of G given H, p(G|H), also known as the likelihood function
of G evaluated at H. The likelihood of G under growth model H with labeling σ,
P(G|H, σ), is referred to as the order likelihood. Imposing a mixture model (or prior
distribution) over the space of permutations and applying the law of total probability
gives,
P(G|H) =∑σ∈Sn
P(G|H, σ)P(σ), (7.5)
where Sn is the set of all permutations on n vertices, and P(σ) is the mixture or prior
distribution over Sn. In the examples considered, it will usually be assumed that one
has no prior information about the order in which vertices appear and so P(σ) = 1/n!
for all σ.
Since the order likelihood is the function whose expectation is to be estimated,
denote P(G|H, σ) ≡ h(σ). One can then write
P(G|H) = E[h(σ)], (7.6)
where σ is distributed according to P(σ), and one can use this to build a Monte Carlo
CHAPTER 7. NETWORK GROWTH MODELS 79
estimator.
7.2.2 Plackett-Luce Model as an Importance Distribution
For importance sampling of distributions where the state space is the space of permu-
tations, one can use of the first-order Plackett-Luce model of rank. The Plackett-Luce
model is a well-known distribution for rank data, originally developed primarily in
the psychometric literature. Commonly referred to as a vase or urn model, it is well
known to generate draws from the stationary distribution of the weighted random-
to-top Markov chain, also known as the “Tsetlin Library”. For more information on
Plackett-Luce and other models of rank see Marden [72] and Diaconis [26].
In the Plackett-Luce model on n labels, each vertex j is assigned a weight wj ∈R+. The procedure to draw a sample from Plackett-Luce is essentially “draw n
labels without replacement with probability proportional to weights”. In the “urn”
interpretation, picture an urn filled with balls of different sizes, and suppose one takes
one ball out of the urn at a time until the urn is empty. If the probability that one
chooses a given ball over another is proportional to its volume, then this is precisely
the first order Plackett-Luce model. The generation procedure is stated formally in
Algorithm 15.
Algorithm 15 Random labeling from Plackett-Luce model
Let U ← V be the set of unpicked labels.
Set σ ← ∅.for i = 1 to |V | do
Pick a label j from U with probabilitywj∑u∈U wu
.
Add j to the end of σ.
Remove j from U .
end for
One of the most attractive qualities of the Plackett-Luce model is that its likeli-
hood function is very simple to compute. Let θi = logwi. Then the log-likelihood of
CHAPTER 7. NETWORK GROWTH MODELS 80
weights θ given a sample ordering σ is given by
W(θ|σ) =∑
θi −n∑i=1
log
(n∑j=i
exp(θσ(j))
). (7.7)
The log-likelihood of collection of samples σ1, . . . , σN is then
l (θ|(σi)1≤i≤N) =N∑i=1
l(θ|σi). (7.8)
Another attractive feature of Plackett-Luce is that under mild assumptions, the
log-likelihood function (equation (7.8)) is strictly concave, and one can solve it nu-
merically using an efficient iterative procedure known as a minorization-maximization
algorithm (a generalization of expectation-maximization) [49]. Since (4.9) is a linear
combination of the concave terms of (7.8) with positive weights, (4.9) is also concave
under the same regularity conditions and one can solve it using the same algorithm.
7.2.3 Choice of Description Length Function
If one chooses the description length L(f(·; vt)) to be the Shannon entropy of the
distribution f(·; f(·; vt)), one can estimate L(f(·; vt)) with crude Monte Carlo for a
particular set of parameters. One can then either attempt to maximize the adjusted
minimum description length via a convex optimization-type algorithm, or employ a
simpler heuristic. In practice, I find that the latter approach works well. One simple
heuristic is to perform a 1-dimensional minimization of L(f(·; vt)) by multiplying each
weight of v by a constant α > 0, searching for the α that minimizes (4.9). If α < 1,
this raises the entropy, making it less sharply concentrated, and if α > 1 it lowers the
entropy, making it more sharply concentrated.
For the second (Bayesian) interpretation of L(f(·; vt)), in practice I choose a prior
on the weights to be iid log-normal with mean 0 and variance σ2. I then optimize
(4.9) using standard numerical optimization techniques. I find that if I also allow
the variance σ2 itself to be a log-normal random variables, overall performance is
improved.
CHAPTER 7. NETWORK GROWTH MODELS 81
In addition to the minimum description length correction, in the adaptive impor-
tance sampler I take the elite sample to be the top k weighted samples across all vi for
i < t, rather than just for the last batch of samples as in the standard cross-entropy
method. This means that one can update the importance distribution more frequently
and do not waste too many samples on poor quality importance distributions. This
allows one to maintain more sample diversity. To prevent the elite sample size from
becoming too large in terms of memory storage, one can set a limit on the number of
previous iterations to store.
7.3 Examples
7.3.1 Modified Preferential Attachment Model
As an example of a complicated probability distribution over permutations, impor-
tance sampling techniques are implemented using a modified undirected preferen-
tial attachment model. As noted above, the defining characteristic of preferential
attachment models is the concept that the “rich get richer”. In the most basic
versions, such as the Yule model and Barabasi-Albert model, as a new vertex is
added to the network, a fixed number of partners are chosen with probability pro-
portional to the degree (allowing multiple edges). The version of preferential attach-
ment used here is slightly different, and is defined by the following network growth
rule: when a new vertex enters the network at step j, choose the number of edges
Nj = B(j − 1, β) as a Binomial random variable, where β is “network density”
parameter. Then choose vertex j’s Nj partners with replacement with probability
proportional to deg(i)/∑
l<j deg(l)(1−α) +α. α is a term that essentially smoothes
the distribution toward “uniform attachment”; that is, if α = 1, then the probability
of any particular pair occurring is equally likely and independent, and this is essen-
tially the same model as the Erdos-Renyi G(n, p) described in the introduction. If
instead α = 0, this corresponds to the linear preferential attachment model, in which
the probability of adding an edge is strictly proportional to the degree.
Note that since multiple edges are not allowed, if a pair occurs more than once
CHAPTER 7. NETWORK GROWTH MODELS 82
the extras are discarded. This means that the number of edges added per step may
actually be less than Nj (and this is why the α = 1 model is not exactly equivalent
to G(n, p)). The reason that I sample with replacement is to speed up the likelihood
computations at each step and maintain edge independence. Since I am dealing with
relatively sparse graphs, multiple edges occur rarely.
Simulations indicate that the modified preferential attachment model produces
power-law degree distribution, and there is also some theoretical justification. The
authors of Sheridan et al. [90] present a “Poisson growth model” similar to the one
presented here. They were able to show that the degree distribution is asymptotically
power-law, with exponent determined by the model parameters. The main difference
between their model and the one that I use is that in the model the expected number
of edges added at each step grows linearly.
7.3.2 Adaptive Importance sampling
For each run of the adaptive importance sampler, I take N = 20 samples per iter-
ation. I take an initial ’elite’ sample size of 2. As noted in Section 7.2.2, the elite
sample is over all previous samples, not just the previous sample as in the standard
cross-entropy method. This elite sample size changes dynamically according to the
performance of the algorithm; if there is no improvement, then I increase the elite
sample size by 1. For the choice of penalty parameter, I choose to use the heuristic
univariate minimization using the α parameter as described in Section 7.2.2. Specif-
ically, I allow each of 20 values of α evenly spaced in range of 0 to 1. For each α, I
raise each parameter in vt to the power of α, compute a Monte Carlo estimate of the
model entropy, compute the model log-likelihood, then add these terms together to
get a MDL score. I then take the α that results in the best value.
7.3.3 Annealed Importance sampling
For the sake of comparison I implemented a version of annealed importance sampling
§4.3.1. For each of the annealed importance sampling simulations, I use 1000 tran-
sition kernels, and take 6 steps per kernel. Starting with the uniform distribution
CHAPTER 7. NETWORK GROWTH MODELS 83
π(σ) = 1/n! and knowing the optimal transition kernel up to a normalizing constant,
q(σ) = h(σ) ∝ g∗(σ), one wants a sequence of transition kernels T1, T2, . . . , T1000 such
that Ti has stationary distribution
qi ∝ π(σ)(1000−i)/1000q(σ)i/1000.
Since each transition kernel is known up to a normalizing constant, define K to be
the random to random walk on the space of permutations, which consists of picking
a labeling at random and moving it to another random spot. This is equivalent to a
card shuffle where one picks a card at random and places it back at a random place in
the deck. Using K one can simply take each Ti to be K plus the Metropolis-Hastings
accept/reject step with respect to qi.
7.3.4 Computational Effort
In general, I set up the simulations so that the adaptive importance sampler and
the annealed importance sampler would make a similar number of objection function
calls. However, it is possible for the annealed importance sampler to use a Markov
chain such as random transpositions that only permits local moves. In this case one
can update the objective function value in linear time. The drawback to this approach
is that random transpositions mixes slowly, so overall computation time to achieve
the same degree of accuracy may not be improved. When using non-local Markov
chains, updating the objective function usually requires the same order of operations
as computing from scratch, typically quadratic in the number of vertices.
Another consideration is the additional computation time required by the iterative
minorization-maximization algorithm to find the maximum likelihood estimator for
the Plackett-Luce model. In my experience, the number of computations this step
requires is negligible compared to the rest of the algorithm. Convergence typically
occurs in fewer operations than the number of operations required to compute a single
objective function value. For example, in the 500 node example below, convergence
occurs with between 25 and 100 iterations. Each iteration requires a linear number
of operations, so the total number of operations for the minorization-maximization
CHAPTER 7. NETWORK GROWTH MODELS 84
algorithm appears to be sub-quadratic in the number of vertices.
7.3.5 Numerical Results
For each of the simulations below I generate set of unlabeled network using the prefer-
ential attachment procedure, then attempt to compute the likelihood function under
the uniform mixture model, comparing the performance of adaptive importance sam-
pling with annealed importance sampling.
Figure 7.1 displays the progression of estimated log-likelihood for each technique
to estimate the likelihood of a generated 500 node network. The x-axis demarcates
the number of score function calls of the simulation up to that point, while the y-
axis gives the log-likelihood. Note that annealed importance sampling starts at 0
and decreases, while adaptive importance sampling starts at the log-likelihood of the
uniform distribution and goes up, but both end up in about the same place. Once
a schedule is fixed, annealed importance sampling always uses the same number of
score function evaluations. Adaptive importance sampling, on the other hand, can
choose to stop if it hasn’t made any progress for a significant number of iterations.
Tables 7.1, 7.2, and 7.3 directly compare results of these methods different on
network data-sets chosen at random.
Table 7.1: Comparison of estimators for sparse 500 node preferential attachment
dataset from Figure 7.1
Estimator Log Mean Wt. Var Log Wt.
Crude -3.22e3 3.8e3
Annealed IS -2.93e3 7.09e2
Adaptive IS -2.99e3 1.86e2
CHAPTER 7. NETWORK GROWTH MODELS 85
0 1000 2000 3000 4000 5000 6000−3500
−3000
−2500
−2000
−1500
−1000
−500
0
Number of likelihood computations
log−
likel
ihoo
d
Figure 7.1: Example runs comparing annealed importance sampling and adaptive im-portance sampling for a 500 node, sparse preferential attachment network, 20 sampleseach method. The set of lines coming from the upper left corner represent runs of theannealed importance sampling algorithm. Lines from the lower left represent runs ofthe adaptive importance sampling algorithm. The horizontal line at -2.86e3 representsthe likelihood of the “true” permutation from which the data was generated.
CHAPTER 7. NETWORK GROWTH MODELS 86
Table 7.2: Comparison of estimators for dataset: 5 networks, 30 nodes each, average
degree 2, 20 samples each method
Estimator Log Mean Wt. Var Log Wt.
Annealed IS -838.7 3.68
Adaptive IS -851.4 7.95
Table 7.3: Comparison of estimators for dataset: 2 networks, 100 nodes each, average
degree 2
Estimator Log Mean Wt. Var Log Wt.
Annealed IS -1.60e3 17.6
Adaptive IS -1.64e3 60.6
7.3.5.1 Effects of Minimum Description Length Correction
To demonstrate the effect of the minimum description length correction applied to
the adaptive importance sampler, Figure 7.2 shows the progression of two typical
simulation runs for a randomly generated dataset. In one simulation I apply the
minimum description length correction and in the other I do not. A circle denotes
the score of a sample, while a x represents the corresponding importance weight.
The solid line represents the estimated log-likelihood of the model, while the dotted
line gives the value of the “true” likelihood, i.e. the likelihood of the permutation
from which the dataset was generated. This is likely an overestimate of the likelihood
function, since it represents close to the “peak” likelihood, not the “mean” likelihood.
The two methods behave quite differently. On the bottom of Figure 7.2, the
standard cross-entropy method does quite well at producing many high scoring func-
tions, but is degenerate, producing importance weights that are on average 60 orders
of magnitude smaller than their respective sample scores. This method takes many
thousands of iterations to converge, and after 1000 interactions the estimated expected
log-likelihood is 40 orders of magnitude below the adaptive importance sampler with
CHAPTER 7. NETWORK GROWTH MODELS 87
the penalty function. On the top of Figure 7.2 the penalty function adaptive impor-
tance sampler produces a diverse collection of samples, with high scores almost as
high as those from the standard cross-entropy sampler, but without becoming degen-
erate. This sample diversity gives the penalty function adaptive importance sampler
a much higher chance of encountering “good” solutions.
7.3.5.2 Application: Mus Musculus Protein-Protein Interaction Network
To demonstrate annealed importance sampling and adaptive importance sampling
in a real dataset, obtained from the BioGRID [19] website a manually curated Mus
Musculus (common mouse) protein-protein interaction (PPI) network. This is a large,
simple, undirected network, where an edge is present between two proteins if there is
strong experimental evidence of interaction. I took the largest connected component
of this network, which has 314 nodes and 503 interactions. A rendering of this network
can be seen below in Figure 7.3.
For annealed importance sampling, I ran 20 particles, with 1000 cooling levels and
6 Markov steps at each level. For adaptive importance sampling, I ran 20 simulations,
with N = 20 samples at each iteration, elite sample sizes adjusted dynamically.
Results are summarized in Figure 7.4 and Table 7.4 below. Both techniques converge
to almost the same value. Overall, this provides strong evidence that preferential
attachment is a better model than the Erdos-Renyi G(n, p) model (defined in the
introduction) for this dataset.
Table 7.4: Estimated log-likelihoods for Mus Musculus protein-protein interaction
networksmodel log-likelihood sample var. log-likelihood
Erdos-Renyi −3.070e3 -
PA Adaptive IS −2.280e3 3.41e2
PA Annealed IS −2.276e3 6.80e2
CHAPTER 7. NETWORK GROWTH MODELS 88
0 100 200 300 400 500 600 700 800 900 1000−1100
−1050
−1000
−950
−900
−850
−800
sample number
log
likel
ihoo
d
0 100 200 300 400 500 600 700 800 900 1000−1100
−1050
−1000
−950
−900
−850
−800
sample number
log
likel
ihoo
d
Figure 7.2: Likelihoods and importance weights for cross-entropy method. The topfigure shows the progression of the MDL-based Plackett-Luce model. The bottomfigure shows the progression for the cross-entropy Plackett-Luce model without theMDL correction. Each sample permutation corresponds to one O and one X; O indi-cates a sample likelihood (score function value), X points show the sample importanceweight. The solid line shows the importance sample estimate of the log-likelihood.The dashed line shows the log-likelihood of the “true” permutation. Dataset: 5 net-works, 30 nodes each, average degree 2.
CHAPTER 7. NETWORK GROWTH MODELS 89
Figure 7.3: Mus. Musculus (common mouse) PPI network.
CHAPTER 7. NETWORK GROWTH MODELS 90
0 1000 2000 3000 4000 5000 6000−3000
−2500
−2000
−1500
−1000
−500
0
number of score function evaluations
log−
likel
ihoo
d
Figure 7.4: Convergence of adaptive importance sampling and annealed importancesampling for Mus. Musculus PPI network.
Chapter 8
Kronecker Product Graphs
An interesting family of generative models for graphs is stochastic Kronecker product
graphs [79]. Kronecker product graphs are appropriate for modeling many common
real world networks as they can mimic important properties observed to be preva-
lent. For example, they are shown to have multinomial degree distribution, which
with properly chosen parameters can produce heavy-tails. Give each vertex in the
initiator graph self-loops, and Kronecker graphs can be made to have (small) constant
diameter. The eigenvalues of the adjacency matrix can be shown to be related to the
degree distribution and can be made heavy-tailed. As noted in Leskovec et al. [66],
Kronecker product graphs can encapsulate core-peripherery type networks due to the
recursive nature of the Kronecker product graph formulation. Many social, infor-
mational, and biological networks are known to have dense cores, making stochastic
Kronecker product graph networks particularly suitable. And perhaps most impor-
tantly, stochastic Kronecker product graphs can be fit using a linear (O(E)) time
algorithm via a maximum likelihood estimation algorithm called KronFit developed
by Leskovec et al Leskovec and Faloutsos [64].
As defined in Leskovec and Faloutsos [64], Kronecker product graphs start with
an initiator matrix Kk ∈ RN1×N1 . One then forms the Kronecker product graph
adjacency matrix
Kk = ⊗kK1 (8.1)
91
CHAPTER 8. KRONECKER PRODUCT GRAPHS 92
where ⊗ is the matrix Kronecker product. See Leskovec et al. [66] for definitions and
many interesting properties. As a consequence of this construction, the Kronecker
product graph matrix Kk has Nk1 rows and columns. To make this into a statistical
model of graph formation, one can interpret the matrix entry Kk(i, j) as representing
the probability of an edge between vertices i and j, then place an edge between
each pair of vertices (i, j) according to independent Bernoulli trials with parameter
Kk(i, j). The resulting model is the stochastic Kronecker product graph. Note that
this formulation contains the Erdos-Renyi G(n, p) model on Nk1 vertices as a special
case. A generalization of stochastic Kronecker product graphs called multiplicative
attribute graphs was recently developed [57] that considers a sequence of attribute
matrices instead of taking Kronecker powers of a single initiator matrix as in the
stochastic Kronecker product graph model.
In this chapter I propose an extension of these models called vertex censored
stochastic Kronecker product graphs. This extension results in a better fitting models
as measured by Bayesian information criterion. The primary novelty in my approach
is the use of sequential importance sampling to circumvent a commonly occurring mis-
match between the number of vertices in the Kronecker graphs and in the real-world
network being analyzed. The importance sampling scheme allows one to consider ver-
tex censored network models In addition to improving the likelihoods for pre-existing
Kronecker graph fits, vertex censoring allows one to make large improvements in like-
lihood by greatly expanding the practical number of parameters and initiator matrices
used.
8.1 Motivation
A primary difficulty inherent in stochastic Kronecker product graph models is the
fact that the number of vertices present in Kronecker product graphs are composite
in the sizes of the generator graphs. In the stochastic Kronecker product graph
model, one chooses a single initiator graph with N0 nodes and take k Kronecker
products, resulting in n = Nk0 vertices. Since most real-world networks don’t have a
number of vertices that are composite in small integers, this is problematic. In order
CHAPTER 8. KRONECKER PRODUCT GRAPHS 93
to compensate, the authors of Leskovec et al. [66] suggest forming a new graph by
padding the real-world graph with isolated vertices, and using likelihood estimates
for this enhanced network as a proxy for the original network. This may result in a
large number of isolated vertices. Assuming n = |V | is uniformly distributed between
Nk−10 < n ≤ Nk
0 , this would give (1 − 1/N0)/2 expected fraction of padded vertices,
with a maximum 1− 1/N0 fraction padded in the worst case. For a given k and N0,
this worst case is attained when the real-world graph contains Nk0 + 1 vertices.
Padding with isolated vertices can create inference problems in several ways. First,
a large number of padded isolated vertices skews the degree distribution and other
properties with respect to the original graph. This may not be as problematic for
graphs with power-law or heavy-tailed degree distributions that would expect most
vertices to be incident to very few edges. However, there are many important classes of
networks that are not power-law, such as regular graphs, and padding these networks
with a large number of isolated vertices may distort much of the observed structure.
Second, adding a large number of padded vertices means that one is computing the
likelihood of a potentially much larger graph. This has effect of reducing the maximum
likelihood estimate for the graph. While many of the interesting and important
statistical features of the original graph may still be reproduced, formulating models
that have high maximum likelihood estimates is important in Bayesian Statistics and
model selection. Further, since padding is more likely to occur with larger initiator
matrices, smaller initiator matrices are often preferred in the stochastic Kronecker
product graph model, leaving out potentially useful larger initiators. The fact that
N1 = 2 is often the best or close to the best choice for initiator matrix dimension [66]
makes intuitive sense since the number of vertices in arbitrary real-world networks is
more likely to be close to a power of 2 than for any other initiator matrix size.
The authors of Leskovec et al. [66] note that this class of models belongs to the
curved exponential family of models, and that it is therefore appropriate to use the
Bayesian information criterion (BIC) for purposes of model selection. BIC balances
the complexity of the model (measured by number of model parameters) against
goodness of fit (measured by likelihood). However, due to the vertex padding issue,
the parameter complexity of the model typically contributes an insignificant amount
CHAPTER 8. KRONECKER PRODUCT GRAPHS 94
to the BIC, and large complicated graphs are often found to fit best with 2 by 2 or
3 by 3 initiator matrix For many common statistical models, adding more parame-
ters results in a better fit, and this is the behavior one would intuitively desire for
stochastic Kronecker product graphs.
Finally, it may be desirable to model ’missing data’ problems where one doesn’t
have access to the full network. This may take the form of “link prediction” (missing
edges) or cases where one doesn’t have access to the full vertex set. Vertex censored
models can be naturally adapted to addressing these types of problems. A similar
setting may arise if it is desirable to predict the future growth of a network given a
core ’seed’ graph.
8.2 Stochastic Kronecker Product Graph model
In this section I briefly review the stochastic Kronecker product graph model and the
KronFit algorithm used to compute maximum likelihood estimates.
8.2.1 Likelihood under Stochastic Kronecker Product Graph
model
In general, the likelihood of a graph G(V,E) under Kk can be computed as
P(G|Kk) =∏
(u,v)∈E
Kk(u, v)∏
(u,v) 6∈E
(1−Kk(u, v)). (8.2)
CHAPTER 8. KRONECKER PRODUCT GRAPHS 95
Instead of probability, it is usually easier to work in terms of the log-likelihood:
l(Kk|G) =∑
(u,v)∈E
logKk(u, v) +∑
(u,v)6∈E
log(1−Kk(u, v)) (8.3)
=∑u,v∈V
log(1−Kk(u, v))+ (8.4)
∑(u,v)∈E
[logKk(u, v)− log(1−Kk(u, v))] (8.5)
= le(Kk|G) + lE(Kk|G), (8.6)
where le denotes the log-likelihood of the empty graph, and lE is the edge correction
to the log-likelihood.
Due to the recursive structure of Kronecker products, for each vertex u one can
associate a vector
u = (u1, u2, . . . , uk), (8.7)
where ui ∈ 1, . . . , N1, and the probability of an edge between two vertices v and u
is
Kk(u, v) =k∏i=1
K1(ui, vi), (8.8)
The log-likelihood of an empty graph is
le(Kk) =∑u,v
log(1−Kk(u, v)). (8.9)
Leskovec et al. [66] use the Taylor series approximation to (8.9),
le(Kk) = −
(N1∑i=1
N1∑j=1
K1(i, j)
)k
− 1
2
(N1∑i=1
N1∑j=1
K1(i, j)2
)k
− . . . (8.10)
CHAPTER 8. KRONECKER PRODUCT GRAPHS 96
which implies that le(Kk) to a given precision can be computed to an arbitrary preci-
sion in either constant or logarithmic time assuming on reasonable bounds on entries
of K1. To compute the log-likelihood of the full graph, one can then apply the edge
correction no the empty graph, adding the log-likelihoods for each edge in G and
removing the corresponding ’no-edge’ contributions. This means that the total log-
likelihood for G under Kk can be approximated in O(m) time.
8.2.2 Sampling Permutations
In the previous section, it is assumed that the “true” mapping between vertices in G
and vertices in Kk is known, i.e. that G is labeled. However, G is typically unlabeled,
and it is necessary to account for this fact by summing over all possible permutations.
One can state the conditional probability of a mapping π as
P(π|G,Kk) =P(G|π,Kk)P(π|Kk)
P(G|Kk). (8.11)
Finding π that maximizes this quantity is similar to the linear ordering problem (LOP)
[77, 15], which is NP-hard. To sample from this distribution, Leskovec and Faloutsos
[64] suggest a Metropolis-Hastings Markov chain Monte Carlo algorithm which has
stationary distribution (8.11). As a base chain they use the “vertex switch” Markov
chain, also known as the “random transpositions” shuffle. Other Markov chains on
permutations are possible; see Aldous and Diaconis [1] for background.
The advantage of using a “local” Markov chain such as random transpositions is
that it allows one to use local updates to the likelihood.
Under the stochastic Kronecker product graph model, the log-likelihood of the
empty graph is the same under both permutations and can be computed in time
O(max degree)[66].
8.2.3 Computing Gradients
Computation of gradients follows the same process as log-likelihood computations. To
compute ∇il(Kk), I first compute the empty graph contribution and apply an edge
CHAPTER 8. KRONECKER PRODUCT GRAPHS 97
correction. Approximation to the empty graph gradient proceeds by differentiating
the log-likelihood of the Taylor series approximation (8.10) with respect to parameter
i. This process is repeated for each of the N1 parameters.
8.3 Vertex Censored Stochastic Kronecker Prod-
uct Graphs
In this section, I specify the vertex censored stochastic Kronecker product graph
model and present an algorithm to compute maximum likelihoods. The primary
motivation for this extension is to compensate for the likelihood distortions introduced
by vertex padding in stochastic Kronecker product graph. Vertex censored models
can be thought of as generating networks by some well-defined additive process, then
“dropping” a subset of vertices. Equivalently, one can assume the dropped subset is
simply hidden, or “censored”. Vertex censored network models allow one to make
more valid comparisons across stochastic Kronecker product graphs with different
sized initiator matrices. censored data is a common issue in the statistical literature;
a previous application of censoring to network data in an unrelated problem can
be found in Thomas and Blitzstein [97]. A key component of my algorithm uses
sequential importance sampling to estimate the empty-graph log-likelihood.
The basic vertex censored network model can be stated simply. Suppose that for
the graph G(V,E), n = |V | < Nk1 . Instead of “padding” G with isolated vertices, I
define an injective mapping φ : V 7→ VK . One can then compute the log-likelihood
under the censored model as
lφ(Kk) =∑u,v∈V
[I(u,v)∈E log(Kk(φ(u), φ(v))
+I(u,v)6∈E log(1−Kk(φ(u), φ(v))], (8.12)
where the right hand side consists of the empty-graph log-likelihood lφe (Kk) and edge
correction lφE(Kk).
CHAPTER 8. KRONECKER PRODUCT GRAPHS 98
8.3.1 Importance Sampling for Likelihoods
Note, however, that under the decomposition (8.12) the recursive structure that al-
lows one to easily compute lφe using the Taylor series expansion (8.10) is lost. To
compensate for this, I first represent le as an expectation,
le(Kk) = n2E[log(1−Kk(u, v))], (8.13)
where u and v are chosen uniformly at random. I will show how to estimate (8.13)
using importance sampling. In the current context, the first order Taylor series ap-
proximation to (8.13) suggests that using
g(u, v) =Kk(u, v)∑u,vKk(u, v)
(8.14)
will be close to optimal. One can simulate from g sequentially, drawing each element
ui and vi independently according to
gi(ui, vi) =K1(ui, vi)∑ui,vi
K1(ui, vi), (8.15)
then combining these as g(u, v) =∏k
i=1 gi(ui, vi).
One can show that this choice of importance distribution has bounded relative
error in the uncensored case.
Theorem 2. Let random variables X = log(1−Kk(u, v)) and Z = log(1−Kk(u, v))f(u,v)g(u,v)
,
where u, v are permutations chosen uniformly at random and u, v are chosen according
to (8.15). Then
EZ2
[EX]2≤ E
[1
1−Kk(u, v)
]. (8.16)
If Kk(u, v) ≤ 1 − ε for ε > 0 and all u, v, this implies that Z has bounded relative
error.
CHAPTER 8. KRONECKER PRODUCT GRAPHS 99
Proof. From the Taylor series expansion (8.10),
EX =1
n2le(Kk) ≥
1
n2
(N1∑i=1
N1∑j=1
K1(i, j)
)k
=m
n2, (8.17)
where m =∑
u,vKk(u, v) is the expected number of edges. One can write Z as
Z =m
n2
log(1−Kk(u, v))
Kk(u, v)(8.18)
= −mn2
(1 +
1
2Kk(u, v) +
1
3Kk(u, v)3 + . . .
)(8.19)
Multiplying out the Taylor series expansion for Z gives
EZ2 =m2
n4E
[1 +Kk(u, v) +
11
12Kk(u, v)2 +
5
6Kk(u, v)3 + . . .
](8.20)
≤ m2
n4E[1 +Kk(u, v) +Kk(u, v)2 + . . .
](8.21)
=m2
n4E
[1
1−Kk(u, v)
](8.22)
Combining the two bounds gives the result.
While this bound is not practically useful in the uncensored case due to the pres-
ence of the Taylor series expansion, it provides some intuitive justification for the
use of the importance sampler in the more complicated censored cases. Computing
the gradient of the log-likelihood via importance sampling can follow the same se-
quential importance distribution, though it is not as straightforward choosing a near
optimal importance distribution. Empirical results show that choosing the same im-
portance distribution for the gradient as the log-likelihood gives better results than
crude Monte Carlo.
CHAPTER 8. KRONECKER PRODUCT GRAPHS 100
8.3.2 Choosing Censored Vertices
Using this same form as (8.13), define the censored version as:
lφe (Kk) = n2E[Iφ−1(u),φ−1(v)∈V log(Kk(φ(u), φ(v)))] (8.23)
As noted in Leskovec et al. [66], uniformly dropping vertices alters the expected degree
distribution of the stochastic Kronecker product graph. However, there is no real
reason to assume that the censored data is chosen uniformly; indeed in many contexts
it may make more sense to assume the probability of “seeing” a vertex is a function
of the number of edges incident to it. Choosing the probability of censoring a vertex
as inversely proportional to degree would for example preserve a power-law degree
distribution. In general, one may create models with arbitrary rules for censoring
vertices, such as dropping the vertices with smallest degree. This last approach is
similar to the “isolated vertex padding” approach of Leskovec and Faloutsos [64],
but has the advantage of producing a higher MLE because it is censoring instead of
padding. This is the approach that I take in the numerical examples in §7.
8.3.3 Sampling Permutations
Sampling of permutations proceeds as in the stochastic Kronecker product graph
model. One can run the random transpositions Metropolis-Hastings algorithm over φ
with the same effect. One problem with this formulation is that flipping the ordering
of a ’visible’ vertex with a ’censored’ vertex requires ’dense’ operations to compute
δφ,φ, as this operation changes the empty-graph log-likelihood lφe (Kk). To compute
this update approximately, one can use an importance sampling scheme as above, but
only estimating the empty graph likelihood of the individual vertices switched. This
can be done quickly and accurately, however this will introduce some ’stochastic drift’
over time due to adding of many Monte Carlo estimators with independent errors.
This necessitates occasionally re-approximating the full likelihood.
An alternative is for most steps to run a Markov chain that only switches mapped
CHAPTER 8. KRONECKER PRODUCT GRAPHS 101
vertices of V , i.e. permuting φ instead of permuting the whole space. Since the empty-
graph log-likelihoods haven’t changed, this only requires a constant-time update as
in the stochastic Kronecker product graph model. One could then more infrequently
perform a fuller step that switches a mapped vertex with an unmapped vertex. One
could also simply make the assumption that all of the unmapped vertices belong to
a known fixed subset, such as all of the lowest expected degree vertices in Kk, or by
matching the expected degree of vertices in Kk to vertex degrees in G.
8.3.4 Multiplicative Attribute Graphs
Multiplicative attribute graphs [57] are a generalization of stochastic Kronecker prod-
uct graphs that use a set of attribute matrices Θ1, . . . ,Θk to form a Bernoulli edge
sampling matrix
Mk = ⊗ki=1Θi. (8.24)
They are fit in the same way as stochastic Kronecker product graphs, using empty
graph likelihoods and edge corrections for both the log-likelihood and the gradient.
The sequential importance sampling scheme for computing empty log-likelihoods (2)
operates in the same way, except for instead of choosing the components of the im-
portance distribution identically, components are chosen according to each individual
attribute matrix. Vertex censoring is particularly useful in this context, as the number
of attribute matrices most appropriate for any given real world graph seems unlikely
to coincide with the choice that minimizes the number of padded vertices.
8.4 Empirical Results
The primary network I examine is the AS-ROUTEVIEWS network studied in Leskovec
and Faloutsos [64]. This network has n = 6474 vertices and m = 26467 edges. To
test the importance sampling scheme, I computed le(Kk) for the uncensored stochas-
tic Kronecker product graph model using both the Taylor series approximation (8.10)
and the sequential importance sampling approximation. Figure 8.1 shows the results
CHAPTER 8. KRONECKER PRODUCT GRAPHS 102
for the k = 2 model using the optimal parameters. To compute the following graph,
I ran the standard uniform ’crude’ Monte Carlo scheme with 100 samples 100 times
and did the same with the SIS estimator. I then plotted the histograms and compared
the results to the Taylor series approximation.
Note the extreme reduction in variance; the standard estimate from the SIS sam-
ples is µSIS = −24650±60, for the crude MC estimate µMC = −21614±17495, where
the ’true’ value from the Taylor series expansion is µts = −24644. In general, crude
MC is quite poor, whereas computing the SIS with something like 10n samples gives
about 5 digits of accuracy.
For gradient computations, however, using 10n samples only gives about 2 or 3
digits of accuracy. This is much better than the crude MC estimator, but as a result
the algorithm had a difficult time converging. More thought needs to be put into the
design of the importance sampler for gradient computations.
However, one nice feature of the vertex censored model is that one can take
the same parameters generated by the standard stochastic Kronecker product graph
model with vertex padding and simply “censor” the padded vertices. While this
doesn’t give the maximum likelihood estimator for the vertex censored model, it does
give a lower bound on the MLE (or upper bounds on the negative log-likelihood).
Note the big improvements in MLE for size 4 and 5 initiator matrices in Figure 8.2.
Again, these are upper bounds on the negative log-likelihood for the vertex censored
stochastic Kronecker product graph model. This shows that the padded vertices
have a noticeable detrimental effect, and that more model parameters give a higher
likelihood.
8.4.1 Implementation
Code is written in C++ and R, and is available upon request. The main routine to
compute maximum likelihood estimates was modified from the KronFit implementa-
tion in the publicly available C++ library SNAP[63].
CHAPTER 8. KRONECKER PRODUCT GRAPHS 103
−24800 −24750 −24700 −24650 −24600 −24550 −24500
02
46
810
Estimated Log−Likelihood
Den
sity
Figure 8.1: Performance of Crude and SIS Monte Carlo simulations on AS-ROUTEVIEWS graph for N1 = 2. Top line represents SIS MC density, bottomrepresents Crude MC density
CHAPTER 8. KRONECKER PRODUCT GRAPHS 104
2 3 4 5 6
1200
0013
0000
1400
0015
0000
AS−ROUTEVIEWS, n=6474, m=26467
N_1, Initiator Matrix Size
−lo
g lik
elih
ood
SKPGVCSKPG
Figure 8.2: Comparison of SKPG and VCSKPG models for AS-ROUTEVIEWS graph
Bibliography
[1] D. Aldous and P. Diaconis. Shuffling cards and stopping times. American
Mathematical Monthly, 93(5):333–348, 1986. ISSN 0002-9890.
[2] David Aldous. Approximate counting via Markov chains. Statistical Science, 8
(1):pp. 16–19, 1993. ISSN 08834237.
[3] H.C. Andersen and P. Diaconis. Hit and run as a unifying device. Journal de
la Societe Francaise de Statistique, 148(5):5–28, 2007.
[4] C. Andrieu, N. De Freitas, A. Doucet, and M.I. Jordan. An introduction to
MCMC for machine learning. Machine learning, 50(1):5–43, 2003.
[5] Andreas Wagner Annette M. Evangelisti. Molecular evolution in the yeast
transcriptional regulation network. Journal of Experimental Zoology Part B:
Molecular and Developmental Evolution, 302B:392–411, 2004. ISSN 1552-5015.
link.
[6] S. Asmussen and P.W. Glynn. Stochastic simulation: Algorithms and analysis.
Springer Verlag, 2007.
[7] M. Madan Babu, Nicholas M. Luscombe, L. Aravind, Mark Gerstein, and
Sarah A. Teichmann. Structure and evolution of transcriptional regulatory
networks. Current Opinion in Structural Biology, 14:283–291, Jun 2004. link.
[8] A.L. Barabasi and R. Albert. Emergence of Scaling in Random Networks.
Science, 286(5439):509, 1999.
105
BIBLIOGRAPHY 106
[9] M. Bayati, J. Kim, and A. Saberi. A sequential algorithm for generating ran-
dom graphs. Approximation, Randomization, and Combinatorial Optimization.
Algorithms and Techniques, pages 326–340, 2007.
[10] I. Beichl and F. Sullivan. The metropolis algorithm. Computing in Science &
Engineering, 2(1):65–69, 2000.
[11] R. Bellman. Adaptive control processes: a guided tour., 1961.
[12] Johannes Berg, Michael Lassig, and Andreas Wagner. Structure and evolution
of protein interaction networks: a statistical model for link dynamics and gene
duplications. BMC Evolutionary Biology, 4:51, 2004. ISSN 1471-2148. link.
[13] I. Bezakova, A. Kalai, and R. Santhanam. Graph model selection using maxi-
mum likelihood. In ICML ’06: Proceedings of the 23rd international conference
on Machine learning, pages 105–112, New York, NY, USA, 2006. ACM. ISBN
1-59593-383-2. doi: http://doi.acm.org/10.1145/1143844.1143858.
[14] J. Blitzstein and P. Diaconis. A sequential importance sampling algorithm for
generating random graphs with prescribed degrees. Internet Mathematics, 6(4):
489–522, 2011.
[15] A. Blum, G. Konjevod, R. Ravi, and S. Vempala. Semi-definite relaxations
for minimum bandwidth and other vertex-ordering problems. In Proceedings of
the thirtieth annual ACM symposium on Theory of computing, pages 100–105.
ACM, 1998. ISBN 0897919629.
[16] B. Bollobas. Random Graphs. Cambridge University Press, 2001.
[17] B. Bollobas, O. Riordan, J. Spencer, and G. Tusnady. The degree sequence of
a scale-free random graph process. Random Structures and Algorithms, 18(3):
279–290, 2001.
[18] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
BIBLIOGRAPHY 107
[19] B.J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livs-
tone, R. Oughtred, D.H. Lackner, J. Bahler, V. Wood, et al. The BioGRID
Interaction Database: 2008 update. Nucleic Acids Research, 2007.
[20] O. Cappe, A. Guillin, J.M. Marin, and C.P. Robert. Population Monte Carlo.
J. Comput. Graph. Statist, 13(4):907–929, 2004.
[21] G. Casella and R.L. Berger. Statistical Inference. Duxbury, 2011.
[22] K.C. Chang, C.Y. Chong, and T. Bar-Shalom. Joint probabilistic data associa-
tion in distributed sensor networks. IEEE Transactions on Automatic Control,
31(10):889–897, 1986.
[23] Y. Chen, P. Diaconis, S.P. Holmes, and J.S. Liu. Sequential monte carlo meth-
ods for statistical analysis of tables. Journal of the American Statistical Asso-
ciation, 100(469):109–120, 2005.
[24] Z. Chen. Bayesian filtering: From Kalman filters to particle filters, and beyond.
Unpublished manuscript, 2003.
[25] F. Chung, L. Lu, T.G. Dewey, and D.J. Galas. Duplication Models for Biological
Networks. Journal of Computational Biology, 10(5):677–687, 2003.
[26] P. Diaconis. Group Representations in Probability and Statistics. Institute of
Mathematical Statistics, 1988.
[27] P. Diaconis. The markov chain monte carlo revolution. AMERICAN MATHE-
MATICAL SOCIETY, 46(2):179–205, 2009.
[28] P. Diaconis and S. Holmes. Three examples of monte-carlo markov chains: at
the interface between statistical computing, computer science, and statistical
mechanics. IMA VOLUMES IN MATHEMATICS AND ITS APPLICATIONS,
72:43–43, 1995.
[29] SN Dorogovtsev and JFF Mendes. Effect of the accelerating growth of commu-
nications networks on their structure. Physical Review E, 63(2):25101, 2001.
BIBLIOGRAPHY 108
[30] R. Douc, A. Guillin, J.M. Marin, and CP Robert. Minimum variance impor-
tance sampling via population Monte Carlo. ESAIM: P&S, 11:427–447, 2007.
[31] A. Doucet and N. De Freitas. Sequential Monte Carlo methods in practice.
Springer, 2001.
[32] R. Durrett. Random Graph Dynamics. Cambridge University Press, 2006.
[33] M. Dyer and A. Frieze. A random polynomial time algorithm for approximating
the volume of convex bodies. In Proceedings of the twenty-first annual ACM
symposium on Theory of computing, pages 375–381. ACM, 1989.
[34] P. Erdos and A. Renyi. On random graphs. Publ. Math. Debrecen, 6(290), 1959.
[35] M.J. Evans and T. Swartz. Approximating integrals via Monte Carlo and de-
terministic methods. Oxford University Press, USA, 2000.
[36] T. Feder, A. Guetz, M. Mihail, and A. Saberi. A local switch markov chain on
given degree graphs with application in connectivity of peer-to-peer networks.
In 47th Annual IEEE Symposium on Foundations of Computer Science, pages
69–76. IEEE, 2006.
[37] E. Frank, M.A. Hall, G. Holmes, R. Kirkby, B. Pfahringer, and I.H. Witten.
Weka: A machine learning workbench for data mining. Data Mining and Knowl-
edge Discovery Handbook: A Complete Guide for Practitioners and Researchers,
pages 1305–1314, 2005.
[38] A. Gelman. Bayesian data analysis. CRC press, 2004.
[39] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6:721–741, 1984.
[40] C.J. Geyer and E.A. Thompson. Constrained Monte Carlo maximum likeli-
hood for dependent data. Journal of the Royal Statistical Society. Series B
(Methodological), 54(3):657–699, 1992.
BIBLIOGRAPHY 109
[41] EN Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):
1141–1144, 1959.
[42] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Process-
ing, IEE Proceedings F, 140(2):107–113, 1993. ISSN 0956-375X.
[43] P.D. Grunwald. The Minimum Description Length Principle. Mit Press, 2007.
[44] J.M. Hammersley and K.W. Morton. Poor man’s monte carlo. Journal of the
Royal Statistical Society. Series B (Methodological), pages 23–38, 1954.
[45] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of sta-
tistical learning: data mining, inference and prediction. The Mathematical
Intelligencer, 27(2):83–85, 2005.
[46] W. K. Hastings. Monte carlo sampling methods using markov chains and their
applications. Biometrika, 57(1):pp. 97–109, 1970. ISSN 00063444. URL http:
//www.jstor.org/stable/2334940.
[47] S. Holmes, A. Kapelner, and P.P. Lee. An interactive Java statistical image
segmentation system: GemIdent. Journal of Statistical Software, 30:1–20, 2009.
[48] C. Hue, J.P. Le Cadre, and P. Perez. Sequential Monte Carlo methods for mul-
tiple target tracking and data fusion. IEEE Transactions on Signal Processing,
50(2):309–325, 2002.
[49] D.R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals
of Statistics, 32(1):384–406, 2004. link.
[50] S. Janson, T. Luczak, and A. Rucinski. Random graphs. John Wiley New York,
2000.
[51] A.H. Jazwinski. Stochastic processes and filtering theory, volume 63. Academic
Pr, 1970.
BIBLIOGRAPHY 110
[52] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM journal on
computing, 18:1149, 1989.
[53] M.R. Jerrum, L.G. Valiant, and V.V. Vazirani. Random generation of combi-
natorial structures from a uniform distribution. Theoretical Computer Science,
43:169–188, 1986.
[54] S.J. Julier and J.K. Uhlmann. A new extension of the kalman filter to nonlin-
ear systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls,
volume 3, page 26. Citeseer, 1997.
[55] R.E. Kalman. A new approach to linear filtering and prediction problems.
Journal of basic Engineering, 82(1):35–45, 1960.
[56] Z. Khan, T. Balch, and F. Dellaert. An MCMC-based particle filter for tracking
multiple interacting targets. Lecture Notes in Computer Science, pages 279–290,
2004.
[57] M. Kim and J. Leskovec. Multiplicative attribute graph model of real-world
networks. Algorithms and Models for the Web-Graph, pages 62–73, 2010.
[58] S. Kirkpatrick, CD Gelati Jr, and MP Vecchi. Optimization by simulated an-
nealing. Biology and Computation: A Physicist’s Choice, 1994.
[59] J. M. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, , and A. Tomkins.
The web as a graph: measurements, models and methods. Proceedings of the
International Conference on Combinatorics and Computing, 1999.
[60] KR Koch. Gibbs sampler by sampling-importance-resampling. Journal of
Geodesy, 81(9):581–591, 2007. ISSN 0949-7714.
[61] A. Kong. A note on importance sampling using standardized weights. Technical
report, Deptartment of Statistics, University Chicago, 1992.
[62] PL Krapivsky and S. Redner. Organization of growing random networks. Phys-
ical Review E, 63(6):66123, 2001.
BIBLIOGRAPHY 111
[63] J. Leskovec. Snap graph library, 2011. URL snap.stanford.edu.
[64] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker
multiplication. In Proceedings of the 24th international conference on Machine
learning, page 504. ACM, 2007.
[65] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution
of social networks. In Proceeding of the 14th ACM SIGKDD international con-
ference on Knowledge discovery and data mining, pages 462–470. ACM, 2008.
[66] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani.
Kronecker graphs: An approach to modeling networks. The Journal of Machine
Learning Research, 11:985–1042, 2010.
[67] J.S. Liu. Metropolized independent sampling with comparisons to rejection
sampling and importance sampling. Statistics and Computing, 6(2):113–119,
1996.
[68] J.S. Liu. Monte Carlo strategies in scientific computing. Springer Verlag, 2008.
[69] J.S. Liu and R. Chen. Sequential Monte Carlo methods for dynamic systems.
Journal of the American Statistical Association, 93(443):1032–1044, 1998.
[70] D.J.C. MacKay. Information theory, inference, and learning algorithms. Cam-
bridge University Press New York, 2003.
[71] M. Madan Babu and Sarah A. Teichmann. Evolution of transcription factors
and the gene regulatory network in escherichia coli. Nucleic Acids Research, 31:
1234–1244, Feb 2003. link.
[72] J.I. Marden. Analyzing and Modeling Rank Data. Chapman & Hall/CRC, 1995.
[73] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Au-
gusta H. Teller, and Edward Teller. Equation of state calculations by fast com-
puting machines. Journal of Chemical Physics, 21(6):1087–1092, 1953. ISSN
00219606. doi: DOI:10.1063/1.1699114.
BIBLIOGRAPHY 112
[74] M. Middendorf, E. Ziv, and C.H. Wiggins. Inferring network mechanisms:
the Drosophila Melanogaster protein interaction network. Proceedings of the
National Academy of Sciences, 102(9):3192–3197, 2005.
[75] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):
125–139, 2001.
[76] R.M. Neal. Estimating Ratios of Normalizing Constants Using Linked Impor-
tance Sampling. Arxiv preprint math.ST/0511216, 2005.
[77] A. Newman. Cuts and orderings: on semidefinite relaxations for the linear
ordering problem. Approximation, Randomization, and Combinatorial Opti-
mization, pages 195–206, 2004.
[78] MEJ Newman. The Structure and Function of Complex Networks. Structure,
45(2):167–256, 2003.
[79] MEJ Newman and EA Leicht. Mixture models and exploratory analysis in
networks. Proceedings of the National Academy of Sciences, 104(23):9564, 2007.
[80] E. Ozkan. Particle methods for bayesian multi-object tracking and parameter
estimation. PhD thesis, Middle East Technical University, 2009.
[81] G. Polya and F. Eggenberger. Uber die statistik verketter vorgange. Z. Angew
Math. Mech, pages 279–289, 1923.
[82] DJ Price. A general theory of bibliometric and other cumulative advantage
processes. Journal of the American Society for Information Science, 27(5):
292–306, 1976.
[83] O. Ratmann, O. Jørgensen, T. Hinkley, M. Stumpf, S. Richardson, and C. Wiuf.
Using likelihood-free inference to compare evolutionary dynamics of the protein
networks of H. Pylori and P. Falciparum. PLoS Comput Biol, 3(11):e230, 2007.
link.
[84] C.P. Robert and G. Casella. Monte Carlo statistical methods. Springer, 2004.
BIBLIOGRAPHY 113
[85] C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R. Springer
Verlag, 2010.
[86] M.N. Rosenbluth and A.W. Rosenbluth. Monte carlo calculation of the average
extension of molecular chains. The Journal of Chemical Physics, 23(2):356–359,
1955.
[87] D.B. Rubin. A noniterative sampling/importance resampling alternative to the
data augmentation algorithm for creating a few imputations when fractions of
missing information are modest: the SIR algorithm. Journal of the American
Statistical Association, 82(398):543–546, 1987.
[88] R.Y. Rubinstein. Optimization of computer simulation models with rare events.
European Journal of Operational Research, 99(1):89–112, 1997.
[89] R.Y. Rubinstein and D.P. Kroese. The cross-entropy method: a unified approach
to combinatorial optimization, Monte-Carlo simulation and machine earning.
Springer, 2004.
[90] P. Sheridan, Y. Yagahara, and H. Shimodaira. A preferential attachment model
with Poisson growth for scale-free networks. arxiv.org, 2008.
[91] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transac-
tions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
[92] H.A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):
425–440, 1955.
[93] A. Sinclair and M. Jerrum. Approximate counting, uniform generation and
rapidly mixing Markov chains. Information and Computation, 82(1):93–133,
1989.
[94] Ø. Skare, E. Bølviken, and L. Holden. Improved sampling-importance resam-
pling and reduced bias importance sampling. Scandinavian Journal of Statistics,
30(4):719–737, 2003. ISSN 1467-9469.
BIBLIOGRAPHY 114
[95] DA Spielman. Spectral graph theory and its applications. In Foundations of
Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages
29–38, 2007.
[96] M.A. Tanner and W.H. Wong. The calculation of posterior distributions by
data augmentation. Journal of the American statistical Association, 82(398):
528–540, 1987.
[97] A.C. Thomas and J.K. Blitzstein. The effect of censoring out-degree on network
inferences. Unpublished manuscript, 2009.
[98] L.G. Valiant. The complexity of computing the permanent. Theoretical com-
puter science, 8(2):189–201, 1979.
[99] D.A. van Dyk and X.L. Meng. The art of data augmentation. Journal of
Computational and Graphical Statistics, 10(1):1–50, 2001.
[100] J. Vermaak, A. Doucet, and P. Perez. Maintaining multi-modality through
mixture tracking. In International Conference on Computer Vision, volume 2,
pages 1110–1116. Citeseer, 2003.
[101] B.N. Vo, S. Singh, and A. Doucet. Sequential Monte Carlo methods for multi-
target filtering with random finite sets. IEEE Transactions on Aerospace and
Electronic Systems, 41(4):1224–1245, 2005. ISSN 0018-9251.
[102] Andreas Wagner. How the global structure of protein interaction networks
evolves. Proceedings of the Royal Society B: Biological Sciences, 270:457–466,
Mar 2003. link.
[103] E.A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear
estimation. In Adaptive Systems for Signal Processing, Communications, and
Control Symposium, pages 153–158. IEEE, 2000. ISBN 0780358007.
[104] C. Wiuf, M. Brameier, O. Hagberg, and M.P.H. Stumpf. A likelihood approach
to analysis of network data. Proceedings of the National Academy of Sciences,
103(20):7566–7570, 2006.
BIBLIOGRAPHY 115
[105] G.U. Yule. A mathematical theory of evolution, based on the conclusions of Dr.
JC Willis. FRS Philosophical Transactions B, 213(1924):21–87, 1924.