Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Structure Learning in Locally ConstantGaussian Graphical Models
By
Apratim GangulyB.Stat., Indian Statistical Institute (2007)M.Stat. Indian Statistical Institute (2009)
DISSERTATION
Submitted in partial satisfaction of the requirements for the degree of
DOCTOR OF PHILOSOPHY
in
STATISTICS
in the
OFFICE OF GRADUATE STUDIES
of the
UNIVERSITY OF CALIFORNIA
DAVIS
Approved:
Wolfgang Polonik (Chair)
Debashis Paul
Ethan Anderes
Committee in Charge2014
i
c© Apratim Ganguly, 2014. All rights reserved.
Contents
1 INTRODUCTION 1
1.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Introduction to Graphical Models . . . . . . . . . . . . . . . . 2
1.1.2 Types of Graphical Models . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . 14
1.1.4 Local Constancy in Gaussian Graphical Model . . . . . . . . . 16
1.2 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . 20
2 LOCAL GEOMETRY IN GAUSSIAN GRAPHICAL MODELS 21
2.1 GGM as Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3 Gaussian MRF . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Saturated Model . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Covariance Selection Model . . . . . . . . . . . . . . . . . . . 28
2.3 Geometric Interpretation of Existence of MLE . . . . . . . . . . . . . 36
2.4 Local Geometry as Structure Constraint . . . . . . . . . . . . . . . . 37
ii
2.4.1 Quantitative Definition . . . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Bayesian Perspective . . . . . . . . . . . . . . . . . . . . . . . 40
3 Estimation of Locally Constant Gaussian Graphical Model Using
Neighborhood-Fused Lasso 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 A Review of Related Works . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Neighborhood Selection Using Fused Lasso . . . . . . . . . . . . . . . 47
3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Asymptotics of Graphical Model Selection . . . . . . . . . . . . . . . 57
3.6 Compatibility and l1 properties . . . . . . . . . . . . . . . . . . . . . 63
3.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.1 Simulation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.2 Simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.8 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Smoothing of Diffusion Tensor MRI Images 97
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Principles of Diffusion MRI . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.1 Shortcomings of conventional MRI and Contrast Generation . 99
4.2.2 Water Diffusion and Its Importance in MR Imaging . . . . . . 100
4.3 Tensor Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.1 Two-Stage Smoothing . . . . . . . . . . . . . . . . . . . . . . 102
4.3.2 Locally Constant Smoothing with Conjugate Gradient Descent
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
iii
4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References 113
iv
List of Figures
1.1.1 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Toy Bayesian Network for Microarray Analysis . . . . . . . . . . . . . 6
1.1.3 A template model associating promoter sequences to gene expressions 8
1.1.4 A spin configuration of 2-D lattice Ising model . . . . . . . . . . . . . 12
1.1.5 Human Brain Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Chordal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.1 Comparison of NFL with GLASSO, Meinshausen-Buhlmann estimate
and CDD in section 3.7.1 . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7.2 Comparison of NFL with GLASSO and Meinshausen-Buhlmann esti-
mate in section 3.7.2, sample sizes from top to bottom are 10, 25, 50,
100, 500, 1000, 5000, 10000 . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Comparison of smoothers: Low noise level . . . . . . . . . . . . . . . 110
4.3.2 Comparison of smoothers: Medium noise level . . . . . . . . . . . . . 111
4.3.3 Comparison of smoothers: High noise level . . . . . . . . . . . . . . . 112
v
Apratim GangulyAugust 2014
Statistics
Structure Learning in Locally Constant Gaussian Graphical Models
Abstract
Occurrence of zero entries in the inverse covariance matrix of a multivariate Gaus-
sian random variable has a one to one correspondence with conditional independence
of corresponding pairs of components. A challenging aspect of sparse structure learn-
ing is the well known “small n large p” scenario. So far, several algorithms have been
proposed to solve the problem. Neighborhood selection using lasso (Meinshausen-
Buhlmann), block-coordinate descent algorithm to estimate the covariance matrix
(Banerjee et al.), graphical lasso (Tibshirani et al) are some of the most popular ones.
In first part of this thesis, an alternative methodology is proposed for Gaussian graph-
ical models on manifolds where spatial information is judiciously incorporated into
the estimation procedure. This is initiated by Honorio et al. (2009) who proposed
an extension of the coordinate descent approach, calling it “coordinate direction de-
scent approach”, which incorporates the local constancy property of spatial neighbors.
However, only an intuitive formalization is provided by Honorio et al. and no the-
oretical investigations. Here I propose an algorithm to deal with local geometry in
Gaussian graphical models. The algorithm extended the Meinshausen-Buhlmann’s
idea of successive regression by a different penalty. Neighborhood information is used
in the penalty term and it is called neighborhood-fused lasso algorithm. I will show by
simulation and prove theoretically the asymptotic model selection consistency of the
proposed method and will establish faster convergence to the ground truth than the
standard rates if the assumption of local constancy holds. This modification has nu-
merous practical application, e.g., in the analysis of MRI data, 2-dimensional spatial
manifold data in order to study spatial aspects of the human brain or moving objects.
vi
In second part of the thesis, I will discuss smoothing techniques on Riemannian
manifolds using local information. Estimation of smoothed diffusion tensors from
diffusion weighted magnetic resonance images (DW-MRI or DWI) of human brain is
usually a two-step procedure, the first step being a regression (linear/non-linear) and
the second step being a smoothing (isotropic/anisotropic). I extended the smoothing
ideas on Euclidean space to non-Euclidean space by running a conjugate gradient al-
gorithm on the manifold of positive definite matrices. This method shows empirical
evidence of a better performance than the two-step method of smoothing. This is a
collaborative work with Debashis Paul, Jie Peng and Owen Carmichael.
vii
Acknowledgments
Even though the dissertation is my individual work, its completion would not
be possible without the unsparing support and invaluable advice of my academic
mentor, Professor Wolfgang Polonik. His scientific acumen helped me shape some
of the fundamental ideas and thoughtful conversations with him have always been
immensely helpful for the development of my thesis. I am also indebted to Professor
Debashis Paul for the collaborative work I did with him and for the academic support
and counseling during the last five years. I would like to thank Prof. Francisco J.
Samaniego for giving me the opportunity to work with him and for the zestful but
informative interactions. My dissertation committee members Prof. Ethan Anderes
and Prof. Alexander Aue have helped me with their time and valuable inputs. I
am grateful to Prof. Owen Carmichael, Prof. Vladimir Filkov, Prof. Jie Peng, Prof.
Prabir Burman and Prof. Duncan Temple Lang for letting me work with them on
various projects that helped me expand my knowledge horizon. I would like to say a
big thank you to Pete Scully, our graduate program coordinator for his ever-present
help in time of need. I extend my profound gratitude to the rest of the faculty mem-
bers of the Department of Statistics at UC Davis and my colleagues who made my
stay nevertheless enjoyable.
I would like to thank a few people without whom this half-a-decade long journey
far away from home would never be the same. The wonderful company of friends in
Davis and SF Bay area will be forever etched in my memory. So thank you Nagda,
Debasis, Partha, Sai, Uttam, Aveek, Gupta, Roy, Shreyashi, Pulak, Sanchita, Rajat,
Ana, Anupam, Annie, Saumen, Manjula, Sandipan, Sumit, Anirban, Bhashwar, Kin-
viii
jal, Sujayam, Riddhi, Charles and all my awesome friends for being there.
I am also deeply thankful to my parents whose support and encouragement has
made this endeavor successful in every respect.
And the one person who deserve a very special mention is my fiancee Rohosen,
without whose relentless support and love it would be extremely difficult to make it
this far. You stood there by me in my difficult times, and held my hand. I can not
thank you enough for that. This would never be possible without you by my side,
and I promise to do the same. Love you.
ix
1
Chapter 1
Introduction
This thesis aims to demonstrate the significance of local geometry in multivari-
ate structure learning and kernel smoothing problems in non-Euclidean spaces. The
principal goal of this thesis is to provide empirical evidence and a detailed theoretical
explanation for the advantage that one gets when the locality information is judi-
ciously incorporated in the relevant statistical method. In this thesis, I shall delve
into the following two problems
(i) Efficient structure learning in high dimensional gaussian graphical models using
local constancy.
(ii) Kernel smoothing on a tensor space using local alignment of tensors.
with the first problem being the main focus of this thesis. In both these problems,
local geometry based methods will be developed and their supremacy over the tradi-
tional methods will be established.
1.1. Graphical Models 2
1.1 Graphical Models
1.1.1 Introduction to Graphical Models
Uncertainty and complexity are the two most critical aspects in model based
multivariate statistical analysis in various domains like computer vision and image
processing, video processing, bioinformatics, medical imaging etc. In a broader sense,
any model based statistical method can be said to have a declarative representation[1]
in which the model, representing the investigators’ existing knowledge about the sys-
tem is used by different algorithms to answer different questions about the system.
For example, a model in cognitive psychology might represent the investigators’ prior
knowledge about the relationship between several psychological states, their symp-
toms and regions in the human brain that are responsible for that to some degree. An
algorithm would then take this model as an input, along with the pertinent informa-
tion about a certain individual and diagnose her psychological condition. A model in
genetics might entail the relationships between certain diseases, their symptoms and
genetic disorders potentially responsible. Alternatively, one might try to improve the
working model with collection of more data and advancements in domain knowledge
and without significant changes to the reasoning algorithms to answer the specific
questions the algorithms are designed for. This becomes possible due to complete
separation of knowledge and reasoning as in declarative representation. In contrast,
a procedural method concentrates on finding a sequence of modules to reach to con-
clusions starting from data. For example, when relating transcription factor binding
sites in the promoter regions of genes to their expression profiles, one can start by
finding clusters of coexpressed genes and look for overrepresented elements in the
promoters of the genes in each cluster. [2, 3]
1.1. Graphical Models 3
In various applied fields of modern science, some of which have been mentioned in the
preceding paragraph, the systems appear with different degrees of uncertainty. The
uncertainty might arise as a result of as imprecise measurements or surrounding noise
or it might be due to the lack of sufficient domain knowledge or simply becasue of
a modelling limitation. Probability theory provides us with a comprehensive frame-
work to formalize the inherent uncertainty of any system, enabling it for a thorough
quantitative analysis. On the other hand, systems involving multiple components,
often times in the order of millions, makes it exceedingly difficult to express the re-
lationships between its components owing to its complex modular structure. Graph
theory provides the tools to deal with this complexity in a nice, formal and time
efficient manner. Coupling graph theoretical tools with probability theory proves to
be quite powerful in analyzing systems with a high degree of modularity with lots of
interacting sub-components. These are commonly referred as Graphical Models. In
the following section, we discuss different types of graphical models with motivating
examples.
1.1.2 Types of Graphical Models
Let us begin with a general mathematical framework for graphical models. It
consists of a graph G = (V,E) where V = 1, 2, · · · , p is a collection of vertices
where each vertex corresponds to a random variable Xi taking values in some space
Ωi for i = 1, 2, · · · , p and E ⊆ V × V is a collection of edges. The multivariate
random variable X = (X1, X2, · · · , Xp) is assumed to have a joint p.m.f. or p.d.f.
f on Ω =⊗p
i=1 Ωi. The existence (or lack) of edges signify conditional dependence
(or independence) of the corresponding pair of random variables given the rest of
1.1. Graphical Models 4
them. The fundamental idea of a graphical model is that f factors according to the
structure of G. The edges of the graph are either directed or undirected leading to
two different types of probabilistic graphical models - directed graphical models and
undirected graphical models.
Directed Graphical Model
Given a directed graph G = (V,E) with a directed edge set, we define s to be a
parent of t if there is a directed edge from s to t. Conversely, t is called a child of
s. The set of all parents of a node s ∈ V is denoted by π(s). A directed cycle is a
sequence (s1, s2, · · · , sk) such that E contains all the directed edges (si → si+1; i =
1, 2, · · · , k − 1) and (sk → s1) ∈ E. See figure 1.1.1 for an illustration.
1
2
3 4
5
1
2
3 4
5
(a) (b)
Figure 1.1.1: (a) A directed acyclic graph with 5 vertices. (b) A directed cyclic graph which
includes the directed cycle 1→ 2→ 3→ 4→ 1
The graph in figure 1.1.1(a) is an example of a Directed Acyclic Graph, abbre-
viated as DAG, meaning that every edge in the graph is directed and it contains no
directed cycle. One can define a notion of ancestry on DAG as following: s is said
to be an ancestor of u if there is a directed path (s → t1 → t2 → · · · → tk → u).
1.1. Graphical Models 5
For A ⊆ V , let us define XA := Xt : t ∈ A. Given a DAG, for each vertex s and
its parent set π(s), let fs(xs|xπ(s)) denotes a nonnegative function over the variables
(xs, xπ(s)) such that∫fs(xs|xπ(s))dxs = 1. A directed graphical model is then char-
acterized as a joint probability density that factors as a product of these conditional
densities, i.e.,
f(x1, x2, · · · , xp) =∏s∈V
fs(xs|xπ(s)) . (1.1.1)
It should be noted that while speaking of densities, p.m.f.’s (absolutely continuous
w.r.t. counting measure) and p.d.f.’s (absolutely continuous w.r.t. Lebesgue measure)
are not differentiated. It can be proved using the normalization condition of fs and
the ancestry relationship of the DAG that the above definition is consistent. An
alternative formulation of directed graphical models is given by Directed Markov
Property, which states that every variable is conditionally independent of its non
descendants given its parents, i.e.,
Xs ⊥ XV \de(s)|Xπ(s) ∀s ∈ V (1.1.2)
where de(s) denotes the set of all descendants of s as given by the DAG. Such a
model is also known as Bayesian Network ∗ which has multiple applications. We de-
scribe below a practical application of Bayesian network in computational biology[2].
∗Bayesian networks do not necessarily adopt a Bayesian statistics framework. The name alludesto Bayes’ rule for probabilistic inference. However, Bayes nets are instrumental in hierarchicalBayesian models which lays the foundation of a number of applied Bayesian statistics project. See,e.g., the BUGS project at http://www.mrc-bsu.cam.ac.uk/software/bugs/
1.1. Graphical Models 6
After the success of complete genome sequencing, high-thoroughput assays have been
developed to analyze cells at a genome-wide scale. It is possible to measure compo-
nents of molecular level networks with the help of these assays. However, extraction
of meaningful information about the underlying mechanisms at the molecular level
continues to be a challenging problem in molecular biology. Gene expression profiles
are the main sources of high-thoroughput data on cellular networks. The expression
levels of genes are measured by several microarrays. The number of genes can be upto
the order of tens of thousands whereas the number of microarrays could be tens of
hundreds. An initial simplification is made by clustering genes with similar expression
levels and arrays into several array clusters according to certain biological contexts.
Let Xg,a denote the expression level of gene g meaured by array a. Let GCg and ACa
denote the gene/array cluster that gene g / array a belong to, respectively. We define
a Bayesian nework under the assumption that all measurements that correspond to a
particular gene array-cluster array pair have the same conditional distribution. Thus,
each node Xg,a has parents GCg and ACa. Figure 1.1.2 shows an example with a
small number of genes and arrays.
GC1
GC2
GC3
AC1 AC2
X1,1 X1,2
X2,1 X2,2
X3,1 X3,2
Figure 1.1.2: A Bayesian Network with 3 genes and 2 arrays. GCg is the gene cluster for gene g
and ACa is the array cluster for array a.
1.1. Graphical Models 7
It can be easily seen that the underlying graph is indeed a DAG. One can also addi-
tionally assume that the conditional p.d.f. f(Xg,a|GCg, ACa) is similar for different
choices of g and a. This model can achieve high likelihood if the gene and array clus-
ters partitions the expression data into homogeneous expression within each block
[2, 4]. One can use EM algorithm to find such a partition.
One can extend this simple model to more complex models capturing the hidden
biological mechanisms that result into similar gene expressions, maintaining the
acyclicness and directedness of the extended graph. For example, it is assumed
that coexpression is an effect of coregualtion. Hence, it would make sense if one
incorporates different regulation mechanisms into this model to account for similar
gene expressions in a gene cluster. One such regulation mechanism involves binding
of transcriptor factor binding sites in the promoter region of genes. Thus, one can
add some additional nodes, annotating promoters with characterized binding sites.
One can define an indicator variable Rg,j denoting whether gene g has a binding site
of transcription factor j. The corresponding graphical model is shown in figure 1.1.3.
Another popular example of directed graphical models are Hidden Markov Mod-
els (abbreviated as HMM) that have numerous applications in bioinformatics (e.g.,
in gene-finding problems[5, 6]), speech recognition[7], video processing[8] and many
other problems. Directed graphical models are used in collaborative filtering and rec-
ommender systems, intensive care monitoring, text analysis, object recognition and
image segmentation. For details, see [9].
1.1. Graphical Models 8
Promoter
Gene
Array
Expression
Seq
R1 R2 • • • Rk
Gene Cluster
Array Cluster
X
Figure 1.1.3: A template model associating promoter sequences to gene expressions. Ri’s indicate
whether the gene is regulated by i-th transcription factor. Thus, cluster of the genes would depend
on the different binary sequences generated.
1.1. Graphical Models 9
Undirected Graphical Model
In an undirected graphical model, also known as Markov Random Field, the p.d.f.
factors out over cliques †. We associate, with each clique C, a compatibility function[5]
or potential function[10]
ψC :⊗s∈C
χS −→ R+
The joint p.d.f. is obtained by taking product of the potential functions over all
clique.
f(x1, x2, · · · , xp) =1
Z
∏C∈C
ψC(xC), (1.1.3)
where C is the set of all maximal cliques‡ and Z is the normalization constant. The
functions ψC need not be related to the conditional distributions defined over the
cliques, unlike directed graphs where the factors correspond to the conditional dis-
tributions over the child-parent set of nodes.
An important characterization of undirected graphical models is given in terms of
conditional independence among subsets of nodes, also known as the Markov proper-
ties of the graphical model. We say that the graph exhibits
(a) Pairwise Markov Property: IfXu ⊥ Xv|XV \u,v for all pair of nodes (u, v) /∈
E.
(b) Local Markov Property: If Xv ⊥ XV \cl(v)|Xne(v) where ne(v) denotes the
neighborhood of v, i.e., all nodes connected to v, and cl(v) = v ∪ ne(v).
†Cliques are complete graphs or subgraphs (in this scenario)‡Maximal cliques are cliques that do not have any supercliques containing them
1.1. Graphical Models 10
(c) Global Markov Property: If XA ⊥ XB|XS where each path from a node in
A to a node in B passes through S or equivalently, S is a separator of A and
B.
However, all the three aforementioned Markov properties are equivalent when the
joint density is strictly positive. Otherwise, the global markov property implies the
local Markov property which in turn implies the pairwise Markov property [11]. Tak-
ing into account all possible choices of A, B and S generates a sequence of conditional
independence assertions. By Hammersley-Clifford theorem §, proposed by Hammer-
sley and Clifford in an unpublished paper in 1971 (see [13]), it can be shown that the
set of probability distributions satisfying these assertions is exactly the set of distribu-
tions defined by equation (1.1.3) over all possible choices of potential functions. The
global markov property is basically identified with the functionally equivalent notion
of reachability in graph theory, establishing a fundamental connection between the
algebraic concept of factorization to the graph theoretic concept of reachability[5].
This formulation can be exploited by clever algorithms (e.g., breadth-first search al-
gorithms[12]) in order to assess conditional independence on a graph.
When the density is strictly positive, global, local and pairwise Markov properties on
the graph are equivalent. Therefore, one can use the pairwise markov property alone
to represent the conditional independence structure. Hence, the conditional (inde-
pendence)dependence between two nodes could be represented by (absence)presence
of an edge between the two. We shall use this representation frequently in this the-
sis. Before I delve into specifics of Gaussian graphical model, I shall present some
practical applications of undirected graphical models.
§For other proofs, see [14, 15, 16, 17]
1.1. Graphical Models 11
One of the basic and interesting examples of undirected graphical model are Ising
models [18], which arose from statistical physics and deal with interacting particles
with similar or opposite magnetic spins. The model considers N interacting particles,
each with a magnetic spin of +1 or −1, depending on the directions of their magnetic
dipole moment. Let us denote by yi the state of the ith particle. The particles are ar-
ranged in a regular geometric configuration (a 1-D, 2-D or 3-D lattice to be specific).
In ferromagnetic substance, the neighbors tend to spin in similar direction whereas
in anti-ferromagnetic substances the neighbors tend to rotate in opposite directions,
and thereby the system could be in 2N different spin configurations. The model was
proposed by physicist Wilhelm Lenz (1920) and his student Ernst Ising solved the
one-dimensional model in 1925. The two-dimensional square lattice Ising model was
harder to solve and it was solved by Lars Onsager in 1944.¶.
An Ising model could be represented as a Markov random field on a 1-D, 2-D or
3-D lattice. A 2-D lattice configuration of an Ising model is shown in figure 1.1.4.
In this representation, the neighboring edges constitute the cliques of size 2. The
pairwise clique potentials can be written as
ψuv(yu, yv) = eyuyvwuv (1.1.4)
where wuv is the interaction strength between nodes u and v. A standard Ising
model makes some assumptions about the interactions wuv. Some of the common
assumptions are
(i) W = ((wuv))u,v is symmetric, so wuv = wvu.
¶Source: Wikipedia. For details, see http://en.wikipedia.org/wiki/Ising_model
1.1. Graphical Models 12
+ + -
+ - +
- + -
Figure 1.1.4: A spin configuration of 2-D lattice Ising model
(ii) The interactions have same strength. Hence wuv = J for all u, v.
If wuv > (<)0, the interaction corresponds to a ferromagnetic (anti-ferromagnetic)
interaction. Thus, for a common value J > 0, one gets a ferromagnet and the
corresponding graphical model is called an associative Markov network. Conversely,
for J < 0, one gets an anti-ferromagnet and the corresponding graphical model is
known as a frustrated system.19 The energy of a configuration is given by
E(y) = −∑
(u,v)∈E
yuyvwuv (1.1.5)
In the most general version of the Ising model, one would have an external magnetic
field Bu at node u so that
E(y) = −∑
(u,v)∈E
yuyvwuv − µ∑u
Buyu (1.1.6)
1.1. Graphical Models 13
where E is the edge set of the lattice and the magnetic moment is given by µ.
However, with the assumption of no external magnetic field, the negative log potential
equals the energy of the system and assuming the interactions are of same strength,
E(y) = −J∑
(u,v)∈E
yuyv (1.1.7)
The corresponding probability distribution is given by
Pβ(y) =∏
(u,v)∈E
(ψuv(yu, yv))β∑
y (ψuv(yu, yv))β
=
∏(u,v)∈E e
βyuyvwuv∑y
∏(u,v)∈E e
βyuyvwuv
=e−βE(y)∑y e−βE(y)
(1.1.8)
where E(y) is given by the equation (1.1.7) with all its relevant assumptions and β
is a physical parameter which depends on the temperature of the system.
If J is bigger than a certain positive lower bound, then it can be proved that the
corresponding probability distribution will have two maxima - when all the states are
+1 or −1. These are known as the ground states [19] of the system. Contrarily, if
J is smaller than a negative upper bound, then the corresponding distribution will
have many maxima. The denominator on the right hand side of equation (1.1.8) is
called the partition function, and it is usually hard to calculate analytically‖. It is,
therefore, a common practice to use Metropolis Hastings sampling for simulation of
Ising models.
‖It is actually a NP hard problem in general (see [20]). However, for associative Markov networks,it could be calculated in polynomial time.
1.1. Graphical Models 14
In the following section, we discuss, in detail about the special kind of Markov random
field where the nodes jointly follow a multivariate normal distribution. In statistical
parlance, this is commonly referred to as Gaussian Graphical Models.
1.1.3 Gaussian Graphical Model
When the nodes in the graph are representative of each variable in a multivariate
Gaussian random variable, the pairwise Markov property, the local Markov property
and the global Markov property are equivalent due to strict positivity of the den-
sity function. Thus, the Markov properties are easily interpreted as pairwise Markov
properties, or conditional (in)dependence of pair of nodes, given the rest of them.
For a multivariate normal model, interestingly, this property translates to an even
simpler specification. Before spelling it out, let us introduce some notation we shall
be following for the rest of the chapter.
Let X = (X1, X2, · · · , Xp) ∈ Rp follows a multivariate normal distribution Np (0,Σ)
where Σ denotes the covariance matrix, assumed to be positive definite. Let us de-
note Ω = ((ωij))i,j := Σ−1. The matrix Ω is known as the concentration matrix of
the distribution. Each of the p variables can be represented as a node in a graph
G = (V,E) where V = 1, 2, · · · , p and E = (a, b) : Xa 6⊥ Xb|XV \a,b. Then it
can be easily proved that11
Proposition 1.1.1. Assume that X ∼ Np (0,Σ). Then it holds for a pair of vertices
(s, t) with s 6= t that
Xs ⊥ Xt
∣∣XV \s,t ⇐⇒ ωst = 0 (1.1.9)
1.1. Graphical Models 15
This basic connection between conditional independence in multivariate Gaussian
distribution and its concentration matrix lays out an alternative representation for
the underlying Markov network by restricting certain elements in the concentration
matrix to zero. It can be shown using simple algebra that
ωss = Var(Xs|XV \s
). (1.1.10)
Let C = ((cij))i,j be the scaled matrix Ω such that the diagonals are equal to 1.
Then,
cij =ωij√ωiiωjj
. (1.1.11)
Thus, the off-diagonal elements are equal to the negative of the partial correlation
coefficients. Hence, conditional independence directly corresponds to zero partial
correlation for multivariate Gaussian model. Now we describe a real-life application
of Gaussian graphical model.
Proteins are polymers of amino acids that are linked into a chain with the help
of peptide bonds. Proteins are one of the fundamental building blocks of life, and
they play a ton of important roles including structural roles (cytoskeleton), as cat-
alysts (enzymes), transporter to ferry ions and molecules across membranes, and
hormones[21]. The three dimensional structure of protein has two equivalent expres-
sion - either as Cartesian coordinates of constituent atoms, or in terms of a sequence
of dihedral angles. In the jargon of biotechnology, there are two types of dihedral
angle - three backbone dihedral angles (denoted as φ, ψ and ω) and four side-chain
dihedral angles (denoted as χ1, χ2, χ3 and χ4). The structure of protein is not static
1.1. Graphical Models 16
and during a reaction it changes continuously. Thus, one only observes a small sample
from the distribution of different values these dihedral angles go through. One can
safely assume that the underlying distribution is multivariate Gaussian and one is
interested to explore the underlying conditional independence graph by learning the
structure of the Gaussian graphical model, which gives an idea about the interaction
of several nodes of that protein during the reaction[22].
1.1.4 Local Constancy in Gaussian Graphical Model
Although the traditional methods of structure learning in Gaussian graphical
models did not consider any additional restrictions, we often encounter data with
a very specific geometric structure that calls for special treatment. For instance, a
lot of data measured these days are manifold-valued, and the additional information
is often ignored in classical analyses. Take, for example, the protein dihedral angle
measurement mentioned in the last section. Simply assuming a Gaussian graphi-
cal model, one is ignoring the spatial arrangement of those angles or the order they
appear in reality. Another example is provided by data describing some feature out-
lining of a (moving) silhouette, pixels in an image or voxels in a body organ like brain
or heart. In these problems, spatially close variables have a structural resemblance
in terms of probabilistic dependence. Considering this local behavior might lead to
faster and/or more efficient structure learning.
Honorio et al.[23] introduced the notion of local constancy. “Local” only is used in a
heuristic fashion in their paper. No formal definition is provided. According to them,
in a Gaussian graphical model, if a variable Xa is conditionally (in)dependent of vari-
able Xb, then a spatial neighbor Xa′ of Xa would also be conditionally (in)dependent
1.1. Graphical Models 17
of Xb. This would then encourage finding connections between two local or distant
clusters of variables instead of finding connections between isolated variables. For
an image, a spatial neighbor might be a neighboring pixel in a 4-neighborhood or
8-neighborhood system. For MRI data, a spatial neighbor might be a neighboring
voxel. In Bayesian networks, similar methods have been proposed, where the vari-
ables were grouped into classes and prior probabilities of having an edge between two
different classes were assigned in a way to enforce regularized structure learning[24].
Imposing the restriction of local constancy is a step towards structured learning in
graphical models. Such restrictions promise to have many practical application, espe-
cially in the domain of medical imaging, image processing, genetics and biotechnology.
Structural constraints are implicitly being used for a long time. The QMR-DT[25]
network has two distinct classes of nodes - diseases and symptoms. And they only
allow edges from one class to another. In biological literature, similar divisions into
functional classes are not rare. In [26], the authors learn gene networks by finding
a relatively small set of regulator genes that control other genes and also exhibit
inter-connectivity.
One can anticipate a number of advantages of using structured learning over un-
structured learning. It can, for instance, be expected that once the assumption of
local constancy holds, the efficiency of structured learning should be better than that
of unstructured learning. It should also be possible to incorporate domain knowledge
to regularize the model. This way, one can make up for lack of sample, especially
in the high dimensional problems where samples are scarce and sparse estimation is
sought. When compared against the unstructured estimation, often times we will see
that structured learning produces the same quality of estimation with smaller number
1.1. Graphical Models 18
of samples. Finally, structured learning could be important by its own merit. For
example, one might simply want to verify that certain genes are regulatory without
having to detect their inter-connenctivity, influence or causal effect.
Although structure learning through local constancy (or similar ideas) is not new
in statistical community, it lacks a deep theoretic understanding, both from a geo-
metric point of view and application perspective. In this thesis, I attempt to establish
a fundamental underpinning of the concept of local constancy in Gaussian graphical
model and simultaneously explore its advantageous aspects in practical applications.
Another application of using local structures discussed in this thesis is kernel smooth-
ing of positive definite matrices to estimate the diffusion tensor matrix in voxels in
human brain. The following example shows a practical application of local constancy
property in graphical models.
Studying the interaction between cognitive, emotional, behavioral and neurotic changes
on the one hand and drug addiction on the other is an active field of research in neu-
robiology[27]. The objective is to understand the physiological and psychological
phenomena that controls the recursive nature of addiction to drugs (intoxication,
withdrawal, craving, relapse). Functional MRI (fMRI) data are quite useful and pop-
ular to analyze brain activity when the individual is asked to perform certain tasks.
In one of such studies[27] the researchers observed the brain’s sensitivity to monetary
rewards of different magnitudes among cocaine abusers and its association with mo-
tivation and self control. fMRI data were collected for both addict and non-addict
subjects. Now, it has been observed that functional connectivity in the brain is often
realized as brightness or dimness of certain regions in brain. Their study concluded
1.1. Graphical Models 19
Figure 1.1.5: A representative image of human brain graph. For more images, go to http:
// www. humanconnectomeproject. org
that cocaine abusers show an overall reduced regional brain responsitivity to the dif-
ferences between monetary awards and for the healthy individuals the money-induced
stimulations were predominantly in the orbitofrontal cortex. Thus, one can choose to
model the voxel-level sensitivity as a high dimensional multivariate Gaussian random
variable but since regions of the brain work as functional units, it makes a lot of sense
to impose some structural restriction on the model. One such simplistic restriction
could be the constraint of local constancy. Then one can represent the functional
connectivity among different regions of the human brain as the conditional indepen-
dent graph learned from the data. Figure 1.1.5∗∗ shows a representation image of
connectivity in human brain.
∗∗This picture is taken from https://raweb.inria.fr/rapportsactivite/RA2011/parietal/
uid42.html
1.2. Organization of this Dissertation 20
1.2 Organization of this Dissertation
The rest of the dissertation is organized as follows. In the next chapter we discuss
the importance of geometry in maximum likelihood estimation in Gaussian graphical
models and establish local constancy as a geometric constraint, along with providing
some alternative definitions. In chapter 3, we propose neighborhood-fused lasso for
structure learning in Gaussian graphical model, followed by its empirical performance
using simulations and the asymptotic consistency results (and their proofs) with dis-
cussions about how to choose the regularization parameters in a data-dependent
fashion. Chapter 4 deals with smooth estimation of diffusion tensors from diffu-
sion weighted MR images and its performance compared to traditional methods in a
simulated data set.
21
Chapter 2
LOCAL GEOMETRY IN
GAUSSIAN GRAPHICAL
MODELS (GGM)
In this chapter, structural restrictions on Gaussian graphical models will be ana-
lyzed in general, and local geometry in particular, from both a statistical and a ge-
ometric point of view. I shall proceed by establishing the Gaussian graphical model
as a special member of the exponential family of distributions on finite-dimensional
Euclidean space. Existence and conditions for existence of maximum likelihood esti-
mates in GGM will be discussed both in presence and absence of structure constraints.
A geometric analysis facilitates an in-depth understanding of the phenomenon of im-
posing structural restrictions on the variables and thereby provides us with the nec-
essary tools to formalize the notion of local constancy. The transition from complete
to partial knowledge of graph structure and exploiting the partial knowledge in a
quantitative manner will be the principal objective in this chapter. Development of
2.1. GGM as Exponential Family 22
a sound theoretical understanding is the central theme, with intermittent reference
to relevant practical applications.
I shall present various results from existing literature to show the effectiveness of
imposing a structural constraint on finding the MLE of covariance matrix in a Gaus-
sian graphical model. Motivated by that I shall represent the local neighborhood as
a structure constraint and define local constancy in a more formal manner.
2.1 GGM as Exponential Family
The exponential families of probability distributions encompass a large class of
probability distributions on finite dimensional Euclidean space, e.g., Rp, which can
be parametrized by a finite dimensional parameter. Exponential families have been
widely studied in statistics literature[28–30] since most of the popular distributions
that are practically useful can be expressed as a member of exponential family. Ex-
amples include binomial, Poisson, exponential, gamma, beta, Gaussian distributions
among univariate probability distributions and multinomial, Dirichlet or multivari-
ate Gaussian distribution among multivariate probability distributions. The com-
monality and mathematical neatness of exponential families contribute to an overall
convenience in dealing with these distributions and also to the development of a
deep understanding of these distributions and their properties, both finite sample
and asymptotic. The framework of exponenial families binds statistical inference and
convex analysis[31–33] together. Apart from classical statistics, exponential families
are also relevant in machine learning and graphical models, where a lot of inferential
problems can be formulated and analyzed in terms of the canonical parameters and
sufficient statistics of relevant exponential families.
2.1. GGM as Exponential Family 23
2.1.1 Definition
Let X = (X1, X2, · · · , Xp) ∼ Pθ where θ ∈ Θ ⊆ Rk. We say that the family of
distributions Pθ : θ ∈ Θ belongs to the k-parameter exponential family if its density
f , which is absolutely continuous with respect to a measure ν, can be represented as
f(x) = exp 〈η(θ), T (x)〉 − A(θ) , (2.1.1)
where 〈., .〉 denotes the Euclidean inner product in Rk and the function A(·), known
as the cumulant function, is defined to ensure proper normalization of the density.
Thus,
A(θ) = log
∫X p
exp 〈η(θ), T (x)〉 dν, (2.1.2)
where X p is the sample space. The (multivariate) statistic T (x) is the sufficient
statistic for the exponential family and η(θ) is the (multivariate) canonical parameter.
Finiteness of A(θ) is a necessary condition for the above definition to hold, and for
this reason, we are only interested in θ ∈ Ω, where
Θ0 := θ ∈ Θ : A(θ) <∞ (2.1.3)
2.1.2 Motivation
Information theory drives one of the basic motivations for exponential family rep-
resentations for graphical models.5,34,35 Starting with n i.i.d. observations X1,X2, · · · ,Xn,
one would like to infer about the (unknown) probability distribution that generated
the data. For any probability distribution represented as a density w.r.t. an appropri-
2.1. GGM as Exponential Family 24
ate measure ν, we define the theoretical moment of a statistic T (X) = (T1, T2, · · · , Tk)
as
Ef (Ti(X)) :=
∫X pTi(X)f(X)dν ∀i = 1, 2, · · · , k (2.1.4)
assuming they exist. One good property that our optimal f should satisfy is to equate
the theoretical moments with sample moments. Hence, a good criterion is
Ef (Ti(X)) =1
n
n∑j=1
Ti(Xj) ∀i = 1, 2, · · · , k (2.1.5)
However this is not enough, since there are infinitely many distributions that satisfy
the property (2.1.5). Therefore, we need to adopt a principle, as a functional of
density f , to choose the density. In information theory, one such useful principle is
based on
H(f) := Ef (− log(f(X))) = −∫X p
(log f(x))f(x)dν (2.1.6)
This is commonly known as Shannon Entropy, and it is a measure of unpredictability
of information content. Hence, the maximum entropy solution f ∗, chosen among a
class F , is given by
f ∗ := arg maxf∈F
H(f) subject to condition (2.1.5). (2.1.7)
The heuristic idea behind choosing Shannon entropy is to maximize unpredictability
subjected to the moment restriction. It can be shown that the optimal solution f ∗ is
2.1. GGM as Exponential Family 25
of the form[5]
f ∗θ (x) ∝ exp
n∑i=1
αiTi(x)
(2.1.8)
where αi represents a (finite) parametrization of the distribution. This is the general
representation of exponential family.
2.1.3 Gaussian MRF
Given an undirected graph G = (V,E), a Gaussian Markov random field is a mul-
tivariate Gaussian random vector X = (X1, X2, · · · , Xp) with mean µ and covariance
matrix Σ that satisfies the Markov properties of G. One can show that it has expo-
nential family representation. The joint density of multivariate normal distribution
can be expressed as
fµ,Σ(x1,x2, · · · ,xn) = (2π)−np/2|Σ|−n/2n∏i=1
exp
1
2(xi − µ)TΣ−1(xi − µ)
∝ |Σ|−n/2 exp
−tr(Σ−1xTx/2) + µTΣ−1xT1− nµTΣ−1µ/2
(2.1.9)
where x is the n × p matrix [x1,x2, · · · ,xn]T and 1 is a vector (of appropriate di-
mension) of all 1’s. It follows that expression (2.1.9) identifies the model determined
by the multivariate Gaussian distribution with unknown µ and Ω = Σ−1 as an ex-
ponential model. To see this, observe that tr(ATB) is a valid inner product on
the space of matrices. Also, take θ = (Ω, ξ) with ξ = Ωµ as canonical parameter,
2.1. GGM as Exponential Family 26
T (x) = (−xTx/2,xT1) as sufficient statistics and
〈(A1, y1), (A2, y2)〉 = tr(AT1A2) + yT1 y2 (2.1.10)
as an inner product, where A1, A2 are matrices and y1, y2 are vectors. Then expres-
sion (2.1.9) can be written as
fθ(x1,x2, · · · ,xn) ∝ expn
2log |Ω| − tr(ΩxTx/2) + µTΩxT1− nµTΩµ/2
= exp
n2
log |Ω|+ 〈θ, T (x)〉 − nµTΩµ
(2.1.11)
Since the integral ∫Rp
exp
−1
2(x− µ)TΩ(x− µ)
dx
is finite iff Ω is positive definite, this is an (regular) exponential model. The closed
convex support C of the sufficient statistics is the set of pairs (A, b) such that A is a
symmetric p× p matrix, b ∈ Rp and A− bbT/n is non-negative definite, i.e.,
C =
(A, b) ∈ Sp × Rp : A− bbT/n 0, (2.1.12)
where Sp denotes the set of p× p symmetric matrices and 0 denotes non-negative
definiteness[11]. It follows that the interior of C, denoted by C0, is given by
C0 =
(A, b) ∈ Sp × Rp : A− bbT/n 0, (2.1.13)
where 0 denotes the positive definiteness criterion. In the following sections, we
discuss about the existence of maximum likelihood estimates for both saturated and
unsaturated models.
2.2. Maximum Likelihood Estimation 27
2.2 Maximum Likelihood Estimation
In order to find the MLE or study conditions for its existence, one can use results
from the theory of exponential models. This might seem to be an overkill for the
general model but is helpful when we study the Gaussian graphical models with
conditional independence restrictions. In general, properties of exponential model
will also come handy while incorporating the local geometric restrictions.
2.2.1 Saturated Model
In a saturated model, we have no additional restrictions on the conditional in-
dependence structure. In particular, we have n i.i.d. samples x1,x2, · · · ,xn from
N(µ,Σ) and we assume no mandatory zero occurrence on the precision matrix Ω =
Σ−1. The only restriction on Σ (and hence Ω) is its positive definiteness. Then, it
can be shown that
Theorem 2.2.1. In the saturated model, the MLE’s of µ and Σ exist iff
S := xTx− xT11Tx/n
is positive definite. This happens with probability one if n > p and never when n ≤ p.
When the estimates exist they are given by
µ = xT1/n
Σ = S/n
and they are independently distributed as µ ∼ Np(µ,Σ/n) and Σ ∼ Wp(n− 1,Σ/n).
where x = [x1,x2, · · · ,xn]T is the n × p matrix with each row corresponding to
2.2. Maximum Likelihood Estimation 28
an observation. The MLE of Ω is Ω = Σ−1 ∼ W−1p (n− 1,Σ/n). Therefore,
E(Ω) =n
n− p− 2Ω
V (Ω) =2n2
(Ω⊗ Ω + 1
n−p−2Ω Ω
)(n− p− 1)(n− p− 2)(n− p− 4)
,
where W−1p denotes the p-dimensional inverse Wishart distribution, ⊗ denotes the
Kronecker product and denotes the matrix outer product such that
(Ω Ω)(A) = tr(AΩ)Ω
so the asymptotic variance of Ω is 2(Ω⊗ Ω)/n. For proofs of these results, see [11].
In light of theorem 2.2.1 it is seen that for the MLE to exist, the sample size should
be at least as small as the problem dimension. In fact, for a generalized hypothetical
situation where the dimension would polynomially increase with increasing sample
size, one would require a sample size in the polynomial order of the problem dimen-
sion. So, for high dimensional problems, where p n, the MLE will not exist in
the saturated model. This is primarily the driving factor towards considering models
with sparsity constraints. In the following section we show how a sparse model results
in the advantage of being able to find the MLE with smaller sample size.
2.2.2 Covariance Selection Model
To start with, we revisit proposition 1.1.9, that connects the conditional indepen-
dence of two nodes in a Gaussian graphical model with occurrence of zeros in the
concentration matrix Ω. Apart from conditional independence, a number of statisti-
2.2. Maximum Likelihood Estimation 29
cal measures of the underlying model could be expressed as functions of the precision
matrix. The conditional variances are expressed as inverses of the diagonal elements,
i.e.,
ωii = Var(Xi|XV \i
)for all i. The partial correlation coefficients are given by
ρij|V \i,j = − ωij√ωiiωjj
.
Also, by property 2.2.2, we have that Xi given XV \i follows a univariate normal
distribution.
Property 2.2.2. Let Y ∼ Np(µ,Σ). Let us consider the following partitions : Y =
(Y1, Y2)T , µ = (µ1, µ2)T and
Σ =
Σ11 Σ12
Σ21 Σ22
,
where Y1, µ1 are k-dimensional vectors (1 ≤ k < p) and Σ11 is a k × k matrix. Then
the conditional distribution of Y1|Y2 is Nk(µ1|2,Σ1|2) where µ1|2 = µ1+Σ12Σ−122 (Y2−µ2)
and Σ1|2 = Σ11 − Σ12Σ−122 Σ21 assuming Σ22 is invertible.
Writing the conditional mean as a sum, we get
E(Xi|XV \i
)= µi +
∑j∈V \i
βij|V \i(Xj − µj), (2.2.1)
where the partial regression coefficients are given by
βij|V \i = −ωijωii
. (2.2.2)
2.2. Maximum Likelihood Estimation 30
The above property 2.2.2 laid the foundation of a novel procedural approach towards
solving the model selection problems in Gaussian graphical models, especially when
p n. This will be discussed in greater details in the next chapter. Here we restrain
ourselves to reviewing conditions that ascertains the existence of MLE’s for Gaussian
MRF’s.
Decomposing Covariance Selection Models
Before we delve into studying geometric properties of Gaussian MRF’s, let us
review some results which lead to existence of the MLE in high dimensional situa-
tions (where n < p) in a covariance selection model. These results are summarized
from Lauritzen’s book[11]. We start with a simplistic decomposition assumption, and
study the existence of the MLE under that condition and its extensions.
Let the graph G = (V,E) be decomposed by (A,B,C). Hence V is the disjoint union
of A,B,C; C separates A and B, meaning any path from A to B passes through C
and C is a complete subgraph of G. For a subset Q ⊆ V , let KQ be the submatrix
corresponding to nodes in Q, [ΩQ]V and[Ω−1Q
]V= [ΣQ]V denote p× p matrices such
that all the off diagonal elements of Ω and Σ respectively corresponding to the nodes
in Q are kept intact and the rest of them are set to 0. Let SG and S+G denote the
set of symmetric matrices and positive definite matrices (respectively) such that the
off-diagonal elements corresponding to the missing edges in G are 0. We start with
the following lemma.
Lemma 2.2.3. Let Ω ∈ SG, and let (A,B,C) be a disjoint partitioning of G as
2.2. Maximum Likelihood Estimation 31
mentioned above. Then
Ω = [ΩA∪C ]V + [ΩB∪C ]V − [ΩC ]V (2.2.3)
and for any symmetric p× p matrix L, we have
tr(ΩL) = tr (ΩA∪CLA∪C) + tr (ΩB∪CLB∪C)− tr (ΩCLC) . (2.2.4)
Also, if Ω ∈ S+G , then
Ω =[(
Ω−1A∪C
)−1]V
+[(
Ω−1B∪C
)−1]V−[(
Ω−1C
)−1]V
(2.2.5)
and |Ω| =∣∣Ω−1
C
∣∣∣∣Ω−1A∪C
∣∣ ∣∣Ω−1B∪C
∣∣ (2.2.6)
The above lemma summarizes the effect of decomposition on the precision matrix
leading to the following proposition, which is basically a sample version of the above
lemma.
Proposition 2.2.4. Consider a sample of size n from a covariance selection model
given by the graph G = (V,E) decomposed by (A,B,C). The maximum likelihood
estimate of the precision matrix Ω exists iff the estimates Ω[A∪C] and Ω[B∪C] exist,
where Ω[A∪C] and Ω[B∪C] denote the MLE of the concentration matrices in the two
marginal models with graph GA∪C and GB∪C based on the data in the marginal samples
only. If the estimates exist, then they satisfy
Ω =[Ω[A∪C]
]V+[Ω[B∪C]
]V− n
[(SC)−1]V (2.2.7)∣∣∣Ω∣∣∣ =
∣∣∣Ω[A∪C]
∣∣∣ · ∣∣∣Ω[B∪C]
∣∣∣n−|C| |SC | (2.2.8)
2.2. Maximum Likelihood Estimation 32
Further, it holds that
Σ[A∪C] = ΣA∪C , Σ[B∪C] = ΣB∪C , (2.2.9)
ξ =[ξ[A∪C]
]V+[ξ[B∪C]
]V−[ξ[C]
]V. (2.2.10)
With the aforementioned results, we study the properties of decomposable co-
variance selection models. Usually these covariance selection models are built up by
accumulating several small saturated models by successive direct joins. This makes
the model amenable for modular analysis by breaking it down into smaller saturated
sub models. Lemma 2.2.3 demonstrates the effect of Markov properties of the graph-
ical model on covariance matrices across decompositions of the graph. In light of the
fact that (see proposition 2.17 in [11]):
Proposition 2.2.5. For an undirected graph the following two statements are equiv-
alent.
(i) The cliques of the graph can be numbered to form a perfect sequence.
(ii) The graph is decomposable.
we can come up with a sequential enumeration of the cliques of G, assumably
given by C1, C2, · · · , Ck where each combination of subgraphs induced by Hj−1 =
C1 ∪ C2 ∪ · · · ∪ Cj−1 and Cj is a decomposition. The joint density then factorizes as
f(x) =
∏kj=1 f
(xCj)∏k
j=2 f(xSj)
=
∏C∈C f (xC)∏
S∈S f (xS)ν(S),
where Sj = Hj−1 ∩ Cj is the sequence of separators and ν(S) is the frequency of
2.2. Maximum Likelihood Estimation 33
separator S among the Sj’s. Using lemma 2.2.3 repetedly, we get
Ω =∑C∈C
[(ΣC)−1]V −∑
S∈S
ν(S)[(ΣS)−1]V , (2.2.11)
|Σ| =∏
C∈C |ΣC |∏S∈S |ΣS|ν(S)
. (2.2.12)
Similarly, by using proposition 2.2.4 repetedly, we can derive an explicit formula for
the MLE of a decomposable covariance selection model. This is formalized in the
following proposition.
Proposition 2.2.6. In a decomposable covariance selection model with graph G =
(V,E), the maximum likelihood estimate of the mean vector and the concentration
matrix, based on a sample of size n, exists with probability one iff n > maxC∈C |C|,
and they are given by µ = x and
Ω = n
∑C∈C
[(SC)−1]V −∑
s∈S
ν(s)[(Ss)
−1]V (2.2.13)
where C is the set of cliques of G and S the set of separators with multiplicities ν in
a sequence of cliques and S is same as in theorem 2.2.1.
Non-decomposable Covariance Selection Models
The existence of MLE’s in non-decomposable covariance selection models is usu-
ally studied by looking at a decomposable “cover” of the underlying graph. This also
leads to the minimum sample size needed for the existence of the MLE. This has been
studied by Buhl[36] and Uhler[37] . In this section, we shall discuss the fundamental
ideas, followed by the geometric principles of maximum likelihood estimation in next
section. The geometric ideas will be the key concepts in formalizing the idea of local
2.2. Maximum Likelihood Estimation 34
geometry in Gaussian graphical models.
We start with some definitions. For a non-decomposable graph G = (V,E), a graph
G+ = (V,E+) is called a fill-in if G+ is decomposable and E ⊆ E+. Obviously
there are a number of choices for potential fill-ins and one aims for a minimal one.
The minimal fill-in is closely connected to the concept of a chordal graph. A graph
is a chordal graph if it contains no chordless∗ cycle of length greater than 3. For a
nonchordal graph, one can naturally define its chordal cover by constructing a fill-in
which is chordal. The notion of minimal chordal cover is useful where minimality
refers to the maximal clique size in the chordal cover. Figure 2.2.1 shows a nonchordal
graph and its chordal cover which is also a minimal chordal cover.
1
2
3 4
5
1
2
3 4
5
6
7
6
7
(a) (b)
Figure 2.2.1: (a) A nonchordal graph and (b) its chordal cover
Following [37], we define the treewidth of a graph as one less than the maximal
clique size in its chordal cover. Let C and C+ denote the clique classes and q, q+ the
maximal clique sizes of G and G+ respectively, where G+ can be taken as a minimal
chordal cover of G. It is obvious that q ≤ q+. Buhl introduced the idea of matching,
∗An edge between two non-adjacent nodes of a cycle is called a chord
2.2. Maximum Likelihood Estimation 35
which is defined as follows
Definition 2.2.7. Two indexed sets of vectors xI ⊆ Rn and xI ⊆ Rd are said to be
matching on the graph G if xTi xj = xTi xj for i = j or (i, j) ∈ E.
In other words, a matching would correspond to a folding of a set of vectors
into a lower or higher dimensional space such that the angle between the vectors are
preserved†. Taking orthogonal transformations is one way to accomplish that. The
following lemma lays out sufficient conditions for matching in a decomposable model.
Lemma 2.2.8. Let G be a decomposable model with maximal clique size q. Given
xI ∈ Rn, we want to find xI ∈ Rd that matches xI on G. (i) If d ≥ q, this can always
be done. (ii) If d > p and the indexed sets xc, c ∈ C are linearly independent this can
be done such that xI is linearly independent.
For a non-decomposable graph, one needs to unfold the line bundle according to
a fill-in G+ with a clique class C+ and maximal clique size q+. This is described in
the following theorem.
Theorem 2.2.9. Let d ≥ q+. The MLE exists for xI ⊆ Rn iff there exists xI ⊆ Rd
such that (i) xI and xI match on G and (ii) all xc+ , c+ ∈ C+ are linearly independent.
A consequence of lemma 2.2.8 and theorem 2.2.9, is that if n < q, then the MLE’s
can not exist. If n ≥ q+, the MLE exists with probability 1. However, no such
general statement can be made when q ≤ n < q+. This region henceforth is denoted
by UG. Simpler nonchordal graphs like cycles and wheels[36,38,39], bipartite graphs
and small grids[37] have been studied in literature. Buhl considered the chordless
p-cycles with a sample of size n = 2 (which falls inside UG where G denotes the
†For simplicity, one can assume all vectors to be normalized
2.3. Geometric Interpretation of Existence of MLE 36
chordless p-cycle) and a p-wheel structure with sample of size of n = 3 and showed
under the assumption of linear independence that the MLE exists with a probability
of 1− 2(p−1)!
, which is quite large for higher values of p, showing the huge advantage
one gets by a regular underlying model.
2.3 Geometric Interpretation of Existence of MLE
Given a graph G = (V,E), the space of concentration matrices respecting the
Markov property on G, denoted by KG, is a convex cone inside the positive definite
cone S+p and is defined as
KG :=K ∈ S+
p |Kij = 0∀(i, j) /∈ E.
Taking inverse of every matrix in KG generates the space of covariance matrices in
the model. This happens to be an algebraic variety intersecting with the positive
definite cone S+p . In a Gaussian graphical model, the matrix completion problem can
be reformulated as
Corollary 2.3.1. The MLE’s Σ and K exist for a given sample covariance matrix
iff
fiberG(S) :=
Σ ∈ S+p |ΣG = SG
is nonempty, in which case fiberG(S) intersects K−1
G in exactly one point Σ.
Thus the MLE Σ is algebraically connected to the sufficient statistic SG, in the
sense that it can be expressed as a solution to polynomial equation in the sufficient
statistic SG. Using corollary 2.3.1 we can find the set of all sufficient statistics for
which the MLE exists by projecting S+p onto the edge set of graph G. It has been
2.4. Local Geometry as Structure Constraint 37
proved[40] that the cone of sufficient statistics is the convex dual to the cone of con-
centration matrices KG. Uhler[37] investigated the existence of MLE in the range
q ≤ n < q+ by looking at the manifold of rank n matrices on the boundary of S+p . Its
projection lies on the topological closure of the cone of sufficient statistics. Existence
of the MLE is ensured if the projection lies in the interior. It was proved in [37] that
the elimination ideal obtained from the ideal of (n+1)×(n+1)-minors of a symmetric
m×m matrix of unknown by eliminating all unknown corresponding to non-edges of
the graph G should be the zero ideal in order for the MLE to exist with probability
one. This provides a sufficient condition for existence of MLE. If the elimination
ideal is not the zero ideal, the MLE can still exist with positive probability. However,
calculation of the elimination ideal is computationally intensive and is quite hard for
a large graph. Usually small graphs are studied with the elimination criterion and
joined using clique sums to build larger graphs.
2.4 Local Geometry as Structure Constraint
Motivated by the significant implications of imposing structural constraints for
efficient estimation of MLE with under-sampling, we claim that structure learning
in Gaussian graphical models benefits from incorporating the local geometry in the
analysis in a judicious manner. In this section, we shall describe a few ways to do
that. The principal idea is to hypothetically split the neighborhood into two groups
- the “local” neighbors and the “non-local” neighbors.
In most of the practical problems, the underlying spatial geometry will be known
2.4. Local Geometry as Structure Constraint 38
to the statistician, and in most situations it is going to be a regular graph like one-
dimensional, 2-dimensional or 3-dimensional lattice, triangles or other combinations
of cliques of smaller size. To generalize the notion of this local structure we introduce
a graph Glocal that may or may not be a part of the global neighborhood system. We
call it a local neighborhood graph. This graph is solely determined by external factors
and it is not, by any means, affected by the data. Given a node of a graph G, all its
neighbors in Glocal are defined as the local neighbors of that node.
The local constancy property, as defined in Honorio’s paper, can easily be assimi-
lated into this more formal setting as following: Xa is (in)dependent of Xb, then a
local neighbor of (as defined above) is expected to be (in)dependent of Xb. However,
the so called “likeliness” behavior of local neighbors need to be formalized. In the
remaining portion of this chapter, we propose two alternative ways to more formally
quantify the definition of local constancy.
Before we formally define the local constancy, we should carefully define the dif-
ference matrix. We consider an m × p matrix D, where m is the total number
of pairs of local neighbors. Assuming that we have a labelled sequence of nodes
Γ(n) = 1, 2, · · · , p(n), arrange the pairs of local neighbors in a sequence
B := (u, vu) : vu ∈ lne(Xu) > u : u ∈ Γ(n)
where lne(Xu) denotes the set of local neighbors of Xu. This particular ordering
chosen here does not influence the results below. Note that B is nothing but an
alternative sequencing of edges in Glocal, and hence B is topologically equivalent with
Glocal. The inequality vu > u is included to avoid double counting. Each pair is
2.4. Local Geometry as Structure Constraint 39
then represented by a row in the difference matrix D. The kth row of D is given by
Dk,. = ei − ej where (i, j) is the kth element of the local neighbor sequence and ei, ej
denote two canonical basis vectors of Rp where the 1’s occur at ith and jth position,
respectively. We denote by Da the ma×p sub-matrix of D selecting all the rows with
ath entry being 0. In other words, Da is the difference matrix corresponding to all the
local neighbor pairs not involving Xa. The number of local neighbors in Γ(n) \ a
is ma. It should be noted that throughout our discussion we shall assume that the
local neighborhood structure is known to us, meaning that D is known beforehand
and it does not depend on the data. Also the following definitions are helpful.
Definition 2.4.1 (Zero-Operator). Given a matrix A, its zero operator is defined as
Z(A) := (I(Aij = 0))i,j
where I is the indicator function.
Definition 2.4.2 (Diagonal-Excluded Product). Given two matrices A and B, the
diagonal-excluded product is given by
AB := Z(A) AB
Here denotes the Hadamard product of matrices. Although the name does not
clearly show how diagonals are removed from the product, usually for the types of
matrices we will deal with, this will eventually lead to the exclusion of diagonals of
the matrix B.
2.4. Local Geometry as Structure Constraint 40
2.4.1 Quantitative Definition
One way to define the local constancy property is to impose a bound on the
norm of the difference of local neighbors. We say a model exhibit (ε,Glocal, lp)-local
constancy if
‖ D Ω ‖p< ε
where ‖ · ‖p denotes the lp norm. For our purpose, the best candidate would be l1
norm for certain desirable properties like sparsity and convexity. However, a different
choice of norm would lead to a different solution which might be more appropriate
for other types of problems. With p = 1, this definition coincides with Honorio’s
local penalty criterion.
It should be noted that the ε parameter controls the degree of local constancy so
this is closely related to the tuning parameter we shall use in next chapter. This
serves as a constraint on certain matrix parameters that we are trying to estimate,
A small ε would ensure a high level of local constancy where a large ε would do the
opposite. This is exactly similar to any penalized regression problem.
2.4.2 Bayesian Perspective
Since the locality information is being used here as a prior information, it makes
sense to interpret it from a Bayesian perspective. Let π(Ω) denotes a probability
distribution on the space of positive definite matrices. Then one can say that an
(ε, δ,Glocal, lp)-local constancy property holds for this model if
P (‖ D Ω ‖p< ε) > 1− δ
2.4. Local Geometry as Structure Constraint 41
Here we need to assume that δ < 1. The ε parameter controls the degree of local
constancy and the δ parameter is indicative of our prior belief about the local con-
stancy.
It should be noted that the non-zero terms of the matrix D Ω consist of the
differences of elements from Ω such that they are local neighbors. For example ,
if
D =
1 −1 0
0 1 −1
and
Ω =
ω11 ω12 ω13
ω21 ω22 ω23
ω31 ω32 ω33
Then
D Ω =
0 0 ω13 − ω23
ω21 − ω31 0 0
In this thesis, I will pursue the first definition of local constancy. In following chapter,
I shall describe a method of model selection using penalized regression. The penalty
term will be derived from the above definition. Upper bounds of norm constraints
play an important role in determining the order of the regularization parameter in the
equivalent optimization process. I shall show by simulation and theoretical justifica-
tion that with proper order of the regularization parameters, our proposed method
works better than traditional approaches.
42
Chapter 3
Estimation of Locally Constant
Gaussian Graphical Model Using
Neighborhood-Fused Lasso
3.1 Introduction
In this chapter, we extend and subsequently analyze the idea propounded by Mein-
shausen and Buhlmann for structure learning in Gaussian graphical model, where we
incorporate the additional structural constraint introduced in last chapter as local
constancy. It is well known that in a p-dimensional multivariate Gaussian random
variable X with non singular covariance matrix Σ, the conditional independence can
be represented by occurrence of zero entries at respective entries of precision matrix
Ω. The neighborhood nea of a node a is defined as the collection of all nodes that
are conditionally dependent on a. The neighborhood selection algorithms aim to find
all the neighbors of a node Xa, given n i.i.d. observations of X. Meinshausen &
3.1. Introduction 43
Buhlmann[65] showed that this problem could be interpreted as an ensemble of l1-
penalized regressions which could be solved using lasso algorithm[66]. In a relatively
recent paper, Honorio, Ortiz & Samaras[23] have proposed local constancy of spatially
close nodes as a prior information for learning in Gaussian graphical model. In this
chapter, I will assume that the global conditional dependence neighborhood graph of
the underlying Gaussian distribution consists of the following subgraphs - (a) a local
graph that is topologically close to a disconnected or weakly connected clusters of
regular graph objects like chain, cycle, lattice or a clique; and (b) a non-local graph
where the edges connect nodes of two different spatial clusters of nodes. Moreover, it
will also be assumed in this paper that the regular graph object that governs the local
structure depends only on domain knowledge and hence known beforehand. Adap-
tive selection of local neighborhood is proposed as a future research direction and has
not been dealt with in our analysis. The locality information is critical when data
is observed in a certain manifold with spatial geometry. The assumption of local
constancy, in some sense, enforces spatial regularization on structure learning and
thereby stimulates the search for probabilistic dependencies between local clusters of
nodes.
We propose to add a new penalty term which, in essence, generalizes the fused
lasso penalty[67] to penalize the differences of spatially close nodes, which are de-
termined by the geometry of the particular problem and is not chosen adaptively.
It is this additional penalty term which takes care of the local regularization and
therefore its choice is intertwined with the regular local graph structure we choose
to deal with. This leads us to propose neighborhood-fused lasso, a model selection
method for locally constant gaussian graphical models. This helps us to estimate the
3.2. A Review of Related Works 44
probabilistic connectivity among distant clusters of nodes using the spatial informa-
tion. We also show by simulation and by studying the theoretical properties of our
proposed method that neighborhood-fused lasso outperforms other competing model
selection algorithms where locality information is ignored. Both finite sample and
large sample properties of our estimator have been studied, leading to several inter-
esting findings. We prove theoretically that introducing a local penalty term reduces
the finite sample type-I error probability in model selection and leads to equivalent
accuracy as standard methods with smaller sample size. Also, our method is not
computationally intensive since we provide data dependent choices of the regulariza-
tion parameters instead of cross-validation based methods. We study the asymptotic
l1 properties of our estimator to find sufficient conditions on design matrix and regu-
larization parameters to ensure asymptotic consistency in parameter estimation and
prediction in terms of l1 and l2 metric, respectively.
3.2 A Review of Related Works
Going back to Dempster[68] who introduced Covariance Selection to discover the
conditional independence restrictions (the graph) from a set of i.i.d. observations,
many methods have been proposed for sparse estimation of the precision matrix in a
Gaussian graphical model. Covariance selection traditionally relies on optimization
of an objective function[11, 69]. Modern technological developments and top-notch
computing power enable us to deal with high dimensional models. Of course, there
still are computational challenges. Exhaustive search is often infeasible for high-
dimensional models. Usually, greedy forward-selection or backward-deletion search
is used. In forward(backward) search, one starts with the empty (full) set and adds
(deletes) edges iteratively until a suitable stopping criterion is fulfilled. The selec-
3.2. A Review of Related Works 45
tion (deletion) of an edge requires an MLE fit[70] for O(p2) many models, making
it a suboptimal choice for high-dimensional models where p is large. Also, the MLE
might not exist in general for p > n [36]. In contrast, neighborhood selection with the
lasso, proposed by Meinshausen and Buhlmann, relies on optimization of a convex
function, applied consecutively to each node in the graph, thus fitting O(p) many
models. Fast lasso-type algorithms and data dependent choices for regularization
parameter reduce the computational cost. Unlike covariance selection, this algorithm
estimates the dependency graph by sequential estimation of individual neighbors and
subsequent combination by union or intersection. Theoretical analysis shows that the
choice of union or intersection does not matter asymptotically. Other authors have
proposed algorithms for the exact maximization of the l1-penalized log-likelihood.
Yuan & Lin[71] proposed an l1-penalty on the off-diagonal elements of the concentra-
tion matrix for its sparse estimation with the positive definiteness constraint. They
showed that this problem is similar to the maxdet problem in Vandenberghe et al.[72],
and thus solvable by the interior point algorithm. They also took a nonnegative gar-
rote[73] type approach. A quadratic approximation to the objective function in their
proposed method leads to a solution similar to Meinshausen & Buhlmann. Banerjee
et al.[74] viewed this as a penalized maximum likelihood estimation problem with the
same l1-penalty on the concentration matrix. Constructing the dual transforms the
problem into sparse estimation of the covariance matrix instead of the concentration
matrix. They proposed block coordinate descent algorithm to solve this efficiently for
large values of p. They also showed that the dual of the quadratic objective func-
tion in the block-coordinate step can be interpreted as a recursive l1-penalized least
square solution. Friedman, Hastie & Tibshirani used this idea successfully to develop
a new algorithm, known as graphical lasso[75]. Using a coordinate descent approach
3.2. A Review of Related Works 46
to solve the lasso problem speeds up the algorithm to a considerable extent, making
it quite fast and effective for a large class of high-dimensional problems.
In all the aforementioned methods, the local geometry (of the manifold valued data)
is not taken into consideration. Often times we come across data that are measured
on a certain manifold. E.g., data describing some feature on the outline of a (moving)
silhouette, pixels in an image or voxels in a body organ like brain or heart. In most
of these problems, spatially close variables have a structural resemblance in terms of
probabilistic dependence. Considering this local behavior might lead to faster and/or
more efficient estimation. Honorio, Ortiz, Samaras et al.[23] introduced the notion of
local constancy. According to them, if one variable Xa is conditionally (in)dependent
of Xb, then a local neighbor of Xa in that manifold is likely to be conditionally
(in)dependent of Xb as well. They developed coordinate direction descent algorithm
to solve a penalized MLE problem which penalizes both the l1-norm of the precision
matrix and the l1-norm of local differences of the precision matrix, expressed as its
“diagonal excluded product” with a local difference matrix. Local geometry have
been addressed, although quite implicitly, by Tibshirani et al.[67] who assumed that
the underlying parameters in a linear model have a natural ordering. Apart from
being sparse, some of the local parameters could also be fused together. They penal-
ized both the l1-norm of the parameter and the l1-norm of successive differences and
developed a modified lasso algorithm, known as fused lasso. Chen et al.[76] used this
idea and proposed graph guided fused lasso for structure learning in multi-task regres-
sion, where the output space is continuous and the outputs are related by a graph.
The input space is high dimensional and outputs that are connected by an edge in
the graph are believed to share a common set of inputs. The goal then is to learn
3.3. Neighborhood Selection Using Fused Lasso 47
the underlying functional map from the input space to the output space in a way
that respects the similar sparsity pattern among the covariates that are believed to
affect the output variables that are connected. Local smoothing has been discussed
in Kovac and Smith[77] in order to perform a penalized nonparametric regression
where the differences of neighboring nodes were penalized. Their algorithm aims to
split the image into active sets and subsequently merge, split or amalgamate them in
order to minimize a penalized weighted distance.
3.3 Neighborhood Selection Using Fused Lasso
Like Greenshtein & Ritov[78] and Meinshausen & Buhlmann[65] we work in a
set-up where the number of nodes in the graph p(n) and the covariance matrix
Σ(n) depend on the sample size. Consider the p(n)-dimensional multivariate ran-
dom variable X = (X1, · · · , Xp) ∼ N(µ,Σ). The conditional (in)dependence struc-
ture of this distribution could be represented by the graph G = (Γ(n), E(n)), where
Γ(n) = 1, · · · , p(n) is the set of nodes corresponding to each coordinate vari-
able and E(n) the set of edges in Γ(n) × Γ(n). A pair of nodes (a, b) ∈ E(n) if
and only if Xa is conditionally dependent of Xb, given all other remaining variables
XΓ(n)\a,b = Xk : k ∈ Γ(n)\a, b. Ec(n) contains all pairs which are conditionally
independent, given all the remaining variables.
The neighborhood nea of a node a is defined as the smallest subset of Γ(n) \ a
such that given nea, Xa is conditionally independent of all the remaining nodes. In
other words, neighbors of a certain node consists of the coordinates that are condi-
tionally dependent on that particular node. We already know that in a p-dimensional
multivariate gaussian random variable X with non singular covariance matrix Σ, the
3.3. Neighborhood Selection Using Fused Lasso 48
conditional independence can be represented by occurrence of zero entries at respec-
tive cells of precision matrix Ω, i.e., Xa ⊥ Xb|XΓ\a,b iff Ωab = Σ−1ab = 0. The
neighborhood selection / model selection algorithms aim to find all the neighbors
of a node Xa, given n i.i.d. observations of X. Meinshausen & Buhlmann showed
that this problem could be interpreted as a penalized regression problem, where each
variable is regressed on the remaining variables with an l1 penalty on the estimated
coefficients.
By above definition of neighborhood, we have, for all a ∈ Γ(n) where Γ(n) is the
set of all nodes,
Xa ⊥ Xk : k ∈ Γ\(a⋃
ne(a)) | Xne(a).
An alternative definition of neighborhood of Xa is given as the non zero components
of θa where θa is given by
θa = argminθ:θa=0E
(Xa −
∑k∈Γ
θkXk
)2
. (3.3.1)
In light of the above definition, the set of neighbors of a node a ∈ Γ(n) can be written
as
ne(a) = b ∈ Γ(n) : θab 6= 0. (3.3.2)
Given the domain knowledge we construct a regular graph Glocal that is representative
of the underlying spatial geometry. Glocal could be a chain graph, a two or three
dimensional lattice or more generally a collection of cliques or wheels. Following this,
3.3. Neighborhood Selection Using Fused Lasso 49
the local neighbors are defined as following
Definition 3.3.1. Xa and Xb are local neighbors if the edge connecting them belongs
to a predefined graph Glocal.
Note that the edge between Xa and Xb need not belong to E(n). Now if one
incorporates the local constancy property to this graph, the neighborhood becomes
more structured. By local constancy, if Xb is conditionally independent of Xa given
the other nodes (and hence, there is no edge between a and b), it is likely that a local
neighbor Xb′ of Xb is also conditionally independent of Xa, making both θab and θab′
equal to zero. Thus, the zeroes of θa are spatially close.
Fused Lasso, proposed by Tibshirani et. al. addresses problems where the parameters
could be “ordered” in some sense. The traditional fused lasso penalizes the l1-norm
of both the coefficients and their successive differences. Thus it encourages sparsity
of the coefficients and also sparsity of their differences - i.e. local constancy of the
coefficient profile. We modify the traditional fused LASSO algorithm to satisfy our
purpose of estimating the zeros and non zeros of the precision matrix with a spatially
similar occurrence pattern. It is therefore natural to penalize the differences between
spatial neighbors of a graph. In the next couple of paragraphs we explain how to
do that. In next section we write down a few assumptions that are similar to Mein-
shausen and Buhlmann’s but we need a few additional assumptions to incorporate
the locality information. Our theoretical analysis reveals that with these assumptions
on sparsity, model complexity and our added assumption on local neighborhood spar-
sity, the fused LASSO estimate will furnish all the desired finite-sample properties
like sign-consistency and model selection consistency.
3.3. Neighborhood Selection Using Fused Lasso 50
The neighborhood-fused LASSO estimate θa,λ,µ of θa is given by
θa,λ,µ = argminθ:θa=0
(n−1||Xa −Xθ||2 + λ||θ||1 + µ||Daθ||1
)(3.3.3)
where ||x||1 denotes the l1-norm of x. Penalizing both the l1-norms of θ and Daθ
implies parsimony, thus ensuring sparsity and local constancy at the same time.
This property helps us in variable selection and thereby leads to neighborhood selec-
tion. Note that we are referring to both the non-local and the local neighbors. The
neighborhood estimate of node a is defined by the nodes corresponding to non-null
coefficients when Xa is regressed on the rest of the variables with the fused lasso
penalty in (3.3.3), in other words
neλ,µa =b ∈ Γ(n) \ a : θa,λ,µb 6= 0
(3.3.4)
where θa,λ,µb denotes the b-th component of the vector θa,λ,µ. It is clear that the
selected neighborhood depends on the value of λ and µ chosen. Large values of λ
and µ will give rise to more sparse solutions. Usually the regularization parameters
are chosen by some cross validation criteria. But in this paper we will find a data
driven approach for selection of regularization parameters that speeds up computation
and ensures asymptotic consistency in model selection. Meinshausen and Buhlmann
derived a data driven choice for λ. We will extend their method for simultaneous
selection of λ and µ from the data.
3.3. Neighborhood Selection Using Fused Lasso 51
3.3.1 Assumptions
In order to prove consistency of our method for gaussian graphical models, we
need to work with a few assumptions. Most of the assumptions are same as Mein-
shausen and Buhlmann’s (See section 2.3 in [65]). However we need the following
two additional assumptions to deal with the local constancy, see [A4] and [A8]. Even
though the assumptions are similar, we write them down for sake of completeness
and quick references to notations.
[A1 ] Dimensionality : ∃γ > 0, such that p(n) = O(nγ) as n → ∞. Note that
γ > 1 is included thus allowing p n.
[A2 ] For all a ∈ Γ(n) and n ∈ N, V ar(Xa) = 1. There exists v2 > 0, so that
for all n ∈ N and a ∈ Γ(n), V ar(Xa|XΓ(n)\a) ≥ v2. This means that all the
conditional variances are bounded away from 0.
[A3 ] Sparsity : There exists some 0 ≤ κ < 1 so that maxa∈Γ(n)|nea| = O(nκ) for
n→∞.
[A4 ] Local Neighborhood Sparsity: The number of local neighbors also can
grow at polynomial rate of n, i.e., ∃β0 ≥ 0 such that the maximum number of
local neighbors of a node maxa∈Γ(n)nel(a) = O(nβ0) for n→∞. For convenience
we let K > 0 be such that
2 + maxa∈Γ(n)nel(a) < Knβ0 .
[A5 ] l1-Boundedness: There exists some ϑ <∞ so that for all neighboring nodes
a, b ∈ Γ(n) and all n ∈ N, ||θa,neb\a||1 ≤ ϑ.
3.3. Neighborhood Selection Using Fused Lasso 52
[A6 ] Magnitude of partial correlation : There exists a constant δ > 0 and
some ξ > κ with κ as in [A3], so that for every (a, b) ∈ E, |πab| ≥ δn−1−ξ2
+β0
where πab denotes the partial correlation of Xa and Xb.
[A7 ] Neighborhood Stability : Define Sa(b) =∑
k∈neasgn(θa,nea
k )θb,neak . There
exists some δ1 < 1 so that for all a, b ∈ Γ(n) with b /∈ nea, |Sa(b)| < δ1
[A8 ] Local Neighborhood Stability: Let us define La := k : ((Da)′ sgn(Daθa))k 6=
0 and Ta(b) :=∑
k∈La
[(Da)′ sgn(Daθa)
]kθb,Lak . There exists some δ2 < 1 so
that for all a, b ∈ Γ(n) such that b /∈ La, |Ta(b)| < δ2‖Da.b‖1, where Da
.b denotes
denotes the bth column of Da.
Similar to Meinshausen and Buhlmann’s interpretation, we can describe an intu-
itive condition which implies the last two assumptions. Define
θa(η1, η2) = argminθ:θa=0E(Xa −∑k∈Γ(n)
θkXk)2 + η1‖θ‖1 + η2‖Daθa‖1.
According to the characterization of nea derived from (3.3.2), nea = k ∈ Γ(n) :
θak(0, 0) 6= 0. One can think of a two-dimensional perturbation approach in which
one tweaks the parameters η1 and η2 in a way that the perturbed neighborhood
nea(η1, η2) = k ∈ Γ(n) : θak(η1, η2) 6= 0 is identical to original neighborhood
nea(0, 0). The following proposition shows that the two assumptions of neighbor-
hood stability are fulfilled under this situation. The terms Sa(b) and Ta(b) measure
the sub-gradient of the lasso penalty and the neighborhood-fused lasso penalty re-
spectively. Having a small l1 bound on them enforces the stability of the estimated
coefficients, and hence stability of neighbors. See section 3.8 for proof of the propo-
sition.
3.4. Optimization method 53
Proposition 3.3.2. If there exists some η1 > 0, η2 > 0 such that nea(η1, 0) =
nea(0, η2) = nea(0, 0). Then, |Sa(b)| ≤ 1 and |Ta(b)| ≤ ‖Da.b‖1. Moreover, nea(η1, η2) =
nea(0, 0).
3.4 Optimization method
Before we describe our optimization method, we quickly review the existing al-
gorithms used for optimization of fused lasso problems and some of its generalized
versions. Broadly speaking, there are two fundamental class of algorithms used in
this type of problems - (a) Solution path algorithm - which finds the entire solution
for all values of the regularization parameters and (b) Approximation algorithm -
which attempts to solve a large scale optimazation given a fixed set of regularization
parameters using first order approximations.
Friedman et al.[75] formulated the path-wise optimization method for the stan-
dard fused lasso signal approximation problem where the design matrix X = I. The
algorithm was two-step and the final solution is obtained by soft-thresholding the
total-variation norm penalized estimate obtained in the first step. The basic chal-
lenge in applying the coordinate descent algorithm to a fused lasso problem is the
non-separability of the total variation penalty unlike usual lasso where the l1-penalty
is completely separable. So they used a modified coordinate descent approach where
the descent step was followed by an additional fusion and smoothing step. However,
this method works only for the total variation penalty and does not extend to fused
lasso regression problems with generalized fusion penalty like our situation. Hoe-
fling[79] proposed a path-wise optimization algorithm that worked for generalized
fusion penalty in fused lasso signal approximation problem. This algorithm uses the
3.4. Optimization method 54
fact that the sets of fused coefficients do not change except for finitely many values
of the fused penalty parameter. Tibshirani and Taylor[80] proposed another path al-
gorithm for the fused lasso regression problem with a generalized fused lasso penalty
by solving the dual optimization problem. However, the path algorithm that they de-
vised can be applied only when the design matrix is full rank and hence not applicable
for high dimensional problems. In order to resolve the problem when the matrix is
not full rank, they proposed to add an infinitesimal perturbation ε ‖ β ‖2. However,
this does not solve the problem completely as a small ε leads to ill-conditioning and
increasing number of rows in the generalized fused penalty matrix causes inefficient
solution because of increasing number of dual variables.
Approximation algorithms were developed to find efficient solutions to general
fused lasso problem with a fixed set of penalty parameters, regardless of its rank
and they usually adapt themselves to high dimensional problems quite easily. Usu-
ally these approximation algorithms are based on first order approximation type
methods like gradient descent. Liu et al.[81] proposed the efficient fused lasso algo-
rithm (EFLA) that solves standard fused lasso regression problem by replacing the
quadratic error term in the optimization function by it first order Taylor expansion
at an approximate solution followed by an additional quadratic regularization term.
The approximate objective function has a fused lasso signal approximation form and
can be solved by applying gradient descent on its dual which is a box-constrained
quadratic program. Chen et al.[82] proposed the smoothing proximal gradient method
to solve regression problem with structured penalties which closely resemble our ob-
jective function. The basic idea is to approximate the fused penalty term ‖ Dβ ‖1 by
a smooth function and devise an iterative scheme for an approximate optimization.
3.4. Optimization method 55
The smooth function used here is given by
Ω(β, t) = max‖α‖∞≤1
(αTDβ − t
2‖ α ‖2
2
).
Convexity and continuous differentiability of this function follows from Nesterov[83].
One may now proceed using standard first order approximation approach like FISTA[84]
which works like EFLA. However this process is computationally intensive and defi-
nitely not a feasible approach in our case where we need to do a nodewise regression.
The other algorithm which tries to solve similar problem is known as Split Breg-
man algorithm[85] which was proposed for standard fused lasso regression and later
extended to generalized fused lasso regression. The SB algorithm is derived from
augmented Lagrangian[86, 87] which adds quadratic penalty terms to the penalized
objective function and alternatively solves the primal and dual starting from an ini-
tial estimate. This is also computationally intensive unless one has a simple structure
constraint on the parameters.
We propose a different algorithm to solve our optimization problem. We show
that the neighborhood-fused lasso problem can be reparametrized into a standard
lasso problem, thus simplifying the optimization procedure.
3.4. Optimization method 56
The neighborhood-fused LASSO estimate θa,λ,µ of θa can be written as
θa,λ,µ = argminθ:θa=0
(1
n||Xa −Xθ||2 + λ||θ||1 + µ||Daθ||1
)= argminθ:θa=0
(1
n||Xa −Xθ||2 + λ
(||θ||1 +
µ
λ||Daθ||1
))
= argminθ:θa=0
1
n||Xa −Xθ||2 + λ
∣∣∣∣∣∣∣∣∣∣∣∣∣∣ I
µλDa
θ
∣∣∣∣∣∣∣∣∣∣∣∣∣∣1
(3.4.1)
Letting Ga =
I
µλDa
and ω = Gaθ, we define
ωa,λ,µ = argminω
(1
n||Xa −XG+
a ω||2 + λ||ω||1)
(3.4.2)
where G+a is the Moore-Penrose inverse of G. We find θa,λ,µ from the following relation
(see lemma 3.4.1 for details)
θa,λ,µ = G+a ω
a,λ,µ (3.4.3)
We note that since Ga has full column rank, G+a has full row rank. Thus, the following
lemma can be applied here.
Lemma 3.4.1. Let
β := argminβ∈Rk ‖ y −Xβ ‖2 +λ ‖ Gβ ‖1
and
ω := argminω ‖ y −XG+ω ‖2 +λ ‖ ω ‖1 .
3.5. Asymptotics of Graphical Model Selection 57
Also assume that G is of full column rank. Then
β = G+ω
.
One interesting observation here is that although we started with the assumption
that ω = Gaθ ∈ C(Ga), where C(Ga) denotes the column space of Ga, we did not
optimize over C(Ga) but did the same over all ω. See the proof of lemma 3.4.1
for details. The heuristic idea behind this is that we find the minimizer on C(Ga)
by projecting the global minimizer onto C(Ga) and the projection operator is given
by GaG+a . The objective function in (3.4.2) combines the two l1 penalties into
a single l1 penalty and thus could be easily solved by any of the standard LASSO
algorithms. The parameters λ and µ are chosen according to theorem 3.5.6. We show
by several simulations that the proposed method performs better than Meinshausen
- Buhlmann’s method or graphical lasso in situations where local constancy holds.
3.5 Asymptotics of Graphical Model Selection
From discussion in section 3, it can be seen that using neighborhood-fused lasso
leads to efficient model selection when applied successively to all the nodes. In this
section, we show that our proposed method leads to asymptotically consistent model
selection similar to the procedure by Meinshausen and Buhlmann. The choice of the
regularization parameters is crucial in such cases. Moreover, we show that our pro-
posed choice of regularization parameters not only ensures convergence to the “true”
model but the convergence is faster than the Meinhausen-Buhlmann’s method when
the underlying model is indeed locally constant. This property is also supported by
3.5. Asymptotics of Graphical Model Selection 58
our simulation results shown in section 7.
We start with a lemma on the conditions of subdifferential of our objective func-
tion. For proof, see section 3.8.
Lemma 3.5.1. Given θ ∈ Rp(n), let G(θ) be a p(n) dimensional vector with elements
Gb(θ) = − 2n〈Xa −Xθ,Xb〉. Define
La(θ) = b : [(Da)′sgn(Daθ)]b 6= 0.
A vector θ is a solution to the fused LASSO problem described above iff
Gb(θ) = λsgn(θb) + µ[(Da)′sgn(Daθ)]b when θb 6= 0, b ∈ La(θ)
λsgn(θb)− µ ‖ Da.b ‖1≤Gb(θ) ≤ λsgn(θb) + µ ‖ Da
.b ‖1 when θb 6= 0, b /∈ La(θ)
−λ+ µ[(Da)′sgn(Daθ)]b ≤Gb(θ) ≤ λ+ µ[(Da)′sgn(Daθ)]b when θb = 0, b ∈ La(θ)
|Gb(θ)| ≤ λ+ µ ‖ Da.b ‖1 otherwise
This lemma builds the foundation of several of the following results and will be
used frequently to prove them. Sign consistency is one of the major properties that a
potentially good model selection method should exhibit. Before we study the model
selection consistency of our estimators, we show, in the following lemma that the
neighborhood-fused lasso estimator is sign-consistent.
Lemma 3.5.2. Let θa,λ,µ be defined for all a. Under the assumptions A1-A7, it holds
for some c that for all a, P(sgn(θa,λ,µb ) = sgn(θab )∀b ∈ nea) = 1−O(exp(−cnε))
Observe that this lemma, in turn preserves the asymptotic equality of signs of
3.5. Asymptotics of Graphical Model Selection 59
local neighbors. If Xb and X ′b are local neighbors, then it is likely, most of the time,
that when one regresses Xa on the remaining variables, the coefficients of Xb and Xb′
will have the same sign. This is a direct consequence of local constancy of estimated
regression coefficients.
Our results show that, just like Meinshausen and Buhlmann, a rate slower than
n−1/2 is necessary for the regularization parameters for consistent model selection
in the high dimensional case where the dimension may increase as a polynomial in
the sample size. Specifically if λ decays at n−1−ε2 and µ decays as n−
1−ε2−β0 where
0 < β0 < κ < ε < ξ are as in assumptions A1 - A8, the estimated neighborhood is
almost surely contained in the true neighborhood. Hence the type-I error probability
goes to 0. This is formally stated in the following theorem (see section 3.8 for a
proof).
Theorem 3.5.3. Let assumptions A1-A8 be fulfilled. Let the penalty parameters
satisfy λn ∼ d1n− 1−ε
2 and µn ∼ d2n− 1−ε
2 with some β0 ≥ 0 and 0 < κ < ε < ξ and
d1, d2 > 0. There exists some c > 0 such that for all a ∈ Γ(n),
P (neλ,µa ⊆ nea) = 1−O(exp(−cnε))
The assumptions of neighborhood stability and local neighborhood stability are
not redundant. The following proposition shows that one can not relax the assump-
tions A7 and A8.
Proposition 3.5.4. If there exists some a, b ∈ Γ(n) with b /∈ nea and |Sa(b)| > 1 and
|Ta(b)| >‖ Da.b ‖1, then for λn, µn as in theorem 4.6, P
(neλ,µa ⊆ nea
)→ 0 for n→∞.
On the other hand, with the same sets of assumptions, one can show that the type
II probability, i.e., probability of falsely identifying an edge as a potential connection
3.5. Asymptotics of Graphical Model Selection 60
exponentially goes to 0. This has been formally stated and proved in the following
theorem (see section 3.8 for a proof).
Theorem 3.5.5. With all the assumptions of theorem 4.6 and λ, µ as before,
P(nea ⊆ neλa
)= 1−O(exp(−cnε))
On Model Selection in Graphs
It follows from the discussion so far that consistent estimation of nodes is possible
using our approach. However, one of the original goals of our method is to estimate
the entire underlying graphical model, which can be accomplished by combining the
estimated neighborhoods in some way. Two different methods of combination have
been proposed
Eλ,µ,∨ :=
(a, b) : a ∈ neλ,µb ∨ b ∈ neλ,µa
(Union)
Eλ,µ,∧ :=
(a, b) : a ∈ neλ,µb ∧ b ∈ neλ,µa
(Intersection)
In our simulations, we combined them following the union method. It has been
observed that the differences vanish asymptotically when a regular lasso is applied.
Although we did not theoretically study this for our case but our experiments indicate
that they do not vary much asymptotically.
Finite Sample Choices for Penalty Parameter
Theoretic exploration in asymptotic domain does not cast much light on the
choices of regularization parameter in real life, finite sample problem. It is hard to
ensure consistency or absolute containment like theorem 3.5.3 or 3.5.5. However,
3.5. Asymptotics of Graphical Model Selection 61
following the idea proposed by Meinshausen and Buhlmann, one can consider the
connectivity component of a node, which is defined as the set of nodes which are
connected to it through a chain of edges. Generalizing the results from Meinshausen
& Buhlmann, it has been shown in the following theorem (proof in section 3.8)
that the estimated connectivity component derived from neighborhood-fused lasso
estimate will belong to the true connectivity component with probability (1−α), for
any chosen level of α ∈ (0, 1).
Theorem 3.5.6. With all the assumptions A1-A8, and the following choices of the
penalty parameters,
λ =σa√n
Φ−1
(α
2p(n)2
)µ =
σaKnβ0+1/2
Φ−1
(α
2p(n)2
)
we have
P(∃a ∈ Γ(n) : Cλ,µ
a * Ca
)≤ α
for all n. Ca and Cλ,µa are the true and estimated connectivity components of a, K, β0
are certain constants and Φ = 1− Φ.
The choice of K and β0 depends on the rate of growth of the local neighborhood
with increasing sample size. In our simulations, we found that for models with con-
stant dimension and increasing sample size, K = 1 and β ∈(0, 1
2
)works fine.
This is, by all means, a weaker result than asymptotic consistency. But, quite
important corollaries can be derived from it. E.g., if the underlying graph is empty,
then it will be estimated by an empty graph. Also, if the underlying graph has dis-
connected components, then with high probability, one is going to have an estimated
3.5. Asymptotics of Graphical Model Selection 62
graph which is also disconnected, with the estimated connected components a subset
of the true connected component.
As shown in the simulations, we get a higher convergence speed over Meinshausen-
Buhlmann’s method by applying a neighborhood-fused lasso penalty for nodewise
regression when the underlying model exhibit local constancy. This can be proved
using the following lemma (proof in section 3.8).
Lemma 3.5.7. (a) With the notations having the same meaning as in the proof of
theorem 3.5.3, the following is true
P(∣∣2n−1〈Xa, Vb〉
∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)≤ exp
[− 1
σ2∗
(d1
2(1− δ1) +
d2
2(1− δ2)nβ0
)2
nε
]
(b) The upper bound of type I error probability in neighborhood fused lasso is smaller
than that of usual lasso by a factor of
exp
[− 1
σ2∗
(d1d2
2(1− δ1)(1− δ2)nβ0 +
d22
4(1− δ2)2n2β0
)nε]
where σ2∗ = E(X2
a,iV2a,i)
Looking at the proof of theorem 3.5.3, it is seen that the term
P(∣∣2n−1〈Xa, Vb〉
∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)
is the principal contributor probability of false positives. The corresponding term
using usual lasso is
P(∣∣2n−1〈Xa, Vb〉
∣∣ ≥ (1− δ1)λ).
3.6. Compatibility and l1 properties 63
Lemma 3.5.7 shows that the maximal probability of false positives using fused lasso
is a proper fraction of the same using usual lasso and the fraction is small when
the local neighborhood grows. So, this method invariably performs better than the
Meinshausen-Buhlmann’s method with respect to a minimax criterion, and the im-
provement is more when the local neighborhood grows faster. It also implies that it
can perform as bad as usual lasso regression as the worst case scenario.
3.6 Compatibility and l1 properties
In our attempt to gaussian graphical model learning, we adopt the Meinshausen-
Buhlmann approach, i.e., do a componentwise penalized regression. However, one
should keep in mind that in the process of doing so, the design matrix (which is ran-
dom here) changes at every iteration. In previous section we discussed the asymptotic
model selection consistency of our estimator. In this section, we shall go beyond
model selection and explore conditions under which our neighborhood-fused lasso
estimate exhibit nice asymptotic l1 properties. We shall carry out our theoretical
analysis under the assumption that the linear regression model holds exactly, with
some underlying “true” θ0. Buhlmann and Van de Geer[88] derived some consistency
results and oracle properties of usual lasso. I shall apply and extend their results in
this section.
We start with a quick recapitulation of the notation we are going to use here. If
X = (X1, X2, · · · , Xp) ∼ N(0,Σ), one assumes a node-wise regression model
Xa = Xaθa + εa for a = 1, 2, · · · , p (3.6.1)
3.6. Compatibility and l1 properties 64
where Xa denotes the a-th component, Xa denotes all the components except a
and ε ∼ N(0, σ2In) for some σ2. Also assume that Σ := ((σij))i=1(1)p;j=1(1)p and
Ω := Σ−1 = ((σij))i=1(1)p;j=1(1)p is the precision matrix. If E denotes the edge set in
the conditional independence graph then (i, j) /∈ E ⇐⇒ σij = σji = 0. The design
matrix Xa is obtained by deleting the a-th column of the data matrix.
In the high dimensional set up, where one generally has lesser number of samples
that model dimension (n < p), we assume an inherent sparsity in the true θa. This
sparsity is also ensured if we assume that the underlying conditional independence
graph is sparse. Let us assume
S0a = j : θa,j 6= 0
and s0a = card(S0
a). Since S0a is not known, one needs a regularization penalty.
Meinshausen and Buhlmann chose the l1 penalty, i.e., the traditional lasso and got
the estimate
θλa = argminθa
[1
n‖ Xa −Xaθa ‖2 +λ ‖ θa ‖1
]
In our situation, we have added another penalty term ‖ Daθa ‖1 along with the lasso
penalty term. Hence, our estimator is given by
θλ,µa = argminθa
[1
n‖ Xa −Xaθa ‖2 +λ ‖ θa ‖1 +µ ‖ Daθa ‖1
]
Note that this new definition of our estimator is exactly same as what was defined
in 3.3.3.
3.6. Compatibility and l1 properties 65
The Compatibility Condition for NFLasso
As mentioned in the earlier section, we shall develop our theory based on the
assumption of a linear truth. We try to provide an upper bound on the prediction
error. The following lemma forms the basis of our derivation.
Lemma 3.6.1. The basic inequality
1
n‖ Xa(θλ,µa − θ0
a) ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1≤2
nε′aX
a(θλ,µa − θ0a) + λ ‖ θ0
a ‖1 +µ ‖ Daθ0a ‖1
It should be noted that the number of non zero elements in the column of the
matrix Da denotes the number of local neighbors that particular node has (except
node a). Let us assume that the number of local neighbors is O(nβ0). Fixing n, let
us also assume that the number of local neighbors is bounded by B. Then, it follows
from lemma 3.6.1 using ||‖ x ‖1 − ‖ y ‖1||1 ≤ ||x− y||1 that
1
n‖ Xa(θλ,µa − θ0
a) ‖2 ≤ 2
nε′aX
a(θλ,µa − θ0a) + (λ+Bµ) ‖ (θλ,µa − θ0
a) ‖1
The basic objective of using a penalty parameter is to overrule the empirical
process term 2nε′aX
a(θλ,µa − θ0a). It can be easily seen that
∣∣∣∣ 2nε′aXa(θλ,µa − θ0a)
∣∣∣∣ ≤ ( 2
nmax1≤j≤p
∣∣ε′aXaj
∣∣) ‖ θλ,µa − θ0a ‖1
Our goal is to choose a λ and a µ such that the probability that the empirical process
term in the right hand side exceeds λ + Bµ is small, so that with high probability
the right hand side could be dominated by (λ + Bµ) ‖ θλ,µa − θ0a ‖1. To that effect,
3.6. Compatibility and l1 properties 66
we define
Λa :=
max1≤j≤p
j 6=a
2
n
∣∣ε′aXaj
∣∣ ≤ λ0 +Bµ0
Our objective is to show that for some particular choice of λ0 and µ0, Λa has high
probability. Now one should keep in mind that both εa and Xaj are random here. One
can make the valid assumption of their individual gaussian law and independence.
We formalize these notions in the following proposition.
Proposition 3.6.2. Under the assumption of linear truth for the gaussian graphical
model, i.e., Xa = Xaθ0a+εa for a = 1, 2, · · · , p, we have εa,i
i.i.d.∼ N(0, σaa − ΣabΣ
−1bb Σ′ab
)where Σ =
σaa Σab
Σ′ab Σbb
Proof of this proposition is easy and hence skipped. From lemma 3.6.3 and its
following corollary, we see that for suitable choice of λ0 and µ0, P (Λa) is high.
Lemma 3.6.3. Under the assumption that σi = 1 ∀i = 1, 2, · · · , p,
P (Λa) ≥ 1− 2 exp
[log p− n
(λ0
2+ 1−
√λ0 +Bµ0 + 1
)]
In particular, with λ0 = 2(t+log p)n
> 0 and µ0 = 2B
√2n(t+ log p), then
P (Λa) ≥ 1− 2 exp(−t)
Corollary 3.6.4. Assume that σi = 1 ∀i = 1, 2, · · · , p and that p = O(nγ). Let the
3.6. Compatibility and l1 properties 67
regularization parameters be
λ =2 (t2 + log p)
n
µ =1
B
√8(t2 + log p)
n
Then the following is true
P
(1
n‖ Xa(θλ,µa − θ0
a) ‖2≤ 2(λ+Bµ) ‖ θ0a ‖1 +µB ‖ θλ,µa ‖1
)≥ 1− exp(−t2)
Assume that δmin is the smallest non-zero singular value ofXa. Then, the following
lemma provides an upper bound on the quadratic norm of estimation error.
Lemma 3.6.5. Assume that σi = 1 ∀i = 1, 2, · · · , p and that p = O(nγ). Let the
regularization parameters be
λ =2 (t2 + log p)
n
µ =1
B
√8(t2 + log p)
n
Also assume that δmin is the smallest non-zero singular value of Xa. Then we have
P
(1
n‖ Xa(θλ,µa − θ0
a) ‖2≤ 2
(λ+Bµ
(1 +
√p
2
))‖ θ0
a ‖1 +npBµ(λ+Bµ)
δmin
)≥ 1− exp(−t2)
In the next section we will study some oracle properties of neighborhood-fused
lasso estimators and derive better bounds than above.
3.6. Compatibility and l1 properties 68
Oracle Inequalities
We now try to derive some oracle inequalities which will prove the l1 consistency of
our neighborhood-fused lasso estimate. Following Buhlmann’s notation, let us write,
for an index set S ⊂ 1, 2, · · · , p,
θa,j,S := θa,j1j ∈ S
θa,j,Sc := θa,j1j /∈ S
We introduce a few more notations to carry out the calculations. We split the matrix
Da as following
Da =
DaS,S 0
DaS,0 Da
0,Sc
0 DaSc,Sc
where
DaS,S consists of all rows such that both the non zero terms belong to S.
DaS,0 consists of all rows and columns such that exactly one of the non zero term
in that row belongs to S.
Da0,Sc consists of all rows and columns such that exactly one of the non zero
term in that row belongs to Sc.
DaSc,Sc consists of all rows such that both the non zero terms belong to S.
Thus, in simple words, DaS,S takes care of the local difference terms within S,
[DaS,0 Da
0,Sc
]takes care of local difference terms between S and Sc and Da
Sc,Sc takes care of the
3.6. Compatibility and l1 properties 69
local difference terms within Sc. With these notations, the lemma follows
Lemma 3.6.6. On Λa, if we choose λ ≥ 2λ0 and µ ≥ 2µ0, we have
2
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0
− θ0a,S0‖1
It can be easily verified that if λ ≥(3 + 14
∆
)Bµ for some ∆ > 0, we have
λ− 3Bµ ≥ 1
3 + ∆(3λ+ 5Bµ)
This helps us to simplify the consequences lemma 3.6.6, which implies that
(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0− θ0
a,S0‖1
Combining with the aforementioned condition, we get
3λ+ 5Bµ
3 + ∆‖ θλ,µa,Sc0 ‖1 ≤ (λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0
− θ0a,S0‖1
Or,
‖ θλ,µa,Sc0 ‖1 ≤ (3 + ∆) ‖ θλ,µa,S0− θ0
a,S0‖1
A standard way to incorporate the l1 penalty ‖ θλ,µa,S0− θ0
a,S0‖1 into an l2 penalty is
to use the Cauchy-Schwarz inequality
‖ θλ,µa,S0− θ0
a,S0‖1 ≤
√s0 ‖ θλ,µa,S0
− θ0a,S0‖
3.6. Compatibility and l1 properties 70
and subsequently relate it to the l2 penalty on the left hand side
2
n‖ Xa
(θλ,µa − θ0
a
)‖2 =
2
n
(θλ,µa − θ0
a
)′Xa′Xa
(θλ,µa − θ0
a
)
in the following manner
‖ θλ,µa,S0− θ0
a,S0‖2 ≤
(θλ,µa − θ0
a
)′Xa′Xa
(θλ,µa − θ0
a
)φ2
0,a
where φ0,a > 0 is some constant. However, θλ,µa being random, this condition can not
hold unanimously for all θ ∈ Rp. Buhlmann provided a compatibility condition so
that this is true. In our situation, we found a similar condition like them to carry out
further analysis. It should be noted that our compatibility condition is weaker than
the compatibility condition Buhlmann provided. However, we need to work under
the assumption that λ ≥(3 + 14
∆
). The definition of this compatibility condition is
provided below.
Definition 3.6.7 (Compatibility Condition). We say that the Compatibility condi-
tion is satisfied for the a collection of nodes S0 if for all θ satisfying
‖ θSc0 ‖1 ≤ (3 + ∆) ‖ θS0 ‖1
the following holds
‖ θS0 ‖21 ≤
s0
(θ′Xa′Xaθ
)nφ2
0,a
Following Bickel et al.[89], we shall refer to φ20,a as the restricted eigenvalue. Here
we provide a detailed explanation of this phenomenon. Observe that the compatibility
3.6. Compatibility and l1 properties 71
condition could be re-written as
φ20,a ≤ infθ:‖θSc0‖1≤(3+∆)‖θS0‖1
s0
(1nθ′Xa′Xaθ
)‖ θS0 ‖2
1
so that φ0,a could be taken as the square root of the infimum assumed by the right
hand side. Now following the chain of inequalities
infθ:‖θSc0‖1≤(3+∆)‖θS0‖1s0
(1nθ′Xa′Xaθ
)‖ θS0 ‖2
1
≥ infθs0
(1nθ′Xa′Xaθ
)‖ θS0 ‖2
1
≥ infθs0
(1nθ′Xa′Xaθ
)s0 ‖ θS0 ‖2
≥ infθ
(1nθ′Xa′Xaθ
)‖ θ ‖2
= λmin
(1
nXa′Xa
)
Thus, φ20,a could be assumed to be the minimum eigenvalue over the restricted set
θ :‖ θSc0 ‖1≤ (3+∆) ‖ θS0 ‖1. With this compatibility condition imposed, we derive
the following oracle inequality
Theorem 3.6.8. Assume that the compatibility condition (from definition 3.6.7)
holds for some ∆ > 0. Then on Λa, for λ ≥ 2λ0, µ ≥ 2µ0 and λ ≥(3 + 14
∆
)Bµ, we
have
1
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0
a ‖1 ≤ s0(2λ+Bµ)2
φ20,a
Combining theorem 3.6.8 and lemma 3.6.3, we get,
3.6. Compatibility and l1 properties 72
Theorem 3.6.9. Under the assumption that σi = 1 ∀i = 1, 2, · · · , p, if one chooses
λ =4(t+ log p)
n
µ =4
B
√2
n(t+ log p)
and
t ≥ 2n
(3 +
14
∆
)2
then with probability 1− exp(−t), we have
1
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0
a ‖1 ≤ s0(2λ+Bµ)2
φ20,a
From theorem 3.6.8 we see that with high probability,
1
n‖ Xa
(θλ,µa − θ0
a
)‖2 ≤ s0(2λ+Bµ)2
φ20,a
,
and
‖ θλ,µa − θ0a ‖1 ≤
s0(2λ+Bµ)2
φ20,a(λ− 3Bµ)
Since λ and Bµ go to 0 as n→∞ both these norms are small.
3.7. Simulations 73
3.7 Simulations
3.7.1 Simulation 1
In this simulation we repeat the exact scenario presented in Honorio’s first sim-
ulation setting. The Gaussian graphical model consists of 9 variables as shown in
figure 3.7.1. It deals with both local and non-local interactions. Our method is
compared to Meinshausen-Bhlmann’s method, the graphical lasso and Honorio’s Co-
ordinate Direction Descent algorithm. We run our simulation for 4 different sample
sizes n = 4, 50, 100 & 400. For each sample size, we run the simulation 50 times and
estimate the neighborhood for each iteration. We construct a graph with edges that
occur most frequently in all of those 50 iterations. The results are shown in figure
3.7.1.
3.7.2 Simulation 2
In this simulation study we take a 50 dimensional normal random vector with zero
mean. The diagonals of the precision matrix are all 1 and all the nonzero off-diagonal
entries are 0.2. The conditional (in)dependence graph of this vector consists of both
spatial (local) and non-spatial neighbors where the local neighborhood structure is
linear (one dimensional lattice). There are two groups of distant neighbors. We
generate n i.i.d. samples from the corresponding normal distribution. We run the
simulation for n = 10, 25, 50, 100, 500, 1000. We try to reconstruct the conditional
(in)dependence graph from the data using Graphical LASSO (we use an oracle ver-
sion of graphical lasso where the choice of penalty parameter is contingent on the
actual number of edges in the true model), Meinshausen - Buhlmann’s coordinate-
wise LASSO and our coordinate-wise generalized Fused LASSO approach. For each
3.7. Simulations 74
Ground Truth
1
2
34
5
6
78
9
NFL est. for n = 4
1
2
34
5
6
78
9
GLASSO est. for n = 4
1
2
34
5
6
78
9
MB est. for n = 4
1
2
34
5
6
78
9
CDD. est. for n = 4
1
2
34
5
6
78
9
Ground Truth
1
2
34
5
6
78
9
NFL est. for n = 50
1
2
34
5
6
78
9
GLASSO est. for n = 50
1
2
34
5
6
78
9
MB est. for n = 50
1
2
34
5
6
78
9
CDD. est. for n = 50
1
2
34
5
6
78
9
Ground Truth
1
2
34
5
6
78
9
NFL est. for n = 100
1
2
34
5
6
78
9
GLASSO est. for n = 100
1
2
34
5
6
78
9
MB est. for n = 100
1
2
34
5
6
78
9
CDD. est. for n = 100
1
2
34
5
6
78
9
Ground Truth
1
2
34
5
6
78
9
NFL est. for n = 400
1
2
34
5
6
78
9
GLASSO est. for n = 400
1
2
34
5
6
78
9
MB est. for n = 400
1
2
34
5
6
78
9
CDD. est. for n = 400
1
2
34
5
6
78
9
Figure 3.7.1: Comparison of NFL with GLASSO, Meinshausen-Buhlmann estimateand CDD in section 3.7.1
3.7. Simulations 75
n, we run the simulation 50 times and calculate the number of correctly identified
edges and that of falsely identified edges for each iteration. The mean and standard
deviations are shown in the tables 3.1 and 3.2. Figure 3.7.2 shows the relative per-
formance of the competing methods for different sample sizes. Sample size increases
from top to bottom. In the figure we include comparison for two additional sample
values. They are n = 5000 and n = 10000 respectively.
It is quite evident from the two simulation that our method converges to the true
model faster than Meinshausen-Buhlmann’s method. It is also shown theoretically
in lemma 3.5.7.
Method Parameters n = 10 n = 25 n = 50
fp tp fp tp fp tpGL ρ = 0.00 NA NA NA NA NA NA
ρ = 0.05 239.78(8.94) 24.98(4.13) 352.79(9.89) 40.33(3.86) 364.65(14.77) 52.08(3.79)ρ = 0.10 217.71(8.73) 23.20(3.83) 273.30(11.22) 36.10(3.57) 248.21(11.54) 45.58(3.52)ρ = 0.15 195.75(9.85) 21.78(3.68) 210.74(11.01) 31.92(3.88) 165.44(9.21) 38.640(3.70)ρ = 0.20 175.32(10.02) 20.60(3.25) 161.56(10.49) 27.88(3.45) 104.28(8.24) 31.64(3.75)ρ = 0.25 156.95(11.18) 18.86(3.28) 121.98(10.66) 23.42(3.84) 59.92(6.64) 24.740(4.13)ρ = 0.30 138.76(10.79) 17.32(3.42) 89.40(10.55) 19.34(3.51) 33.08(5.67) 17.53(3.15)
MB 0(0) 0(0) 0(0) 0(0) 0(0) 0(0)
NFL 51.62(5.74) 7.54(2.69) 37.32(4.84) 10.12(2.69) 47.20(6.05) 20.18(4.02)
Table 3.1: Comparison of False Positives and True Positives
3.7. Simulations 76
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Truth OGL MB NFL
Figure 3.7.2: Comparison of NFL with GLASSO and Meinshausen-Buhlmann esti-mate in section 3.7.2, sample sizes from top to bottom are 10, 25, 50, 100, 500, 1000,5000, 10000
3.8. Theoretical Results 77
Method Parameters n = 100 n = 500 n = 1000
fp tp fp tp fp tpGL ρ = 0.00 579.72(22.16) 64.06(3.03) 580.64(20.99) 69(0) 584.7(24.48) 69(0)
ρ = 0.05 336.84(15.68) 60.34(2.79) 166.10(8.39) 68.24(0.85) 83.16(7.06) 68.9(0.36)ρ = 0.10 187.14(11.43) 52.46(2.97) 23.90(4.49) 61.26(1.66) 2.7(1.76) 62.74(1.64)ρ = 0.15 94.48(9.24) 42.50(3.59) 1.5(1.04) 49.62(1.71) 0.02(0.14) 50.28(1.84)ρ = 0.20 38.80(6.33) 32.48(3.25) 0.04(0.20) 33.62(2.92) 0(0) 35.14(3.09)ρ = 0.25 14.30(4.45) 21.76(3.32) 0(0) 15.12(3.99) 0(0) 12.86(2.67)ρ = 0.30 3.96(2.08) 13.32(2.87) 0(0) 3.6(1.96) 0(0) 1.12(0.87)
MB 0(0) 0(0) 0(0) 0(0) 0(0) 0.95(0.92)
NFL 99.92(7.08) 41.10(3.73) 2.44(1.40) 49.86(2.42) 0.02(0.14) 49.7(2.25)
Table 3.2: Comparison of False Positives and True Positives
3.8 Theoretical Results
Proof [Proof of proposition 3.3.2] The subdifferential of E(Xa−∑
k∈Γ(n))2 +
η1‖θ‖1 + η2‖Daθa‖1, w.r.t. θak: k ∈ Γ(n) is given by
−2E((Xa −∑
m∈Γ(n) θamXm)Xk) + η1e
(1)k + η2e
(2)k
where e(1)k = sgn(θak) if θak 6= 0 and e
(1)k ∈ [−1, 1] if θak = 0 and
e(2)k = (D′asgn(Daθa))k if (D′asgn(Daθa))k 6= 0. Otherwise, e
(2)k ∈ [−1, 1].
Using nea(η1, 0) = nea(0, 0), it follows from lemma 3.5.1 that for all k ∈ nea
2E((Xa −∑
m∈Γ(n) θam(η1, 0)Xm)Xk) = η1sgn(θak)
and for b /∈ nea,
|2E((Xa −∑
m∈Γ(n) θam(η1, 0)Xm)Xk)| ≤ η1
A variable Xb with b /∈ nea can be written as Xb =∑
k∈neaθb,neak Xk + Wb, where Wb
is independent of Xk : k ∈ cla. Using this yields
|2∑
k∈neaθb,neak E((Xa −
∑m∈Γ(n) θ
am(η1, 0)Xm)Xk)| ≤ η1
3.8. Theoretical Results 78
Thus, it follows that |∑
k∈neaθb,neak sgn(θa,nea
k )| ≤ 1
Using nea(0, η2) = nea(0, 0), it follows from lemma 3.5.1 that for all k ∈ La ,
2E((Xa −∑
m∈Γ(n) θam(0, η2)Xm)Xk) = η2(D′asgn(Daθa))k
and for k /∈ La,
|2E((Xa −∑
m∈Γ(n) θam(0, η2)Xm)Xk)| ≤ η2‖Da
.b‖1
A variable Xb with b /∈ La can be written as Xb =∑
k∈La θb,Lak Xk + Zb, where Zb is
independent of Xk : k ∈ La. Using this yields
|2∑
k∈La θb,Lak E((Xa −
∑m∈Γ(n) θ
am(0, η2)Xm)Xk)| ≤ η2‖Da
.b‖1
Thus, it follows that
|∑
k∈La θb,Lak (D′asgn(Daθa))k| ≤ ‖Da
.b‖1
Proof [Proof of lemma 3.4.1] Let G+ω = β0. We will show that β0 = β. We
know from definition of ω that ∀ω
‖ y −XG+ω ‖2 +λ ‖ ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1
Since the objective function is convex, the minimizer in C(G) is obtained by projecting
the grand minimizer onto C(G). Hence,
ω := argminω∈C(G) ‖ y −XG+ω ‖2 +λ ‖ ω ‖1= GG+ω
3.8. Theoretical Results 79
Therefore, for any ω ∈ C(G), we have
‖ y −XG+ω ‖2 +λ ‖ ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1
⇒ ‖ y −XG+GG+ω ‖2 +λ ‖ GG+ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1
⇒ ‖ y −Xβ0 ‖2 +λ ‖ Gβ0 ‖1≤‖ y −XG+Gθ ‖2 +λ ‖ Gθ ‖1 ∀θ
Since G is full column rank, G+G = I. Hence,
‖ y −Xβ0 ‖2 +λ ‖ Gβ0 ‖1≤‖ y −Xθ ‖2 +λ ‖ Gθ ‖1 ∀θ
Therefore we get that
β0 = argminβ∈C(G+)=Rk ‖ y −Xβ ‖2 +λ ‖ Gβ ‖1
Proof [Proof of lemma 3.5.1] The subdifferential of 1n‖ Xa −Xθ ‖2
2 +λ ‖ θ ‖1
+µ ‖ Daθ ‖1 is given by G(θ) +λe1 +µe2 : e1 ∈ S1, e2 ∈ S2 where S1 = e ∈ Rp(n) :
eb = sgn(θb) if θb 6= 0 and eb ∈ [−1, 1] if θb = 0 and S2 = e ∈ Rp(n) :
eb = αb if αb 6= 0 and eb ∈ [−1, 1] if αb = 0 where α = D′asgn(Daθ).
Observe that |(D′asgn(Daθ))b| ≤‖ Da.b ‖1 where Da
.b denotes the bth column of the
matrix Da. This is same as the total number of local neighbors of b except a. The
lemma follows.
Proof [Proof of lemma 3.5.2] Using Bonferronis inequality, and |nea| = o(n)
for n → ∞, it suffices to show that there exists some c > 0 so that for for every
a, b ∈ Γ(n) with b ∈ nea, P(sgn(θa,nea,B,λ,µb ) = sgn(θab )) = 1−O(exp(−cnε)).
3.8. Theoretical Results 80
Consider the definition of
θa,nea,B,λ,µ =
argminθ:θk=0∀k/∈nea,θl−θm=0∀(l,m)∈B(
1n‖ Xa −Xθ ‖2
2 +λ ‖ θ ‖1 +µ ‖ Daθ ‖1
)Assume now that component b of this estimate is fixed at a constant value β. Denote
this new estimate by θa,b,B,λ,µ(β),
θa,b,B,λ,µ(β) = argminθ∈Θa,b(β)(n−1 ‖ Xa −Xaθ ‖2
2 +λ ‖ θ ‖1 +µ ‖ Daθ ‖1)
where
Θa,b(β) = θ ∈ Rp(n) : θb = β, θk = 0∀k /∈ nea, θl − θm = 0∀(l,m) ∈ B
There will always exist a value β = θa,nea,B,λ,µb such that θa,b,B,λ,µ(β) is identical
to θa,nea,B,λ,µ. Thus, if sgn(θa,nea,B,λ,µb ) 6= sgn(θab ), there would exist some β with
sgn(β)sgn(θab ) ≤ 0 so that θa,b,B,λ,µ(β) would be a solution. Using sgn(θab ) 6= 0 for all
b ∈ nea, it is sufficient to show that for every β with sgn(β)sgn(θab ) < 0, θa,b,B,λ,µ(β)
can not be a solution with high probability.
We concentrate on the case where θab > 0. The case θab < 0 will follow analogously. If
θab > 0, it follows by lemma 1 that θa,b,B,λ,µ(β) with θa,b,B,λ,µb (β) = β ≤ 0 can only be
a solution if Gb(θa,b,B,λ,µ(β)) ≥ −λ−Bµ where B = maxa,b ‖ Da
.b ‖1. Hence it suffices
to show that for some c > 0 and all b ∈ nea with θab > 0, for n→∞
P(supβ≤0Gb(θa,b,B,λ,µ(β)) < −λ−Bµ) = 1−O(exp(−cnε)) (3.8.1)
(3.8.2)
Let in the following Rλ,µa (β) be the n-dimensional vector of residuals.
3.8. Theoretical Results 81
Rλ,µa (β) = Xa −Xθa,b,B,λ,µ(β)
We can write Xb as
Xb =∑
k∈nea\b θb,nea\bk Xk +Wb
where Wb is independent of Xk : k ∈ nea\b. It follows that
Gb(θa,b,B,λ,µ(β)) = −2n−1〈Rλ,µ
a (β),Wb〉 −∑
k∈nea\b θb,nea\bk (2n−1〈Rλ,µ
a (β), Xk〉)
By lemma 3.5.1, ∀k ∈ nea\b, |Gk(θa,b,B,λ,µ(β))| = |2n−1〈Rλ,µ
a (β), Xk〉| ≤ λ + Bµ.
This together with the equation above yields
Gb(θa,b,B,λ,µ(β)) ≤ −2n−1〈Rλ,µ
a (β),Wb〉+ (λ+Bµ) ‖ θb,nea\b ‖1
Using Assumption 5 , there exists some ϑ < ∞, so that ‖ θb,nea\b ‖1< ϑ. It is
therefore sufficient to show that there exists for every g > 0 some c > 0 so that it
holds for all b ∈ nea with θab > 0, for n→∞,
P(infβ≤02n−1〈Rλ,µa (β),Wb〉 > g(λ+Bµ)) = 1−O(exp(−cnε))
Let W|| ⊆ Rn be the space spanned by the vectors Xk, k ∈ nea\b and let W⊥ be
the orthogonal complement of W|| in Rn. Split the n-dimensional vector Wb into the
two vectors Wb = W⊥b + W
||b . where W
||b ∈W|| and W⊥
b ∈W⊥. The inner product
can be written as
2n−1〈Rλ,µa (β),Wb〉 = 2n−1〈Rλ,µ
a (β),W||b 〉+ 2n−1〈Rλ,µ
a (β),W⊥b 〉
By Lemma 3.8 (see below), there exists for every g > 0 some c > 0 so that, for
n→∞
P(infβ≤02n−1〈Rλ,µa (β),W
||b 〉/(1 +Knβ0|β|) > −g(λ+Bµ)) = 1−O(exp(−cnε))
3.8. Theoretical Results 82
To show the result, it is sufficient to prove that there exists for every g > 0 some
c > 0 so that, for n→∞,
P(infβ≤02n−1〈Rλ,µa (β),W⊥
b 〉 − g(1 +Knβ0 |β|)(λ+Bµ) > g(λ+Bµ)) =
1−O(exp(−cnε))
It holds for some random variable Va, independent of Xnea , that
Xa =∑
k∈nea θakXk + Va
Note that Va and Wb are independent normally distributed random variables with
variances σ2a and σ2
b respectively. By assumption 2, 0 < v2 ≤ σ2b , σ
2a ≤ 1. Note
furthermore that Wb and Xnea\b are independent. Using θa = θa,nea and Xb =∑k∈nea\b θ
b,nea\bk Xk +Wb
Xa =∑
k∈nea\b(θak + θab θ
b,nea\bk )Xk + θabWb + Va
Using this, the definition of residuals and the orthogonality property of W⊥b ,
2n−1〈Rλ,µa (β),W⊥
b 〉 = 2n−1(θab − β)〈W⊥b ,W
⊥b 〉+ 2n−1〈Va,W⊥
b 〉 ≥
2n−1(θab − β)〈W⊥b ,W
⊥b 〉 − 2n−1|〈Va,W⊥
b 〉|
The second term, 2n−1|〈Va,W⊥b 〉|, is stochastically smaller than 2n−1|〈Va,Wb〉|. Due
to independence of Va and Wb, E(VaWb) = 0. Using Bernstein inequality, and λ +
Bµ ∼ dn−1−ε2 with ε > 0, there exists for every g > 0, some c > 0 so that
P (2n−1|〈Va,W⊥b 〉| ≥ g(λ+Bµ)) ≤ P (2n−1|〈Va,Wb〉| ≥ g(λ+Bµ)) = O(exp(−cnε))
Thus, it is sufficient to show that for every g > 0, there exists a c > 0 such that for
n→∞ ,
P (infβ≤02n−1(θab − β)〈W⊥b ,W
⊥b 〉 − g(1 +Knβ0|β|)(λ+Bµ) > 2g(λ+Bµ)) =
1−O(exp(−cnε))
3.8. Theoretical Results 83
Note that σ−2b 〈W⊥
b ,W⊥b 〉 follows a χ2
n−|nea|-distribution. As |nea| = o(n) and σ2b ≥
v2(by Assumption 2), it follows that there exists some k > 0 so that for n > n0 with
some n0(k) ∈ N, and any c > 0,
P (2n−1〈W⊥b ,W
⊥b 〉 > k) = 1−O(exp(−cnε))
Hence, it suffices to show that for every k, l > 0, there exists some n0(k, l) ∈ N so
that for all n ≥ n0,
infβ≤0(θab − β)k − l(1 +Knβ0|β|)(λ+Bµ) > 0
By assumption 5, |πab| is of the order a least n−1−ξ2
+β0 . Using
πab = θab/(Var(Xa|XΓ(n)\a)Var(Xb|XΓ(n)\b))12
and assumption 2, this implies that there exists some q > 0 so that θab ≥ qn−1−ξ2
+β0 .
As λ ∼ d1n− 1−ε
2 and µ ∼ d2n− 1−ε
2−β0 and ξ > ε by assumption of theorem 1, it follows
that for every k, l > 0 and large enough values of n,
θabk − lKnβ0|β|(λ+Bµ) > 0
It remains to show that for any k, l > 0, there exists some n0(k, l) such that for all
n ≥ n0,
infβ≤0−βk − l(λ+Bµ) ≥ 0
This follows as λ+Bµ→ 0 for n→∞, which completes the proof.
(A1). Assume the conditions of theorem 1 hold true. Let Rλ,µa (β) and W
‖b be defined
as in the proof of the previous lemma. For any g > 0, there exists c > 0 so that it
holds for all a, b ∈ Γ(n), for n→∞,
P
(supβ∈R
|2n−1〈Rλ,µa (β),W‖b 〉|
1+Knβ0 |β| < g(λ+Bµ)
)= 1−O(exp(−cnε))
3.8. Theoretical Results 84
Proof By Cauchy-Schwarz inequality,
|2n−1〈Rλ,µa (β),W‖b 〉|
1+Knβ0 |β| ≤ 2n−12 ‖ W ‖
b ‖2n−
12 ‖Rλ,µa (β)‖2
1+Knβ0 |β|
The sum of squares of the residuals is increasing with increasing value of λ, µ. Thus,
‖ Rλ,µa (β) ‖2≤‖ R∞,∞a (β) ‖2. By definition of Rλ,µ
a ,
‖ R∞,∞a (β) ‖22=‖ Xa − βXb − βXb1 − βXb2 − · · · − βXbw ‖2
2
where w is the number of nodes which are in the same equivalence class as b. And
hence,
‖ R∞,∞a (β) ‖22≤ (1 + (w + 1)|β|)2max‖ Xa ‖2
2, ‖ Xb ‖22, ‖ Xb1 ‖2
2, · · · , ‖ Xbw ‖22 ≤
(1 +Knβ0 |β|)2max‖ Xa ‖22, ‖ Xb ‖2
2, ‖ Xb1 ‖22, · · · , ‖ Xbw ‖2
2
Hence, for any q > 0,
P
(supβ∈R
n−12 ‖Rλ,µa (β)‖2
1+Knβ0 |β| > q
)≤
P(n−
12 max‖ Xa ‖2, ‖ Xb ‖2, ‖ Xb1 ‖2, · · · , ‖ Xbw ‖2 > q
)Note that ‖ Xa ‖2
2, ‖ Xb ‖22 and all of ‖ Xbk ‖2
2’s (k = 1, 2, · · · , w) have χ2n distribution.
Thus, by the following lemma (Lemma 3.8.1), there exists q > 1 and c > 0 such that
P
(supβ∈R
n−12 ‖Rλ,µa (β)‖2
1+Knβ0 |β| > q
)= O(exp(−cnε)) for n→∞
It remains to be shown that for every g > 0 there exists some c > 0 so that
P(n−
12 ‖ W ‖
b ‖2> g(λ+Bµ))
= O(exp(−cnε)) for n→∞
The expression σ−2b 〈W
‖b ,W
‖b 〉 is χ2
|nea|−1 distributed. As σb ≤ 1 and |nea| = O(nκ), it
follows that n−12 ‖ W ‖
b ‖2 is stochastically smaller than tn−1−κ2
(Znκ
) 12 , for some t > 0
and for some Z ∼ χ2nκ . Thus, for every g > 0,
3.8. Theoretical Results 85
P(n−
12 ‖ W ‖
b ‖2> g(λ+Bµ))≤ P
(Znκ> (g
t)2n1−κ(λ+Bµ)2
)As λ ∼ n−
1−ε2 and µ ∼ n−
1−ε2−β0 and B ∼ nβ0 , it follows that n1−κ(λ+Bµ)2 ≥ hnε−κ
for some h > 0 and sufficiently large n. By the properties of χ2 distribution and
ε > κ, by assumption in Theorem 4.6, the claim follows. This completes the proof.
Lemma 3.8.1 (Accessory 2 to lemma 3.5.2). If Y ∼ χ2n, then P (n−
12
√Y > q) =
O(exp(−cnε)) for some q > 1 and all c > 0
Proof
P (n−12
√Y > q) = P (n−1Y > q2) = P ( 1
n
∑nj=1 Y
2j > q2) ≤ (E(exp( t
nY 2j )))
n
exp(tq2)
By using Markov inequality with the increasing function ψt(x) = exp(tx). The mo-
ment generating function of a χ21 variable is known to be ψ(s) = (1 − 2s)−
12 for
0 ≤ s < 12
and hence, the upper bound
P (n−12
√Y > q) ≤ exp(−tq2)(1− 2t
n)−
n2 = exp
(−tq2 + n
2log( 1
1− 2tn
))
Since the probability on the left-hand side does not depend on t, we can take the
infimum over t on the right (as long as the resulting t satisfies the constraint 0 ≤ tn< 1
2
that is used above. We find that this infimum is achieved at t∗ = n(q2+1)2
. This t∗ being
too large (beyond the constraint region), along with the observation that f ′(t) < 0
for smaller values of t(f(t) := −tq2 + n
2log( 1
1− 2tn
))
tells that a convenient choice for
t could be n4. With this choice, we get
P (n−12
√Y > q) ≤ exp
(− q2n
4+ n
2log 2
)And if q2 = η2 + 2 log 2 for η > 0 (so that q > 1) then this bound becomes ≤
exp(−η2n
4
). Since η2 ≥ 4cnε−1 for large n and for all constant c, exp
(−η2n
4
)≤
exp (−cnε). Therefore
3.8. Theoretical Results 86
P (n−12
√Y > q) = O(exp(−cnε)) for some q > 1 and all c > 0
Proof [Proof of theorem 3.5.3] The event neλ,µa * nea is equivalent to the event
that there exists some node b ∈ Γ(n) \ cla in the set of non-neighbors of node a such
that the estimated coefficient θa,λ,µb is not zero. Thus,
P(neλ,µa ⊆ nea
)= 1− P
(∃b ∈ Γ(n) \ cla : θa,λ,µb 6= 0
)Let E be the event that
maxk∈Γ(n)\cla
∣∣∣Gk
(θa,nea,B,λ,µ
)∣∣∣ < λ+Bµ
Conditional on E , it follows from Meinshausen-Buhlmann’s discussion that θa,nea,B,λ,µ
is also a solution to the fused LASSO problem with A = Γ(n)\a. As θa,nea,B,λ,µb = 0
for all b ∈ Γ(n) \ cla, it follows that θa,λ,µb = 0 for all b ∈ Γ(n) \ cla. By 3.5.1 ,
P(∃b ∈ Γ(n) \ cla : θa,λ,µb 6= 0
)≤ P
(maxk∈Γ(n)\cla
∣∣∣Gk
(θa,nea,B,λ,µ
)∣∣∣ ≥ λ+Bµ)
It suffices to show that there exists a constant c > 0 so that for all b ∈ Γ(n) \ cla,
P(∣∣∣Gb
(θa,nea,λ,µ
)∣∣∣ ≥ λ+Bµ)
= O(exp(−cnε))
Writing Xb =∑
m∈neaθb,neam Xm + Vb for any b ∈ Γ(n) \ cla where Vb ∼ N(0, σ2
b ) for
some σ2b ≤ 1 and Vb is independent of Xm;m ∈ cla. Hence,
Gb
(θa,nea,λ,µ
)= −2n−1
∑m∈nea
θb,neam 〈Xa−Xθa,nea,λ,µ, Xm〉−2n−1〈Xa−Xθa,nea,λ,µ, Vb〉
By lemma 3.5.2, there exists a c > 0 so that with probability 1−O(exp(−cnε)),
sgn(θa,nea,λ,µk
)= sgn (θa,nea
k ) ∀k ∈ nea
3.8. Theoretical Results 87
In this case, it holds by lemma 3.5.1 that
∣∣∣∣∣2n−1∑m∈nea
θb,neam 〈Xa −Xθa,nea,λ,µ, Xm〉
∣∣∣∣∣ ≤∣∣∣∣∣ ∑m∈nea
sgn (θa,neam ) θb,nea
m λ
∣∣∣∣∣+
∣∣∣∣∣ ∑m∈nea
[D′asgn(Daθ)]mθb,neam µ
∣∣∣∣∣≤ δ1λ+ δ2Bµ
The absolute value of the coefficient Gb is hence bounded by∣∣∣Gb
(θa,nea,λ,µ
)∣∣∣ ≤ δ1λ+ δ2Bµ+∣∣∣2n−1〈Xa −Xθa,nea,λ,µ, Vb〉
∣∣∣with probability 1−O(exp(−cnε)). Conditional on Xcla , the random variable 〈Xa −
Xθa,nea,λ,µk , Vb〉 is normally distributed with mean 0 and variance σ2
b‖Xa−Xθa,nea,λ,µk ‖2.
By definition of θa,nea,λ,µ,
‖Xa −Xθa,nea,λ,µk ‖2 ≤ ‖Xa‖2
Since σ2b ≤ 1, 2n−1〈Xa − Xθa,nea,λ,µ
k , Vb〉 is stochastically smaller than or equal to
|2n−1〈Xa, Vb〉|. It remains to show that for some c > 0 and some 0 < δ1, δ2 < 1,
P (|2n−1〈Xa, Vb〉| ≥ (1− δ1)λ+ (1− δ2)Bµ) = O(exp(−cnε))
If Xa and Vb are independent, E(XaVb) = 0. Using Gaussianity and bounded variance
of both Xa and Vb, there exists some g < ∞ such that E (exp(|XaVb|)) < g. Hence
using Bernstein inequality and boundedness of λ and µ, it holds for some c > 0 that
for all b ∈ nea (see lemma 3.5.7 for detailed proof),
P (|2n−1〈Xa, Vb〉| ≥ (1− δ1)λ+ (1− δ2)Bµ) = O(exp(−cnε))
which completes the proof.
3.8. Theoretical Results 88
• The lemma 3.5.7 delves deep into the precise constants of the above inequality
and shows that one could achieve lesser type 1 error probability with a finite sample
size if one uses neighborhood fused lasso instead of usual lasso.
Proof [Proof of proposition 3.5.4] We follow the proof of theorem 3.5.3 and
claim that with the above assumptions, for all a, b with b /∈ nea,
P(|Gb
(θa,nea,λ,µ
)| > λ+Bµ
)→ 1 for n→∞
Following similar arguments afterwards, it can be concluded that with some δ1 > 1
and δ2 > 1,
P(|Gb
(θa,nea,λ,µ
)| ≥ δ1λ+ δ2Bµ− |2n−1〈Xa −Xθa,nea,λ,µ, Vb〉|
)→ 1 as n→∞
It holds for the third term that for any g1 > 0, g2 > 0,
P(|2n−1〈Xa −Xθa,nea,λ,µ, Vb〉| > g1λ+ g2Bµ
)→ 0 as n→∞
which combined with the previous result proves the proposition.
Proof [Proof of theorem 3.5.5] Observe that
P(nea ⊆ neλa
)= 1− P
(∃b ∈ nea : θa,λ,µb = 0
)Let E be the event
maxk∈Γ(n)\cla |Gk
(θa,nea,λ,µ
)| < λ+Bµ
On E , following similar arguments as before, we can conclude that θa,nea,λ,µ = θa,λ,µ.
Therefore
P(∃b ∈ nea : θa,λ,µb = 0
)≤ P
(b ∈ nea : θa,nea,λ,µ = 0
)+ P (Ec)
It follows from the proof of Theorem 4.6 that there exists some c > 0 so that P (Ec) =
O(exp(−cnε)). Using Bonferronis inequality, it hence remains to show that there
exists some c > 0 so that for all b ∈ nea,
3.8. Theoretical Results 89
P(θa,nea,λ,µ = 0
)= O(exp(−cnε))
which follows from lemma 3.5.2.
Proof [Proof of theorem 3.5.6] Cλ,µa * Ca implies the existence of an edge
in the estimated neighborhood that connects two nodes in two different connectivity
components of the true underlying graph. Hence,
P(∃b ∈ Γ(n) : b ∈ neλ,µa
)≤ p(n) maxa P
(∃b ∈ Γ(n) \ Ca : b ∈ neλ,µa
)Going by the same arguments used in proving theorem 4.6, we have
P(∃b ∈ Γ(n) \ Ca : b ∈ neλ,µa
)≤ P
(maxb∈Γ(n)\Ca |Gb(θ
a, Ca, λ, µ)| ≥ λ+Bµ)
Thus, it is sufficient to show that
p(n)2 maxa∈Γ(n),b∈Γ(n)\Ca P(|Gb(θ
a, Ca, λ, µ)| ≥ λ+Bµ)
Now observe that since Xb and Xk; k ∈ Ca are in different connectivity component,
they are, in fact, independent. Therefore, conditional on XCa , Gb(θa, Ca, λ, µ) ∼
N(
0, 4‖Xa−Xθa,Ca,λ,µ‖2n2
), making it stochastically smaller than Z ∼ N
(0, 4‖Xa‖2
n2
).
Hence it holds for all a ∈ Γ(n) and b ∈ Γ(n) \ Ca that
P(|Gb(θ
a,Ca,λ,µ)| > λ+Bµ)≤ 2Φ
(√n(λ+Bµ)
2σa
)where Φ = 1− Φ. Using the λ and µ proposed, the RHS becomes α
p(n)2.
Proof [Proof of lemma 3.5.7] This is a direct application of Bernstein inequal-
ity. Since λ→ 0 and µ→ 0 as n→∞, for large enough n, we have, by a version of
3.8. Theoretical Results 90
Bernstein’s inequality, that
P(∣∣2n−1〈Xa, Vb〉
∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)
≤ exp
−(d12
(1− δ1)n1+ε2 + d2
2(1− δ2)n
1+ε2
+β0
)2
nE(X2a,iV
2b,i)
= exp
[− 1
σ2∗
(d1
2(1− δ1) +
d2
2(1− δ2)nβ0
)2
nε
]
This proves part (a). Expanding the above, we get
= exp
[− 1
σ2∗
(d2
1
4(1− δ1)2nε
)]· exp
[− 1
σ2∗
(d1d2
2(1− δ1)(1− δ2)nβ0 +
d22
4(1− δ2)2n2β0
)nε]
which proves part (b), since the first factor is the upper bound for lasso.
Proof [Proof of lemma 3.6.1] Since θλ,µa is the fused lasso minimizer, we get
1
n‖ Xa −Xaθλ,µa ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1≤
1
n‖ Xa −Xaθ0
a ‖2 +λ ‖ θ0a ‖1 +µ ‖ Daθ0
a ‖1
Plugging in Xa = Xaθ0a + εa, we get
1
n‖ Xa(θλ,µa − θ0
a ‖2 − 2
nε′aX
a(θλ,µa − θ0a) +
1
n‖ εa ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1
≤ 1
n‖ εa ‖2 +λ ‖ θ0
a ‖1 +µ ‖ Daθ0a ‖1
3.8. Theoretical Results 91
Rewrite the above inequality to get the desired lemma.
Proof [Proof of lemma 3.6.3]
E(εa,iXaj,i) = 0 ∀i = 1, 2, · · · , n
Also, we have
1
n
n∑i=1
E[∣∣εa,iXa
j,i
∣∣k] =1
n
n∑i=1
[E∣∣εka,i∣∣E ∣∣Xa
j,i
∣∣k] = E∣∣εka,1∣∣E (∣∣Xa
j,1
∣∣k)
Using the fact that if Z ∼ N(0, σ2), then E(|Z|k) = σk · 2k/2Γ( k+12
)√π
, we get the above
=
[σk2 ·
2k/2Γ(k+12
)√π
]·
[(σjj)
k/2 ·2k/2Γ(k+1
2)
√π
]
=
(2σ2√σjj)k
π
[Γ
(k + 1
2
)]2
≤(2σ2√σjj)k
π
22−k
k + 1Γ(k + 1) =
4σ22σjj
π(k + 1)
(σ2√σjj)k−2
k!
Under the assumption that all the population variances are 1, we get the above
≤ k!
2
3.8. Theoretical Results 92
We assumed σ22 = σaa − ΣabΣ
−1bb Σ′ab and used the following facts
• β(k + 1
2,k + 1
2
)=
Γ(k+1
2
)2
Γ(k + 1)
• β(x, x) = 21−2xβ
(x,
1
2
)• If (a− 1)(b− 1) ≤ 0 then β(a, b) ≤ 1
ab
Therefore, using Bernstein’s inequality, we get for t > 0
P
(1
n
n∑i=1
εa,iXaj,i ≥ t+
√2t
)≤ exp(−nt)
Hence
P (Λca) = P
[max1≤j≤p
j 6=a
2
n
∣∣ε′aXaj
∣∣ > λ0 +Bµ0
]=
p∑j=1
j 6=a
P
[∣∣∣∣∣ 2nn∑i=1
ε′a,iXaj,i
∣∣∣∣∣ > λ0 +Bµ0
]
= 2
p∑j=1
j 6=a
P
[1
n
n∑i=1
ε′a,iXaj,i >
λ0 +Bµ0
2
]
If we choose t = λ0+Bµ02
+1−√λ0 +Bµ0 + 1 then we get t+
√2t = λ0+Bµ0
2. Therefore,
using the above result, we get
P (Λca) ≤ 2(p− 1) exp
[−n(λ0 +Bµ0
2+ 1−
√λ0 +Bµ0 + 1
)]≤ 2 exp
[log p− n
(λ0 +Bµ0
2+ 1−
√λ0 +Bµ0 + 1
)]≤ 2 exp
[log p− n
(λ0
2+ 1−
√λ0 +Bµ0 + 1
)]
3.8. Theoretical Results 93
Alternatively, if we take λ0 = 2(t+log p)n
and µ0 = 2B
√2n(t+ log p), we get
P (Λca) ≤ 2 exp(−t)
Proof [Proof of lemma 3.6.5] The proof exploits the fact that θλ,µa is the min-
imizer of the penalized least square by equalling the sub-differential of the objective
function to 0. We start by replacing the assumed linear truth, i.e., Xa = Xaθ0a + εa.
Therefore, we get
∂
∂θa
[1
n‖ Xaθ0
a −Xaθa ‖2 +1
n‖ εa ‖2 +
2
nε′aX
a(θ0a − θa
)+ λsgn(θa) + µ
(Da′sgn(Daθa)
)]θa=θλ,µa
= 0
⇒ 2
n
(Xa′Xaθλ,µa −Xa′Xaθ0
)− 2
nXa′εa + λsgn
(θλ,µa
)+ µ
(Da′sgn(Daθλ,µa )
)= 0
⇒ Xa′Xaθλ,µa = Xa′Xaθ0 +n
2
(2
nXa′εa
)− nλ
2sgn
(θλ,µa
)− nµ
2
(Da′sgn(Daθλ,µa )
)⇒ θλ,µa =
(Xa′Xa
)+
Xa′Xaθ0a +
n
2
(Xa′Xa
)+(
2
nXa′εa − λsgn
(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
))
With the given choices for λ and µ, if we define Λa =
max1≤j≤p
j 6=a2n
∣∣ε′aXaj
∣∣ ≤ λ+Bµ
,
then we have, with probability 1− exp(−t2), the following
∣∣∣∣ 2nXa′εa − λsgn(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
)∣∣∣∣ ≤ (λ+Bµ)1p + λ1p +Bµ1p = 2(λ+Bµ)1p
3.8. Theoretical Results 94
where the inequality is meant to be interpreted componentwise and 1p is a vector of
size p consisting of 1’s. Hence, we get
‖ θλ,µa ‖1 =
∣∣∣∣∣∣∣∣(Xa′Xa)+
Xa′Xaθ0a +
n
2
(Xa′Xa
)+(
2
nXa′εa − λsgn
(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
))∣∣∣∣∣∣∣∣1
≤∣∣∣∣∣∣∣∣(Xa′Xa
)+
Xa′Xaθ0a
∣∣∣∣∣∣∣∣1
+n
2
∣∣∣∣∣∣∣∣(Xa′Xa)+(
2
nXa′εa − λsgn
(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
))∣∣∣∣∣∣∣∣1
≤ √p∣∣∣∣∣∣∣∣(Xa′Xa
)+
Xa′Xaθ0a
∣∣∣∣∣∣∣∣2
+n√p
2
∣∣∣∣∣∣∣∣(Xa′Xa)+(
2
nXa′εa − λsgn
(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
))∣∣∣∣∣∣∣∣2
Using the facts that ‖ Ax ‖2≤ smax ‖ x ‖2, where smax is the maximum singular value
of A and that A+A is idempotent, and hence its singular values are either 0 or 1, we
get the above
≤ √p ‖ θ0a ‖2 +
n√p
2δmin
∣∣∣∣∣∣∣∣ 2nXa′εa − λsgn(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
)∣∣∣∣∣∣∣∣2
≤ √p ‖ θ0a ‖2 +
np
2δmin
∣∣∣∣∣∣∣∣ 2nXa′εa − λsgn(θλ,µa
)− µ
(Da′sgn(Daθλ,µa )
)∣∣∣∣∣∣∣∣∞
≤ √p ‖ θ0a ‖2 +
np(λ+Bµ)
δmin
Plugging in the result obtained from previous corollary, we get
P
(1
n‖ Xa(θλ,µa − θ0
a) ‖2≤ 2(λ+Bµ) ‖ θ0a ‖1 +µB
(√p ‖ θ0
a ‖2 +np(λ+Bµ)
δmin
))≥ 1− exp(−t2)
⇒ P
(1
n‖ Xa(θλ,µa − θ0
a) ‖2≤ 2
(λ+Bµ
(1 +
√p
2
))‖ θ0
a ‖1 +npBµ(λ+Bµ)
δmin
)≥ 1− exp(−t2)
Proof [Proof of lemma 3.6.6] To start with, we derive a series of inequalities
3.8. Theoretical Results 95
and apply them on the basic inequality.
‖ θλ,µa ‖1 = ‖ θλ,µa,S0‖1 + ‖ θλ,µa,Sc0 ‖1 ≥ ‖ θ0
a,S0‖1 − ‖ θλ,µa,S0
− θ0a,S0‖1 + ‖ θλ,µa,Sc0 ‖1
‖ θλ,µa − θ0a ‖1 = ‖ θλ,µa,S0
− θ0a,S0‖1 + ‖ θλ,µa,S0
‖1
We write θλ,µa,S0=(θλ,µa,S0,1
, 0)′
and θλ,µa,Sc0 =(
0, θλ,µa,Sc0,1
)′so that θλ,µa =
(θλ,µa,S0,1
, θλ,µa,Sc0,1
)′.
Similarly we write θ0a = θ0
a,S0=(θ0a,S0,1
, 0)′
. Hence, we get
‖ Daθλ,µa ‖1 =
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣
DaS0,S0
0
DaS0,0
Da0,Sc0
0 DaSc0,S
c0
θλ,µa,S0,1
θλ,µa,Sc0,1
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣1
=
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣
DaS0,S0
θλ,µa,S0,1
DaS0,0
θλ,µa,S0,1+Da
0,Sc0θλ,µa,Sc0,1
DaSc0,S
c0θλ,µa,Sc0,1
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣1
≥ ‖ DaS0,S0
θλ,µa,S0,1‖1 + ‖ Da
Sc0,Sc0θλ,µa,Sc0,1 ‖1 + ‖ Da
S0,0θλ,µa,S0,1
‖1 − ‖ Da0,Sc0
θλ,µa,Sc0,1 ‖1
≥ ‖ DaS0,S0
θ0a,S0,1
‖1 − ‖ DaS0,S0
(θλ,µa,S0,1
− θ0a,S0,1
)‖1 + ‖ Da
Sc0,Sc0θλ,µa,Sc0,1 ‖1
+ ‖ DaS0,0
θ0a,S0,1
‖1 − ‖ DaS0,0
(θλ,µa,S0,1
− θ0a,S0,1
)‖1 − ‖ Da
0,Sc0θλ,µa,Sc0,1 ‖1
≥ ‖ DaS0,S0
θ0a,S0,1
‖1 + ‖ DaS0,0
θ0a,S0,1
‖1 −2B ‖ θλ,µa,S0− θ0
a,S0‖1
+ ‖ DaSc0,S
c0θλ,µa,Sc0,1 ‖1 −B ‖ θλ,µa,Sc0 ‖1
Similarly, we get
‖ Daθ0a ‖1 =
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣
DaS0,S0
0
DaS0,0
Da0,Sc0
0 DaSc0,S
c0
θ0
a,S0,1
0
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣1
=
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣
DaS0,S0
θ0a,S0,1
DaS0,0
θ0a,S0,1
0
∣∣∣∣∣∣∣∣∣∣
∣∣∣∣∣∣∣∣∣∣1
= ‖ DaS0,S0
θ0a,S0,1
‖1 + ‖ DaS0,0
θ0a,S0,1
‖1
3.8. Theoretical Results 96
Plugging into the basic inequality, we get on Λa, with λ ≥ 2λ0 and µ ≥ 2µ0,
1
n‖ Xa
(θλ,µa − θ0
a
)‖2 +λ ‖ θ0
a,S0‖1 −λ ‖ θλ,µa,S0
− θ0a,S0‖1 +λ ‖ θλ,µa,Sc0 ‖1
+ µ ‖ DaS0,S0
θ0a,S0,1
‖1 +µ ‖ DaS0,0
θ0a,S0,1
‖1 −2Bµ ‖ θλ,µa,S0− θ0
a,S0‖1
+ µ ‖ DaSc0,S
c0θλ,µa,Sc0,1 ‖1 −Bµ ‖ θλ,µa,Sc0 ‖1
≤ λ+Bµ
2‖ θλ,µa,S0
− θ0a,S0‖1 +
λ+Bµ
2‖ θλ,µa,Sc0 ‖1 +λ ‖ θ0
a,S0‖1
+ µ ‖ DaS0,S0
θ0a,S0,1
‖1 +µ ‖ DaS0,0
θ0a,S0,1
‖1
From this, we get
2
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0
− θ0a,S0‖1
Proof [Proof of lemma 3.6.8] The proof is basically a continuation of what we
have already shown. We have
2
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0
a ‖1
=2
n‖ Xa
(θλ,µa − θ0
a
)‖2 +(λ− 3Bµ) ‖ θλ,µa,S0
− θ0a,S0‖1 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1
≤ (3λ+ 5Bµ) ‖ θλ,µa,S0− θ0
a,S0‖1 +(λ− 3Bµ) ‖ θλ,µa,S0
− θ0a,S0‖1
= 2(2λ+Bµ) ‖ θλ,µa,S0− θ0
a,S0‖1 ≤
2√s0(2λ+Bµ)√
nφ0,a
‖ θλ,µa,S0− θ0
a,S0‖
≤ s0(2λ+Bµ)2
φ20,a
+1
n‖ Xa
(θλ,µa − θ0
a
)‖2
97
Chapter 4
Smoothing of Diffusion Tensor
MRI Images∗
4.1 Introduction
Diffusion-weighted magnetic resonance imaging (MRI) has emerged as an impor-
tant field of medical research in general and neuroscience to be specific. The MRI
technology boosted clinical neuro-diagnostics and gave rise to new applications to
study the anatomy of human brain in-vivo and in a non-invasive manner. Modern
day clinical applications of diffusion tensor MRI are primarily based on contrast stud-
ies. For example, contrast in relaxation times for T1- or T2- weighted MR imaging,
in time of flight for MR angiography, in blood oxygen level dependency for functional
MR imaging, and in diffusion for apparent diffusion coefficient (ADC) imaging[41].
More advanced applications of diffusion weighted MRI involves finding connectivity
in human tissues where isotropic water movement is apparent. However, DW-MRI
∗This is a collaborative work with Prof. Owen Carmichael, Prof. Debashis Paul and Prof. JiePeng.
4.2. Principles of Diffusion MRI 98
has been successfully applied for anisotropic diffusion of water molecules owing to
internal fiber structure inside a tissue. The theoretical understanding of diffusion
MRI have not been developed much over the years, however, sophisticated data ac-
quisition tools are available. Nevertheless, it is important for a medical practitioner
to understand the basic principles of these tools to some degree. Diffusion tensor
imaging is immensely successful in determining neurological disorders because it can
reveal abnormalities in white matter fiber structures and help build models for neural
connectivity in brain. The ability to visualize these connections has led neuroscien-
tists to undertake the so called Human Brain Connectome[42] project.
4.2 Principles of Diffusion MRI
With a congregation of billions of neurons, human brain is one of the most complex
natural systems. Studying and identifying crucial biological phenomena in human
brain hugely depends on proper imaging technology. One common imaging technique
which is usually done with animals is histology followed by examination with light
or electron microscopy. It is commonly performed by examining cells and tissues by
sectioning and staining. Staining highlights the locations of proteins and genes of
interests, and electron microscopy helps us to analyze them at the molecular level.
However, the invasive and extensive histology is not suitable for analyzing human
brain. On the other hand, MRI is noninvasive and quick to perform. MRI will
produce a condensed three dimensional image of the brain. While a compressed
anatomical information is beneficial, a lot of biological details are lost which results
in reduced specificity of obtained information. Thus, going beyond conventional MRI
4.2. Principles of Diffusion MRI 99
technology to capture more detailed anatomy was one of the primary objectives of
MRI research communities. In an effort to achieve this goal, Diffusion Tensor MRI
was introduced in the mid ’90s[48]. It can delineate the axonal organization of the
brain unlike conventional MRI. See [43–47] for the application of DT-MRI in animal
studies.
4.2.1 Shortcomings of conventional MRI and Contrast Gen-
eration
There are some problems with conventional MRI technology that does not make
it a good candidate for anatomy learning in complex tissues. Among these, the most
critical ones are spatial resolution, contrast and often times, the size of the data.
Theoretically, the MR image resolution should be approximately 10 µm because of
the movement of water molecule in the exposure time[49]. However, this resolution
is hard to achieve in practice because the signal obtained is hard to distinguish from
noise and hence a long exposure is required to detect weak signals. For in vivo stud-
ies, exposure time is typically short which limits the workable resolution. For ex vivo
studies, however, this does not pose a problem. Still, the data obtained from an MRI
study is, more often than not, so voluminous that even with cutting edge computing
technology it might be hard to deal with.
A crucial drawback of conventional MR image is its lack of contrast. Two differ-
ent anatomical regions having water molecules with similar properties will generate
exactly the same image value; and thereby failing to generate the meaningful con-
trasts. One way around is to use physical properties of water molecules, proton
density (p0), T1 and T2 relaxation times and the diffusion coefficients (D) based con-
4.2. Principles of Diffusion MRI 100
trasting tools. A mathematical relationship among these parameters and measured
intensity is given by
S = p0
(1− e−c/T1
)e−d/T2e−bD (4.2.1)
where b, c and d are certain imaging constants that might depend on environmental
factors like viscosity and existence of macromolecules[49]. In this chapter, the diffu-
sion factor will be our main concern. We will typically assume that the remaining
factors remain constant during experiments.
4.2.2 Water Diffusion and Its Importance in MR Imaging
Diffusion is the random translational thermal motion of the water molecules. The
diffusion tensor imaging of live (or dead, but not frozen) brain is a probe into its
anatomical structure based on the movement of water molecules diffusing along the
path of the neuron bundles. Following [49], an interesting analogy is the shape of a
spreading ink blot on a wet paper or cloth. The shape of the ink stain is regulated
by the underlying fiber structure of the paper or cloth. If the stain is circular, it is
said to be an isotropic diffusion whereas an elongated pear-shaped stain is indicative
of an anisotropic diffusion. In DTI, this anisotropy is used to find the alignment
of axonal bundles in the brain. When we characterize anisotropic diffusion, an en-
tirely new image contrast is generated which is based on structural orientation[50–52].
A magnetic field is applied along a gradient and diffusion weighted image is ob-
tained. However, this measurement is only along the gradient axis so it’s not possible
to find the three dimensional axon bundle structure from applying magnetic field in
4.2. Principles of Diffusion MRI 101
one direction. That’s why multiple directions are used and the measured intensity is
related to the three dimensional diffusion tensor matrix D as the following equation
S = S0e−bqTDq (4.2.2)
where S0 is a constant and q is the applied gradient direction. See [48] for more
details. In diffusion tensor framework, a 3D ellipsoid is constructed which represents
the average diffusion distance along each coordinate axes. The axes lengths of the
ellipsoid denote the diffusion magnitudes and are given by the eigenvalues of the dif-
fusion tensor matrix D. The eigenvectors determine their orientations.
A measure of anisotropy is given by quantification of relative differences in the eigen-
values λ1, λ2 and λ3. A popular metric is fractional anisotropy, given by
FA :=
√(λ1 − λ2)2 + (λ2 − λ3)2 + (λ3 − λ1)2
2(λ21 + λ2
2 + λ23)
. For isotropic tensors, λ1 = λ2 = λ3 and hence FA = 0. For highly anisotropic
tensor, FA would be close to 1. There are a number of ways to represent the tensor
information graphically. One can represent the FA values as a grayscale image but
that does not have the directionality information. An alternative way is to represent
the FA values in a color-coded format where the colors indicate the directions of
tensors.
4.3. Tensor Smoothing 102
4.3 Tensor Smoothing
Removing noise from diffusion tensor images is quite challenging[53–56]. Trac-
ing of anatomical structures is extremely difficult in presence of noise which is of
complex Rician nature[53]. Among different approaches for noise removal from diffu-
sion weighted images, smoothing with the help of Karcher means[57–59] is the most
popular. This is equivalent to kernel smoothing in the tensor field. Yuan et al.[60]
extended the weighted Karcher-mean approach to local polynomial smoothing in the
tensor space with respect to different geometries. Carmichael et al.[61] studied the
Karcher mean approach and analyzed the performance of linear and nonlinear esti-
mators during regression, followed by a demonstration of the effect of Euclidean and
geometric smoothers under various local structures of the underlying tensor field and
different noise levels. They found that under certain interactions of the tensor field
geometry and noise level, Euclidean smoother work better than geometric smoothers,
contrary to popular belief. In the following part, we introduce some notations and
describe the two-stage smoothing discussed above.
4.3.1 Two-Stage Smoothing
The problem is to estimate the diffusion tensor D based on the observed DWI in-
tensities Sq : q ∈ Q. We denote d = vec(D) = (D11, D22, D33, D12, D13, D23)T and
xq = (q21, q
22, q
23, 2q1q2, 2q1q3, 2q2q3). Also, following [61] we assume that the choice of
gradient directions ensures that∑
q∈Q xqxTq is well-conditioned. The linear regression
estimator is given by
DLS := argmind∈R6
∑q∈Q
(log Sq − logS0 + xTqd
)2
4.3. Tensor Smoothing 103
and the nonlinear regression estimator is given by
DNL := argmind∈R6
∑q∈Q
(Sq − S0 exp(−xTqd)
)2
Even though positive definiteness constraint is not imposed, it has been found that
most of the time the usual solution will be positive definite. If not, one can convert
the negative eigenvalues to 0. The nonlinear regression problem can be solved using
Levenberg-Marquardt method. It has been proved in the literature that DNL is bet-
ter than DLS in terms of bias62 and asymptotic efficiency61 as noise variance goes to 0.
After the initial estimates are obtained, the smoothing is carried out on the space of
3× 3 positive definite matrices, denoted by P3. In Euclidean space, if f is a function
from Rk to Rp, and we have observations (si, Xi)ni=1 where si denotes the position
and Xi the noise corrupted observation of f(si). A standard way of the smoothing
is to take weighted average
fE(s) =
∑ni=1wi(s)Xi∑ni=1 wi(s)
(4.3.1)
The weights wi(s) are usually given by a nonnegative, integrable kernel Kh(.) where
h > 0 denotes the bandwidth which determines the size of the local neighborhood
over which the smoothing is carried out. To be specific
wi(s) := Kh(‖ si − s ‖), i = 1, · · · , n
4.3. Tensor Smoothing 104
Another way to interpret the solution is to view it as the minimizer of the weighted
criterion
f(s) := argminc∈P
n∑i=1
wi(s)d2P(Xi, c) (4.3.2)
Kernel smoothing thus takes the form of a weighted Karcher mean of Xi’s.63 From 4.3.2,
it should be obvious that different distance metrics on the manifold P might lead to
different kernel smoothers. Imposing an Euclidean metric, we get the previous so-
lution. Among other metrics, some popular choices are log-Euclidean and affine
invariant metric.
Log-Euclidean metric is defined as
dLE(X, Y ) :=‖ logX − log Y ‖,
where log denotes the matrix logarithm. With this distance metric, the solution
to 4.3.2 is given by
fLE(s) = exp
(∑ni=1wi(s) log(Xi)∑n
i=1wi(s)
)(4.3.3)
The affine invariant metric can be defined for the space of positive definite matrices
as they constitute a naturally reductive homogeneous space.64 This is defined as
dAff :=[tr(log(X−1/2Y X−1/2
))2] 1
2
. The solution to 4.3.2 under dAff is not available is closed form. Gradient descent
algorithms are usually used to solve it. In the literature, Eucildean method is often
4.3. Tensor Smoothing 105
criticized for its swelling effect where the determinant of the average tensor is larger
than the determinants of the averaging tensors. However, in terms of estimation
accuracy, there is no clear winner in all circumstances. The performance depends
heavily on the noise level and local geometric structures.
4.3.2 Locally Constant Smoothing with Conjugate Gradient
Descent Algorithm
In this section, we shall reformulate the smoothing problem. Instead of the two
stage (regression followed by smoothing), we consider an appropriately weighted ob-
jective function to minimize. In order to carry out the optimization, we apply a
geometric version of conjugate gradient descent algorithm. The principal idea here is
to introduce the weights to the objective function itself and find the optimal solution
with the positive definiteness constraint.
To be specific, we want to find the minimizer D of
f1(D) =n∑j=1
m∑i=1
wj,i[logSj,i − log
(S0 exp
(−brTi Dri
))]2=
n∑j=1
m∑i=1
wj,i[logSj,i − logS0 + brTi Dri
]2(4.3.4)
where b > 0 is given, ri ∈ R3 are given vectors (gradient directions), Sj,i is the signal
intensity of the jth voxel in i-th gradient and wj,i is the weight for the jth voxel in ith
gradient. Without the log-transform, the objective function would be
f2(D) =n∑j=1
m∑i=1
wj,i[Sj,i − S0 exp
(−brTi Dri
)]2(4.3.5)
4.3. Tensor Smoothing 106
Depending on the choice of our objective function, the optimization problem can be
formulated as
Di = argminD∈P fi(D) for i = 1, 2
The corresponding solution D1 or D2 would then be the smooth estimator of the
underlying tensor. However, the optimization is carried out with the additional con-
straint of positive definiteness which requires incorporating the geometry of this Rie-
mannian surface in order to develop a conjugate gradient algorithm on it. For this,
we need to compute the gradient and Hessian of the objective functions, which are
given below.
∇f1(D) = 2bn∑j=1
m∑i=1
wj,i(logSj,i − logS0 + brTi Dri
)DriDrTi D (4.3.6)
Hess(f1)(X, Y ) = 2bn∑j=1
m∑i=1
[(logSj,i − logS0 + brTi Dri
)rTi XD
−1Y ri + b(rTi Xri
) (rTi Y ri
)](4.3.7)
∇f2(D) = 2S0bn∑j=1
m∑i=1
wj,i exp(−brTi Dri
) (Sj,i − S0 exp
(−brTi Dri
))Drir
Ti D
(4.3.8)
Hess(f2)(X,X) = 2S0bn∑j=1
m∑i=1
wj,i exp(−brTi Dri
) [(Sj,i − S0 exp
(−brTi Dri
))rTi XD
−1Xri
−b(Sj,i − 2S0 exp
(−brTi Dri
)) (rTi Xri
)2]
(4.3.9)
Hess(f2)(X, Y ) = S0bn∑j=1
m∑i=1
wj,i exp(−brTi Dri
) [(Sj,i − S0 exp
(−brTi Dri
))rTi(XD−1Y
+Y D−1X)
ri − 2b(Sj,i − 2S0 exp
(−brTi Dri
)) (rTi Xri
) (rTi Y ri
)](4.3.10)
4.3. Tensor Smoothing 107
With these notations, we devised the conjugate gradient algorithm, which is given
in Algorithm 1.
4.3.3 Experimental Results
We emulate the set up by Carmichael et al.[61]. We construct a simulated tensor
field on a 128 × 128 × 4 three-dimension grid consisting of the background regions
with identical isotropic tensors and the banded regions with three parallel vertical
bands and three horizontal bands (for each of the four slices), where within each band
tensors are identical and aligned in either the X or the Y ? direction. The bands are
of various widths and degrees of anisotropy. The purpose is to compare the relative
performances of two stage smoothers and our smoother empirically.
The underlying truth is smooth. We compare the Frobenius distance of our esti-
mated tensor from the true tensor. We chose a small 10 × 10 region for smoothing
purpose. We compare the following six methods
1. Linear regression with no smoothing.
2. Nonlinear regression with no smoothing.
3. Anisotropic smoothing with log-transformed data.
4. Anisotropic smoothing with original data.
5. Two stage anisotropic smoothing w.r.t. Euclidean distance.
6. Two stage anisotropic smoothing w.r.t. log-Euclidean distance.
4.3. Tensor Smoothing 108
Data: Riemannian manifold P : space of positive definite matricesVector transport T :
Tαη,D(η) := D1/2 exp(αD−1/2ηD−1/2
)D−1/2η
Retraction map R:
RD(tX) := D1/2 exp(tD−1/2XD−1/2
)D1/2
Objective function:fi(D) i = 1, 2
Gradient of f :∇fi(D)
Armijo parameter σ, α and β.
Result: Find a local minimizer of f using Conjugate Gradient algorithm.
Initialize: Initial iterate x0 ∈ P . Taken to be the Identity matrix.
Set η0 = −∇f(x0)
for k = 0, 1, 2, · · · dofor m = 1, 2, · · · , 20 do
Compute step size s = αβm.Compute xk+1 = Rxk(sηk).Compute LHS = f(xk)− f(xk+1) and RHS = σs〈∇f(xk),∇f(xk)〉xkif LHS > RHS then
Break inner loopend
end
Compute βk+1 =〈∇f(xk+1),∇f(xk+1)〉xk+1
〈∇f(xk),∇f(xk)〉xkSet ηk+1 = −∇f(xk+1) + βk+1Tsηk,xk(ηk).
endAlgorithm 1: Conjugate Gradient on Positive Definite Matrices for a certain noiselevel
4.4. Discussion 109
The Frobenius distances are then plotted against the corresponding FA values. The
graphs for low (1%), medium (5%) and high(10%) are shown in figure 4.3.1, 4.3.2
and 4.3.3.
4.4 Discussion
From the simulations, it has been observed that for low noise level, our method
with log-transformed data, Euclidean smoother and geometric smoother perform sig-
nificantly better than the rest. This might be attributed to the fact that at low noise
level the space geometry influences the smoothing process. Hence, smoothers that re-
spect the geometry turn out to be better. The improvement diminishes as noise level
increases, but still the geometric methods perform better than the rest, especially
at higher anisotropic regions and high noise levels. If the local heterogeneity of the
tensor field is the dominant component of variation of the tensors, and the noiseless
tensors follow a certain geometric regularity, then the geometric smoothers lead to
smaller bias. As seen in [61], under the domination of sensor noise over geometry,
the Euclidean smoother is a better performer than geometric smoother. However,
our method sides with the winner most of the time, no matter whichever method is
better.
4.4. Discussion 110
0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360
0.02
0.04
0.06
0.08
0.1
0.12Comparison for low noise level
Lin. Reg.Nonlin. Reg.Aniso. Sm.Nonlin Aniso. Sm.Two Step EuclideanTwo Step Log Euclidean
Figure 4.3.1: Comparison of smoothers: Low noise level
4.4. Discussion 111
0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360
0.1
0.2
0.3
0.4
0.5
0.6
0.7Comparison for medium noise level
Lin. Reg.Nonlin. Reg.Aniso. Sm.Nonlin Aniso. Sm.Two Step EuclideanTwo Step Log Euclidean
Figure 4.3.2: Comparison of smoothers: Medium noise level
4.4. Discussion 112
0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360
0.2
0.4
0.6
0.8
1
1.2
1.4Comparison for high noise level
Lin. Reg.Nonlin. Reg.Aniso. Sm.Nonlin Aniso. Sm.Two Step EuclideanTwo Step Log Euclidean
Figure 4.3.3: Comparison of smoothers: High noise level
REFERENCES 113
References
[1] Koller, Daphne and Friedman, Nir (2009). “Probabilistic Graphical Models,
Principles and Techniques”. Cambridge, Massachusetts: The MIT Press.
[2] Friedman, Nir (2004). “Inferring Cellular Networks Using Probabilistic Graphi-
cal Models.”, Science, 6 February 2004, Vol. 303 no. 5659 pp. 799-805.
[3] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, G. M. Church, Nature
Genet. 22, 281 (1999).
[4] E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller, Bioinformatics 17 (suppl.
1), S243 (2001).
[5] Wainwright, Martin J., Jordan, Michael I. (2008), “Graphical Models, Exponen-
tial Families, and Variational Inference”, Foundations and Trends in Machine
Learning, Vol. 1, Nos. 1-2 pp 1-305.
[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison (1998), eds., Biological Sequence
Analysis. Cambridge, UK: Cambridge University Press.
[7] Gales, Mark and Young, Steve (2007), “The Application of Hidden Markov
Models in Speech Recognition”, Foundations and Trends in Signal Processing,
Vol. 1, No. 3 pp 195-304.
REFERENCES 114
[8] Malinici, Iulian Pruteanu and Carin, Lawrence, “Infinite Hidden Markov Models
for Unusual-Event Detection in Video”, IEEE Transactions in Image Processing,
Vol. 17, No. 5, May 2008.
[9] Aksoy, Selim, “Probabilistic Graphical Models Part III: Example Appli-
cations”, available at http://www.cs.bilkent.edu.tr/~saksoy/courses/
cs551/slides/cs551_pgm3.pdf.
[10] Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S. and Saul,
Lawrence K. “An Introduction to Variational Methods”. In Jordan, Michael I.,
“Learning in Graphical Models”. Cambridge, Massachusetts: The MIT Press.
[11] Lauritzen, Stephen L. (1996), “Graphical Models”. Oxford Statistical Science
Series: Clarendon Press, Oxford.
[12] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, “Introduction to Algorithms”.
Cambridge, MA: MIT Press, 1990.
[13] Hammersley, J. M., Clifford, P. (1971), “Markov Fields on Finite Graphs and
Lattices”, unpublished, available at http://www.statslab.cam.ac.uk/~grg/
books/hammfest/hamm-cliff.pdf
[14] Grimmett, G. R. (1973), “A Theorem about Random Fields”, Bulletin of the
London Mathematical Society 5 (1): 81-84.
[15] Preston, C. J. (1973), “Generalized Gibbs States and Markov Random Fields”,
Advances in Applied Probability 5 (2): 242-261.
[16] Sherman, S. (1973), “Markov Random Fields and Gibbs Random Fields”, Israel
Journal of Mathematics 14 (1): 92-103.
REFERENCES 115
[17] Besag, J. (1974), “Spatial Interaction and the Statistical Analysis of Lattice
Systems”, Journal of the Royal Statistical Society, Series B 36 (2): 192-236.
[18] Ising, E. (1925), “Beitrag zur Theorie des Ferromagnetismus”, Z. Phys. 31:
253258
[19] Murphy, Kevin Patrick (2012), “Machine Learning: A Probabilistic Perspec-
tive”. The MIT Press.
[20] Cipra, Barry A., “The Ising Model is NP-Complete”, SIAM News, Volume 33,
Number 6.
[21] The Structure of Proteins, available at http://www.vivo.colostate.edu/
hbooks/genetics/biotech/basics/prostruct.html.
[22] Razavian, Narges Sharif, Moitra, Subhodeep, Kamisetty, Hetunandan, Ra-
manathan, Arvind, and Langmead, Christopher J. (2010), “Time- Varying Gaus-
sian Graphical Models of Molecular Dynamics Data” . Computer Science De-
partment. Paper 1061.
[23] Honorio, Jean, Ortiz, Luis, Samaras, Dimitris, Paragios, Nikos, Goldstein, Rita
(2009), “Sparse and Locally Constant Gaussian Graphical Models”, Advances in
Neural Information Processing Systems 22.
[24] Mansinghka, V. K., Kemp, C., Tenenbaum, J. B., Griffiths, T. L. (2006), “Struc-
tured Priors for Structure Learning”, Uncertainty in Artificial Intelligence.
[25] Shwe, M., Middleton, B., Heckerman, D., Henrion, M., Horvitz, E., Lehmann,
H. and Cooper, G. (1991), “Probabilistic diagnosis using a reformulation of the
INTERNIST- 1/QMR knowledge base I. - The probabilistic model and inference
algorithms”, Methods of Information in Medicine, 30:241 - 255, 1991.
REFERENCES 116
[26] Peer, D., Tanay, A. and Regev, A. (2006), “MinReg: A Scal- able Algorithm for
Learning Parsimonious Regulatory Networks in Yeast and Mammals”. Journal
of Machine Learning Research, 7:167-189.
[27] Goldstein,Rita Z., Alia-Klein,Nelly, Tomasi,Dardo, Zhang,Lei, Cottone,Lisa
A., Maloney,Thomas, Telang,Frank, Caparelli,Elizabeth C., Chang,Linda,
Ernst,Thomas, Samaras,Dimitris, Squires, Nancy K., Volkow, Nora D. (2007),
“Decreased prefrontal cortical sensitivity to monetary reward is associated with
impaired motivation and self-control in cocaine addiction”, The American Jour-
nal of Psychiatry. Jan 2007; 164(1): 43-51.
[28] Barndorff-Nielsen, O.E. (1978), “Information and Exponential Families”, Chich-
ester, UK: Wiley.
[29] Brown, L.D. (1986), “Fundamentals of Statistical Exponential Families”, Hay-
ward, CA: Institute of Mathematical Statistics.
[30] Efron, B. (1978), “The Geometry of Exponential Families”, Annals of Statistics,
Vol. 6, pp. 362-376.
[31] Borwein, J. and Lewis, A. (1999), “Convex Analysis”. New York: Springer-
Verlag.
[32] Hirriart-Urruty, J. and Lemarechal, C. (1993), “Convex Analysis and Minimiza-
tion Algorithms”, Vol. 1, New York: Springer-Verlag.
[33] Rockafellar, G. (1970), “Convex Analysis”, Princeton, NJ: Princeton University
Press.
[34] Jaynes, E.T. (1957), “Information Theory and Statistical Mechanics”, Physical
Review, Vol 106, pp. 620-630.
REFERENCES 117
[35] Wu, N. (1997), “The Maximum Entropy Method”, New York: Springer.
[36] Buhl, Sφren L. (1993), “On the Existence of Maximum Likelihood Estimators
for Graphical Gaussian Models”, Scandinavian Journal of Statistics, Vol. 20, No.
3 , pp. 263-270
[37] Uhler, Caroline (2012), “Geometry of Maximum Likelihood Estimation in Gaus-
sian Graphical Models”, The Annals of Statistics, Vol. 40, No. 1, 238-261.
[38] Barrett, W., Johnson, C. R. and Tarazaga, P. (1993). “The real positive def-
inite completion problem for a simple cycle”. Linear Algebra Appl., Vol. 192, pp.
3-31.
[39] Barrett, W. W., Johnson, C. R. and Loewy, R. (1996). “The real positive definite
completion problem: Cycle completability”. Mem. Amer. Math. Soc., Vol.122,
viii+69 pp.
[40] Sturmfels, B. and Uhler, C. (2010). “Multivariate Gaussian, semidefinite matrix
completion, and convex algebraic geometry”. Ann. Inst. Statist. Math, Vol. 62,
pp 603-638.
[41] Hagmann, P. , Fonasson, L., Maeder, P., Thiran, F., Wedeen, V., Meuli,
R. (2006). “Understanding Diffusion MR Imaging Techniques: From Scalar
Diffusion-weighted Imaging to Diffusion Tensor Imaging and Beyond”. Radio-
Graphics, Vol. 26: S205-S223.
[42] Dillow, Clay (2010). ”The Human Connectome Project Is a First-of-its-Kind
Map of the Brain’s Circuitry.” Popular Science. Sept .
REFERENCES 118
[43] Ahrens, E.T., Laidlaw, D.H., Readhead, C., Brosnan, C.F., Fraser, S.E., and
Jacobs, R.E. (1998). MR microscopy of transgenic mice that spontaneously ac-
quire experimental allergic encephalomyelitis. Magn. Reson. Med. Vol. 40, pp.
119-132.
[44] Harsan, L.A., Poulet, P., Guignard, B., Steibel, J., Parizel, N., de Sousa, P.L.,
Boehm, N., Grucker, D., and Ghandour, M.S. (2006). Brain dysmyelination and
recovery assessment by noninvasive in vivo diffusion tensor magnetic resonance
imaging. J. Neurosci. Res. Vol. 83, pp. 392-402.
[45] Kroenke, C.D., Bretthorst, G.L., Inder, T.E., and Neil, J.J. (2006). Modeling
water diffusion anisotropy within fixed newborn primate brain using Bayesian
probability theory. Magn. Reson. Med. Vol. 55, pp. 187-197.
[46] Mori, S., Itoh, R., Zhang, J., Kaufmann, W.E., van Zijl, P.C.M., Sol- aiyappan,
M., and Yarowsky, P. (2001). Diffusion tensor imaging of the developing mouse
brain. Magn. Reson. Med. Vol. 46, pp. 18-23.
[47] Nair, G., Tanahashi, Y., Low, H.P., Billings-Gagliardi, S., Schwartz, W.J., and
Duong, T.Q. (2005). Myelination and long diffusion times alter diffusion-tensor-
imaging contrast in myelin-deficient shiverer mice. Neuroimage Vol. 28, pp. 165-
174.
[48] Basser, P.J., Mattiello, J., and Le Bihan, D. (1994). MR diffusion tensor spec-
troscopy and imaging. Biophys. J. Vol. 66, pp. 259-267.
[49] Mori, S. & Zhang, J. (2006). “Principles of Diffusion Tensor Imaging and Its
Applications to Basic Neuroscience Research”, Neuron Vol. 51, pp. 527-539.
REFERENCES 119
[50] Chenevert, T.L., Brunberg, J.A., and Pipe, J.G. (1990). “Anisotropic diffusion
in human white matter: Demonstration with MR technique in vivo”. Radiology
Vol. 177, pp. 401-405.
[51] Moseley, M.E., Cohen, Y., Kucharczyk, J., Mintorovitch, J., Asgari, H.S., Wend-
land, M.F., Tsuruda, J., and Norman, D. (1990). “Diffusion-weighted MR imag-
ing of anisotropic water diffusion in cat central nervous system”. Radiology Vol.
176, pp 439-445.
[52] Turner, R., Le Bihan, D., Maier, J., Vavrek, R., Hedges, L.K., and Pe- kar, J.
(1990). “Echo-planar imaging of intravoxel incoherent motion”. Radiology Vol.
177, pp. 407-414.
[53] Gudbjartsson, H. and Patz, S. (1995). “The Rician distribution of noisy MRI
data”. Magnetic Resonance in Medicine, Vol. 34, pp. 910-914.
[54] Hahn, K. R., Prigarin, S., Heim, S., and Hasan, K. (2006). “Random noise
in diffusion tensor imaging, its destructive impact and some corrections”. In
Weickert, J. and Hagen, H., editors, Visualization and Processing of Tensor
Fields, pp 107-117, Springer. MR2210513.
[55] Hahn, K. R., Prigarin, S., Rodenacker, K., and Hasan, K. (2009). “Denoising
for diffusion tensor imaging with low signal to noise ratios: method and monte
carlo validation”. International Journal for Biomathematics and Biostatistics,
1:83-81.
[56] Zhu, H. T., Li, Y., Ibrahim, I. G., Shi, X., An, H., Chen, Y., Gao, W., Lin,
W., Rowe, D. B., and Peterson, B. S. (2009). “Regression models for identifying
REFERENCES 120
noise sources in magnetic resonance images”. Journal of the American Statistical
Association, Vol. 104, pp. 623-637.
[57] Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2005). “Fast and simple
computations on tensors with log-euclidean metrics”. Technical report, Institut
National de Recherche en Informatique et en Automatique.
[58] Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2006). “Log- euclidean
metrics for fast and simple calculus on diffusion tensors”. Magnetic Resonance
in Medicine, Vol. 56, pp. 411-421.
[59] Castano Moraga, C. A., Lenglet, C., Deriche, R., and Ruiz-Alzola, J. (2007). “A
Riemannian approach to anisotropic filtering of tensor fields”. Signal Processing,
Vol. 87, pp. 263-276.
[60] Yuan, Y., Zhu, H., Lin, W., and Marron, J. (2012). “Local polynomial regression
for symmetric positive definite matrices”. Journal of the Royal Statistical Society:
Series B (Statistical Methodology).
[61] Carmichael, O., Chen, J., Paul, D. and Peng, J. (2013). “Diffusion tensor
smoothing through weighted Karcher means”, Electronic Journal of Statistics,
Vol. 7, pp. 1913-1956.
[62] Polzehl, J. and Tabelow, K. (2008). “Structural adaptive smoothing in diffusion
tensor imaging: the R package dti”. Technical report, Weierstrass Institute,
Berlin.
[63] Karcher, H. (1977). “Riemannian center of mass and mollifier smoothing”. Com-
munications in Pure and Applied Mathematics, Vol. 30, pp. 509-541.
REFERENCES 121
[64] Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). “Optimization Algorithms
on Matrix Manifolds”. Princeton University Press.
[65] Meinshausen, N. and Buhlmann, P. (2006). “High-Dimensional Graphs and Vari-
able Selection with the LASSO”, The Annals of Statistics, Vol. 34, No. 3, pp.
1436-1462.
[66] Tibshirani, Robert (1997). “Regression Shrinkage and Selection via the Lasso”,
Journal of Royal Statistical Society, Series B (Methodological), Vol. 58, pp. 267-
288.
[67] Tibshirani, R., Saunders, M., Rosset, S., Zhu, Ji and Knight, K. (2005). “Sparsity
and Smoothness via the Fused Lasso”, Journal of Royal Statistical Society, Series
B (Methodological), Vol. 67, Part 1, pp. 91-108.
[68] Dempster, A. P. (1972). “Covariance Selection”, Biometrics, Vol. 28, No. 1,
Special Multivariate Issue, pp. 157-175.
[69] Edwards, David (2000). “Introduction to Graphical Modelling”, Second Edition,
Springer.
[70] Speed, T. P. and Kiiveri, H. T., “Gaussian Markov Distributions over Finite
Graphs”, The Annals of Statistics, Vol. 14, No. 1, pp. 138-150.
[71] Yuan, M. and Lin, Y. (2007). “Model Selection and Estimation in the Gaussian
Graphical Model”, Biometrika, Vol. 94, No. 1, pp. 19-35.
[72] Vandenberghe, L., Boyd, S. and Wu, S.-P. (1998). “Determinant Maximization
with Linear Matrix Inequality Constraints”, SIAM Journal on Matrix Analysis
and Applications, Vol. 19, No. 2, pp. 499-533.
REFERENCES 122
[73] Breiman, Leo (1995). “Better Subset Regression Using the Nonnegative Gar-
rote”, Technometrics, Vol. 37, No. 4, pp. 373-384.
[74] Banerjee, O., Ghaoui, L. E., d’Aspremont, A. (2008). “Model Selection Through
Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary
Data”, Journal of Machine Learning Research, Vol. 9, pp. 485-516.
[75] Friedman, J., Hastie, T. and Tibshirani, R. (2007)“Sparse inverse covariance
estimation with the graphical lasso”. Biostatistics, Dec. 12, 2007, pp. 1-10.
[76] Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). “An Efficient
Proximal Gradient Method for General Structured Sparse Learning”, Journal of
Machine Learning Research, Vol. 11.
[77] Kovac, A. and Smith, A. D. A. C. (2012), “Nonparametric Regression on a
Graph”. Journal of Computational and Graphical Statistics. Vol. 20, No. 2, pp.
432-447.
[78] Greenshtein, E. and Ritov, Y. (2004). “Persistence in High-Dimensional Linear
Predictor Selection and the Virtue of Overparametrization”, Bernoulli, Vol. 10,
No. 6, pp. 971-988.
[79] Hoefling, H. (2010). “A Path Algorithm for the Fused Lasso Signal Approxi-
mator”. Journal of Computational and Graphical Statistics, Vol. 19, No. 4, pp.
984-1006.
[80] Tibshirani, R.J. and Taylor, J. (2011). “The Solution Path of the Generalized
Lasso”. The Annals of Statistics, Vol. 39, No. 3, pp. 1335-1371.
REFERENCES 123
[81] Liu, J., Yuan, L., and Ye, J. (2010). “An Efficient Algorithm for a Class of Fused
Lasso Problems”. In The ACM SIG Knowledge Discovery and Data Mining.
ACM, Washington, DC.
[82] Chen, X., Lin, Q., Kim, S., Carbonell, J.G., and Xing, E.P. (2012). “Smoothing
Proximal Gradient Method for General Structured Sparse Regression”. Annals
of Applied Statistics, Vol. 6, No. 2, pp. 719-752.
[83] Nesterov, Y. (2005). “Smooth Minimization of Non-smooth Functions”. Mathe-
matical Programming, Vol. 103, pp. 127-152.
[84] Beck, A. and Teboulle, M. (2009). “A Fast Iterative Shrinkage-thresholding Al-
gorithm for Linear Inverse Problems”. SIAM Journal on Imaging Sciences, Vol.
2, pp. 183-202.
[85] Ye, G.B. and Xie, X. (2011). “Split Bregman Method for Large Scale Fused
Lasso”. Computational Statistics & Data Analysis, Vol. 55, No. 4, pp. 1552-1569.
[86] Hestenes, M.R. (1969). “Multiplier and gradient methods”. Journal Optimization
Theory & Applications, Vol. 4, pp. 303-320.
[87] Rockafellar, R.T. (1973), “A Dual Approach to Solving Nonlinear Programming
Problems by Unconstrained Optimization”. Mathematical Programming, Vol. 5,
pp. 354-373.
[88] Buhlmann, P. and van de Geer, S. (2011). “Statistics for High-Dimensional
Data”. Springer Series in Statistics.
[89] Bickel, P., Ritov, Y. and Tsybakov, A. (2009). “Simultaneous Analysis of Lasso
and Dantzig Selector”, Annals of Statistics, Vol. 37, No. 4, pp. 1705-1732.