Structure Learning in Locally Constant Gaussian Graphical ...math.bu.edu/people/apratim/Dissertation_ApratimGanguly.pdf · Structure Learning in Locally Constant Gaussian Graphical

Structure Learning in Locally ConstantGaussian Graphical Models

By

Apratim GangulyB.Stat., Indian Statistical Institute (2007)M.Stat. Indian Statistical Institute (2009)

DISSERTATION

Submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in

STATISTICS

in the

OFFICE OF GRADUATE STUDIES

of the

UNIVERSITY OF CALIFORNIA

DAVIS

Approved:

Wolfgang Polonik (Chair)

Debashis Paul

Ethan Anderes

Committee in Charge2014

i

c© Apratim Ganguly, 2014. All rights reserved.

Contents

1 INTRODUCTION 1

1.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Introduction to Graphical Models . . . . . . . . . . . . . . . . 2

1.1.2 Types of Graphical Models . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . 14

1.1.4 Local Constancy in Gaussian Graphical Model . . . . . . . . . 16

1.2 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . 20

2 LOCAL GEOMETRY IN GAUSSIAN GRAPHICAL MODELS 21

2.1 GGM as Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.3 Gaussian MRF . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Saturated Model . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 Covariance Selection Model . . . . . . . . . . . . . . . . . . . 28

2.3 Geometric Interpretation of Existence of MLE . . . . . . . . . . . . . 36

2.4 Local Geometry as Structure Constraint . . . . . . . . . . . . . . . . 37

ii

2.4.1 Quantitative Definition . . . . . . . . . . . . . . . . . . . . . . 40

2.4.2 Bayesian Perspective . . . . . . . . . . . . . . . . . . . . . . . 40

3 Estimation of Locally Constant Gaussian Graphical Model Using

Neighborhood-Fused Lasso 42

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 A Review of Related Works . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Neighborhood Selection Using Fused Lasso . . . . . . . . . . . . . . . 47

3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 Optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Asymptotics of Graphical Model Selection . . . . . . . . . . . . . . . 57

3.6 Compatibility and l1 properties . . . . . . . . . . . . . . . . . . . . . 63

3.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7.1 Simulation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7.2 Simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.8 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Smoothing of Diffusion Tensor MRI Images 97

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.2 Principles of Diffusion MRI . . . . . . . . . . . . . . . . . . . . . . . 98

4.2.1 Shortcomings of conventional MRI and Contrast Generation . 99

4.2.2 Water Diffusion and Its Importance in MR Imaging . . . . . . 100

4.3 Tensor Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.1 Two-Stage Smoothing . . . . . . . . . . . . . . . . . . . . . . 102

4.3.2 Locally Constant Smoothing with Conjugate Gradient Descent

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

iii

4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 107

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

References 113

iv

List of Figures

1.1.1 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Toy Bayesian Network for Microarray Analysis . . . . . . . . . . . . . 6

1.1.3 A template model associating promoter sequences to gene expressions 8

1.1.4 A spin configuration of 2-D lattice Ising model . . . . . . . . . . . . . 12

1.1.5 Human Brain Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Chordal Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7.1 Comparison of NFL with GLASSO, Meinshausen-Buhlmann estimate

and CDD in section 3.7.1 . . . . . . . . . . . . . . . . . . . . . . . . 74

3.7.2 Comparison of NFL with GLASSO and Meinshausen-Buhlmann esti-

mate in section 3.7.2, sample sizes from top to bottom are 10, 25, 50,

100, 500, 1000, 5000, 10000 . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 Comparison of smoothers: Low noise level . . . . . . . . . . . . . . . 110

4.3.2 Comparison of smoothers: Medium noise level . . . . . . . . . . . . . 111

4.3.3 Comparison of smoothers: High noise level . . . . . . . . . . . . . . . 112

v

Apratim GangulyAugust 2014

Statistics

Structure Learning in Locally Constant Gaussian Graphical Models

Abstract

Occurrence of zero entries in the inverse covariance matrix of a multivariate Gaus-

sian random variable has a one to one correspondence with conditional independence

of corresponding pairs of components. A challenging aspect of sparse structure learn-

ing is the well known “small n large p” scenario. So far, several algorithms have been

proposed to solve the problem. Neighborhood selection using lasso (Meinshausen-

Buhlmann), block-coordinate descent algorithm to estimate the covariance matrix

(Banerjee et al.), graphical lasso (Tibshirani et al) are some of the most popular ones.

In first part of this thesis, an alternative methodology is proposed for Gaussian graph-

ical models on manifolds where spatial information is judiciously incorporated into

the estimation procedure. This is initiated by Honorio et al. (2009) who proposed

an extension of the coordinate descent approach, calling it “coordinate direction de-

scent approach”, which incorporates the local constancy property of spatial neighbors.

However, only an intuitive formalization is provided by Honorio et al. and no the-

oretical investigations. Here I propose an algorithm to deal with local geometry in

Gaussian graphical models. The algorithm extended the Meinshausen-Buhlmann’s

idea of successive regression by a different penalty. Neighborhood information is used

in the penalty term and it is called neighborhood-fused lasso algorithm. I will show by

simulation and prove theoretically the asymptotic model selection consistency of the

proposed method and will establish faster convergence to the ground truth than the

standard rates if the assumption of local constancy holds. This modification has nu-

merous practical application, e.g., in the analysis of MRI data, 2-dimensional spatial

manifold data in order to study spatial aspects of the human brain or moving objects.

vi

In second part of the thesis, I will discuss smoothing techniques on Riemannian

manifolds using local information. Estimation of smoothed diffusion tensors from

diffusion weighted magnetic resonance images (DW-MRI or DWI) of human brain is

usually a two-step procedure, the first step being a regression (linear/non-linear) and

the second step being a smoothing (isotropic/anisotropic). I extended the smoothing

ideas on Euclidean space to non-Euclidean space by running a conjugate gradient al-

gorithm on the manifold of positive definite matrices. This method shows empirical

evidence of a better performance than the two-step method of smoothing. This is a

collaborative work with Debashis Paul, Jie Peng and Owen Carmichael.

vii

Acknowledgments

Even though the dissertation is my individual work, its completion would not

be possible without the unsparing support and invaluable advice of my academic

mentor, Professor Wolfgang Polonik. His scientific acumen helped me shape some

of the fundamental ideas and thoughtful conversations with him have always been

immensely helpful for the development of my thesis. I am also indebted to Professor

Debashis Paul for the collaborative work I did with him and for the academic support

and counseling during the last five years. I would like to thank Prof. Francisco J.

Samaniego for giving me the opportunity to work with him and for the zestful but

informative interactions. My dissertation committee members Prof. Ethan Anderes

and Prof. Alexander Aue have helped me with their time and valuable inputs. I

am grateful to Prof. Owen Carmichael, Prof. Vladimir Filkov, Prof. Jie Peng, Prof.

Prabir Burman and Prof. Duncan Temple Lang for letting me work with them on

various projects that helped me expand my knowledge horizon. I would like to say a

big thank you to Pete Scully, our graduate program coordinator for his ever-present

help in time of need. I extend my profound gratitude to the rest of the faculty mem-

bers of the Department of Statistics at UC Davis and my colleagues who made my

stay nevertheless enjoyable.

I would like to thank a few people without whom this half-a-decade long journey

far away from home would never be the same. The wonderful company of friends in

Davis and SF Bay area will be forever etched in my memory. So thank you Nagda,

Debasis, Partha, Sai, Uttam, Aveek, Gupta, Roy, Shreyashi, Pulak, Sanchita, Rajat,

Ana, Anupam, Annie, Saumen, Manjula, Sandipan, Sumit, Anirban, Bhashwar, Kin-

viii

jal, Sujayam, Riddhi, Charles and all my awesome friends for being there.

I am also deeply thankful to my parents whose support and encouragement has

made this endeavor successful in every respect.

And the one person who deserve a very special mention is my fiancee Rohosen,

without whose relentless support and love it would be extremely difficult to make it

this far. You stood there by me in my difficult times, and held my hand. I can not

thank you enough for that. This would never be possible without you by my side,

and I promise to do the same. Love you.

ix

1

Chapter 1

Introduction

This thesis aims to demonstrate the significance of local geometry in multivari-

ate structure learning and kernel smoothing problems in non-Euclidean spaces. The

principal goal of this thesis is to provide empirical evidence and a detailed theoretical

explanation for the advantage that one gets when the locality information is judi-

ciously incorporated in the relevant statistical method. In this thesis, I shall delve

into the following two problems

(i) Efficient structure learning in high dimensional gaussian graphical models using

local constancy.

(ii) Kernel smoothing on a tensor space using local alignment of tensors.

with the first problem being the main focus of this thesis. In both these problems,

local geometry based methods will be developed and their supremacy over the tradi-

tional methods will be established.

1.1. Graphical Models 2

1.1 Graphical Models

1.1.1 Introduction to Graphical Models

Uncertainty and complexity are the two most critical aspects in model based

multivariate statistical analysis in various domains like computer vision and image

processing, video processing, bioinformatics, medical imaging etc. In a broader sense,

any model based statistical method can be said to have a declarative representation[1]

in which the model, representing the investigators’ existing knowledge about the sys-

tem is used by different algorithms to answer different questions about the system.

For example, a model in cognitive psychology might represent the investigators’ prior

knowledge about the relationship between several psychological states, their symp-

toms and regions in the human brain that are responsible for that to some degree. An

algorithm would then take this model as an input, along with the pertinent informa-

tion about a certain individual and diagnose her psychological condition. A model in

genetics might entail the relationships between certain diseases, their symptoms and

genetic disorders potentially responsible. Alternatively, one might try to improve the

working model with collection of more data and advancements in domain knowledge

and without significant changes to the reasoning algorithms to answer the specific

questions the algorithms are designed for. This becomes possible due to complete

separation of knowledge and reasoning as in declarative representation. In contrast,

a procedural method concentrates on finding a sequence of modules to reach to con-

clusions starting from data. For example, when relating transcription factor binding

sites in the promoter regions of genes to their expression profiles, one can start by

finding clusters of coexpressed genes and look for overrepresented elements in the

promoters of the genes in each cluster. [2, 3]


In various applied fields of modern science, some of which have been mentioned in the

preceding paragraph, the systems appear with different degrees of uncertainty. The

uncertainty might arise as a result of as imprecise measurements or surrounding noise

or it might be due to the lack of sufficient domain knowledge or simply becasue of

a modelling limitation. Probability theory provides us with a comprehensive frame-

work to formalize the inherent uncertainty of any system, enabling it for a thorough

quantitative analysis. On the other hand, systems involving multiple components,

often times in the order of millions, makes it exceedingly difficult to express the re-

lationships between its components owing to its complex modular structure. Graph

theory provides the tools to deal with this complexity in a nice, formal and time

efficient manner. Coupling graph theoretical tools with probability theory proves to

be quite powerful in analyzing systems with a high degree of modularity with lots of

interacting sub-components. These are commonly referred as Graphical Models. In

the following section, we discuss different types of graphical models with motivating

examples.

1.1.2 Types of Graphical Models

Let us begin with a general mathematical framework for graphical models. It

consists of a graph G = (V,E) where V = 1, 2, · · · , p is a collection of vertices

where each vertex corresponds to a random variable Xi taking values in some space

Ωi for i = 1, 2, · · · , p and E ⊆ V × V is a collection of edges. The multivariate

random variable X = (X1, X2, · · · , Xp) is assumed to have a joint p.m.f. or p.d.f.

f on Ω =⊗p

i=1 Ωi. The existence (or lack) of edges signify conditional dependence

(or independence) of the corresponding pair of random variables given the rest of


them. The fundamental idea of a graphical model is that f factors according to the

structure of G. The edges of the graph are either directed or undirected leading to

two different types of probabilistic graphical models - directed graphical models and

undirected graphical models.

Directed Graphical Model

Given a directed graph G = (V,E) with a directed edge set, we define s to be a

parent of t if there is a directed edge from s to t. Conversely, t is called a child of

s. The set of all parents of a node s ∈ V is denoted by π(s). A directed cycle is a

sequence (s1, s2, · · · , sk) such that E contains all the directed edges (si → si+1; i =

1, 2, · · · , k − 1) and (sk → s1) ∈ E. See figure 1.1.1 for an illustration.

1

2

3 4

5

1

2

3 4

5

(a) (b)

Figure 1.1.1: (a) A directed acyclic graph with 5 vertices. (b) A directed cyclic graph which

includes the directed cycle 1→ 2→ 3→ 4→ 1

The graph in figure 1.1.1(a) is an example of a Directed Acyclic Graph, abbre-

viated as DAG, meaning that every edge in the graph is directed and it contains no

directed cycle. One can define a notion of ancestry on DAG as following: s is said

to be an ancestor of u if there is a directed path (s → t1 → t2 → · · · → tk → u).


For A ⊆ V , let us define XA := Xt : t ∈ A. Given a DAG, for each vertex s and

its parent set π(s), let fs(xs|xπ(s)) denotes a nonnegative function over the variables

(xs, xπ(s)) such that∫fs(xs|xπ(s))dxs = 1. A directed graphical model is then char-

acterized as a joint probability density that factors as a product of these conditional

densities, i.e.,

f(x1, x2, · · · , xp) =∏s∈V

fs(xs|xπ(s)) . (1.1.1)

It should be noted that while speaking of densities, p.m.f.’s (absolutely continuous

w.r.t. counting measure) and p.d.f.’s (absolutely continuous w.r.t. Lebesgue measure)

are not differentiated. It can be proved using the normalization condition of fs and

the ancestry relationship of the DAG that the above definition is consistent. An

alternative formulation of directed graphical models is given by Directed Markov

Property, which states that every variable is conditionally independent of its non

descendants given its parents, i.e.,

Xs ⊥ XV \de(s)|Xπ(s) ∀s ∈ V (1.1.2)

where de(s) denotes the set of all descendants of s as given by the DAG. Such a

model is also known as Bayesian Network ∗ which has multiple applications. We de-

scribe below a practical application of Bayesian network in computational biology[2].

∗Bayesian networks do not necessarily adopt a Bayesian statistics framework. The name alludesto Bayes’ rule for probabilistic inference. However, Bayes nets are instrumental in hierarchicalBayesian models which lays the foundation of a number of applied Bayesian statistics project. See,e.g., the BUGS project at http://www.mrc-bsu.cam.ac.uk/software/bugs/

http://www.mrc-bsu.cam.ac.uk/software/bugs/


After the success of complete genome sequencing, high-thoroughput assays have been

developed to analyze cells at a genome-wide scale. It is possible to measure compo-

nents of molecular level networks with the help of these assays. However, extraction

of meaningful information about the underlying mechanisms at the molecular level

continues to be a challenging problem in molecular biology. Gene expression profiles

are the main sources of high-thoroughput data on cellular networks. The expression

levels of genes are measured by several microarrays. The number of genes can be upto

the order of tens of thousands whereas the number of microarrays could be tens of

hundreds. An initial simplification is made by clustering genes with similar expression

levels and arrays into several array clusters according to certain biological contexts.

Let Xg,a denote the expression level of gene g meaured by array a. Let GCg and ACa

denote the gene/array cluster that gene g / array a belong to, respectively. We define

a Bayesian nework under the assumption that all measurements that correspond to a

particular gene array-cluster array pair have the same conditional distribution. Thus,

each node Xg,a has parents GCg and ACa. Figure 1.1.2 shows an example with a

small number of genes and arrays.

GC1

GC2

GC3

AC1 AC2

X1,1 X1,2

X2,1 X2,2

X3,1 X3,2

Figure 1.1.2: A Bayesian Network with 3 genes and 2 arrays. GCg is the gene cluster for gene g

and ACa is the array cluster for array a.


It can be easily seen that the underlying graph is indeed a DAG. One can also addi-

tionally assume that the conditional p.d.f. f(Xg,a|GCg, ACa) is similar for different

choices of g and a. This model can achieve high likelihood if the gene and array clus-

ters partitions the expression data into homogeneous expression within each block

[2, 4]. One can use EM algorithm to find such a partition.

One can extend this simple model to more complex models capturing the hidden

biological mechanisms that result into similar gene expressions, maintaining the

acyclicness and directedness of the extended graph. For example, it is assumed

that coexpression is an effect of coregualtion. Hence, it would make sense if one

incorporates different regulation mechanisms into this model to account for similar

gene expressions in a gene cluster. One such regulation mechanism involves binding

of transcriptor factor binding sites in the promoter region of genes. Thus, one can

add some additional nodes, annotating promoters with characterized binding sites.

One can define an indicator variable Rg,j denoting whether gene g has a binding site

of transcription factor j. The corresponding graphical model is shown in figure 1.1.3.

Another popular example of directed graphical models are Hidden Markov Mod-

els (abbreviated as HMM) that have numerous applications in bioinformatics (e.g.,

in gene-finding problems[5, 6]), speech recognition[7], video processing[8] and many

other problems. Directed graphical models are used in collaborative filtering and rec-

ommender systems, intensive care monitoring, text analysis, object recognition and

image segmentation. For details, see [9].


Promoter

Gene

Array

Expression

Seq

R1 R2 • • • Rk

Gene Cluster

Array Cluster

X

Figure 1.1.3: A template model associating promoter sequences to gene expressions. Ri’s indicate

whether the gene is regulated by i-th transcription factor. Thus, cluster of the genes would depend

on the different binary sequences generated.


Undirected Graphical Model

In an undirected graphical model, also known as Markov Random Field, the p.d.f.

factors out over cliques †. We associate, with each clique C, a compatibility function[5]

or potential function[10]

ψC :⊗s∈C

χS −→ R+

The joint p.d.f. is obtained by taking product of the potential functions over all

clique.

f(x1, x2, · · · , xp) =1

Z

∏C∈C

ψC(xC), (1.1.3)

where C is the set of all maximal cliques‡ and Z is the normalization constant. The

functions ψC need not be related to the conditional distributions defined over the

cliques, unlike directed graphs where the factors correspond to the conditional dis-

tributions over the child-parent set of nodes.

An important characterization of undirected graphical models is given in terms of

conditional independence among subsets of nodes, also known as the Markov proper-

ties of the graphical model. We say that the graph exhibits

(a) Pairwise Markov Property: IfXu ⊥ Xv|XV \u,v for all pair of nodes (u, v) /∈

E.

(b) Local Markov Property: If Xv ⊥ XV \cl(v)|Xne(v) where ne(v) denotes the

neighborhood of v, i.e., all nodes connected to v, and cl(v) = v ∪ ne(v).

†Cliques are complete graphs or subgraphs (in this scenario)‡Maximal cliques are cliques that do not have any supercliques containing them


(c) Global Markov Property: If XA ⊥ XB|XS where each path from a node in

A to a node in B passes through S or equivalently, S is a separator of A and

B.

However, all the three aforementioned Markov properties are equivalent when the

joint density is strictly positive. Otherwise, the global markov property implies the

local Markov property which in turn implies the pairwise Markov property [11]. Tak-

ing into account all possible choices of A, B and S generates a sequence of conditional

independence assertions. By Hammersley-Clifford theorem §, proposed by Hammer-

sley and Clifford in an unpublished paper in 1971 (see [13]), it can be shown that the

set of probability distributions satisfying these assertions is exactly the set of distribu-

tions defined by equation (1.1.3) over all possible choices of potential functions. The

global markov property is basically identified with the functionally equivalent notion

of reachability in graph theory, establishing a fundamental connection between the

algebraic concept of factorization to the graph theoretic concept of reachability[5].

This formulation can be exploited by clever algorithms (e.g., breadth-first search al-

gorithms[12]) in order to assess conditional independence on a graph.

When the density is strictly positive, global, local and pairwise Markov properties on

the graph are equivalent. Therefore, one can use the pairwise markov property alone

to represent the conditional independence structure. Hence, the conditional (inde-

pendence)dependence between two nodes could be represented by (absence)presence

of an edge between the two. We shall use this representation frequently in this the-

sis. Before I delve into specifics of Gaussian graphical model, I shall present some

practical applications of undirected graphical models.

§For other proofs, see [14, 15, 16, 17]


One of the basic and interesting examples of undirected graphical model are Ising

models [18], which arose from statistical physics and deal with interacting particles

with similar or opposite magnetic spins. The model considers N interacting particles,

each with a magnetic spin of +1 or −1, depending on the directions of their magnetic

dipole moment. Let us denote by yi the state of the ith particle. The particles are ar-

ranged in a regular geometric configuration (a 1-D, 2-D or 3-D lattice to be specific).

In ferromagnetic substance, the neighbors tend to spin in similar direction whereas

in anti-ferromagnetic substances the neighbors tend to rotate in opposite directions,

and thereby the system could be in 2N different spin configurations. The model was

proposed by physicist Wilhelm Lenz (1920) and his student Ernst Ising solved the

one-dimensional model in 1925. The two-dimensional square lattice Ising model was

harder to solve and it was solved by Lars Onsager in 1944.¶.

An Ising model could be represented as a Markov random field on a 1-D, 2-D or

3-D lattice. A 2-D lattice configuration of an Ising model is shown in figure 1.1.4.

In this representation, the neighboring edges constitute the cliques of size 2. The

pairwise clique potentials can be written as

ψuv(yu, yv) = eyuyvwuv (1.1.4)

where wuv is the interaction strength between nodes u and v. A standard Ising

model makes some assumptions about the interactions wuv. Some of the common

assumptions are

(i) W = ((wuv))u,v is symmetric, so wuv = wvu.

¶Source: Wikipedia. For details, see http://en.wikipedia.org/wiki/Ising_model

http://en.wikipedia.org/wiki/Ising_model


+ + -

+ - +

- + -

Figure 1.1.4: A spin configuration of 2-D lattice Ising model

(ii) The interactions have same strength. Hence wuv = J for all u, v.

If wuv > (<)0, the interaction corresponds to a ferromagnetic (anti-ferromagnetic)

interaction. Thus, for a common value J > 0, one gets a ferromagnet and the

corresponding graphical model is called an associative Markov network. Conversely,

for J < 0, one gets an anti-ferromagnet and the corresponding graphical model is

known as a frustrated system.19 The energy of a configuration is given by

E(y) = −∑

(u,v)∈E

yuyvwuv (1.1.5)

In the most general version of the Ising model, one would have an external magnetic

field Bu at node u so that

E(y) = −∑

(u,v)∈E

yuyvwuv − µ∑u

Buyu (1.1.6)


where E is the edge set of the lattice and the magnetic moment is given by µ.

However, with the assumption of no external magnetic field, the negative log potential

equals the energy of the system and assuming the interactions are of same strength,

E(y) = −J∑

(u,v)∈E

yuyv (1.1.7)

The corresponding probability distribution is given by

Pβ(y) =∏

(u,v)∈E

(ψuv(yu, yv))β∑

y (ψuv(yu, yv))β

=

∏(u,v)∈E e

βyuyvwuv∑y

∏(u,v)∈E e

βyuyvwuv

=e−βE(y)∑y e−βE(y)

(1.1.8)

where E(y) is given by the equation (1.1.7) with all its relevant assumptions and β

is a physical parameter which depends on the temperature of the system.

If J is bigger than a certain positive lower bound, then it can be proved that the

corresponding probability distribution will have two maxima - when all the states are

+1 or −1. These are known as the ground states [19] of the system. Contrarily, if

J is smaller than a negative upper bound, then the corresponding distribution will

have many maxima. The denominator on the right hand side of equation (1.1.8) is

called the partition function, and it is usually hard to calculate analytically‖. It is,

therefore, a common practice to use Metropolis Hastings sampling for simulation of

Ising models.

‖It is actually a NP hard problem in general (see [20]). However, for associative Markov networks,it could be calculated in polynomial time.


In the following section, we discuss, in detail about the special kind of Markov random

field where the nodes jointly follow a multivariate normal distribution. In statistical

parlance, this is commonly referred to as Gaussian Graphical Models.

1.1.3 Gaussian Graphical Model

When the nodes in the graph are representative of each variable in a multivariate

Gaussian random variable, the pairwise Markov property, the local Markov property

and the global Markov property are equivalent due to strict positivity of the den-

sity function. Thus, the Markov properties are easily interpreted as pairwise Markov

properties, or conditional (in)dependence of pair of nodes, given the rest of them.

For a multivariate normal model, interestingly, this property translates to an even

simpler specification. Before spelling it out, let us introduce some notation we shall

be following for the rest of the chapter.

Let X = (X1, X2, · · · , Xp) ∈ Rp follows a multivariate normal distribution Np (0,Σ)

where Σ denotes the covariance matrix, assumed to be positive definite. Let us de-

note Ω = ((ωij))i,j := Σ−1. The matrix Ω is known as the concentration matrix of

the distribution. Each of the p variables can be represented as a node in a graph

G = (V,E) where V = 1, 2, · · · , p and E = (a, b) : Xa 6⊥ Xb|XV \a,b. Then it

can be easily proved that11

Proposition 1.1.1. Assume that X ∼ Np (0,Σ). Then it holds for a pair of vertices

(s, t) with s 6= t that

Xs ⊥ Xt

∣∣XV \s,t ⇐⇒ ωst = 0 (1.1.9)


This basic connection between conditional independence in multivariate Gaussian

distribution and its concentration matrix lays out an alternative representation for

the underlying Markov network by restricting certain elements in the concentration

matrix to zero. It can be shown using simple algebra that

ωss = Var(Xs|XV \s

). (1.1.10)

Let C = ((cij))i,j be the scaled matrix Ω such that the diagonals are equal to 1.

Then,

cij =ωij√ωiiωjj

. (1.1.11)

Thus, the off-diagonal elements are equal to the negative of the partial correlation

coefficients. Hence, conditional independence directly corresponds to zero partial

correlation for multivariate Gaussian model. Now we describe a real-life application

of Gaussian graphical model.

Proteins are polymers of amino acids that are linked into a chain with the help

of peptide bonds. Proteins are one of the fundamental building blocks of life, and

they play a ton of important roles including structural roles (cytoskeleton), as cat-

alysts (enzymes), transporter to ferry ions and molecules across membranes, and

hormones[21]. The three dimensional structure of protein has two equivalent expres-

sion - either as Cartesian coordinates of constituent atoms, or in terms of a sequence

of dihedral angles. In the jargon of biotechnology, there are two types of dihedral

angle - three backbone dihedral angles (denoted as φ, ψ and ω) and four side-chain

dihedral angles (denoted as χ1, χ2, χ3 and χ4). The structure of protein is not static


and during a reaction it changes continuously. Thus, one only observes a small sample

from the distribution of different values these dihedral angles go through. One can

safely assume that the underlying distribution is multivariate Gaussian and one is

interested to explore the underlying conditional independence graph by learning the

structure of the Gaussian graphical model, which gives an idea about the interaction

of several nodes of that protein during the reaction[22].

1.1.4 Local Constancy in Gaussian Graphical Model

Although the traditional methods of structure learning in Gaussian graphical

models did not consider any additional restrictions, we often encounter data with

a very specific geometric structure that calls for special treatment. For instance, a

lot of data measured these days are manifold-valued, and the additional information

is often ignored in classical analyses. Take, for example, the protein dihedral angle

measurement mentioned in the last section. Simply assuming a Gaussian graphi-

cal model, one is ignoring the spatial arrangement of those angles or the order they

appear in reality. Another example is provided by data describing some feature out-

lining of a (moving) silhouette, pixels in an image or voxels in a body organ like brain

or heart. In these problems, spatially close variables have a structural resemblance

in terms of probabilistic dependence. Considering this local behavior might lead to

faster and/or more efficient structure learning.

Honorio et al.[23] introduced the notion of local constancy. “Local” only is used in a

heuristic fashion in their paper. No formal definition is provided. According to them,

in a Gaussian graphical model, if a variable Xa is conditionally (in)dependent of vari-

able Xb, then a spatial neighbor Xa′ of Xa would also be conditionally (in)dependent


of Xb. This would then encourage finding connections between two local or distant

clusters of variables instead of finding connections between isolated variables. For

an image, a spatial neighbor might be a neighboring pixel in a 4-neighborhood or

8-neighborhood system. For MRI data, a spatial neighbor might be a neighboring

voxel. In Bayesian networks, similar methods have been proposed, where the vari-

ables were grouped into classes and prior probabilities of having an edge between two

different classes were assigned in a way to enforce regularized structure learning[24].

Imposing the restriction of local constancy is a step towards structured learning in

graphical models. Such restrictions promise to have many practical application, espe-

cially in the domain of medical imaging, image processing, genetics and biotechnology.

Structural constraints are implicitly being used for a long time. The QMR-DT[25]

network has two distinct classes of nodes - diseases and symptoms. And they only

allow edges from one class to another. In biological literature, similar divisions into

functional classes are not rare. In [26], the authors learn gene networks by finding

a relatively small set of regulator genes that control other genes and also exhibit

inter-connectivity.

One can anticipate a number of advantages of using structured learning over un-

structured learning. It can, for instance, be expected that once the assumption of

local constancy holds, the efficiency of structured learning should be better than that

of unstructured learning. It should also be possible to incorporate domain knowledge

to regularize the model. This way, one can make up for lack of sample, especially

in the high dimensional problems where samples are scarce and sparse estimation is

sought. When compared against the unstructured estimation, often times we will see

that structured learning produces the same quality of estimation with smaller number


of samples. Finally, structured learning could be important by its own merit. For

example, one might simply want to verify that certain genes are regulatory without

having to detect their inter-connenctivity, influence or causal effect.

Although structure learning through local constancy (or similar ideas) is not new

in statistical community, it lacks a deep theoretic understanding, both from a geo-

metric point of view and application perspective. In this thesis, I attempt to establish

a fundamental underpinning of the concept of local constancy in Gaussian graphical

model and simultaneously explore its advantageous aspects in practical applications.

Another application of using local structures discussed in this thesis is kernel smooth-

ing of positive definite matrices to estimate the diffusion tensor matrix in voxels in

human brain. The following example shows a practical application of local constancy

property in graphical models.

Studying the interaction between cognitive, emotional, behavioral and neurotic changes

on the one hand and drug addiction on the other is an active field of research in neu-

robiology[27]. The objective is to understand the physiological and psychological

phenomena that controls the recursive nature of addiction to drugs (intoxication,

withdrawal, craving, relapse). Functional MRI (fMRI) data are quite useful and pop-

ular to analyze brain activity when the individual is asked to perform certain tasks.

In one of such studies[27] the researchers observed the brain’s sensitivity to monetary

rewards of different magnitudes among cocaine abusers and its association with mo-

tivation and self control. fMRI data were collected for both addict and non-addict

subjects. Now, it has been observed that functional connectivity in the brain is often

realized as brightness or dimness of certain regions in brain. Their study concluded


Figure 1.1.5: A representative image of human brain graph. For more images, go to http:

// www. humanconnectomeproject. org

that cocaine abusers show an overall reduced regional brain responsitivity to the dif-

ferences between monetary awards and for the healthy individuals the money-induced

stimulations were predominantly in the orbitofrontal cortex. Thus, one can choose to

model the voxel-level sensitivity as a high dimensional multivariate Gaussian random

variable but since regions of the brain work as functional units, it makes a lot of sense

to impose some structural restriction on the model. One such simplistic restriction

could be the constraint of local constancy. Then one can represent the functional

connectivity among different regions of the human brain as the conditional indepen-

dent graph learned from the data. Figure 1.1.5∗∗ shows a representation image of

connectivity in human brain.

∗∗This picture is taken from https://raweb.inria.fr/rapportsactivite/RA2011/parietal/

uid42.html

http://www.humanconnectomeproject.org

http://www.humanconnectomeproject.org

https://raweb.inria.fr/rapportsactivite/RA2011/parietal/uid42.html

https://raweb.inria.fr/rapportsactivite/RA2011/parietal/uid42.html

1.2. Organization of this Dissertation 20

1.2 Organization of this Dissertation

The rest of the dissertation is organized as follows. In the next chapter we discuss

the importance of geometry in maximum likelihood estimation in Gaussian graphical

models and establish local constancy as a geometric constraint, along with providing

some alternative definitions. In chapter 3, we propose neighborhood-fused lasso for

structure learning in Gaussian graphical model, followed by its empirical performance

using simulations and the asymptotic consistency results (and their proofs) with dis-

cussions about how to choose the regularization parameters in a data-dependent

fashion. Chapter 4 deals with smooth estimation of diffusion tensors from diffu-

sion weighted MR images and its performance compared to traditional methods in a

simulated data set.

21

Chapter 2

LOCAL GEOMETRY IN

GAUSSIAN GRAPHICAL

MODELS (GGM)

In this chapter, structural restrictions on Gaussian graphical models will be ana-

lyzed in general, and local geometry in particular, from both a statistical and a ge-

ometric point of view. I shall proceed by establishing the Gaussian graphical model

as a special member of the exponential family of distributions on finite-dimensional

Euclidean space. Existence and conditions for existence of maximum likelihood esti-

mates in GGM will be discussed both in presence and absence of structure constraints.

A geometric analysis facilitates an in-depth understanding of the phenomenon of im-

posing structural restrictions on the variables and thereby provides us with the nec-

essary tools to formalize the notion of local constancy. The transition from complete

to partial knowledge of graph structure and exploiting the partial knowledge in a

quantitative manner will be the principal objective in this chapter. Development of

2.1. GGM as Exponential Family 22

a sound theoretical understanding is the central theme, with intermittent reference

to relevant practical applications.

I shall present various results from existing literature to show the effectiveness of

imposing a structural constraint on finding the MLE of covariance matrix in a Gaus-

sian graphical model. Motivated by that I shall represent the local neighborhood as

a structure constraint and define local constancy in a more formal manner.

2.1 GGM as Exponential Family

The exponential families of probability distributions encompass a large class of

probability distributions on finite dimensional Euclidean space, e.g., Rp, which can

be parametrized by a finite dimensional parameter. Exponential families have been

widely studied in statistics literature[28–30] since most of the popular distributions

that are practically useful can be expressed as a member of exponential family. Ex-

amples include binomial, Poisson, exponential, gamma, beta, Gaussian distributions

among univariate probability distributions and multinomial, Dirichlet or multivari-

ate Gaussian distribution among multivariate probability distributions. The com-

monality and mathematical neatness of exponential families contribute to an overall

convenience in dealing with these distributions and also to the development of a

deep understanding of these distributions and their properties, both finite sample

and asymptotic. The framework of exponenial families binds statistical inference and

convex analysis[31–33] together. Apart from classical statistics, exponential families

are also relevant in machine learning and graphical models, where a lot of inferential

problems can be formulated and analyzed in terms of the canonical parameters and

sufficient statistics of relevant exponential families.


2.1.1 Definition

Let X = (X1, X2, · · · , Xp) ∼ Pθ where θ ∈ Θ ⊆ Rk. We say that the family of

distributions Pθ : θ ∈ Θ belongs to the k-parameter exponential family if its density

f , which is absolutely continuous with respect to a measure ν, can be represented as

f(x) = exp 〈η(θ), T (x)〉 − A(θ) , (2.1.1)

where 〈., .〉 denotes the Euclidean inner product in Rk and the function A(·), known

as the cumulant function, is defined to ensure proper normalization of the density.

Thus,

A(θ) = log

∫X p

exp 〈η(θ), T (x)〉 dν, (2.1.2)

where X p is the sample space. The (multivariate) statistic T (x) is the sufficient

statistic for the exponential family and η(θ) is the (multivariate) canonical parameter.

Finiteness of A(θ) is a necessary condition for the above definition to hold, and for

this reason, we are only interested in θ ∈ Ω, where

Θ0 := θ ∈ Θ : A(θ) <∞ (2.1.3)

2.1.2 Motivation

Information theory drives one of the basic motivations for exponential family rep-

resentations for graphical models.5,34,35 Starting with n i.i.d. observations X1,X2, · · · ,Xn,

one would like to infer about the (unknown) probability distribution that generated

the data. For any probability distribution represented as a density w.r.t. an appropri-


ate measure ν, we define the theoretical moment of a statistic T (X) = (T1, T2, · · · , Tk)

as

Ef (Ti(X)) :=

∫X pTi(X)f(X)dν ∀i = 1, 2, · · · , k (2.1.4)

assuming they exist. One good property that our optimal f should satisfy is to equate

the theoretical moments with sample moments. Hence, a good criterion is

Ef (Ti(X)) =1

n

n∑j=1

Ti(Xj) ∀i = 1, 2, · · · , k (2.1.5)

However this is not enough, since there are infinitely many distributions that satisfy

the property (2.1.5). Therefore, we need to adopt a principle, as a functional of

density f , to choose the density. In information theory, one such useful principle is

based on

H(f) := Ef (− log(f(X))) = −∫X p

(log f(x))f(x)dν (2.1.6)

This is commonly known as Shannon Entropy, and it is a measure of unpredictability

of information content. Hence, the maximum entropy solution f ∗, chosen among a

class F , is given by

f ∗ := arg maxf∈F

H(f) subject to condition (2.1.5). (2.1.7)

The heuristic idea behind choosing Shannon entropy is to maximize unpredictability

subjected to the moment restriction. It can be shown that the optimal solution f ∗ is


of the form[5]

f ∗θ (x) ∝ exp

n∑i=1

αiTi(x)

(2.1.8)

where αi represents a (finite) parametrization of the distribution. This is the general

representation of exponential family.

2.1.3 Gaussian MRF

Given an undirected graph G = (V,E), a Gaussian Markov random field is a mul-

tivariate Gaussian random vector X = (X1, X2, · · · , Xp) with mean µ and covariance

matrix Σ that satisfies the Markov properties of G. One can show that it has expo-

nential family representation. The joint density of multivariate normal distribution

can be expressed as

fµ,Σ(x1,x2, · · · ,xn) = (2π)−np/2|Σ|−n/2n∏i=1

exp

1

2(xi − µ)TΣ−1(xi − µ)

∝ |Σ|−n/2 exp

−tr(Σ−1xTx/2) + µTΣ−1xT1− nµTΣ−1µ/2

(2.1.9)

where x is the n × p matrix [x1,x2, · · · ,xn]T and 1 is a vector (of appropriate di-

mension) of all 1’s. It follows that expression (2.1.9) identifies the model determined

by the multivariate Gaussian distribution with unknown µ and Ω = Σ−1 as an ex-

ponential model. To see this, observe that tr(ATB) is a valid inner product on

the space of matrices. Also, take θ = (Ω, ξ) with ξ = Ωµ as canonical parameter,


T (x) = (−xTx/2,xT1) as sufficient statistics and

〈(A1, y1), (A2, y2)〉 = tr(AT1A2) + yT1 y2 (2.1.10)

as an inner product, where A1, A2 are matrices and y1, y2 are vectors. Then expres-

sion (2.1.9) can be written as

fθ(x1,x2, · · · ,xn) ∝ expn

2log |Ω| − tr(ΩxTx/2) + µTΩxT1− nµTΩµ/2

= exp

n2

log |Ω|+ 〈θ, T (x)〉 − nµTΩµ

(2.1.11)

Since the integral ∫Rp

exp

−1

2(x− µ)TΩ(x− µ)

dx

is finite iff Ω is positive definite, this is an (regular) exponential model. The closed

convex support C of the sufficient statistics is the set of pairs (A, b) such that A is a

symmetric p× p matrix, b ∈ Rp and A− bbT/n is non-negative definite, i.e.,

C =

(A, b) ∈ Sp × Rp : A− bbT/n 0, (2.1.12)

where Sp denotes the set of p× p symmetric matrices and 0 denotes non-negative

definiteness[11]. It follows that the interior of C, denoted by C0, is given by

C0 =

(A, b) ∈ Sp × Rp : A− bbT/n 0, (2.1.13)

where 0 denotes the positive definiteness criterion. In the following sections, we

discuss about the existence of maximum likelihood estimates for both saturated and

unsaturated models.

2.2. Maximum Likelihood Estimation 27

2.2 Maximum Likelihood Estimation

In order to find the MLE or study conditions for its existence, one can use results

from the theory of exponential models. This might seem to be an overkill for the

general model but is helpful when we study the Gaussian graphical models with

conditional independence restrictions. In general, properties of exponential model

will also come handy while incorporating the local geometric restrictions.

2.2.1 Saturated Model

In a saturated model, we have no additional restrictions on the conditional in-

dependence structure. In particular, we have n i.i.d. samples x1,x2, · · · ,xn from

N(µ,Σ) and we assume no mandatory zero occurrence on the precision matrix Ω =

Σ−1. The only restriction on Σ (and hence Ω) is its positive definiteness. Then, it

can be shown that

Theorem 2.2.1. In the saturated model, the MLE’s of µ and Σ exist iff

S := xTx− xT11Tx/n

is positive definite. This happens with probability one if n > p and never when n ≤ p.

When the estimates exist they are given by

µ = xT1/n

Σ = S/n

and they are independently distributed as µ ∼ Np(µ,Σ/n) and Σ ∼ Wp(n− 1,Σ/n).

where x = [x1,x2, · · · ,xn]T is the n × p matrix with each row corresponding to


an observation. The MLE of Ω is Ω = Σ−1 ∼ W−1p (n− 1,Σ/n). Therefore,

E(Ω) =n

n− p− 2Ω

V (Ω) =2n2

(Ω⊗ Ω + 1

n−p−2Ω Ω

)(n− p− 1)(n− p− 2)(n− p− 4)

,

where W−1p denotes the p-dimensional inverse Wishart distribution, ⊗ denotes the

Kronecker product and denotes the matrix outer product such that

(Ω Ω)(A) = tr(AΩ)Ω

so the asymptotic variance of Ω is 2(Ω⊗ Ω)/n. For proofs of these results, see [11].

In light of theorem 2.2.1 it is seen that for the MLE to exist, the sample size should

be at least as small as the problem dimension. In fact, for a generalized hypothetical

situation where the dimension would polynomially increase with increasing sample

size, one would require a sample size in the polynomial order of the problem dimen-

sion. So, for high dimensional problems, where p n, the MLE will not exist in

the saturated model. This is primarily the driving factor towards considering models

with sparsity constraints. In the following section we show how a sparse model results

in the advantage of being able to find the MLE with smaller sample size.

2.2.2 Covariance Selection Model

To start with, we revisit proposition 1.1.9, that connects the conditional indepen-

dence of two nodes in a Gaussian graphical model with occurrence of zeros in the

concentration matrix Ω. Apart from conditional independence, a number of statisti-


cal measures of the underlying model could be expressed as functions of the precision

matrix. The conditional variances are expressed as inverses of the diagonal elements,

i.e.,

ωii = Var(Xi|XV \i

)for all i. The partial correlation coefficients are given by

ρij|V \i,j = − ωij√ωiiωjj

.

Also, by property 2.2.2, we have that Xi given XV \i follows a univariate normal

distribution.

Property 2.2.2. Let Y ∼ Np(µ,Σ). Let us consider the following partitions : Y =

(Y1, Y2)T , µ = (µ1, µ2)T and

Σ =

Σ11 Σ12

Σ21 Σ22

,

where Y1, µ1 are k-dimensional vectors (1 ≤ k < p) and Σ11 is a k × k matrix. Then

the conditional distribution of Y1|Y2 is Nk(µ1|2,Σ1|2) where µ1|2 = µ1+Σ12Σ−122 (Y2−µ2)

and Σ1|2 = Σ11 − Σ12Σ−122 Σ21 assuming Σ22 is invertible.

Writing the conditional mean as a sum, we get

E(Xi|XV \i

)= µi +

∑j∈V \i

βij|V \i(Xj − µj), (2.2.1)

where the partial regression coefficients are given by

βij|V \i = −ωijωii

. (2.2.2)


The above property 2.2.2 laid the foundation of a novel procedural approach towards

solving the model selection problems in Gaussian graphical models, especially when

p n. This will be discussed in greater details in the next chapter. Here we restrain

ourselves to reviewing conditions that ascertains the existence of MLE’s for Gaussian

MRF’s.

Decomposing Covariance Selection Models

Before we delve into studying geometric properties of Gaussian MRF’s, let us

review some results which lead to existence of the MLE in high dimensional situa-

tions (where n < p) in a covariance selection model. These results are summarized

from Lauritzen’s book[11]. We start with a simplistic decomposition assumption, and

study the existence of the MLE under that condition and its extensions.

Let the graph G = (V,E) be decomposed by (A,B,C). Hence V is the disjoint union

of A,B,C; C separates A and B, meaning any path from A to B passes through C

and C is a complete subgraph of G. For a subset Q ⊆ V , let KQ be the submatrix

corresponding to nodes in Q, [ΩQ]V and[Ω−1Q

]V= [ΣQ]V denote p× p matrices such

that all the off diagonal elements of Ω and Σ respectively corresponding to the nodes

in Q are kept intact and the rest of them are set to 0. Let SG and S+G denote the

set of symmetric matrices and positive definite matrices (respectively) such that the

off-diagonal elements corresponding to the missing edges in G are 0. We start with

the following lemma.

Lemma 2.2.3. Let Ω ∈ SG, and let (A,B,C) be a disjoint partitioning of G as


mentioned above. Then

Ω = [ΩA∪C ]V + [ΩB∪C ]V − [ΩC ]V (2.2.3)

and for any symmetric p× p matrix L, we have

tr(ΩL) = tr (ΩA∪CLA∪C) + tr (ΩB∪CLB∪C)− tr (ΩCLC) . (2.2.4)

Also, if Ω ∈ S+G , then

Ω =[(

Ω−1A∪C

)−1]V

+[(

Ω−1B∪C

)−1]V−[(

Ω−1C

)−1]V

(2.2.5)

and |Ω| =∣∣Ω−1

C

∣∣∣∣Ω−1A∪C

∣∣ ∣∣Ω−1B∪C

∣∣ (2.2.6)

The above lemma summarizes the effect of decomposition on the precision matrix

leading to the following proposition, which is basically a sample version of the above

lemma.

Proposition 2.2.4. Consider a sample of size n from a covariance selection model

given by the graph G = (V,E) decomposed by (A,B,C). The maximum likelihood

estimate of the precision matrix Ω exists iff the estimates Ω[A∪C] and Ω[B∪C] exist,

where Ω[A∪C] and Ω[B∪C] denote the MLE of the concentration matrices in the two

marginal models with graph GA∪C and GB∪C based on the data in the marginal samples

only. If the estimates exist, then they satisfy

Ω =[Ω[A∪C]

]V+[Ω[B∪C]

]V− n

[(SC)−1]V (2.2.7)∣∣∣Ω∣∣∣ =

∣∣∣Ω[A∪C]

∣∣∣ · ∣∣∣Ω[B∪C]

∣∣∣n−|C| |SC | (2.2.8)


Further, it holds that

Σ[A∪C] = ΣA∪C , Σ[B∪C] = ΣB∪C , (2.2.9)

ξ =[ξ[A∪C]

]V+[ξ[B∪C]

]V−[ξ[C]

]V. (2.2.10)

With the aforementioned results, we study the properties of decomposable co-

variance selection models. Usually these covariance selection models are built up by

accumulating several small saturated models by successive direct joins. This makes

the model amenable for modular analysis by breaking it down into smaller saturated

sub models. Lemma 2.2.3 demonstrates the effect of Markov properties of the graph-

ical model on covariance matrices across decompositions of the graph. In light of the

fact that (see proposition 2.17 in [11]):

Proposition 2.2.5. For an undirected graph the following two statements are equiv-

alent.

(i) The cliques of the graph can be numbered to form a perfect sequence.

(ii) The graph is decomposable.

we can come up with a sequential enumeration of the cliques of G, assumably

given by C1, C2, · · · , Ck where each combination of subgraphs induced by Hj−1 =

C1 ∪ C2 ∪ · · · ∪ Cj−1 and Cj is a decomposition. The joint density then factorizes as

f(x) =

∏kj=1 f

(xCj)∏k

j=2 f(xSj)

=

∏C∈C f (xC)∏

S∈S f (xS)ν(S),

where Sj = Hj−1 ∩ Cj is the sequence of separators and ν(S) is the frequency of


separator S among the Sj’s. Using lemma 2.2.3 repetedly, we get

Ω =∑C∈C

[(ΣC)−1]V −∑

S∈S

ν(S)[(ΣS)−1]V , (2.2.11)

|Σ| =∏

C∈C |ΣC |∏S∈S |ΣS|ν(S)

. (2.2.12)

Similarly, by using proposition 2.2.4 repetedly, we can derive an explicit formula for

the MLE of a decomposable covariance selection model. This is formalized in the

following proposition.

Proposition 2.2.6. In a decomposable covariance selection model with graph G =

(V,E), the maximum likelihood estimate of the mean vector and the concentration

matrix, based on a sample of size n, exists with probability one iff n > maxC∈C |C|,

and they are given by µ = x and

Ω = n

∑C∈C

[(SC)−1]V −∑

s∈S

ν(s)[(Ss)

−1]V (2.2.13)

where C is the set of cliques of G and S the set of separators with multiplicities ν in

a sequence of cliques and S is same as in theorem 2.2.1.

Non-decomposable Covariance Selection Models

The existence of MLE’s in non-decomposable covariance selection models is usu-

ally studied by looking at a decomposable “cover” of the underlying graph. This also

leads to the minimum sample size needed for the existence of the MLE. This has been

studied by Buhl[36] and Uhler[37] . In this section, we shall discuss the fundamental

ideas, followed by the geometric principles of maximum likelihood estimation in next

section. The geometric ideas will be the key concepts in formalizing the idea of local


geometry in Gaussian graphical models.

We start with some definitions. For a non-decomposable graph G = (V,E), a graph

G+ = (V,E+) is called a fill-in if G+ is decomposable and E ⊆ E+. Obviously

there are a number of choices for potential fill-ins and one aims for a minimal one.

The minimal fill-in is closely connected to the concept of a chordal graph. A graph

is a chordal graph if it contains no chordless∗ cycle of length greater than 3. For a

nonchordal graph, one can naturally define its chordal cover by constructing a fill-in

which is chordal. The notion of minimal chordal cover is useful where minimality

refers to the maximal clique size in the chordal cover. Figure 2.2.1 shows a nonchordal

graph and its chordal cover which is also a minimal chordal cover.

1

2

3 4

5

1

2

3 4

5

6

7

6

7

(a) (b)

Figure 2.2.1: (a) A nonchordal graph and (b) its chordal cover

Following [37], we define the treewidth of a graph as one less than the maximal

clique size in its chordal cover. Let C and C+ denote the clique classes and q, q+ the

maximal clique sizes of G and G+ respectively, where G+ can be taken as a minimal

chordal cover of G. It is obvious that q ≤ q+. Buhl introduced the idea of matching,

∗An edge between two non-adjacent nodes of a cycle is called a chord


which is defined as follows

Definition 2.2.7. Two indexed sets of vectors xI ⊆ Rn and xI ⊆ Rd are said to be

matching on the graph G if xTi xj = xTi xj for i = j or (i, j) ∈ E.

In other words, a matching would correspond to a folding of a set of vectors

into a lower or higher dimensional space such that the angle between the vectors are

preserved†. Taking orthogonal transformations is one way to accomplish that. The

following lemma lays out sufficient conditions for matching in a decomposable model.

Lemma 2.2.8. Let G be a decomposable model with maximal clique size q. Given

xI ∈ Rn, we want to find xI ∈ Rd that matches xI on G. (i) If d ≥ q, this can always

be done. (ii) If d > p and the indexed sets xc, c ∈ C are linearly independent this can

be done such that xI is linearly independent.

For a non-decomposable graph, one needs to unfold the line bundle according to

a fill-in G+ with a clique class C+ and maximal clique size q+. This is described in

the following theorem.

Theorem 2.2.9. Let d ≥ q+. The MLE exists for xI ⊆ Rn iff there exists xI ⊆ Rd

such that (i) xI and xI match on G and (ii) all xc+ , c+ ∈ C+ are linearly independent.

A consequence of lemma 2.2.8 and theorem 2.2.9, is that if n < q, then the MLE’s

can not exist. If n ≥ q+, the MLE exists with probability 1. However, no such

general statement can be made when q ≤ n < q+. This region henceforth is denoted

by UG. Simpler nonchordal graphs like cycles and wheels[36,38,39], bipartite graphs

and small grids[37] have been studied in literature. Buhl considered the chordless

p-cycles with a sample of size n = 2 (which falls inside UG where G denotes the

†For simplicity, one can assume all vectors to be normalized

2.3. Geometric Interpretation of Existence of MLE 36

chordless p-cycle) and a p-wheel structure with sample of size of n = 3 and showed

under the assumption of linear independence that the MLE exists with a probability

of 1− 2(p−1)!

, which is quite large for higher values of p, showing the huge advantage

one gets by a regular underlying model.

2.3 Geometric Interpretation of Existence of MLE

Given a graph G = (V,E), the space of concentration matrices respecting the

Markov property on G, denoted by KG, is a convex cone inside the positive definite

cone S+p and is defined as

KG :=K ∈ S+

p |Kij = 0∀(i, j) /∈ E.

Taking inverse of every matrix in KG generates the space of covariance matrices in

the model. This happens to be an algebraic variety intersecting with the positive

definite cone S+p . In a Gaussian graphical model, the matrix completion problem can

be reformulated as

Corollary 2.3.1. The MLE’s Σ and K exist for a given sample covariance matrix

iff

fiberG(S) :=

Σ ∈ S+p |ΣG = SG

is nonempty, in which case fiberG(S) intersects K−1

G in exactly one point Σ.

Thus the MLE Σ is algebraically connected to the sufficient statistic SG, in the

sense that it can be expressed as a solution to polynomial equation in the sufficient

statistic SG. Using corollary 2.3.1 we can find the set of all sufficient statistics for

which the MLE exists by projecting S+p onto the edge set of graph G. It has been

2.4. Local Geometry as Structure Constraint 37

proved[40] that the cone of sufficient statistics is the convex dual to the cone of con-

centration matrices KG. Uhler[37] investigated the existence of MLE in the range

q ≤ n < q+ by looking at the manifold of rank n matrices on the boundary of S+p . Its

projection lies on the topological closure of the cone of sufficient statistics. Existence

of the MLE is ensured if the projection lies in the interior. It was proved in [37] that

the elimination ideal obtained from the ideal of (n+1)×(n+1)-minors of a symmetric

m×m matrix of unknown by eliminating all unknown corresponding to non-edges of

the graph G should be the zero ideal in order for the MLE to exist with probability

one. This provides a sufficient condition for existence of MLE. If the elimination

ideal is not the zero ideal, the MLE can still exist with positive probability. However,

calculation of the elimination ideal is computationally intensive and is quite hard for

a large graph. Usually small graphs are studied with the elimination criterion and

joined using clique sums to build larger graphs.

2.4 Local Geometry as Structure Constraint

Motivated by the significant implications of imposing structural constraints for

efficient estimation of MLE with under-sampling, we claim that structure learning

in Gaussian graphical models benefits from incorporating the local geometry in the

analysis in a judicious manner. In this section, we shall describe a few ways to do

that. The principal idea is to hypothetically split the neighborhood into two groups

- the “local” neighbors and the “non-local” neighbors.

In most of the practical problems, the underlying spatial geometry will be known


to the statistician, and in most situations it is going to be a regular graph like one-

dimensional, 2-dimensional or 3-dimensional lattice, triangles or other combinations

of cliques of smaller size. To generalize the notion of this local structure we introduce

a graph Glocal that may or may not be a part of the global neighborhood system. We

call it a local neighborhood graph. This graph is solely determined by external factors

and it is not, by any means, affected by the data. Given a node of a graph G, all its

neighbors in Glocal are defined as the local neighbors of that node.

The local constancy property, as defined in Honorio’s paper, can easily be assimi-

lated into this more formal setting as following: Xa is (in)dependent of Xb, then a

local neighbor of (as defined above) is expected to be (in)dependent of Xb. However,

the so called “likeliness” behavior of local neighbors need to be formalized. In the

remaining portion of this chapter, we propose two alternative ways to more formally

quantify the definition of local constancy.

Before we formally define the local constancy, we should carefully define the dif-

ference matrix. We consider an m × p matrix D, where m is the total number

of pairs of local neighbors. Assuming that we have a labelled sequence of nodes

Γ(n) = 1, 2, · · · , p(n), arrange the pairs of local neighbors in a sequence

B := (u, vu) : vu ∈ lne(Xu) > u : u ∈ Γ(n)

where lne(Xu) denotes the set of local neighbors of Xu. This particular ordering

chosen here does not influence the results below. Note that B is nothing but an

alternative sequencing of edges in Glocal, and hence B is topologically equivalent with

Glocal. The inequality vu > u is included to avoid double counting. Each pair is


then represented by a row in the difference matrix D. The kth row of D is given by

Dk,. = ei − ej where (i, j) is the kth element of the local neighbor sequence and ei, ej

denote two canonical basis vectors of Rp where the 1’s occur at ith and jth position,

respectively. We denote by Da the ma×p sub-matrix of D selecting all the rows with

ath entry being 0. In other words, Da is the difference matrix corresponding to all the

local neighbor pairs not involving Xa. The number of local neighbors in Γ(n) \ a

is ma. It should be noted that throughout our discussion we shall assume that the

local neighborhood structure is known to us, meaning that D is known beforehand

and it does not depend on the data. Also the following definitions are helpful.

Definition 2.4.1 (Zero-Operator). Given a matrix A, its zero operator is defined as

Z(A) := (I(Aij = 0))i,j

where I is the indicator function.

Definition 2.4.2 (Diagonal-Excluded Product). Given two matrices A and B, the

diagonal-excluded product is given by

AB := Z(A) AB

Here denotes the Hadamard product of matrices. Although the name does not

clearly show how diagonals are removed from the product, usually for the types of

matrices we will deal with, this will eventually lead to the exclusion of diagonals of

the matrix B.


2.4.1 Quantitative Definition

One way to define the local constancy property is to impose a bound on the

norm of the difference of local neighbors. We say a model exhibit (ε,Glocal, lp)-local

constancy if

‖ D Ω ‖p< ε

where ‖ · ‖p denotes the lp norm. For our purpose, the best candidate would be l1

norm for certain desirable properties like sparsity and convexity. However, a different

choice of norm would lead to a different solution which might be more appropriate

for other types of problems. With p = 1, this definition coincides with Honorio’s

local penalty criterion.

It should be noted that the ε parameter controls the degree of local constancy so

this is closely related to the tuning parameter we shall use in next chapter. This

serves as a constraint on certain matrix parameters that we are trying to estimate,

A small ε would ensure a high level of local constancy where a large ε would do the

opposite. This is exactly similar to any penalized regression problem.

2.4.2 Bayesian Perspective

Since the locality information is being used here as a prior information, it makes

sense to interpret it from a Bayesian perspective. Let π(Ω) denotes a probability

distribution on the space of positive definite matrices. Then one can say that an

(ε, δ,Glocal, lp)-local constancy property holds for this model if

P (‖ D Ω ‖p< ε) > 1− δ


Here we need to assume that δ < 1. The ε parameter controls the degree of local

constancy and the δ parameter is indicative of our prior belief about the local con-

stancy.

It should be noted that the non-zero terms of the matrix D Ω consist of the

differences of elements from Ω such that they are local neighbors. For example ,

if

D =

1 −1 0

0 1 −1

and

Ω =

ω11 ω12 ω13

ω21 ω22 ω23

ω31 ω32 ω33

Then

D Ω =

0 0 ω13 − ω23

ω21 − ω31 0 0

In this thesis, I will pursue the first definition of local constancy. In following chapter,

I shall describe a method of model selection using penalized regression. The penalty

term will be derived from the above definition. Upper bounds of norm constraints

play an important role in determining the order of the regularization parameter in the

equivalent optimization process. I shall show by simulation and theoretical justifica-

tion that with proper order of the regularization parameters, our proposed method

works better than traditional approaches.

42

Chapter 3

Estimation of Locally Constant

Gaussian Graphical Model Using

Neighborhood-Fused Lasso

3.1 Introduction

In this chapter, we extend and subsequently analyze the idea propounded by Mein-

shausen and Buhlmann for structure learning in Gaussian graphical model, where we

incorporate the additional structural constraint introduced in last chapter as local

constancy. It is well known that in a p-dimensional multivariate Gaussian random

variable X with non singular covariance matrix Σ, the conditional independence can

be represented by occurrence of zero entries at respective entries of precision matrix

Ω. The neighborhood nea of a node a is defined as the collection of all nodes that

are conditionally dependent on a. The neighborhood selection algorithms aim to find

all the neighbors of a node Xa, given n i.i.d. observations of X. Meinshausen &

3.1. Introduction 43

Buhlmann[65] showed that this problem could be interpreted as an ensemble of l1-

penalized regressions which could be solved using lasso algorithm[66]. In a relatively

recent paper, Honorio, Ortiz & Samaras[23] have proposed local constancy of spatially

close nodes as a prior information for learning in Gaussian graphical model. In this

chapter, I will assume that the global conditional dependence neighborhood graph of

the underlying Gaussian distribution consists of the following subgraphs - (a) a local

graph that is topologically close to a disconnected or weakly connected clusters of

regular graph objects like chain, cycle, lattice or a clique; and (b) a non-local graph

where the edges connect nodes of two different spatial clusters of nodes. Moreover, it

will also be assumed in this paper that the regular graph object that governs the local

structure depends only on domain knowledge and hence known beforehand. Adap-

tive selection of local neighborhood is proposed as a future research direction and has

not been dealt with in our analysis. The locality information is critical when data

is observed in a certain manifold with spatial geometry. The assumption of local

constancy, in some sense, enforces spatial regularization on structure learning and

thereby stimulates the search for probabilistic dependencies between local clusters of

nodes.

We propose to add a new penalty term which, in essence, generalizes the fused

lasso penalty[67] to penalize the differences of spatially close nodes, which are de-

termined by the geometry of the particular problem and is not chosen adaptively.

It is this additional penalty term which takes care of the local regularization and

therefore its choice is intertwined with the regular local graph structure we choose

to deal with. This leads us to propose neighborhood-fused lasso, a model selection

method for locally constant gaussian graphical models. This helps us to estimate the

3.2. A Review of Related Works 44

probabilistic connectivity among distant clusters of nodes using the spatial informa-

tion. We also show by simulation and by studying the theoretical properties of our

proposed method that neighborhood-fused lasso outperforms other competing model

selection algorithms where locality information is ignored. Both finite sample and

large sample properties of our estimator have been studied, leading to several inter-

esting findings. We prove theoretically that introducing a local penalty term reduces

the finite sample type-I error probability in model selection and leads to equivalent

accuracy as standard methods with smaller sample size. Also, our method is not

computationally intensive since we provide data dependent choices of the regulariza-

tion parameters instead of cross-validation based methods. We study the asymptotic

l1 properties of our estimator to find sufficient conditions on design matrix and regu-

larization parameters to ensure asymptotic consistency in parameter estimation and

prediction in terms of l1 and l2 metric, respectively.

3.2 A Review of Related Works

Going back to Dempster[68] who introduced Covariance Selection to discover the

conditional independence restrictions (the graph) from a set of i.i.d. observations,

many methods have been proposed for sparse estimation of the precision matrix in a

Gaussian graphical model. Covariance selection traditionally relies on optimization

of an objective function[11, 69]. Modern technological developments and top-notch

computing power enable us to deal with high dimensional models. Of course, there

still are computational challenges. Exhaustive search is often infeasible for high-

dimensional models. Usually, greedy forward-selection or backward-deletion search

is used. In forward(backward) search, one starts with the empty (full) set and adds

(deletes) edges iteratively until a suitable stopping criterion is fulfilled. The selec-


tion (deletion) of an edge requires an MLE fit[70] for O(p2) many models, making

it a suboptimal choice for high-dimensional models where p is large. Also, the MLE

might not exist in general for p > n [36]. In contrast, neighborhood selection with the

lasso, proposed by Meinshausen and Buhlmann, relies on optimization of a convex

function, applied consecutively to each node in the graph, thus fitting O(p) many

models. Fast lasso-type algorithms and data dependent choices for regularization

parameter reduce the computational cost. Unlike covariance selection, this algorithm

estimates the dependency graph by sequential estimation of individual neighbors and

subsequent combination by union or intersection. Theoretical analysis shows that the

choice of union or intersection does not matter asymptotically. Other authors have

proposed algorithms for the exact maximization of the l1-penalized log-likelihood.

Yuan & Lin[71] proposed an l1-penalty on the off-diagonal elements of the concentra-

tion matrix for its sparse estimation with the positive definiteness constraint. They

showed that this problem is similar to the maxdet problem in Vandenberghe et al.[72],

and thus solvable by the interior point algorithm. They also took a nonnegative gar-

rote[73] type approach. A quadratic approximation to the objective function in their

proposed method leads to a solution similar to Meinshausen & Buhlmann. Banerjee

et al.[74] viewed this as a penalized maximum likelihood estimation problem with the

same l1-penalty on the concentration matrix. Constructing the dual transforms the

problem into sparse estimation of the covariance matrix instead of the concentration

matrix. They proposed block coordinate descent algorithm to solve this efficiently for

large values of p. They also showed that the dual of the quadratic objective func-

tion in the block-coordinate step can be interpreted as a recursive l1-penalized least

square solution. Friedman, Hastie & Tibshirani used this idea successfully to develop

a new algorithm, known as graphical lasso[75]. Using a coordinate descent approach


to solve the lasso problem speeds up the algorithm to a considerable extent, making

it quite fast and effective for a large class of high-dimensional problems.

In all the aforementioned methods, the local geometry (of the manifold valued data)

is not taken into consideration. Often times we come across data that are measured

on a certain manifold. E.g., data describing some feature on the outline of a (moving)

silhouette, pixels in an image or voxels in a body organ like brain or heart. In most

of these problems, spatially close variables have a structural resemblance in terms of

probabilistic dependence. Considering this local behavior might lead to faster and/or

more efficient estimation. Honorio, Ortiz, Samaras et al.[23] introduced the notion of

local constancy. According to them, if one variable Xa is conditionally (in)dependent

of Xb, then a local neighbor of Xa in that manifold is likely to be conditionally

(in)dependent of Xb as well. They developed coordinate direction descent algorithm

to solve a penalized MLE problem which penalizes both the l1-norm of the precision

matrix and the l1-norm of local differences of the precision matrix, expressed as its

“diagonal excluded product” with a local difference matrix. Local geometry have

been addressed, although quite implicitly, by Tibshirani et al.[67] who assumed that

the underlying parameters in a linear model have a natural ordering. Apart from

being sparse, some of the local parameters could also be fused together. They penal-

ized both the l1-norm of the parameter and the l1-norm of successive differences and

developed a modified lasso algorithm, known as fused lasso. Chen et al.[76] used this

idea and proposed graph guided fused lasso for structure learning in multi-task regres-

sion, where the output space is continuous and the outputs are related by a graph.

The input space is high dimensional and outputs that are connected by an edge in

the graph are believed to share a common set of inputs. The goal then is to learn

3.3. Neighborhood Selection Using Fused Lasso 47

the underlying functional map from the input space to the output space in a way

that respects the similar sparsity pattern among the covariates that are believed to

affect the output variables that are connected. Local smoothing has been discussed

in Kovac and Smith[77] in order to perform a penalized nonparametric regression

where the differences of neighboring nodes were penalized. Their algorithm aims to

split the image into active sets and subsequently merge, split or amalgamate them in

order to minimize a penalized weighted distance.

3.3 Neighborhood Selection Using Fused Lasso

Like Greenshtein & Ritov[78] and Meinshausen & Buhlmann[65] we work in a

set-up where the number of nodes in the graph p(n) and the covariance matrix

Σ(n) depend on the sample size. Consider the p(n)-dimensional multivariate ran-

dom variable X = (X1, · · · , Xp) ∼ N(µ,Σ). The conditional (in)dependence struc-

ture of this distribution could be represented by the graph G = (Γ(n), E(n)), where

Γ(n) = 1, · · · , p(n) is the set of nodes corresponding to each coordinate vari-

able and E(n) the set of edges in Γ(n) × Γ(n). A pair of nodes (a, b) ∈ E(n) if

and only if Xa is conditionally dependent of Xb, given all other remaining variables

XΓ(n)\a,b = Xk : k ∈ Γ(n)\a, b. Ec(n) contains all pairs which are conditionally

independent, given all the remaining variables.

The neighborhood nea of a node a is defined as the smallest subset of Γ(n) \ a

such that given nea, Xa is conditionally independent of all the remaining nodes. In

other words, neighbors of a certain node consists of the coordinates that are condi-

tionally dependent on that particular node. We already know that in a p-dimensional

multivariate gaussian random variable X with non singular covariance matrix Σ, the


conditional independence can be represented by occurrence of zero entries at respec-

tive cells of precision matrix Ω, i.e., Xa ⊥ Xb|XΓ\a,b iff Ωab = Σ−1ab = 0. The

neighborhood selection / model selection algorithms aim to find all the neighbors

of a node Xa, given n i.i.d. observations of X. Meinshausen & Buhlmann showed

that this problem could be interpreted as a penalized regression problem, where each

variable is regressed on the remaining variables with an l1 penalty on the estimated

coefficients.

By above definition of neighborhood, we have, for all a ∈ Γ(n) where Γ(n) is the

set of all nodes,

Xa ⊥ Xk : k ∈ Γ\(a⋃

ne(a)) | Xne(a).

An alternative definition of neighborhood of Xa is given as the non zero components

of θa where θa is given by

θa = argminθ:θa=0E

(Xa −

∑k∈Γ

θkXk

)2

. (3.3.1)

In light of the above definition, the set of neighbors of a node a ∈ Γ(n) can be written

as

ne(a) = b ∈ Γ(n) : θab 6= 0. (3.3.2)

Given the domain knowledge we construct a regular graph Glocal that is representative

of the underlying spatial geometry. Glocal could be a chain graph, a two or three

dimensional lattice or more generally a collection of cliques or wheels. Following this,


the local neighbors are defined as following

Definition 3.3.1. Xa and Xb are local neighbors if the edge connecting them belongs

to a predefined graph Glocal.

Note that the edge between Xa and Xb need not belong to E(n). Now if one

incorporates the local constancy property to this graph, the neighborhood becomes

more structured. By local constancy, if Xb is conditionally independent of Xa given

the other nodes (and hence, there is no edge between a and b), it is likely that a local

neighbor Xb′ of Xb is also conditionally independent of Xa, making both θab and θab′

equal to zero. Thus, the zeroes of θa are spatially close.

Fused Lasso, proposed by Tibshirani et. al. addresses problems where the parameters

could be “ordered” in some sense. The traditional fused lasso penalizes the l1-norm

of both the coefficients and their successive differences. Thus it encourages sparsity

of the coefficients and also sparsity of their differences - i.e. local constancy of the

coefficient profile. We modify the traditional fused LASSO algorithm to satisfy our

purpose of estimating the zeros and non zeros of the precision matrix with a spatially

similar occurrence pattern. It is therefore natural to penalize the differences between

spatial neighbors of a graph. In the next couple of paragraphs we explain how to

do that. In next section we write down a few assumptions that are similar to Mein-

shausen and Buhlmann’s but we need a few additional assumptions to incorporate

the locality information. Our theoretical analysis reveals that with these assumptions

on sparsity, model complexity and our added assumption on local neighborhood spar-

sity, the fused LASSO estimate will furnish all the desired finite-sample properties

like sign-consistency and model selection consistency.


The neighborhood-fused LASSO estimate θa,λ,µ of θa is given by

θa,λ,µ = argminθ:θa=0

(n−1||Xa −Xθ||2 + λ||θ||1 + µ||Daθ||1

)(3.3.3)

where ||x||1 denotes the l1-norm of x. Penalizing both the l1-norms of θ and Daθ

implies parsimony, thus ensuring sparsity and local constancy at the same time.

This property helps us in variable selection and thereby leads to neighborhood selec-

tion. Note that we are referring to both the non-local and the local neighbors. The

neighborhood estimate of node a is defined by the nodes corresponding to non-null

coefficients when Xa is regressed on the rest of the variables with the fused lasso

penalty in (3.3.3), in other words

neλ,µa =b ∈ Γ(n) \ a : θa,λ,µb 6= 0

(3.3.4)

where θa,λ,µb denotes the b-th component of the vector θa,λ,µ. It is clear that the

selected neighborhood depends on the value of λ and µ chosen. Large values of λ

and µ will give rise to more sparse solutions. Usually the regularization parameters

are chosen by some cross validation criteria. But in this paper we will find a data

driven approach for selection of regularization parameters that speeds up computation

and ensures asymptotic consistency in model selection. Meinshausen and Buhlmann

derived a data driven choice for λ. We will extend their method for simultaneous

selection of λ and µ from the data.


3.3.1 Assumptions

In order to prove consistency of our method for gaussian graphical models, we

need to work with a few assumptions. Most of the assumptions are same as Mein-

shausen and Buhlmann’s (See section 2.3 in [65]). However we need the following

two additional assumptions to deal with the local constancy, see [A4] and [A8]. Even

though the assumptions are similar, we write them down for sake of completeness

and quick references to notations.

[A1 ] Dimensionality : ∃γ > 0, such that p(n) = O(nγ) as n → ∞. Note that

γ > 1 is included thus allowing p n.

[A2 ] For all a ∈ Γ(n) and n ∈ N, V ar(Xa) = 1. There exists v2 > 0, so that

for all n ∈ N and a ∈ Γ(n), V ar(Xa|XΓ(n)\a) ≥ v2. This means that all the

conditional variances are bounded away from 0.

[A3 ] Sparsity : There exists some 0 ≤ κ < 1 so that maxa∈Γ(n)|nea| = O(nκ) for

n→∞.

[A4 ] Local Neighborhood Sparsity: The number of local neighbors also can

grow at polynomial rate of n, i.e., ∃β0 ≥ 0 such that the maximum number of

local neighbors of a node maxa∈Γ(n)nel(a) = O(nβ0) for n→∞. For convenience

we let K > 0 be such that

2 + maxa∈Γ(n)nel(a) < Knβ0 .

[A5 ] l1-Boundedness: There exists some ϑ <∞ so that for all neighboring nodes

a, b ∈ Γ(n) and all n ∈ N, ||θa,neb\a||1 ≤ ϑ.


[A6 ] Magnitude of partial correlation : There exists a constant δ > 0 and

some ξ > κ with κ as in [A3], so that for every (a, b) ∈ E, |πab| ≥ δn−1−ξ2

+β0

where πab denotes the partial correlation of Xa and Xb.

[A7 ] Neighborhood Stability : Define Sa(b) =∑

k∈neasgn(θa,nea

k )θb,neak . There

exists some δ1 < 1 so that for all a, b ∈ Γ(n) with b /∈ nea, |Sa(b)| < δ1

[A8 ] Local Neighborhood Stability: Let us define La := k : ((Da)′ sgn(Daθa))k 6=

0 and Ta(b) :=∑

k∈La

[(Da)′ sgn(Daθa)

]kθb,Lak . There exists some δ2 < 1 so

that for all a, b ∈ Γ(n) such that b /∈ La, |Ta(b)| < δ2‖Da.b‖1, where Da

.b denotes

denotes the bth column of Da.

Similar to Meinshausen and Buhlmann’s interpretation, we can describe an intu-

itive condition which implies the last two assumptions. Define

θa(η1, η2) = argminθ:θa=0E(Xa −∑k∈Γ(n)

θkXk)2 + η1‖θ‖1 + η2‖Daθa‖1.

According to the characterization of nea derived from (3.3.2), nea = k ∈ Γ(n) :

θak(0, 0) 6= 0. One can think of a two-dimensional perturbation approach in which

one tweaks the parameters η1 and η2 in a way that the perturbed neighborhood

nea(η1, η2) = k ∈ Γ(n) : θak(η1, η2) 6= 0 is identical to original neighborhood

nea(0, 0). The following proposition shows that the two assumptions of neighbor-

hood stability are fulfilled under this situation. The terms Sa(b) and Ta(b) measure

the sub-gradient of the lasso penalty and the neighborhood-fused lasso penalty re-

spectively. Having a small l1 bound on them enforces the stability of the estimated

coefficients, and hence stability of neighbors. See section 3.8 for proof of the propo-

sition.

3.4. Optimization method 53

Proposition 3.3.2. If there exists some η1 > 0, η2 > 0 such that nea(η1, 0) =

nea(0, η2) = nea(0, 0). Then, |Sa(b)| ≤ 1 and |Ta(b)| ≤ ‖Da.b‖1. Moreover, nea(η1, η2) =

nea(0, 0).

3.4 Optimization method

Before we describe our optimization method, we quickly review the existing al-

gorithms used for optimization of fused lasso problems and some of its generalized

versions. Broadly speaking, there are two fundamental class of algorithms used in

this type of problems - (a) Solution path algorithm - which finds the entire solution

for all values of the regularization parameters and (b) Approximation algorithm -

which attempts to solve a large scale optimazation given a fixed set of regularization

parameters using first order approximations.

Friedman et al.[75] formulated the path-wise optimization method for the stan-

dard fused lasso signal approximation problem where the design matrix X = I. The

algorithm was two-step and the final solution is obtained by soft-thresholding the

total-variation norm penalized estimate obtained in the first step. The basic chal-

lenge in applying the coordinate descent algorithm to a fused lasso problem is the

non-separability of the total variation penalty unlike usual lasso where the l1-penalty

is completely separable. So they used a modified coordinate descent approach where

the descent step was followed by an additional fusion and smoothing step. However,

this method works only for the total variation penalty and does not extend to fused

lasso regression problems with generalized fusion penalty like our situation. Hoe-

fling[79] proposed a path-wise optimization algorithm that worked for generalized

fusion penalty in fused lasso signal approximation problem. This algorithm uses the


fact that the sets of fused coefficients do not change except for finitely many values

of the fused penalty parameter. Tibshirani and Taylor[80] proposed another path al-

gorithm for the fused lasso regression problem with a generalized fused lasso penalty

by solving the dual optimization problem. However, the path algorithm that they de-

vised can be applied only when the design matrix is full rank and hence not applicable

for high dimensional problems. In order to resolve the problem when the matrix is

not full rank, they proposed to add an infinitesimal perturbation ε ‖ β ‖2. However,

this does not solve the problem completely as a small ε leads to ill-conditioning and

increasing number of rows in the generalized fused penalty matrix causes inefficient

solution because of increasing number of dual variables.

Approximation algorithms were developed to find efficient solutions to general

fused lasso problem with a fixed set of penalty parameters, regardless of its rank

and they usually adapt themselves to high dimensional problems quite easily. Usu-

ally these approximation algorithms are based on first order approximation type

methods like gradient descent. Liu et al.[81] proposed the efficient fused lasso algo-

rithm (EFLA) that solves standard fused lasso regression problem by replacing the

quadratic error term in the optimization function by it first order Taylor expansion

at an approximate solution followed by an additional quadratic regularization term.

The approximate objective function has a fused lasso signal approximation form and

can be solved by applying gradient descent on its dual which is a box-constrained

quadratic program. Chen et al.[82] proposed the smoothing proximal gradient method

to solve regression problem with structured penalties which closely resemble our ob-

jective function. The basic idea is to approximate the fused penalty term ‖ Dβ ‖1 by

a smooth function and devise an iterative scheme for an approximate optimization.


The smooth function used here is given by

Ω(β, t) = max‖α‖∞≤1

(αTDβ − t

2‖ α ‖2

2

).

Convexity and continuous differentiability of this function follows from Nesterov[83].

One may now proceed using standard first order approximation approach like FISTA[84]

which works like EFLA. However this process is computationally intensive and defi-

nitely not a feasible approach in our case where we need to do a nodewise regression.

The other algorithm which tries to solve similar problem is known as Split Breg-

man algorithm[85] which was proposed for standard fused lasso regression and later

extended to generalized fused lasso regression. The SB algorithm is derived from

augmented Lagrangian[86, 87] which adds quadratic penalty terms to the penalized

objective function and alternatively solves the primal and dual starting from an ini-

tial estimate. This is also computationally intensive unless one has a simple structure

constraint on the parameters.

We propose a different algorithm to solve our optimization problem. We show

that the neighborhood-fused lasso problem can be reparametrized into a standard

lasso problem, thus simplifying the optimization procedure.


The neighborhood-fused LASSO estimate θa,λ,µ of θa can be written as

θa,λ,µ = argminθ:θa=0

(1

n||Xa −Xθ||2 + λ||θ||1 + µ||Daθ||1

)= argminθ:θa=0

(1

n||Xa −Xθ||2 + λ

(||θ||1 +

µ

λ||Daθ||1

))

= argminθ:θa=0

1

n||Xa −Xθ||2 + λ

∣∣∣∣∣∣∣∣∣∣∣∣∣∣ I

µλDa

θ

∣∣∣∣∣∣∣∣∣∣∣∣∣∣1

(3.4.1)

Letting Ga =

I

µλDa

and ω = Gaθ, we define

ωa,λ,µ = argminω

(1

n||Xa −XG+

a ω||2 + λ||ω||1)

(3.4.2)

where G+a is the Moore-Penrose inverse of G. We find θa,λ,µ from the following relation

(see lemma 3.4.1 for details)

θa,λ,µ = G+a ω

a,λ,µ (3.4.3)

We note that since Ga has full column rank, G+a has full row rank. Thus, the following

lemma can be applied here.

Lemma 3.4.1. Let

β := argminβ∈Rk ‖ y −Xβ ‖2 +λ ‖ Gβ ‖1

and

ω := argminω ‖ y −XG+ω ‖2 +λ ‖ ω ‖1 .

3.5. Asymptotics of Graphical Model Selection 57

Also assume that G is of full column rank. Then

β = G+ω

.

One interesting observation here is that although we started with the assumption

that ω = Gaθ ∈ C(Ga), where C(Ga) denotes the column space of Ga, we did not

optimize over C(Ga) but did the same over all ω. See the proof of lemma 3.4.1

for details. The heuristic idea behind this is that we find the minimizer on C(Ga)

by projecting the global minimizer onto C(Ga) and the projection operator is given

by GaG+a . The objective function in (3.4.2) combines the two l1 penalties into

a single l1 penalty and thus could be easily solved by any of the standard LASSO

algorithms. The parameters λ and µ are chosen according to theorem 3.5.6. We show

by several simulations that the proposed method performs better than Meinshausen

- Buhlmann’s method or graphical lasso in situations where local constancy holds.

3.5 Asymptotics of Graphical Model Selection

From discussion in section 3, it can be seen that using neighborhood-fused lasso

leads to efficient model selection when applied successively to all the nodes. In this

section, we show that our proposed method leads to asymptotically consistent model

selection similar to the procedure by Meinshausen and Buhlmann. The choice of the

regularization parameters is crucial in such cases. Moreover, we show that our pro-

posed choice of regularization parameters not only ensures convergence to the “true”

model but the convergence is faster than the Meinhausen-Buhlmann’s method when

the underlying model is indeed locally constant. This property is also supported by


our simulation results shown in section 7.

We start with a lemma on the conditions of subdifferential of our objective func-

tion. For proof, see section 3.8.

Lemma 3.5.1. Given θ ∈ Rp(n), let G(θ) be a p(n) dimensional vector with elements

Gb(θ) = − 2n〈Xa −Xθ,Xb〉. Define

La(θ) = b : [(Da)′sgn(Daθ)]b 6= 0.

A vector θ is a solution to the fused LASSO problem described above iff

Gb(θ) = λsgn(θb) + µ[(Da)′sgn(Daθ)]b when θb 6= 0, b ∈ La(θ)

λsgn(θb)− µ ‖ Da.b ‖1≤Gb(θ) ≤ λsgn(θb) + µ ‖ Da

.b ‖1 when θb 6= 0, b /∈ La(θ)

−λ+ µ[(Da)′sgn(Daθ)]b ≤Gb(θ) ≤ λ+ µ[(Da)′sgn(Daθ)]b when θb = 0, b ∈ La(θ)

|Gb(θ)| ≤ λ+ µ ‖ Da.b ‖1 otherwise

This lemma builds the foundation of several of the following results and will be

used frequently to prove them. Sign consistency is one of the major properties that a

potentially good model selection method should exhibit. Before we study the model

selection consistency of our estimators, we show, in the following lemma that the

neighborhood-fused lasso estimator is sign-consistent.

Lemma 3.5.2. Let θa,λ,µ be defined for all a. Under the assumptions A1-A7, it holds

for some c that for all a, P(sgn(θa,λ,µb ) = sgn(θab )∀b ∈ nea) = 1−O(exp(−cnε))

Observe that this lemma, in turn preserves the asymptotic equality of signs of


local neighbors. If Xb and X ′b are local neighbors, then it is likely, most of the time,

that when one regresses Xa on the remaining variables, the coefficients of Xb and Xb′

will have the same sign. This is a direct consequence of local constancy of estimated

regression coefficients.

Our results show that, just like Meinshausen and Buhlmann, a rate slower than

n−1/2 is necessary for the regularization parameters for consistent model selection

in the high dimensional case where the dimension may increase as a polynomial in

the sample size. Specifically if λ decays at n−1−ε2 and µ decays as n−

1−ε2−β0 where

0 < β0 < κ < ε < ξ are as in assumptions A1 - A8, the estimated neighborhood is

almost surely contained in the true neighborhood. Hence the type-I error probability

goes to 0. This is formally stated in the following theorem (see section 3.8 for a

proof).

Theorem 3.5.3. Let assumptions A1-A8 be fulfilled. Let the penalty parameters

satisfy λn ∼ d1n− 1−ε

2 and µn ∼ d2n− 1−ε

2 with some β0 ≥ 0 and 0 < κ < ε < ξ and

d1, d2 > 0. There exists some c > 0 such that for all a ∈ Γ(n),

P (neλ,µa ⊆ nea) = 1−O(exp(−cnε))

The assumptions of neighborhood stability and local neighborhood stability are

not redundant. The following proposition shows that one can not relax the assump-

tions A7 and A8.

Proposition 3.5.4. If there exists some a, b ∈ Γ(n) with b /∈ nea and |Sa(b)| > 1 and

|Ta(b)| >‖ Da.b ‖1, then for λn, µn as in theorem 4.6, P

(neλ,µa ⊆ nea

)→ 0 for n→∞.

On the other hand, with the same sets of assumptions, one can show that the type

II probability, i.e., probability of falsely identifying an edge as a potential connection


exponentially goes to 0. This has been formally stated and proved in the following

theorem (see section 3.8 for a proof).

Theorem 3.5.5. With all the assumptions of theorem 4.6 and λ, µ as before,

P(nea ⊆ neλa

)= 1−O(exp(−cnε))

On Model Selection in Graphs

It follows from the discussion so far that consistent estimation of nodes is possible

using our approach. However, one of the original goals of our method is to estimate

the entire underlying graphical model, which can be accomplished by combining the

estimated neighborhoods in some way. Two different methods of combination have

been proposed

Eλ,µ,∨ :=

(a, b) : a ∈ neλ,µb ∨ b ∈ neλ,µa

(Union)

Eλ,µ,∧ :=

(a, b) : a ∈ neλ,µb ∧ b ∈ neλ,µa

(Intersection)

In our simulations, we combined them following the union method. It has been

observed that the differences vanish asymptotically when a regular lasso is applied.

Although we did not theoretically study this for our case but our experiments indicate

that they do not vary much asymptotically.

Finite Sample Choices for Penalty Parameter

Theoretic exploration in asymptotic domain does not cast much light on the

choices of regularization parameter in real life, finite sample problem. It is hard to

ensure consistency or absolute containment like theorem 3.5.3 or 3.5.5. However,


following the idea proposed by Meinshausen and Buhlmann, one can consider the

connectivity component of a node, which is defined as the set of nodes which are

connected to it through a chain of edges. Generalizing the results from Meinshausen

& Buhlmann, it has been shown in the following theorem (proof in section 3.8)

that the estimated connectivity component derived from neighborhood-fused lasso

estimate will belong to the true connectivity component with probability (1−α), for

any chosen level of α ∈ (0, 1).

Theorem 3.5.6. With all the assumptions A1-A8, and the following choices of the

penalty parameters,

λ =σa√n

Φ−1

(α

2p(n)2

)µ =

σaKnβ0+1/2

Φ−1

(α

2p(n)2

)

we have

P(∃a ∈ Γ(n) : Cλ,µ

a * Ca

)≤ α

for all n. Ca and Cλ,µa are the true and estimated connectivity components of a, K, β0

are certain constants and Φ = 1− Φ.

The choice of K and β0 depends on the rate of growth of the local neighborhood

with increasing sample size. In our simulations, we found that for models with con-

stant dimension and increasing sample size, K = 1 and β ∈(0, 1

2

)works fine.

This is, by all means, a weaker result than asymptotic consistency. But, quite

important corollaries can be derived from it. E.g., if the underlying graph is empty,

then it will be estimated by an empty graph. Also, if the underlying graph has dis-

connected components, then with high probability, one is going to have an estimated


graph which is also disconnected, with the estimated connected components a subset

of the true connected component.

As shown in the simulations, we get a higher convergence speed over Meinshausen-

Buhlmann’s method by applying a neighborhood-fused lasso penalty for nodewise

regression when the underlying model exhibit local constancy. This can be proved

using the following lemma (proof in section 3.8).

Lemma 3.5.7. (a) With the notations having the same meaning as in the proof of

theorem 3.5.3, the following is true

P(∣∣2n−1〈Xa, Vb〉

∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)≤ exp

[− 1

σ2∗

(d1

2(1− δ1) +

d2

2(1− δ2)nβ0

)2

nε

]

(b) The upper bound of type I error probability in neighborhood fused lasso is smaller

than that of usual lasso by a factor of

exp

[− 1

σ2∗

(d1d2

2(1− δ1)(1− δ2)nβ0 +

d22

4(1− δ2)2n2β0

)nε]

where σ2∗ = E(X2

a,iV2a,i)

Looking at the proof of theorem 3.5.3, it is seen that the term

P(∣∣2n−1〈Xa, Vb〉

∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)

is the principal contributor probability of false positives. The corresponding term

using usual lasso is

P(∣∣2n−1〈Xa, Vb〉

∣∣ ≥ (1− δ1)λ).

3.6. Compatibility and l1 properties 63

Lemma 3.5.7 shows that the maximal probability of false positives using fused lasso

is a proper fraction of the same using usual lasso and the fraction is small when

the local neighborhood grows. So, this method invariably performs better than the

Meinshausen-Buhlmann’s method with respect to a minimax criterion, and the im-

provement is more when the local neighborhood grows faster. It also implies that it

can perform as bad as usual lasso regression as the worst case scenario.

3.6 Compatibility and l1 properties

In our attempt to gaussian graphical model learning, we adopt the Meinshausen-

Buhlmann approach, i.e., do a componentwise penalized regression. However, one

should keep in mind that in the process of doing so, the design matrix (which is ran-

dom here) changes at every iteration. In previous section we discussed the asymptotic

model selection consistency of our estimator. In this section, we shall go beyond

model selection and explore conditions under which our neighborhood-fused lasso

estimate exhibit nice asymptotic l1 properties. We shall carry out our theoretical

analysis under the assumption that the linear regression model holds exactly, with

some underlying “true” θ0. Buhlmann and Van de Geer[88] derived some consistency

results and oracle properties of usual lasso. I shall apply and extend their results in

this section.

We start with a quick recapitulation of the notation we are going to use here. If

X = (X1, X2, · · · , Xp) ∼ N(0,Σ), one assumes a node-wise regression model

Xa = Xaθa + εa for a = 1, 2, · · · , p (3.6.1)


where Xa denotes the a-th component, Xa denotes all the components except a

and ε ∼ N(0, σ2In) for some σ2. Also assume that Σ := ((σij))i=1(1)p;j=1(1)p and

Ω := Σ−1 = ((σij))i=1(1)p;j=1(1)p is the precision matrix. If E denotes the edge set in

the conditional independence graph then (i, j) /∈ E ⇐⇒ σij = σji = 0. The design

matrix Xa is obtained by deleting the a-th column of the data matrix.

In the high dimensional set up, where one generally has lesser number of samples

that model dimension (n < p), we assume an inherent sparsity in the true θa. This

sparsity is also ensured if we assume that the underlying conditional independence

graph is sparse. Let us assume

S0a = j : θa,j 6= 0

and s0a = card(S0

a). Since S0a is not known, one needs a regularization penalty.

Meinshausen and Buhlmann chose the l1 penalty, i.e., the traditional lasso and got

the estimate

θλa = argminθa

[1

n‖ Xa −Xaθa ‖2 +λ ‖ θa ‖1

]

In our situation, we have added another penalty term ‖ Daθa ‖1 along with the lasso

penalty term. Hence, our estimator is given by

θλ,µa = argminθa

[1

n‖ Xa −Xaθa ‖2 +λ ‖ θa ‖1 +µ ‖ Daθa ‖1

]

Note that this new definition of our estimator is exactly same as what was defined

in 3.3.3.


The Compatibility Condition for NFLasso

As mentioned in the earlier section, we shall develop our theory based on the

assumption of a linear truth. We try to provide an upper bound on the prediction

error. The following lemma forms the basis of our derivation.

Lemma 3.6.1. The basic inequality

1

n‖ Xa(θλ,µa − θ0

a) ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1≤2

nε′aX

a(θλ,µa − θ0a) + λ ‖ θ0

a ‖1 +µ ‖ Daθ0a ‖1

It should be noted that the number of non zero elements in the column of the

matrix Da denotes the number of local neighbors that particular node has (except

node a). Let us assume that the number of local neighbors is O(nβ0). Fixing n, let

us also assume that the number of local neighbors is bounded by B. Then, it follows

from lemma 3.6.1 using ||‖ x ‖1 − ‖ y ‖1||1 ≤ ||x− y||1 that

1


a) ‖2 ≤ 2

nε′aX

a(θλ,µa − θ0a) + (λ+Bµ) ‖ (θλ,µa − θ0

a) ‖1

The basic objective of using a penalty parameter is to overrule the empirical

process term 2nε′aX

a(θλ,µa − θ0a). It can be easily seen that

∣∣∣∣ 2nε′aXa(θλ,µa − θ0a)

∣∣∣∣ ≤ ( 2

nmax1≤j≤p

∣∣ε′aXaj

∣∣) ‖ θλ,µa − θ0a ‖1

Our goal is to choose a λ and a µ such that the probability that the empirical process

term in the right hand side exceeds λ + Bµ is small, so that with high probability

the right hand side could be dominated by (λ + Bµ) ‖ θλ,µa − θ0a ‖1. To that effect,


we define

Λa :=

max1≤j≤p

j 6=a

2

n

∣∣ε′aXaj

∣∣ ≤ λ0 +Bµ0

Our objective is to show that for some particular choice of λ0 and µ0, Λa has high

probability. Now one should keep in mind that both εa and Xaj are random here. One

can make the valid assumption of their individual gaussian law and independence.

We formalize these notions in the following proposition.

Proposition 3.6.2. Under the assumption of linear truth for the gaussian graphical

model, i.e., Xa = Xaθ0a+εa for a = 1, 2, · · · , p, we have εa,i

i.i.d.∼ N(0, σaa − ΣabΣ

−1bb Σ′ab

)where Σ =

σaa Σab

Σ′ab Σbb

Proof of this proposition is easy and hence skipped. From lemma 3.6.3 and its

following corollary, we see that for suitable choice of λ0 and µ0, P (Λa) is high.

Lemma 3.6.3. Under the assumption that σi = 1 ∀i = 1, 2, · · · , p,

P (Λa) ≥ 1− 2 exp

[log p− n

(λ0

2+ 1−

√λ0 +Bµ0 + 1

)]

In particular, with λ0 = 2(t+log p)n

> 0 and µ0 = 2B

√2n(t+ log p), then

P (Λa) ≥ 1− 2 exp(−t)

Corollary 3.6.4. Assume that σi = 1 ∀i = 1, 2, · · · , p and that p = O(nγ). Let the


regularization parameters be

λ =2 (t2 + log p)

n

µ =1

B

√8(t2 + log p)

n

Then the following is true

P

(1


a) ‖2≤ 2(λ+Bµ) ‖ θ0a ‖1 +µB ‖ θλ,µa ‖1

)≥ 1− exp(−t2)

Assume that δmin is the smallest non-zero singular value ofXa. Then, the following

lemma provides an upper bound on the quadratic norm of estimation error.

Lemma 3.6.5. Assume that σi = 1 ∀i = 1, 2, · · · , p and that p = O(nγ). Let the

regularization parameters be

λ =2 (t2 + log p)

n

µ =1

B

√8(t2 + log p)

n

Also assume that δmin is the smallest non-zero singular value of Xa. Then we have

P

(1


a) ‖2≤ 2

(λ+Bµ

(1 +

√p

2

))‖ θ0

a ‖1 +npBµ(λ+Bµ)

δmin

)≥ 1− exp(−t2)

In the next section we will study some oracle properties of neighborhood-fused

lasso estimators and derive better bounds than above.


Oracle Inequalities

We now try to derive some oracle inequalities which will prove the l1 consistency of

our neighborhood-fused lasso estimate. Following Buhlmann’s notation, let us write,

for an index set S ⊂ 1, 2, · · · , p,

θa,j,S := θa,j1j ∈ S

θa,j,Sc := θa,j1j /∈ S

We introduce a few more notations to carry out the calculations. We split the matrix

Da as following

Da =

DaS,S 0

DaS,0 Da

0,Sc

0 DaSc,Sc

where

DaS,S consists of all rows such that both the non zero terms belong to S.

DaS,0 consists of all rows and columns such that exactly one of the non zero term

in that row belongs to S.

Da0,Sc consists of all rows and columns such that exactly one of the non zero

term in that row belongs to Sc.

DaSc,Sc consists of all rows such that both the non zero terms belong to S.

Thus, in simple words, DaS,S takes care of the local difference terms within S,

[DaS,0 Da

0,Sc

]takes care of local difference terms between S and Sc and Da

Sc,Sc takes care of the


local difference terms within Sc. With these notations, the lemma follows

Lemma 3.6.6. On Λa, if we choose λ ≥ 2λ0 and µ ≥ 2µ0, we have

2

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0

− θ0a,S0‖1

It can be easily verified that if λ ≥(3 + 14

∆

)Bµ for some ∆ > 0, we have

λ− 3Bµ ≥ 1

3 + ∆(3λ+ 5Bµ)

This helps us to simplify the consequences lemma 3.6.6, which implies that

(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0− θ0

a,S0‖1

Combining with the aforementioned condition, we get

3λ+ 5Bµ

3 + ∆‖ θλ,µa,Sc0 ‖1 ≤ (λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0

− θ0a,S0‖1

Or,

‖ θλ,µa,Sc0 ‖1 ≤ (3 + ∆) ‖ θλ,µa,S0− θ0

a,S0‖1

A standard way to incorporate the l1 penalty ‖ θλ,µa,S0− θ0

a,S0‖1 into an l2 penalty is

to use the Cauchy-Schwarz inequality

‖ θλ,µa,S0− θ0

a,S0‖1 ≤

√s0 ‖ θλ,µa,S0

− θ0a,S0‖


and subsequently relate it to the l2 penalty on the left hand side

2

n‖ Xa

(θλ,µa − θ0

a

)‖2 =

2

n

(θλ,µa − θ0

a

)′Xa′Xa

(θλ,µa − θ0

a

)

in the following manner


a,S0‖2 ≤

(θλ,µa − θ0

a

)′Xa′Xa

(θλ,µa − θ0

a

)φ2

0,a

where φ0,a > 0 is some constant. However, θλ,µa being random, this condition can not

hold unanimously for all θ ∈ Rp. Buhlmann provided a compatibility condition so

that this is true. In our situation, we found a similar condition like them to carry out

further analysis. It should be noted that our compatibility condition is weaker than

the compatibility condition Buhlmann provided. However, we need to work under

the assumption that λ ≥(3 + 14

∆

). The definition of this compatibility condition is

provided below.

Definition 3.6.7 (Compatibility Condition). We say that the Compatibility condi-

tion is satisfied for the a collection of nodes S0 if for all θ satisfying

‖ θSc0 ‖1 ≤ (3 + ∆) ‖ θS0 ‖1

the following holds

‖ θS0 ‖21 ≤

s0

(θ′Xa′Xaθ

)nφ2

0,a

Following Bickel et al.[89], we shall refer to φ20,a as the restricted eigenvalue. Here

we provide a detailed explanation of this phenomenon. Observe that the compatibility


condition could be re-written as

φ20,a ≤ infθ:‖θSc0‖1≤(3+∆)‖θS0‖1

s0

(1nθ′Xa′Xaθ

)‖ θS0 ‖2

1

so that φ0,a could be taken as the square root of the infimum assumed by the right

hand side. Now following the chain of inequalities

infθ:‖θSc0‖1≤(3+∆)‖θS0‖1s0

(1nθ′Xa′Xaθ

)‖ θS0 ‖2

1

≥ infθs0

(1nθ′Xa′Xaθ

)‖ θS0 ‖2

1

≥ infθs0

(1nθ′Xa′Xaθ

)s0 ‖ θS0 ‖2

≥ infθ

(1nθ′Xa′Xaθ

)‖ θ ‖2

= λmin

(1

nXa′Xa

)

Thus, φ20,a could be assumed to be the minimum eigenvalue over the restricted set

θ :‖ θSc0 ‖1≤ (3+∆) ‖ θS0 ‖1. With this compatibility condition imposed, we derive

the following oracle inequality

Theorem 3.6.8. Assume that the compatibility condition (from definition 3.6.7)

holds for some ∆ > 0. Then on Λa, for λ ≥ 2λ0, µ ≥ 2µ0 and λ ≥(3 + 14

∆

)Bµ, we

have

1

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0

a ‖1 ≤ s0(2λ+Bµ)2

φ20,a

Combining theorem 3.6.8 and lemma 3.6.3, we get,


Theorem 3.6.9. Under the assumption that σi = 1 ∀i = 1, 2, · · · , p, if one chooses

λ =4(t+ log p)

n

µ =4

B

√2

n(t+ log p)

and

t ≥ 2n

(3 +

14

∆

)2

then with probability 1− exp(−t), we have

1

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0

a ‖1 ≤ s0(2λ+Bµ)2

φ20,a

From theorem 3.6.8 we see that with high probability,

1

n‖ Xa

(θλ,µa − θ0

a

)‖2 ≤ s0(2λ+Bµ)2

φ20,a

,

and

‖ θλ,µa − θ0a ‖1 ≤

s0(2λ+Bµ)2

φ20,a(λ− 3Bµ)

Since λ and Bµ go to 0 as n→∞ both these norms are small.

3.7. Simulations 73

3.7 Simulations

3.7.1 Simulation 1

In this simulation we repeat the exact scenario presented in Honorio’s first sim-

ulation setting. The Gaussian graphical model consists of 9 variables as shown in

figure 3.7.1. It deals with both local and non-local interactions. Our method is

compared to Meinshausen-Bhlmann’s method, the graphical lasso and Honorio’s Co-

ordinate Direction Descent algorithm. We run our simulation for 4 different sample

sizes n = 4, 50, 100 & 400. For each sample size, we run the simulation 50 times and

estimate the neighborhood for each iteration. We construct a graph with edges that

occur most frequently in all of those 50 iterations. The results are shown in figure

3.7.1.

3.7.2 Simulation 2

In this simulation study we take a 50 dimensional normal random vector with zero

mean. The diagonals of the precision matrix are all 1 and all the nonzero off-diagonal

entries are 0.2. The conditional (in)dependence graph of this vector consists of both

spatial (local) and non-spatial neighbors where the local neighborhood structure is

linear (one dimensional lattice). There are two groups of distant neighbors. We

generate n i.i.d. samples from the corresponding normal distribution. We run the

simulation for n = 10, 25, 50, 100, 500, 1000. We try to reconstruct the conditional

(in)dependence graph from the data using Graphical LASSO (we use an oracle ver-

sion of graphical lasso where the choice of penalty parameter is contingent on the

actual number of edges in the true model), Meinshausen - Buhlmann’s coordinate-

wise LASSO and our coordinate-wise generalized Fused LASSO approach. For each

3.7. Simulations 74

Ground Truth

1

2

34

5

6

78

9

NFL est. for n = 4

1

2

34

5

6

78

9

GLASSO est. for n = 4

1

2

34

5

6

78

9

MB est. for n = 4

1

2

34

5

6

78

9

CDD. est. for n = 4

1

2

34

5

6

78

9

Ground Truth

1

2

34

5

6

78

9

NFL est. for n = 50

1

2

34

5

6

78

9


1

2

34

5

6

78

9

MB est. for n = 50

1

2

34

5

6

78

9

CDD. est. for n = 50

1

2

34

5

6

78

9

Ground Truth

1

2

34

5

6

78

9

NFL est. for n = 100

1

2

34

5

6

78

9


1

2

34

5

6

78

9

MB est. for n = 100

1

2

34

5

6

78

9


1

2

34

5

6

78

9

Ground Truth

1

2

34

5

6

78

9

NFL est. for n = 400

1

2

34

5

6

78

9


1

2

34

5

6

78

9

MB est. for n = 400

1

2

34

5

6

78

9


1

2

34

5

6

78

9

Figure 3.7.1: Comparison of NFL with GLASSO, Meinshausen-Buhlmann estimateand CDD in section 3.7.1

3.7. Simulations 75

n, we run the simulation 50 times and calculate the number of correctly identified

edges and that of falsely identified edges for each iteration. The mean and standard

deviations are shown in the tables 3.1 and 3.2. Figure 3.7.2 shows the relative per-

formance of the competing methods for different sample sizes. Sample size increases

from top to bottom. In the figure we include comparison for two additional sample

values. They are n = 5000 and n = 10000 respectively.

It is quite evident from the two simulation that our method converges to the true

model faster than Meinshausen-Buhlmann’s method. It is also shown theoretically

in lemma 3.5.7.

Method Parameters n = 10 n = 25 n = 50

fp tp fp tp fp tpGL ρ = 0.00 NA NA NA NA NA NA

ρ = 0.05 239.78(8.94) 24.98(4.13) 352.79(9.89) 40.33(3.86) 364.65(14.77) 52.08(3.79)ρ = 0.10 217.71(8.73) 23.20(3.83) 273.30(11.22) 36.10(3.57) 248.21(11.54) 45.58(3.52)ρ = 0.15 195.75(9.85) 21.78(3.68) 210.74(11.01) 31.92(3.88) 165.44(9.21) 38.640(3.70)ρ = 0.20 175.32(10.02) 20.60(3.25) 161.56(10.49) 27.88(3.45) 104.28(8.24) 31.64(3.75)ρ = 0.25 156.95(11.18) 18.86(3.28) 121.98(10.66) 23.42(3.84) 59.92(6.64) 24.740(4.13)ρ = 0.30 138.76(10.79) 17.32(3.42) 89.40(10.55) 19.34(3.51) 33.08(5.67) 17.53(3.15)

MB 0(0) 0(0) 0(0) 0(0) 0(0) 0(0)

NFL 51.62(5.74) 7.54(2.69) 37.32(4.84) 10.12(2.69) 47.20(6.05) 20.18(4.02)

Table 3.1: Comparison of False Positives and True Positives

3.7. Simulations 76

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Truth OGL MB NFL

Figure 3.7.2: Comparison of NFL with GLASSO and Meinshausen-Buhlmann esti-mate in section 3.7.2, sample sizes from top to bottom are 10, 25, 50, 100, 500, 1000,5000, 10000

3.8. Theoretical Results 77

Method Parameters n = 100 n = 500 n = 1000

fp tp fp tp fp tpGL ρ = 0.00 579.72(22.16) 64.06(3.03) 580.64(20.99) 69(0) 584.7(24.48) 69(0)

ρ = 0.05 336.84(15.68) 60.34(2.79) 166.10(8.39) 68.24(0.85) 83.16(7.06) 68.9(0.36)ρ = 0.10 187.14(11.43) 52.46(2.97) 23.90(4.49) 61.26(1.66) 2.7(1.76) 62.74(1.64)ρ = 0.15 94.48(9.24) 42.50(3.59) 1.5(1.04) 49.62(1.71) 0.02(0.14) 50.28(1.84)ρ = 0.20 38.80(6.33) 32.48(3.25) 0.04(0.20) 33.62(2.92) 0(0) 35.14(3.09)ρ = 0.25 14.30(4.45) 21.76(3.32) 0(0) 15.12(3.99) 0(0) 12.86(2.67)ρ = 0.30 3.96(2.08) 13.32(2.87) 0(0) 3.6(1.96) 0(0) 1.12(0.87)

MB 0(0) 0(0) 0(0) 0(0) 0(0) 0.95(0.92)

NFL 99.92(7.08) 41.10(3.73) 2.44(1.40) 49.86(2.42) 0.02(0.14) 49.7(2.25)

Table 3.2: Comparison of False Positives and True Positives

3.8 Theoretical Results

Proof [Proof of proposition 3.3.2] The subdifferential of E(Xa−∑

k∈Γ(n))2 +

η1‖θ‖1 + η2‖Daθa‖1, w.r.t. θak: k ∈ Γ(n) is given by

−2E((Xa −∑

m∈Γ(n) θamXm)Xk) + η1e

(1)k + η2e

(2)k

where e(1)k = sgn(θak) if θak 6= 0 and e

(1)k ∈ [−1, 1] if θak = 0 and

e(2)k = (D′asgn(Daθa))k if (D′asgn(Daθa))k 6= 0. Otherwise, e

(2)k ∈ [−1, 1].

Using nea(η1, 0) = nea(0, 0), it follows from lemma 3.5.1 that for all k ∈ nea

2E((Xa −∑

m∈Γ(n) θam(η1, 0)Xm)Xk) = η1sgn(θak)

and for b /∈ nea,

|2E((Xa −∑

m∈Γ(n) θam(η1, 0)Xm)Xk)| ≤ η1

A variable Xb with b /∈ nea can be written as Xb =∑

k∈neaθb,neak Xk + Wb, where Wb

is independent of Xk : k ∈ cla. Using this yields

|2∑

k∈neaθb,neak E((Xa −

∑m∈Γ(n) θ

am(η1, 0)Xm)Xk)| ≤ η1


Thus, it follows that |∑

k∈neaθb,neak sgn(θa,nea

k )| ≤ 1

Using nea(0, η2) = nea(0, 0), it follows from lemma 3.5.1 that for all k ∈ La ,

2E((Xa −∑

m∈Γ(n) θam(0, η2)Xm)Xk) = η2(D′asgn(Daθa))k

and for k /∈ La,

|2E((Xa −∑

m∈Γ(n) θam(0, η2)Xm)Xk)| ≤ η2‖Da

.b‖1

A variable Xb with b /∈ La can be written as Xb =∑

k∈La θb,Lak Xk + Zb, where Zb is

independent of Xk : k ∈ La. Using this yields

|2∑

k∈La θb,Lak E((Xa −

∑m∈Γ(n) θ

am(0, η2)Xm)Xk)| ≤ η2‖Da

.b‖1

Thus, it follows that

|∑

k∈La θb,Lak (D′asgn(Daθa))k| ≤ ‖Da

.b‖1

Proof [Proof of lemma 3.4.1] Let G+ω = β0. We will show that β0 = β. We

know from definition of ω that ∀ω

‖ y −XG+ω ‖2 +λ ‖ ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1

Since the objective function is convex, the minimizer in C(G) is obtained by projecting

the grand minimizer onto C(G). Hence,

ω := argminω∈C(G) ‖ y −XG+ω ‖2 +λ ‖ ω ‖1= GG+ω


Therefore, for any ω ∈ C(G), we have

‖ y −XG+ω ‖2 +λ ‖ ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1

⇒ ‖ y −XG+GG+ω ‖2 +λ ‖ GG+ω ‖1≤‖ y −XG+ω ‖2 +λ ‖ ω ‖1

⇒ ‖ y −Xβ0 ‖2 +λ ‖ Gβ0 ‖1≤‖ y −XG+Gθ ‖2 +λ ‖ Gθ ‖1 ∀θ

Since G is full column rank, G+G = I. Hence,

‖ y −Xβ0 ‖2 +λ ‖ Gβ0 ‖1≤‖ y −Xθ ‖2 +λ ‖ Gθ ‖1 ∀θ

Therefore we get that

β0 = argminβ∈C(G+)=Rk ‖ y −Xβ ‖2 +λ ‖ Gβ ‖1

Proof [Proof of lemma 3.5.1] The subdifferential of 1n‖ Xa −Xθ ‖2

2 +λ ‖ θ ‖1

+µ ‖ Daθ ‖1 is given by G(θ) +λe1 +µe2 : e1 ∈ S1, e2 ∈ S2 where S1 = e ∈ Rp(n) :

eb = sgn(θb) if θb 6= 0 and eb ∈ [−1, 1] if θb = 0 and S2 = e ∈ Rp(n) :

eb = αb if αb 6= 0 and eb ∈ [−1, 1] if αb = 0 where α = D′asgn(Daθ).

Observe that |(D′asgn(Daθ))b| ≤‖ Da.b ‖1 where Da

.b denotes the bth column of the

matrix Da. This is same as the total number of local neighbors of b except a. The

lemma follows.

Proof [Proof of lemma 3.5.2] Using Bonferronis inequality, and |nea| = o(n)

for n → ∞, it suffices to show that there exists some c > 0 so that for for every

a, b ∈ Γ(n) with b ∈ nea, P(sgn(θa,nea,B,λ,µb ) = sgn(θab )) = 1−O(exp(−cnε)).


Consider the definition of

θa,nea,B,λ,µ =

argminθ:θk=0∀k/∈nea,θl−θm=0∀(l,m)∈B(

1n‖ Xa −Xθ ‖2

2 +λ ‖ θ ‖1 +µ ‖ Daθ ‖1

)Assume now that component b of this estimate is fixed at a constant value β. Denote

this new estimate by θa,b,B,λ,µ(β),

θa,b,B,λ,µ(β) = argminθ∈Θa,b(β)(n−1 ‖ Xa −Xaθ ‖2

2 +λ ‖ θ ‖1 +µ ‖ Daθ ‖1)

where

Θa,b(β) = θ ∈ Rp(n) : θb = β, θk = 0∀k /∈ nea, θl − θm = 0∀(l,m) ∈ B

There will always exist a value β = θa,nea,B,λ,µb such that θa,b,B,λ,µ(β) is identical

to θa,nea,B,λ,µ. Thus, if sgn(θa,nea,B,λ,µb ) 6= sgn(θab ), there would exist some β with

sgn(β)sgn(θab ) ≤ 0 so that θa,b,B,λ,µ(β) would be a solution. Using sgn(θab ) 6= 0 for all

b ∈ nea, it is sufficient to show that for every β with sgn(β)sgn(θab ) < 0, θa,b,B,λ,µ(β)

can not be a solution with high probability.

We concentrate on the case where θab > 0. The case θab < 0 will follow analogously. If

θab > 0, it follows by lemma 1 that θa,b,B,λ,µ(β) with θa,b,B,λ,µb (β) = β ≤ 0 can only be

a solution if Gb(θa,b,B,λ,µ(β)) ≥ −λ−Bµ where B = maxa,b ‖ Da

.b ‖1. Hence it suffices

to show that for some c > 0 and all b ∈ nea with θab > 0, for n→∞

P(supβ≤0Gb(θa,b,B,λ,µ(β)) < −λ−Bµ) = 1−O(exp(−cnε)) (3.8.1)

(3.8.2)

Let in the following Rλ,µa (β) be the n-dimensional vector of residuals.


Rλ,µa (β) = Xa −Xθa,b,B,λ,µ(β)

We can write Xb as

Xb =∑

k∈nea\b θb,nea\bk Xk +Wb

where Wb is independent of Xk : k ∈ nea\b. It follows that

Gb(θa,b,B,λ,µ(β)) = −2n−1〈Rλ,µ

a (β),Wb〉 −∑

k∈nea\b θb,nea\bk (2n−1〈Rλ,µ

a (β), Xk〉)

By lemma 3.5.1, ∀k ∈ nea\b, |Gk(θa,b,B,λ,µ(β))| = |2n−1〈Rλ,µ

a (β), Xk〉| ≤ λ + Bµ.

This together with the equation above yields

Gb(θa,b,B,λ,µ(β)) ≤ −2n−1〈Rλ,µ

a (β),Wb〉+ (λ+Bµ) ‖ θb,nea\b ‖1

Using Assumption 5 , there exists some ϑ < ∞, so that ‖ θb,nea\b ‖1< ϑ. It is

therefore sufficient to show that there exists for every g > 0 some c > 0 so that it

holds for all b ∈ nea with θab > 0, for n→∞,

P(infβ≤02n−1〈Rλ,µa (β),Wb〉 > g(λ+Bµ)) = 1−O(exp(−cnε))

Let W|| ⊆ Rn be the space spanned by the vectors Xk, k ∈ nea\b and let W⊥ be

the orthogonal complement of W|| in Rn. Split the n-dimensional vector Wb into the

two vectors Wb = W⊥b + W

||b . where W

||b ∈W|| and W⊥

b ∈W⊥. The inner product

can be written as

2n−1〈Rλ,µa (β),Wb〉 = 2n−1〈Rλ,µ

a (β),W||b 〉+ 2n−1〈Rλ,µ

a (β),W⊥b 〉

By Lemma 3.8 (see below), there exists for every g > 0 some c > 0 so that, for

n→∞

P(infβ≤02n−1〈Rλ,µa (β),W

||b 〉/(1 +Knβ0|β|) > −g(λ+Bµ)) = 1−O(exp(−cnε))


To show the result, it is sufficient to prove that there exists for every g > 0 some

c > 0 so that, for n→∞,

P(infβ≤02n−1〈Rλ,µa (β),W⊥

b 〉 − g(1 +Knβ0 |β|)(λ+Bµ) > g(λ+Bµ)) =

1−O(exp(−cnε))

It holds for some random variable Va, independent of Xnea , that

Xa =∑

k∈nea θakXk + Va

Note that Va and Wb are independent normally distributed random variables with

variances σ2a and σ2

b respectively. By assumption 2, 0 < v2 ≤ σ2b , σ

2a ≤ 1. Note

furthermore that Wb and Xnea\b are independent. Using θa = θa,nea and Xb =∑k∈nea\b θ

b,nea\bk Xk +Wb

Xa =∑

k∈nea\b(θak + θab θ

b,nea\bk )Xk + θabWb + Va

Using this, the definition of residuals and the orthogonality property of W⊥b ,

2n−1〈Rλ,µa (β),W⊥

b 〉 = 2n−1(θab − β)〈W⊥b ,W

⊥b 〉+ 2n−1〈Va,W⊥

b 〉 ≥

2n−1(θab − β)〈W⊥b ,W

⊥b 〉 − 2n−1|〈Va,W⊥

b 〉|

The second term, 2n−1|〈Va,W⊥b 〉|, is stochastically smaller than 2n−1|〈Va,Wb〉|. Due

to independence of Va and Wb, E(VaWb) = 0. Using Bernstein inequality, and λ +

Bµ ∼ dn−1−ε2 with ε > 0, there exists for every g > 0, some c > 0 so that

P (2n−1|〈Va,W⊥b 〉| ≥ g(λ+Bµ)) ≤ P (2n−1|〈Va,Wb〉| ≥ g(λ+Bµ)) = O(exp(−cnε))

Thus, it is sufficient to show that for every g > 0, there exists a c > 0 such that for

n→∞ ,

P (infβ≤02n−1(θab − β)〈W⊥b ,W

⊥b 〉 − g(1 +Knβ0|β|)(λ+Bµ) > 2g(λ+Bµ)) =

1−O(exp(−cnε))


Note that σ−2b 〈W⊥

b ,W⊥b 〉 follows a χ2

n−|nea|-distribution. As |nea| = o(n) and σ2b ≥

v2(by Assumption 2), it follows that there exists some k > 0 so that for n > n0 with

some n0(k) ∈ N, and any c > 0,

P (2n−1〈W⊥b ,W

⊥b 〉 > k) = 1−O(exp(−cnε))

Hence, it suffices to show that for every k, l > 0, there exists some n0(k, l) ∈ N so

that for all n ≥ n0,

infβ≤0(θab − β)k − l(1 +Knβ0|β|)(λ+Bµ) > 0

By assumption 5, |πab| is of the order a least n−1−ξ2

+β0 . Using

πab = θab/(Var(Xa|XΓ(n)\a)Var(Xb|XΓ(n)\b))12

and assumption 2, this implies that there exists some q > 0 so that θab ≥ qn−1−ξ2

+β0 .

As λ ∼ d1n− 1−ε

2 and µ ∼ d2n− 1−ε

2−β0 and ξ > ε by assumption of theorem 1, it follows

that for every k, l > 0 and large enough values of n,

θabk − lKnβ0|β|(λ+Bµ) > 0

It remains to show that for any k, l > 0, there exists some n0(k, l) such that for all

n ≥ n0,

infβ≤0−βk − l(λ+Bµ) ≥ 0

This follows as λ+Bµ→ 0 for n→∞, which completes the proof.

(A1). Assume the conditions of theorem 1 hold true. Let Rλ,µa (β) and W

‖b be defined

as in the proof of the previous lemma. For any g > 0, there exists c > 0 so that it

holds for all a, b ∈ Γ(n), for n→∞,

P

(supβ∈R

|2n−1〈Rλ,µa (β),W‖b 〉|

1+Knβ0 |β| < g(λ+Bµ)

)= 1−O(exp(−cnε))


Proof By Cauchy-Schwarz inequality,

|2n−1〈Rλ,µa (β),W‖b 〉|

1+Knβ0 |β| ≤ 2n−12 ‖ W ‖

b ‖2n−

12 ‖Rλ,µa (β)‖2

1+Knβ0 |β|

The sum of squares of the residuals is increasing with increasing value of λ, µ. Thus,

‖ Rλ,µa (β) ‖2≤‖ R∞,∞a (β) ‖2. By definition of Rλ,µ

a ,

‖ R∞,∞a (β) ‖22=‖ Xa − βXb − βXb1 − βXb2 − · · · − βXbw ‖2

2

where w is the number of nodes which are in the same equivalence class as b. And

hence,

‖ R∞,∞a (β) ‖22≤ (1 + (w + 1)|β|)2max‖ Xa ‖2

2, ‖ Xb ‖22, ‖ Xb1 ‖2

2, · · · , ‖ Xbw ‖22 ≤

(1 +Knβ0 |β|)2max‖ Xa ‖22, ‖ Xb ‖2

2, ‖ Xb1 ‖22, · · · , ‖ Xbw ‖2

2

Hence, for any q > 0,

P

(supβ∈R

n−12 ‖Rλ,µa (β)‖2

1+Knβ0 |β| > q

)≤

P(n−

12 max‖ Xa ‖2, ‖ Xb ‖2, ‖ Xb1 ‖2, · · · , ‖ Xbw ‖2 > q

)Note that ‖ Xa ‖2

2, ‖ Xb ‖22 and all of ‖ Xbk ‖2

2’s (k = 1, 2, · · · , w) have χ2n distribution.

Thus, by the following lemma (Lemma 3.8.1), there exists q > 1 and c > 0 such that

P

(supβ∈R

n−12 ‖Rλ,µa (β)‖2

1+Knβ0 |β| > q

)= O(exp(−cnε)) for n→∞

It remains to be shown that for every g > 0 there exists some c > 0 so that

P(n−

12 ‖ W ‖

b ‖2> g(λ+Bµ))

= O(exp(−cnε)) for n→∞

The expression σ−2b 〈W

‖b ,W

‖b 〉 is χ2

|nea|−1 distributed. As σb ≤ 1 and |nea| = O(nκ), it

follows that n−12 ‖ W ‖

b ‖2 is stochastically smaller than tn−1−κ2

(Znκ

) 12 , for some t > 0

and for some Z ∼ χ2nκ . Thus, for every g > 0,


P(n−

12 ‖ W ‖

b ‖2> g(λ+Bµ))≤ P

(Znκ> (g

t)2n1−κ(λ+Bµ)2

)As λ ∼ n−

1−ε2 and µ ∼ n−

1−ε2−β0 and B ∼ nβ0 , it follows that n1−κ(λ+Bµ)2 ≥ hnε−κ

for some h > 0 and sufficiently large n. By the properties of χ2 distribution and

ε > κ, by assumption in Theorem 4.6, the claim follows. This completes the proof.

Lemma 3.8.1 (Accessory 2 to lemma 3.5.2). If Y ∼ χ2n, then P (n−

12

√Y > q) =

O(exp(−cnε)) for some q > 1 and all c > 0

Proof

P (n−12

√Y > q) = P (n−1Y > q2) = P ( 1

n

∑nj=1 Y

2j > q2) ≤ (E(exp( t

nY 2j )))

n

exp(tq2)

By using Markov inequality with the increasing function ψt(x) = exp(tx). The mo-

ment generating function of a χ21 variable is known to be ψ(s) = (1 − 2s)−

12 for

0 ≤ s < 12

and hence, the upper bound

P (n−12

√Y > q) ≤ exp(−tq2)(1− 2t

n)−

n2 = exp

(−tq2 + n

2log( 1

1− 2tn

))

Since the probability on the left-hand side does not depend on t, we can take the

infimum over t on the right (as long as the resulting t satisfies the constraint 0 ≤ tn< 1

2

that is used above. We find that this infimum is achieved at t∗ = n(q2+1)2

. This t∗ being

too large (beyond the constraint region), along with the observation that f ′(t) < 0

for smaller values of t(f(t) := −tq2 + n

2log( 1

1− 2tn

))

tells that a convenient choice for

t could be n4. With this choice, we get

P (n−12

√Y > q) ≤ exp

(− q2n

4+ n

2log 2

)And if q2 = η2 + 2 log 2 for η > 0 (so that q > 1) then this bound becomes ≤

exp(−η2n

4

). Since η2 ≥ 4cnε−1 for large n and for all constant c, exp

(−η2n

4

)≤

exp (−cnε). Therefore


P (n−12

√Y > q) = O(exp(−cnε)) for some q > 1 and all c > 0

Proof [Proof of theorem 3.5.3] The event neλ,µa * nea is equivalent to the event

that there exists some node b ∈ Γ(n) \ cla in the set of non-neighbors of node a such

that the estimated coefficient θa,λ,µb is not zero. Thus,

P(neλ,µa ⊆ nea

)= 1− P

(∃b ∈ Γ(n) \ cla : θa,λ,µb 6= 0

)Let E be the event that

maxk∈Γ(n)\cla

∣∣∣Gk

(θa,nea,B,λ,µ

)∣∣∣ < λ+Bµ

Conditional on E , it follows from Meinshausen-Buhlmann’s discussion that θa,nea,B,λ,µ

is also a solution to the fused LASSO problem with A = Γ(n)\a. As θa,nea,B,λ,µb = 0

for all b ∈ Γ(n) \ cla, it follows that θa,λ,µb = 0 for all b ∈ Γ(n) \ cla. By 3.5.1 ,

P(∃b ∈ Γ(n) \ cla : θa,λ,µb 6= 0

)≤ P

(maxk∈Γ(n)\cla

∣∣∣Gk

(θa,nea,B,λ,µ

)∣∣∣ ≥ λ+Bµ)

It suffices to show that there exists a constant c > 0 so that for all b ∈ Γ(n) \ cla,

P(∣∣∣Gb

(θa,nea,λ,µ

)∣∣∣ ≥ λ+Bµ)

= O(exp(−cnε))

Writing Xb =∑

m∈neaθb,neam Xm + Vb for any b ∈ Γ(n) \ cla where Vb ∼ N(0, σ2

b ) for

some σ2b ≤ 1 and Vb is independent of Xm;m ∈ cla. Hence,

Gb

(θa,nea,λ,µ

)= −2n−1

∑m∈nea

θb,neam 〈Xa−Xθa,nea,λ,µ, Xm〉−2n−1〈Xa−Xθa,nea,λ,µ, Vb〉

By lemma 3.5.2, there exists a c > 0 so that with probability 1−O(exp(−cnε)),

sgn(θa,nea,λ,µk

)= sgn (θa,nea

k ) ∀k ∈ nea


In this case, it holds by lemma 3.5.1 that

∣∣∣∣∣2n−1∑m∈nea

θb,neam 〈Xa −Xθa,nea,λ,µ, Xm〉

∣∣∣∣∣ ≤∣∣∣∣∣ ∑m∈nea

sgn (θa,neam ) θb,nea

m λ

∣∣∣∣∣+

∣∣∣∣∣ ∑m∈nea

[D′asgn(Daθ)]mθb,neam µ

∣∣∣∣∣≤ δ1λ+ δ2Bµ

The absolute value of the coefficient Gb is hence bounded by∣∣∣Gb

(θa,nea,λ,µ

)∣∣∣ ≤ δ1λ+ δ2Bµ+∣∣∣2n−1〈Xa −Xθa,nea,λ,µ, Vb〉

∣∣∣with probability 1−O(exp(−cnε)). Conditional on Xcla , the random variable 〈Xa −

Xθa,nea,λ,µk , Vb〉 is normally distributed with mean 0 and variance σ2

b‖Xa−Xθa,nea,λ,µk ‖2.

By definition of θa,nea,λ,µ,

‖Xa −Xθa,nea,λ,µk ‖2 ≤ ‖Xa‖2

Since σ2b ≤ 1, 2n−1〈Xa − Xθa,nea,λ,µ

k , Vb〉 is stochastically smaller than or equal to

|2n−1〈Xa, Vb〉|. It remains to show that for some c > 0 and some 0 < δ1, δ2 < 1,

P (|2n−1〈Xa, Vb〉| ≥ (1− δ1)λ+ (1− δ2)Bµ) = O(exp(−cnε))

If Xa and Vb are independent, E(XaVb) = 0. Using Gaussianity and bounded variance

of both Xa and Vb, there exists some g < ∞ such that E (exp(|XaVb|)) < g. Hence

using Bernstein inequality and boundedness of λ and µ, it holds for some c > 0 that

for all b ∈ nea (see lemma 3.5.7 for detailed proof),

P (|2n−1〈Xa, Vb〉| ≥ (1− δ1)λ+ (1− δ2)Bµ) = O(exp(−cnε))

which completes the proof.


• The lemma 3.5.7 delves deep into the precise constants of the above inequality

and shows that one could achieve lesser type 1 error probability with a finite sample

size if one uses neighborhood fused lasso instead of usual lasso.

Proof [Proof of proposition 3.5.4] We follow the proof of theorem 3.5.3 and

claim that with the above assumptions, for all a, b with b /∈ nea,

P(|Gb

(θa,nea,λ,µ

)| > λ+Bµ

)→ 1 for n→∞

Following similar arguments afterwards, it can be concluded that with some δ1 > 1

and δ2 > 1,

P(|Gb

(θa,nea,λ,µ

)| ≥ δ1λ+ δ2Bµ− |2n−1〈Xa −Xθa,nea,λ,µ, Vb〉|

)→ 1 as n→∞

It holds for the third term that for any g1 > 0, g2 > 0,

P(|2n−1〈Xa −Xθa,nea,λ,µ, Vb〉| > g1λ+ g2Bµ

)→ 0 as n→∞

which combined with the previous result proves the proposition.

Proof [Proof of theorem 3.5.5] Observe that

P(nea ⊆ neλa

)= 1− P

(∃b ∈ nea : θa,λ,µb = 0

)Let E be the event

maxk∈Γ(n)\cla |Gk

(θa,nea,λ,µ

)| < λ+Bµ

On E , following similar arguments as before, we can conclude that θa,nea,λ,µ = θa,λ,µ.

Therefore

P(∃b ∈ nea : θa,λ,µb = 0

)≤ P

(b ∈ nea : θa,nea,λ,µ = 0

)+ P (Ec)

It follows from the proof of Theorem 4.6 that there exists some c > 0 so that P (Ec) =

O(exp(−cnε)). Using Bonferronis inequality, it hence remains to show that there

exists some c > 0 so that for all b ∈ nea,


P(θa,nea,λ,µ = 0

)= O(exp(−cnε))

which follows from lemma 3.5.2.

Proof [Proof of theorem 3.5.6] Cλ,µa * Ca implies the existence of an edge

in the estimated neighborhood that connects two nodes in two different connectivity

components of the true underlying graph. Hence,

P(∃b ∈ Γ(n) : b ∈ neλ,µa

)≤ p(n) maxa P

(∃b ∈ Γ(n) \ Ca : b ∈ neλ,µa

)Going by the same arguments used in proving theorem 4.6, we have

P(∃b ∈ Γ(n) \ Ca : b ∈ neλ,µa

)≤ P

(maxb∈Γ(n)\Ca |Gb(θ

a, Ca, λ, µ)| ≥ λ+Bµ)

Thus, it is sufficient to show that

p(n)2 maxa∈Γ(n),b∈Γ(n)\Ca P(|Gb(θ

a, Ca, λ, µ)| ≥ λ+Bµ)

Now observe that since Xb and Xk; k ∈ Ca are in different connectivity component,

they are, in fact, independent. Therefore, conditional on XCa , Gb(θa, Ca, λ, µ) ∼

N(

0, 4‖Xa−Xθa,Ca,λ,µ‖2n2

), making it stochastically smaller than Z ∼ N

(0, 4‖Xa‖2

n2

).

Hence it holds for all a ∈ Γ(n) and b ∈ Γ(n) \ Ca that

P(|Gb(θ

a,Ca,λ,µ)| > λ+Bµ)≤ 2Φ

(√n(λ+Bµ)

2σa

)where Φ = 1− Φ. Using the λ and µ proposed, the RHS becomes α

p(n)2.

Proof [Proof of lemma 3.5.7] This is a direct application of Bernstein inequal-

ity. Since λ→ 0 and µ→ 0 as n→∞, for large enough n, we have, by a version of


Bernstein’s inequality, that

P(∣∣2n−1〈Xa, Vb〉

∣∣ ≥ (1− δ1)λ+ (1− δ2)Bµ)

≤ exp

−(d12

(1− δ1)n1+ε2 + d2

2(1− δ2)n

1+ε2

+β0

)2

nE(X2a,iV

2b,i)

= exp

[− 1

σ2∗

(d1

2(1− δ1) +

d2

2(1− δ2)nβ0

)2

nε

]

This proves part (a). Expanding the above, we get

= exp

[− 1

σ2∗

(d2

1

4(1− δ1)2nε

)]· exp

[− 1

σ2∗

(d1d2

2(1− δ1)(1− δ2)nβ0 +

d22

4(1− δ2)2n2β0

)nε]

which proves part (b), since the first factor is the upper bound for lasso.

Proof [Proof of lemma 3.6.1] Since θλ,µa is the fused lasso minimizer, we get

1

n‖ Xa −Xaθλ,µa ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1≤

1

n‖ Xa −Xaθ0

a ‖2 +λ ‖ θ0a ‖1 +µ ‖ Daθ0

a ‖1

Plugging in Xa = Xaθ0a + εa, we get

1


a ‖2 − 2

nε′aX

a(θλ,µa − θ0a) +

1

n‖ εa ‖2 +λ ‖ θλ,µa ‖1 +µ ‖ Daθλ,µa ‖1

≤ 1

n‖ εa ‖2 +λ ‖ θ0

a ‖1 +µ ‖ Daθ0a ‖1


Rewrite the above inequality to get the desired lemma.

Proof [Proof of lemma 3.6.3]

E(εa,iXaj,i) = 0 ∀i = 1, 2, · · · , n

Also, we have

1

n

n∑i=1

E[∣∣εa,iXa

j,i

∣∣k] =1

n

n∑i=1

[E∣∣εka,i∣∣E ∣∣Xa

j,i

∣∣k] = E∣∣εka,1∣∣E (∣∣Xa

j,1

∣∣k)

Using the fact that if Z ∼ N(0, σ2), then E(|Z|k) = σk · 2k/2Γ( k+12

)√π

, we get the above

=

[σk2 ·

2k/2Γ(k+12

)√π

]·

[(σjj)

k/2 ·2k/2Γ(k+1

2)

√π

]

=

(2σ2√σjj)k

π

[Γ

(k + 1

2

)]2

≤(2σ2√σjj)k

π

22−k

k + 1Γ(k + 1) =

4σ22σjj

π(k + 1)

(σ2√σjj)k−2

k!

Under the assumption that all the population variances are 1, we get the above

≤ k!

2


We assumed σ22 = σaa − ΣabΣ

−1bb Σ′ab and used the following facts

• β(k + 1

2,k + 1

2

)=

Γ(k+1

2

)2

Γ(k + 1)

• β(x, x) = 21−2xβ

(x,

1

2

)• If (a− 1)(b− 1) ≤ 0 then β(a, b) ≤ 1

ab

Therefore, using Bernstein’s inequality, we get for t > 0

P

(1

n

n∑i=1

εa,iXaj,i ≥ t+

√2t

)≤ exp(−nt)

Hence

P (Λca) = P

[max1≤j≤p

j 6=a

2

n

∣∣ε′aXaj

∣∣ > λ0 +Bµ0

]=

p∑j=1

j 6=a

P

[∣∣∣∣∣ 2nn∑i=1

ε′a,iXaj,i

∣∣∣∣∣ > λ0 +Bµ0

]

= 2

p∑j=1

j 6=a

P

[1

n

n∑i=1

ε′a,iXaj,i >

λ0 +Bµ0

2

]

If we choose t = λ0+Bµ02

+1−√λ0 +Bµ0 + 1 then we get t+

√2t = λ0+Bµ0

2. Therefore,

using the above result, we get

P (Λca) ≤ 2(p− 1) exp

[−n(λ0 +Bµ0

2+ 1−

√λ0 +Bµ0 + 1

)]≤ 2 exp

[log p− n

(λ0 +Bµ0

2+ 1−

√λ0 +Bµ0 + 1

)]≤ 2 exp

[log p− n

(λ0

2+ 1−

√λ0 +Bµ0 + 1

)]


Alternatively, if we take λ0 = 2(t+log p)n

and µ0 = 2B

√2n(t+ log p), we get

P (Λca) ≤ 2 exp(−t)

Proof [Proof of lemma 3.6.5] The proof exploits the fact that θλ,µa is the min-

imizer of the penalized least square by equalling the sub-differential of the objective

function to 0. We start by replacing the assumed linear truth, i.e., Xa = Xaθ0a + εa.

Therefore, we get

∂

∂θa

[1

n‖ Xaθ0

a −Xaθa ‖2 +1

n‖ εa ‖2 +

2

nε′aX

a(θ0a − θa

)+ λsgn(θa) + µ

(Da′sgn(Daθa)

)]θa=θλ,µa

= 0

⇒ 2

n

(Xa′Xaθλ,µa −Xa′Xaθ0

)− 2

nXa′εa + λsgn

(θλ,µa

)+ µ

(Da′sgn(Daθλ,µa )

)= 0

⇒ Xa′Xaθλ,µa = Xa′Xaθ0 +n

2

(2

nXa′εa

)− nλ

2sgn

(θλ,µa

)− nµ

2


)⇒ θλ,µa =

(Xa′Xa

)+

Xa′Xaθ0a +

n

2

(Xa′Xa

)+(

2

nXa′εa − λsgn

(θλ,µa

)− µ


))

With the given choices for λ and µ, if we define Λa =

max1≤j≤p

j 6=a2n

∣∣ε′aXaj

∣∣ ≤ λ+Bµ

,

then we have, with probability 1− exp(−t2), the following

∣∣∣∣ 2nXa′εa − λsgn(θλ,µa

)− µ


)∣∣∣∣ ≤ (λ+Bµ)1p + λ1p +Bµ1p = 2(λ+Bµ)1p


where the inequality is meant to be interpreted componentwise and 1p is a vector of

size p consisting of 1’s. Hence, we get

‖ θλ,µa ‖1 =

∣∣∣∣∣∣∣∣(Xa′Xa)+

Xa′Xaθ0a +

n

2

(Xa′Xa

)+(

2

nXa′εa − λsgn

(θλ,µa

)− µ


))∣∣∣∣∣∣∣∣1

≤∣∣∣∣∣∣∣∣(Xa′Xa

)+

Xa′Xaθ0a

∣∣∣∣∣∣∣∣1

+n

2

∣∣∣∣∣∣∣∣(Xa′Xa)+(

2

nXa′εa − λsgn

(θλ,µa

)− µ


))∣∣∣∣∣∣∣∣1

≤ √p∣∣∣∣∣∣∣∣(Xa′Xa

)+

Xa′Xaθ0a

∣∣∣∣∣∣∣∣2

+n√p

2

∣∣∣∣∣∣∣∣(Xa′Xa)+(

2

nXa′εa − λsgn

(θλ,µa

)− µ


))∣∣∣∣∣∣∣∣2

Using the facts that ‖ Ax ‖2≤ smax ‖ x ‖2, where smax is the maximum singular value

of A and that A+A is idempotent, and hence its singular values are either 0 or 1, we

get the above

≤ √p ‖ θ0a ‖2 +

n√p

2δmin

∣∣∣∣∣∣∣∣ 2nXa′εa − λsgn(θλ,µa

)− µ


)∣∣∣∣∣∣∣∣2

≤ √p ‖ θ0a ‖2 +

np

2δmin

∣∣∣∣∣∣∣∣ 2nXa′εa − λsgn(θλ,µa

)− µ


)∣∣∣∣∣∣∣∣∞

≤ √p ‖ θ0a ‖2 +

np(λ+Bµ)

δmin

Plugging in the result obtained from previous corollary, we get

P

(1


a) ‖2≤ 2(λ+Bµ) ‖ θ0a ‖1 +µB

(√p ‖ θ0

a ‖2 +np(λ+Bµ)

δmin

))≥ 1− exp(−t2)

⇒ P

(1


a) ‖2≤ 2

(λ+Bµ

(1 +

√p

2

))‖ θ0

a ‖1 +npBµ(λ+Bµ)

δmin

)≥ 1− exp(−t2)

Proof [Proof of lemma 3.6.6] To start with, we derive a series of inequalities


and apply them on the basic inequality.

‖ θλ,µa ‖1 = ‖ θλ,µa,S0‖1 + ‖ θλ,µa,Sc0 ‖1 ≥ ‖ θ0

a,S0‖1 − ‖ θλ,µa,S0

− θ0a,S0‖1 + ‖ θλ,µa,Sc0 ‖1

‖ θλ,µa − θ0a ‖1 = ‖ θλ,µa,S0

− θ0a,S0‖1 + ‖ θλ,µa,S0

‖1

We write θλ,µa,S0=(θλ,µa,S0,1

, 0)′

and θλ,µa,Sc0 =(

0, θλ,µa,Sc0,1

)′so that θλ,µa =

(θλ,µa,S0,1

, θλ,µa,Sc0,1

)′.

Similarly we write θ0a = θ0

a,S0=(θ0a,S0,1

, 0)′

. Hence, we get

‖ Daθλ,µa ‖1 =

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣

DaS0,S0

0

DaS0,0

Da0,Sc0

0 DaSc0,S

c0

θλ,µa,S0,1

θλ,µa,Sc0,1

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣1

=

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣

DaS0,S0

θλ,µa,S0,1

DaS0,0

θλ,µa,S0,1+Da

0,Sc0θλ,µa,Sc0,1

DaSc0,S

c0θλ,µa,Sc0,1

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣1

≥ ‖ DaS0,S0

θλ,µa,S0,1‖1 + ‖ Da

Sc0,Sc0θλ,µa,Sc0,1 ‖1 + ‖ Da

S0,0θλ,µa,S0,1

‖1 − ‖ Da0,Sc0

θλ,µa,Sc0,1 ‖1

≥ ‖ DaS0,S0

θ0a,S0,1

‖1 − ‖ DaS0,S0

(θλ,µa,S0,1

− θ0a,S0,1

)‖1 + ‖ Da

Sc0,Sc0θλ,µa,Sc0,1 ‖1

+ ‖ DaS0,0

θ0a,S0,1

‖1 − ‖ DaS0,0

(θλ,µa,S0,1

− θ0a,S0,1

)‖1 − ‖ Da

0,Sc0θλ,µa,Sc0,1 ‖1

≥ ‖ DaS0,S0

θ0a,S0,1

‖1 + ‖ DaS0,0

θ0a,S0,1

‖1 −2B ‖ θλ,µa,S0− θ0

a,S0‖1

+ ‖ DaSc0,S

c0θλ,µa,Sc0,1 ‖1 −B ‖ θλ,µa,Sc0 ‖1

Similarly, we get

‖ Daθ0a ‖1 =

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣

DaS0,S0

0

DaS0,0

Da0,Sc0

0 DaSc0,S

c0

θ0

a,S0,1

0

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣1

=

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣

DaS0,S0

θ0a,S0,1

DaS0,0

θ0a,S0,1

0

∣∣∣∣∣∣∣∣∣∣

∣∣∣∣∣∣∣∣∣∣1

= ‖ DaS0,S0

θ0a,S0,1

‖1 + ‖ DaS0,0

θ0a,S0,1

‖1


Plugging into the basic inequality, we get on Λa, with λ ≥ 2λ0 and µ ≥ 2µ0,

1

n‖ Xa

(θλ,µa − θ0

a

)‖2 +λ ‖ θ0

a,S0‖1 −λ ‖ θλ,µa,S0

− θ0a,S0‖1 +λ ‖ θλ,µa,Sc0 ‖1

+ µ ‖ DaS0,S0

θ0a,S0,1

‖1 +µ ‖ DaS0,0

θ0a,S0,1

‖1 −2Bµ ‖ θλ,µa,S0− θ0

a,S0‖1

+ µ ‖ DaSc0,S

c0θλ,µa,Sc0,1 ‖1 −Bµ ‖ θλ,µa,Sc0 ‖1

≤ λ+Bµ

2‖ θλ,µa,S0

− θ0a,S0‖1 +

λ+Bµ

2‖ θλ,µa,Sc0 ‖1 +λ ‖ θ0

a,S0‖1

+ µ ‖ DaS0,S0

θ0a,S0,1

‖1 +µ ‖ DaS0,0

θ0a,S0,1

‖1

From this, we get

2

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1 ≤ (3λ+ 5Bµ) ‖ θλ,µa,S0

− θ0a,S0‖1

Proof [Proof of lemma 3.6.8] The proof is basically a continuation of what we

have already shown. We have

2

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa − θ0

a ‖1

=2

n‖ Xa

(θλ,µa − θ0

a

)‖2 +(λ− 3Bµ) ‖ θλ,µa,S0

− θ0a,S0‖1 +(λ− 3Bµ) ‖ θλ,µa,Sc0 ‖1

≤ (3λ+ 5Bµ) ‖ θλ,µa,S0− θ0

a,S0‖1 +(λ− 3Bµ) ‖ θλ,µa,S0

− θ0a,S0‖1

= 2(2λ+Bµ) ‖ θλ,µa,S0− θ0

a,S0‖1 ≤

2√s0(2λ+Bµ)√

nφ0,a


a,S0‖

≤ s0(2λ+Bµ)2

φ20,a

+1

n‖ Xa

(θλ,µa − θ0

a

)‖2

97

Chapter 4

Smoothing of Diffusion Tensor

MRI Images∗

4.1 Introduction

Diffusion-weighted magnetic resonance imaging (MRI) has emerged as an impor-

tant field of medical research in general and neuroscience to be specific. The MRI

technology boosted clinical neuro-diagnostics and gave rise to new applications to

study the anatomy of human brain in-vivo and in a non-invasive manner. Modern

day clinical applications of diffusion tensor MRI are primarily based on contrast stud-

ies. For example, contrast in relaxation times for T1- or T2- weighted MR imaging,

in time of flight for MR angiography, in blood oxygen level dependency for functional

MR imaging, and in diffusion for apparent diffusion coefficient (ADC) imaging[41].

More advanced applications of diffusion weighted MRI involves finding connectivity

in human tissues where isotropic water movement is apparent. However, DW-MRI

∗This is a collaborative work with Prof. Owen Carmichael, Prof. Debashis Paul and Prof. JiePeng.

4.2. Principles of Diffusion MRI 98

has been successfully applied for anisotropic diffusion of water molecules owing to

internal fiber structure inside a tissue. The theoretical understanding of diffusion

MRI have not been developed much over the years, however, sophisticated data ac-

quisition tools are available. Nevertheless, it is important for a medical practitioner

to understand the basic principles of these tools to some degree. Diffusion tensor

imaging is immensely successful in determining neurological disorders because it can

reveal abnormalities in white matter fiber structures and help build models for neural

connectivity in brain. The ability to visualize these connections has led neuroscien-

tists to undertake the so called Human Brain Connectome[42] project.

4.2 Principles of Diffusion MRI

With a congregation of billions of neurons, human brain is one of the most complex

natural systems. Studying and identifying crucial biological phenomena in human

brain hugely depends on proper imaging technology. One common imaging technique

which is usually done with animals is histology followed by examination with light

or electron microscopy. It is commonly performed by examining cells and tissues by

sectioning and staining. Staining highlights the locations of proteins and genes of

interests, and electron microscopy helps us to analyze them at the molecular level.

However, the invasive and extensive histology is not suitable for analyzing human

brain. On the other hand, MRI is noninvasive and quick to perform. MRI will

produce a condensed three dimensional image of the brain. While a compressed

anatomical information is beneficial, a lot of biological details are lost which results

in reduced specificity of obtained information. Thus, going beyond conventional MRI


technology to capture more detailed anatomy was one of the primary objectives of

MRI research communities. In an effort to achieve this goal, Diffusion Tensor MRI

was introduced in the mid ’90s[48]. It can delineate the axonal organization of the

brain unlike conventional MRI. See [43–47] for the application of DT-MRI in animal

studies.

4.2.1 Shortcomings of conventional MRI and Contrast Gen-

eration

There are some problems with conventional MRI technology that does not make

it a good candidate for anatomy learning in complex tissues. Among these, the most

critical ones are spatial resolution, contrast and often times, the size of the data.

Theoretically, the MR image resolution should be approximately 10 µm because of

the movement of water molecule in the exposure time[49]. However, this resolution

is hard to achieve in practice because the signal obtained is hard to distinguish from

noise and hence a long exposure is required to detect weak signals. For in vivo stud-

ies, exposure time is typically short which limits the workable resolution. For ex vivo

studies, however, this does not pose a problem. Still, the data obtained from an MRI

study is, more often than not, so voluminous that even with cutting edge computing

technology it might be hard to deal with.

A crucial drawback of conventional MR image is its lack of contrast. Two differ-

ent anatomical regions having water molecules with similar properties will generate

exactly the same image value; and thereby failing to generate the meaningful con-

trasts. One way around is to use physical properties of water molecules, proton

density (p0), T1 and T2 relaxation times and the diffusion coefficients (D) based con-


trasting tools. A mathematical relationship among these parameters and measured

intensity is given by

S = p0

(1− e−c/T1

)e−d/T2e−bD (4.2.1)

where b, c and d are certain imaging constants that might depend on environmental

factors like viscosity and existence of macromolecules[49]. In this chapter, the diffu-

sion factor will be our main concern. We will typically assume that the remaining

factors remain constant during experiments.

4.2.2 Water Diffusion and Its Importance in MR Imaging

Diffusion is the random translational thermal motion of the water molecules. The

diffusion tensor imaging of live (or dead, but not frozen) brain is a probe into its

anatomical structure based on the movement of water molecules diffusing along the

path of the neuron bundles. Following [49], an interesting analogy is the shape of a

spreading ink blot on a wet paper or cloth. The shape of the ink stain is regulated

by the underlying fiber structure of the paper or cloth. If the stain is circular, it is

said to be an isotropic diffusion whereas an elongated pear-shaped stain is indicative

of an anisotropic diffusion. In DTI, this anisotropy is used to find the alignment

of axonal bundles in the brain. When we characterize anisotropic diffusion, an en-

tirely new image contrast is generated which is based on structural orientation[50–52].

A magnetic field is applied along a gradient and diffusion weighted image is ob-

tained. However, this measurement is only along the gradient axis so it’s not possible

to find the three dimensional axon bundle structure from applying magnetic field in


one direction. That’s why multiple directions are used and the measured intensity is

related to the three dimensional diffusion tensor matrix D as the following equation

S = S0e−bqTDq (4.2.2)

where S0 is a constant and q is the applied gradient direction. See [48] for more

details. In diffusion tensor framework, a 3D ellipsoid is constructed which represents

the average diffusion distance along each coordinate axes. The axes lengths of the

ellipsoid denote the diffusion magnitudes and are given by the eigenvalues of the dif-

fusion tensor matrix D. The eigenvectors determine their orientations.

A measure of anisotropy is given by quantification of relative differences in the eigen-

values λ1, λ2 and λ3. A popular metric is fractional anisotropy, given by

FA :=

√(λ1 − λ2)2 + (λ2 − λ3)2 + (λ3 − λ1)2

2(λ21 + λ2

2 + λ23)

. For isotropic tensors, λ1 = λ2 = λ3 and hence FA = 0. For highly anisotropic

tensor, FA would be close to 1. There are a number of ways to represent the tensor

information graphically. One can represent the FA values as a grayscale image but

that does not have the directionality information. An alternative way is to represent

the FA values in a color-coded format where the colors indicate the directions of

tensors.

4.3. Tensor Smoothing 102

4.3 Tensor Smoothing

Removing noise from diffusion tensor images is quite challenging[53–56]. Trac-

ing of anatomical structures is extremely difficult in presence of noise which is of

complex Rician nature[53]. Among different approaches for noise removal from diffu-

sion weighted images, smoothing with the help of Karcher means[57–59] is the most

popular. This is equivalent to kernel smoothing in the tensor field. Yuan et al.[60]

extended the weighted Karcher-mean approach to local polynomial smoothing in the

tensor space with respect to different geometries. Carmichael et al.[61] studied the

Karcher mean approach and analyzed the performance of linear and nonlinear esti-

mators during regression, followed by a demonstration of the effect of Euclidean and

geometric smoothers under various local structures of the underlying tensor field and

different noise levels. They found that under certain interactions of the tensor field

geometry and noise level, Euclidean smoother work better than geometric smoothers,

contrary to popular belief. In the following part, we introduce some notations and

describe the two-stage smoothing discussed above.

4.3.1 Two-Stage Smoothing

The problem is to estimate the diffusion tensor D based on the observed DWI in-

tensities Sq : q ∈ Q. We denote d = vec(D) = (D11, D22, D33, D12, D13, D23)T and

xq = (q21, q

22, q

23, 2q1q2, 2q1q3, 2q2q3). Also, following [61] we assume that the choice of

gradient directions ensures that∑

q∈Q xqxTq is well-conditioned. The linear regression

estimator is given by

DLS := argmind∈R6

∑q∈Q

(log Sq − logS0 + xTqd

)2


and the nonlinear regression estimator is given by

DNL := argmind∈R6

∑q∈Q

(Sq − S0 exp(−xTqd)

)2

Even though positive definiteness constraint is not imposed, it has been found that

most of the time the usual solution will be positive definite. If not, one can convert

the negative eigenvalues to 0. The nonlinear regression problem can be solved using

Levenberg-Marquardt method. It has been proved in the literature that DNL is bet-

ter than DLS in terms of bias62 and asymptotic efficiency61 as noise variance goes to 0.

After the initial estimates are obtained, the smoothing is carried out on the space of

3× 3 positive definite matrices, denoted by P3. In Euclidean space, if f is a function

from Rk to Rp, and we have observations (si, Xi)ni=1 where si denotes the position

and Xi the noise corrupted observation of f(si). A standard way of the smoothing

is to take weighted average

fE(s) =

∑ni=1wi(s)Xi∑ni=1 wi(s)

(4.3.1)

The weights wi(s) are usually given by a nonnegative, integrable kernel Kh(.) where

h > 0 denotes the bandwidth which determines the size of the local neighborhood

over which the smoothing is carried out. To be specific

wi(s) := Kh(‖ si − s ‖), i = 1, · · · , n


Another way to interpret the solution is to view it as the minimizer of the weighted

criterion

f(s) := argminc∈P

n∑i=1

wi(s)d2P(Xi, c) (4.3.2)

Kernel smoothing thus takes the form of a weighted Karcher mean of Xi’s.63 From 4.3.2,

it should be obvious that different distance metrics on the manifold P might lead to

different kernel smoothers. Imposing an Euclidean metric, we get the previous so-

lution. Among other metrics, some popular choices are log-Euclidean and affine

invariant metric.

Log-Euclidean metric is defined as

dLE(X, Y ) :=‖ logX − log Y ‖,

where log denotes the matrix logarithm. With this distance metric, the solution

to 4.3.2 is given by

fLE(s) = exp

(∑ni=1wi(s) log(Xi)∑n

i=1wi(s)

)(4.3.3)

The affine invariant metric can be defined for the space of positive definite matrices

as they constitute a naturally reductive homogeneous space.64 This is defined as

dAff :=[tr(log(X−1/2Y X−1/2

))2] 1

2

. The solution to 4.3.2 under dAff is not available is closed form. Gradient descent

algorithms are usually used to solve it. In the literature, Eucildean method is often


criticized for its swelling effect where the determinant of the average tensor is larger

than the determinants of the averaging tensors. However, in terms of estimation

accuracy, there is no clear winner in all circumstances. The performance depends

heavily on the noise level and local geometric structures.

4.3.2 Locally Constant Smoothing with Conjugate Gradient

Descent Algorithm

In this section, we shall reformulate the smoothing problem. Instead of the two

stage (regression followed by smoothing), we consider an appropriately weighted ob-

jective function to minimize. In order to carry out the optimization, we apply a

geometric version of conjugate gradient descent algorithm. The principal idea here is

to introduce the weights to the objective function itself and find the optimal solution

with the positive definiteness constraint.

To be specific, we want to find the minimizer D of

f1(D) =n∑j=1

m∑i=1

wj,i[logSj,i − log

(S0 exp

(−brTi Dri

))]2=

n∑j=1

m∑i=1

wj,i[logSj,i − logS0 + brTi Dri

]2(4.3.4)

where b > 0 is given, ri ∈ R3 are given vectors (gradient directions), Sj,i is the signal

intensity of the jth voxel in i-th gradient and wj,i is the weight for the jth voxel in ith

gradient. Without the log-transform, the objective function would be

f2(D) =n∑j=1

m∑i=1

wj,i[Sj,i − S0 exp

(−brTi Dri

)]2(4.3.5)


Depending on the choice of our objective function, the optimization problem can be

formulated as

Di = argminD∈P fi(D) for i = 1, 2

The corresponding solution D1 or D2 would then be the smooth estimator of the

underlying tensor. However, the optimization is carried out with the additional con-

straint of positive definiteness which requires incorporating the geometry of this Rie-

mannian surface in order to develop a conjugate gradient algorithm on it. For this,

we need to compute the gradient and Hessian of the objective functions, which are

given below.

∇f1(D) = 2bn∑j=1

m∑i=1

wj,i(logSj,i − logS0 + brTi Dri

)DriDrTi D (4.3.6)

Hess(f1)(X, Y ) = 2bn∑j=1

m∑i=1

[(logSj,i − logS0 + brTi Dri

)rTi XD

−1Y ri + b(rTi Xri

) (rTi Y ri

)](4.3.7)

∇f2(D) = 2S0bn∑j=1

m∑i=1

wj,i exp(−brTi Dri

) (Sj,i − S0 exp

(−brTi Dri

))Drir

Ti D

(4.3.8)

Hess(f2)(X,X) = 2S0bn∑j=1

m∑i=1


) [(Sj,i − S0 exp

(−brTi Dri

))rTi XD

−1Xri

−b(Sj,i − 2S0 exp

(−brTi Dri

)) (rTi Xri

)2]

(4.3.9)

Hess(f2)(X, Y ) = S0bn∑j=1

m∑i=1


) [(Sj,i − S0 exp

(−brTi Dri

))rTi(XD−1Y

+Y D−1X)

ri − 2b(Sj,i − 2S0 exp

(−brTi Dri

)) (rTi Xri

) (rTi Y ri

)](4.3.10)


With these notations, we devised the conjugate gradient algorithm, which is given

in Algorithm 1.

4.3.3 Experimental Results

We emulate the set up by Carmichael et al.[61]. We construct a simulated tensor

field on a 128 × 128 × 4 three-dimension grid consisting of the background regions

with identical isotropic tensors and the banded regions with three parallel vertical

bands and three horizontal bands (for each of the four slices), where within each band

tensors are identical and aligned in either the X or the Y ? direction. The bands are

of various widths and degrees of anisotropy. The purpose is to compare the relative

performances of two stage smoothers and our smoother empirically.

The underlying truth is smooth. We compare the Frobenius distance of our esti-

mated tensor from the true tensor. We chose a small 10 × 10 region for smoothing

purpose. We compare the following six methods

1. Linear regression with no smoothing.

2. Nonlinear regression with no smoothing.

3. Anisotropic smoothing with log-transformed data.

4. Anisotropic smoothing with original data.

5. Two stage anisotropic smoothing w.r.t. Euclidean distance.

6. Two stage anisotropic smoothing w.r.t. log-Euclidean distance.


Data: Riemannian manifold P : space of positive definite matricesVector transport T :

Tαη,D(η) := D1/2 exp(αD−1/2ηD−1/2

)D−1/2η

Retraction map R:

RD(tX) := D1/2 exp(tD−1/2XD−1/2

)D1/2

Objective function:fi(D) i = 1, 2

Gradient of f :∇fi(D)

Armijo parameter σ, α and β.

Result: Find a local minimizer of f using Conjugate Gradient algorithm.

Initialize: Initial iterate x0 ∈ P . Taken to be the Identity matrix.

Set η0 = −∇f(x0)

for k = 0, 1, 2, · · · dofor m = 1, 2, · · · , 20 do

Compute step size s = αβm.Compute xk+1 = Rxk(sηk).Compute LHS = f(xk)− f(xk+1) and RHS = σs〈∇f(xk),∇f(xk)〉xkif LHS > RHS then

Break inner loopend

end

Compute βk+1 =〈∇f(xk+1),∇f(xk+1)〉xk+1

〈∇f(xk),∇f(xk)〉xkSet ηk+1 = −∇f(xk+1) + βk+1Tsηk,xk(ηk).

endAlgorithm 1: Conjugate Gradient on Positive Definite Matrices for a certain noiselevel

4.4. Discussion 109

The Frobenius distances are then plotted against the corresponding FA values. The

graphs for low (1%), medium (5%) and high(10%) are shown in figure 4.3.1, 4.3.2

and 4.3.3.

4.4 Discussion

From the simulations, it has been observed that for low noise level, our method

with log-transformed data, Euclidean smoother and geometric smoother perform sig-

nificantly better than the rest. This might be attributed to the fact that at low noise

level the space geometry influences the smoothing process. Hence, smoothers that re-

spect the geometry turn out to be better. The improvement diminishes as noise level

increases, but still the geometric methods perform better than the rest, especially

at higher anisotropic regions and high noise levels. If the local heterogeneity of the

tensor field is the dominant component of variation of the tensors, and the noiseless

tensors follow a certain geometric regularity, then the geometric smoothers lead to

smaller bias. As seen in [61], under the domination of sensor noise over geometry,

the Euclidean smoother is a better performer than geometric smoother. However,

our method sides with the winner most of the time, no matter whichever method is

better.

4.4. Discussion 110

0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360

0.02

0.04

0.06

0.08

0.1

0.12Comparison for low noise level

Lin. Reg.Nonlin. Reg.Aniso. Sm.Nonlin Aniso. Sm.Two Step EuclideanTwo Step Log Euclidean

Figure 4.3.1: Comparison of smoothers: Low noise level

4.4. Discussion 111

0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360

0.1

0.2

0.3

0.4

0.5

0.6

0.7Comparison for medium noise level


Figure 4.3.2: Comparison of smoothers: Medium noise level

4.4. Discussion 112

0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.360

0.2

0.4

0.6

0.8

1

1.2

1.4Comparison for high noise level


Figure 4.3.3: Comparison of smoothers: High noise level

REFERENCES 113

References

[1] Koller, Daphne and Friedman, Nir (2009). “Probabilistic Graphical Models,

Principles and Techniques”. Cambridge, Massachusetts: The MIT Press.

[2] Friedman, Nir (2004). “Inferring Cellular Networks Using Probabilistic Graphi-

cal Models.”, Science, 6 February 2004, Vol. 303 no. 5659 pp. 799-805.

[3] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, G. M. Church, Nature

Genet. 22, 281 (1999).

[4] E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller, Bioinformatics 17 (suppl.

1), S243 (2001).

[5] Wainwright, Martin J., Jordan, Michael I. (2008), “Graphical Models, Exponen-

tial Families, and Variational Inference”, Foundations and Trends in Machine

Learning, Vol. 1, Nos. 1-2 pp 1-305.

[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison (1998), eds., Biological Sequence

Analysis. Cambridge, UK: Cambridge University Press.

[7] Gales, Mark and Young, Steve (2007), “The Application of Hidden Markov

Models in Speech Recognition”, Foundations and Trends in Signal Processing,

Vol. 1, No. 3 pp 195-304.

REFERENCES 114

[8] Malinici, Iulian Pruteanu and Carin, Lawrence, “Infinite Hidden Markov Models

for Unusual-Event Detection in Video”, IEEE Transactions in Image Processing,

Vol. 17, No. 5, May 2008.

[9] Aksoy, Selim, “Probabilistic Graphical Models Part III: Example Appli-

cations”, available at http://www.cs.bilkent.edu.tr/~saksoy/courses/

cs551/slides/cs551_pgm3.pdf.

[10] Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S. and Saul,

Lawrence K. “An Introduction to Variational Methods”. In Jordan, Michael I.,

“Learning in Graphical Models”. Cambridge, Massachusetts: The MIT Press.

[11] Lauritzen, Stephen L. (1996), “Graphical Models”. Oxford Statistical Science

Series: Clarendon Press, Oxford.

[12] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, “Introduction to Algorithms”.

Cambridge, MA: MIT Press, 1990.

[13] Hammersley, J. M., Clifford, P. (1971), “Markov Fields on Finite Graphs and

Lattices”, unpublished, available at http://www.statslab.cam.ac.uk/~grg/

books/hammfest/hamm-cliff.pdf

[14] Grimmett, G. R. (1973), “A Theorem about Random Fields”, Bulletin of the

London Mathematical Society 5 (1): 81-84.

[15] Preston, C. J. (1973), “Generalized Gibbs States and Markov Random Fields”,

Advances in Applied Probability 5 (2): 242-261.

[16] Sherman, S. (1973), “Markov Random Fields and Gibbs Random Fields”, Israel

Journal of Mathematics 14 (1): 92-103.

http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/slides/cs551_pgm3.pdf

http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/slides/cs551_pgm3.pdf

http://www.statslab.cam.ac.uk/~grg/books/hammfest/hamm-cliff.pdf

http://www.statslab.cam.ac.uk/~grg/books/hammfest/hamm-cliff.pdf

REFERENCES 115

[17] Besag, J. (1974), “Spatial Interaction and the Statistical Analysis of Lattice

Systems”, Journal of the Royal Statistical Society, Series B 36 (2): 192-236.

[18] Ising, E. (1925), “Beitrag zur Theorie des Ferromagnetismus”, Z. Phys. 31:

253258

[19] Murphy, Kevin Patrick (2012), “Machine Learning: A Probabilistic Perspec-

tive”. The MIT Press.

[20] Cipra, Barry A., “The Ising Model is NP-Complete”, SIAM News, Volume 33,

Number 6.

[21] The Structure of Proteins, available at http://www.vivo.colostate.edu/

hbooks/genetics/biotech/basics/prostruct.html.

[22] Razavian, Narges Sharif, Moitra, Subhodeep, Kamisetty, Hetunandan, Ra-

manathan, Arvind, and Langmead, Christopher J. (2010), “Time- Varying Gaus-

sian Graphical Models of Molecular Dynamics Data” . Computer Science De-

partment. Paper 1061.

[23] Honorio, Jean, Ortiz, Luis, Samaras, Dimitris, Paragios, Nikos, Goldstein, Rita

(2009), “Sparse and Locally Constant Gaussian Graphical Models”, Advances in

Neural Information Processing Systems 22.

[24] Mansinghka, V. K., Kemp, C., Tenenbaum, J. B., Griffiths, T. L. (2006), “Struc-

tured Priors for Structure Learning”, Uncertainty in Artificial Intelligence.

[25] Shwe, M., Middleton, B., Heckerman, D., Henrion, M., Horvitz, E., Lehmann,

H. and Cooper, G. (1991), “Probabilistic diagnosis using a reformulation of the

INTERNIST- 1/QMR knowledge base I. - The probabilistic model and inference

algorithms”, Methods of Information in Medicine, 30:241 - 255, 1991.

http://www.vivo.colostate.edu/hbooks/genetics/biotech/basics/prostruct.html

http://www.vivo.colostate.edu/hbooks/genetics/biotech/basics/prostruct.html

REFERENCES 116

[26] Peer, D., Tanay, A. and Regev, A. (2006), “MinReg: A Scal- able Algorithm for

Learning Parsimonious Regulatory Networks in Yeast and Mammals”. Journal

of Machine Learning Research, 7:167-189.

[27] Goldstein,Rita Z., Alia-Klein,Nelly, Tomasi,Dardo, Zhang,Lei, Cottone,Lisa

A., Maloney,Thomas, Telang,Frank, Caparelli,Elizabeth C., Chang,Linda,

Ernst,Thomas, Samaras,Dimitris, Squires, Nancy K., Volkow, Nora D. (2007),

“Decreased prefrontal cortical sensitivity to monetary reward is associated with

impaired motivation and self-control in cocaine addiction”, The American Jour-

nal of Psychiatry. Jan 2007; 164(1): 43-51.

[28] Barndorff-Nielsen, O.E. (1978), “Information and Exponential Families”, Chich-

ester, UK: Wiley.

[29] Brown, L.D. (1986), “Fundamentals of Statistical Exponential Families”, Hay-

ward, CA: Institute of Mathematical Statistics.

[30] Efron, B. (1978), “The Geometry of Exponential Families”, Annals of Statistics,

Vol. 6, pp. 362-376.

[31] Borwein, J. and Lewis, A. (1999), “Convex Analysis”. New York: Springer-

Verlag.

[32] Hirriart-Urruty, J. and Lemarechal, C. (1993), “Convex Analysis and Minimiza-

tion Algorithms”, Vol. 1, New York: Springer-Verlag.

[33] Rockafellar, G. (1970), “Convex Analysis”, Princeton, NJ: Princeton University

Press.

[34] Jaynes, E.T. (1957), “Information Theory and Statistical Mechanics”, Physical

Review, Vol 106, pp. 620-630.

REFERENCES 117

[35] Wu, N. (1997), “The Maximum Entropy Method”, New York: Springer.

[36] Buhl, Sφren L. (1993), “On the Existence of Maximum Likelihood Estimators

for Graphical Gaussian Models”, Scandinavian Journal of Statistics, Vol. 20, No.

3 , pp. 263-270

[37] Uhler, Caroline (2012), “Geometry of Maximum Likelihood Estimation in Gaus-

sian Graphical Models”, The Annals of Statistics, Vol. 40, No. 1, 238-261.

[38] Barrett, W., Johnson, C. R. and Tarazaga, P. (1993). “The real positive def-

inite completion problem for a simple cycle”. Linear Algebra Appl., Vol. 192, pp.

3-31.

[39] Barrett, W. W., Johnson, C. R. and Loewy, R. (1996). “The real positive definite

completion problem: Cycle completability”. Mem. Amer. Math. Soc., Vol.122,

viii+69 pp.

[40] Sturmfels, B. and Uhler, C. (2010). “Multivariate Gaussian, semidefinite matrix

completion, and convex algebraic geometry”. Ann. Inst. Statist. Math, Vol. 62,

pp 603-638.

[41] Hagmann, P. , Fonasson, L., Maeder, P., Thiran, F., Wedeen, V., Meuli,

R. (2006). “Understanding Diffusion MR Imaging Techniques: From Scalar

Diffusion-weighted Imaging to Diffusion Tensor Imaging and Beyond”. Radio-

Graphics, Vol. 26: S205-S223.

[42] Dillow, Clay (2010). ”The Human Connectome Project Is a First-of-its-Kind

Map of the Brain’s Circuitry.” Popular Science. Sept .

REFERENCES 118

[43] Ahrens, E.T., Laidlaw, D.H., Readhead, C., Brosnan, C.F., Fraser, S.E., and

Jacobs, R.E. (1998). MR microscopy of transgenic mice that spontaneously ac-

quire experimental allergic encephalomyelitis. Magn. Reson. Med. Vol. 40, pp.

119-132.

[44] Harsan, L.A., Poulet, P., Guignard, B., Steibel, J., Parizel, N., de Sousa, P.L.,

Boehm, N., Grucker, D., and Ghandour, M.S. (2006). Brain dysmyelination and

recovery assessment by noninvasive in vivo diffusion tensor magnetic resonance

imaging. J. Neurosci. Res. Vol. 83, pp. 392-402.

[45] Kroenke, C.D., Bretthorst, G.L., Inder, T.E., and Neil, J.J. (2006). Modeling

water diffusion anisotropy within fixed newborn primate brain using Bayesian

probability theory. Magn. Reson. Med. Vol. 55, pp. 187-197.

[46] Mori, S., Itoh, R., Zhang, J., Kaufmann, W.E., van Zijl, P.C.M., Sol- aiyappan,

M., and Yarowsky, P. (2001). Diffusion tensor imaging of the developing mouse

brain. Magn. Reson. Med. Vol. 46, pp. 18-23.

[47] Nair, G., Tanahashi, Y., Low, H.P., Billings-Gagliardi, S., Schwartz, W.J., and

Duong, T.Q. (2005). Myelination and long diffusion times alter diffusion-tensor-

imaging contrast in myelin-deficient shiverer mice. Neuroimage Vol. 28, pp. 165-

174.

[48] Basser, P.J., Mattiello, J., and Le Bihan, D. (1994). MR diffusion tensor spec-

troscopy and imaging. Biophys. J. Vol. 66, pp. 259-267.

[49] Mori, S. & Zhang, J. (2006). “Principles of Diffusion Tensor Imaging and Its

Applications to Basic Neuroscience Research”, Neuron Vol. 51, pp. 527-539.

REFERENCES 119

[50] Chenevert, T.L., Brunberg, J.A., and Pipe, J.G. (1990). “Anisotropic diffusion

in human white matter: Demonstration with MR technique in vivo”. Radiology

Vol. 177, pp. 401-405.

[51] Moseley, M.E., Cohen, Y., Kucharczyk, J., Mintorovitch, J., Asgari, H.S., Wend-

land, M.F., Tsuruda, J., and Norman, D. (1990). “Diffusion-weighted MR imag-

ing of anisotropic water diffusion in cat central nervous system”. Radiology Vol.

176, pp 439-445.

[52] Turner, R., Le Bihan, D., Maier, J., Vavrek, R., Hedges, L.K., and Pe- kar, J.

(1990). “Echo-planar imaging of intravoxel incoherent motion”. Radiology Vol.

177, pp. 407-414.

[53] Gudbjartsson, H. and Patz, S. (1995). “The Rician distribution of noisy MRI

data”. Magnetic Resonance in Medicine, Vol. 34, pp. 910-914.

[54] Hahn, K. R., Prigarin, S., Heim, S., and Hasan, K. (2006). “Random noise

in diffusion tensor imaging, its destructive impact and some corrections”. In

Weickert, J. and Hagen, H., editors, Visualization and Processing of Tensor

Fields, pp 107-117, Springer. MR2210513.

[55] Hahn, K. R., Prigarin, S., Rodenacker, K., and Hasan, K. (2009). “Denoising

for diffusion tensor imaging with low signal to noise ratios: method and monte

carlo validation”. International Journal for Biomathematics and Biostatistics,

1:83-81.

[56] Zhu, H. T., Li, Y., Ibrahim, I. G., Shi, X., An, H., Chen, Y., Gao, W., Lin,

W., Rowe, D. B., and Peterson, B. S. (2009). “Regression models for identifying

REFERENCES 120

noise sources in magnetic resonance images”. Journal of the American Statistical

Association, Vol. 104, pp. 623-637.

[57] Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2005). “Fast and simple

computations on tensors with log-euclidean metrics”. Technical report, Institut

National de Recherche en Informatique et en Automatique.

[58] Arsigny, V., Fillard, P., Pennec, X., and Ayache, N. (2006). “Log- euclidean

metrics for fast and simple calculus on diffusion tensors”. Magnetic Resonance

in Medicine, Vol. 56, pp. 411-421.

[59] Castano Moraga, C. A., Lenglet, C., Deriche, R., and Ruiz-Alzola, J. (2007). “A

Riemannian approach to anisotropic filtering of tensor fields”. Signal Processing,

Vol. 87, pp. 263-276.

[60] Yuan, Y., Zhu, H., Lin, W., and Marron, J. (2012). “Local polynomial regression

for symmetric positive definite matrices”. Journal of the Royal Statistical Society:

Series B (Statistical Methodology).

[61] Carmichael, O., Chen, J., Paul, D. and Peng, J. (2013). “Diffusion tensor

smoothing through weighted Karcher means”, Electronic Journal of Statistics,

Vol. 7, pp. 1913-1956.

[62] Polzehl, J. and Tabelow, K. (2008). “Structural adaptive smoothing in diffusion

tensor imaging: the R package dti”. Technical report, Weierstrass Institute,

Berlin.

[63] Karcher, H. (1977). “Riemannian center of mass and mollifier smoothing”. Com-

munications in Pure and Applied Mathematics, Vol. 30, pp. 509-541.

REFERENCES 121

[64] Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). “Optimization Algorithms

on Matrix Manifolds”. Princeton University Press.

[65] Meinshausen, N. and Buhlmann, P. (2006). “High-Dimensional Graphs and Vari-

able Selection with the LASSO”, The Annals of Statistics, Vol. 34, No. 3, pp.

1436-1462.

[66] Tibshirani, Robert (1997). “Regression Shrinkage and Selection via the Lasso”,

Journal of Royal Statistical Society, Series B (Methodological), Vol. 58, pp. 267-

288.

[67] Tibshirani, R., Saunders, M., Rosset, S., Zhu, Ji and Knight, K. (2005). “Sparsity

and Smoothness via the Fused Lasso”, Journal of Royal Statistical Society, Series

B (Methodological), Vol. 67, Part 1, pp. 91-108.

[68] Dempster, A. P. (1972). “Covariance Selection”, Biometrics, Vol. 28, No. 1,

Special Multivariate Issue, pp. 157-175.

[69] Edwards, David (2000). “Introduction to Graphical Modelling”, Second Edition,

Springer.

[70] Speed, T. P. and Kiiveri, H. T., “Gaussian Markov Distributions over Finite

Graphs”, The Annals of Statistics, Vol. 14, No. 1, pp. 138-150.

[71] Yuan, M. and Lin, Y. (2007). “Model Selection and Estimation in the Gaussian

Graphical Model”, Biometrika, Vol. 94, No. 1, pp. 19-35.

[72] Vandenberghe, L., Boyd, S. and Wu, S.-P. (1998). “Determinant Maximization

with Linear Matrix Inequality Constraints”, SIAM Journal on Matrix Analysis

and Applications, Vol. 19, No. 2, pp. 499-533.

REFERENCES 122

[73] Breiman, Leo (1995). “Better Subset Regression Using the Nonnegative Gar-

rote”, Technometrics, Vol. 37, No. 4, pp. 373-384.

[74] Banerjee, O., Ghaoui, L. E., d’Aspremont, A. (2008). “Model Selection Through

Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary

Data”, Journal of Machine Learning Research, Vol. 9, pp. 485-516.

[75] Friedman, J., Hastie, T. and Tibshirani, R. (2007)“Sparse inverse covariance

estimation with the graphical lasso”. Biostatistics, Dec. 12, 2007, pp. 1-10.

[76] Chen, X., Lin, Q., Kim, S., Carbonell, J. G., Xing, E. P. (2010). “An Efficient

Proximal Gradient Method for General Structured Sparse Learning”, Journal of

Machine Learning Research, Vol. 11.

[77] Kovac, A. and Smith, A. D. A. C. (2012), “Nonparametric Regression on a

Graph”. Journal of Computational and Graphical Statistics. Vol. 20, No. 2, pp.

432-447.

[78] Greenshtein, E. and Ritov, Y. (2004). “Persistence in High-Dimensional Linear

Predictor Selection and the Virtue of Overparametrization”, Bernoulli, Vol. 10,

No. 6, pp. 971-988.

[79] Hoefling, H. (2010). “A Path Algorithm for the Fused Lasso Signal Approxi-

mator”. Journal of Computational and Graphical Statistics, Vol. 19, No. 4, pp.

984-1006.

[80] Tibshirani, R.J. and Taylor, J. (2011). “The Solution Path of the Generalized

Lasso”. The Annals of Statistics, Vol. 39, No. 3, pp. 1335-1371.

REFERENCES 123

[81] Liu, J., Yuan, L., and Ye, J. (2010). “An Efficient Algorithm for a Class of Fused

Lasso Problems”. In The ACM SIG Knowledge Discovery and Data Mining.

ACM, Washington, DC.

[82] Chen, X., Lin, Q., Kim, S., Carbonell, J.G., and Xing, E.P. (2012). “Smoothing

Proximal Gradient Method for General Structured Sparse Regression”. Annals

of Applied Statistics, Vol. 6, No. 2, pp. 719-752.

[83] Nesterov, Y. (2005). “Smooth Minimization of Non-smooth Functions”. Mathe-

matical Programming, Vol. 103, pp. 127-152.

[84] Beck, A. and Teboulle, M. (2009). “A Fast Iterative Shrinkage-thresholding Al-

gorithm for Linear Inverse Problems”. SIAM Journal on Imaging Sciences, Vol.

2, pp. 183-202.

[85] Ye, G.B. and Xie, X. (2011). “Split Bregman Method for Large Scale Fused

Lasso”. Computational Statistics & Data Analysis, Vol. 55, No. 4, pp. 1552-1569.

[86] Hestenes, M.R. (1969). “Multiplier and gradient methods”. Journal Optimization

Theory & Applications, Vol. 4, pp. 303-320.

[87] Rockafellar, R.T. (1973), “A Dual Approach to Solving Nonlinear Programming

Problems by Unconstrained Optimization”. Mathematical Programming, Vol. 5,

pp. 354-373.

[88] Buhlmann, P. and van de Geer, S. (2011). “Statistics for High-Dimensional

Data”. Springer Series in Statistics.

[89] Bickel, P., Ritov, Y. and Tsybakov, A. (2009). “Simultaneous Analysis of Lasso

and Dantzig Selector”, Annals of Statistics, Vol. 37, No. 4, pp. 1705-1732.

Documents

Structure Learning in Locally Constant Gaussian Graphical ...math.bu.edu/people/apratim/Dissertation_ApratimGanguly.pdf · Structure Learning in Locally Constant Gaussian Graphical