Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Advances in Model Selection Techniqueswith Applications to Statistical Network
Analysis and Recommender Systems
Diego Franco Saldana
Submitted in partial fulfillment of the
requirements for the degree
of Doctor of Philosophy
in the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2016
c©2016
Diego Franco Saldana
All Rights Reserved
ABSTRACT
Advances in Model Selection Techniqueswith Applications to Statistical Network
Analysis and Recommender Systems
Diego Franco Saldana
This dissertation focuses on developing novel model selection techniques, the pro-
cess by which a statistician selects one of a number of competing models of varying
dimensions, under an array of different statistical assumptions on observed data. Tra-
ditionally, two main reasons have been advocated by researchers for performing model
selection strategies over classical maximum likelihood estimates (MLEs). The first
reason is prediction accuracy, where by shrinking or setting to zero some model pa-
rameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in
turn leads to an overall improvement in predictive performance. The second reason
relates to interpretability of the selected models in the presence of a large number
of predictors, where in order to obtain a parsimonious representation exhibiting the
relationship between the response and covariates, we are willing to sacrifice some of
the smaller details brought in by spurious predictors.
In the first part of this work, we revisit the family of variable selection techniques
known as sure independence screening procedures for generalized linear models and
the Cox proportional hazards model. After clever combination of some of its most
powerful variants, we propose new extensions based on the idea of sample splitting,
data-driven thresholding, and combinations thereof. A publicly available package
developed in the R statistical software demonstrates considerable improvements in
terms of model selection and competitive computational time between our enhanced
variable selection procedures and traditional penalized likelihood methods applied
directly to the full set of covariates.
Next, we develop model selection techniques within the framework of statistical
network analysis for two frequent problems arising in the context of stochastic block-
models: community number selection and change-point detection. In the second part
of this work, we propose a composite likelihood based approach for selecting the
number of communities in stochastic blockmodels and its variants, with robustness
consideration against possible misspecifications in the underlying conditional inde-
pendence assumptions of the stochastic blockmodel. Several simulation studies, as
well as two real data examples, demonstrate the superiority of our composite likeli-
hood approach when compared to the traditional Bayesian Information Criterion or
variational Bayes solutions. In the third part of this thesis, we extend our analysis on
static network data to the case of dynamic stochastic blockmodels, where our model
selection task is the segmentation of a time-varying network into temporal and spatial
components by means of a change-point detection hypothesis testing problem. We
propose a corresponding test statistic based on the idea of data aggregation across the
different temporal layers through kernel-weighted adjacency matrices computed be-
fore and after each candidate change-point, and illustrate our approach on synthetic
data and the Enron email corpus.
The matrix completion problem consists in the recovery of a low-rank data ma-
trix based on a small sampling of its entries. In the final part of this dissertation,
we extend prior work on nuclear norm regularization methods for matrix completion
by incorporating a continuum of penalty functions between the convex nuclear norm
and nonconvex rank functions. We propose an algorithmic framework for comput-
ing a family of nonconvex penalized matrix completion problems with warm-starts,
and present a systematic study of the resulting spectral thresholding operators. We
demonstrate that our proposed nonconvex regularization framework leads to improved
model selection properties in terms of finding low-rank solutions with better predic-
tive performance on a wide range of synthetic data and the famous Netflix data
recommender system.
Table of Contents
List of Figures iv
List of Tables viii
1 SIS: An R Package for Sure Independence Screening in Ultrahigh
Dimensional Statistical Models 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 General SIS and ISIS Methodological Framework . . . . . . . . . . . 5
1.2.1 SIS and Feature Ranking by Maximum Marginal Likelihood
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Iterative Sure Independence Screening . . . . . . . . . . . . . 9
1.3 Variants of ISIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 First Variant of ISIS . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Second Variant of ISIS . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Data-driven Thresholding . . . . . . . . . . . . . . . . . . . . 16
1.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Model Selection and Timings . . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Model Selection and Statistical Accuracy . . . . . . . . . . . . 20
1.4.2 Scaling in n and p with Feature Screening . . . . . . . . . . . 28
1.4.3 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.4 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i
1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 How Many Communities Are There? 36
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Stochastic Blockmodels . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 Spectral Clustering and SCORE . . . . . . . . . . . . . . . . . 43
2.3 Model Selection for the Number of Communities . . . . . . . . . . . . 44
2.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.2 Composite Likelihood Inference . . . . . . . . . . . . . . . . . 45
2.3.3 Composite Likelihood BIC . . . . . . . . . . . . . . . . . . . . 47
2.3.4 Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.2 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 60
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3 Kernel-Based Change-Point Detection in Dynamic Stochastic Block-
models 65
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.1 Dynamic Stochastic Blockmodels . . . . . . . . . . . . . . . . 69
3.2.2 Change-Point Model . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Change-Point Detection Methodology . . . . . . . . . . . . . . . . . . 72
3.3.1 Spectral Clustering and Parameter Estimation in SBM . . . . 72
3.3.2 Algorithm Description and Rationale . . . . . . . . . . . . . . 73
3.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ii
3.4.2 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4 NC-Impute: Scalable Matrix Completion with Nonconvex Penalties 90
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Contributions and Outline . . . . . . . . . . . . . . . . . . . . 94
4.2 Spectral Thresholding Operators . . . . . . . . . . . . . . . . . . . . 96
4.2.1 Properties of Spectral Thresholding Operators . . . . . . . . . 100
4.2.2 Effective Degrees of Freedom for Spectral Thresholding Operators102
4.3 The NC-Impute Algorithm . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.2 Computing the Thresholding Operators . . . . . . . . . . . . . 115
4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4.1 Synthetic Examples . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4.2 Real Data Examples: MovieLens and Netflix Data Sets . . . . 124
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 127
A Technical Material for NC-Impute 143
A.1 Proof of Proposition 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.2 Proof of Proposition 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . 145
iii
List of Figures
1.1 Median runtime in seconds taken over 10 trials (log-log scale). . . . . 29
2.1 (Color online) Comparisons between different methods for selecting the true
community number K in the standard blockmodel settings of Simulations 1
– 3. Along the y-axis, we record the proportion of times the chosen number
of communities for each of the different criteria for selecting K agrees with
the truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2 (Color online) Largest connected component of the international trade net-
work for the year 1995. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3 (Color online) Largest connected component of the school friendship net-
work. Panel (c) shows the “true” grade community labels: 7th (blue), 8th
(yellow), 9th (green), 10th (purple), 11th (red), and 12th (black). . . . . . 62
3.1 (Color online) Mean values for the test statistic d(t) and the estimated
quantiles cα(t) for each candidate change-point in Simulations 1 and 2, where
the choices of α = 0.10 (green), α = 0.05 (blue), and α = 0.01 (red) are
considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
iv
3.2 (Color online) Scan statistic values and associated detected change-points
for the Enron email data set (left panels). Kernel Change-Points test statis-
tic d(t) along with the estimated quantiles cα(t), with α = 0.01, for each
candidate change-point (right panels). Red circles indicate common dis-
covered change-points between the two approaches, whereas gold circles
represent newly discovered change-points by the KCP statistic. . . . . . . 87
4.1 [Top panel] Examples of nonconvex penalties σ 7→ P (σ;λ, γ) with λ = 1 for
different values of γ. [Bottom Panel] The corresponding scalar thresholding
operators: σ 7→ sλ,γ(σ). At σ = 1, some of the thresholding operators
corresponding to the `γ penalty function are discontinuous, and some of the
other thresholding functions are “close” to being so. . . . . . . . . . . . . 100
4.2 Figure showing the df for the MC+ thresholding operator for a matrix with
m = n = 10, µ = 0 and v = 1. The df profile as a function of γ (in the
log scale) is shown for three values of λ. The dashed lines correspond to
the df of the spectral soft-thresholding operator, corresponding to γ = ∞.
We propose calibrating the (λ, γ) grid to a (λ, γ) grid such that the df
corresponding to every value of γ matches the df of the soft-thresholding
operator — as shown in Figure 4.3. . . . . . . . . . . . . . . . . . . . . 104
4.3 Figure showing the calibrated (λ, γ) lattice — for every fixed value of λ, the
df of the MC+ spectral threshold operators are the same across different γ
values. The df computations have been performed on a null model using
Proposition 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
v
4.4 (Color online) Random Orthogonal Model (ROM) simulations with SNR =
1. The choice γ = +∞ refers to nuclear norm regularization as provided by
the Soft-Impute algorithm. The least nonconvex alternatives at γ = 100
and γ = 80 behave similarly to nuclear norm, although with better predic-
tion performance. The choices of γ = 5 and γ = 10 result in excessively
aggressive fitting behavior for the true rank = 10 case, but improve sig-
nificantly in prediction error and recovering the true rank in the sparser
true rank = 5 setting. In both scenarios, the intermediate models with
γ = 30 and γ = 20 fare the best, with the former achieving the smallest
prediction error, while the latter estimates the actual rank of the matrix.
Values of test error larger than one are not displayed in the figure. . . . . 121
4.5 (Color online) Random Orthogonal Model (ROM) simulations with SNR =
5. The benefits of nonconvex regularization are more evident in this high-
sparsity, high-missingness scenario. While the γ = 100 and γ = 80 models
distance themselves more from nuclear norm, the remaining members of the
MC+ family essentially minimize prediction error while correctly estimating
the true rank. This is especially true in panel (d), where the best predictive
performance of the model γ = 5 at the correct rank is achieved under a
low-rank truth and high SNR setting. . . . . . . . . . . . . . . . . . . . 122
4.6 (Color online) Coherent and Non-Uniform Sampling (NUS) simulations with
SNR = 10. Nonconvex regularization also proves to be a successful strategy
in these challenging scenarios, particularly in the non-uniform sampling set-
ting where the MC+ family exhibits a monotone decrease in prediction error
as γ approaches 1. Again, the model γ = 5 estimates the correct rank under
high SNR settings. Although nuclear norm achieves a relatively small pre-
diction error, compared with previous simulation settings, the MC+ family
still provides a superior and more robust mechanism for regularization. . . 123
vi
4.7 (Color online) MovieLens 100k and 1m data. For each value of λ in the
solution path, an operating rank threshold (capped at 250) larger than the
rank of the previous solution was employed. . . . . . . . . . . . . . . . . 125
4.8 (Color online) Netflix competition data. The model γ = 10 achieves optimal
test set RMSE of 0.8276 for a solution rank of 105. . . . . . . . . . . . . 126
vii
List of Tables
1.1 Summary of tuning parameters for variable selection using ISIS proce-
dures within the SIS package, as well as associated defaults. All ISIS
variants are implemented through the SIS function, which we describe
in Section 1.4.4 using a gene expression data set. . . . . . . . . . . . . 21
1.2 Linear regression, Case 1, where results are given in the form of medians
and robust standard deviations (in parentheses). . . . . . . . . . . . . 25
1.3 Logistic regression, Case 2, where results are given in the form of me-
dians and robust standard deviations (in parentheses). . . . . . . . . 25
1.4 Poisson regression, Case 3, where results are given in the form of me-
dians and robust standard deviations (in parentheses). . . . . . . . . 26
1.5 Cox proportional hazards regression, Case 4, where results are given
in the form of medians and robust standard deviations (in parentheses). 26
1.6 Classification error rates and number of selected genes by various meth-
ods for the balanced Leukemia and Prostate cancer data sets. For the
Leukemia data, the training and test samples are of size 36. For the
Prostate cancer data, the training and test samples are of size 68. Re-
sults are given in the form of medians and robust standard deviations
(in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
1.7 Classification error rates and number of selected genes by various meth-
ods for the balanced Lung and Neuroblastoma (NB) cancer data sets.
For the Lung data, the training and test samples are of sizes 90 and
91, respectively. For the Neuroblastoma cancer data, the training and
test samples are of size 123. Results are given in the form of medians
and robust standard deviations (in parentheses). . . . . . . . . . . . 32
2.1 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 1,
where Eq and Dec indicate equally correlated and exponential decaying
cases, respectively. Both correlation of multivariate Gaussian random varia-
bles (ρMVN) and the corresponding maximum correlation between Bernoulli
variables (ρ Ber.) are presented. . . . . . . . . . . . . . . . . . . . . . . 53
2.2 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 2,
where Ind indicates ρjl = 0 for j 6= l. For simplicity, we omit the correlation
between the corresponding Bernoulli variables. . . . . . . . . . . . . . . . 54
2.3 Comparison of CL-BIC and VB over 200 repetitions from Simulation 3.
For simplicity, we omit the correlation between the corresponding Bernoulli
variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 4.
Before being scaled by the constant γn, we selected θ = (θab; 1 ≤ a ≤ b ≤
4)′, where θaa = 7 for all a = 1, . . . , 4 and θab = 1 for 1 ≤ a < b ≤ 4. . . . . 56
2.5 Comparison of CL-BIC and BIC over 200 repetitions from the DCBM case
in Simulation 4, with (ρEq = 0.2,γn = 0.03), where the individual effect
parameters ωi are now generated from a Uniform(1/5, 9/5) distribution. . 59
3.1 Performance of KCP over 100 repetitions from Simulations 1 and 2. Results
are given in the form of mean (standard deviation). . . . . . . . . . . . . 82
ix
Acknowledgments
Throughout my years at Columbia in the pursuit of my doctoral degree, I have ben-
efited significantly from interactions with professors, colleagues and friends. I would
like to take this opportunity to express to them my most sincere gratitude and utmost
admiration.
First and foremost, I wish to express my appreciation and gratitude to my advisor,
Professor Yang Feng, for his guidance and continuous support during the research
phase of my PhD. Yang’s insights, thoughtful suggestions and dedicated involvement
in my research work greatly improved the content of this dissertation, as well as my
knowledge and skill set as I develop into a practicing statistician.
Next, I would like to thank Professor Rahul Mazumder for fruitful discussions re-
garding the intersection of statistics, machine learning and optimization. His steadfast
championing and encouragement transformed a topics class project into one of the
research projects presented in this dissertation.
I would also like to thank Professor Tian Zheng, Professor Peter Orbanz and
Professor Sumit Mukherjee for kindly agreeing to serve on my dissertation defense
committee. Their insightful comments and suggestions provided very helpful feedback
to improve the work presented herein.
The professors in our department for their dedication and encouragement over the
years, in particular Ioannis Karatzas and Bodhisattva Sen for their formative courses,
as well as Richard Davis and Zhiliang Ying for their advice and expertise as exemplar
role models.
I am also deeply grateful to all my classmates in the Department of Statistics at
x
Columbia. Your discipline, hard work and unwavering drive are the key ingredients at
placing our department as a world-class community to pursue our graduate studies. I
would like to thank in particular my cohort classmates Rohit Patra and Zach Shahn
for innumerable conversations that always helped me to see the big picture throughout
my graduate studies, as well as Haolei Weng and Yi Yu for being not only the perfect
coauthors, but also an invaluable source of strength and encouragement during my
time at Columbia.
To all my Mexican friends here at Columbia University, and in New York in
general. Elia, Colin, Dani, Diego, Jorge, Rodrigo, Prospero, Emilio and Angeles: you
are the single most amazing group of friends I could have imagined through my PhD
and beyond. Thank you for so many wonderful memories and your unconditional
support throughout this journey. My time at Columbia would not have been the
same without you, thank you for being there along the way!
My deepest gratitude goes to Susan, my beloved partner and the utmost compas-
sionate and understanding person in this planet. Without her endless encouragement
and invaluable support, I would not have been even close to completing this disser-
tation. My memories of Columbia and New York are inseparably attached to hers. I
can only hope to be as supportive for her and her career as she has been for mine.
And finally, I would like to express my most sincere thank you to my parents Estela
Ivonne and Ernesto, and my brother Ernie. Your infinite love and encouragement have
been my day-to-day pillars throughout my PhD. You are the single best example of
hard work and perseverance that I have, and it is because of it that I have been able
to successfully overcome the many challenges I have faced throughout my life. I love
you all very much — to them this thesis is dedicated.
xi
Gracias a la vida, que me ha dado tanto.
Me ha dado la marcha de mis pies cansados.
Con ellos anduve ciudades y charcos,
Playas y desiertos, montanas y llanos,
Y la casa tuya, tu calle y tu patio.
Thanks to life, which has given me so much.
It gave me the steps of my tired feet.
With them I have traversed cities and puddles,
Valleys and deserts, mountains and plains,
And your house, your street and your courtyard.
Violeta Parra
xii
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 1
Chapter 1
SIS: An R Package for Sure
Independence Screening in
Ultrahigh Dimensional Statistical
Models
In this chapter, we revisit sure independence screening procedures for variable selec-
tion in generalized linear models and the Cox proportional hazards model. Through
the publicly available R package SIS, we provide a unified environment to carry out
variable selection using iterative sure independence screening (ISIS) and all of its
variants. For the regularization steps in the ISIS recruiting process, available penal-
ties include the LASSO, SCAD, and MC+ while the implemented variants for the
screening steps are sample splitting, data-driven thresholding, and novel combinations
thereof. Performance of these feature selection techniques is investigated by means
of real and simulated data sets, where we find considerable improvements in terms
of model selection and computational time between our algorithms and traditional
penalized pseudo-likelihood methods applied directly to the full set of covariates.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 2
1.1 Introduction
With the remarkable development of modern technology, including computing power
and storage, more and more high-dimensional and high-throughput data of unprece-
dented size and complexity are being generated for contemporary statistical studies.
For instance, bioimaging technology has made it possible to collect a huge amount of
predictor information such as microarray, proteomic, and SNP data while observing
survival information and tumor classification on patients in clinical studies. A com-
mon feature of all these examples is that the number of variables p can be potentially
much larger than the number of observations n, i.e., the number of gene expression
profiles is in the order of tens of thousands while the number of patient samples is
in the order of tens or hundreds. By ultrahigh dimensionality, following Fan and Lv
(2008), we mean that the dimensionality grows exponentially in the sample size, i.e.,
log p = O(nξ) for some ξ ∈ (0, 1/2). In order to provide more representative and
reasonable statistical models, it is typically assumed that only a small fraction of
predictors are associated with the outcome. This is the notion of sparsity which em-
phasizes the prominent role feature selection techniques play in ultrahigh dimensional
statistical modeling.
One popular family of variable selection methods for parametric models is based
on the penalized (pseudo-)likelihood approach. Examples include the LASSO (Tib-
shirani, 1996, 1997), SCAD (Fan and Li, 2001), the elastic net penalty (Zou and
Hastie, 2005), the MC+ (Zhang, 2010), and related methods. Nevertheless, in ultra-
high dimensional statistical learning problems, these methods may not perform well
due to the simultaneous challenges of computational expediency, statistical accuracy,
and algorithmic stability (Fan et al., 2009).
Motivated by these concerns, Fan and Lv (2008) introduced a new framework
for variable screening via independent correlation learning that tackles the afore-
mentioned challenges in the context of ultrahigh dimensional linear models. Their
proposed sure independence screening (SIS) is a two-stage procedure; first filtering
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 3
out the features that have weak marginal correlation with the response, effectively
reducing the dimensionality p to a moderate scale below the sample size n, and
then performing variable selection and parameter estimation simultaneously through
a lower dimensional penalized least squares method such as SCAD or LASSO. Under
certain regularity conditions, Fan and Lv (2008) showed surprisingly that this fast
feature selection method has a “sure screening property”; that is, with probability
tending to 1, the independence screening technique retains all of the important fea-
tures in the model. However, the SIS procedure in Fan and Lv (2008) only covers
ordinary linear regression models and their technical arguments do not extend easily
to more general models such as generalized linear models and hazard regression with
right-censored times.
In order to enhance finite sample performance, an important methodological ex-
tension, iterative sure independence screening (ISIS), was also proposed by Fan and
Lv (2008) to handle cases where the regularity conditions may fail, such as when some
important predictors are marginally uncorrelated with the response, or the reverse
situation where an unimportant predictor has higher marginal correlation than some
important features. Roughly speaking, the original ISIS procedure works by itera-
tively performing variable selection to recruit a small number of predictors, computing
residuals based on the model fitted using these recruited predictors, and then using
the residuals as the working response variable to continue recruiting new predictors.
With the purpose of handling more complex real data, Fan and Song (2010) extended
SIS to generalized linear models; and Fan et al. (2009) improved some important
steps of the original ISIS procedure, allowing variable deletion in the recruiting pro-
cess through penalized pseudo-likelihood, while dealing with more general loss based
models. In particular, they introduced the concept of conditional marginal regres-
sions and, with the aim of reducing the false discovery rate, proposed two new ISIS
variants based on the idea of splitting samples. Other extensions of ISIS include
Fan et al. (2010) to the Cox proportional hazards model, and Fan et al. (2011) to
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 4
nonparametric additive models.
In this chapter, we build on the work of Fan et al. (2009) and Fan et al. (2010)
to provide a publicly available package SIS (Fan et al., 2015), implemented in the
R statistical software (R Core Team, 2015), extending sure independence screening
and all of its variants to generalized linear models and the Cox proportional hazards
model. In particular, our codes are able to perform variable selection through the
proposed ISIS variants of Fan et al. (2009) and through the data-driven thresholding
approach of Fan et al. (2011). Furthermore, we combine these sample splitting and
data-driven thresholding ideas to provide two novel feature selection techniques.
Taking advantage of the fast cyclical coordinate descent algorithms developed in
the packages glmnet (Friedman et al., 2013) and ncvreg (Breheny, 2013), for convex
and nonconvex penalty functions, respectively, we are able to efficiently perform the
moderate scale penalized pseudo-likelihood steps from the ISIS procedure, thus yield-
ing variable selection techniques outperforming direct use of glmnet and ncvreg in
terms of both computational time and estimation error. Our procedures scale favor-
ably in both n and p, allowing us to expeditiously and accurately solve much larger
problems than with previous packages, particularly in the case of nonconvex penal-
ties. We would like to point out that the recent package apple (Yu and Feng, 2014a),
using a hybrid of the predictor-corrector method and coordinate descent procedures,
provides an alternative for the penalized pseudo-likelihood estimation with nonconvex
penalties. In the present work, we limit all numerical results to the use of ncvreg,
noting there are other available options to implement the nonconvex variable selec-
tion procedures performed by SIS. Similarly, although the package survHD (Bernau
et al., 2014) provides an efficient alternative for implementing Cox proportional haz-
ards regression, in the current presentation, we only make use of the survival package
(Therneau and Lumley, 2015) to compute conditional marginal regressions and of the
glmnet package (Friedman et al., 2013) to fit high-dimensional Cox models.
The remainder of the chapter is organized as follows. In Section 1.2, we describe
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 5
the vanilla SIS and ISIS variable selection procedures in the context of generalized
linear models and the Cox proportional hazards model. Section 1.3 discusses sev-
eral ISIS variants, as well as important implementation details. Simulation results
comparing model selection performance and run time trials are given in Section 1.4,
where we also analyze four gene expression data sets and work through an example
using our package with one of them. The chapter is concluded with a short discussion
in Section 1.5.
1.2 General SIS and ISIS Methodological Frame-
work
Consider the usual generalized linear model (GLM) framework, where we have in-
dependent and identically distributed observations (xi, yi) : i = 1, . . . , n from the
population (x, y), where the predictor x = (x0, x1, . . . , xp)> is a (p + 1)-dimensional
random vector with x0 = 1 and y is the response. We further assume the conditional
distribution of y given x is from an exponential family taking the canonical form
f(y;x,β) = expyθ − b(θ) + c(y), (1.1)
where θ = x>β, β = (β0, β1, . . . , βp)> is a vector of unknown regression parameters
and b(·), c(·) are known functions. As we are only interested in modeling the mean
regression, the dispersion parameter is assumed known. In virtue of (1.1), inference
about the parameter β in the GLM context is made via maximization of the log-
likelihood function
`(β) =n∑i=1
yi(x>i β)− b(x>i β). (1.2)
For the survival analysis framework, the observed data (xi, yi, δi) : xi ∈ Rp, yi ∈
R+, δi ∈ 0, 1, i = 1, . . . n is an independent and identically distributed random
sample from a certain population (x, y, δ). Here, as in the context of linear models,
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 6
x = (x1, x2, . . . , xp)> is a p-dimensional random vector of predictors and y, the ob-
served time, is a time of failure if δ = 1, or a right-censored time if δ = 0. Suppose
that the sample comprises m distinct uncensored failure times t1 < t2 < · · · < tm.
Let (j) denote the individual failing at time tj and R(tj) be the risk set just prior
to time tj, that is, R(tj) = i : yi ≥ tj. The main problem of interest is to study
the relationship between the predictor variables and the failure time, and a common
approach is through the Cox proportional hazards model (Cox, 1975). For a vector
β = (β1, β2, . . . , βp)> of unknown regression parameters, the Cox model assumes a
semiparametric form of the hazard function
h(t|xi) = h0(t)ex>i β,
where h0(t) is an unknown arbitrary baseline hazard function giving the hazard when
xi = 0. Following the argument in Cox (1975), inference about β is made via maxi-
mization of the partial likelihood function
L(β) =m∏j=1
ex>(j)β∑
k∈R(tj)ex>k β,
which is equivalent to maximizing the log-partial likelihood
`(β) =n∑i=1
δix>i β −
n∑i=1
δi log
∑k∈R(yi)
exp(x>k β)
. (1.3)
We refer interested readers to Kalbfleisch and Prentice (2002) and references therein
for a comprehensive literature review on the Cox proportional hazards model.
For both statistical models, we assume all predictors x1, . . . , xp are standardized
to have mean zero and standard deviation one. Additionally, although our variable
selection procedures within the SIS package also handle the classical p < n setting,
we allow the dimensionality of the covariates p to grow much faster than the num-
ber of observations n. To be more specific, we assume the ultrahigh dimensionality
setup log p = O(nξ) for some ξ ∈ (0, 1/2) presented in Fan and Song (2010). What
makes statistical inference possible in this “large p, small n” scenario is the sparsity
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 7
assumption; only a small subset of variables among predictors x1, . . . , xp contribute
to the response, which implies the parameter vector β is sparse. Therefore, varia-
ble selection techniques play a pivotal role in these ultrahigh dimensional statistical
models.
1.2.1 SIS and Feature Ranking by Maximum Marginal Like-
lihood Estimators
LetM? = 1 ≤ j ≤ p : β?j 6= 0 be the true sparse model, where β?=(β?0 , β?1 , . . . , β
?p)>
denotes the true value of the parameter vector and β?0 = 0 for the Cox model. In order
to carry out the vanilla sure independence screening variable selection procedure, we
initially fit marginal versions of models (1.2) and (1.3) with componentwise covariates.
The maximum marginal likelihood estimator (MMLE) βMj , for j = 1, . . . , p, is defined
in the GLM context as the maximizer of the componentwise regression
βMj = (βMj,0, βMj ) = arg max
β0,βj
n∑i=1
yi(β0 + xijβj)− b(β0 + xijβj), (1.4)
where xi = (xi0, xi1, . . . , xip)> and xi0 = 1. Similarly, for each covariate xj (1 ≤ j ≤
p), one can define the MMLE for the Cox model as the maximizer of the log-partial
likelihood with a single covariate
βMj = arg maxβj
(n∑i=1
δixijβj −n∑i=1
δi log
∑k∈R(yi)
exp(xkjβj)
), (1.5)
with xi = (xi1, . . . , xip)>. Both componentwise estimators can be computed very
rapidly and implemented modularly, avoiding the numerical instability associated
with ultrahigh dimensional estimation problems.
The vanilla SIS procedure then ranks the importance of features according to the
magnitude of their marginal regression coefficients, excluding the intercept in the case
of GLM. Therefore, we select a set of variables
Mδn = 1 ≤ j ≤ p : |βMj | ≥ δn, (1.6)
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 8
where δn is a threshold value chosen so that we pick the d top ranked covariates. Typ-
ically, one may take d = bn/ log nc, so that dimensionality is reduced from ultrahigh
to below the sample size. As further discussed in Sections 1.3.3 and 1.3.4, the choice
of d may also be either data-driven or model-based. Under a mild set of technical
conditions, Fan and Song (2010) show the magnitude of these marginal estimators
can preserve the nonsparsity information about the joint model with full covariates.
In other words, for a given sequence δn, the sure screening property
P(M? ⊂ Mδn)→ 1 as n→∞ (1.7)
holds for SIS, effectively reducing the dimensionality of the model from ultrahigh to
below the sample size, and solving the aforementioned challenges of computational
expediency, statistical accuracy, and algorithmic stability.
With features being crudely selected by the intensity of their marginal signals, the
index set Mδn may also include a great deal of unimportant predictors. To further
improve finite sample performance of vanilla SIS, variable selection and parameter es-
timation can be simultaneously achieved via penalized likelihood estimation, using the
joint information of the covariates in Mδn . Without loss of generality, by reordering
the features if necessary, we may assume that x1, . . . , xd are the predictors recruited by
SIS. We define βd = (β0, β1, . . . , βd)> and let xi,d = (xi0, xi1, . . . , xid)
> with xi0 = 1.
Thus, our original problem of estimating the sparse (p + 1)-vector β in the GLM
framework (1.2) reduces to estimating a sparse (d + 1)-vector βd = (β0, β1, . . . , βd)>
based on maximizing the moderate scale penalized likelihood
βd = arg maxβd
n∑i=1
yi(x>i,dβd)− b(x>i,dβd) −d∑j=1
pλ(|βj|). (1.8)
Likewise, after defining βd = (β1, β2, . . . , βd)> and setting xi,d = (xi1, xi2, . . . , xid)
>
for survival data within the Cox model, the moderate scale version of the penalized
log-partial likelihood problem consists in maximizing
βd = arg maxβd
n∑i=1
δix>i,dβd −
n∑i=1
δi log
∑k∈R(yi)
exp(x>k,dβd)
−
d∑j=1
pλ(|βj|). (1.9)
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 9
Here, pλ(·) is a penalty function and λ > 0 is a regularization parameter which
may be selected using the AIC (Akaike, 1973), BIC (Schwarz, 1978) or EBIC (Chen
and Chen, 2008) criteria, or through ten-fold cross-validation and the modified cross-
validation framework (Feng and Yu, 2013; Yu and Feng, 2014b), for example. Penalty
functions whose resulting estimators yield sparse solutions to the maximization prob-
lems (1.8) and (1.9) include the LASSO penalty pλ(|β|) = λ|β| (Tibshirani, 1996),
the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001), which
is a folded-concave quadratic spline with pλ(0) = 0 and
p′λ(|β|) = λ
1|β|≤λ +
(γλ− |β|)+
(γ − 1)λ1|β|>λ
,
for some γ > 2 and |β| > 0, and the minimax concave penalty (MC+), another
folded-concave quadratic spline with pλ(0) = 0 such that
p′λ(|β|) = (λ− |β|/γ)+,
for some γ > 0 and |β| > 0 (Zhang, 2010). For the SCAD and MC+ penalties, the
tuning parameter γ is used to adjust the concavity of the penalty. The smaller γ
is, the more concave the penalty becomes, which means finding a global minimizer
is more difficult; but on the other hand, the resulting estimators overcome the bias
introduced by the LASSO penalty.
Once penalized likelihood estimation is carried out in (1.8) and (1.9), a refined
estimate of M? can be obtained from M1, the index set of the nonzero components
of the sparse regression parameter estimator. We summarize this initial screening
procedure based on componentwise regressions through Algorithm 1.1 below.
1.2.2 Iterative Sure Independence Screening
The technical conditions in Fan and Lv (2008) and Fan and Song (2010) guaranteeing
the sure screening property for SIS fail to hold if there is a predictor marginally
unrelated, but jointly related with the response, or if a predictor is jointly uncorrelated
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 10
Algorithm 1.1 Vanilla SIS (Van-SIS)
1. Inputs: Screening model size d. Penalty pλ(·).
2. For every j ∈ 1, . . . , p, compute the MMLE βMj from (1.4) or (1.5).
3. Choose a threshold value δn in (1.6) such that Mδn consists of the d top ranked
covariates.
4. Obtain the parameter estimate βd from the penalized likelihood estimation prob-
lems (1.8) or (1.9).
5. Outputs: Parameter estimate βd and the corresponding estimate of the true
sparse model M1 = suppβd.
with the response but has higher marginal correlation with the response than some
important predictors in M?. In the former case, the important predictor cannot be
picked up by vanilla SIS, whereas in the latter case, unimportant predictors in Mc?
are ranked too high, leaving out features from the true sparse model M?.
To deal with these difficult scenarios in which the SIS methodology breaks down,
Fan and Lv (2008) and Fan et al. (2009) proposed iterative sure independence screen-
ing based on conditional marginal regressions and feature ranking. The approach
seeks to overcome the limitations of SIS, which is based on marginal models only,
by making more use of the joint covariate information while retaining computational
expediency and stability as in the original SIS.
In its first iteration, the vanilla ISIS variable selection procedure begins with using
regular SIS to pick an index set A1 of size k1, and then similarly applies a penalized
likelihood estimation approach to select a subset M1 of these indices. Note that
the screening step only fits componentwise regressions, while the penalized likelihood
step solves optimization problems of moderate scale k1, typically below the sample
size n. This is an attractive feature for any variable selection technique applied
to ultrahigh dimensional statistical models. After the first iteration, we denote the
resulting estimator with nonzero components and indices in M1 by βM1.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 11
In an effort to use more fully the joint covariate information, in the second iteration
of vanilla ISIS we compute the conditional marginal regression of each predictor j that
is not in M1. That is, under the generalized linear model framework, we solve
arg maxβ0,βM1
,βj
n∑i=1
yi(β0 + x>i,M1
βM1+ xijβj)− b(β0 + x>
i,M1βM1
+ xijβj), (1.10)
whereas under the Cox model, we obtain
arg maxβM1
,βj
(n∑i=1
δi(x>i,M1
βM1+ xijβj)−
n∑i=1
δi log
∑k∈R(yi)
exp(x>k,M1
βM1+ xkjβj)
)(1.11)
for each j ∈ 1, . . . , p\M1, where xi,M1denotes the sub-vector of xi with indices in
M1 and similarly for βM1. These are again low-dimensional problems which can be
solved quite efficiently. Similar to the componentwise regressions (1.4) and (1.5), let
βMj denote the last coordinate of the maximizer in (1.10) and (1.11). The magnitude
|βMj | reflects the additional contribution of the jth predictor given that all covariates
with indices in M1 have been included in the model.
Once the conditional marginal regressions have been computed for each predictor
not in M1, we perform conditional feature ranking by ordering |βMj | : j ∈ Mc1 and
forming an index set A2 of size k2, say, consisting of the indices with the top ranked
elements. After this screening step, under the GLM framework, we maximize the
moderate scale penalized likelihood
n∑i=1
yi(β0 +x>i,M1∪A2
βM1∪A2)− b(β0 +x>
i,M1∪A2βM1∪A2
) −∑
j∈M1∪A2
pλ(|βj|) (1.12)
with respect to βM1∪A2to get a sparse estimator βM1∪A2
. Similarly, in the Cox
model, we obtain a sparse estimator by maximizing the moderate scale penalized
log-partial likelihood
n∑i=1
δi(x>i,M1∪A2
βM1∪A2)−
n∑i=1
δi log
∑k∈R(yi)
exp(x>k,M1∪A2
βM1∪A2)
−
∑j∈M1∪A2
pλ(|βj|).(1.13)
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 12
The indices of the nonzero coefficients of βM1∪A2provide an updated estimate M2
of the true sparse model M?.
In the screening step above, an alternative approach is to substitute the fitted
regression parameter βM1from the first step of vanilla ISIS into problems (1.10) and
(1.11), so that the optimization problems become again componentwise regressions.
This approach is exactly an extension of the original ISIS proposal of Fan and Lv
(2008) to generalized linear models and the Cox proportional hazards model. Even
for the ordinary linear model, the conditional contributions of predictors are more
relevant for variable selection in the second ISIS iteration than the original proposal
of using the residuals ri = yi − x>i,M1βM1
as the new response (see Fan et al., 2009).
Furthermore, the penalized likelihood steps (1.12) and (1.13) allow the procedure to
delete some features xj : j ∈ M1 that were previously selected. This is also an
important deviation from Fan and Lv (2008), as their original ISIS procedure does
not contemplate intermediate deletion steps.
Lastly, the vanilla ISIS procedure, which iteratively recruits and deletes predictors,
can be repeated until some convergence criterion is reached. We adopt the criterion of
having an index set Ml which either has the prescribed size d, or satisfies Ml = Mj
for some j < l. In order to ensure that iterated SIS takes at least two iterations
to terminate, in our implementation we fix k1 = b2d/3c, and thereafter at the lth
iteration we set kl = d − |Ml−1|. A step-by-step description of the vanilla ISIS
procedure is provided in Algorithm 1.2.
We conclude this section providing a simple overview of the main features of the
vanilla SIS and ISIS procedures for applied practitioners. In the ultrahigh dimen-
sional statistical model setting where p n, and even in the classical p < n setting
with p > 30, variable screening is an essential tool in helping eliminate irrelevant
predictors while reducing data gathering and storage requirements. The vanilla SIS
procedure given in Algorithm 1.1 provides an extremely fast and efficient variable
screening based on marginal regressions of each predictor with the response. While
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 13
Algorithm 1.2 Vanilla ISIS (Van-ISIS)
1. Inputs: Screening model size d. Penalty pλ(·). Maximum iteration number lmax.
2. For every j ∈ 1, . . . , p, compute the MMLE βMj from problems (1.4) or (1.5).
Select the k1 top ranked covariates to form the index set A1.
3. Apply penalized likelihood estimation on the set A1 to obtain a subset of indices
M1.
4. For every j ∈ Mc1, solve the conditional marginal regression problem (1.10) or
(1.11) and sort |βMj | : j ∈ Mc1. Form the index set A2 with the k2 top ranked
covariates, and apply penalized likelihood estimation on M1 ∪ A2 as in (1.12) or
(1.13) to obtain a new index set M2.
5. Iterate the process in step 4 until we have an index set Ml such that |Ml| = d
or Ml = Mj for some j < l or l = lmax.
6. Outputs: Ml from step 5 along with the parameter estimate from (1.12) or (1.13).
under certain independence assumptions among predictors this may prove a success-
ful strategy in terms of estimating the true sparse model M?, there are well-known
issues associated with variable screening using only information from marginal re-
gressions, such as missing important predictors from M? which happen to have low
marginal correlation with the response. The vanilla ISIS procedure addresses these
issues by using more thoroughly the joint covariate information through the condi-
tional marginal regressions (1.10) and (1.11), which aim at measuring the additional
contribution of a predictor xj given the presence of the variables in M1 in the current
model, all while maintaining low computational costs. Finally, we would like to point
out that the intermediate deletion steps of the vanilla ISIS procedure could be carried
out with any other variable selection methods, such as the weight vector ranking with
support vector machines (Rakotomamonjy, 2003) or the greedy search strategies of
forward selection and backward elimination (Guyon and Elisseeff, 2003). In our im-
plementation within the SIS package we favor the penalized likelihood criteria (1.12)
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 14
and (1.13), but in principle any variable selection technique could be employed to
further filter unimportant predictors.
1.3 Variants of ISIS
By nature of their marginal approach, sure independence screening procedures have
large false selection rates, namely, many unimportant predictors in Mc? are selected
after the screening steps. In order to reduce the false selection rate, Fan et al. (2009)
suggested the idea of sample splitting. Without loss of generality, we assume the
sample size n is even, and we randomly split the sample into two halves. Two variants
of the ISIS methodology have been proposed in the literature; both of them relying
on the idea of performing variable screening to the data in each partition separately,
combining the results in a subsequent penalized likelihood step. We also revisit the
approach of Fan et al. (2011), in which a random permutation of the observations is
used to obtain a data-driven threshold for independence screening.
1.3.1 First Variant of ISIS
Let A (1)1 and A (2)
1 be the two sets of indices, each of size k1, obtained after applying
regular SIS to the data in each partition. As the first crude estimates of the true
sparse model M?, both of them should have large false selection rates. Yet, under
the technical conditions given in Fan and Song (2010), the estimates should satisfy
P(M? ⊂ A (j)1 )→ 1 as n→∞
for j = 1, 2. That is, important features should appear in both sets with probability
tending to one. If we define A1 = A (1)1 ∩ A (2)
1 as a new estimate of M?, this new
index set must also satisfy
P(M? ⊂ A1)→ 1 as n→∞.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 15
However, by construction, the number of unimportant predictors in A1 should be
much smaller, thus reducing the false selection rate. The reason is that, in order for
an unimportant predictor to appear in A1, it has to be included in both sets A (1)1
and A (2)1 randomly.
In their theoretical support for this variant based on random splitting, Fan et al.
(2009) provided a non-asymptotic upper bound for the probability of the event that
m unimportant covariates are included in the intersection A1. The probability bound,
obtained under an exchangeability condition ensuring that each unimportant feature
is equally likely to be recruited by SIS, is decreasing in the dimensionality, showing
an apparent “blessing of dimensionality”. This is only part of the full story, since, as
pointed out in Fan et al. (2009), the probability of missing important predictors from
the true sparse modelM? is expected to increase with p. However, as we show in our
simulation settings of Section 1.4.1, and further confirm in the real data analysis of
Section 1.4.3, the procedure is quite effective at obtaining a minimal set of features
that should be included in a final model.
The remainder of the first variant of ISIS proceeds as follows. After forming
the intersection A1 = A (1)1 ∩ A (2)
1 , we perform penalized likelihood estimation as in
Algorithm 1.2 to obtain a refined approximation M1 to the true sparse model. We
then perform the second iteration of the vanilla ISIS procedure, computing conditional
marginal regressions to each partition separately to obtain sets of indices A (1)2 and
A (2)2 , each of size k2. After taking the intersection A2 = A (1)
2 ∩A(2)2 of these two sets,
we carry out sparse penalized likelihood estimation as in (1.12) and (1.13), obtaining
a second approximation M2 to the true modelM?. As before, the iteration continues
until we have an index set Ml of size d, or satisfiying Ml = Mj for some j < l.
1.3.2 Second Variant of ISIS
The variable selection performed by the first variant of ISIS could potentially lead
to a very aggressive screening of predictors, reducing the overall false selection rate,
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 16
but sometimes yielding undesirably minimal model sizes. The second variant of ISIS
is a more conservative variable selection procedure, where we again apply regular
SIS to each partition separately, but we now choose larger sets of indices A (1)1 and
A (2)1 to ensure that their intersection A1 = A (1)
1 ∩ A(2)1 has k1 elements. In this way,
the second variant guarantees that there are at least k1 predictors included before the
penalized likelihood step, making it considerably less aggressive than the first variant.
After applying penalized likelihood to the predictors with indices in A1, we obtain
a first estimate M1 of the true sparse model. The second iteration computes condi-
tional marginal regressions to each partition separately, recruiting enough features in
index sets A (1)2 and A (2)
2 to ensure that A2 = A (1)2 ∩ A
(2)2 has k2 elements. Penalized
likelihood, as outlined in Section 1.2.2, is now applied to M1 ∪ A2, yielding a second
estimate M2 of the true modelM?. Stopping criteria remain the same as in the first
variant.
1.3.3 Data-driven Thresholding
Motivated by a false discovery rate criterion in Fan et al. (2011), the following variant
of ISIS determines a data-driven threshold for marginal screening. Given GLM data
of the form (xi, yi) : i = 1, . . . , n, a random permutation π of 1, . . . , n is used to
decouple xi and yi so that the resulting data (xπ(i), yi) : i = 1, . . . , n follow a null
model, i.e., a model in which the features have no prediction power over the response.
For the newly permuted data, we recalculate marginal regression coefficients (βMj )∗
for each predictor j, with j = 1, . . . , p.
The motivation behind this approach is that whenever j is the index of an im-
portant predictor in M?, the MMLE |βMj | given in (1.4) should be larger than most
of |(βMj )∗|, as the random permutation is meant to eliminate the prediction power of
features. For a given q ∈ [0, 1], let ω(q) be the qth quantile of |(βMj )∗| : j = 1, . . . , p.
The data-driven threshold allows only a 1−q proportion of inactive variables to enter
the model when x and y are not related (the null model). Thus, the initial index set
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 17
Algorithm 1.3 Permutation-based ISIS (Perm-ISIS)
1. Inputs: Screening model size d. Penalty pλ(·). Quantile q. Maximum iteration
number lmax.
2. For every j ∈ 1, . . . , p, compute the MMLE βMj . Obtain the randomly permuted
data (xπ(i), yi) : i = 1, . . . , n, and let ω(q) be the qth quantile of |(βMj )∗| : j =
1, . . . , p, where (βMj )∗ is the second coordinate of the solution to
arg maxβ0,βj
n∑i=1
yi(β0 + xπ(i)jβj)− b(β0 + xπ(i)jβj).
Select the variables in the index set A1 = 1 ≤ j ≤ p : |βMj | ≥ ω(q).
3. Apply penalized likelihood estimation on the set A1 to obtain a subset of indices
M1.
4. For every j ∈ Mc1, solve the conditional marginal regression problem (1.10) and
obtain |βMj | : j ∈ Mc1. By randomly permuting only the variables in Mc
1, let
ω(q) be the qth quantile of |(βMj )∗| : j ∈ Mc1, where (βMj )∗ is the last coordinate
of the solution to
arg maxβ0,βM1
,βj
n∑i=1
yi(β0 + x>i,M1
βM1+ xπ(i)jβj)− b(β0 + x>
i,M1βM1
+ xπ(i)jβj).
Select the variables in the index set A2 = j ∈ Mc1 : |βMj | ≥ ω(q), and apply
penalized likelihood estimation on M1 ∪ A2 to obtain a new subset M2.
5. Iterate the process in step 4 until we have an index set Ml such that |Ml| ≥ d
or Ml = Mj for some j < l or l = lmax.
6. Outputs: Ml from step 5 along with the parameter estimate from (1.12) or (1.13).
A1 is defined as
A1 = 1 ≤ j ≤ p : |βMj | ≥ ω(q),
and the modified ISIS iteration then carries out penalized likelihood estimation in
the usual way to obtain a finer approximation M1 of the true sparse modelM?. The
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 18
complete procedure for this variant is detailed in Algorithm 1.3 above.
A greedy modification of the above algorithm can be proposed to enhance variable
selection performance. Specifically, we restrict the size of the sets Aj in the iterative
screening steps to be at most p0, a small positive integer. Moreover, a completely
analogous algorithm can be proposed for survival data, with permutation π and data-
driven threshold ω(q) defined accordingly. Details of such procedure are omitted in
the current presentation.
1.3.4 Implementation Details
There are several important details in the implementation of the vanilla versions of
SIS and ISIS, as well as all of the above mentioned variants.
• In order to speed up computations, and exclusively for the first screening step,
all variable selection procedures use correlation learning (i.e., marginal Pearson
correlations) between each predictor and the response, instead of the compo-
nentwise GLM or partial likelihood fits (1.4) and (1.5). We found no major
differences in variable selection performance between this variant and the one
using the MMLEs.
• Although the asymptotic theory of Fan and Song (2010) guarantees the sure
screening property (1.7) for a sequence δn ∼ n−θ∗ , with θ∗ > 0 a fixed constant,
in practice Fan et al. (2009) recommend using d = bn/ log nc as a sensible choice
since θ∗ is typically unknown. Furthermore, Fan et al. (2009) also advocate
using model-based choices of d, favoring smaller values in models where the
response provides less information. In our numerical implementation we use d =
bn/(4 log n)c for logistic regression and d = bn/(2 log n)c for Poisson regression,
which are less informative than the real-valued response in a linear model, for
which we select d = bn/ log nc. For the right-censored response in the survival
analysis framework, we fix d = bn/(4 log n)c.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 19
• Regardless of the statistical model at hand, we set d = bn/ log nc for the first
variant of ISIS. Note that since the selected variables for this first variant are
in the intersection of two sets of size kl ≤ d, we typically end up with far fewer
than d variables selected by this method. In any case, our procedures within
the SIS package allow for a user-specified value of d.
• Variable selection under the traditional p < n setting can also be carried out
using our screening procedures, for which we fix d = p as default for all variants.
In this regard, practicing data analysts can view sure independence screening
procedures along classical criteria for variable selection such as the AIC (Akaike,
1973), BIC (Schwarz, 1978), Cp (Mallows, 1973) or the generalized information
criterion GIC (Nishii, 1984) applied directly to the full set of covariates.
• The intermediate penalized likelihood problems (1.12) and (1.13) are solved
using the glmnet and ncvreg packages. Our code has an option allowing the
regularization parameter λ > 0 to be selected through the AIC, BIC, EBIC or
K-fold cross-validation criteria. The concavity parameter γ in the SCAD and
MC+ penalties can also be user-specified.
• In our permutation-based variant with data-driven thresholding, we use q = 1
(i.e., we take ω(q) to be the maximum absolute value of the permuted estimates)
and take p0 to be 1 in the greedy modification. Note that if none of the variables
is recruited, that is, exceeding the threshold for the null model, the criterion in
step 5 stops the procedure.
• We can further combine the permutation-based approach of Section 1.3.3 with
the sample splitting idea from the first two variants to define a new ISIS pro-
cedure. Concretely, we first select two subsets of indices A (1)1 and A (2)
1 , each
consisting of variables whose MMLEs, or correlation with the response, exceed
the data-driven thresholds ω(1)(q) and ω
(2)(q) of their respective samples. If the size
of their intersection is less than k1, we define A1 = A (1)1 ∩ A (2)
1 ; otherwise, we
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 20
reduce the size of A (1)1 and A (2)
1 to ensure their intersection has at most k1
elements. This is done to control the size of A1 when too many variables exceed
the thresholds. The rest of the iteration is carried out accordingly.
• Every variant of ISIS is coded to guarantee there will be at least two predictors
at the end of the first screening step.
As the proposed ISIS variants grow more involved, the associated number of tuning
parameters is bound to increase. While this may initially make some data practition-
ers feel uneasy, our intent here is to be as flexible as possible, providing all available
tools that the powerful family of sure independence screening procedures has to offer.
Additionally, it is important to clarify that there exist tuning parameters inherent to
the screening procedures, which are fundamentally different from tuning parameters
(e.g., driven by a K-fold cross-validation approach) needed in the intermediate penal-
ized likelihood procedures. In any case, we detail all available options implemented in
our SIS package in Table 1.1 below, where we highlight recommended default settings
for practicing researchers.
1.4 Model Selection and Timings
In this section we illustrate all independence screening procedures by studying their
performance on simulated data and on four popular gene expression data sets. Most
of the simulation settings are adapted from the work of Fan et al. (2009) and Fan
et al. (2010).
1.4.1 Model Selection and Statistical Accuracy
We first conduct simulation studies comparing the runtime of the vanilla version of
SIS (Van-SIS), its iterated vanilla version (Van-ISIS), the first variant (Var1-ISIS), the
second variant (Var2-ISIS), the permutation-based ISIS (Perm-ISIS), its greedy mod-
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 21
Parameter Description SIS package options
pλ(·) Penalty function for intermediate
penalized likelihood estimation
penalty = "SCAD" (default) /
= "MCP" /= "lasso"
λMethod for tuning regularization
parameter of penalty function
pλ(·)
tune = "bic" (default) /
= "ebic" /= "aic" /= "cv"
γ Concavity parameter for SCAD
and MC+ penalties
concavity.parameter = 3.7 /= 3
are defaults for SCAD/MC+. Any
γ > 2 for SCAD or γ > 1 for MC+
can be user-specified
d Upper bound on the number of
predictors to be selected
nsis default is model-based as exp-
lained in Section 1.3.4. It can also
be user-specified.
ISIS variant Flags which ISIS version to
perform
varISIS = "vanilla" (default) /
= "aggr" /= "cons"
Permuation
variant
Flags whether to use permuta-
tion variant with data-driven
thresholds
perm = FALSE (default) /= TRUE
q Quantile used in calculating
data-driven thresholds
q = 1 (default) / can be any user-
specified value in [0, 1]
p0 Maximum size of active sets
in greedy modification
greedy.size = 1 (default) / can be
any user-specified integer less than p
Table 1.1: Summary of tuning parameters for variable selection using ISIS procedures
within the SIS package, as well as associated defaults. All ISIS variants are imple-
mented through the SIS function, which we describe in Section 1.4.4 using a gene
expression data set.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 22
ification (Perm-g-ISIS), the permutation-based variant with sample splitting (Perm-
var-ISIS) and its greedy modification (Perm-var-g-ISIS), under both generalized linear
models and the Cox proportional hazards model. We also demonstrate the power of
ISIS and its variants, in terms of model selection and estimation, by comparing them
with traditional LASSO and SCAD penalized estimation. Our SIS code is tested
against the competing R packages glmnet (Friedman et al., 2013) and ncvreg (Bre-
heny, 2013) for LASSO and SCAD penalization, respectively. All calculations were
carried out on an Intel Xeon L5430 @ 2.66 GHz processor. We refer interested readers
to Friedman et al. (2010), Breheny and Huang (2011), and Simon et al. (2011) for
a detailed discussion in penalized likelihood estimation algorithms under generalized
linear models and the Cox proportional hazards model.
Four different statistical models are considered here: Linear regression, Logistic
regression, Poisson regression, and Cox proportional hazards regression. We select the
configuration (n, p) = (400, 5000) for all models, and generate covariates x1, . . . , xp
as follows:
Case 1: x1, . . . , xp are independent and identically distributed N(0, 1) random varia-
bles.
Case 2: x1, . . . , xp are jointly Gaussian, marginally distributed as N(0, 1), and with
correlation structure Corr(xi, xj) = 1/2 if i 6= j.
Case 3: x1, . . . , xp are jointly Gaussian, marginally distributed as N(0, 1), and with
correlation structure Corr(xi, x4) = 1/√
2 for all i 6= 4 and Corr(xi, xj) = 1/2
if i and j are distinct elements of 1, . . . , p \ 4.
Case 4: x1, . . . , xp are jointly Gaussian, marginally distributed as N(0, 1), and with
correlation structure Corr(xi, x5) = 0 for all i 6= 5, Corr(xi, x4) = 1/√
2
for all i /∈ 4, 5, and Corr(xi, xj) = 1/2 if i and j are distinct elements of
1, . . . , p \ 4, 5.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 23
With independent features, Case 1 is the most straightforward for variable selec-
tion. In Cases 2–4, however, we have serial correlation among predictors such that
Corr(xi, xj) does not decay as |i − j| increases. We will see below that for Cases
3 and 4, the true sparse model M? is chosen such that the response is marginally
independent but jointly dependent of x4. This type of dependence is ruled out in
the asymptotic theory of SIS in Fan and Song (2010), so we should expect variable
selection in these settings to be more challenging for the non-iterated version of SIS.
In the Cox proportional hazards scenario, the right-censoring time is generated
from the exponential distribution with mean 10. This corresponds to fixing the base-
line hazard function h0(t) = 0.1 for all t ≥ 0. The true regression coefficients from
the sparse model M? in each of the four settings are as follows:
Case 1: β?0 = 0, β?1 = −1.5140, β?2 = 1.2799, β?3 = −1.5307, β?4 = 1.5164, β?5 =
−1.3019, β?6 = 1.5833, and β?j = 0 for j > 6.
Case 2: The coefficients are the same as in Case 1.
Case 3: β?0 = 0, β?1 = 0.6, β?2 = 0.6, β?3 = 0.6, β?4 = −0.9√
2, and β?j = 0 for j > 4.
Case 4: β?1 = 4, β?2 = 4, β?3 = 4, β?4 = −6√
2, β?5 = 4/3, and β?j = 0 for j > 5. The
corresponding median censoring rate is 33.5%.
For Cases 1 and 2, the coefficients were randomly generated as (4 log n/√n +
|Z|/4)U with Z ∼ N(0, 1) and U = 1 with probability 0.5 and −1 with probability
0.5, independent of the value of Z. For Cases 3 and 4, the selected model ensures that
even though β?4 6= 0, the associated predictor x4 and the response y are marginally
independent. This is designed in order to make it challenging for the vanilla sure
independence screening procedure to select this important variable. Furthermore, in
Case 4, we add another important predictor x5 with a small coefficient to make it
even more challenging to identify the true sparse model.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 24
The results are given in Tables 1.2–1.5, in which the median and robust estimate
of the standard deviation (over 100 repetitions) of several performance measures are
reported: `1-estimation error, squared `2-estimation error, true positives (TP), false
positives (FP), and computational time in seconds (Time). In Cases 1 and 2, under
the Linear and Logistic regression setups, for any type of SIS or ISIS, we employ
the SCAD penalty (γ = 3.7) at the end of the screening steps; whereas LASSO is
applied for Cases 3 and 4, under the Poisson and Cox proportional hazards regression
frameworks. For simplicity, we exclude the performance of MC+ based screening
procedures in the current analysis. Whenever necessary, for all variable selection
procedures considered here, the BIC criterion is used as a fast way to select the
regularization parameter λ > 0, always chosen from a path of 100 candidate λ values.
As the covariates are all independent in Case 1, it is not surprising to see that Van-
SIS performs reasonably well. However, this non-iterative procedure fails in terms of
identifying the true model when correlation is present, particularly in the challenging
Cases 3 and 4. When predictors are dependent, the vanilla ISIS improves significantly
over SIS in terms of true positives. While the number of false positive variables may
be larger in some settings, Van-ISIS provides comparable estimation errors in Cases
1–3 but significant reduction in the complicated Case 4.
In terms of further reducing the false selection rate and estimation errors, while
still selecting the true model M?, Var1-ISIS performs much better than Var2-ISIS.
Being a more conservative variable selection approach, Var2-ISIS tends to have a
higher number of false positives. This is particularly true in the Poisson regression
scenario, in which the second variant even misses one important predictor.
From the permutation-based variants, the ones that combine the sample splitting
approach (Perm-var-ISIS and Perm-var-g-ISIS) outperform all other ISIS procedures
in terms of true positives, low false selection rates, and small estimation errors, with
Var1-ISIS following closely. In particular, for Perm-var-ISIS, the number of false
positives is approximately zero for all examples. The only drawback seems to be their
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 25
Method ‖β − β?‖1 ‖β − β?‖22 TP FP Time
Van-SIS 0.24(0.10) 0.01(0.01) 6(0.00) 0(0.00) 0.26(0.02)
Van-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 8.34(0.78)
Var1-ISIS 0.29(0.15) 0.02(0.02) 6(0.00) 0(0.74) 11.76(8.65)
Var2-ISIS 0.24(0.10) 0.01(0.01) 6(0.00) 0(0.00) 11.90(1.12)
Perm-ISIS 0.41(0.25) 0.05(0.05) 6(0.00) 1(1.49) 44.57(13.38)
Perm-g-ISIS 0.39(0.27) 0.04(0.05) 6(0.00) 1(1.49) 107.50(22.99)
Perm-var-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 41.64(1.01)
Perm-var-g-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 82.81(15.80)
SCAD 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 6.89(5.56)
Table 1.2: Linear regression, Case 1, where results are given in the form of medians
and robust standard deviations (in parentheses).
Method ‖β − β?‖1 ‖β − β?‖22 TP FP Time
Van-SIS 2.79(2.20) 2.02(2.55) 5.5(0.75) 0(0.75) 0.36(0.05)
Van-ISIS 4.06(7.78) 2.77(7.35) 6(0.00) 2(7.46) 48.73(11.72)
Var1-ISIS 1.86(1.25) 0.79(1.07) 6(0.00) 0(0.75) 76.47(20.56)
Var2-ISIS 5.29(7.99) 4.43(7.75) 6(0.00) 5(6.72) 96.25(59.99)
Perm-ISIS 2.94(8.80) 1.87(9.03) 6(0.00) 2(7.46) 128.94(42.74)
Perm-g-ISIS 18.65(6.10) 23.38(14.40) 6(0.00) 10(0.93) 739.84(89.96)
Perm-var-ISIS 1.55(1.38) 0.53(1.43) 6(0.75) 0(0.00) 153.53(43.84)
Perm-var-g-ISIS 1.64(1.37) 0.63(1.18) 6(0.00) 0(0.75) 251.21(63.66)
SCAD 409.22(104.33) 8403.18(4844.63) 6(0.00) 20(2.24) 304.98(66.65)
Table 1.3: Logistic regression, Case 2, where results are given in the form of medians
and robust standard deviations (in parentheses).
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 26
Method ‖β − β?‖1 ‖β − β?‖22 TP FP Time
Van-SIS 3.10(0.55) 2.08(0.26) 3(0.00) 9.5(18.66) 0.14(0.04)
Van-ISIS 5.21(0.79) 2.21(2.20) 4(0.75) 29(0.00 ) 86.23(64.17)
Var1-ISIS 0.53(0.39) 0.09(0.11) 4(0.00) 1(0.75) 88.05(28.02)
Var2-ISIS 5.05(0.67) 2.15(0.22) 3(0.75) 29(0.75) 207.67(100.74)
Perm-ISIS 6.16(1.40) 6.56(5.54) 3(0.00) 30(0.75) 130.93(221.45)
Perm-g-ISIS 6.76(0.77) 1.70(0.74) 4(0.00) 29(0.00) 2202.81(136.08)
Perm-var-ISIS 0.26(0.21) 0.02(0.04) 4(0.00) 0(0.75) 174.90(42.50)
Perm-var-g-ISIS 0.43(0.36) 0.07(0.08) 4(0.00) 1(1.49) 231.81(89.62)
LASSO 2.97(0.07) 2.14(0.18) 3(0.00) 20(9.70) 1.47(0.36)
Table 1.4: Poisson regression, Case 3, where results are given in the form of medians
and robust standard deviations (in parentheses).
Method ‖β − β?‖1 ‖β − β?‖22 TP FP Time
Van-SIS 21.27(0.49) 95.64(3.63) 3(0.75) 12(0.75) 0.15(0.04)
Van-ISIS 3.13(1.20) 1.02(1.35) 5(0.00) 11(0.00) 92.29(45.61)
Var1-ISIS 1.33(0.62) 0.38(0.42) 5(0.00) 2(2.24) 207.16(45.69)
Var2-ISIS 2.80(1.12) 0.93(1.10) 5(0.00) 11(0.00) 189.24(84.92)
Perm-ISIS 21.44(0.25) 93.93(6.42) 3(0.00) 13(0.00) 136.47(69.53)
Perm-g-ISIS 9.19(1.95) 8.26(5.53) 5(0.00) 11(0.00) 1102.28(96.11)
Perm-var-ISIS 0.95(0.76) 0.24(0.38) 5(0.00) 0(0.00) 386.87(68.42)
Perm-var-g-ISIS 1.24(0.66) 0.35(0.41) 5(0.00) 1(1.49) 509.85(147.38)
LASSO 163.09(14.17) 1035.23(173.27) 4(0.00) 313(10.63) 35.67(3.87)
Table 1.5: Cox proportional hazards regression, Case 4, where results are given in the
form of medians and robust standard deviations (in parentheses).
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 27
relatively large computation cost, being at least twice as large as that of Var1-ISIS.
This is to be expected considering the amount of extra work these procedures have to
perform: two rounds of marginal fits to obtain sample specific data-driven thresholds
ω(1)(q) and ω
(2)(q) , plus two additional rounds of marginal fits to compute the corresponding
index sets A (1)1 and A (2)
1 . Computational costs potentially increase further when
carrying out the conditional marginal regression steps described in Section 1.2.2,
however, the gains in statistical accuracy and model selection offset the increased
timings, particularly in the equally correlated Case 2 with nonconvex penalties.
Tables 1.2–1.3 show that SCAD also enjoys the sure screening property for the rel-
atively easy Cases 1 and 2, however, model sizes and estimation errors are significantly
larger than any of the ISIS procedures in the correlated scenario. On the other hand,
for the difficult Cases 3 and 4, surprisingly, LASSO rarely includes the important
predictor x4 even though it is not a marginal screening based method. While exhibit-
ing competitive performance with some of the ISIS variants in the Poisson regression
scenario, LASSO performs poorly in the Cox model setup, having prohibitively large
model sizes and estimation errors.
The ncvreg package with SCAD outperforms all ISIS variants in terms of compu-
tational cost for the uncorrelated Case 1. Still, the vanilla SIS procedure identifies the
true model faster than ncvreg. For the correlated Case 2, however, only the greedy
modification Perm-g-ISIS is slower than SCAD. In the Poisson and Cox model setups,
while the computational cost of LASSO with the glmnet package is smaller than any
of the ISIS procedures, the vanilla SIS shows better performance in terms of timings
and estimation errors.
As they become more elaborate, ISIS procedures become more computationally
expensive. Yet, vanilla ISIS and most of its variants presented here, particularly
Var1-ISIS and Perm-var-ISIS, are clearly competitive variable selection methods in
ultrahigh dimensional statistical models, adequately using the joint covariate infor-
mation, while exhibiting a very low false selection rate and competing computational
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 28
cost.
1.4.2 Scaling in n and p with Feature Screening
In addition to comparing our SIS codes with glmnet and ncvreg, we would like to
know how the timings of the vanilla SIS and ISIS procedures scale in n and p. We
simulated data sets from Cases 1–3 above and, for a variety of n and p values, we took
the median running time over 10 repetitions. Again, for each (n, p) pair, whenever
necessary, the BIC criterion was used to select the best λ value among a path of 100
possible candidates. Figure 1.1 shows timings for fixed n as p grows (Cases 1–3), and
for fixed p as n grows (Case 2).
From the plots we see that independence screening procedures perform uniformly
faster than ncvreg with the SCAD penalty. For Poisson regression, vanilla SIS also
outperforms glmnet with LASSO, particularly in the n = p scenario, where glmnet
exhibits unusually slow performance. It is worth pointing out that iterative varia-
ble screening procedures typically do not show a strictly monotone timing as n or p
increase. This is due to the varying number of iterations it takes to recruit d predic-
tors, the random splitting of the sample in the first two ISIS variants, the random
permutation in the data-driven thresholding, among other factors.
1.4.3 Real Data Analysis
We now evaluate the performance of all variable screening procedures on four well-
studied gene expression data sets: Leukemia (Golub et al., 1999), Prostate cancer
(Singh et al., 2002), Lung cancer (Gordon et al., 2002) and Neuroblastoma (Oberthuer
et al., 2006). The first three data sets come with predetermined, separate training
and test sets of data vectors. The Leukemia data set contains p = 7129 genes for 27
acute lymphoblastic leukemia and 11 acute myeloid leukemia vectors in the training
set. The test set includes 20 acute lymphoblastic leukemia and 14 acute myeloid
leukemia vectors. The Prostate cancer data set consists of gene expression profiles
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 29
3.0 3.2 3.4 3.6 3.8 4.0 4.2
−10
12
34
56
Time vs p (Case 1)
log10(p)
log 1
0(T)
n=500 Van−SISn=500 SCAD/ncvregn=1000 Van−SISn=1000 SCAD/ncvreg
3.0 3.2 3.4 3.6 3.8 4.0 4.2
−10
12
34
56
Time vs p (Case 2)
log10(p)
log 1
0(T)
n=500 Van−ISISn=500 SCAD/ncvregn=1000 Van−ISISn=1000 SCAD/ncvreg
3.0 3.2 3.4 3.6 3.8 4.0 4.2
−10
12
34
56
Time vs p (Case 3)
log10(p)
log 1
0(T)
n=500 Van−SISn=500 LASSO/glmnetn=1000 Van−SISn=1000 LASSO/glmnet
2.0 2.2 2.4 2.6 2.8 3.0 3.2
−10
12
34
56
Time vs n (Case 2)
log10(n)
log 1
0(T)
Van−ISIS with p=10,000SCAD/ncvreg with p=10,000Van−ISIS with p=20,000SCAD/ncvreg with p=20,000
Figure 1.1: Median runtime in seconds taken over 10 trials (log-log scale).
from p = 12600 genes for 52 prostate “tumor” samples and 50 “normal” prostate
samples in the training data set. The test set is from a different experiment and
contains 25 tumor and 9 normal samples. The lung cancer data set contains 32 tissue
samples in the training set (16 from malignant pleural mesothelioma and 16 from
adenocarcinoma) and 149 in the test set (15 from malignant pleural mesothelioma
and 134 from adenocarcinoma). Each sample consists of p = 12533 genes. The
neuroblastoma data set consists of gene expression profiles for p = 10707 genes from
246 patients of the German neuroblastoma trials NB90-NB2004, diagnosed between
1989 and 2004. We analyzed the gene expression data by means of the 3-year event-
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 30
free survival, indicating whether a patient survived 3 years after the diagnosis of
neuroblastoma. Combining the original training and test sets, the data consists of 56
positive and 190 negative cases. For purposes of the present analysis, in each of these
gene expression data sets, we initially combine the training and test samples and then
perform a 50% - 50% random splitting of the observed data into new training and
test data for which the number of cases remains balanced across these new samples.
In this manner, for the Leukemia data, the balanced training and test samples are of
size 36, for the Prostate data we have balanced training and test samples of size 68,
whereas the Neuroblastoma data set has balanced training and test samples of size
123. The balanced training and test samples for the Lung cancer data are of sizes 90
and 91, respectively. Interested readers can find more details about these data sets
in Golub et al. (1999), Singh et al. (2002), Gordon et al. (2002) and Oberthuer et al.
(2006).
Following the approach of Dudoit et al. (2002), before variable screening and
classification, we first standardize each sample to zero mean and unit variance. We
compare the performance of all described variable screening procedures with the Near-
est Shrunken Centroids (NSC) method of Tibshirani et al. (2002), the Independence
Rule (IR) in the high-dimensional setting (Bickel and Levina, 2004) and the LASSO
(Tibshirani, 1996), which uses ten-fold cross-validation to select its tuning parameter,
applied to the full set of covariates. Under a working independence assumption in
the feature space, NSC selects an important subset of variables for classification by
thresholding a corresponding two-sample t statistic, whereas IR makes use of the full
set of predictors.
Tables 1.6–1.7 show the median and robust standard deviation of the classifica-
tion error rates and model sizes for all procedures, taken over 100 random splittings
into 50% - 50% balanced training and test data. At each intermediate step of the
(I)SIS procedures, we employ the LASSO with ten-fold cross-validation to further
filter unimportant predictors for classification purposes. To determine a data-driven
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 31
Leukemia Leukemia Leukemia Prostate Prostate Prostate
Method training test model training test model
error rate error rate size error rate error rate size
Van-SIS 0.00(0.00)0.06(0.04) 16(2.43) 0.00(0.01)0.22(0.05) 14(3.92)
Van-ISIS 0.00(0.00)0.06(0.04) 15(2.99) 0.00(0.01)0.20(0.06) 14(2.43)
Var1-ISIS 0.00(0.02)0.08(0.04) 5(1.49) 0.04(0.04)0.19(0.09) 6(2.24)
Var2-ISIS 0.00(0.00)0.06(0.02) 15(2.24) 0.00(0.01)0.18(0.06) 12(2.24)
Perm-ISIS 0.00(0.00)0.06(0.04) 15(2.99) 0.00(0.01)0.20(0.05) 13(2.99)
Perm-g-ISIS 0.03(0.02)0.08(0.04) 3(1.49) 0.07(0.05)0.23(0.11) 4(1.49)
Perm-Var-ISIS 0.00(0.02)0.08(0.04) 5(1.49) 0.04(0.04)0.19(0.09) 5(2.24)
Perm-Var-g-ISIS 0.03(0.04)0.08(0.04) 2(0.00) 0.08(0.04)0.22(0.10) 4(0.93)
LASSO-CV(10) 0.00(0.00)0.06(0.04) 17(2.99) 0.03(0.06)0.22(0.07) 19(8.21)
NSC 0.03(0.02)0.06(0.04) 143(456.72) 0.07(0.03)0.20(0.12) 13(15.30)
IR 0.03(0.02)0.14(0.10)7129(0.00) 0.29(0.04)0.32(0.06)12600(0.00)
Table 1.6: Classification error rates and number of selected genes by various methods
for the balanced Leukemia and Prostate cancer data sets. For the Leukemia data, the
training and test samples are of size 36. For the Prostate cancer data, the training
and test samples are of size 68. Results are given in the form of medians and robust
standard deviations (in parentheses).
threshold for independence screening, we fix q = 0.95 for all permutation-based varia-
ble selection procedures. Lastly, for each data set considered, we apply all screening
procedures to reduce dimensionality from the corresponding p to d = 100.
From the results in Tables 1.6–1.7, we observe that all ISIS variants perform sim-
ilarly in terms of test error rates, whereas the main differences lie in the estimated
model sizes. Compared with the LASSO applied to the full set of covariates, a major-
ity of ISIS procedures select smaller models while retaining competitive classification
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 32
Lung Lung Lung NB NB NB
Method training test model training test model
error rate error rate size error rate error rate size
Van-SIS 0.00(0.00)0.02(0.01) 14(2.43) 0.09(0.02)0.19(0.02) 14(2.99)
Van-ISIS 0.00(0.00)0.02(0.01) 13(1.49) 0.00(0.00)0.22(0.03) 38(5.22)
Var1-ISIS 0.00(0.00)0.02(0.01) 9(1.68) 0.14(0.04)0.21(0.03) 3(2.24)
Var2-ISIS 0.00(0.00)0.02(0.01) 13(2.24) 0.00(0.00)0.22(0.03) 33(5.41)
Perm-ISIS 0.00(0.00)0.02(0.01) 13(1.49) 0.00(0.00)0.22(0.02) 38(5.97)
Perm-g-ISIS 0.00(0.01)0.02(0.02) 2(0.75) 0.03(0.02)0.26(0.04) 10(3.73)
Perm-Var-ISIS 0.00(0.00)0.02(0.01) 9(2.24) 0.14(0.04)0.21(0.03) 4(2.24)
Perm-Var-g-ISIS0.01(0.01)0.02(0.01) 2(0.75) 0.14(0.03)0.21(0.04) 3(1.49)
LASSO-CV(10) 0.01(0.01)0.02(0.02) 15(2.24) 0.14(0.04)0.26(0.04) 23(9.70)
NSC 0.00(0.00)0.00(0.02) 6(22.76) 0.18(0.02)0.20(0.03) 361(5150.37)
IR 0.13(0.05)0.14(0.06)12533(0.00) 0.15(0.02)0.20(0.02)10707(0.00)
Table 1.7: Classification error rates and number of selected genes by various methods
for the balanced Lung and Neuroblastoma (NB) cancer data sets. For the Lung
data, the training and test samples are of sizes 90 and 91, respectively. For the
Neuroblastoma cancer data, the training and test samples are of size 123. Results are
given in the form of medians and robust standard deviations (in parentheses).
error rates. This is in agreement with our simulation results, which highlight the ben-
efits of variable screening over a direct high-dimensional regularized logistic regression
approach. In particular, we observe the variants Var1-ISIS and Perm-var-g-ISIS pro-
vide the most parsimonious models across all four data sets, yielding optimal test error
rates while using only 2 features in the case of the Lung cancer data set. Nonetheless,
due to its robust performance in both the simulated data and these four gene expres-
sion data sets, and its reduced computational cost compared with all available ISIS
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 33
variants, we select the vanilla ISIS of Algorithm 1.2 as the default variable selection
procedure within our SIS package.
While the NSC method achieves competitive test error rates, it typically makes
use of larger sets of genes which vary considerably across the different 50% - 50%
training and test data splittings. The Independence Rule exhibits poor test error
performance, except for the Neuroblastoma data set, where it even outperforms some
of the ISIS procedures. However, this approach uses all features without performing
variable selection, thus yielding models of little practical use for researchers.
1.4.4 Code Example
All described independence screening procedures are straightforward to run using the
SIS package. We demonstrate the SIS function on the Leukemia data set from the
previous section. We first load the predictors and response vector from the training
and test data sets, and then carry out the balanced sample splitting as outlined above.
R> set.seed(9)
R> data("leukemia.train", package = "SIS")
R> data("leukemia.test", package = "SIS")
R> y1 = leukemia.train[, dim(leukemia.train)[2]]
R> x1 = as.matrix(leukemia.train[, -dim(leukemia.train)[2]])
R> y2 = leukemia.test[, dim(leukemia.test)[2]]
R> x2 = as.matrix(leukemia.test[, -dim(leukemia.test)[2]])
R> ### Combine Datasets ###
R> x = rbind(x1, x2)
R> y = c(y1, y2)
R> ### Split Datasets ###
R> n = dim(x)[1]; aux = 1:n
R> ind.train1 = sample(aux[y == 0], 23, replace = FALSE)
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 34
R> ind.train2 = sample(aux[y == 1], 13, replace = FALSE)
R> ind.train = c(ind.train1, ind.train2)
R> x.train = scale(x[ind.train,])
R> y.train = y[ind.train]
R> ind.test1 = setdiff(aux[y == 0], ind.train1)
R> ind.test2 = setdiff(aux[y == 1], ind.train2)
R> ind.test = c(ind.test1, ind.test2)
R> x.test = scale(x[ind.test,])
R> y.test = y[ind.test]
We now perform variable selection using the Var1-ISIS and Perm-var-ISIS proce-
dures paired with the LASSO penalty and the ten-fold cross-validation method for
choosing the regularization parameter.
R> model1 = SIS(x.train, y.train, family = "binomial",
+ penalty = "lasso", tune = "cv", nfolds = 10, nsis = 100,
+ varISIS = "aggr", seed = 9, standardize = FALSE)
R> model2 = SIS(x.train, y.train, family = "binomial",
+ penalty = "lasso", tune = "cv", nfolds = 10, nsis = 100,
+ varISIS = "aggr", perm = TRUE,q = 0.95, seed = 9,
+ standardize = FALSE)
R> model1$ix
[1] 1834 4377 4847 6281 6493
R> model2$ix
[1] 1834 4377 4847 6281 6493
Here we modified the default value d = bn/(4 log n)c to make both iterative pro-
cedures select models with at most 100 predictors. The value of q ∈ [0, 1], from which
we obtain the data-driven threshold ω(q) for Perm-var-ISIS, was also customized from
its default q = 1.
CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 35
1.5 Discussion
Sure independence screening is a power family of methods for performing variable
selection in statistical models when the dimension is much larger than the sample
size, as well as in the classical setting where p < n. The focus of the chapter is on
iterative sure independence screening, which iteratively applies a large scale screening
by means of conditional marginal regressions, filtering out unimportant predictors,
and a moderate scale variable selection through penalized pseudo-likelihood methods,
which further selects the unfiltered predictors. With the goal of providing further
flexibility to the iterative screening paradigm, special attention is also paid to powerful
variants which reduce the number of false positives by means of sample splitting and
data-driven thresholding approaches. Compared with the versions of LASSO and
SCAD we used, the iterative procedures presented in this chapter are much more
accurate in selecting important variables and achieving small estimation errors. In
addition, computational time is also reduced, particularly in the case of nonconvex
penalties, thus resulting in a robust family of procedures for model selection and
estimation in ultrahigh dimensional statistical models. Extensions of the current
package to more general loss-based models and nonparametric independence screening
procedures, as well as implementation of conditional marginal regressions through
support vector machine methods are lines of future work.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 36
Chapter 2
How Many Communities Are
There?
Stochastic blockmodels and variants thereof are among the most widely used ap-
proaches to community detection for social networks and relational data. A stochastic
blockmodel partitions the nodes of a network into disjoint sets, called communities.
The approach is inherently related to clustering with mixture models; and raises a
similar model selection problem for the number of communities. The Bayesian infor-
mation criterion (BIC) is a popular solution, however, for stochastic blockmodels, the
conditional independence assumption given the communities of the endpoints among
different edges is usually violated in practice. In this regard, we propose compos-
ite likelihood BIC (CL-BIC) to select the number of communities, and we show it
is robust against possible misspecifications in the underlying stochastic blockmodel
assumptions. We derive the requisite methodology and illustrate the approach using
both simulated and real data.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 37
2.1 Introduction
Enormous network datasets are being generated and analyzed along with an increasing
interest from researchers in studying the underlying structures of a complex networked
world. The potential benefits span traditional scientific fields such as epidemiology
and physics, but also emerging industries, especially large-scale internet companies.
Among a variety of interesting problems arising with network data, in this chapter, we
focus on community detection in undirected networks G := (V,E), where V and E are
the sets of nodes and edges, respectively. In this framework, the community detection
problem can be formulated as finding the true disjoint partition of V = V1t · · · tVK ,
where K is the number of communities. Although it is difficult to give a rigorous
definition, communities are often regarded as tightly-knit groups of nodes which are
loosely connected between themselves.
The community detection problem has close connections with graph partitioning,
which could be traced back to Euler, while it has its own characteristics due to the
concrete physical meanings from the underlying dataset (Newman and Girvan, 2004).
Over the last decade, there has been a considerable amount of work on it, including
minimizing ratio cut (Wei and Cheng, 1989), minimizing normalized cut (Shi and
Malik, 2000), maximizing modularity (Newman and Girvan, 2004), hierarchical clus-
tering (Newman, 2004) and edge-removal methods (Newman and Girvan, 2004), to
name a few. Among all the progress made by peer researchers, spectral clustering
(Donath and Hoffman, 1973) based on stochastic blockmodels (Holland et al., 1983)
debuted and soon gained a majority of attention. We refer the interested readers to
Spielmat and Teng (1996) and Goldenberg et al. (2010) as comprehensive reviews on
the history of spectral clustering and stochastic blockmodels, respectively.
Compared to the amount of work on spectral clustering or stochastic blockmodels,
to the best of our knowledge, there is little work on the selection of the community
number K. In most of the previously mentioned community detection methods, the
number of communities is generally input as a pre-specified quantity. For the lit-
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 38
erature addressing the problem of selecting K, besides the block-wise edge splitting
method of Chen and Lei (2014), a common practice is to use BIC-type criteria (Airoldi
et al., 2008; Daudin et al., 2008) or a variational Bayes approach (Latouche et al.,
2012; Hunter et al., 2012). An inherently related problem is that of selecting the
number of components in mixture models, where the birth-and-death point process
of Stephens (2000) and the allocation sampler of Nobile and Fearnside (2007) provide
two fully Bayesian approaches in the case where K is finite but unknown. Based
on the allocation sampler, McDaid et al. (2013) propose an efficient Bayesian clus-
tering algorithm which directly estimates the number of communities in stochastic
blockmodels, and which exhibits similar results to the variational Bayes approach of
Latouche et al. (2012). Nonparametric Bayesian methods based on Dirichlet process
mixtures (Ferguson, 1973) have also been used to estimate the number of components
in this finite but unknown K setting (Fearnhead, 2004), although the inconsistency of
this approach has been recently shown by Miller and Harrison (2014). This commu-
nity or mixture component number K, as a vital part of model selection procedures,
highly depends on the model assumptions. For instance, the famous stochastic block-
model has undesirable restrictive assumptions in the form of independent Bernoulli
observations when the community assignments are known.
In this chapter, we study the community number selection problem with robust-
ness consideration against model misspecification in the stochastic blockmodel and
its variants. Our motivation is that, the conditional independence assumption among
edges, when the communities of their endpoints are given, is usually violated in real
applications. In addition, we do not restrict our interest only to exchangeable graphs.
Using techniques from the composite likelihood paradigm (Lindsay, 1988), we develop
a composite likelihood BIC (CL-BIC) approach (Gao and Song, 2010) for selecting
the community number in the situation where assumed independencies in the stochas-
tic blockmodel and other exchangeable graph models do not hold. The procedure is
tested on simulated and real data, and is shown to outperform two competitors –
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 39
traditional BIC and the variational Bayes criterion of Latouche et al. (2012), in terms
of model selection consistency.
The rest of the chapter is organized as follows. The background for stochastic
blockmodels and spectral clustering is introduced in Section 2.2, and the proposed
CL-BIC methodology is developed in Section 2.3. In Section 2.4, several simulation
examples as well as two real data sets are analyzed. The chapter is concluded with a
short discussion in Section 2.5.
2.2 Background
First, we would like to introduce some notation. For an N -node undirected, simple,
and connected network G, its symmetric adjacency matrix A is defined as Aij := 1
if (i, j) is an element in E, and Aij := 0 otherwise. The diagonal AiiNi=1 is fixed
to zero (i.e., self-edges are not allowed). Moreover, D and L denote the degree
matrix and Laplacian matrix, respectively. Here, Dii := di, and Dij := 0 for i 6= j,
where di is the degree of node i, i.e., the number of edges with endpoint node i; and
L := D−1/2AD−1/2. As isolated nodes are discarded, D−1/2 is well-defined.
2.2.1 Stochastic Blockmodels
2.2.1.1 Standard Stochastic Blockmodel
Stochastic blockmodels were first introduced in Holland et al. (1983). They posit
independent Bernoulli random variables Aij1≤i<j≤N with success probabilities Pij
which depend on the communities of their endpoints i and j. Consequently, all
edges are conditionally independent given the corresponding communities. Moreover,
each node is associated with one and only one community, with label Zi, where
Zi ∈ 1, . . . , K. Following Rohe et al. (2011) and Choi et al. (2012), throughout
this chapter we assume each Zi is fixed and unknown, thus yielding P(Aij = 1;Zi =
zi, Zj = zj) = θzizj . Treating the node assignments Z1, . . . , ZN as latent random
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 40
variables is another popular approach in the community detection literature, and
various methods including the variational Bayes criterion of Latouche et al. (2012)
and the belief propagation algorithm of Decelle et al. (2011) efficiently approximate
the corresponding observed-data log-likelihood of the stochastic blockmodel, without
having to add KN multinomial terms accounting for all possible label assignments.
For θ := (θab; 1 ≤ a ≤ b ≤ K)′ and for any fixed community assignment z ∈
1, . . . , KN , the log-likelihood under the standard Stochastic Blockmodel (SBM) is
given as
`(θ;A) :=∑i<j
[Aij log θzizj + (1− Aij) log(1− θzizj)]. (2.1)
For the remainder of the chapter, denote Na as the size of community a, and nab
as the maximum number of possible edges between communities a and b, i.e., nab :=
NaNb for a 6= b, and naa := Na(Na−1)/2. Also, let mab :=∑
i<j Aij1zi = a, zj = b,
and θab := mab/nab be the MLE of θab in (2.1).
Under this framework, Choi et al. (2012) showed that the fraction of misclustered
nodes converges in probability to zero under maximum likelihood fitting when K is
allowed to grow no faster than√N . By means of a regularized maximum likelihood
estimation approach, Rohe et al. (2014) further proved that this weak convergence
can be achieved for K = O(N/ log5N).
2.2.1.2 Degree-Corrected Stochastic Blockmodel
Heteroscedasticity of node degrees within communities is often observed in real-
world networks. To tackle this problem, Karrer and Newman (2011) proposed the
Degree-Corrected Blockmodel (DCBM), in which the success probabilities Pij are
also functions of individual effects. To be more precise, the DCBM assumes that
P(Aij = 1;Zi = zi, Zj = zj) = ωiωjθzizj , where ω := (ω1, . . . , ωn)′ are individual
effect parameters satisfying the identifiability constraint∑
i ωi1zi = a = 1 for each
community 1 ≤ a ≤ K.
To simplify technical derivations, Karrer and Newman (2011) allowed networks
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 41
to contain both multi-edges and self-edges. Thus, they assumed the random varia-
bles Aij1≤i≤j≤N to be independent Poisson, with the previously defined success
probabilities Pij of an edge between vertices i and j replaced by the expected num-
ber of such edges. Under this framework, and for any fixed community assignment
z ∈ 1, . . . , KN , Karrer and Newman (2011) arrived at the log-likelihood `(θ, ω;A)
of observing the adjacency matrix A = (Aij) under the DCBM,
`(θ, ω;A) := 2∑i
di logωi +∑a,b
(mab log θab − θab). (2.2)
After allowing for the identifiability constraint on ω, the MLEs of the parameters θab
and ωi are given by θab := mab and ωi := di/∑
j:zj=zidj, respectively.
As mentioned in Zhao et al. (2012), there is no practical difference in performance
between the log-likelihood (2.2) and its slightly more elaborate version based on the
true Bernoulli observations. The reason is that the Bernoulli distribution with a small
mean is well approximated by the Poisson distribution, and the sparser the network
is, the better the approximation works (Perry and Wolfe, 2012).
2.2.1.3 Mixed Membership Stochastic Blockmodel
As a methodological extension in which nodes are allowed to belong to more than one
community, Airoldi et al. (2008) proposed the Mixed Membership Stochastic Block-
model (MMB) for directed relational data Aij1≤i,j≤N . For instance, when a social
actor interacts with its different neighbors, an array of different social contexts may
be taking place and thus the actor may be taking on different latent roles.
The model assumes the observed network is generated according to node-specific
distributions of community membership and edge-specific indicator vectors denoting
membership in one of the K communities. More specifically, each vertex i is asso-
ciated with a randomly drawn vector ~πi, with πia denoting the probability of node i
belonging to community a. Additionally, let the indicator vector ~zi→j denote the co-
mmunity membership of node i when he sends a message to node j, and ~zi←j denote
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 42
the community membership of node j when he receives a message from node i. If, in
order to account for the asymmetric interactions, we denote by θ := (θab) the K ×K
matrix where θab represents the probability of having an edge from a social actor in
community a to a social actor in community b, the MMB posits that the Aij1≤i,j≤N
are drawn from the following generative process:
• For each node i ∈ V :
– Draw a K dimensional mixed membership vector ~πi ∼ Dirichlet(α), with
the vector α = (α1, . . . , αK)′ being a hyper-parameter.
• For each possible edge variable Aij:
– Draw membership indicator vector for the initiator ~zi→j∼Multinomial(~πi).
– Draw membership indicator vector for the receiver ~zi←j∼Multinomial(~πj).
– Sample the interaction Aij ∼ Bernoulli(~z′i→jθ~zi←j).
Upon defining the set of mixed membership vectors Π := ~πi : i ∈ V and the
sets of membership indicator vectors Z→ := ~zi→j : i, j ∈ V and Z← := ~zi←j :
i, j ∈ V , following Airoldi et al. (2008), we obtain the complete data log-likelihood
of the hyper-parameters θ,α as
`(θ, α;A,Π,Z→,Z←) :=∑i,j
[Aij log(~z′i→jθ~zi←j) + (1− Aij) log(1− ~z′i→jθ~zi←j)]
+N(
log Γ(∑a
αa)−∑a
log Γ(αa))
+∑i
∑a
(αa − 1) log πia
+ const, (2.3)
where A corresponds to the observed data and Π,Z→,Z← are the latent variables.
In order to carry out posterior inference of the latent variables given the observa-
tions A, Airoldi et al. (2008) proposed an efficient coordinate ascent algorithm based
on a variational approximation to the true posterior. Therefore, one can compute
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 43
expected posterior mixed membership vectors and posterior membership indicator
vectors. We refer interested readers to Section 3 in Airoldi et al. (2008) for further
details.
Consequently, following the same profile likelihood approach, for any fixed set
Π,Z→,Z←, the MLE of θab is given by
θab :=∑i,j
Aij · ~zi→j,a~zi←j,b/∑
i,j
~zi→j,a~zi←j,b. (2.4)
As the MLE of αa does not admit a closed form, Minka (2000) proposed an efficient
Newton-Raphson procedure for obtaining parameter estimates in Dirichlet models,
where the gradient and Hessian matrix of the complete data log-likelihood (2.3) with
respect to α are
∂`(θ, α;A)
∂αa= N
(Ψ(∑a
αa)−Ψ(αa))
+∑i
∑a
log πia
∂2`(θ, α;A)
∂αa∂αb= N
(Ψ′(∑a
αa)−Ψ′(αa)1a = b),
(2.5)
and Ψ is known as the digamma function (i.e., the logarithmic derivative of the gamma
function).
2.2.2 Spectral Clustering and SCORE
Although there is a parametric framework for the standard stochastic blockmodel,
considering the computational burden, it is intractable to directly estimate both pa-
rameters θ and z based on exact maximization of the log-likelihood (2.1). Researchers
have instead resorted to spectral clustering as a computationally feasible algorithm.
For comprehensive reviews, we refer interested readers to von Luxburg (2007) and
Rohe et al. (2011), in which the authors proved the consistency of spectral clustering
in the standard stochastic blockmodel under proper conditions imposed on the den-
sity of the network and the eigen-structure of the Laplacian matrix. The algorithm
finds the eigenvectors u1, . . . ,uK associated with the K eigenvalues of L that are
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 44
largest in magnitude, forming an N ×K matrix U := (u1, . . . ,uK), and then applies
the K-means algorithm to the rows of U .
Similarly, Jin (2015) proposed a variant of spectral clustering for the DCBM,
called Spectral Clustering On Ratios-of-Eigenvectors (SCORE). Instead of using the
Laplacian matrix L, SCORE collects the eigenvectors v1, . . . ,vK associated with the
K eigenvalues of A that are largest in magnitude, and then forms the N ×K matrix
V := (1,v2/v1, . . . ,vK/v1), where the division operator is taken entry-wise, i.e., for
vectors a, b ∈ Rn, with Πn`=1b` 6= 0, a/b := (a1/b1, . . . , an/bn)′. SCORE then applies
the K-means algorithm to the rows of V . The corresponding consistency results for
the DCBM are also provided in Jin (2015).
2.3 Model Selection for the Number of Communi-
ties
2.3.1 Motivation
In much of the previous work, e.g. Airoldi et al. (2008), Daudin et al. (2008) and
Handcock et al. (2007), researchers have used a BIC-penalized version of the log-
likelihood (2.1) to choose the community number K. However, we are aware of the
possible misspecifications in the underlying stochastic blockmodel assumptions and
in the loss of precision from the computational relaxation brought in by spectral
clustering.
Firstly, in network data, edges are not necessarily independent if only the commu-
nities of their endpoints are given. For instance, if two different edges Aij and Ail
have mutual endpoint i, it is highly likely that they are dependent even given the
community labels of their endpoints. This misspecification problem exists in both
the standard stochastic blockmodel and its variants, such as DCBM (Karrer and
Newman, 2011) and MMB (Airoldi et al., 2008). Secondly, as previously mentioned,
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 45
spectral clustering is a feasible relaxation, but the loss of precision is inevitable. Sev-
eral examples of this can be found in Guattery and Miller (1998). Whence, we resort
to introducing CL-BIC with the concern of robustness against misspecifications in
the underlying stochastic blockmodel.
We would like to emphasize that CL-BIC is not a new community detection
method. Instead, under the SBM, DCBM, or MMB assumptions, it can be combined
with existing community detection methods to choose the true community number.
2.3.2 Composite Likelihood Inference
The CL-BIC approach extends the concepts and theory of conventional BIC on likeli-
hoods to the composite likelihood paradigm (Lindsay, 1988; Varin et al., 2011). Com-
posite likelihood aims at a relaxation of the computational complexity of statistical
inference based on exact likelihoods. For instance, when the dependence structure for
relational data is too complicated to implement, a working independence assumption
can effectively recover some properties of the usual maximum likelihood estimators
(Cox and Reid, 2004; Varin et al., 2011). However, under this misspecification frame-
work, the asymptotic variance of the resulting estimators is usually underestimated
as the Fisher information. Composite marginal likelihoods (also known as indepen-
dence likelihoods) have the same formula as conventional likelihoods in terms of being
a product of marginal densities (Varin, 2008), while statistical inference based on
them can capture this loss of variance. Consequently, to pursue the “true” model,
CL-BIC penalizes the number of parameters more than what BIC does for dependent
relational data.
Before going into details, we would like to give the rationale of using stochastic
blockmodels under a misspecification framework. In order to estimate the true joint
density g of Aij1≤i<j≤N , we consider the stochastic blockmodel family P = pθ :
θ ∈ Θ, where Θ = [0, 1]K(K+1)/2 for the standard stochastic blockmodel, and Θ =
[0, 1]K(K+1)/2+N for DCBM. The true joint density g may or may not belong to P,
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 46
which is a parametric family imposing independence among the Aiji<j when only
the communities of the endpoints are given.
Due to the difficulty in specifying the full, highly structured(N2
)-dimensional
density g, while having access to the univariate densities pij(·;θ) of Aij under the
blockmodel family P, the composite marginal likelihood paradigm compounds the
first-order log-likelihood contributions to form the composite log-likelihood
cl(θ;A) :=∑i<j
log pij(Aij;θ), (2.6)
where cl(·;A) corresponds to (2.1) under the standard stochastic blockmodel, and cor-
responds to (2.2) in the DCBM framework. Since each component of cl(θ;A) in (2.6)
is a valid log-likelihood object, the composite score estimating equation ∇θ cl(θ;A) =
0 is unbiased under usual regularity conditions. The associated Composite Likelihood
Estimator (CLE) θC, defined as the solution to ∇θ cl(θ;A) = 0, suggests a natural
estimator of the form p = pθC to minimize the expected composite Kullback–Leibler
divergence (Varin and Vidoni, 2005) between the assumed blockmodel pθ and the
true, but unknown, joint density g,
∆C(g, p;θ) :=∑i<j
Eg(log g(Aiji<j ∈ Aij)− log pij(Aij;θ)),
where Aiji<j denotes the corresponding set of marginal events.
In terms of the asymptotic properties of the CLE, following the discussion in Cox
and Reid (2004), it is important to distinguish whether the available data consist
of many independent replicates from a common distribution function or form a few
individually large sequences. While, in the first scenario, consistency and asymptotic
normality of the corresponding θC hold under some regularity conditions from the
classical theory of estimating equations (Varin et al., 2011), some difficulties arise
in the second one, which includes our observations Aiji<j. Indeed, as argued in
Cox and Reid (2004), if there is too much internal correlation present among the
individual components of the composite score ∇θ cl(θ;A), the estimator θC will not
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 47
be consistent. The CLE will retain good properties as long as the data are not too
highly correlated, which is the case for spatial data with exponential correlation decay.
Under this setting, Heagerty and Lele (1998) proved consistency and asymptotic
normality of θC in a scenario where the data are not sampled independently from
a study population. Under more general settings, consistency results are expected
upon using limit theorems and parametric estimation for fields (e.g. Guyon, 1995);
however, applying the corresponding results requires a properly defined distance on
networks and α-mixing conditions based on such distance.
2.3.3 Composite Likelihood BIC
Taking into account the measure of model complexity in the context of composite
marginal likelihoods (Varin and Vidoni, 2005), we define the following criterion for
selecting the community number K:
CL-BICk := −2 cl(θC;A) + d∗k log (N(N − 1)/2) , (2.7)
where k is the number of communities under consideration in the current model
used as model index, d∗k := trace(H−1k V k), Hk := Eθ(−∇2
θ cl(θ;A)) and V k :=
Varθ(∇θ cl(θ;A)). Then the resulting estimator for the community number is
kCL−BIC := arg mink
CL-BICk.
Note that the CLE is a function of k, since a different model index yields a
different estimator θC := θC(k). Assuming independent and identically distributed
data replicates, which lead to consistent and asymptotically normally distributed
estimators θC, Gao and Song (2010) established the model selection consistency of a
similar composite likelihood BIC approach for high-dimensional parametric models.
While allowing for the number of potential model parameters to increase to infinity,
their consistency result only holds when the true model sparsity is bounded by a
universal constant.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 48
Even though, under a misspecification framework for the blockmodel family P,
the observed data Aiji<j do not form independent replicates from a common popu-
lation, we anticipate the CL-BIC criterion (2.7) to be consistent in selecting the true
community number K, at least when the correlation among the Aiji<j is not severe
and the estimators θC are consistent and asymptotically normal, as in Heagerty and
Lele (1998). Since all the moment conditions in the consistency results from Gao and
Song (2010) hold automatically after noticing the specific forms of the blockmodel
composite log-likelihoods (2.1) – (2.3), under a properly defined mixing condition on
Aiji<j (Guyon, 1995), and for a bounded community number K ≤ k0, we conjec-
ture that PkCL−BIC = K → 1 as the number of nodes N in the network grows to
infinity. This theoretical study will be relegated as a future work.
2.3.4 Formulae
2.3.4.1 Standard Stochastic Blockmodel
Following our discussions in the previous section, we treat (2.1) as the composite
marginal likelihood, under the working independence assumption that, given the co-
mmunity labels of their endpoints, the Bernoulli random variables Aiji<j are inde-
pendent. The first-order partial derivative of `(θ;A) with respect to θ is denoted as
u(θ) = (u(θab); 1 ≤ a ≤ b ≤ k)′, where
u(θab) =∑i<j
[Aijθzizj
− 1− Aij1− θzizj
]Ia,bi,j ,
and
Ia,bi,j = min (1zi = a, zj = b+ 1zi = b, zj = a, 1) .
Furthermore, the second-order partial derivative of `(θ;A) has the following compo-
nents,
∂2`(θ;A)
∂θa1b1∂θa2b2= 0, if (a1, b1) 6= (a2, b2)
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 49
and
∂2`(θ;A)
∂θ2ab
= −∑i<j
[Aijθ2zizj
+1− Aij
(1− θzizj)2
]Ia,bi,j .
Define the Hessian matrix Hk(θ) = Eθ(−∂u(θ)/∂θ), then
Hk(θ) = Eθ(diag
−∂2`(θ;A)/∂θ2
ab; 1 ≤ a ≤ b ≤ k).
Define the variability matrix V k(θ) = Varθ(u(θ)) and, following Varin and Vidoni
(2005), the model complexity d∗k = trace[Hk(θ)−1V k(θ)]. If the underlying model is
indeed a correctly specified standard stochastic blockmodel, we have d∗k = k(k+ 1)/2
and CL-BIC reduces to the traditional BIC. Indexed by 1 ≤ k ≤ k0, the estimated
criterion functions for the CL-BIC sequence (2.7) are
CL-BICk = −2 cl(θC;A) + d∗k log (N(N − 1)/2) , (2.8)
where θC and d∗k are estimators of θ and d∗k, respectively. For a certain k, the explicit
estimator forms are given below:
Hk(θC) = diag
∑i<j
[Aij
θ2zizj
+(1− Aij)
(1− θzizj)2
]Ia,bi,j
and V k(θC) = u(θC)[u(θC)]T .
As noted in Gao and Song (2010), the above naive estimator for V k(θ) vanishes
when evaluated at the CLE θC. An alternative proposed in Varin et al. (2011) is to
use a jackknife covariance matrix estimator, for the asymptotic covariance matrix of
θC, of the form
Varjack(θC) =N − 1
N
N∑l=1
(θ(−l)C − θC)(θ
(−l)C − θC)T , (2.9)
where θ(−l)C is the composite likelihood estimator of θ with the l-th vertex deleted.
Let A(−l) be the (N − 1) × (N − 1) matrix obtained after deleting the l-th row and
column from the original adjacency matrix A. An explicit form for θ(−l)C is given
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 50
by θ(−l)ab = 1/n
(−l)ab
∑i<j A
(−l)ij 1zi = a, zj = b, with n
(−l)ab = N
(−l)a N
(−l)b for a 6= b,
and n(−l)aa = N
(−l)a (N
(−l)a − 1)/2; naturally, N
(−l)a = Na − 1 if zl = a and N
(−l)a = Na
otherwise.
Since the asymptotic covariance matrix of θC is given by the inverse Godambe in-
formation matrix, Gk(θ)−1 = Hk(θ)−1V k(θ)Hk(θ)−1, see Gao and Song (2010) and
Varin et al. (2011), an explicit estimator for d∗k can be obtained by right-multiplying
the jackknife covariance matrix estimator (2.9) by Hk(θ) to obtain
d∗k = trace[Varjack(θC)Hk(θC)
]=
∑1≤a≤b≤k
Varjack(θab)×
∑i<j
[ Aijθ2zizj
+1− Aij
(1− θzizj)2
]Ia,bi,j
.
2.3.4.2 Degree-Corrected Stochastic Blockmodel
Similarly, we develop corresponding parallel results for DCBM. The first- and second-
order partial derivatives of `(θ,ω;A) with respect to θ are defined as follows,
∂`(θ, ω;A)
∂θ= u(θ) = (u(θab); 1 ≤ a ≤ b ≤ k)′, u(θab) =
mab
θab− 1,
∂2`(θ, ω;A)
∂θa1b1∂θa2b2= 0, if (a1, b1) 6= (a2, b2),
∂2`(θ, ω;A)
∂θ2ab
= −mab
θ2ab
,
Hk(θC) = diag
1
θab; 1 ≤ a ≤ b ≤ k
,
which yields
d∗k =∑
1≤a≤b≤k
Varjack(θab)/θab
.
2.3.4.3 Mixed Membership Stochastic Blockmodel
The estimated model complexity for MMB now involves second-order partial deriva-
tives of `(θ, α;A) with respect to the hyper-parameters θ and α. Upon noticing
the form of the first term of the complete data log-likelihood (2.3), and recalling the
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 51
Hessian matrix with respect to α detailed in (2.5), it is easy to see that Hk(θC, αC)
is a block matrix of the form
Hk(θC, αC) =
Hk(θC) 0
0 Hk(αC)
,
where Hk(θC) is a k2 × k2 diagonal matrix given by
Hk(θC) = diag
∑i,j
[Aij
θ2zi→j ,zi←j
+(1− Aij)
(1− θzi→j ,zi←j)2
]1zi→j = a, zi←j = b
,
and Hk(αC) = (Hk(αC)ab) is a k × k matrix with entries
Hk(αC)ab = N(
Ψ′(αa)1a = b −Ψ′(∑a
αa)).
In a slight abuse of notation, we denote by zi→j above the label assignment corres-
ponding to node i when he sends a message to node j, and similarly for zi←j. The
estimated model complexity is thus d∗k = trace[Varjack(θC, αC)Hk(θC, αC)], where the
jackknife matrix Varjack(θC, αC), assuming a similar form as in (2.9) with θ(−l)C and
α(−l)C estimated as explained in Section 2.2, provides the corresponding asymptotic
covariance matrix estimator of the CLE (θC, αC).
We would like to remark that our CL-BIC approach for selecting the community
number K extends beyond the realm of stochastic blockmodels. Indeed, both the
latent space cluster model of Handcock et al. (2007) and the local dependence model
of Schweinberger and Handcock (2015), as well as any other (composite) likelihood-
based approach which requires to select a value of K can employ our proposed CL-BIC
methodology for selecting the number of communities. We leave the details of this
further investigation for future research.
2.4 Experiments
In this section, we show the advantages of the CL-BIC approach over the traditional
BIC as well as the variational Bayes approach in selecting the true number of commu-
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 52
nities via simulations and two real datasets.
2.4.1 Simulations
For simplicity of the presentation, we consider only the SBM and the DCBM in our
simulations. For each setting, we relax the assumption that the Aij’s are condition-
ally independent given the labels (Zi = zi, Zj = zj), varying both the dependence
structure of the adjacency matrix A ∈ RN×N and the value of the parameters (θ,ω).
The models introduced are correlation-contaminated stochastic blockmodels, i.e., we
bring different types of correlation into the stochastic blockmodels, both standard
and degree-corrected, mimicking real-world networks.
All of our simulated adjacency matrices have independent rows. That is, the
binary variables Aik and Ajl are independent, whenever i 6= j, given the corres-
ponding community labels of their endpoints. However, for a fixed node i ∈ V ,
correlation does exist across different columns in the binary variables Aij and Ail.
For the standard stochastic blockmodel, correlated binary random variables are gen-
erated, following the approach in Leisch et al. (1998), by thresholding a multivari-
ate Gaussian vector with correlation matrix Σ satisfying Σjl = ρjl. Specifically,
for any choice of |ρjl| ≤ 1, we simulate correlated variables Aij and Ail such that
Cov(Aij, Ail) = L(−µj,−µl, ρjl) − θzizjθzizl . Here, following Leisch et al. (1998), we
have L(−µj,−µl, ρjl) = P(Wj ≥ −µj,Wl ≥ −µl), µj = Φ−1(θzizj) and µl = Φ−1(θzizl),
where (Wj,Wl) is standard bivariate normal with correlation ρjl. Correlated Bernoulli
variables for the degree-corrected blockmodel are generated in a similar fashion.
In each experiment, carried over 200 randomly generated adjacency matrices, we
record the proportion of times the chosen number of communities for each of the
different criteria for selecting K agrees with the truth. Apart from CL-BIC and
BIC, we also consider the Integrated Likelihood Variational Bayes (VB) approach
of Latouche et al. (2012). To estimate the true community number, their method
selects the candidate value k which maximizes a variational Bayes approximation to
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 53
Table 2.1: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 1, where
Eq and Dec indicate equally correlated and exponential decaying cases, respectively. Both
correlation of multivariate Gaussian random variables (ρ MVN) and the corresponding
maximum correlation between Bernoulli variables (ρ Ber.) are presented.
CORR PROP MEDIAN DEV CORR PROP MEDIAN DEV
ρ MVN ρ Ber. CL-BIC BIC CL-BIC BIC ρ MVN ρ Ber. CL-BIC BIC CL-BIC BIC
0.10 0.06 1.00 0.40 0.0(0.0) 2.0(1.5) 0.40 0.25 1.00 0.35 0.0(0.0) 2.0(1.5)
0.15 Eq 0.09 0.92 0.14 1.0(0.0) 3.0(2.2) 0.50 Dec 0.32 1.00 0.21 0.0(0.0) 2.0(1.5)
0.20 0.12 0.81 0.03 1.0(0.4) 5.0(3.0) 0.60 0.40 0.99 0.12 1.0(0.0) 3.0(1.5)
NOTE: CORR, correlation; PROP, proportion; MEDIAN DEV, median deviation. In the
MEDIAN DEV columns, results are in the form of median (robust standard deviation).
the observed-data log-likelihood.
We restrict attention to candidate values for the true but unknown K in the
range k ∈ 1, . . . , 18, both in simulations and the real data analysis section. For
Simulations 1 – 3, spectral clustering is used to obtain the community labels for each
candidate k, whereas in the DCBM setting of Simulation 4, the SCORE algorithm
is employed. Additionally, among the incorrectly selected community number trials,
we calculate the median deviation between the selected community number and the
true K = 4, as well as its robust standard deviation.
Simulation 1: Correlation among the edges within and between communities is
introduced simultaneously throughout all blocks in the network, and not proceeding in a
block-by-block fashion. Concretely, for each node i, all edges Aiji<j are generated by
thresholding a correlated (N−i)-dimensional Gaussian random vector with correlation
matrix Σ = (ρjl). Thus, in this scenario, all edges Aij and Ail with common endpoint
i are correlated, regardless of whether j and l belong to the same community or not.
Cases ρjl = ρ and ρjl = ρ|j−l|, with several choices of ρ are conducted. We consider a
4-community network, θ = (θab; 1 ≤ a ≤ b ≤ 4)′, where θaa = 0.35 for all a = 1, . . . , 4
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 54
Table 2.2: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 2, where
Ind indicates ρjl = 0 for j 6= l. For simplicity, we omit the correlation between the corres-
ponding Bernoulli variables.
CORR PROP MEDIAN DEV CORR PROP MEDIAN DEV
ρ W. ρ B. CL-BIC BIC CL-BIC BIC ρ W. ρ B. CL-BIC BIC CL-BIC BIC
0.10
Ind
1.00 0.64 0.0(0.0) 2.0(0.7)
0.10 Eq
0.40 1.00 0.59 0.0(0.0) 1.0(1.5)
0.15 Eq 0.98 0.36 1.0(0.7) 2.0(1.5) 0.50 Dec 1.00 0.54 0.0(0.0) 2.0(0.7)
0.20 0.80 0.08 1.0(0.7) 3.0(2.2) 0.60 1.00 0.53 0.0(0.0) 2.0(1.1)
0.40
Ind
1.00 0.33 0.0(0.0) 2.0(0.7)
0.15 Eq
0.40 0.98 0.32 1.0(0.0) 2.0(1.5)
0.50 Dec 1.00 0.29 0.0(0.0) 2.0(1.5) 0.50 Dec 0.97 0.30 1.0(0.4) 3.0(1.5)
0.60 1.00 0.14 0.0(0.0) 2.0(1.5) 0.60 0.95 0.25 1.0(0.0) 3.0(1.5)
and θab = 0.05 for 1 ≤ a < b ≤ 4. Community sizes are 60, 90, 120 and 150,
respectively. Results are collected in Table 2.1.
Simulation 2: Correlation among the edges within (ρ W.) and between (ρ B.)
communities is introduced block-wisely. Concretely, for each node i, all edges Aij and
Ail are generated independently whenever j and l belong to different communities. If
j and l belong to the same community, edges Aij and Ail are generated by thresholding
a correlated Gaussian random vector with correlation matrix Σ = (ρjl). Parameter
settings are identical to Simulation 1, with results collected in Table 2.2.
Simulation 3: Correlation settings are the same as in Simulation 2, but we change
the value of the parameter θ to allow for more general network topologies. We set
θ = (θab; 1 ≤ a ≤ b ≤ 4)′ with θaa = θb4 = 0.35 for all a = 1, . . . , 4 and b = 1, 2, 3.
The remaining entries of θ are set to 0.05. Hence, following Latouche et al. (2012),
vertices from community 4 connect with probability 0.35 to any other vertices in the
network, forming a community of only hubs. Community sizes are the same as in
Simulation 1, with results collected in Table 2.3.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 55
Table 2.3: Comparison of CL-BIC and VB over 200 repetitions from Simulation 3. For
simplicity, we omit the correlation between the corresponding Bernoulli variables.
CORR PROP MEDIAN DEV CORR PROP MEDIAN DEV
ρ W. ρ B. CL-BIC VB CL-BIC VB ρ W. ρ B. CL-BIC VB CL-BIC VB
0.00
Eq Ind
1.00 1.00 0.0(0.0) 0.0(0.0) 0.00
Dec Ind
1.00 1.00 0.0(0.0) 0.0(0.0)
0.10 0.96 0.00 1.0(0.0) 2.0(0.0) 0.40 1.00 1.00 0.0(0.0) 0.0(0.0)
0.15 0.88 0.00 1.0(0.0) 4.0(2.2) 0.50 1.00 0.94 0.0(0.0) 1.0(0.0)
0.20 0.85 0.00 1.0(0.0) 5.0(1.5) 0.60 1.00 0.56 0.0(0.0) 1.0(0.0)
Simulation 4: We follow the approach of Zhao et al. (2012) in choosing the pa-
rameters (θ,ω) to generate networks from the degree-corrected blockmodel. Thus, the
identifiability constraint∑
i ωi1zi = a = 1 for each community 1 ≤ a ≤ K is re-
placed by the requirement that the ωi be independently generated from a distribution
with unit expectation, fixed here to be
ωi =
ηi, w.p. 0.8,
2/11, w.p. 0.1,
20/11, w.p. 0.1,
where ηi is uniformly distributed on the interval [0, 2]. The vector θ, in a slight abuse
of notation, is reparametrized as θn = γnθ, where we vary the constant γn to obtain
different expected degrees of the network. Correlation settings and community sizes
are the same as in Simulation 1, with results presented in Table 2.4, where choices
for γn and θ are specified.
When the stochastic blockmodels are contaminated by the imposed correlation
structure, which is expected in real-world networks, CL-BIC outperforms BIC over-
whelmingly. Tables 2.1–2.2 show the improvement is more significant when the im-
posed correlation is larger. For instance, in the block-wise correlated case of Table
2.2, when we only have within-community correlation ρDec = 0.60, CL-BIC does the
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 56
Table 2.4: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 4. Before
being scaled by the constant γn, we selected θ = (θab; 1 ≤ a ≤ b ≤ 4)′, where θaa = 7 for
all a = 1, . . . , 4 and θab = 1 for 1 ≤ a < b ≤ 4.
CORR γn PROP MEDIAN DEV CORR γn PROP MEDIAN DEV
ρ MVN CL-BIC BIC CL-BIC BIC ρ MVN CL-BIC BIC CL-BIC BIC
0.20 Eq0.02 0.84 0.35 1.0(1.5) 2.0(1.5)
0.60 Dec0.02 0.92 0.52 -1.0(1.5) 2.0(1.5)
0.03 0.96 0.58 1.0(0.0) 3.0(1.5) 0.03 1.00 0.81 0.0(0.0) 1.0(1.5)
0.30 Eq0.02 0.70 0.31 1.0(0.4) 2.0(0.7)
0.70 Dec0.02 0.83 0.41 -1.0(1.5) 2.0(1.5)
0.03 0.93 0.52 1.0(0.6) 3.0(1.5) 0.03 1.00 0.77 0.0(0.0) 1.0(0.7)
0.40 Eq0.02 0.43 0.21 1.0(1.5) 2.0(1.5)
0.80 Dec0.02 0.69 0.22 -1.0(1.5) 3.0(2.4)
0.03 0.85 0.51 1.0(1.9) 3.0(1.9) 0.03 0.98 0.69 -1.0(0.4) 1.0(1.5)
right selection in all cases, while BIC is only successful in 14% of 200 trials.
As shown in Table 2.3 for the model with a community of only hubs, if the network
is generated from a purely stochastic blockmodel, or if the contaminating correlation
is not too strong, CL-BIC and VB have similar performance in selecting the correct
K = 4. But again, as the imposed correlation increases, VB fails to make the right
selection more often than CL-BIC. This is particularly true in the ρEq = 0.20 case,
where CL-BIC makes the right selection in 85% of simulated networks, whereas VB
fails in all cases, yielding models with a median of 9 communities.
The same pattern translates into the DCBM setting of Table 2.4, where smaller
values of γn yield sparser networks. The community number selection problem be-
comes more difficult as γn decreases, as degrees for many nodes are small, yielding
noisy individual effect estimates ωi = di/∑
j:zj=zidj. Nevertheless, the CL-BIC ap-
proach consistently selects the correct number of communities more frequently than
BIC over different correlation settings.
In addition, Figure 2.1 presents simulation results where the true community
number K increases from 2 to 8. Following our previous examples, community
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 57
2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
True Community Number K
Pro
po
rtio
n
Simulation 1 CL−BIC (ρEq=0.10)Simulation 1 BIC (ρEq=0.10)Simulation 2 CL−BIC (ρEq=0.10)Simulation 2 BIC (ρEq=0.10)Simulation 3 CL−BIC (ρDec=0.60)Simulation 3 VB (ρDec=0.60)
Figure 2.1: (Color online) Comparisons between different methods for selecting the true
community number K in the standard blockmodel settings of Simulations 1 – 3. Along the
y-axis, we record the proportion of times the chosen number of communities for each of the
different criteria for selecting K agrees with the truth.
sizes grow according to the sequence (60, 90, 120, 150, 60, 90, 120, 150). The selected
correlation-contaminated stochastic blockmodels are ρEq = 0.10 from Simulation 1,
within-community correlation ρEq = 0.10 from Simulation 2, and within-community
correlation ρDec = 0.60 from Simulation 3. As K increases and enough vertices are
added into the network, CL-BIC tends to correctly estimate the true community
number in all simulation settings. Even in this scenario with a growing number of
communities, the proportion of times CL-BIC selects the true K is always greater
than the corresponding BIC or VB estimates.
Before moving to our last simulation example, we would like to define two measures
to quantify the accuracy of a given node label assignment. The first measure is a
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 58
“goodness-of-fit” (GF) measure defined as
GF (z, zk) =∑i<j
(1zi = zj1zi = zj+ 1zi 6= zj1zi 6= zj)/(N
2
), (2.10)
where z represents the true community labels and zk represents the community as-
signments from an estimator. Thus, the measure GF (z, zk) calculates the proportion
of pairs whose estimated assignments agree with the correct labels in terms of being
assigned to the same or different communities, and is commonly known as the Rand
Index (Rand, 1971) in the cluster analysis literature.
The second measure is motivated from the “assortativity” notion. The ratio of
the median within community edge number to that of the between community edge
number (MR) is defined as
MR(zk) = mediana=1,...,k
(maa)/
mediana6=b
(mab) , (2.11)
where k is the number of communities implied by zk and mab is the total number of
edges between communities a and b, as given by the community assignment zk. It is
clear that for both measures, a higher value indicates a better community detection
performance.
As a final simulation example, we analyze the performance of CL-BIC and BIC
for a growing number of communities under the degree-corrected blockmodel. While
the reparametrized vector θn = γnθ remains as in Simulation 4, the ωi are now
independently generated from Uniform(1/5, 9/5). The results are collected in Table
2.5, where we also record the performance of the SCORE algorithm under the true
K, along with the goodness-of-fit (GF) and median ratio (MR) performance measures
introduced in (2.10) and (2.11), respectively.
The true community number and community sizes grow as in the case for the
standard blockmodel described in Figure 2.1. Although CL-BIC performs uniformly
better than BIC across all validating criteria and throughout all K, the procedure
does not appear to yield model selection consistent results in this example. Aside
from the fact that the introduced correlation is not exponentially decaying, this poor
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 59
Table 2.5: Comparison of CL-BIC and BIC over 200 repetitions from the DCBM case in
Simulation 4, with (ρEq = 0.2,γn = 0.03), where the individual effect parameters ωi are
now generated from a Uniform(1/5, 9/5) distribution.
SCORE Performance CL-BIC BIC
K Misc. R. Orac. Err. Est. Err. PROP MD RSD GF MR PROP MD RSD GF MR
2 0.02 0.51 0.54 0.88 1 0 0.96 7.33 0.10 2 0.75 0.73 3.37
3 0.03 0.53 0.55 0.93 1 0.75 0.97 7.87 0.09 3 1.49 0.88 3.91
4 0.03 0.55 0.58 0.86 1 0 0.97 8.11 0.16 3 1.49 0.91 5.41
5 0.04 0.58 0.62 0.56 1 0 0.96 6.35 0.09 3 2.05 0.92 5.92
6 0.05 0.60 0.64 0.47 1 1.49 0.96 7.10 0.09 2 2.24 0.94 7.08
7 0.05 0.63 0.66 0.39 1 1.49 0.97 6.77 0.09 3 1.49 0.95 6.73
8 0.08 0.63 0.66 0.29 1 1.49 0.97 7.24 0.02 3 2.24 0.96 6.80
NOTE: PROP, proportion; MD, median deviation; RSD, robust standard deviation; GF,
goodness-of-fit measure; MR: median ratio measure. Misc. R. denotes the miscluster-
ing rate of the SCORE algorithm. For Ω = Eθ(A), Orac. Err. and Est. Err. are
‖ΩO − Ω‖ / ‖Ω‖ and ‖ΩSC − Ω‖ / ‖Ω‖, respectively, where ‖·‖ denotes Frobenius norm.
Here, ΩO denotes the estimate of Ω under the oracle scenario where we know the true co-
mmunity assignment z ∈ 1, . . . ,KN , and ΩSC is the estimate of Ω using the SCORE
labeling vector.
performance as K increases can also be explained by the difficulty in estimating the
DCBM parameters (θ,ω) in a scenario where several vertices have potentially low
degrees. Indeed, even in the oracle scenario where we know the true community labels
zi ahead of time, and for a relatively small misclustering rate of the SCORE algorithm,
Table 2.5 exhibits the difficulty in obtaining accurate estimates (θC, ωC), and in
evaluating the CL-BIC criterion functions (2.8), under this increasing K scenario for
the DCBM. Whether the increased number of parameters in the DCBM has an effect
on the consistency results of CL-BIC as K increases is also an interesting line of future
work.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 60
2.4.2 Real Data Analysis
2.4.2.1 International Trade Networks
We first study an international trade dataset originally analyzed in Westveld and
Hoff (2011), containing yearly international trade data between N = 58 countries
from 1981 − 2000. For a more detailed description of this dataset, we refer the
interested reader to the Appendix in Westveld and Hoff (2011). In our numerical
comparisons between CL-BIC and BIC paired with the standard stochastic block-
model log-likelihood (2.1), we focus on data from year 1995. For this network, an
adjacency matrix A can be formed by first considering a weight matrix W with
Wij = Tradei,j + Tradej,i, where Tradei,j denotes the value of exports from country
i to country j. Finally, we define Aij = 1 if Wij ≥ Wα, and Aij = 0 otherwise;
here Wα denotes the α-th quantile of Wij1≤i<j≤N . For the choice of α = 0.5, Fig-
ure 2.2 shows the largest connected component of the resulting network. Panel (a)
shows CL-BIC selecting 3 communities, corresponding to countries with the highest
GDPs (dark blue), industrialized European and Asian countries with medium-level
GDPs (green), and developing countries in South America with the smallest GDPs
(yellow). Next, in panel (b) we also show the variational Bayes solution correspon-
ding to k = 7, providing finer communities for some Central and South American
neighboring countries (yellow and pink, respectively) but fragmenting the high- and
medium-level GDP countries into ambiguous communities. For instance, it is not
clear why countries like Bolivia and Nepal belong to the same community (orange) or
why the Netherlands, rather than Brazil or Italy, joined the community of countries
with the highest GDPs (light blue). At last, panel (c) corresponds to the final BIC
model selecting 10 communities. Under this partition, South American countries are
now split into 6 “noisy” communities, while high GDP countries are unnecessarily
fragmented into two.
We believe CL-BIC provides a better model than traditional BIC, yielding commu-
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 61
Algeria
Argentina
Australia
Austria
Barbados
Belgium
Bolivia
Brazil
Canada
Chile
Colombia
Costa_Rica
Cyprus
Denmark
Ecuador
El_SalvadorFinland
FranceGermany
Greece
Guatemala
Honduras
Iceland
India
Indonesia
Ireland
Israel
Italy
Jamaica
Japan Korea_Rep
Malaysia
Mauritius
Mexico
Morocco
Nepal
NetherlandsNew_Zealand
Norway
Oman
Panama
ParaguayPeru
Philippines
Portugal
Singapore
Spain
Sweden
Switzerland
ThailandTrinidad_and_Tobago
Tunisia
Turkey
United_KingdomUnited_States
Uruguay
Venezuela
(a) k = 3 (CL-BIC-SBM)
Algeria
Argentina
Australia
Austria
Barbados
Belgium
Bolivia
Brazil
Canada
Chile
Colombia
Costa_Rica
Cyprus
Denmark
Ecuador
El_SalvadorFinland
FranceGermany
Greece
Guatemala
Honduras
Iceland
India
Indonesia
Ireland
Israel
Italy
Jamaica
Japan Korea_Rep
Malaysia
Mauritius
Mexico
Morocco
Nepal
NetherlandsNew_Zealand
Norway
Oman
Panama
ParaguayPeru
Philippines
Portugal
Singapore
Spain
Sweden
Switzerland
ThailandTrinidad_and_Tobago
Tunisia
Turkey
United_KingdomUnited_States
Uruguay
Venezuela
(b) k = 7 (VB-SBM)
Algeria
Argentina
Australia
Austria
Barbados
Belgium
Bolivia
Brazil
Canada
Chile
Colombia
Costa_Rica
Cyprus
Denmark
Ecuador
El_SalvadorFinland
FranceGermany
Greece
Guatemala
Honduras
Iceland
India
Indonesia
Ireland
Israel
Italy
Jamaica
Japan Korea_Rep
Malaysia
Mauritius
Mexico
Morocco
Nepal
NetherlandsNew_Zealand
Norway
Oman
Panama
ParaguayPeru
Philippines
Portugal
Singapore
Spain
Sweden
Switzerland
ThailandTrinidad_and_Tobago
Tunisia
Turkey
United_KingdomUnited_States
Uruguay
Venezuela
(c) k = 10 (BIC-SBM)
Figure 2.2: (Color online) Largest connected component of the international trade network
for the year 1995.
nities with countries sharing similar GDP values without dividing an entire continent
into 6 smaller communities. On the contrary, BIC selects a model containing commu-
nities of size as small as one, which are of little, if any, practical use. The variational
Bayes approach provides a meaningful solution in this example, exhibiting a similar
performance as in Latouche et al. (2012) in terms of providing some finer community
assignments.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 62
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
5253
54
55
56
57
58
5960
61
62
63
64
6566
67
68
69
71
7273
74
75
77
78
79
80
8182
83
84
85
86
87
88
89
90
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109 110
112
113
114
115
116
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
162
163
164
166
167
168
169
170
171172
173
174
175
176
177
178
179
180
181
182
183
184
185
186187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247249
250
251
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278279
280
281
282
283
284 285
286
287
288
289
290
291
292
294
295
296297
298
299
300
301
303
304
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321322
323
324
325
326
327
328
329
330 331
332
333
334
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
357
358
360
361
363
364365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380381
382
383
384
385
386
387
388
389
390
391
392
393
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
455
456
457
(a) k = 6 (CL-BIC-DCBM)
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
5253
54
55
56
57
58
5960
61
62
63
64
6566
67
68
69
71
7273
74
75
77
78
79
80
8182
83
84
85
86
87
88
89
90
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109 110
112
113
114
115
116
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
162
163
164
166
167
168
169
170
171172
173
174
175
176
177
178
179
180
181
182
183
184
185
186187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247249
250
251
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278279
280
281
282
283
284 285
286
287
288
289
290
291
292
294
295
296297
298
299
300
301
303
304
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321322
323
324
325
326
327
328
329
330 331
332
333
334
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
357
358
360
361
363
364365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380381
382
383
384
385
386
387
388
389
390
391
392
393
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
455
456
457
(b) k = 9 (BIC-DCBM)
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
5253
54
55
56
57
58
5960
61
62
63
64
6566
67
68
69
71
7273
74
75
77
78
79
80
8182
83
84
85
86
87
88
89
90
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109 110
112
113
114
115
116
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
162
163
164
166
167
168
169
170
171172
173
174
175
176
177
178
179
180
181
182
183
184
185
186187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247249
250
251
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278279
280
281
282
283
284 285
286
287
288
289
290
291
292
294
295
296297
298
299
300
301
303
304
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321322
323
324
325
326
327
328
329
330 331
332
333
334
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
357
358
360
361
363
364365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380381
382
383
384
385
386
387
388
389
390
391
392
393
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
455
456
457
(c) “Truth” (K = 6)
Figure 2.3: (Color online) Largest connected component of the school friendship network.
Panel (c) shows the “true” grade community labels: 7th (blue), 8th (yellow), 9th (green),
10th (purple), 11th (red), and 12th (black).
2.4.2.2 School Friendship Networks
Now, we consider a school friendship network obtained from the National Longitudinal
Study of Adolescent Health (http://www.cpc.unc.edu/projects/addhealth). For this
network, Aij = 1 if either student i or j reported a close friendship tie between the
two, and Aij = 0 otherwise. We focus on the network of school 7 from this dataset,
and our comparisons between CL-BIC and BIC are done with respect to the degree-
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 63
corrected blockmodel log-likelihood (2.2). With 433 vertices, Figure 2.3 shows the
largest connected component of the resulting network. As shown in panel (a), CL-
BIC selects the true community number K = 6, roughly agreeing with the actual
grade labels, except for the black community. BIC, shown in panel (b), selects 9
communities, unnecessarily splitting the 7th and 8th graders. The “true” friendship
network is shown in panel (c).
We still conclude CL-BIC performs better than traditional BIC. Except for the
misallocation of the black community of 12th graders, the model selected by CL-BIC
correctly labels most of the remaining network. While BIC partially separates the
10th graders and the 12th graders, a substantial portion of the 10th graders are ab-
sorbed into the 9th grader community (green). In addition, BIC further fragments
7th and 8th graders into “noisy” communities. This is an extremely difficult commu-
nity detection problem since, even for a “correctly” specified k = 6, SCORE fails
to assign all 12th graders to their corresponding true grade. The black community
selected by SCORE in panel (a) mainly corresponds to female students and hispanic
males, reflecting perhaps closer friendship ties among a subgroup of students recently
starting junior high school.
Using the “goodness-of-fit” measure defined in (2.10), we found that the CL-BIC
community assignment leads to GF (z, z6) = 0.811, which is slightly better than the
GF (z, z9) = 0.810 obtained for BIC. For the MR measure given in (2.11), the results
for CL-BIC and BIC are MR(z6) = 40.8 and MR(z9) = 33.3, respectively, again
indicating the superiority of the CL-BIC solution paired with SCORE.
In both examples, BIC tends to overestimate the “true” community number K,
rendering very small communities which are in turn penalized under the CL-BIC ap-
proach. This means CL-BIC successfully remedies the robustness issues brought in
by spectral clustering, due to the misspecification of the underlying stochastic block-
models, and effectively captures the loss of variance produced by using traditional
BIC.
CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 64
2.5 Discussion
There has been a tremendous amount of research in recovering the underlying struc-
tures of network data, especially on the community detection problem. Most of the
existing work has focused on studying the properties of the stochastic blockmodel
and its variants without looking at the possible model misspecification problem. In
this chapter, under the standard stochastic blockmodel and its variants, we advocate
the use of composite likelihood BIC for selecting the number of communities due to
its simplicity in implementation and its robustness against correlated binary data.
Some extensions are possible. For instance, the proposed methodology in this work
is based on the spectral clustering and SCORE algorithms, and it would be interesting
to explore the combination of the CL-BIC with other community detection methods.
In addition, most examples considered here are dense graphs, which are common
but cannot exhaust all scenarios in real applications. Another open problem is to
study whether the CL-BIC approach is consistent for the degree-corrected stochastic
blockmodel, which is not necessarily true from our numerical studies.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 65
Chapter 3
Kernel-Based Change-Point
Detection in Dynamic Stochastic
Blockmodels
The stochastic blockmodel and its variants have been one of the most widely used
approaches for spatial segmentation of static network data. As most complex phenom-
ena studied through statistical network analysis are dynamic in nature, static network
models fail to capture the relevant temporal features. In this chapter, we formulate
the segmentation of a time-varying network into temporal and spatial components by
means of a change-point detection hypothesis testing problem on dynamic stochastic
blockmodels. We propose a Kernel Change-Points (KCP) test statistic based on the
idea of data aggregation across the different temporal layers through kernel-weighted
adjacency matrices computed before and after each candidate change-point. We de-
rive the required theoretical framework and illustrate our anomaly detection approach
using a wide array of synthetic data sets. In addition, we apply our proposed KCP
methodology to the real-world Enron network, a dynamic social network of email
communications.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 66
3.1 Introduction
The study of several complex systems having interacting units through network or
graph representations has emerged as a topic of great research interest in recent years.
Examples of modeled phenomena include online friendship interactions within social
networks, regulatory protein-protein interactions in systems biology, hyperlinks be-
tween webpages in information systems, and coauthorship within citation networks.
While various probabilistic models have been proposed for static networks (Gold-
enberg et al., 2010), which represent a single time snapshot of the phenomenon of
interest, many of these systems are dynamic in nature, meaning their evolving struc-
ture can be represented as a sequence of networks snapshots taken at discrete time
points over a common set of nodes. Among a myriad of research problems arising
with time-varying network data, in this chapter we tackle the problem of change-
point detection within dynamic stochastic blockmodels, where the goal is to find the
M discrete time snapshots at which a subset of vertices of the network alter their con-
nectivity patterns under a time-dependent version of the classic stochastic blockmodel
(Holland et al., 1983) with T observed time points. Our aim is thus the segmenta-
tion of dynamic networks into a temporal component, where the network generative
mechanism remains fixed between presumed change-points, and a spatial component,
in which we seek to partition the nodes of the network into tightly-knit communities
or clusters.
Significant progress has been made recently in generalizing static network mo-
dels such as the stochastic blockmodel and its mixed membership variant (Airoldi
et al., 2008) to their dynamic counterparts. Proposed extensions of the stochas-
tic blockmodel include the state-space model for dynamic networks of Xu and Hero
(2014), where the state evolution of the blockmodel parameters is assumed to follow
a stochastic dynamic system, and the multi-graph stochastic blockmodel of Han et al.
(2015), where network vertices share the same block structure over the multiple tem-
poral layers but class connection probabilities are allowed to vary. Under some mild
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 67
conditions on the sequence of block probabilities, Han et al. (2015) show a modified
version of the classic spectral clustering algorithm (Donath and Hoffman, 1973) to
be consistent at estimating the fixed block labels for a diverging number of snap-
shots. In the opposite setting of nodes changing their community membership over
time under fixed block probabilities, Ghasemian et al. (2015) develop detectability
thresholds for community detection in dynamic stochastic blockmodels as a function
of community strength and the rate of change of nodes to different communities. The
more general dynamic mixed membership stochastic blockmodel (Xing et al., 2010;
Ho et al., 2011) has also been proposed, where all block probabilities, membership
indicators and mixed membership vectors are assumed to follow a state-space model.
Aside from community detection in temporally evolving networks, researchers have
also focused in modeling dynamic relational network data in the presence of covari-
ates through longitudinal data structures (Westveld and Hoff, 2011) and continuous
time, point process models paired with partial likelihood inference procedures (Perry
and Wolfe, 2013). We refer interested readers to Aggarwal and Subbian (2014) for a
comprehensive survey in related time-dependent network methodology.
While some of the above models follow complex dynamics in their treatment
of temporally evolving networks, we favor the simpler approach of multiple-change
point detection, also known as temporal segmentation in time series analysis (Cho
and Fryzlewicz, 2015). Among a growing literature of methodology for this network
problem, the scan statistics approach for anomaly detection of Wang et al. (2014)
represents one of the few works where limiting distributions and power computations
are provided for the associated change-point hypothesis testing problem. The main
idea in their paper is to scan the observed network data over small time and spatial
windows, calculating some locality statistic or measure for each window, with the
computed scan statistic being the maximum of these locality statistics. Although
without obtaining explicit power calculations, the latent process model of Lee and
Priebe (2011) and the latent position model with an EM-type partition algorithm
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 68
for identifying multiple change-points of Robinson and Priebe (2015) provide a more
flexible change-point detection framework than in Wang et al. (2014). Recently,
Cribben and Yu (2015) tackled temporal segmentation of brain network dynamics
using principal angles between subspaces as a measure of local distance for adjacent
network snapshots. A common feature of all these change-point detection techniques
is the aim at sensing sudden changes in the local behavior of the connectivity patterns
of the network through a set of spatial or temporal measurement windows.
Motivated by the idea of capturing the local characteristics of dynamic networks,
and making use of the consistency result for spectral clustering on a mean adjacency
matrix (Han et al., 2015), showing the benefits of data aggregation for time-dependent
networks, in this chapter, we propose a kernel-based regularized distance statistic for
multiple change-point detection under dynamic stochastic blockmodels. While allow-
ing for changes both in the class connection probabilities and the node community
memberships, our approach first computes kernel-weighted adjacency matrices which
aggregate information across the temporal layers before and after each candidate
change-point. By the consistency result in Han et al. (2015), spectral clustering on
these weighted adjacency matrices provides a more precise estimation of the underly-
ing block structure in between presumed change-points. With the help of the profile
likelihood paradigm, we use the obtained community labels to compute estimated
expectation matrices of the aggregated networks before and after each candidate
change-point, and whose difference in the Frobenius norm yields a tailor-made test
statistic for the corresponding hypothesis testing problem. The resulting procedure,
Kernel Change-Points (KCP), is then tested on synthetic networks and the Enron
email data set, yielding a performance comparable to the scan statistic techniques
(Priebe et al., 2005; Wang et al., 2014) in terms of change-point selection.
The remainder of the chapter is organized as follows. In Section 3.2 we provide a
background on dynamic stochastic blockmodels, and introduce a rigorous definition
of our multiple change-point detection problem. The proposed Kernel Change-Point
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 69
methodology is developed in Section 3.3, with simulation examples and a real-world
network data analysis on the Enron email corpus carried out in Section 3.4. We
conclude the chapter with a short discussion in Section 3.5.
3.2 Background
We first introduce some notation and present a general overview of our time-varying
network data framework. Consider a temporal series of simple, undirected networks
G(1), . . . ,G(T ) over a fixed set of nodes V , where G(t) := (V,E(t)) represents the
network snapshot observed at time t = 1, . . . , T and E(t) denotes the corresponding
set of edges over N = |V | vertices. The corresponding sequence of N ×N symmetric
adjacency matrices is thus A(t)Tt=1; where, as usual, Aij(t) := 1 if (i, j) ∈ E(t), and
Aij(t) := 0 otherwise. Self-edges are not allowed, and so Aii(t)Ni=1 is fixed to zero
across all t values.
3.2.1 Dynamic Stochastic Blockmodels
The original, static stochastic blockmodel for a single time snapshot network G :=
(V,E), with N ×N adjacency matrix A, was first proposed by Holland et al. (1983).
The model posits each node i ∈ V is associated with one and only one community
from V1, . . . , VK, with label Ci, where V = V1 t · · · t VK and Ci ∈ 1, . . . , K.
Further, for all possible pairs (i, j), the model assumes the corresponding edge va-
riables Aiji<j are conditionally independent Bernoulli random variables given the
communities of their endpoints, i.e., P(Aij = 1;Ci = ci, Cj = cj) = θcicj . In other
words, a standard Stochastic Blockmodel (SBM) with K communities is parametrized
by a class connection probability matrix θ ∈ [0, 1]K×K , where θab denotes the proba-
bility of forming an edge between nodes from communities a and b, and a community
assignment matrix Z = (z1, . . . , zN)′ ∈ 0, 1N×K , where zi,k = 1 if and only if ci = k,
and zi,k = 0 otherwise.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 70
Under the Dynamic Stochastic Blockmodel (DSBM) framework proposed in this
chapter, we allow for nodes i ∈ V to change their community membership over time,
as well as let the class connection probability matrices vary across different layers. As
such, each network G(t) in the time series has its own community assignment matrix
Z(t) = (z1(t), . . . , zN(t))′ ∈ 0, 1N×K , with zi,k(t) = 1 if and only if ci(t) = k, and
zi,k(t) = 0 otherwise. Moreover, the sequence of K ×K class connection probability
matrices θ(t)Tt=1 satisfies
PAij(t) = 1;Ci(t) = ci(t), Cj(t) = cj(t) = θ(t)ci(t)cj(t) . (3.1)
For the DSBM, we define the expectation matrix Ω(t) := E(A(t)). A few algebra
steps show that this matrix can be expressed as the product
Ω(t) = Z(t)θ(t)Z(t)′, (3.2)
with diag(Ω(t)) := 0 for all t = 1, . . . , T . In a slight abuse of notation, we write
A(t) ∼ SBM(Ω(t)), (3.3)
to mean the individual network snapshot data A(t) follows a standard stochastic
blockmodel with fixed but unknown parameters θ(t) and Z(t) for each t = 1, . . . , T .
While not as general as the dynamic Mixed Membership Stochastic Blockmodel
from Xing et al. (2010), the DSBM proposed in equations (3.1) – (3.3) allows for
enough flexibility in modeling dynamic relational network data. As opposed to mo-
dels where the block membership is fixed from the initial observation time (Wang
et al., 2014; Han et al., 2015), or settings where the edges in E(t) are generated in-
dependently for each t according to the same class connection probability matrix θ
(Ghasemian et al., 2015), our DSBM framework above allows for time-evolving θ(t),
Z(t) or both. Since we aim at modeling various change-point phenomena throughout
the time series G(1), . . . ,G(T ), we expect (θ(t),Z(t)) to vary only at a limited
number of snapshots. In this regard, the DSBM setting presented in this chapter is
more stringent than the linear dynamic state-space model of Xu and Hero (2014).
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 71
3.2.2 Change-Point Model
Before estimating the parameters (θ(t),Z(t)) driving the network evolution, we are
first interested in detecting change-points, the discrete time snapshots at which the
underlying characteristics of the evolving network are altered. Following Lee and
Priebe (2011), we make the following definition of a change-point.
Definition 3.1. A fixed but unknown constant t∗ ∈ 1, 2, 3, . . .∪+∞ is a change-
point for the temporal series G(t)t≥1 if there exist distinct expectation matrices
Ω0,ΩA ∈ [0, 1]N×N such that prior to time t∗, the sequence A(t) follows the null
dynamics, i.e.,
A(t) ∼ SBM(Ω0) for t ≤ t∗;
and after time t∗, the sequence A(t) starts following the alternative dynamics
A(t) ∼ SBM(ΩA) for t > t∗.
For any given time snapshot t ∈ N, the associated change-point hypothesis testing
problem is
H0 : t∗ > t
HA : t∗ = t.(3.4)
As a convention, if t∗ = +∞, then we assume the sequence A(t)t≥1 evolves
according to the null dynamics for all t. The above definition extends naturally to
the case of multiple change-points within the time series of graphs G(t)t≥1; for sim-
plicity of the presentation, we omit details of this definition. We do assume, however,
there is an unknown number of M distinct change-points 1 ≤ t∗1 < · · · < t∗M ≤ +∞
in our observed data A(t)t≥1. While in principle the above definition considers all
time snapshots as candidate change-points, in our numerical implementation, we bar
any two adjacent time points to be labeled as such; a minimum of separation common
to the literature in multiple change-point detection for multivariate time series (Cho
and Fryzlewicz, 2015). Modulo this consideration for adjacent time snapshots, every
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 72
time point t is considered a candidate change-point for the hypothesis testing problem
(3.4). In the next section, we build upon the spectral clustering algorithm (Donath
and Hoffman, 1973) to construct a test statistic comparing a form of distance between
networks before and after each candidate change-point for the testing problem.
3.3 Change-Point Detection Methodology
3.3.1 Spectral Clustering and Parameter Estimation in SBM
Spectral clustering is a popular, computationally feasible algorithm for estimating the
community structure Z ∈ 0, 1N×K of the single network snapshot SBM. Proven to
be consistent for sufficiently dense networks as N → ∞ (Rohe et al., 2011), the
algorithm can be derived as a convex relaxation (von Luxburg, 2007) of assortative
mixing, NP-hard criteria such as minimizing normalized cut (Shi and Malik, 2000).
The procedure, given in detail in Algorithm 3.1 for a generic weighted adjacency
matrix W ∈ [0, 1]N×N , is based on the eigenvalue decomposition of the normalized
graph Laplacian L (von Luxburg, 2007) and on the K-means algorithm applied to
the eigenvectors of L.
Following Choi et al. (2012), and for a fixed time snapshot in (3.1), the log-
likelihood of observing an adjacency matrix A under the SBM is
`(θ, c;A) :=∑i<j
[Aij log θcicj + (1− Aij) log(1− θcicj)], (3.5)
with c ∈ 1, . . . , KN being a community membership vector directly parametrizing
Z. Considering the computational intractability of directly estimating both param-
eters (θ, c) based on exact maximum likelihood fitting, we follow a profile likelihood
approach paired with the spectral clustering relaxation to estimate the model param-
eters and the associated expectation matrix Ω. Denote Na as the size of community
a, and nab as the maximum number of possible edges between communities a and
b, i.e., nab := NaNb for a 6= b, and naa := Na(Na − 1)/2. For a fixed community
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 73
Algorithm 3.1 Spectral Clustering
1. Input: Weighted adjacency matrix W ∈ [0, 1]N×N . Pre-specified number of co-
mmunities K.
2. Compute normalized graph Laplacian L := D−1/2WD−1/2, where Dii := di and
Dij := 0 for i 6= j. Here, di is the degree of node i.
3. Find the eigenvectors u1, . . . ,uK associated with the K eigenvalues of L that are
largest in magnitude, forming an N ×K matrix U := (u1, . . . ,uK).
4. Apply the K-means algorithm to the rows of U and obtain estimated co-
mmunity labels c := (c1, . . . , cN), and estimated community assignment matrix
Z := (z1, . . . , zN)′ ∈ 0, 1N×K ; with zi,k = 1 if and only if ci = k, and zi,k = 0
otherwise.
5. Output: (c, Z) ∈ RN ⊗ RN×K .
assignment c ∈ 1, . . . , KN , let mab :=∑
i<j Aij1ci = a, cj = b so that the MLE
of θab in (3.5) is given by θab := mab/nab. Instead of using a Gibbs sampler to explore
the space of candidate community assignments, we resort to spectral clustering to
obtain an estimated Z, and thus define
Ω := Ω(A) = ZθZ′, (3.6)
the estimated expectation matrix for a fixed time snapshot SBM with adjacency
matrix A.
3.3.2 Algorithm Description and Rationale
3.3.2.1 Motivation
As our main goal is multiple change-point detection under the dynamic stochastic
blockmodel setting for A(t)Tt=1, a natural question is how one can extend the spec-
tral clustering algorithm for the SBM to discover community structure in the DSBM;
and how one can use the aggregated information in G(t)Tt=1 to compute a test statis-
tic for problem (3.4) using a regularized distance between networks before and after
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 74
each candidate change-point. Bazzi et al. (2015) argue that two main approaches
have been followed to detect communities in temporally evolving networks. The first
involves constructing a single static network by aggregating information from the
different snapshots into a mean graph, or a weighted average graph considering the
total edge weight for each edge across time points t = 1, . . . , T ; and then using stan-
dard techniques such as spectral clustering on this aggregated network (Han et al.,
2015). The second approach involves using standard community detection techniques
to infer Z(t) independently on each network snapshot, and then tracking the commu-
nities across the sequence (Macon et al., 2012; Fenn et al., 2012) or averaging them
by means of majority voting. Han et al. (2015) show the effectiveness of the former
approach by, under some identifiability conditions on θ(t)Tt=1, proving a consistency
result for spectral clustering on the mean adjacency matrix
A :=1
T
T∑t=1
A(t) (3.7)
as the number of snapshots T → ∞ with N fixed. Moreover, their paper also finds
the majority voting scheme underperforming spectral clustering on the mean graph
in instances where the SBM on a single time snapshot is below the detectability limit
developed in Decelle et al. (2011), thus showing difficulty in learning the underlying
community structure from the multiple temporal layers for this heuristic.
Motivated by these results favoring data aggregation for discovering community
structure within DSBM, and keeping in mind the necessity of specifying a metric
comparing distances between adjacent networks for each candidate change-point, we
define the following sequences of weighted before and after adjacency matrices:
B(t) :=
∑t−1i=1Kt−1,i,hA(i)∑t−1
i=1Kt−1,i,h
and A(t) :=
∑Ti=tKt,i,hA(i)∑T
i=tKt,i,h, (3.8)
for t = 2, . . . , T ; where Kt,i,h := K(|t − i|/h), K(·) is a Gaussian kernel, and h is a
kernel bandwidth. Spectral clustering on B(t) helps unveil the community structure
before each candidate t, and together with the profile likelihood approach of Section
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 75
3.3.1, it provides an estimate Ω(B(t)) of the parameters driving the evolution of the
dynamic stochastic blockmodel just prior to time snapshot t. Similarly, through the
same approach outlined in equation (3.6), A(t) provides an estimate Ω(A(t)) of the
DSBM parameters for snapshot t and afterwards.
Just as in the case of univariate kernel density estimation (Silverman, 1986), the
choice of bandwidth h poses a natural bias-variance tradeoff. A small value of h
in (3.8) leads to under-smoothing, potentially yielding estimates (Ω(B(t)), Ω(A(t)))
showing features in the dynamic network data that are not really present (e.g., change-
points); whereas a large value of h leads to over-smoothing, possibly resulting in
overlooking important features present in A(t)Tt=1. Given our goal of detecting
multiple change-points within DSBM data, the choice of bandwidth h will play a
crucial role in evaluating the deviance between adjacent networks before and after a
given candidate change-point.
3.3.2.2 Test Statistic d(t)
For a candidate change-point t ∈ 2, . . . , T under the observed temporal series
A(t)Tt=1, we define the following test statistic for the hypothesis testing problem
(3.4):
d(t) :=∥∥∥Ω(B(t))− Ω(A(t))
∥∥∥F, (3.9)
where (Ω(B(t)), Ω(A(t))) are the estimated expectation matrices for the weighted
adjacency matrices given in (3.8), and ‖·‖F denotes the usual Frobenius norm. Natu-
rally, one would reject the null hypothesis that the candidate t is not a change-point
for large observed values of d(t).
The procedure for determining whether a candidate t is indeed a change-point,
described in Algorithm 3.2 below, incorporates the following aspects of the observed
data A(t)Tt=1 and the DSBM defined in equations (3.1) – (3.3). Firstly, as a result
of our flexible dynamic stochastic blockmodel framework, the test statistic d(t) is
able to detect structural changes in θ(t), Z(t) or both. Our approach is thus more
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 76
Algorithm 3.2 Kernel Change-Points (KCP)
1. Input: Observed temporal series of adjacency matrices A(t)Tt=1. Pre-specified
number of communities K and significance level α.
2. For every candidate change-point t ∈ 2, . . . , T:
(a) Compute the weighted before and after adjacency matrices B(t), A(t) from
(3.8).
(b) For the provided K, apply the spectral clustering algorithm on B(t) and
A(t).
(c) Follow the profile likelihood approach to produce estimates Ω(B(t)),
Ω(A(t)) of the corresponding expectation matrices.
(d) Calculate the regularized distance, test statistic d(t) from (3.9).
(e) Define δ(t) := 1 if d(t) ≥ cα(t), the upper αth quantile of the distribution of
d(t) under the null hypothesis H0; otherwise, let δ(t) := 0.
3. Output: The estimated change-points t : t ∈ 2, . . . , T, δ(t) = 1 of the dy-
namic network.
flexible than the scan statistics procedures presented in Wang et al. (2014), where
the proposed methodology only deals with the anomaly detection problem in which
only one diagonal entry from θ(t) increases in value at the presumed change-point.
Secondly, given the consistency result of spectral clustering on the mean adjacency
matrix (3.7) presented in Han et al. (2015), and considering the flexibility provided
by the Gaussian kernel family Kt,i,h, our proposed test statistic d(t) both success-
fully aggregates information from the different snapshots to produce more accurate
estimates (θ(t), Z(t)), and provides a regularized distance between adjacent networks
which is essential for detecting multiple change-points along G(t)Tt=1.
Dynamic relational network data represent complex phenomena. Through our
test statistic (3.9), we aim to construct a simple procedure capable of detecting time-
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 77
varying structure in real-world network systems, effectively segmenting the observed
data A(t)Tt=1 into different behavioral patterns. If the desired network analysis
requires attention to more subtle features that emerge from every given snapshot to
the next, then a change-point model like the one proposed in this chapter does not
seem appropriate, with the state-space models of Xing et al. (2010) and Xu and Hero
(2014) being more fitting in this situation.
3.3.3 Implementation Details
In this subsection, we clarify several implementation details related to Algorithm 3.2
when applied to real-world dynamic network data sets.
3.3.3.1 Selecting the Number of Communities K
From our definition of the dynamic stochastic blockmodel in equations (3.1) – (3.3),
we have treated the number of communities K as a pre-specified model input. While
the problem of selecting the true community number K remains largely unexplored
in the DSBM, some recent progress has been made in the case of static stochastic
blockmodels (Chen and Lei, 2014; Saldana et al., 2015). We offer three main alterna-
tives to specify K in Algorithm 3.2. The first one involves employing strategies such
as the composite likelihood BIC approach of Saldana et al. (2015), or the network
cross-validation of Chen and Lei (2014), to the mean adjacency matrix (3.7) to de-
termine a plausible choice for K. While nodes are allowed to change communities in
our DSBM framework, the input K remains fixed throughout the temporal series, so
aggregation is only expected to help in the consistency of these methods in correctly
estimating the total number of communities.
The second approach, proposed in Cribben and Yu (2015), consists in specifying an
over-estimated value for the true K. As an under-estimated true value misses impor-
tant directions in the spectral clustering algorithm, the authors show this approach
yields better change-point detection results than the correct K for their provided
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 78
simulation settings.
Although more computationally demanding, a third alternative is to provide a
different value for K at each snapshot. While our DSBM framework (3.1) – (3.3) does
not explicitly model a change in the number of communities, this approach provides
a slight generalization from our initial assumptions, in which we seek to detect a
change-point if the number of available communities changes from one snapshot to
the next. Existing methods like the ones described above for estimating the true K
at fixed snapshots typically require additional computational resources, thus posing
a trade-off between modeling flexibility and computation time.
3.3.3.2 Candidate Change-Points
As mentioned in Section 3.2.2, we bar any adjacent time points to be labeled as
change-points. Thus, whenever the null hypothesis in (3.4) is rejected for three con-
secutive candidate snapshots (i.e., δ(t) = 1 in Algorithm 3.2), we take the one with
the maximum value of d(t) as the true change-point, disregarding the other two can-
didates. Additionally, in order to improve the finite sample behavior of the weighted
adjacency matrices B(t) and A(t), we exclude as candidate change-points the first
and last tmin snapshots. In our simulation settings and real-data analysis, we choose
tmin = 5 and 10 for simplicity.
3.3.3.3 Choice of Bandwidth h
Following Silverman’s rule of thumb (Silverman, 1986), we select the kernel bandwidth
as h∗ = 1.06×T−1/5. Although none of B(t), A(t) in (3.8) are being calculated with
exactly T data points, nor is the sequence A(t)Tt=1 independent and identically dis-
tributed, we find this choice of h∗ performing favorably in our numerical experiments
next section.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 79
3.3.3.4 Quantile cα(t) Computation
The test statistic d(t) in (3.9) depends on the results of the spectral clustering al-
gorithm applied to B(t) and A(t), the profile maximum likelihood estimators, as
well as the number and location of the true change-points of the temporal series
G(t)t≥1. As such, exact finite-sample or asymptotic behavior results for d(t) are
difficult to come by, thus requiring the use of Monte Carlo replicates or bootstrap
schemes (Cribben and Yu, 2015) to conduct the change-point inferential procedures.
Our data-driven approach to estimate the quantile cα(t) of Algorithm 3.2 works
as follows. We first note that Ω(B(t)), Ω(A(t)) both parametrize fixed time snap-
shot stochastic blockmodels from which we can simulate independent and identically
distributed sequences of adjacency matrices B(τ)Tτ=1 and A(τ)Tτ=1, respectively.
Note that, as opposed to our observed temporal series A(t)Tt=1, there are no change-
points present in each of these two simulated sequences; therefore, their properties are
crucial in describing the behavior of d(t) under the null hypothesis of no change-point
at t. Additionally, for the simulated sequences B(τ)Tτ=1 and A(τ)Tτ=1, associated
before and after weighted adjacency matrices can be computed, giving rise to regu-
larized distances d(1)(t) and d(2)(t) for each candidate change-point. After following
this approach for q Monte Carlo replicates, we are able to compute corresponding
upper αth quantiles c(1)α (t) and c
(2)α (t) for each distance sequence, thus yielding our
estimated cα(t) as:
cα(t) := maxc(1)α (t), c(2)
α (t). (3.10)
We summarize this fully data-driven procedure in Algorithm 3.3 below.
3.4 Experiments
In this section, we present a comprehensive numerical study of the temporal segmen-
tation properties of our Kernel Change-Points (KCP) detection approach for different
choices of kernel bandwidth. We perform our experiments under two major synthetic
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 80
Algorithm 3.3 Quantile cα(t) Computation
1. Input: Observed temporal series of adjacency matrices A(t)Tt=1. Significance
level α.
2. For every candidate change-point t ∈ 2, . . . , T:
(a) Compute the estimated expectation matrices Ω(B(t)), Ω(A(t)) based on
B(t), A(t).
(b) For i = 1, . . . , q Monte Carlo replicates:
(i) Generate B(i)1 , . . . , B
(i)T ∼ SBM(Ω(B(t))) and A
(i)1 , . . . , A
(i)T ∼
SBM(Ω(A(t))).
(ii) Calculate weighted adjacency matrices B(i)(t) and A(i)(t) based on (3.8).
(iii) Compute associated regularized distances d(1)i (t) and d
(2)i (t).
(c) Define c(1)α (t) and c
(2)α (t) as the upper αth quantiles of each distance sequence.
3. Output: The estimated upper αth quantile cα(t) of d(t) under H0 as given in
(3.10).
network structures for the underlying dynamic stochastic blockmodels, and compare
our methodology with the scan statistics approach of Wang et al. (2014) for the Enron
email data set (Priebe et al., 2005).
3.4.1 Simulations
We study two different examples of network structure generating mechanisms, where
for the observed data A(t)Tt=1, we consider a time-evolving θ(t) while keeping Z(t)
fixed, as well as the opposite scenario of θ(t) fixed with time-varying Z(t). We
demonstrate the robustness of our approach by covering both sparse and non-sparse
network settings.
In each experiment, carried over 100 randomly generated dynamic stochastic
blockmodels for the temporal series A(t)Tt=1, we fix the number of network snap-
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 81
shots at T = 50 and introduce M = 3 distinct change-points at locations (t∗1, t∗2, t∗3) =
(15, 30, 40). While the number of available vertices N = 900 and the number of co-
mmunities K = 3 are fixed along the sequence A(t)Tt=1, the community sizes are
allowed to vary only in our second experiment. Following Algorithm 3.3, for each
candidate change-point t ∈ 2, . . . , 50, we generate q = 50 Monte Carlo replicates
to compute the estimated quantile cα(t) given in (3.10).
Simulation 1: In this fixed Z(t) := Z scenario, community sizes are 300, 300 and
300. Accordingly, we generate A(t)50t=1 following a dynamic stochastic blockmodel
with expectation matrix Ω(t) = Zθ(t)Z ′. At times (t∗1, t∗2, t∗3) = (15, 30, 40), we in-
troduce change-points that modify the underlying structure of the temporally evolving
networks according to
θ(t) =
0.02 0.01 0.01
0.01 0.02 0.01
0.01 0.01 0.02
for t ≤ t∗1, θ(t) =
0.02 0.01 0.01
0.01 0.02 0.01
0.01 0.01 0.04
for t∗1 < t ≤ t∗2,
and
θ(t) =
0.02 0.01 0.01
0.01 0.02 0.01
0.01 0.01 0.05
for t∗2 < t ≤ t∗3, θ(t) =
0.02 0.01 0.01
0.01 0.02 0.01
0.01 0.01 0.07
for t > t∗3.
Simulation 2: In this fixed θ(t) := θ setting, the time series A(t)50t=1 is generated
according to the community assignments
Z(t) =
300
300
300
for t ≤ t∗1, Z(t) =
295
295
310
for t∗1 < t ≤ t∗2,
and
Z(t) =
285
285
330
for t∗2 < t ≤ t∗3, Z(t) =
250
250
400
for t > t∗3.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 82
Table 3.1: Performance of KCP over 100 repetitions from Simulations 1 and 2. Results
are given in the form of mean (standard deviation).
CONF. LEVEL h SIMULATION 1 CONF. LEVEL h SIMULATION 2
1 - α TP FP 1 - α TP FP
90 %
0.24 2.60(0.57) 4.76(1.36)
90 %
0.24 3.00(0.00) 1.62(1.05)
0.48 2.87(0.34) 3.10(1.40) 0.48 3.00(0.00) 1.99(1.31)
0.96 2.88(0.33) 1.65(1.25) 0.96 3.00(0.00) 1.89(1.23)
95 %
0.24 2.56(0.57) 4.08(1.31)
95 %
0.24 3.00(0.00) 1.00(0.86)
0.48 2.82(0.39) 2.18(1.23) 0.48 3.00(0.00) 1.09(1.15)
0.96 2.83(0.38) 0.90(0.99) 0.96 3.00(0.00) 0.98(1.01)
99 %
0.24 2.41(0.59) 2.91(1.37)
99 %
0.24 3.00(0.00) 0.50(0.63)
0.48 2.72(0.45) 1.16(1.01) 0.48 3.00(0.00) 0.28(0.59)
0.96 2.67(0.47) 0.34(0.64) 0.96 3.00(0.00) 0.34(0.62)
NOTE: TP, true positives; FP, false positives; α, significance level; h, choice of bandwidth.
where, in a slight abuse of notation, we mean that the third community in the time
series G(t)50t=1 increases its size to 310, 330 and 400 at the designated change-points
(t∗1, t∗2, t∗3) = (15, 30, 40), respectively. The class connection probability matrix is set be
θaa = 0.15 for a = 1, 2, 3 and θab = 0.05 for 1 ≤ a < b ≤ 3.
To simplify matters, for purposes of change-point detection in this simulation
study, we compute the KCP test statistic (3.9) under a correctly specified choice of
K = 3. Additionally, we measure the robustness of our approach with the choice of
bandwidths h = 0.5h∗, h∗, and 2h∗, where h∗ is the Silverman rule of thumb discussed
in Section 3.3.3. Along with the mean values of d(t) for each candidate change-point,
we plot in Figure 3.1 the mean estimated quantile cα(t) in each location, thus depicting
the average rejection values for the test statistics along the time-evolving networks
across all simulations. Results are summarized in Table 3.1 above.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 83
As we can see from the true positive (TP) and false positive (FP) rates collected,
KCP performs reasonably well in both scenarios. In the “chatty” group setting from
Simulation 1, where the third community increases its network communication pat-
tern to an abnormal level, for α = 0.05, the highest true positives and smallest false
positives occur at bandwidth h = 0.96. While it may appear straightforwardly that
KCP benefits from network data aggregation by simply increasing the bandwidth h,
the implicit bias-variance tradeoff is already evident in Table 3.1 when the confidence
level is at 99 %. As shown in Figure 3.1, the change-point at t∗2 = 30 is considerably
harder to detect in this scenario, contributing to the vast majority of missed true
positives from Table 3.1. Nevertheless, and in spite of the poor performance of the
spectral clustering algorithm on sparse networks like the one in Simulation 1 (Qin and
Rohe, 2013), these results put KCP forward as a novel valuable tool for change-point
detection in dynamic network data.
The right panels of Figure 3.1 show the values of our test statistic d(t) easily
exceeding the estimated upper αth quantiles cα(t) at the change-point locations
t = t∗1, t∗2, t∗3; thus shifting the analysis of Simulation 2 exclusively to false positive
rates. This is confirmed by Table 3.1, where, under this setting, the choice of α = 0.01
naturally leads to the smallest false positive rates. Again, due to the bias-variance
tradeoff, the optimal choice of h = 0.48 yields the smallest number of false positives
across all randomly generated dynamic stochastic blockmodels. While the perfor-
mance of KCP is again fairly accurate at detecting change-points, panel (f) of Figure
3.1 underscores the importance of our interpretation of Definition 3.1, forbidding two
adjacent time points to be labeled as change-points, effectively reducing the false pos-
itive rate by discarding neighboring points which also exceed the estimated quantile
cα(t). Lastly, we would like to point out that, in an effort to measure the false pos-
itive rate of KCP under no change-points in the time series A(t)50t=1, we repeated
Simulation 1 with θaa(t) = 0.02 for a = 1, 2, 3 and θab(t) = 0.01 for 1 ≤ a < b ≤ 3
with t = 1, . . . , 50. For the choice of h = 0.48, only 1 out of 100 repetitions delivered
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 84
candidate change-points with the rest correctly yielding zero false positives.
Although our “chatty” setting of Simulation 1 is inspired by the work of Wang
et al. (2014), we decided to not compare our KCP procedure with their scan statistics
approach as their asymptotic power derivations are not easily extended to our choices
of (θ(t),Z(t)). Furthermore, their theoretical framework does not cover the case of
a fixed θ(t) with time-varying Z(t) presented here in Simulation 2.
3.4.2 Real Data Analysis
For our real data experiment, we use the Enron email corpus described in Priebe et al.
(2005); Wang et al. (2014). The data set collects a time series of networks G(t)Tt=1
with N = 184 vertices taken over a period of T = 189 weeks from 1998 through 2002.
Consisting mostly of Enron executives, for each network, Aij(t) = 1 if employee i
sends at least one email message to employee j (or viceversa) during the one week
period t, and Aij(t) = 0 otherwise. We refer interested readers to Priebe et al. (2005)
for a more detailed description of this email corpus.
For purposes of change-point detection on this data set, Wang et al. (2014) rely
on the idea of scan statistics (Glaz et al., 2001), which constitute of vertex-dependent
normalizations and temporal normalizations of locality statistics. For a given network
G := (V,E) with V′ ⊂ V , denote by H(V
′,G) the subgraph of G induced by V
′.
Given a fixed time snapshot t, and for all k ≥ 0 and v ∈ V , Wang et al. (2014) define
a first locality statistic as
Ψt;k(v) =∣∣∣E(H(Nk(v;G(t)),G(t))
)∣∣∣ .More specifically, the statistic Ψt;k(v) counts the number of edges in the subgraph of
G(t) induced by Nk(v;G(t)), the subset of vertices u at a distance at most k from
v in G(t). Additionally, for given snapshots t and t′
with t′ ≤ t, Wang et al. (2014)
also define the locality statistic
Φt,t′ ;k(v) =∣∣∣E(H(Nk(v;G(t)),G(t
′)))∣∣∣ .
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 85
Simulation 1 Simulation 2
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
2
4
6
8
10
12
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(a) Time-evolving θ(t), h = 0.24
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
5
10
15
20
25
30
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(b) Time-evolving Z(t), h = 0.24
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
2
4
6
8
10
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(c) Time-evolving θ(t), h = 0.48
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
5
10
15
20
25
30
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(d) Time-evolving Z(t), h = 0.48
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
2
4
6
8
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(e) Time-evolving θ(t), h = 0.96
0 10 20 30 40 50Time Snapshot
Sta
tistic
d(t)
0
5
10
15
20
25
30
Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level
(f) Time-evolving Z(t), h = 0.96
Figure 3.1: (Color online) Mean values for the test statistic d(t) and the estimated quantiles
cα(t) for each candidate change-point in Simulations 1 and 2, where the choices of α = 0.10
(green), α = 0.05 (blue), and α = 0.01 (red) are considered.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 86
In this case, the locality statistic Φt,t′ ;k(v) counts the number of edges in the sub-
graph of G(t′) induced by Nk(v;G(t)). After a fixed number of τ ≥ 0 vertex-
dependent normalizations and λ ≥ 0 temporal normalizations, the resulting scan
statistics Sτ,λ,k(t; Ψ) and Sτ,λ,k(t; Φ) are employed by Wang et al. (2014) as a tool for
change-point detection. For some additional work on scan statistics, see the papers
Priebe et al. (2005, 2010) and references therein.
We compare the performance of the scan statistic approach with our KCP metho-
dology in Figure 3.2 below. After truncating for the first 40 weeks due to τ = λ = 20
normalization steps, we analyze the remaining period of 149 weeks from August 1999
to June 2002 for different values of k in the scan statistics Sτ,λ,k(t; ·), as well as different
choices of kernel bandwidth h for the KCP statistic d(t). For our KCP methodology,
in order to avoid cluttered displays, we only present results for a fixed K = 2 and
kernel bandwidths h = h∗, h∗ + 0.09, and h∗ + 2 · 0.09. Following Wang et al. (2014),
change-point detections for scan statistics are defined whenever Sτ,λ,k(t; ·) > 5. For
purposes of this numerical experiment, we select α = 0.01 in our KCP approach and
plot the estimated upper αth quantile cα(t) based on q = 200 Monte Carlo replicates.
The red circles in Figure 3.2 correspond to change-points that were identi-
fied by both algorithms, whereas the gold ones refer to the most representa-
tive change-points discovered only by KCP. A summary of our findings based
on the Enron Chronology (http://www.desdemonadespair.net/2010/09/bushenron-
chronology.html) follows below:
• Most scan statistics as well as d(t) with h = 0.47, panel (d), signal a change-
point at t∗ = 58 in December 1999. This period coincides with the finalization
of a sham energy deal between Enron and Merrill Lynch in order to meet profit
expectations and boost stock price.
• A second common change-point is t∗ = 94 during August 2000, where Enron
stock peaks at $90.56 and a large-scale insider selling scheme is underway to
profit while the price is still high.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 87
Scan Statistics Kernel Change-Points
Time
Sca
n S
tatis
tic
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
5
10
15
20
Scan Statistic Sτ,λ,0(t, Ψ)Scan Statistic Sτ,λ,0(t, Φ)
(a) Scan statistics Sτ,λ,0(t; Ψ) and Sτ,λ,0(t; Φ)
Time
Sta
tistic
d(t)
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
2
4
6
Test Statistic d(t) with K=299% Conf. Level
(b) KCP test statistic d(t) with (K,h) = (2, 0.38)
Time
Sca
n S
tatis
tic
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
5
10
15
20
Scan Statistic Sτ,λ,1(t, Ψ)Scan Statistic Sτ,λ,1(t, Φ)
(c) Scan statistics Sτ,λ,1(t; Ψ) and Sτ,λ,1(t; Φ)
Time
Sta
tistic
d(t)
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
2
4
6
Test Statistic d(t) with K=299% Conf. Level
(d) KCP test statistic d(t) with (K,h) = (2, 0.47)
Time
Sca
n S
tatis
tic
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
5
10
15
20
Scan Statistic Sτ,λ,2(t, Ψ)Scan Statistic Sτ,λ,2(t, Φ)
(e) Scan statistics Sτ,λ,2(t; Ψ) and Sτ,λ,2(t; Φ)
Time
Sta
tistic
d(t)
08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02
0
2
4
6
Test Statistic d(t) with K=299% Conf. Level
(f) KCP test statistic d(t) with (K,h) = (2, 0.56)
Figure 3.2: (Color online) Scan statistic values and associated detected change-points for
the Enron email data set (left panels). Kernel Change-Points test statistic d(t) along with
the estimated quantiles cα(t), with α = 0.01, for each candidate change-point (right panels).
Red circles indicate common discovered change-points between the two approaches, whereas
gold circles represent newly discovered change-points by the KCP statistic.
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 88
• The third common change-point is t∗ = 115 in December 2000. According to
the Enron Chronology, Jeff Skilling is just being promoted to CEO of Enron
and has begun selling 10,000 shares of stock weekly.
• The final common change-point between the scan statistic approach and KCP
occurs at t∗ = 145 during August 2001. This anomaly occurs as Enron CEO
Skilling steps down citing personal reasons and surrounded by public criticisms
on a number of accounting and trading issues.
• Two further change-points are only detected by d(t) with h = 0.38. The first
one at t∗ = 67 in February 2000 coincides with a period in which Enron CFO
Andrew Fastow is involved in selling Enron subsidiaries at an increased value
to bolster financial statements. To conclude, at t∗ = 137 during May 2001,
the anomaly occurs as Enron’s largest foreign investment, the Dabhol Power
Company in India, shuts down amid heavy protests and environmental damage.
In conclusion, we notice that while the theoretical framework of the scan statis-
tic approach does not support arbitrary time-varying (θ(t),Z(t)), the red circles in
Figure 3.2 demonstrate that in some instances both Sτ,λ,k(t; ·) and d(t) are able to
capture the same anomalous connectivity patterns. Furthermore, due to the adapt-
ability of d(t) provided by the different kernel bandwidths, the KCP statistic d(t) is
able to detect relevant change-points overlooked by Sτ,λ,k(t; ·).
3.5 Discussion
The analysis and collection of network data observed over time has recently enjoyed
tremendous popularity among peer researchers trying to understand time-evolving
complex phenomena. In this chapter, we focus on the change-point detection prob-
lem under the dynamic stochastic blockmodel framework. We develop a novel me-
thodology — KCP, advocating the use of kernel-weighted adjacency matrices for data
CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 89
aggregation and test statistics based on the spectral clustering algorithm for anomaly
detection. Our flexible dynamic stochastic blockmodel setting allows for a variety of
network models considered, while the adaptability of our KCP methodology through
different kernel bandwidths enables the detection of different types of change-points.
Model selection rates on synthetic network data demonstrate the robustness of our
approach in terms of change-point detection, with the Enron email corpus provid-
ing an example of commensurate performance with competing methods in real-world
network data sets.
There are several interesting extensions of our current work. First, we can con-
sider even more sophisticated network models, such as a dynamic degree-corrected
blockmodel or mixed membership models, to carry out change-point detection. In
addition, given the major computational bottleneck of our approach is in calculating
the quantile cα(t), we leave an online implementation using parallel computing as fu-
ture work. Based on the classical analysis of kernel density estimation, an additional
further investigation is to study the limiting properties and power characteristics of
the KCP test statistic.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 90
Chapter 4
NC-Impute: Scalable Matrix
Completion with Nonconvex
Penalties
In this chapter, we study the popularly dubbed matrix completion problem in its
noisy setting, where the task is to “fill in” the unobserved entries of a matrix from
a small subset of observed entries under the assumption that the underlying matrix
is of (approximately) low-rank. The main contribution of this chapter is to present
a systematic computational study of nonconvex regularized matrix completion prob-
lems. The present work complements and enhances our previous work on nuclear
norm regularized problems for matrix completion (Mazumder et al., 2010); and also
provides a systematic study of spectral thresholding operators. Inspired by Soft-
Impute (Mazumder et al., 2010; Hastie et al., 2016), we propose NC-Impute —an
algorithmic framework for computing a family of nonconvex penalized matrix com-
pletion problems with warm-starts. Using structured low-rank SVD computations,
we demonstrate the computational scalability of our proposal for problems up to the
Netflix size (approximately, a 500, 000 × 20, 000 matrix with 108 observed entries).
We demonstrate that on a wide range of synthetic and real data instances, our pro-
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 91
posed nonconvex regularization framework leads to low-rank solutions with better
predictive performance when compared to convex nuclear norm problems. Imple-
mentations of algorithms proposed herein, written in the R programming language,
are made available on github.
4.1 Introduction
In several problems of contemporary interest, arising for instance, in recommender
system applications, for example, the Netflix Prize competition (see SIGKDD and
Netflix (2007)), observed data is in the form of a large sparse matrix, Yij, (i, j) ∈ Ω,
where Ω ⊂ 1, . . . ,m× 1, . . . , n, with |Ω| mn. Popularly dubbed as the matrix
completion problem (Candes and Recht, 2009), the task is to predict the unobserved
entries, under the assumption that the underlying matrix is of low-rank. This leads
to the natural rank regularized optimization problem:
minimizeX
1
2‖PΩ(X − Y )‖2
F + λ rank(X), (4.1)
where, PΩ(X) denotes the projection of X onto the observed matrix indices Ω and is
zero otherwise; and ‖·‖F denotes the usual Frobenius norm of a matrix. Problem (4.1),
however, is computationally difficult due to the presence of the combinatorial rank
constraint (Chistov and Grigor’ev, 1984). A natural convexification (Fazel, 2002;
Recht et al., 2010) of rank(X) is ‖X‖∗, the nuclear norm of X, which leads to the
following surrogate of Problem (4.1) given by:
minimizeX
1
2‖PΩ(X − Y )‖2
F + λ‖X‖∗. (4.2)
Candes and Recht (2009); Candes and Plan (2010) show that under some assum-
ptions on the underlying “population” matrix, a solution to Problem (4.2) approx-
imates a solution to Problem (4.1) reasonably well. The estimator obtained from
Problem (4.2) enjoys several advantages: the nuclear norm shrinks the singular val-
ues and simultaneously sets many of the singular values to zero, thereby encouraging
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 92
low-rank solutions. It is thus not surprising that Problem (4.2) has enjoyed a signif-
icant amount of attention in the wider statistical community over the past several
years. There have been impressive advances in understanding its statistical proper-
ties (Negahban and Wainwright, 2012; Candes and Plan, 2010; Recht et al., 2010;
Rohde and Tsybakov, 2011). Motivated by the work of Candes and Recht (2009);
Cai et al. (2010), the authors in Mazumder et al. (2010) proposed Soft-Impute,
an EM-flavored (Dempster et al., 1977) algorithm for optimizing Problem (4.2). For
some other computational work in developing scalable algorithms for the problem,
see the papers Hastie et al. (2016); Jaggi and Sulovsky (2010); Freund et al. (2015),
and references therein. Typical assumptions under which the nuclear norm works as
a good proxy for the low-rank problem require the entries of the singular vectors of
the “true” low-rank matrix to be sufficiently spread, and the missing pattern to be
roughly uniform. The proportion of observed entries needs to be sufficiently larger
than the number of parameters of the matrix O ((m+ n)r), where, r denotes the
rank of the true underlying matrix. Negahban and Wainwright (2012) propose im-
provements with a (convex) weighted nuclear norm penalty in addition to spikiness
constraints for the noisy matrix completion problem.
The nuclear norm penalization framework, however, has limitations. If the condi-
tions mentioned above fail, Problem (4.2) may fall short of delivering reliably low-rank
estimators with good prediction properties (on the missing entries). In addition, since
the nuclear norm shrinks the singular values, in order to obtain an estimator with
good explanatory power, it often results in a matrix estimator with high numerical
rank — thereby leading to models that have higher rank than what might be desir-
able. The limitations mentioned above, however, should not come as a surprise to
an expert — especially if one draws a parallel connection to the Lasso (Tibshirani,
1996), a popular sparsity inducing shrinkage mechanism effectively used in the context
of sparse linear modeling and regression. In the linear regression context, the Lasso
often leads to dense models and suffers when the features are highly correlated — the
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 93
limitations of the Lasso are quite well known in the statistics literature, and there
have been major strides in moving beyond the convex `1-penalty to more aggressive
forms of nonconvex penalties (Fan and Li, 2001; Mazumder et al., 2011; Bertsimas
et al., 2016; Zou and Li, 2008; Zhang, 2010; Zhang et al., 2012). The key principle
in these methods is the use of nonconvex regularizers that better approximate the
`0-penalty, leading to possibly nonconvex estimation problems. Thusly motivated,
we study herein, the following family of nonconvex regularized estimators for the task
of (noisy) matrix completion:
minimizeX
f(X) :=1
2‖PΩ(X − Y )‖2
F +
minm,n∑i=1
P (σi(X);λ, γ), (4.3)
where, σi(X), i ≥ 1, are the singular values of Xm×n and σ 7→ P (σ;λ, γ) is a con-
cave penalty function on [0,∞). We will denote an estimator obtained from (4.3)
by Xλ,γ. The family of penalty functions P (σ;λ, γ) is indexed by the parameters
(λ, γ) — together they control the amount of nonconvexity and shrinkage — see for
example Mazumder et al. (2011); Zhang et al. (2012) and also Section 4.2, herein, for
examples of such nonconvex families.
A caveat in considering problems of the form (4.3) is that they lead to nonconvex
optimization problems and thus obtaining a certifiably optimal global minimizer is
generally difficult. Fairly recently, Bertsimas et al. (2016); Mazumder and Radchenko
(2015) have shown that subset selection problems in sparse linear regression can
be computed using advances in mixed integer quadratic optimization. Such global
optimization methods, however, do not apply to matrix variate problems involving
spectral1 penalties, as in Problems (4.1) or (4.3). The main focus in our work herein is
to develop a computationally scalable algorithmic framework that allows us to obtain
1We say that a function is a spectral function of a matrix X, if it depends only upon the singular
values of X. The state of the art algorithmics in mixed integer Semidefinite optimization problems
is in its nascent stage; and not nearly as advanced as the technology for mixed integer quadratic
optimization.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 94
high quality stationary points or upper bounds2 for Problem (4.3) — we obtain a
path of solutions Xλ,γ across a grid of values of (λ, γ) for Problem (4.3) by employing
warm-starts, following the path-following scheme proposed in Mazumder et al. (2010).
Leveraging problem structure and modern advances in computationally scalable low-
rank SVDs, as successfully employed in Mazumder et al. (2010); Hastie et al. (2016),
we empirically demonstrate the computational scalability of our method for problems
of the size of the Netflix data set, a matrix of size (approx.) 480, 000 × 18, 000
with ∼ 108 observed entries. Perhaps most importantly, we demonstrate empirically
that the resultant estimators lead to better statistical properties (i.e., the estimators
have lower rank and enjoy better prediction performance) over nuclear norm based
estimates, on a variety of problem instances.
4.1.1 Contributions and Outline
The main contributions of this chapter can be summarized as follows:
• We propose a computational framework for nonconvex penalized matrix comple-
tion problems of the form (4.3). Our algorithm, NC-Impute, is an adaptation
of the EM-stylized procedure Soft-Impute (Mazumder et al., 2010) to more
general nonconvex penalized thresholding operators.
• We present an in-depth investigation of the nonconvex spectral thresholding
operators, which form the main building block of our algorithm. We also study
their effective degrees of freedom, which provide a simple and intuitive way to
calibrate the two-dimensional grid of tuning parameters, extending the method
proposed in nonconvex penalized regression by Mazumder et al. (2011).
2Since the problems under consideration are nonconvex, our methods are not guaranteed to reach
the global minimum – we thus refer to the solutions obtained as upper bounds. In many synthetic
examples, however, the solutions are indeed seen to be globally optimal. We do show rigorously,
however, that these solutions are first order stationary points for the optimization problems under
consideration.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 95
• We provide comprehensive computational guarantees of our algorithm in terms
of the number of iterations needed to reach a first order stationary point and the
asymptotic convergence of the sequence of estimates produced by NC-Impute.
• Every iteration of NC-Impute requires the computation of a low-rank SVD
of a structured matrix, for which we propose new methods. Using efficient
warm-start tricks to speed up the low-rank computations, we demonstrate the
effectiveness of our proposal to large scale instances up to the Netflix size in
reasonable computation times.
• Over a wide range of synthetic and real-data examples, we show that our pro-
posed nonconvex penalized framework leads to high quality solutions with excel-
lent statistical properties, which are often found to be significantly better than
nuclear norm regularized solutions in terms of producing low-rank solutions
with good predictive performances.
• Implementations of our algorithms in the R programming language have been
made publicly available on github at the following url: https://github.com/
diegofrasal/ncImpute.
The remainder of the chapter is organized as follows. Section 4.2 studies several
properties of nonconvex spectral penalties and associated spectral thresholding opera-
tors, including their effective degrees of freedom. Section 4.3 describes our algorithmic
framework NC-Impute and studies the convergence properties of the algorithm. Sec-
tion 4.4 presents numerical experiments demonstrating the usefulness of nonconvex
penalized estimation procedures in terms of superior statistical properties on several
synthetic data sets — we also show the usefulness of these estimators on several real
data instances. Section 4.5 contains the conclusions. To improve readability, some
technical material is relegated to Appendix A.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 96
Notation: For a matrix Am×n, we denote its (i, j)th entry by aij. PΩ(A) is a
matrix with its (i, j)th entry given by aij for (i, j) ∈ Ω and zero otherwise, with
Ω ⊂ 1, . . . ,m × 1, . . . , n. We use the notation P⊥Ω (A) = X − PΩ(A) to denote
projection onto the complement of Ω. Let σi(A), i = 1, . . . ,maxm,n denote the
singular values of A, with σi(A) ≥ σi+1(A) (for all i) – we will use the notation σ(A)
to denote the vector of singular values. When clear from the context, we will simply
write σ instead of σ(A). For a vector a = (a1, . . . , an) ∈ Rn, we will use the notation
diag(a) to denote a n× n diagonal matrix with ith diagonal entry being ai.
4.2 Spectral Thresholding Operators
We begin our analysis by considering the fully observed version of Problem (4.3),
given by:
Sλ,γ(Z) ∈ arg minX
g(X) :=1
2‖X − Z‖2
F +
minm,n∑i=1
P (σi(X);λ, γ), (4.4)
where, for a given matrix Z, a minimizer of the function g(X), given by Sλ,γ(Z), is
the spectral thresholding operator induced by the spectral penalty function denoted
through:∑minm,n
i=1 P (σi(X);λ, γ). Suppose Udiag(σ)V ′ denotes the SVD of Z. For
the nuclear norm regularized problem, with P (σi(X);λ, γ) = λσi(X), the thresh-
olding operator denoted by Sλ,`1(Z) (say) is given by the familiar soft-thresholding
operator (Cai et al., 2010; Mazumder et al., 2010):
Sλ,`1(Z) := Udiag(sλ,`1(σ))V ′, with sλ,`1(σi) := (σi − λ)+, (4.5)
where, (·)+ = max·, 0 and sλ,`1(σi) is the ith entry of sλ,`1(σ) (due to separabil-
ity of the thresholding operator). Here, sλ,`1(σ) is the the soft-thresholding opera-
tor on the singular values of Z and plays a crucial role in the Soft-Impute algo-
rithm (Mazumder et al., 2010). For the rank regularized problem, with penalty given
as P (σi(X);λ, γ) = λ1(σi(X) > 0), the thresholding operator denoted by Sλ,`0(Z) is
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 97
given by the familiar hard-thresholding operator (Mazumder et al., 2010):
Sλ,`0(Z) := Udiag(sλ,`0(σ))V ′, with sλ,`0(σi) = σi1(σi >√
2λ). (4.6)
A closely related thresholding operator that retains the top r singular values and sets
the remaining to zero formed the basis of the Hard-Impute algorithm in Mazumder
et al. (2010); Troyanskaya et al. (2001). The results in (4.5) and (4.6) suggest a
curious link — the spectral thresholding operators (for the two specific choices of the
spectral penalty functions given above) are tied to the corresponding thresholding
functions that operate only on the singular values of the matrix — in other words,
the operators Sλ,`1(Z), Sλ,`0(Z) do not affect the singular vectors of the matrix Z. It
turns out that a similar result holds true for more general spectral penalty functions
P (·;λ, γ) as the following proposition illustrates.
Proposition 4.1. Let Z = Udiag(σ)V ′ denote the SVD of Z, and sλ,γ(σ) denote the
following thresholding operator on the singular values of Z:
sλ,γ(σ) ∈ arg minα≥0
g(α) :=1
2‖α− σ‖2
2 +
minm,n∑i=1
P (αi;λ, γ). (4.7)
Then Sλ,γ(Z) = Udiag(sλ,γ(σ))V ′.
Proof. Note that by the Wielandt-Hoffman inequality (Horn and Johnson, 2012) we
have that: ‖X − Z‖2F ≥ ‖σ(X)− σ(Z)‖2
2, where, for a vector a ∈ Rm, ‖a‖2 denotes
the standard Euclidean norm. Equality holds when X and Z share the same left and
right singular vectors. This leads to:
1
2‖X − Z‖2
F +
minm,n∑i=1
P (σi(X);λ, γ) ≥ 1
2‖σ(X)− σ(Z)‖2
2 +
minm,n∑i=1
P (σi(X);λ, γ).
(4.8)
In the above inequality, note that the left hand side is g(X) and right hand side is
g(σ(X)). It follows that
minX
g(X) ≥ minσ(X)
g(σ(X)) = g (sλ,γ(σ)) , (4.9)
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 98
where, we used the observation that σ(X) ≥ 0 and sλ,γ(σ), as defined in (4.7)
minimizes g(σ(X)). In addition, this minimum is attained by the function g(X), at
the choice X = Udiag(sλ,γ(σ))V ′. This completes the proof of the proposition.
Due to the separability of the optimization problem (4.7) across the coordinates,
i.e., g(α) =∑
i gi(αi), it suffices to consider each of the subproblems separately. Let
sλ,γ(σi) denote a minimizer of gi(αi) (over αi ≥ 0), i.e.,
sλ,γ(σi) ∈ arg minα≥0
gi(α) :=1
2(α− σi)2 + P (α;λ, γ). (4.10)
It is easy to see that the ith coordinate of sλ,γ(σ) is given by sλ,γ(σi). This discus-
sion suggests that our understanding of the spectral thresholding operator Sλ,γ(Z)
is intimately tied to the univariate thresholding operator (4.10). Thusly motivated,
in the following, we present a concise discussion about univariate penalty functions
and the resultant thresholding operators. We begin with some examples of concave
penalties that are popularly used in the statistics literature in the context of sparse
linear modeling.
Families of Nonconvex Penalty Functions: Several types of nonconvex penal-
ties are popularly used in the high-dimensional regression framework—see for exam-
ple Nikolova (2000); Lv and Fan (2009); Zhang et al. (2012). Note that for our setup,
since these penalty functions operate on the singular values of a matrix, it suffices to
consider nonconvex functions that are defined only on the nonnegative real numbers.
We present a few examples below:
• The `γ penalty (Frank and Friedman, 1993) given by P (σ;λ, γ) = λσγ (0 ≤ γ <
1) on σ ≥ 0.
• The SCAD penalty (Fan and Li, 2001) is defined via:
P ′(σ;λ, γ) = λ1(σ ≤ λ) +(γλ− σ)+
γ − 11(σ > λ), for σ ≥ 0, γ > 2,
where P ′(σ;λ, γ) denotes the derivative of σ 7→ P (σ;λ, γ) on σ ≥ 0 with
P (0;λ, γ) = 0.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 99
• The MC+ penalty (Zhang, 2010; Mazumder et al., 2011) defined as
P (σ;λ, γ) = λ
(σ − σ2
2λγ
)1(0 ≤ σ < λγ) +
λ2γ
21(σ ≥ λγ),
on λ ≥ 0, γ ≥ 0.
• The log-penalty, with P (σ;λ, γ) = λ log(γσ+1)/ log(γ+1) on γ > 0 and λ ≥ 0.
Figure 4.1 shows some members of the above nonconvex penalty families. The `γ
penalty function is non differentiable at σ = 0, due to the unboundedness of P ′(σ;λ, γ)
as σ → 0+. The non-zero derivative at σ = 0+ encourages sparsity in the estimated
σ. The `γ penalty functions show a clear transition from the `1 penalty, i.e., σ, to
the `0 penalty, i.e., 1(σ > 0) — similarly, the resultant thresholding operators show a
passage from the soft-thresholding to the hard-thresholding operator. Let us examine
the analytic form of the thresholding function induced by the MC+ penalty (for any
γ > 1):
sλ,γ(σ) =
0, if σ ≤ λ(
σ−λ1−1/γ
), if λ < σ ≤ λγ
σ, if σ > λγ.
(4.11)
It is interesting to note that for the MC+ penalty, the derivatives are all bounded
and the thresholding functions are all continuous for all γ > 1. As γ → ∞, the
threshold operator (4.11) coincides with the soft-thresholding operator. However, as
γ → 1+ the thresholding operator approaches the discontinuous hard-thresholding
operator σ1(σ ≥ λ) — this is illustrated in Figure 4.1 and can also be observed by
inspecting (4.11). Note that the `1 penalty penalizes small and large singular values
in a similar fashion, thereby incurring in an increased bias in estimating the larger
coefficients. For the MC+ and SCAD penalties, we observe that they penalize the
larger coefficients less severely than the `1 penalty — simultaneously, they penalize
the smaller coefficients in a manner similar to that of the `1-penalty. On the other
hand, the `γ penalty (for small values of γ) imposes a more severe penalty for values
of σ ≈ 0, quite different from the behavior of the other penalties.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 100
0 1 2 3 4 5
01
23
45
ℓγ penalty
σ
γ=0.01γ=0.1γ=0.3γ=0.6γ=0.99
0 1 2 3 4 5
SCAD penalty
σ
γ=2.1γ=3γ=3.5γ=4γ=100
0 1 2 3 4 5
MC+ penalty
σ
γ=1.1γ=2γ=3γ=4γ=100
0 1 2 3 4 5
01
23
45
ℓγ threshold operator
σ
γ=0.01γ=0.1γ=0.3γ=0.6γ=0.99
0 1 2 3 4 5
SCAD threshold operator
σ
γ=2.1γ=3γ=3.5γ=4γ=100
0 1 2 3 4 5
MC+ threshold operator
σ
γ=1.1γ=2γ=3γ=4γ=100
Figure 4.1: [Top panel] Examples of nonconvex penalties σ 7→ P (σ;λ, γ) with λ = 1
for different values of γ. [Bottom Panel] The corresponding scalar thresholding operators:
σ 7→ sλ,γ(σ). At σ = 1, some of the thresholding operators corresponding to the `γ penalty
function are discontinuous, and some of the other thresholding functions are “close” to
being so.
4.2.1 Properties of Spectral Thresholding Operators
The nonconvex penalty functions described in the previous section are concave func-
tions on the nonnegative real line. We will now discuss measures that character-
ize the amount of concavity in the functions. For a univariate penalty function
α 7→ P (α;λ, γ) on α ≥ 0, assumed to be differentiable on (0,∞), we introduce
the following quantity (φP ) that measures the amount of concavity (see also Zhang
(2010)) of P (α;λ, γ):
φP := infα,α′>0
P ′(α;λ, γ)− P ′(α′;λ, γ)
α− α′, (4.12)
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 101
where P ′(α;λ, γ) denotes the derivative of P (α;λ, γ) with respect to α on α > 0.
We say that the function g(X) is τ -strongly convex if the following condition holds:
g(X) ≥ g(X) + 〈∇g(X), X − X〉+τ
2‖X − X‖2
F , (4.13)
for some τ ≥ 0 and all X, X. In (4.13), ∇g(X) denotes any subgradient (assuming
it exists) of g(X) at X. If τ = 0 then the function is simply convex3. Using standard
properties of spectral functions (Borwein and Lewis, 2006; Lewis, 1995), it follows
that g(X) is τ -strongly convex iff the vector function:
g(α) =1
2‖α− σ(Z)‖2
2 +
minm,n∑i=1
P (αi;λ, γ) (4.14)
is τ -strongly convex on α : α ≥ 0, where σ(Z) denotes the singular values of Z.
Let us recall the separable decomposition of g(α) =∑
i gi(αi), with gi(α) as defined
in (4.10). Clearly, the function α 7→ g(α) is τ -strongly convex (on the nonnegative
reals) iff each summand gi(α) is τ -strongly convex on αi ≥ 0. Towards this end,
notice that gi(α) is convex on α ≥ 0 iff 1+φP ≥ 0 — in particular, gi(α) is τ -strongly
convex with parameter τ = 1+φP , provided this number is nonnegative. In this vein,
we have the following proposition:
Proposition 4.2. Suppose φP > −1, then the function X 7→ g(X) is τ -strongly
convex with τ = 1 + φP > 0.
For the MC+ penalty, the condition τ = 1 + φP > 0 is equivalent to γ > 1. For
the `γ penalty function, with γ < 1, the parameter τ = −∞.
Proposition 4.3. Suppose 1 + φP > 0, then Z 7→ Sλ,γ(Z) is Lipschitz continuous
with constant 11+φP
:
‖Sλ,γ(Z1)− Sλ,γ(Z2)‖F ≤1
1 + φP‖Z1 − Z2‖F . (4.15)
3Note that we consider τ ≥ 0 in the definition so that it includes the case of convexity.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 102
Proof. We rewrite g(X) as:
g(X) =
1
2‖X − Z‖2
F −ψ
2‖X‖2
F
+
minm,n∑
i=1
P (σi(X);λ, γ) +ψ
2‖X‖2
F
. (4.16)
We have that ‖X‖2F =
∑minm,ni=1 σ2
i (X). Using the shorthand notation P (σi(X)) =
P (σi(X);λ, γ) + ψ2σ2i (X), and rearranging the terms in g(X), it follows that Sλ,γ(Z),
a minimizer of g(X), is given by:
Sλ,γ(Z) ∈ arg minX
1− ψ2‖X − 1
1− ψZ‖2
F +
minm,n∑i=1
P (σi(X))
. (4.17)
If ψ + φP > 0, the function σi 7→ P (σi) is convex for every i. If 1 − ψ > 0, then
the first term appearing in the objective function in (4.17) is convex. Thus, assuming
ψ+φP > 0, 1−ψ > 0 both summands in the above objective function are convex. In
particular, the optimization problem (4.17) is convex and Z 7→ Sλ,γ(Z) can be viewed
as a convex proximal map (Rockafellar, 1970). Using standard contraction properties
of proximal maps, we have that:
‖Sλ,γ(Z1)− Sλ,γ(Z2)‖F ≤∥∥∥∥ Z1
1− ψ− Z2
1− ψ
∥∥∥∥F
≤ 1
1− ψ‖Z1 − Z2‖F .
Since the above holds true for any ψ as chosen above, optimizing over the value of
ψ such that Problem (4.17) remains convex gives us ψ = −φP , i.e., 1/(1 − ψ) =
1/(1 + φP ), thereby leading to (4.15).
4.2.2 Effective Degrees of Freedom for Spectral Thresholding
Operators
The effective degrees of freedom or df is a popularly used statistical notion that
measures the amount of “fitting” performed by an estimator (Efron et al., 2004;
Hastie et al., 2009; Stein, 1981). In the case of classical linear regression, for example,
df is simply given by the number of features used in the linear model. This notion
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 103
applies more generally to additive fitting procedures. Following Efron et al. (2004);
Stein (1981), let us consider an additive model of the form:
Zij = µij + εij with εijiid∼ N(0, v2), i = 1, . . . ,m, j = 1, . . . , n. (4.18)
The df of µ := µ(Z), for the fully observed model above (4.18) is given by:
df(µ) =∑ij
Cov(µij, Zij)/v2,
where µij denotes the (i, j)th entry of the matrix µ. For the particular case of a
spectral thresholding operator we have µ = Sλ,γ(Z). When Z 7→ µ(Z) satisfies a weak
differentiability condition, the df may be computed via a divergence formula (Stein,
1981; Efron et al., 2004):
df(µ) = E ((∇ · µ(Z)) · (Z)) , (4.19)
where (∇ · µ) · (Z) =∑
ij ∂µ(Zij)/∂Zij. For the spectral thresholding operator
Sλ,γ(·), expression (4.19) holds if the map Z 7→ Sλ,γ(Z) is Lipschitz and hence weakly
differentiable—see for example Candes et al. (2013). In the light of Proposition 4.3,
the map Z 7→ Sλ,γ(Z) is Lipschitz when φP + 1 > 0. Under the model (4.18), the
singular values of Z will have a multiplicity of one with probability one. With these
assumptions in place, the divergence formula for Sλ,γ(Z) can be obtained follow-
ing Candes et al. (2013). We assume that the univariate thresholding operators are
differentiable, i.e., s′λ,γ(·) exist.
Proposition 4.4. Assume that 1 + φP > 0 and the model (4.18) is in place. Then
the degrees of freedom of the estimator Sλ,γ(Z) are given by:
df(Sλ,γ(Z)) = E
(∑i
(s′λ,γ(σi) + |m− n|sλ,γ(σi)
σi
)+ 2
∑i 6=j
σisλ,γ(σi)
σ2i − σ2
j
), (4.20)
where the σi’s are the singular values of Z.
We note that the above expression is true for any value of 1 + φP > 0, and in
particular, for the MC+ penalty function, expression (4.20) holds for γ > 1. As soon
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 104
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.7
0.8
0.9
1.0
log(γ)
df/(
mn)
λ=0.5λ=1λ=1.5
Figure 4.2: Figure showing the df for the MC+ thresholding operator for a matrix with
m = n = 10, µ = 0 and v = 1. The df profile as a function of γ (in the log scale) is shown
for three values of λ. The dashed lines correspond to the df of the spectral soft-thresholding
operator, corresponding to γ = ∞. We propose calibrating the (λ, γ) grid to a (λ, γ) grid
such that the df corresponding to every value of γ matches the df of the soft-thresholding
operator — as shown in Figure 4.3.
as γ ≤ 1, the above method of deriving df does not apply due to the discontinuity
in the map Z 7→ Sλ,γ(Z). Values of γ close to one (but larger), however, give an
expression for the df near the hard-thresholding spectral operator, which corresponds
to γ = 1.
To understand the behavior of the df as a function of (λ, γ), let us consider the
null model with µ = 0 and the MC+ penalty function. In this case, for a fixed λ
(see Figure 4.2, for any fixed λ), the df is seen to increase with smaller γ values:
the soft-thresholding function shrinks the large coefficients and sets all coefficients
smaller than λ to be zero; the more aggressive shrinkage operators shrink less for
larger values of σ and set all coefficients smaller than λ to zero. Thus, intuitively,
the more aggressive thresholding operators should have larger df since they do more
“fitting” — this is indeed observed in Figure 4.2. Mazumder et al. (2011) studied
the df of the univariate thresholding operators in the linear regression problem, and
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 105
0.2 0.4 0.6 0.8 1.0 1.2 1.4
12
510
2050
100
λ
γ
Figure 4.3: Figure showing the calibrated (λ, γ) lattice — for every fixed value of λ, the
df of the MC+ spectral threshold operators are the same across different γ values. The df
computations have been performed on a null model using Proposition 4.5.
observed a similar pattern in the behavior of the df across (λ, γ) values. For the linear
regression problem, Mazumder et al. (2011) argued that it is desirable to choose a
parametrization for (λ, γ) such that for a fixed λ, as one moves across γ, the df should
be the same. We follow the same strategy for the spectral regularization problem
considered herein — we reparametrize a two-dimensional grid of (λ, γ) values to a
two-dimensional grid of (λ, γ) values, such that the df remain calibrated in the sense
described above — this is illustrated in Figure 4.3.
The study of df presented herein provides a simple and intuitive explanation to
the roles of (λ, γ) for the fully observed problem. The notion of calibration proposed
herein gives a new parametrization of the family of penalties. The general algorith-
mic framework presented in this chapter (see Section 4.3) computes a regularization
surface using warm-starts across adjacent (λ, γ) values on a two-dimensional grid; it
is thus desirable for the adjacent values to be close — the df calibration ensures this
in a simple and intuitive manner.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 106
Computation of df : The df estimate as implied by Proposition 4.4 depends only
upon singular values (and not the vectors) of a matrix and can hence be computed
with cost O(minm,n2). The expectation can be approximated via Monte-Carlo
simulation — these computations are easy to parallelize and can be done offline.
Since we compute the df for the null model, for larger values of m,n we resort to
the Marchenko-Pastur law for iid Gaussian matrix ensembles to approximate the
df expression (4.20). We illustrate the method using the MC+ penalty for γ > 1.
Towards this end, let us define a function on β ≥ 0:
gζ,γ(β) =
0, if
√β ≤ ζ
γγ−1
(1− ζ√β), if ζ <
√β ≤ ζγ
1, if√β > ζγ .
(4.21)
The following proposition provides a method to approximate the df of the spectral
threshold operator for the null model, using the Marchenko-Pastur distribution— see
Lemma 1 in Appendix A. For the following proposition, we will assume (for simplicity)
that m ≥ n.
Proposition 4.5. Let m,n → ∞ with nm→ α ∈ (0, 1], then under the model Zij
iid∼
N(0, 1), we have
limm,n→∞
df(Sλ,γ(Z))
mn=
0, if λ√
m→∞
(1− α)Egt,γ(T1) + αE(T1gt,γ(T1)−T2gt,γ(T2)
T1−T2
),if λ√
m→ ζ
1, if λ√m→ 0 ,
(4.22)
where Sλ,γ(Z) is the thresholding operator corresponding to the MC+ penalty with
λ ≥ 0, γ > 1 and the expectation is taken with respect to T1 and T2 independently
generated from the Marchenko-Pastur distribution (as described in Appendix A).
Proof. For a proof, see Appendix A.1.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 107
Note that the variance v2 in model (4.18) can always be assumed to be one (by
adjusting the value of the tuning parameter accordingly4).
4.3 The NC-Impute Algorithm
In this section, we present our main contribution of the chapter: algorithm NC-
Impute. The algorithm is inspired by an EM-stylized procedure, similar to Soft-
Impute (Mazumder et al., 2010). It is helpful to recall that, for observed data:
PΩ(Y ), the algorithm Soft-Impute relies on the following update sequence
Xk+1 = Sλ,`1(PΩ(Y ) + P⊥Ω (Xk)
), (4.23)
which can be interpreted as computing the nuclear norm regularized spectral thresh-
olding operator for the following “fully observed” problem:
Xk+1 ∈ arg minX
1
2‖X −
(PΩ(Y ) + P⊥Ω (Xk)
)‖2F + λ‖X‖∗
,
where the missing entries are filled in by the current estimate, i.e., P⊥Ω (Xk). We refer
the reader to Mazumder et al. (2010) for a detailed study of the algorithm. Mazumder
et al. (2010) suggest in passing the notion of extending Soft-Impute to more gen-
eral thresholding operators; however, such generalizations were not pursued by the
authors. In this chapter, we present a more thorough investigation about nonconvex
generalized thresholding operators — we study their convergence properties, scala-
bility aspects and demonstrate their superior statistical performance across a wide
range of numerical experiments.
Update (4.23) suggests a natural generalization to more general nonconvex penalty
functions, by simply replacing the spectral thresholding operator Sλ,`1(·) with more
general operators Sλ,γ(·):
Xk+1 = Sλ,γ(PΩ(Y ) + P⊥Ω (Xk)
). (4.24)
4This follows from the simple observation that saλ,γ(ax) = asλ,γ(x) and s′aλ,γ(ax) = s′λ,γ(x).
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 108
Algorithm 4.1 NC-Impute
1. Input: A search grid λ1 > λ2 > · · · > λN ; +∞ := γ1 > γ2 > · · · > γM . Tolerance
ε.
2. Compute solutions Xλi,γ1 for i = 1, . . . , N , for the nuclear norm regularized prob-
lem.
3. For every (γ, λ) ∈ γ2, . . . , γM × λ1, . . . , λN:
(a) Initialize Xold = arg minX
f(X), X ∈
Xλi−1,γj , Xλi,γj−1
.
(b) Repeat until convergence, i.e., ‖Xnew −Xold‖2F < ε‖Xold‖2
F :
(i) Compute Xnew ∈ arg minX
F`(X;Xold).
(ii) Assign Xold ← Xnew.
(c) Assign Xλi,γj ← Xnew.
4. Output: Xλi,γj for i = 1, . . . , N , j = 1, . . . ,M .
While the above update rule works quite well in our numerical experiments, it en-
joys limited computational guarantees, as suggested by our convergence analysis in
Section 4.3.1. We thus propose and study a seemingly minor generalization of the
rule (4.24) — this modified rule enjoys superior finite time convergence rates to a first
order stationary point.
Towards this end, let us define the following function:
F`(X;Xk) :=1
2‖PΩ(X − Y )‖2
F +1
2‖P⊥Ω (X −Xk)‖2
F
+`
2‖X −Xk‖2
F +
minm,n∑i=1
P (σi(X);λ, γ),
(4.25)
for ` ≥ 0. Note that F`(X;Xk) majorizes the function f(X), i.e., F`(X;Xk) ≥ f(X)
for any X and Xk, with equality holding at X = Xk. In an attempt to obtain a
minimum of Problem (4.3), our algorithm iteratively minimizes an upper bound to
f(X), given by F`(X;Xk), to obtain Xk+1 — more formally, this leads to the following
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 109
update sequence:
Xk+1 ∈ arg minX
F`(X;Xk). (4.26)
Note that Xk+1 is easy to compute; by some rearrangement of (4.25) we see:
Xk+1 ∈ arg minX
`+ 1
2‖X − Xk‖2
F +
minm,n∑i=1
P (σi(X);λ, γ)
:= S`λ,γ
(Xk
),
(4.27)
where Xk =(PΩ(Y ) + P⊥Ω (Xk) + `Xk
)/(` + 1). Note that (4.27) is a minor modifi-
cation of (4.24) — in particular, if ` = 0, then these two update rules coincide.
The sequence Xk defined via (4.27) has desirable convergence properties, as we
discuss in Section 4.3.1. In particular, as k →∞, the sequence reaches (in a sense that
will be made more precise later) a first order stationary point for Problem (4.3). We
however, seek to compute an entire regularization surface of solutions to Problem (4.3)
over a two-dimensional grid of (λ, γ), for which we use warm-starts. We take the
MC+ family of functions as a running example, with (λ, γ) ∈ λ1 > λ2 > · · · >
λN×∞ := γ1 > γ2 > . . . > γM. At the beginning, we compute a path of solutions
for the nuclear norm penalized problem, i.e., Problem (4.3) for γ =∞ on the grid of
λ values. For a fixed value of λ, we compute solutions to Problem (4.3) for smaller
values of γ, gradually moving away from the convex problems. In this scheme, we
found the following strategies useful:
• For every value of (λi, γj), we apply two copies of the iterative scheme (4.26)
initialized with solutions obtained from its two neighboring points (λi−1, γj) and
(λi, γj−1). From these two candidates, we select the one that leads to a smaller
value of the objective function f(·) at (λi, γj).
• Instead of using a two-dimensional rectangular lattice, one can also use the
recalibrated lattice, suggested in Section 4.2.2, as the two-dimensional grid of
tuning parameters.
The algorithm outlined above, called NC-Impute is summarized as Algorithm 4.1.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 110
We now present an elementary convergence analysis of the update sequence (4.27).
Since the problems under investigation herein are nonconvex, our analysis requires
new ideas and techniques beyond those used in Mazumder et al. (2010) for the convex
nuclear norm regularized problem.
4.3.1 Convergence Analysis
By the definition of Xk+1, we have that:
F`(Xk+1;Xk) = minX
F`(X;Xk) ≤ F`(Xk;Xk) = f(Xk). (4.28)
Let us define the quantities:
ν(`) := 1 + φP + ` and ν†(`) := max ν(`), 0 , (4.29)
where, if ν(`) ≥ 0, then the function X 7→ F`(X;Xk) is ν(`)-strongly convex. In
particular, from (4.26), it follows that ∇F`(Xk+1;Xk), a subgradient of the map
X 7→ F`(X;Xk), equals zero. We thus have:
F`(Xk;Xk)− F`(Xk+1;Xk) ≥ν(`)
2‖Xk+1 −Xk‖2
F . (4.30)
Now note that, by the definition of Xk+1, we always have:
F`(Xk;Xk)− F`(Xk+1;Xk) ≥ν†(`)
2‖Xk+1 −Xk‖2
F , (4.31)
since Xk+1 minimizes F`(X;Xk). In addition, we have:
F`(Xk+1;Xk) =1
2‖PΩ(Xk+1 − Y )‖2
F +1
2‖P⊥Ω (Xk+1 −Xk)‖2
F +`
2‖Xk+1 −Xk‖2
F
+
minm,n∑i=1
P (σi(Xk+1);λ, γ) (4.32)
=f(Xk+1) +1
2‖P⊥Ω (Xk+1 −Xk)‖2
F +`
2‖Xk+1 −Xk‖2
F .
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 111
Combining (4.31) and (4.32), and observing that F`(Xk;Xk) = f(Xk), we have:
f(Xk)− f(Xk+1) ≥ν†(`)
2‖Xk+1 −Xk‖2
F +`
2‖Xk+1 −Xk‖2
F +1
2‖P⊥Ω (Xk+1 −Xk)‖2
F
=ν†(`) + `
2‖Xk+1 −Xk‖2
F +1
2‖P⊥Ω (Xk+1 −Xk)‖2
F︸ ︷︷ ︸:=∆`(Xk;Xk+1)
. (4.33)
Since ∆`(Xk;Xk+1) ≥ 0, the above inequality immediately implies that f(Xk) ≥
f(Xk+1) for all k; and the improvement in objective values is at least as large as
the quantity ∆`(Xk;Xk+1). The term ∆`(Xk;Xk+1) is a measure of progress of the
algorithm, as formalized by the following proposition.
Proposition 4.6. (a): Let ν†(`) + ` > 0 and for any Xa, let us consider the update
Xa+1 ∈ arg minX F`(X;Xa). Then the following are equivalent:
(i) f(Xa+1) = f(Xa).
(ii) ∆`(Xa;Xa+1) = 0.
(iii) Xa is a fixed point, i.e., Xa+1 = Xa.
(b): If ν†(`), ` = 0 and ∆`(Xa;Xa+1) = 0, then Xa+1 is a fixed point.
Proof. Proof of Part (a): We will show that (i) =⇒ (ii) =⇒ (iii) =⇒ (i); by
analyzing (4.33). If f(Xa+1) = f(Xa) then ∆`(Xa;Xa+1) = 0. Since ν†(`) + ` > 0,
we have that Xa+1 = Xa, which trivially implies (i).
Proof of Part (b): If ν†(`) + ` = 0, Part (a) needs to be slightly modified. Note
that ∆`(Xa;Xa+1) = 0 iff P⊥Ω (Xa+1) = P⊥Ω (Xa). Since ` = 0, we have that Xa+2 =
Sλ,γ(PΩ(Y ) + P⊥Ω (Xa+1)
). The condition P⊥Ω (Xa+1) = P⊥Ω (Xa) trivially implies that
Sλ,γ(PΩ(Y ) + P⊥Ω (Xa+1)
)= Sλ,γ
(PΩ(Y ) + P⊥Ω (Xa)
), where the term on the right
equals Xa+1. Thus, Xa+1 = Xa+2 = · · · , i.e., Xa+1 is a fixed point.
Since the f(Xk)’s form a decreasing sequence which is bounded from below, they
converge to f , say — this implies that ∆`(Xk;Xk+1) → 0 as k → ∞. Let us now
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 112
consider two cases, depending upon the value of ν†(`)+`. If ν†(`)+` > 0, then we have
Xk+1 −Xk → 0 as k →∞. On the other hand, if the quantities ν†(`) = 0, ` = 0, the
conclusion needs to be modified: ∆`(Xk;Xk+1)→ 0 implies that P⊥Ω (Xk+1−Xk)→ 0
as k →∞.
Motivated by the above discussion, we make the following definition of a first order
stationary point for Problem (4.3).
Definition 4.1. Xa is said to be a first order stationary point for Problem (4.3) if
∆`(Xa;Xa+1) = 0. Xa is said to be an ε-accurate first order stationary point for
Problem (4.3) if ∆`(Xa;Xa+1) ≤ ε.
Proposition 4.7. The sequence f(Xk) is decreasing and suppose it converges to f .
Then the rate of convergence of Xk to this first order stationary point is given by:
min1≤k≤K
∆`(Xk;Xk+1) ≤ 1
K
(f(X1)− f
). (4.34)
Proof. The arguments presented preceding Proposition 4.7 establish that the sequence
f(Xk) is decreasing and converges to f , say.
Consider (4.33) for any 1 ≤ k ≤ K: ∆`(Xk;Xk+1) ≤ f(Xk)−f(Xk+1) — summing
this inequality for k = 1, . . . ,K, we have:
K min1≤k≤K
∆`(Xk;Xk+1) ≤∑
1≤k≤K
∆`(Xk;Xk+1) ≤ f(X1)− f(XK+1) ≤ f(X1)− f ,
where in the last inequality we used the simple fact that f(Xk) ↓ f—gathering the
left and right parts of the above chain of inequalities leads to (4.34).
Proposition 4.7 shows that the sequence Xk reaches an ε-accurate first order sta-
tionary point within Kε = (f(X1)− f)/ε many iterations. The number of iterations
Kε, depends upon how close the initial estimate f(X1) is to the eventual solution
f . Since NC-Impute employs warm-starts, this rate suggests that the number of
iterations required to a reach an approximate first order stationary point is quite low
— this is indeed observed in our experiments, and this feature makes our algorithm
particularly attractive from a practical viewpoint.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 113
Rank Stabilization: Let us consider the thresholding function S`λ,γ(Xk) defined
in (4.27), which expresses Xk+1 as a function of Xk. Using the development in Sec-
tion 4.2, it is easy to see that the spectral operator S`λ,γ(Xk) is closely tied to the
following vector thresholding operator (4.35), acting on the singular values of Xk.
Formally, for a given nonnegative vector x, if we denote:
s`λ,γ(x) ∈ arg minα≥0
`+ 1
2‖α− x‖2
2 +
minm,n∑i=1
P (αi;λ, γ), (4.35)
then S`λ,γ(X) = Udiag(s`λ,γ(x))V ′, where X = Udiag(x)V ′ is the SVD of X. Thus,
properties of the thresholding function S`λ,γ(X) are closely related to those of the vec-
tor thresholding operator s`λ,γ(x). Due to the separability of the vector thresholding
operator s`λ,γ(x), across each coordinate of x and σ, we denote by s`λ,γ(xi), the ith
coordinate of s`λ,γ(x).
We now investigate what happens to the rank of the sequence Xk as defined
via (4.26). In particular, does this rank converge? We show that the rank indeed
stabilizes after finitely many iterations, under an additional assumption — namely,
the spectral thresholding operator is discontinuous — see Figure 4.1 for examples of
discontinuous thresholding functions.
Proposition 4.8. Consider the update sequence Xk+1 = S`λ,γ(Xk) as defined
in (4.27); and let ν†(`) + ` > 0. Suppose that there is a λS > 0 such that, for
any scalar x ≥ 0, the following holds: s`λ,γ(x) 6= 0 =⇒ |s`λ,γ(x)| > λS — i.e., the
scalar thresholding operator x 7→ s`λ,γ(x) is discontinuous. Then there exists an inte-
ger K∗ such that for all k ≥ K∗, we have rank(Xk) = r, i.e., the rank stabilizes after
finitely many iterations.
Proof. Using (4.33) it follows that
f(Xk)− f(Xk+1) ≥ ν†(`) + `
2‖Xk+1 −Xk‖2
F ≥ν†(`) + `
2‖σk+1 − σk‖2
2,
where the last inequality follows from Wielandt-Hoffman inequality (Horn and John-
son, 2012) and σk := σ(Xk) denotes the vector of singular values of Xk. Let 1(σ)
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 114
be an indicator vector with ith coordinate being equal to 1(σi 6= 0). We will prove
the result of rank stabilization via the method of contradiction. Suppose the rank
does not stabilize, then 1(σk+1) 6= 1(σk) for infinitely many k values. Thus there are
infinitely many k′ values such that:
‖σk′+1 − σk′‖22 ≥ σ2
k′+1,i ,
where i is taken such that σk′+1,i 6= 0 but σk′,i = 0. Note that by the property of the
thresholding function s`λ,γ(·) we have that s`λ,γ(x) 6= 0 =⇒ |s`λ,γ(x)| > λS. This implies
that ‖σk′+1 − σk′‖22 ≥ λ2
S for infinitely many k′ values, which is a contradiction to
the convergence: f(Xk+1) − f(Xk) → 0. Thus the support of σ(Xk) converges, and
necessarily after finitely many iterations — leading to the existence of an iteration
number K∗, after which the rank of Xk remains fixed. This completes the proof of
the proposition.
Remark 4.1. If ` = 0, the discontinuity of the thresholding operator sλ,γ(·) (as
demanded by Proposition 4.8) occurs for the MC+ penalty function as soon as γ ≤ 1.
For a general ` > 0, discontinuity in s`λ,γ(·) occurs as soon as γ ≤ 1`+1
. Note that the
condition ` > 0 implies that ν†(`) + ` > 0.
Asymptotic Convergence: We now investigate the asymptotic convergence prop-
erties of the sequence Xk, k ≥ 1. Proposition 4.8 shows that under suitable assum-
ptions, the sequence rank(Xk), k ≥ 1, converges. In this situation, the existence of a
limit point of Xk is guaranteed if the singular values of σ(Xk) remain bounded. It is
not immediately clear whether the sequence σ(Xk) will remain bounded since several
spectral penalty functions (like the MC+ penalty) are bounded5. We address herein
the existence of a limit point of the sequence σ(Xk), and hence the sequence Xk.
For the following proposition, we will assume that the concave penalty function
σ 7→ P (σ;λ, γ) on σ ≥ 0 is differentiable and the gradient is bounded.
5Due to the boundedness of the penalty function, the boundedness of the objective function does
not necessarily imply that the sequence σ(Xk) will remain bounded.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 115
Proposition 4.9. Let Ukdiag(σk)V′k denote the rank-reduced SVD of Xk. Let the
matrices Um×r, Vn×r denote a limit point of the sequence Uk, Vk, k ≥ 1, such that
(Unk , Vnk) → (U , V ) along a subsequence nk → ∞. Let ui denote the ith column of
U (and similarly for vi, V ) and let us denote Θ = [vec(PΩ(u1v′1)), . . . , vec(PΩ(urv
′r))].
We have the following:
(a) If rank(Θ) = r, then the sequence Xnk has a limit point which is a first order
stationary point.
(b) If λmin(Θ′Θ) +φP > 0, then the sequence Xnk converges to a first order station-
ary point: X = Udiag(σ)V ′, where σnk → σ.
Proof. See Appendix A.2.
Proposition 4.8 describes sufficient conditions under which the rank of the se-
quence Xk stabilizes after finitely many iterations — it does not describe the bound-
edness of the sequence Xk, which is addressed in Proposition 4.9. Note that Propo-
sition 4.9 does not imply that the rank of the sequence Xk stabilizes after finitely
many iterations (recall that Proposition 4.9 does not assume that the thresholding
operators are discontinuous, an assumption required by Proposition 4.8).
4.3.2 Computing the Thresholding Operators
The operator (4.27) requires computing a thresholded SVD of the matrix Xk, as
demonstrated by Proposition 4.1. The thresholded singular values s`λ,γ(·) as in (4.35)
will have many zero coordinates due to the “sparsity promoting” nature of the con-
cave penalty. Thus, computing the thresholding operator (4.27) will typically require
performing a low-rank SVD on the matrix Xk. While direct factorization based SVD
methods can be used for smaller problems (i.e., when minm,n is of the order of
a thousand or so); for larger matrices, such methods become computationally pro-
hibitive — we thus resort to iterative methods for computing low-rank SVDs for large
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 116
scale problems. Algorithms like the block power method; also known as block QR
iterations, or those based on the Lanczos method (Golub and Van Loan, 1983) are
quite effective in computing the top few singular value/vectors of a matrix A, espe-
cially when the operations of multiplying Ab1 and A′b2 (for vectors b1, b2 of matching
dimensions) can be done efficiently. Indeed, such matrix-vector multiplications turn
out to be quite computationally attractive for our problem, since the computational
cost of multiplying Xk and X ′k with vectors of matching dimensions is quite low. This
is due to the structure of:
Xk =(PΩ(Y ) + P⊥Ω (Xk) + `Xk
)/(`+ 1)
=1
`+ 1PΩ(Y −Xk)︸ ︷︷ ︸
Sparse
+ Xk︸︷︷︸Low-rank
,(4.36)
which admits a decomposition as the sum of a sparse matrix (given by 1`+1
PΩ(Y −Xk))
and a low-rank matrix6, i.e., Xk. Note that the sparse matrix has the same sparsity
pattern as the observed indices Ω. Decomposition (4.36) is inspired by a similar de-
composition that was exploited effectively in the algorithm Soft-Impute (Mazumder
et al., 2010), where the authors use PROPACK (Larsen, 2004) to compute the low-
rank SVDs. In this chapter, we use the Alternating Least Squares (ALS)-stylized
procedure proposed in Hastie et al. (2016), which computes a low-rank SVD by solv-
ing the following nonlinear optimization problem:
minimizeUm×r,Vn×r
1
2‖Xk − UV ′‖2
F ,
using alternating least squares—this turns out to be, in fact, equivalent to the block
power method (Hastie et al., 2016; Golub and Van Loan, 1983), in computing a rank
6We note that it is not guaranteed that the Xk’s will be of low-rank across the iterations of the
algorithm for k ≥ 1, even if they are eventually, for k sufficiently large. However, in the presence
of warm-starts across (λ, γ) they are indeed, empirically, found to have low-rank as long as the
regularization parameters are large enough to result in a small rank solution. Typically, as we have
observed in our experiments, in the presence of warm-starts, the rank of Xk is found remain low
across all iterations.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 117
r SVD of the matrix Xk. Across the iterations of NC-Impute, we pass the warm-
start information in the U, V ’s obtained from a low-rank SVD of Xk to compute
the low-rank SVD for Xk+1. Empirically, this warm-start strategy is found to be
significantly more advantageous than a black-box low-rank SVD stylized approach,
as used in Soft-Impute. Using warm-start information across successive iterations
(i.e., k values) leads to dramatic gains in computational speed (often reduces the total
time to compute a family of solutions by orders of magnitude), when compared to
black-box SVD stylized methods that do not rely on such warm-start strategies.
4.4 Numerical Experiments
In this section, we present a systematic experimental study of the statistical proper-
ties of estimators obtained from (4.3) for different choices of penalty functions. We
perform our experiments on a wide array of synthetic and real data instances.
4.4.1 Synthetic Examples
We study three different examples, where, for the true low-rank matrix M = LΦRT ,
we vary both the structure of the left and right singular vectors in L and R, as well
as the sampling scheme used to obtain the observed entries in Ω. Our basic model is
Yij = Mij + εij, where we observe entries (i, j) ∈ Ω. We consider different types of
missing patterns for Ω, and various signal-to-noise ratios for the Gaussian error term
ε, defined here to be:
SNR =Var(vec(M))
Var(vec(ε)).
Accordingly, the (standardized) training and test error for the model are defined as:
Training Error =‖PΩ(Y − M)‖2
F
‖PΩ(Y )‖2F
and Test Error =‖P⊥Ω (LΦRT − M)‖2
F
‖P⊥Ω (LΦRT )‖2F
,
where a value greater than one for the test error indicates that the computed estimate
M does a worse job at estimating M than the zero solution, and the training error
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 118
corresponds to the fraction of the error explained on the observed entries by the
estimate M relative to the zero solution.
Example-A: In our first simulation setting, we use the observation model Ym×n =
Lm×rΦr×rRTr×n+εm×n, where L andR are matrices generated from the random orthog-
onal model (Candes and Recht, 2009), and the singular values Φ = diag(φ1, . . . , φr) are
randomly selected as φ1, . . . , φriid∼ Uniform(0, 100). The set Ω is sampled uniformly at
random. Recall that for this model, exact matrix completion in the noiseless setting
is guaranteed as long as |Ω| ≥ Cmr log4m, for some universal constant C (Recht,
2011). Under the noisy setting, Mazumder et al. (2010) show superior performance
of nuclear norm regularization vis-a-vis other matrix recovery algorithms (Cai et al.,
2010; Keshavan et al., 2010) in terms of achieving smaller test error. For the pur-
poses herein, we fix (m,n) = (800, 400) and set the fraction of missing entries to
|Ωc|/mn = 0.9 and |Ωc|/mn = 0.95.
Example-B: In our second setting, we also consider the observation model Ym×n =
Lm×rΦr×rRTr×n + εm×n, but we now select matrices L and R which do not satisfy the
incoherence conditions required for full matrix recovery. Specifically, for the choices
of (m,n, r) = (800, 400, 10) and |Ωc|/mn = 0.9, we select L and R to be block-
diagonal matrices of the form L = diag(L1, . . . , L5) and R = diag(R1, . . . , R5), where,
Li ∈ R160×2 and Ri ∈ R80×2, i = 1, . . . , 5, are random matrices with scaled Gaussian
entries. The singular values are again sampled as φ1, . . . , φriid∼ Uniform(0, 100) with
Ω being uniformly random over the set of indices. For this model, successful matrix
completion is not guaranteed even for the noiseless problem with the nuclear norm
relaxation, as the left and right singular vectors are not sufficiently spread. We would
like to investigate the performance of nonconvex regularizers based on the MC+
family of penalties in this challenging scenario. A superior performance of these
penalties over the usual nuclear norm regularization signals that nonconvex penalties
are able to weaken the strong incoherence conditions required for successful nuclear
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 119
norm matrix reconstruction.
Example-C: In our third simulation setting, for the given choice of (m,n, r) =
(100, 100, 10), we also generate Ym×n = Lm×rΦr×rRTr×n + εm×n from the random
orthogonal model as in our first setting, but we now allow the observed entries in Ω
to follow a non-uniform sampling scheme. In particular, we fix Ωc = 1 ≤ i, j ≤ 100 :
1 ≤ i ≤ 50, 51 ≤ j ≤ 100 so that
PΩ(Y ) =
Y11 0
Y21 Y22
, where Y =
Y11 Y12
Y21 Y22
,with the fraction of missing entries thus being |Ωc|/mn = 0.25. This is again a
challenging simulation setting in which both the uniform (Candes and Recht, 2009)
and independent (Chen et al., 2014) sampling scheme assumptions in Ω that are
necessary for full matrix recovery are violated. Our aim again is to explore whether
the nonconvex MC+ family is able to outperform nuclear norm regularization even
in this setting where no theoretical guarantees for exact matrix reconstruction exist.
For all three settings, we choose a 100 × 25 grid of (λ, γ) values as follows. In
each simulation instance we fix λ1 = ‖PΩ(Y )‖2, the smallest value of λ for which
the nuclear norm regularized solution is zero, and set λ100 = 0.001 · λ1. Keeping
in mind that NC-Impute benefits greatly from using warm starts, we construct an
equally spaced sequence of 100 values of λ decreasing from λ1 to λ100. We choose
25 γ-values in a logarithmic grid from 5000 to 1.1. The results displayed in Figures
4.4 – 4.6 show averages of training and test errors, as well as recovered ranks of
the solution matrix Mλ,γ for the values of (λ, γ), taken over 50 simulations under
all three problem instances. The plots including rank reveal how effective the MC+
family is at recovering the true rank while minimizing prediction error. Throughout
the simulations we keep an upper bound of the operating rank as 50.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 120
4.4.1.1 Discussion of Experimental Results
We devote Figures 4.4 and 4.5 to analyze the simpler random orthogonal model
(Example-A), leaving the more challenging coherent and non-uniform sampling set-
tings (Example-B and Example-C) for Figure 4.6. In each case, the captions detail
the results which we summarize here. The noise is quite high in Figure 4.4 with
SNR = 1 and 90% of the entries missing in both displayed settings, while the model
complexity decreases from a true rank of 10 to 5. The underlying true ranks remain
the same in Figure 4.5, but the noise level has decreased to SNR = 5 with the missing
entries increasing to 95%. For each model setting considered, all nonconvex methods
from the MC+ family outperform nuclear norm regularization in terms of prediction
performance, while members of the MC+ family with smaller values of γ are better
at estimating the correct rank. The choices of γ = 30 and γ = 20 have the best
performance in Figure 4.4 (best prediction errors around the true ranks), while more
nonconvex alternatives fare better in the high-sparsity, low-noise setting of Figure
4.5. In both figures, the performance of nuclear norm regularization is somewhat
similar to the least nonconvex alternative displayed at γ = 100, however, the bias
induced in the estimation of the singular values of the low-rank matrix M leads to
the worst bias-variance trade-off among all training versus test error plots for the
settings considered.
While the nuclear norm relaxation provides a good convex lower approximation
for the rank of a matrix (Recht et al., 2010), these examples show that nonconvex
regularization methods provide a superior mechanism for rank estimation. This is
reminiscent of the performance of the MC+ penalty in the context of variable selection
within sparse, high-dimensional regression models. Although the `1 penalty function
represents the best convex approximation to the `0 penalty, the gap bridged by the
nonconvex MC+ penalty family P (·;λ, γ) provides a better basis for model selection,
and hence rank estimation in the low-rank matrix completion setting.
For the coherent and non-uniform sampling settings of Figure 4.6, we choose the
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 121
Example-A (Low SNR, less missing entries)
0.0 0.2 0.4 0.6 0.8 1.00.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Training Error
Test
Err
or
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(a) ROM, 90% missing, SNR=1, true rank=10
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Training Error
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(b) ROM, 90% missing, SNR = 1, true rank = 5
0 10 20 30 40 500.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Rank
Test
Err
or
(c) ROM, 90% missing, SNR=1, true rank=10
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
Rank
(d) ROM, 90% missing, SNR = 1, true rank = 5
Figure 4.4: (Color online) Random Orthogonal Model (ROM) simulations with SNR = 1.
The choice γ = +∞ refers to nuclear norm regularization as provided by the Soft-Impute
algorithm. The least nonconvex alternatives at γ = 100 and γ = 80 behave similarly to
nuclear norm, although with better prediction performance. The choices of γ = 5 and γ = 10
result in excessively aggressive fitting behavior for the true rank = 10 case, but improve
significantly in prediction error and recovering the true rank in the sparser true rank = 5
setting. In both scenarios, the intermediate models with γ = 30 and γ = 20 fare the best,
with the former achieving the smallest prediction error, while the latter estimates the actual
rank of the matrix. Values of test error larger than one are not displayed in the figure.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 122
Example-A (High SNR, more missing entries)
0.0 0.2 0.4 0.6 0.8 1.00.2
0.4
0.6
0.8
1.0
Training Error
Test
Err
or
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(a) ROM, 95% missing, SNR=5, true rank=10
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Training Error
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(b) ROM, 95% missing, SNR = 5, true rank = 5
0 10 20 30 40 500.2
0.4
0.6
0.8
1.0
Rank
Test
Err
or
(c) ROM, 95% missing, SNR=5, true rank=10
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
Rank
(d) ROM, 95% missing, SNR = 5, true rank = 5
Figure 4.5: (Color online) Random Orthogonal Model (ROM) simulations with SNR =
5. The benefits of nonconvex regularization are more evident in this high-sparsity, high-
missingness scenario. While the γ = 100 and γ = 80 models distance themselves more from
nuclear norm, the remaining members of the MC+ family essentially minimize prediction
error while correctly estimating the true rank. This is especially true in panel (d), where
the best predictive performance of the model γ = 5 at the correct rank is achieved under a
low-rank truth and high SNR setting.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 123
Example-B Example-C
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Training Error
Test
Err
or
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(a) Coherent, 90% missing, SNR = 10,
true rank=10
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Training Error
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(b) NUS, 25% missing, SNR=10, true rank=10
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
Rank
Test
Err
or
(c) Coherent, 90% missing, SNR = 10,
true rank=10
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
Rank
(d) NUS, 25% missing, SNR=10, true rank=10
Figure 4.6: (Color online) Coherent and Non-Uniform Sampling (NUS) simulations with
SNR = 10. Nonconvex regularization also proves to be a successful strategy in these chal-
lenging scenarios, particularly in the non-uniform sampling setting where the MC+ family
exhibits a monotone decrease in prediction error as γ approaches 1. Again, the model γ = 5
estimates the correct rank under high SNR settings. Although nuclear norm achieves a rel-
atively small prediction error, compared with previous simulation settings, the MC+ family
still provides a superior and more robust mechanism for regularization.
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 124
small noise scenario SNR = 10 in order to favor all considered models. Despite the
absence of any theoretical guarantees for successful matrix recovery, the nuclear norm
regularization approach achieves a relatively small prediction error in all displayed
instances. Nevertheless, the nonconvex MC+ family of penalties seems empirically
more adept at overcoming the limitations of nuclear norm penalized matrix com-
pletion in these challenging simulation settings. In particular, the most aggressive
nonconvex fitting behavior at γ = 5 achieves almost uniform best prediction perfor-
mance in the non-uniform sampling setting while correctly estimating the true rank
for the coherent model. In spite of these promising initial findings, a more thorough
theoretical investigation of whether nonconvex regularization methods are able to
overcome the incoherence and uniform sampling assumptions required for successful
matrix reconstruction in the noisy setting (Candes and Plan, 2010; Keshavan et al.,
2010) is deferred for future work.
4.4.2 Real Data Examples: MovieLens and Netflix Data Sets
We now use the real world recommendation system data sets ml100k and ml1m pro-
vided by MovieLens (http://grouplens.org/datasets/movielens/), as well as the fa-
mous Netflix competition data to compare the usual nuclear norm approach with
the MC+ regularizers. The data set ml100k consists of 100, 000 movie ratings (1–5)
from 943 users on 1, 682 movies, whereas ml1m includes 1, 000, 209 anonymous ratings
from 6, 040 users on 3, 952 movies. In both data sets, for all regularization methods
considered, a random subset of 80% of the ratings were used for training purposes;
the remaining were used as the test set.
We also choose a similar 100 × 25 grid of (λ, γ) values, but for each value of λ
in the decreasing sequence, we use an “operating rank” threshold somewhat larger
than the rank of the previous solution, with the goal of always obtaining solution
ranks smaller than the operating threshold. Following the approach of Hastie et al.
(2016), we perform row and column centering of the corresponding (incomplete) data
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 125
Real Data Example: MovieLens
0.2 0.4 0.6 0.8 1.0
Training RMSE
Test
RM
SE
0.88
0.89
0.90
0.91
γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5
(a) MovieLens100k, 20% test data
0.5 0.6 0.7 0.8
Training RMSE
0.84
0.85
0.86
0.87
γ=+∞γ=100γ=80γ=30γ=20γ=15γ=10
(b) MovieLens1m, 20% test data
0 50 100 150 200
Rank
Test
RM
SE
0.88
0.89
0.90
0.91
(c) MovieLens100k, 20% test data
0 50 100 150 200 250
Rank
0.84
0.85
0.86
0.87
(d) MovieLens1m, 20% test data
Figure 4.7: (Color online) MovieLens 100k and 1m data. For each value of λ in the
solution path, an operating rank threshold (capped at 250) larger than the rank of the
previous solution was employed.
matrices as a preprocessing step.
Figure 4.7 compares the performance of nuclear norm regularization with the
MC+ family of penalties on these data sets, in terms of the prediction error (RMSE)
obtained from the left out portion of the data. While the fitting behavior at γ = 5
is overly aggressive in these instances, the choice γ = 10 achieves the best test set
RMSE with a minimum solution rank of 20 for the ml100k data. With a higher test
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 126
Real Data Example: Netflix
0.60 0.65 0.70 0.75 0.80 0.85
Training RMSE
Test
RM
SE
0.82
0.83
0.84
0.85
γ=+∞γ=100γ=20γ=15γ=10γ=8γ=6
(a) Netflix, test data=1, 500, 000 ratings
0 50 100 150 200 250
Rank
0.82
0.83
0.84
0.85
(b) Netflix, test data=1, 500, 000 ratings
Figure 4.8: (Color online) Netflix competition data. The model γ = 10 achieves optimal
test set RMSE of 0.8276 for a solution rank of 105.
RMSE, nuclear norm regularization achieves its minimum with a less parsimonious
model of rank 62. Similar results hold for the ml1m data, where the model γ = 15
achieves near optimal test RMSE at a solution rank of 115, while the best estimation
accuracy of Soft-Impute occurs for ranks well over 200.
The Netflix competition data consists of 100, 480, 507 ratings from 480, 189 users
on 17, 770 movies. A designated probe set, a subset of 1, 408, 395 of these ratings, was
distributed to participants for calibration purposes, leaving 99, 072, 112 for training.
We did not consider the probe set as part of this numerical experiment, instead choos-
ing 1, 500, 000 randomly selected entries as test data with the remaining 97, 572, 112
used for training purposes. Similar to the MovieLens data, we select a 20× 25 grid of
(λ, γ) values which adaptively chooses an operating rank threshold, and also remove
row and columns means for prediction purposes.
As shown in Figure 4.8, the MC+ family again yields better prediction perfor-
mance under more parsimonious models. On average, and for a convergence tolerance
of 0.001 in Algorithm 4.1, the sequence of twenty models took under 10.5 hours of
computing on an Intel E5-2650L cluster with 2.6 GHz processor. Our main goal here
CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 127
is to show the feasibility of applying NC-Impute to the MC+ family on a batch mode
to such a large data set — and that it works better in terms of statistical properties
when compared to the nuclear norm regularized problem.
4.5 Conclusions
In this project we present a computational study for the noisy matrix completion
problem with nonconvex spectral penalties — we consider a family of spectral penal-
ties that bridge the convex nuclear norm penalty and the rank penalty, leading to
a family of estimators with varying degrees of shrinkage and nonconvexity. We pro-
pose NC-Impute, an algorithm that is inspired by the EM-stylized procedure Soft-
Impute (Mazumder et al., 2010), that effectively computes a two-dimensional family
of solutions with specialized warm-start strategies. The main computational bottle-
neck of our algorithm is a low-rank SVD of a matrix that can be written as the sum of
a sparse and low-rank matrix, which is performed using a block QR stylized strategy
that makes effective use of singular subspace warm-start information across itera-
tions. We discuss computational guarantees of our algorithm, including a finite time
complexity analysis to a first order stationary point. We present a systematic study
of various properties of spectral thresholding functions, a main tool in our algorithm.
We also demonstrate the impressive gains in statistical properties of our framework
on a wide array of synthetic and real data sets.
BIBLIOGRAPHY 128
Bibliography
Aggarwal, C. and Subbian, K. (2014). Evolutionary network analysis: A survey. ACM
Computing Surveys, 47:10:1–10:36.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008). Mixed membership
stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood
principle. In Petrov, B. N. and Csaki, F., editors, Proceedings of the 2nd Interna-
tional Symposium on Information Theory. Akademiai Kiado, Budapest.
Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random
Matrices. Springer.
Bazzi, M., Porter, M. A., Williams, S., McDonald, M., Fenn, D. J., and Howison,
S. D. (2015). Community detection in temporal multilayer networks, and its ap-
plication to correlation networks. Multiscale Modeling and Simulation: A SIAM
Interdisciplinary Journal, to appear.
Bernau, C., Waldron, L., and Riester, M. (2014). survHD: Synthesis of High-
Dimensional Survival Analysis. R package version 0.99.1, URL https://
bitbucket.org/lwaldron/survhd.
Bertsimas, D., King, A., and Mazumder, R. (2016). Best subset selection via a modern
optimization lens. Annals of Statistics (to appear).
BIBLIOGRAPHY 129
Bickel, P. J. and Levina, E. (2004). Some theory for fisher’s linear discriminant
function, ‘naive bayes’, and some alternatives when there are many more variables
than observations. Bernoulli, 10(6):989–1010.
Borwein, J. and Lewis, A. (2006). Convex Analysis and Nonlinear Optimization.
Springer.
Breheny, P. (2013). ncvreg: Regularization Paths for SCAD- and MCP-Penalized
Regression Models. R package version 2.6-0, URL http://CRAN.R-project.org/
package=ncvreg.
Breheny, P. and Huang, J. (2011). Coordiante descent algorithms for nonconvex
penalized regression, with applications to biological feature selection. The Annals
of Applied Statistics, 5(1):232–253.
Cai, J.-F., Candes, E. J., and Shen, Z. (2010). A singular value thresholding algorithm
for matrix completion. SIAM Journal on Optimization, 20:1956–1982.
Candes, E., Sing-Long, C., and Trzasko, J. D. (2013). Unbiased risk estimates for
singular value thresholding and spectral estimators. Signal Processing, IEEE Trans-
actions on, 61(19):4643–4657.
Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the
IEEE, 98:925–936.
Candes, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9:717–772.
Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model
selection with large model spaces. Biometrika, 95(3):759–771.
Chen, K. and Lei, J. (2014). Network cross-validation for determining the number of
communities in network data. Available at arXiv:1411.1715v1.
BIBLIOGRAPHY 130
Chen, Y., Bhojanapalli, S., Sanghavi, S., and Ward, R. (2014). Coherent matrix com-
pletion. In Proceedings of the 31st International Conference on Machine Learning,
pages 674–682. JMLR.
Chistov, A. L. and Grigor’ev, D. Y. (1984). Complexity of quantifier elimination in
the theory of algebraically closed fields. In Mathematical Foundations of Computer
Science 1984, pages 17–31. Springer.
Cho, H. and Fryzlewicz, P. (2015). Multiple-change-point detection for high dimen-
sional time series via sparsified binary segmentation. Journal of the Royal Statistical
Society, Series B, 77:475–507.
Choi, D. S., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with a
growing number of classes. Biometrika, 99:273–284.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62(2):269–276.
Cox, D. R. and Reid, N. (2004). A note on pseudolikelihood constructed from marginal
densities. Biometrika, 91:729–737.
Cribben, I. and Yu, Y. (2015). Estimating whole brain dynamics using spectral
clustering. Available at arXiv:1509.03730v1.
Daudin, J.-J., Picard, F., and Robin, S. (2008). A mixture model for random graphs.
Statistics and Computing, 18:173–183.
Decelle, A., Krzakala, F., Moore, C., and Zdeborova, L. (2011). Asymptotic analysis
of the stochastic block model for modular networks and its algorithmic applications.
Physical Review E, 84:066106.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical Society,
Series B, 39:1–38.
BIBLIOGRAPHY 131
Donath, W. E. and Hoffman, A. J. (1973). Lower bounds for the partitioning of
graphs. IBM Journal of Research and Development, 17:420–425.
Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination
methods for the classification of tumors using gene expression data. Journal of the
American Statistical Association, 97(457):77–87.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression
(with discussion). Annals of Statistics, 32(2):407–499.
Fan, J., Feng, Y., Saldana, D. F., Samworth, R., and Wu, Y. (2015). SIS: Sure
Independence Screening. R package version 0.7-6, URL http://CRAN.R-project.
org/package=SIS.
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in
sparse ultra-high-dimensional additive models. Journal of the American Statistical
Association, 106(494):544–557.
Fan, J., Feng, Y., and Wu, Y. (2010). High-dimensional variable selection for cox’s
proportional hazards model. IMS Collections, 6:70–86.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and
its oracle properties. Journal of the American Statistical Association, 96:1348–1360.
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional
feature space. Journal of the Royal Statistical Society, Series B, 70(5):849–911.
Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection:
Beyond the linear model. Journal of Machine Learning Research, 10:2013–2038.
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models
with np-dimensionality. The Annals of Statistics, 38(6):3567–3604.
BIBLIOGRAPHY 132
Fazel, M. (2002). Matrix Rank Minimization with Applications. PhD thesis, Stanford
University.
Fearnhead, P. (2004). Particle filters for mixture models with an unknown number of
components. Statistics and Computing, 14:11–21.
Feng, Y. and Yu, Y. (2013). Consistent cross-validation for tuning parameter selection
in high-dimensional variable selection. manuscript.
Fenn, D. J., Porter, M. A., Mucha, P. J., McDonald, M., Williams, S., Johnson, N. F.,
and Jones, N. S. (2012). Dynamical clustering of exchange rates. Quantitative
Finance, 12:1493–1520.
Ferguson, T. S. (1973). A bayesian analysis of some nonparametric problems. The
Annals of Statistics, 1:209–230.
Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics
regression tools. Technometrics, 35:109–135.
Freund, R. M., Grigas, P., and Mazumder, R. (2015). An Extended Frank-Wolfe
Method with “In-Face” Directions, and its Application to Low-Rank Matrix Com-
pletion. ArXiv e-prints.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for general-
ized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–
22.
Friedman, J., Hastie, T., and Tibshirani, R. (2013). glmnet: Lasso and Elastic-
Net Regularized Generalized Linear Models. R package version 1.9-5, URL http:
//CRAN.R-project.org/package=glmnet.
Gao, X. and Song, P. X.-K. (2010). Composite likelihood bayesian information criteria
for model selection in high-dimensional data. Journal of The American Statistical
Association, 105:1531–1540.
BIBLIOGRAPHY 133
Ghasemian, A., Zhang, P., Clauset, A., Moore, C., and Peel, L. (2015). Detectability
thresholds and optimal algorithms for community structure in dynamic networks.
Available at arXiv:1506.06179v1.
Glaz, J., Naus, J., and Wallenstein, S. (2001). Scan Statistics. Springer, New York.
Goldenberg, A., Zheng, A. X., Fienberg, S. E., and Airoldi, E. M. (2010). A survey of
statistical network models. Foundations and Trends in Machine Learning, 2:129–
233.
Golub, G. and Van Loan, C. (1983). Matrix Computations. Johns Hopkins University
Press, Baltimore.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and
Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class
prediction by gene expression monitoring. Science, 286(5439):531–537.
Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E.,
Ramaswamy, S., Richards, W. G., Sugarbaker, D. J., and Bueno, R. (2002). Trans-
lation of microarray data into clinically relevant cancer diagnostic tests using gene
expression ratios in lung cancer and mesothelioma. Cancer Research, 62:4963–4967.
Guattery, S. and Miller, G. L. (1998). On the quality of spectral separators. SIAM
Journal on Matrix Analysis and Applications, 19:701–719.
Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection.
Journal of Machine Learning Research, 3:1157–1182.
Guyon, X. (1995). Random Fields on a Network: Modeling, Statistics, and Applica-
tions. Springer-Verlag, New York.
BIBLIOGRAPHY 134
Han, Q., Xu, K. S., and Airoldi, E. M. (2015). Consistent estimation of dynamic and
multi-layer block models. In Proceedings of the 32nd International Conference on
Machine Learning, pages 1511–1520. JMLR.
Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007). Model-based clustering
for social networks. Journal of the Royal Statistical Society, Series A, 170:301–354.
Hastie, T., Mazumder, R., Lee, J. D., and Zadeh, R. (2016). Matrix completion
and low-rank svd via fast alternating least squares. Journal of Machine Learning
Research, to appear.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learn-
ing: Prediction, Inference and Data Mining (Second Edition). Springer Verlag, New
York.
Heagerty, P. J. and Lele, S. R. (1998). A composite likelihood approach to binary
spatial data. Journal of the American Statistical Association, 93:1099–1111.
Ho, Q., Song, L., and Xing, E. P. (2011). Evolving cluster mixed-membership block-
model for time-varying networks. In Proceedings of the 14th International Confer-
ence on Artificial Intelligence and Statistics, pages 342–350. JMLR.
Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels:
First steps. Social Networks, 5:109–137.
Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
Hunter, D. R., Krivitsky, P. N., and Schweinberger, M. (2012). Computational statis-
tical methods for social network models. Journal of Computational and Graphical
Statistics, 21:856–882.
Jaggi, M. and Sulovsky, M. (2010). A simple algorithm for nuclear norm regularized
problems. In Proceedings of the 27th International Conference on Machine Learning
(ICML-10), pages 471–478.
BIBLIOGRAPHY 135
Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics,
43:57–89.
Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time
Data. John Wiley & Sons, New Jersey, 2nd edition.
Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community
structure in networks. Physical Review E, 83:016107.
Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy
entries. Journal of Machine Learning Research, 11:2057–2078.
Larsen, R. (2004). Propack-software for large and sparse svd calculations. Available
at http://sun.stanford.edu/$\sim$rmunk/PROPACK.
Latouche, P., Birmele, E., and Ambroise, C. (2012). Variational Bayesian inference
and complexity control for stochastic block models. Statistical Modelling, 12:93–
115.
Lee, N. H. and Priebe, C. E. (2011). A latent process model for time series of
attributed random graphs. Statistical Inference for Stochastic Processes, 14:231–
253.
Leisch, F., Weingessel, A., and Hornik, K. (1998). On the generation of correlated
artifical binary data. Working Paper Series, SFB, Adaptive Information Systems
and Modelling in Economics and Management Science.
Lewis, A. S. (1995). The convex analysis of unitarily invariant matrix functions.
Journal of Convex Analysis, 2:173–183.
Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics,
80:221–239.
BIBLIOGRAPHY 136
Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery
using regularized least squares. The Annals of Statistics, 37:3498–3528.
Macon, K. T., Mucha, P. J., and Porter, M. A. (2012). Community structure in the
united nations general assembly. Physica A, 391:343–361.
Mallows, C. L. (1973). Some comments on cp. Technometrics, 15(4):661–675.
Mazumder, R., Friedman, J. H., and Hastie, T. (2011). Sparsenet: coordinate de-
scent with nonconvex penalties. Journal of the American Statistical Association,
106:1125–1138.
Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization al-
gorithms for learning large incomplete matrices. Journal of Machine Learning
Research, 11:2287–2322.
Mazumder, R. and Radchenko, P. (2015). The discrete dantzig selector: Estimat-
ing sparse linear models via mixed integer linear optimization. arXiv preprint
arXiv:1508.01922.
McDaid, A. F., Murphy, T. B., Friel, N., and Hurley, N. J. (2013). Improved bayesian
inference for the stochastic blockmodel with application to large networks. Com-
putational Statistics and Data Analysis, 60:12–31.
Miller, J. W. and Harrison, M. T. (2014). Inconsistency of pitman-yor process
mixtures for the number of components. Journal of Machine Learning Research,
15:3333–3370.
Minka, T. P. (2000). Estimating a dirichlet distribution. Technical report, Microsoft
Research.
Negahban, S. N. and Wainwright, M. J. (2012). Restricted strong convexity and
weighted matrix completion: optimal bounds with noise. Journal of Machine
Learning Research, 13:1665–1697.
BIBLIOGRAPHY 137
Newman, M. E. J. (2004). Detecting community structure in networks. The European
Physical Journal B, 38:321–330.
Newman, M. E. J. and Girvan, M. (2004). Finding and evaluating community struc-
ture in networks. Physical Review E, 69:026113.
Nikolova, M. (2000). Local strong homogeneity of a regularized estimator. SIAM
Journal on Applied Mathematics, 61:633–658.
Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in mul-
tiple regression. The Annals of Statistics, 12(2):758–765.
Nobile, A. and Fearnside, A. (2007). Bayesian finite mixtures with an unknown
number of components: The allocation sampler. Statistics and Computing, 17:147–
162.
Oberthuer, A., Berthold, F., Warnat, P., Hero, B., Kahlert, Y., Spitz, R., Ernestus,
K., Konig, R., Haas, S., Eils, R., Schwab, M., Brors, B., Westermann, F., and
Fischer, M. (2006). Customized oligonucleotide microarray gene expression-based
classification of neuroblastoma patients outperforms current clinical risk stratifica-
tion. Journal of Clinical Oncology, 24(31):5070–5078.
Perry, P. O. and Wolfe, P. J. (2012). Null models for network data. Available at
arXiv:1201.5871v1.
Perry, P. O. and Wolfe, P. J. (2013). Point process modelling for directed interaction
networks. Journal of the Royal Statistical Society, Series B, 75:821–849.
Priebe, C. E., Conroy, J. M., Marchette, D. J., and Park, Y. (2005). Scan statistics on
enron graphs. Computational and Mathematical Organization Theory, 11:229–247.
Priebe, C. E., Park, Y., Marchette, D. J., Conroy, J. M., Grothendieck, J., and Gorin,
A. L. (2010). Statistical inference on attributed random graphs: Fusion of graph
BIBLIOGRAPHY 138
features and content: An experiment on time series of enron graphs. Computational
Statistics and Data Analysis, 54:1766–1776.
Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-
corrected stochastic blockmodel. In Advances in Neural Information Processing
Systems 26, pages 3120–3128.
Rakotomamonjy, A. (2003). Variable selection using svm-based criteria. Journal of
Machine Learning Research, 3:1357–1370.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.
Journal of the American Statistical Association, 66:846–850.
Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine
Learning Research, 12:3413–3430.
Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions
of linear matrix equations via nuclear norm minimization. SIAM Review, 52:471–
501.
Robinson, L. F. and Priebe, C. E. (2015). Detecting time-dependent structure in
network data via a new class of latent process models. Computational Statistics,
to appear.
Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press, Princeton,
New Jersey.
Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank
matrices. The Annals of Statistics, 39:887–930.
Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-
dimensional stochastic blockmodel. The Annals of Statistics, 39:1878–1915.
BIBLIOGRAPHY 139
Rohe, K., Qin, T., and Fan, H. (2014). The highest dimensional stochastic blockmodel
with a regularized estimator. Statistica Sinica, 24:1771–1786.
Saldana, D. F., Yu, Y., and Feng, Y. (2015). How many communities are there?
Journal of Computational and Graphical Statistics, to appear.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
6(2):461–464.
Schweinberger, M. and Handcock, M. S. (2015). Local dependence in random graph
models: characterization, properties and statistical inference. Journal of the Royal
Statistical Society, Series B, 77:647–676.
Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22:888–905.
SIGKDD, A. and Netflix (2007). Soft modelling by latent variables: the nonlinear
iterative partial least squares (NIPALS) approach. In Proceedings of KDD Cup and
Workshop. Available at http://www.cs.uic.edu/$\sim$liub/KDD-cup-2007/
proceedings.html.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap-
man & Hall/CRC, London.
Simon, H., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths
for cox’s proportional hazards model via coordinate descent. Journal of Statistical
Software, 39(5):1–13.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P.,
Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff,
P. W., Golub, T. R., and Sellers, W. R. (2002). Gene expression correlates of
clinical prostate cancer behavior. Cancer Cell, 1(2):203–209.
BIBLIOGRAPHY 140
Spielmat, D. A. and Teng, S.-H. (1996). Planar graphs and finite element meshes.
In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium
on, pages 96–105. IEEE.
Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.
The Annals of Statistics, pages 1135–1151.
Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of
components - an alternative to reversible jump methods. The Annals of Statistics,
28:40–74.
R Core Team (2015). R: A Language and Environment for Statistical Comput-
ing. R Foundation for Statistical Computing, Vienna, Austria. URL http:
//www.R-project.org/.
Therneau, T. M. and Lumley, T. (2015). survival: Survival Analysis. R package
version 2.38-3, URL http://CRAN.R-project.org/package=survival.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B, 58:267–288.
Tibshirani, R. (1997). The lasso method for variable selection in the cox model.
Statistics in Medicine, 16:385–395.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple
cancer types by shrunken centroids of gene expression. Proceedings of the National
Academy of Sciences of the United States of America, 99(10):6567–6572.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,
Botstein, D., and Altman, R. B. (2001). Missing value estimation methods for dna
microarrays. Bioinformatics, 17(6):520–525.
Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical
Analysis, 92:1–28.
BIBLIOGRAPHY 141
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood
methods. Statistica Sinica, 21:5–42.
Varin, C. and Vidoni, P. (2005). A note on composite likelihood inference and model
selection. Biometrika, 92:519–528.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing,
17:395–416.
Wang, H., Tang, M., Park, Y., and Priebe, C. E. (2014). Locality statistics for
anomaly detection in time series of graphs. IEEE Transactions on Signal Process-
ing, 62(3):703–717.
Wei, Y.-C. and Cheng, C.-K. (1989). Towards efficient hierarchical designs by ratio
cut partitioning. In Computer-Aided Design, 1989. ICCAD-89. Digest of Technical
Papers., 1989 IEEE International Conference on, pages 298–301.
Westveld, A. H. and Hoff, P. D. (2011). A mixed effects model for longitudinal
relational and network data, with applications to international trade and conflict.
The Annals of Applied Statistics, 5:843–872.
Xing, E. P., Fu, W., and Song, L. (2010). A state-space mixed membership blockmodel
for dynamic network tomography. The Annals of Applied Statistics, 4:535–566.
Xu, K. S. and Hero, A. O. (2014). Dynamic stochastic blockmodels for time-evolving
social networks. IEEE Journal of Selected Topics in Signal Processing, 8(4):552–
562.
Yu, Y. and Feng, Y. (2014a). Apple: Approximate path for penalized likelihood
estimators. Statistics and Computing, 24:803–819.
Yu, Y. and Feng, Y. (2014b). Modified cross-validation for penalized high-dimensional
linear regression models. Journal of Computational and Graphical Statistics,
23(4):1009–1027.
BIBLIOGRAPHY 142
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave
penalty. The Annals of Statistics, 38:894–942.
Zhang, C.-H., Zhang, T., et al. (2012). A general theory of concave regularization for
high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593.
Zhao, Y., Levina, E., and Zhu, J. (2012). Consistency of community detection in
networks under degree-corrected stochastic block models. The Annals of Statistics,
40:2266–2292.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society, Series B, 67(2):301–320.
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likeli-
hood models. The Annals of Statistics, 36(4):1509–1533.
APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 143
Appendix A
Technical Material for NC-Impute
Lemma 1. (Marchenko-Pastur law; Bai and Silverstein (2010)). Let X ∈ Rm×n,
where Xij are iid with E(Xij) = 0,E(X2ij) = 1, and m > n. Let λ1 ≤ λ2 ≤ · · · ≤ λn
be the eigenvalues of Qm = 1mXTX. Define the random spectral measure
µn =1
n
n∑i=1
δλi .
Then, assuming n/m→ α ∈ (0, 1], we have
µn(·, ω)→ µ a.s.,
where µ is a deterministic measure with density
dµ
dx=
√(α+ − x)(x− α−)
2παxI(α− ≤ x ≤ α+). (A.1)
Here, α+ = (1 +√α)2 and α− = (1−
√α)2.
A.1 Proof of Proposition 4.5.
Throughout the proof, we make use of the following notation: Θ1(·) and Θ2(·), defined
as follows. For two positive sequences ak and bk, we say ak = Θ2(bk) if there exists
a constant c > 0 such that ak ≥ cbk, and we say ak = Θ1(bk), whenever ak = Θ2(bk)
and bk = Θ2(ak).
APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 144
We first consider the case λn = Θ1(√m). For simplicity, we assume λn = ζ
√m
for some constant ζ > 0. Denote df(Sλn,γ(Z)) = Dλn,γ. Adopting the notation from
Lemma 1, it is not hard to verify that
Dλn,γ = nEµn
s′λn,γ(
√mt1) + |m− n|sλn,γ(
√mt1)√
mt1
+n2Eµn
√mt1sλn,γ(
√mt1)−
√mt2sλn,γ(
√mt2)
mt1 −mt21(t1 6= t2)
,
where t1, t2iid∼ µn. A quick check of the relation between sλn,γ and gζ,γ yields
Dλn,γ
mn=
1
mEµns′λn,γ(
√mt1) +
(1− n
m
)Eµngζ,γ(t1)
+n
mEµn
t1gζ,γ(t1)− t2gζ,γ(t2)
t1 − t21(t1 6= t2)
.
Due to the Lipschitz continuity of the functions sλn,γ(x) and xgζ,γ(x), we obtain∣∣∣Dλn,γ
mn
∣∣∣ ≤ γ
m(γ − 1)+(
1− n
m
)+n
m
(2γ − 1
2γ − 2
).
Hence, there exists a positive constant Cα, such that for sufficiently large n,∣∣∣Dλn,γ
mn
∣∣∣ ≤ Cα, a.s.
Let T1, T2 be two independent random variables generated from the Marchenko-Pastur
distribution µ. If we can showDλn,γmn→ (1−α)Egζ,γ(T1)+αE
(T1gζ,γ(T1)−T2gζ,γ(T2)
T1−T2
)a.s.,
then by the Dominated Convergence Theorem (DCT), we conclude the proof in the
λn = Θ1(√m) regime. Note immediately that
1
mEµns′λn,γ(
√mt1)→ 0 a.s. (A.2)
Moreover gζ,γ(·) is bounded and continuous, by the Marchenko-Pastur result in
Lemma 1, (1− n
m
)Eµngζ,γ(t1)→ (1− α)Eµgζ,γ(T1) a.s. (A.3)
APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 145
Since (t1, t2)d→ (T1, T2), and given that the discontinuity set of the measurable func-
tiont1gζ,γ(t1)−t2gζ,γ(t2)
t1−t2 1(t1 6= t2) has zero probability under the measure induced by
(T1, T2), by the continuous mapping theorem,
t1gζ,γ(t1)− t2gζ,γ(t2)
t1 − t21(t1 6= t2)
d→ T1gζ,γ(T1)− T2gζ,γ(T2)
T1 − T2
1(T1 6= T2) as n→∞ .
Due to the boundedness of the functiont1gζ,γ(t1)−t2gζ,γ(t2)
t1−t2 1(t1 6= t2),
Eµnt1gζ,γ(t1)− t2gζ,γ(t2)
t1 − t21(t1 6= t2)
→Eµ
T1gζ,γ(T1)− T2gζ,γ(T2)
T1 − T2
1(T1 6= T2)
, (A.4)
almost surely. Combining (A.2) – (A.4) completes the proof for the λn = Θ1(√m)
case.
When λn = o(√m), we can readily see that Eµn1(
√mt1 ≥ λnγ) → 1, a.s. Using
thatsλn,γ(
√mt1)√
mt1and
√mt1sλn,γ(
√mt1)−
√mt2sλn,γ(
√mt2)
mt1−mt2 1(t1 6= t2) are bounded, we have,
almost surely
Eµnsλn,γ(
√mt1)√
mt1= Eµn1(
√mt1 ≥ λnγ) + Eµn
sλn,γ(
√mt1)√
mt11(√mt1 < λnγ)
→ 1
and
Eµn√
mt1sλn,γ(√mt1)−
√mt2sλn,γ(
√mt2)
mt1−mt2 1(t1 6= t2)
= Eµn1(√mt1 ≥ λnγ)1(
√mt2 ≥ λnγ) + o(1)→ 1 .
Invoking Dominated Convergence Theorem completes the proof. Similar arguments
hold for the case λn = Θ2(√m).
A.2 Proof of Proposition 4.9
Proof of Part (a):
Let us write the stationary conditions for every update: Xk+1 = arg minX F`(X;Xk).
We set the subdifferential of the map X 7→ F`(X;Xk) to zero at X = Xk+1:
(Xk+1 −
(PΩ(Y ) + P⊥Ω (Xk)
))+ `(Xk+1 −Xk) + Uk+1∇k+1V
′k+1 = 0, (A.5)
APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 146
where Xk+1 = Uk+1diag(σk+1)V ′k+1 is the singular value decomposition of Xk+1. Note
that the term: Uk+1∇k+1V′k+1 in (A.5), is a subdifferential (Lewis, 1995) of the spectral
function: X 7→∑
i P (σi(X);λ, γ), where ∇k+1 is a diagonal matrix with the ith
diagonal entry being a derivative of the map σi 7→ P (σi;λ, γ) (on σi ≥ 0), denoted
by ∂P (σk+1,i;λ, γ)/∂σi for all i. Note that (A.5) can be rewritten as:
PΩ(Xk+1)− PΩ(Y ) +(P⊥Ω (Xk+1 −Xk) + `(Xk+1 −Xk)
)︸ ︷︷ ︸(a)
+Uk+1∇k+1V′k+1 = 0. (A.6)
As k →∞, term (a) converges to zero (See Proposition 4.7) and thus, we have:
PΩ(Xk+1)− PΩ(Y ) + Uk+1∇k+1V′k+1 → 0. (A.7)
Let us denote the ith column of Uk by uk,i, and use a similar notation for Vk and vk,i.
Let rk+1 denote the rank of Xk+1. Hence, we have:
rk+1∑i=1
σk+1,iPΩ(uk+1,iv′k+1,i)− PΩ(Y ) + Uk+1∇k+1V
′k+1 → 0.
Multiplying the left and right hand sides of the above by u′k+1,j and vk+1,j, we have
the following:
rk+1∑i=1
σk+1,iu′k+1,jPΩ(uk+1,iv
′k+1,i)vk+1,j − u′k+1,jPΩ(Y )vk+1,j +∇k+1,j → 0, (A.8)
for j = 1, . . . , rk+1. LetU , V
denote a limit point of the sequence Uk, Vk (which
exists since the sequence is bounded); and let r be the rank of U and V . Let us now
study the following equations1:
r∑i=1
σiu′jPΩ(uiv
′i)vj − u′jPΩ(Y )vj + ∇j = 0, j = 1, . . . , r. (A.9)
Using the notation θj = vec(PΩ(uj v
′j))
and y = vec(PΩ(Y )), we note that (A.9) are
the first order stationary conditions for a point σ for the following penalized regression
problem:
minimizeσ
1
2‖
r∑j=1
σj θj − y‖22 +
r∑j=1
P (σj;λ, γ), (A.10)
1Note that we do not assume that σk has a limit point.
APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 147
with σ ≥ 0.
If the matrix Θ = [θ1, . . . , θr] (note that Θ ∈ Rmn×r) has rank r, then any σ that
satisfies (A.9) is finite — in particular, the sequence σk is bounded and has a limit
point: σ which satisfies the first order stationary condition (A.9).
Proof of Part (b):
Furthermore, if we assume that
λmin(Θ′Θ) + φP > 0,
then (A.10) admits a unique solution σ, which implies that σk has a unique limit
point, and hence the sequence σk necessarily converges.