Advances in Model Selection Techniques with Applications

Advances in Model Selection Techniqueswith Applications to Statistical Network

Analysis and Recommender Systems

Diego Franco Saldana

Submitted in partial fulfillment of the

requirements for the degree

of Doctor of Philosophy

in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2016

c©2016


All Rights Reserved

ABSTRACT

Advances in Model Selection Techniqueswith Applications to Statistical Network

Analysis and Recommender Systems


This dissertation focuses on developing novel model selection techniques, the pro-

cess by which a statistician selects one of a number of competing models of varying

dimensions, under an array of different statistical assumptions on observed data. Tra-

ditionally, two main reasons have been advocated by researchers for performing model

selection strategies over classical maximum likelihood estimates (MLEs). The first

reason is prediction accuracy, where by shrinking or setting to zero some model pa-

rameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in

turn leads to an overall improvement in predictive performance. The second reason

relates to interpretability of the selected models in the presence of a large number

of predictors, where in order to obtain a parsimonious representation exhibiting the

relationship between the response and covariates, we are willing to sacrifice some of

the smaller details brought in by spurious predictors.

In the first part of this work, we revisit the family of variable selection techniques

known as sure independence screening procedures for generalized linear models and

the Cox proportional hazards model. After clever combination of some of its most

powerful variants, we propose new extensions based on the idea of sample splitting,

data-driven thresholding, and combinations thereof. A publicly available package

developed in the R statistical software demonstrates considerable improvements in

terms of model selection and competitive computational time between our enhanced

variable selection procedures and traditional penalized likelihood methods applied

directly to the full set of covariates.

Next, we develop model selection techniques within the framework of statistical

network analysis for two frequent problems arising in the context of stochastic block-

models: community number selection and change-point detection. In the second part

of this work, we propose a composite likelihood based approach for selecting the

number of communities in stochastic blockmodels and its variants, with robustness

consideration against possible misspecifications in the underlying conditional inde-

pendence assumptions of the stochastic blockmodel. Several simulation studies, as

well as two real data examples, demonstrate the superiority of our composite likeli-

hood approach when compared to the traditional Bayesian Information Criterion or

variational Bayes solutions. In the third part of this thesis, we extend our analysis on

static network data to the case of dynamic stochastic blockmodels, where our model

selection task is the segmentation of a time-varying network into temporal and spatial

components by means of a change-point detection hypothesis testing problem. We

propose a corresponding test statistic based on the idea of data aggregation across the

different temporal layers through kernel-weighted adjacency matrices computed be-

fore and after each candidate change-point, and illustrate our approach on synthetic

data and the Enron email corpus.

The matrix completion problem consists in the recovery of a low-rank data ma-

trix based on a small sampling of its entries. In the final part of this dissertation,

we extend prior work on nuclear norm regularization methods for matrix completion

by incorporating a continuum of penalty functions between the convex nuclear norm

and nonconvex rank functions. We propose an algorithmic framework for comput-

ing a family of nonconvex penalized matrix completion problems with warm-starts,

and present a systematic study of the resulting spectral thresholding operators. We

demonstrate that our proposed nonconvex regularization framework leads to improved

model selection properties in terms of finding low-rank solutions with better predic-

tive performance on a wide range of synthetic data and the famous Netflix data

recommender system.

Table of Contents

List of Figures iv

List of Tables viii

1 SIS: An R Package for Sure Independence Screening in Ultrahigh

Dimensional Statistical Models 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 General SIS and ISIS Methodological Framework . . . . . . . . . . . 5

1.2.1 SIS and Feature Ranking by Maximum Marginal Likelihood

Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Iterative Sure Independence Screening . . . . . . . . . . . . . 9

1.3 Variants of ISIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.1 First Variant of ISIS . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.2 Second Variant of ISIS . . . . . . . . . . . . . . . . . . . . . . 15

1.3.3 Data-driven Thresholding . . . . . . . . . . . . . . . . . . . . 16

1.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Model Selection and Timings . . . . . . . . . . . . . . . . . . . . . . 20

1.4.1 Model Selection and Statistical Accuracy . . . . . . . . . . . . 20

1.4.2 Scaling in n and p with Feature Screening . . . . . . . . . . . 28

1.4.3 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.4 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

i

1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2 How Many Communities Are There? 36

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.1 Stochastic Blockmodels . . . . . . . . . . . . . . . . . . . . . . 39

2.2.2 Spectral Clustering and SCORE . . . . . . . . . . . . . . . . . 43

2.3 Model Selection for the Number of Communities . . . . . . . . . . . . 44

2.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.2 Composite Likelihood Inference . . . . . . . . . . . . . . . . . 45

2.3.3 Composite Likelihood BIC . . . . . . . . . . . . . . . . . . . . 47

2.3.4 Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3 Kernel-Based Change-Point Detection in Dynamic Stochastic Block-

models 65

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2.1 Dynamic Stochastic Blockmodels . . . . . . . . . . . . . . . . 69

3.2.2 Change-Point Model . . . . . . . . . . . . . . . . . . . . . . . 71

3.3 Change-Point Detection Methodology . . . . . . . . . . . . . . . . . . 72

3.3.1 Spectral Clustering and Parameter Estimation in SBM . . . . 72

3.3.2 Algorithm Description and Rationale . . . . . . . . . . . . . . 73

3.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 77

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

ii


3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4 NC-Impute: Scalable Matrix Completion with Nonconvex Penalties 90

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.1 Contributions and Outline . . . . . . . . . . . . . . . . . . . . 94

4.2 Spectral Thresholding Operators . . . . . . . . . . . . . . . . . . . . 96

4.2.1 Properties of Spectral Thresholding Operators . . . . . . . . . 100

4.2.2 Effective Degrees of Freedom for Spectral Thresholding Operators102

4.3 The NC-Impute Algorithm . . . . . . . . . . . . . . . . . . . . . . . 107

4.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . 110

4.3.2 Computing the Thresholding Operators . . . . . . . . . . . . . 115

4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.4.1 Synthetic Examples . . . . . . . . . . . . . . . . . . . . . . . . 117

4.4.2 Real Data Examples: MovieLens and Netflix Data Sets . . . . 124

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Bibliography 127

A Technical Material for NC-Impute 143

A.1 Proof of Proposition 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.2 Proof of Proposition 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . 145

iii

List of Figures

1.1 Median runtime in seconds taken over 10 trials (log-log scale). . . . . 29

2.1 (Color online) Comparisons between different methods for selecting the true

community number K in the standard blockmodel settings of Simulations 1

– 3. Along the y-axis, we record the proportion of times the chosen number

of communities for each of the different criteria for selecting K agrees with

the truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.2 (Color online) Largest connected component of the international trade net-

work for the year 1995. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.3 (Color online) Largest connected component of the school friendship net-

work. Panel (c) shows the “true” grade community labels: 7th (blue), 8th

(yellow), 9th (green), 10th (purple), 11th (red), and 12th (black). . . . . . 62

3.1 (Color online) Mean values for the test statistic d(t) and the estimated

quantiles cα(t) for each candidate change-point in Simulations 1 and 2, where

the choices of α = 0.10 (green), α = 0.05 (blue), and α = 0.01 (red) are

considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

iv

3.2 (Color online) Scan statistic values and associated detected change-points

for the Enron email data set (left panels). Kernel Change-Points test statis-

tic d(t) along with the estimated quantiles cα(t), with α = 0.01, for each

candidate change-point (right panels). Red circles indicate common dis-

covered change-points between the two approaches, whereas gold circles

represent newly discovered change-points by the KCP statistic. . . . . . . 87

4.1 [Top panel] Examples of nonconvex penalties σ 7→ P (σ;λ, γ) with λ = 1 for

different values of γ. [Bottom Panel] The corresponding scalar thresholding

operators: σ 7→ sλ,γ(σ). At σ = 1, some of the thresholding operators

corresponding to the `γ penalty function are discontinuous, and some of the

other thresholding functions are “close” to being so. . . . . . . . . . . . . 100

4.2 Figure showing the df for the MC+ thresholding operator for a matrix with

m = n = 10, µ = 0 and v = 1. The df profile as a function of γ (in the

log scale) is shown for three values of λ. The dashed lines correspond to

the df of the spectral soft-thresholding operator, corresponding to γ = ∞.

We propose calibrating the (λ, γ) grid to a (λ, γ) grid such that the df

corresponding to every value of γ matches the df of the soft-thresholding

operator — as shown in Figure 4.3. . . . . . . . . . . . . . . . . . . . . 104

4.3 Figure showing the calibrated (λ, γ) lattice — for every fixed value of λ, the

df of the MC+ spectral threshold operators are the same across different γ

values. The df computations have been performed on a null model using

Proposition 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

v

4.4 (Color online) Random Orthogonal Model (ROM) simulations with SNR =

1. The choice γ = +∞ refers to nuclear norm regularization as provided by

the Soft-Impute algorithm. The least nonconvex alternatives at γ = 100

and γ = 80 behave similarly to nuclear norm, although with better predic-

tion performance. The choices of γ = 5 and γ = 10 result in excessively

aggressive fitting behavior for the true rank = 10 case, but improve sig-

nificantly in prediction error and recovering the true rank in the sparser

true rank = 5 setting. In both scenarios, the intermediate models with

γ = 30 and γ = 20 fare the best, with the former achieving the smallest

prediction error, while the latter estimates the actual rank of the matrix.

Values of test error larger than one are not displayed in the figure. . . . . 121

4.5 (Color online) Random Orthogonal Model (ROM) simulations with SNR =

5. The benefits of nonconvex regularization are more evident in this high-

sparsity, high-missingness scenario. While the γ = 100 and γ = 80 models

distance themselves more from nuclear norm, the remaining members of the

MC+ family essentially minimize prediction error while correctly estimating

the true rank. This is especially true in panel (d), where the best predictive

performance of the model γ = 5 at the correct rank is achieved under a

low-rank truth and high SNR setting. . . . . . . . . . . . . . . . . . . . 122

4.6 (Color online) Coherent and Non-Uniform Sampling (NUS) simulations with

SNR = 10. Nonconvex regularization also proves to be a successful strategy

in these challenging scenarios, particularly in the non-uniform sampling set-

ting where the MC+ family exhibits a monotone decrease in prediction error

as γ approaches 1. Again, the model γ = 5 estimates the correct rank under

high SNR settings. Although nuclear norm achieves a relatively small pre-

diction error, compared with previous simulation settings, the MC+ family

still provides a superior and more robust mechanism for regularization. . . 123

vi

4.7 (Color online) MovieLens 100k and 1m data. For each value of λ in the

solution path, an operating rank threshold (capped at 250) larger than the

rank of the previous solution was employed. . . . . . . . . . . . . . . . . 125

4.8 (Color online) Netflix competition data. The model γ = 10 achieves optimal

test set RMSE of 0.8276 for a solution rank of 105. . . . . . . . . . . . . 126

vii

List of Tables

1.1 Summary of tuning parameters for variable selection using ISIS proce-

dures within the SIS package, as well as associated defaults. All ISIS

variants are implemented through the SIS function, which we describe

in Section 1.4.4 using a gene expression data set. . . . . . . . . . . . . 21

1.2 Linear regression, Case 1, where results are given in the form of medians

and robust standard deviations (in parentheses). . . . . . . . . . . . . 25

1.3 Logistic regression, Case 2, where results are given in the form of me-

dians and robust standard deviations (in parentheses). . . . . . . . . 25

1.4 Poisson regression, Case 3, where results are given in the form of me-

dians and robust standard deviations (in parentheses). . . . . . . . . 26

1.5 Cox proportional hazards regression, Case 4, where results are given

in the form of medians and robust standard deviations (in parentheses). 26

1.6 Classification error rates and number of selected genes by various meth-

ods for the balanced Leukemia and Prostate cancer data sets. For the

Leukemia data, the training and test samples are of size 36. For the

Prostate cancer data, the training and test samples are of size 68. Re-

sults are given in the form of medians and robust standard deviations

(in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

1.7 Classification error rates and number of selected genes by various meth-

ods for the balanced Lung and Neuroblastoma (NB) cancer data sets.

For the Lung data, the training and test samples are of sizes 90 and

91, respectively. For the Neuroblastoma cancer data, the training and

test samples are of size 123. Results are given in the form of medians

and robust standard deviations (in parentheses). . . . . . . . . . . . 32

2.1 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 1,

where Eq and Dec indicate equally correlated and exponential decaying

cases, respectively. Both correlation of multivariate Gaussian random varia-

bles (ρMVN) and the corresponding maximum correlation between Bernoulli

variables (ρ Ber.) are presented. . . . . . . . . . . . . . . . . . . . . . . 53

2.2 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 2,

where Ind indicates ρjl = 0 for j 6= l. For simplicity, we omit the correlation

between the corresponding Bernoulli variables. . . . . . . . . . . . . . . . 54

2.3 Comparison of CL-BIC and VB over 200 repetitions from Simulation 3.

For simplicity, we omit the correlation between the corresponding Bernoulli

variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.4 Comparison of CL-BIC and BIC over 200 repetitions from Simulation 4.

Before being scaled by the constant γn, we selected θ = (θab; 1 ≤ a ≤ b ≤

4)′, where θaa = 7 for all a = 1, . . . , 4 and θab = 1 for 1 ≤ a < b ≤ 4. . . . . 56

2.5 Comparison of CL-BIC and BIC over 200 repetitions from the DCBM case

in Simulation 4, with (ρEq = 0.2,γn = 0.03), where the individual effect

parameters ωi are now generated from a Uniform(1/5, 9/5) distribution. . 59

3.1 Performance of KCP over 100 repetitions from Simulations 1 and 2. Results

are given in the form of mean (standard deviation). . . . . . . . . . . . . 82

ix

Acknowledgments

Throughout my years at Columbia in the pursuit of my doctoral degree, I have ben-

efited significantly from interactions with professors, colleagues and friends. I would

like to take this opportunity to express to them my most sincere gratitude and utmost

admiration.

First and foremost, I wish to express my appreciation and gratitude to my advisor,

Professor Yang Feng, for his guidance and continuous support during the research

phase of my PhD. Yang’s insights, thoughtful suggestions and dedicated involvement

in my research work greatly improved the content of this dissertation, as well as my

knowledge and skill set as I develop into a practicing statistician.

Next, I would like to thank Professor Rahul Mazumder for fruitful discussions re-

garding the intersection of statistics, machine learning and optimization. His steadfast

championing and encouragement transformed a topics class project into one of the

research projects presented in this dissertation.

I would also like to thank Professor Tian Zheng, Professor Peter Orbanz and

Professor Sumit Mukherjee for kindly agreeing to serve on my dissertation defense

committee. Their insightful comments and suggestions provided very helpful feedback

to improve the work presented herein.

The professors in our department for their dedication and encouragement over the

years, in particular Ioannis Karatzas and Bodhisattva Sen for their formative courses,

as well as Richard Davis and Zhiliang Ying for their advice and expertise as exemplar

role models.

I am also deeply grateful to all my classmates in the Department of Statistics at

x

Columbia. Your discipline, hard work and unwavering drive are the key ingredients at

placing our department as a world-class community to pursue our graduate studies. I

would like to thank in particular my cohort classmates Rohit Patra and Zach Shahn

for innumerable conversations that always helped me to see the big picture throughout

my graduate studies, as well as Haolei Weng and Yi Yu for being not only the perfect

coauthors, but also an invaluable source of strength and encouragement during my

time at Columbia.

To all my Mexican friends here at Columbia University, and in New York in

general. Elia, Colin, Dani, Diego, Jorge, Rodrigo, Prospero, Emilio and Angeles: you

are the single most amazing group of friends I could have imagined through my PhD

and beyond. Thank you for so many wonderful memories and your unconditional

support throughout this journey. My time at Columbia would not have been the

same without you, thank you for being there along the way!

My deepest gratitude goes to Susan, my beloved partner and the utmost compas-

sionate and understanding person in this planet. Without her endless encouragement

and invaluable support, I would not have been even close to completing this disser-

tation. My memories of Columbia and New York are inseparably attached to hers. I

can only hope to be as supportive for her and her career as she has been for mine.

And finally, I would like to express my most sincere thank you to my parents Estela

Ivonne and Ernesto, and my brother Ernie. Your infinite love and encouragement have

been my day-to-day pillars throughout my PhD. You are the single best example of

hard work and perseverance that I have, and it is because of it that I have been able

to successfully overcome the many challenges I have faced throughout my life. I love

you all very much — to them this thesis is dedicated.

xi

Gracias a la vida, que me ha dado tanto.

Me ha dado la marcha de mis pies cansados.

Con ellos anduve ciudades y charcos,

Playas y desiertos, montanas y llanos,

Y la casa tuya, tu calle y tu patio.

Thanks to life, which has given me so much.

It gave me the steps of my tired feet.

With them I have traversed cities and puddles,

Valleys and deserts, mountains and plains,

And your house, your street and your courtyard.

Violeta Parra

xii

CHAPTER 1. SIS: AN R PACKAGE FOR SURE INDEPENDENCESCREENING IN ULTRAHIGH DIMENSIONAL STATISTICAL MODELS 1

Chapter 1

SIS: An R Package for Sure

Independence Screening in

Ultrahigh Dimensional Statistical

Models

In this chapter, we revisit sure independence screening procedures for variable selec-

tion in generalized linear models and the Cox proportional hazards model. Through

the publicly available R package SIS, we provide a unified environment to carry out

variable selection using iterative sure independence screening (ISIS) and all of its

variants. For the regularization steps in the ISIS recruiting process, available penal-

ties include the LASSO, SCAD, and MC+ while the implemented variants for the

screening steps are sample splitting, data-driven thresholding, and novel combinations

thereof. Performance of these feature selection techniques is investigated by means

of real and simulated data sets, where we find considerable improvements in terms

of model selection and computational time between our algorithms and traditional

penalized pseudo-likelihood methods applied directly to the full set of covariates.


1.1 Introduction

With the remarkable development of modern technology, including computing power

and storage, more and more high-dimensional and high-throughput data of unprece-

dented size and complexity are being generated for contemporary statistical studies.

For instance, bioimaging technology has made it possible to collect a huge amount of

predictor information such as microarray, proteomic, and SNP data while observing

survival information and tumor classification on patients in clinical studies. A com-

mon feature of all these examples is that the number of variables p can be potentially

much larger than the number of observations n, i.e., the number of gene expression

profiles is in the order of tens of thousands while the number of patient samples is

in the order of tens or hundreds. By ultrahigh dimensionality, following Fan and Lv

(2008), we mean that the dimensionality grows exponentially in the sample size, i.e.,

log p = O(nξ) for some ξ ∈ (0, 1/2). In order to provide more representative and

reasonable statistical models, it is typically assumed that only a small fraction of

predictors are associated with the outcome. This is the notion of sparsity which em-

phasizes the prominent role feature selection techniques play in ultrahigh dimensional

statistical modeling.

One popular family of variable selection methods for parametric models is based

on the penalized (pseudo-)likelihood approach. Examples include the LASSO (Tib-

shirani, 1996, 1997), SCAD (Fan and Li, 2001), the elastic net penalty (Zou and

Hastie, 2005), the MC+ (Zhang, 2010), and related methods. Nevertheless, in ultra-

high dimensional statistical learning problems, these methods may not perform well

due to the simultaneous challenges of computational expediency, statistical accuracy,

and algorithmic stability (Fan et al., 2009).

Motivated by these concerns, Fan and Lv (2008) introduced a new framework

for variable screening via independent correlation learning that tackles the afore-

mentioned challenges in the context of ultrahigh dimensional linear models. Their

proposed sure independence screening (SIS) is a two-stage procedure; first filtering


out the features that have weak marginal correlation with the response, effectively

reducing the dimensionality p to a moderate scale below the sample size n, and

then performing variable selection and parameter estimation simultaneously through

a lower dimensional penalized least squares method such as SCAD or LASSO. Under

certain regularity conditions, Fan and Lv (2008) showed surprisingly that this fast

feature selection method has a “sure screening property”; that is, with probability

tending to 1, the independence screening technique retains all of the important fea-

tures in the model. However, the SIS procedure in Fan and Lv (2008) only covers

ordinary linear regression models and their technical arguments do not extend easily

to more general models such as generalized linear models and hazard regression with

right-censored times.

In order to enhance finite sample performance, an important methodological ex-

tension, iterative sure independence screening (ISIS), was also proposed by Fan and

Lv (2008) to handle cases where the regularity conditions may fail, such as when some

important predictors are marginally uncorrelated with the response, or the reverse

situation where an unimportant predictor has higher marginal correlation than some

important features. Roughly speaking, the original ISIS procedure works by itera-

tively performing variable selection to recruit a small number of predictors, computing

residuals based on the model fitted using these recruited predictors, and then using

the residuals as the working response variable to continue recruiting new predictors.

With the purpose of handling more complex real data, Fan and Song (2010) extended

SIS to generalized linear models; and Fan et al. (2009) improved some important

steps of the original ISIS procedure, allowing variable deletion in the recruiting pro-

cess through penalized pseudo-likelihood, while dealing with more general loss based

models. In particular, they introduced the concept of conditional marginal regres-

sions and, with the aim of reducing the false discovery rate, proposed two new ISIS

variants based on the idea of splitting samples. Other extensions of ISIS include

Fan et al. (2010) to the Cox proportional hazards model, and Fan et al. (2011) to


nonparametric additive models.

In this chapter, we build on the work of Fan et al. (2009) and Fan et al. (2010)

to provide a publicly available package SIS (Fan et al., 2015), implemented in the

R statistical software (R Core Team, 2015), extending sure independence screening

and all of its variants to generalized linear models and the Cox proportional hazards

model. In particular, our codes are able to perform variable selection through the

proposed ISIS variants of Fan et al. (2009) and through the data-driven thresholding

approach of Fan et al. (2011). Furthermore, we combine these sample splitting and

data-driven thresholding ideas to provide two novel feature selection techniques.

Taking advantage of the fast cyclical coordinate descent algorithms developed in

the packages glmnet (Friedman et al., 2013) and ncvreg (Breheny, 2013), for convex

and nonconvex penalty functions, respectively, we are able to efficiently perform the

moderate scale penalized pseudo-likelihood steps from the ISIS procedure, thus yield-

ing variable selection techniques outperforming direct use of glmnet and ncvreg in

terms of both computational time and estimation error. Our procedures scale favor-

ably in both n and p, allowing us to expeditiously and accurately solve much larger

problems than with previous packages, particularly in the case of nonconvex penal-

ties. We would like to point out that the recent package apple (Yu and Feng, 2014a),

using a hybrid of the predictor-corrector method and coordinate descent procedures,

provides an alternative for the penalized pseudo-likelihood estimation with nonconvex

penalties. In the present work, we limit all numerical results to the use of ncvreg,

noting there are other available options to implement the nonconvex variable selec-

tion procedures performed by SIS. Similarly, although the package survHD (Bernau

et al., 2014) provides an efficient alternative for implementing Cox proportional haz-

ards regression, in the current presentation, we only make use of the survival package

(Therneau and Lumley, 2015) to compute conditional marginal regressions and of the

glmnet package (Friedman et al., 2013) to fit high-dimensional Cox models.

The remainder of the chapter is organized as follows. In Section 1.2, we describe


the vanilla SIS and ISIS variable selection procedures in the context of generalized

linear models and the Cox proportional hazards model. Section 1.3 discusses sev-

eral ISIS variants, as well as important implementation details. Simulation results

comparing model selection performance and run time trials are given in Section 1.4,

where we also analyze four gene expression data sets and work through an example

using our package with one of them. The chapter is concluded with a short discussion

in Section 1.5.

1.2 General SIS and ISIS Methodological Frame-

work

Consider the usual generalized linear model (GLM) framework, where we have in-

dependent and identically distributed observations (xi, yi) : i = 1, . . . , n from the

population (x, y), where the predictor x = (x0, x1, . . . , xp)> is a (p + 1)-dimensional

random vector with x0 = 1 and y is the response. We further assume the conditional

distribution of y given x is from an exponential family taking the canonical form

f(y;x,β) = expyθ − b(θ) + c(y), (1.1)

where θ = x>β, β = (β0, β1, . . . , βp)> is a vector of unknown regression parameters

and b(·), c(·) are known functions. As we are only interested in modeling the mean

regression, the dispersion parameter is assumed known. In virtue of (1.1), inference

about the parameter β in the GLM context is made via maximization of the log-

likelihood function

`(β) =n∑i=1

yi(x>i β)− b(x>i β). (1.2)

For the survival analysis framework, the observed data (xi, yi, δi) : xi ∈ Rp, yi ∈

R+, δi ∈ 0, 1, i = 1, . . . n is an independent and identically distributed random

sample from a certain population (x, y, δ). Here, as in the context of linear models,


x = (x1, x2, . . . , xp)> is a p-dimensional random vector of predictors and y, the ob-

served time, is a time of failure if δ = 1, or a right-censored time if δ = 0. Suppose

that the sample comprises m distinct uncensored failure times t1 < t2 < · · · < tm.

Let (j) denote the individual failing at time tj and R(tj) be the risk set just prior

to time tj, that is, R(tj) = i : yi ≥ tj. The main problem of interest is to study

the relationship between the predictor variables and the failure time, and a common

approach is through the Cox proportional hazards model (Cox, 1975). For a vector

β = (β1, β2, . . . , βp)> of unknown regression parameters, the Cox model assumes a

semiparametric form of the hazard function

h(t|xi) = h0(t)ex>i β,

where h0(t) is an unknown arbitrary baseline hazard function giving the hazard when

xi = 0. Following the argument in Cox (1975), inference about β is made via maxi-

mization of the partial likelihood function

L(β) =m∏j=1

ex>(j)β∑

k∈R(tj)ex>k β,

which is equivalent to maximizing the log-partial likelihood

`(β) =n∑i=1

δix>i β −

n∑i=1

δi log

∑k∈R(yi)

exp(x>k β)

. (1.3)

We refer interested readers to Kalbfleisch and Prentice (2002) and references therein

for a comprehensive literature review on the Cox proportional hazards model.

For both statistical models, we assume all predictors x1, . . . , xp are standardized

to have mean zero and standard deviation one. Additionally, although our variable

selection procedures within the SIS package also handle the classical p < n setting,

we allow the dimensionality of the covariates p to grow much faster than the num-

ber of observations n. To be more specific, we assume the ultrahigh dimensionality

setup log p = O(nξ) for some ξ ∈ (0, 1/2) presented in Fan and Song (2010). What

makes statistical inference possible in this “large p, small n” scenario is the sparsity


assumption; only a small subset of variables among predictors x1, . . . , xp contribute

to the response, which implies the parameter vector β is sparse. Therefore, varia-

ble selection techniques play a pivotal role in these ultrahigh dimensional statistical

models.

1.2.1 SIS and Feature Ranking by Maximum Marginal Like-

lihood Estimators

LetM? = 1 ≤ j ≤ p : β?j 6= 0 be the true sparse model, where β?=(β?0 , β?1 , . . . , β

?p)>

denotes the true value of the parameter vector and β?0 = 0 for the Cox model. In order

to carry out the vanilla sure independence screening variable selection procedure, we

initially fit marginal versions of models (1.2) and (1.3) with componentwise covariates.

The maximum marginal likelihood estimator (MMLE) βMj , for j = 1, . . . , p, is defined

in the GLM context as the maximizer of the componentwise regression

βMj = (βMj,0, βMj ) = arg max

β0,βj

n∑i=1

yi(β0 + xijβj)− b(β0 + xijβj), (1.4)

where xi = (xi0, xi1, . . . , xip)> and xi0 = 1. Similarly, for each covariate xj (1 ≤ j ≤

p), one can define the MMLE for the Cox model as the maximizer of the log-partial

likelihood with a single covariate

βMj = arg maxβj

(n∑i=1

δixijβj −n∑i=1

δi log

∑k∈R(yi)

exp(xkjβj)

), (1.5)

with xi = (xi1, . . . , xip)>. Both componentwise estimators can be computed very

rapidly and implemented modularly, avoiding the numerical instability associated

with ultrahigh dimensional estimation problems.

The vanilla SIS procedure then ranks the importance of features according to the

magnitude of their marginal regression coefficients, excluding the intercept in the case

of GLM. Therefore, we select a set of variables

Mδn = 1 ≤ j ≤ p : |βMj | ≥ δn, (1.6)


where δn is a threshold value chosen so that we pick the d top ranked covariates. Typ-

ically, one may take d = bn/ log nc, so that dimensionality is reduced from ultrahigh

to below the sample size. As further discussed in Sections 1.3.3 and 1.3.4, the choice

of d may also be either data-driven or model-based. Under a mild set of technical

conditions, Fan and Song (2010) show the magnitude of these marginal estimators

can preserve the nonsparsity information about the joint model with full covariates.

In other words, for a given sequence δn, the sure screening property

P(M? ⊂ Mδn)→ 1 as n→∞ (1.7)

holds for SIS, effectively reducing the dimensionality of the model from ultrahigh to

below the sample size, and solving the aforementioned challenges of computational

expediency, statistical accuracy, and algorithmic stability.

With features being crudely selected by the intensity of their marginal signals, the

index set Mδn may also include a great deal of unimportant predictors. To further

improve finite sample performance of vanilla SIS, variable selection and parameter es-

timation can be simultaneously achieved via penalized likelihood estimation, using the

joint information of the covariates in Mδn . Without loss of generality, by reordering

the features if necessary, we may assume that x1, . . . , xd are the predictors recruited by

SIS. We define βd = (β0, β1, . . . , βd)> and let xi,d = (xi0, xi1, . . . , xid)

> with xi0 = 1.

Thus, our original problem of estimating the sparse (p + 1)-vector β in the GLM

framework (1.2) reduces to estimating a sparse (d + 1)-vector βd = (β0, β1, . . . , βd)>

based on maximizing the moderate scale penalized likelihood

βd = arg maxβd

n∑i=1

yi(x>i,dβd)− b(x>i,dβd) −d∑j=1

pλ(|βj|). (1.8)

Likewise, after defining βd = (β1, β2, . . . , βd)> and setting xi,d = (xi1, xi2, . . . , xid)

>

for survival data within the Cox model, the moderate scale version of the penalized

log-partial likelihood problem consists in maximizing

βd = arg maxβd

n∑i=1

δix>i,dβd −

n∑i=1

δi log

∑k∈R(yi)

exp(x>k,dβd)

−

d∑j=1

pλ(|βj|). (1.9)


Here, pλ(·) is a penalty function and λ > 0 is a regularization parameter which

may be selected using the AIC (Akaike, 1973), BIC (Schwarz, 1978) or EBIC (Chen

and Chen, 2008) criteria, or through ten-fold cross-validation and the modified cross-

validation framework (Feng and Yu, 2013; Yu and Feng, 2014b), for example. Penalty

functions whose resulting estimators yield sparse solutions to the maximization prob-

lems (1.8) and (1.9) include the LASSO penalty pλ(|β|) = λ|β| (Tibshirani, 1996),

the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001), which

is a folded-concave quadratic spline with pλ(0) = 0 and

p′λ(|β|) = λ

1|β|≤λ +

(γλ− |β|)+

(γ − 1)λ1|β|>λ

,

for some γ > 2 and |β| > 0, and the minimax concave penalty (MC+), another

folded-concave quadratic spline with pλ(0) = 0 such that

p′λ(|β|) = (λ− |β|/γ)+,

for some γ > 0 and |β| > 0 (Zhang, 2010). For the SCAD and MC+ penalties, the

tuning parameter γ is used to adjust the concavity of the penalty. The smaller γ

is, the more concave the penalty becomes, which means finding a global minimizer

is more difficult; but on the other hand, the resulting estimators overcome the bias

introduced by the LASSO penalty.

Once penalized likelihood estimation is carried out in (1.8) and (1.9), a refined

estimate of M? can be obtained from M1, the index set of the nonzero components

of the sparse regression parameter estimator. We summarize this initial screening

procedure based on componentwise regressions through Algorithm 1.1 below.

1.2.2 Iterative Sure Independence Screening

The technical conditions in Fan and Lv (2008) and Fan and Song (2010) guaranteeing

the sure screening property for SIS fail to hold if there is a predictor marginally

unrelated, but jointly related with the response, or if a predictor is jointly uncorrelated


Algorithm 1.1 Vanilla SIS (Van-SIS)

1. Inputs: Screening model size d. Penalty pλ(·).

2. For every j ∈ 1, . . . , p, compute the MMLE βMj from (1.4) or (1.5).

3. Choose a threshold value δn in (1.6) such that Mδn consists of the d top ranked

covariates.

4. Obtain the parameter estimate βd from the penalized likelihood estimation prob-

lems (1.8) or (1.9).

5. Outputs: Parameter estimate βd and the corresponding estimate of the true

sparse model M1 = suppβd.

with the response but has higher marginal correlation with the response than some

important predictors in M?. In the former case, the important predictor cannot be

picked up by vanilla SIS, whereas in the latter case, unimportant predictors in Mc?

are ranked too high, leaving out features from the true sparse model M?.

To deal with these difficult scenarios in which the SIS methodology breaks down,

Fan and Lv (2008) and Fan et al. (2009) proposed iterative sure independence screen-

ing based on conditional marginal regressions and feature ranking. The approach

seeks to overcome the limitations of SIS, which is based on marginal models only,

by making more use of the joint covariate information while retaining computational

expediency and stability as in the original SIS.

In its first iteration, the vanilla ISIS variable selection procedure begins with using

regular SIS to pick an index set A1 of size k1, and then similarly applies a penalized

likelihood estimation approach to select a subset M1 of these indices. Note that

the screening step only fits componentwise regressions, while the penalized likelihood

step solves optimization problems of moderate scale k1, typically below the sample

size n. This is an attractive feature for any variable selection technique applied

to ultrahigh dimensional statistical models. After the first iteration, we denote the

resulting estimator with nonzero components and indices in M1 by βM1.


In an effort to use more fully the joint covariate information, in the second iteration

of vanilla ISIS we compute the conditional marginal regression of each predictor j that

is not in M1. That is, under the generalized linear model framework, we solve

arg maxβ0,βM1

,βj

n∑i=1

yi(β0 + x>i,M1

βM1+ xijβj)− b(β0 + x>

i,M1βM1

+ xijβj), (1.10)

whereas under the Cox model, we obtain

arg maxβM1

,βj

(n∑i=1

δi(x>i,M1

βM1+ xijβj)−

n∑i=1

δi log

∑k∈R(yi)

exp(x>k,M1

βM1+ xkjβj)

)(1.11)

for each j ∈ 1, . . . , p\M1, where xi,M1denotes the sub-vector of xi with indices in

M1 and similarly for βM1. These are again low-dimensional problems which can be

solved quite efficiently. Similar to the componentwise regressions (1.4) and (1.5), let

βMj denote the last coordinate of the maximizer in (1.10) and (1.11). The magnitude

|βMj | reflects the additional contribution of the jth predictor given that all covariates

with indices in M1 have been included in the model.

Once the conditional marginal regressions have been computed for each predictor

not in M1, we perform conditional feature ranking by ordering |βMj | : j ∈ Mc1 and

forming an index set A2 of size k2, say, consisting of the indices with the top ranked

elements. After this screening step, under the GLM framework, we maximize the

moderate scale penalized likelihood

n∑i=1

yi(β0 +x>i,M1∪A2

βM1∪A2)− b(β0 +x>

i,M1∪A2βM1∪A2

) −∑

j∈M1∪A2

pλ(|βj|) (1.12)

with respect to βM1∪A2to get a sparse estimator βM1∪A2

. Similarly, in the Cox

model, we obtain a sparse estimator by maximizing the moderate scale penalized

log-partial likelihood

n∑i=1

δi(x>i,M1∪A2

βM1∪A2)−

n∑i=1

δi log

∑k∈R(yi)

exp(x>k,M1∪A2

βM1∪A2)

−

∑j∈M1∪A2

pλ(|βj|).(1.13)


The indices of the nonzero coefficients of βM1∪A2provide an updated estimate M2

of the true sparse model M?.

In the screening step above, an alternative approach is to substitute the fitted

regression parameter βM1from the first step of vanilla ISIS into problems (1.10) and

(1.11), so that the optimization problems become again componentwise regressions.

This approach is exactly an extension of the original ISIS proposal of Fan and Lv

(2008) to generalized linear models and the Cox proportional hazards model. Even

for the ordinary linear model, the conditional contributions of predictors are more

relevant for variable selection in the second ISIS iteration than the original proposal

of using the residuals ri = yi − x>i,M1βM1

as the new response (see Fan et al., 2009).

Furthermore, the penalized likelihood steps (1.12) and (1.13) allow the procedure to

delete some features xj : j ∈ M1 that were previously selected. This is also an

important deviation from Fan and Lv (2008), as their original ISIS procedure does

not contemplate intermediate deletion steps.

Lastly, the vanilla ISIS procedure, which iteratively recruits and deletes predictors,

can be repeated until some convergence criterion is reached. We adopt the criterion of

having an index set Ml which either has the prescribed size d, or satisfies Ml = Mj

for some j < l. In order to ensure that iterated SIS takes at least two iterations

to terminate, in our implementation we fix k1 = b2d/3c, and thereafter at the lth

iteration we set kl = d − |Ml−1|. A step-by-step description of the vanilla ISIS

procedure is provided in Algorithm 1.2.

We conclude this section providing a simple overview of the main features of the

vanilla SIS and ISIS procedures for applied practitioners. In the ultrahigh dimen-

sional statistical model setting where p n, and even in the classical p < n setting

with p > 30, variable screening is an essential tool in helping eliminate irrelevant

predictors while reducing data gathering and storage requirements. The vanilla SIS

procedure given in Algorithm 1.1 provides an extremely fast and efficient variable

screening based on marginal regressions of each predictor with the response. While


Algorithm 1.2 Vanilla ISIS (Van-ISIS)

1. Inputs: Screening model size d. Penalty pλ(·). Maximum iteration number lmax.

2. For every j ∈ 1, . . . , p, compute the MMLE βMj from problems (1.4) or (1.5).

Select the k1 top ranked covariates to form the index set A1.

3. Apply penalized likelihood estimation on the set A1 to obtain a subset of indices

M1.

4. For every j ∈ Mc1, solve the conditional marginal regression problem (1.10) or

(1.11) and sort |βMj | : j ∈ Mc1. Form the index set A2 with the k2 top ranked

covariates, and apply penalized likelihood estimation on M1 ∪ A2 as in (1.12) or

(1.13) to obtain a new index set M2.

5. Iterate the process in step 4 until we have an index set Ml such that |Ml| = d

or Ml = Mj for some j < l or l = lmax.

6. Outputs: Ml from step 5 along with the parameter estimate from (1.12) or (1.13).

under certain independence assumptions among predictors this may prove a success-

ful strategy in terms of estimating the true sparse model M?, there are well-known

issues associated with variable screening using only information from marginal re-

gressions, such as missing important predictors from M? which happen to have low

marginal correlation with the response. The vanilla ISIS procedure addresses these

issues by using more thoroughly the joint covariate information through the condi-

tional marginal regressions (1.10) and (1.11), which aim at measuring the additional

contribution of a predictor xj given the presence of the variables in M1 in the current

model, all while maintaining low computational costs. Finally, we would like to point

out that the intermediate deletion steps of the vanilla ISIS procedure could be carried

out with any other variable selection methods, such as the weight vector ranking with

support vector machines (Rakotomamonjy, 2003) or the greedy search strategies of

forward selection and backward elimination (Guyon and Elisseeff, 2003). In our im-

plementation within the SIS package we favor the penalized likelihood criteria (1.12)


and (1.13), but in principle any variable selection technique could be employed to

further filter unimportant predictors.

1.3 Variants of ISIS

By nature of their marginal approach, sure independence screening procedures have

large false selection rates, namely, many unimportant predictors in Mc? are selected

after the screening steps. In order to reduce the false selection rate, Fan et al. (2009)

suggested the idea of sample splitting. Without loss of generality, we assume the

sample size n is even, and we randomly split the sample into two halves. Two variants

of the ISIS methodology have been proposed in the literature; both of them relying

on the idea of performing variable screening to the data in each partition separately,

combining the results in a subsequent penalized likelihood step. We also revisit the

approach of Fan et al. (2011), in which a random permutation of the observations is

used to obtain a data-driven threshold for independence screening.

1.3.1 First Variant of ISIS

Let A (1)1 and A (2)

1 be the two sets of indices, each of size k1, obtained after applying

regular SIS to the data in each partition. As the first crude estimates of the true

sparse model M?, both of them should have large false selection rates. Yet, under

the technical conditions given in Fan and Song (2010), the estimates should satisfy

P(M? ⊂ A (j)1 )→ 1 as n→∞

for j = 1, 2. That is, important features should appear in both sets with probability

tending to one. If we define A1 = A (1)1 ∩ A (2)

1 as a new estimate of M?, this new

index set must also satisfy

P(M? ⊂ A1)→ 1 as n→∞.


However, by construction, the number of unimportant predictors in A1 should be

much smaller, thus reducing the false selection rate. The reason is that, in order for

an unimportant predictor to appear in A1, it has to be included in both sets A (1)1

and A (2)1 randomly.

In their theoretical support for this variant based on random splitting, Fan et al.

(2009) provided a non-asymptotic upper bound for the probability of the event that

m unimportant covariates are included in the intersection A1. The probability bound,

obtained under an exchangeability condition ensuring that each unimportant feature

is equally likely to be recruited by SIS, is decreasing in the dimensionality, showing

an apparent “blessing of dimensionality”. This is only part of the full story, since, as

pointed out in Fan et al. (2009), the probability of missing important predictors from

the true sparse modelM? is expected to increase with p. However, as we show in our

simulation settings of Section 1.4.1, and further confirm in the real data analysis of

Section 1.4.3, the procedure is quite effective at obtaining a minimal set of features

that should be included in a final model.

The remainder of the first variant of ISIS proceeds as follows. After forming

the intersection A1 = A (1)1 ∩ A (2)

1 , we perform penalized likelihood estimation as in

Algorithm 1.2 to obtain a refined approximation M1 to the true sparse model. We

then perform the second iteration of the vanilla ISIS procedure, computing conditional

marginal regressions to each partition separately to obtain sets of indices A (1)2 and

A (2)2 , each of size k2. After taking the intersection A2 = A (1)

2 ∩A(2)2 of these two sets,

we carry out sparse penalized likelihood estimation as in (1.12) and (1.13), obtaining

a second approximation M2 to the true modelM?. As before, the iteration continues

until we have an index set Ml of size d, or satisfiying Ml = Mj for some j < l.

1.3.2 Second Variant of ISIS

The variable selection performed by the first variant of ISIS could potentially lead

to a very aggressive screening of predictors, reducing the overall false selection rate,


but sometimes yielding undesirably minimal model sizes. The second variant of ISIS

is a more conservative variable selection procedure, where we again apply regular

SIS to each partition separately, but we now choose larger sets of indices A (1)1 and

A (2)1 to ensure that their intersection A1 = A (1)

1 ∩ A(2)1 has k1 elements. In this way,

the second variant guarantees that there are at least k1 predictors included before the

penalized likelihood step, making it considerably less aggressive than the first variant.

After applying penalized likelihood to the predictors with indices in A1, we obtain

a first estimate M1 of the true sparse model. The second iteration computes condi-

tional marginal regressions to each partition separately, recruiting enough features in

index sets A (1)2 and A (2)

2 to ensure that A2 = A (1)2 ∩ A

(2)2 has k2 elements. Penalized

likelihood, as outlined in Section 1.2.2, is now applied to M1 ∪ A2, yielding a second

estimate M2 of the true modelM?. Stopping criteria remain the same as in the first

variant.

1.3.3 Data-driven Thresholding

Motivated by a false discovery rate criterion in Fan et al. (2011), the following variant

of ISIS determines a data-driven threshold for marginal screening. Given GLM data

of the form (xi, yi) : i = 1, . . . , n, a random permutation π of 1, . . . , n is used to

decouple xi and yi so that the resulting data (xπ(i), yi) : i = 1, . . . , n follow a null

model, i.e., a model in which the features have no prediction power over the response.

For the newly permuted data, we recalculate marginal regression coefficients (βMj )∗

for each predictor j, with j = 1, . . . , p.

The motivation behind this approach is that whenever j is the index of an im-

portant predictor in M?, the MMLE |βMj | given in (1.4) should be larger than most

of |(βMj )∗|, as the random permutation is meant to eliminate the prediction power of

features. For a given q ∈ [0, 1], let ω(q) be the qth quantile of |(βMj )∗| : j = 1, . . . , p.

The data-driven threshold allows only a 1−q proportion of inactive variables to enter

the model when x and y are not related (the null model). Thus, the initial index set


Algorithm 1.3 Permutation-based ISIS (Perm-ISIS)

1. Inputs: Screening model size d. Penalty pλ(·). Quantile q. Maximum iteration

number lmax.

2. For every j ∈ 1, . . . , p, compute the MMLE βMj . Obtain the randomly permuted

data (xπ(i), yi) : i = 1, . . . , n, and let ω(q) be the qth quantile of |(βMj )∗| : j =

1, . . . , p, where (βMj )∗ is the second coordinate of the solution to

arg maxβ0,βj

n∑i=1

yi(β0 + xπ(i)jβj)− b(β0 + xπ(i)jβj).

Select the variables in the index set A1 = 1 ≤ j ≤ p : |βMj | ≥ ω(q).

3. Apply penalized likelihood estimation on the set A1 to obtain a subset of indices

M1.

4. For every j ∈ Mc1, solve the conditional marginal regression problem (1.10) and

obtain |βMj | : j ∈ Mc1. By randomly permuting only the variables in Mc

1, let

ω(q) be the qth quantile of |(βMj )∗| : j ∈ Mc1, where (βMj )∗ is the last coordinate

of the solution to

arg maxβ0,βM1

,βj

n∑i=1

yi(β0 + x>i,M1

βM1+ xπ(i)jβj)− b(β0 + x>

i,M1βM1

+ xπ(i)jβj).

Select the variables in the index set A2 = j ∈ Mc1 : |βMj | ≥ ω(q), and apply

penalized likelihood estimation on M1 ∪ A2 to obtain a new subset M2.

5. Iterate the process in step 4 until we have an index set Ml such that |Ml| ≥ d

or Ml = Mj for some j < l or l = lmax.

6. Outputs: Ml from step 5 along with the parameter estimate from (1.12) or (1.13).

A1 is defined as

A1 = 1 ≤ j ≤ p : |βMj | ≥ ω(q),

and the modified ISIS iteration then carries out penalized likelihood estimation in

the usual way to obtain a finer approximation M1 of the true sparse modelM?. The


complete procedure for this variant is detailed in Algorithm 1.3 above.

A greedy modification of the above algorithm can be proposed to enhance variable

selection performance. Specifically, we restrict the size of the sets Aj in the iterative

screening steps to be at most p0, a small positive integer. Moreover, a completely

analogous algorithm can be proposed for survival data, with permutation π and data-

driven threshold ω(q) defined accordingly. Details of such procedure are omitted in

the current presentation.

1.3.4 Implementation Details

There are several important details in the implementation of the vanilla versions of

SIS and ISIS, as well as all of the above mentioned variants.

• In order to speed up computations, and exclusively for the first screening step,

all variable selection procedures use correlation learning (i.e., marginal Pearson

correlations) between each predictor and the response, instead of the compo-

nentwise GLM or partial likelihood fits (1.4) and (1.5). We found no major

differences in variable selection performance between this variant and the one

using the MMLEs.

• Although the asymptotic theory of Fan and Song (2010) guarantees the sure

screening property (1.7) for a sequence δn ∼ n−θ∗ , with θ∗ > 0 a fixed constant,

in practice Fan et al. (2009) recommend using d = bn/ log nc as a sensible choice

since θ∗ is typically unknown. Furthermore, Fan et al. (2009) also advocate

using model-based choices of d, favoring smaller values in models where the

response provides less information. In our numerical implementation we use d =

bn/(4 log n)c for logistic regression and d = bn/(2 log n)c for Poisson regression,

which are less informative than the real-valued response in a linear model, for

which we select d = bn/ log nc. For the right-censored response in the survival

analysis framework, we fix d = bn/(4 log n)c.


• Regardless of the statistical model at hand, we set d = bn/ log nc for the first

variant of ISIS. Note that since the selected variables for this first variant are

in the intersection of two sets of size kl ≤ d, we typically end up with far fewer

than d variables selected by this method. In any case, our procedures within

the SIS package allow for a user-specified value of d.

• Variable selection under the traditional p < n setting can also be carried out

using our screening procedures, for which we fix d = p as default for all variants.

In this regard, practicing data analysts can view sure independence screening

procedures along classical criteria for variable selection such as the AIC (Akaike,

1973), BIC (Schwarz, 1978), Cp (Mallows, 1973) or the generalized information

criterion GIC (Nishii, 1984) applied directly to the full set of covariates.

• The intermediate penalized likelihood problems (1.12) and (1.13) are solved

using the glmnet and ncvreg packages. Our code has an option allowing the

regularization parameter λ > 0 to be selected through the AIC, BIC, EBIC or

K-fold cross-validation criteria. The concavity parameter γ in the SCAD and

MC+ penalties can also be user-specified.

• In our permutation-based variant with data-driven thresholding, we use q = 1

(i.e., we take ω(q) to be the maximum absolute value of the permuted estimates)

and take p0 to be 1 in the greedy modification. Note that if none of the variables

is recruited, that is, exceeding the threshold for the null model, the criterion in

step 5 stops the procedure.

• We can further combine the permutation-based approach of Section 1.3.3 with

the sample splitting idea from the first two variants to define a new ISIS pro-

cedure. Concretely, we first select two subsets of indices A (1)1 and A (2)

1 , each

consisting of variables whose MMLEs, or correlation with the response, exceed

the data-driven thresholds ω(1)(q) and ω

(2)(q) of their respective samples. If the size

of their intersection is less than k1, we define A1 = A (1)1 ∩ A (2)

1 ; otherwise, we


reduce the size of A (1)1 and A (2)

1 to ensure their intersection has at most k1

elements. This is done to control the size of A1 when too many variables exceed

the thresholds. The rest of the iteration is carried out accordingly.

• Every variant of ISIS is coded to guarantee there will be at least two predictors

at the end of the first screening step.

As the proposed ISIS variants grow more involved, the associated number of tuning

parameters is bound to increase. While this may initially make some data practition-

ers feel uneasy, our intent here is to be as flexible as possible, providing all available

tools that the powerful family of sure independence screening procedures has to offer.

Additionally, it is important to clarify that there exist tuning parameters inherent to

the screening procedures, which are fundamentally different from tuning parameters

(e.g., driven by a K-fold cross-validation approach) needed in the intermediate penal-

ized likelihood procedures. In any case, we detail all available options implemented in

our SIS package in Table 1.1 below, where we highlight recommended default settings

for practicing researchers.

1.4 Model Selection and Timings

In this section we illustrate all independence screening procedures by studying their

performance on simulated data and on four popular gene expression data sets. Most

of the simulation settings are adapted from the work of Fan et al. (2009) and Fan

et al. (2010).

1.4.1 Model Selection and Statistical Accuracy

We first conduct simulation studies comparing the runtime of the vanilla version of

SIS (Van-SIS), its iterated vanilla version (Van-ISIS), the first variant (Var1-ISIS), the

second variant (Var2-ISIS), the permutation-based ISIS (Perm-ISIS), its greedy mod-


Parameter Description SIS package options

pλ(·) Penalty function for intermediate

penalized likelihood estimation

penalty = "SCAD" (default) /

= "MCP" /= "lasso"

λMethod for tuning regularization

parameter of penalty function

pλ(·)

tune = "bic" (default) /

= "ebic" /= "aic" /= "cv"

γ Concavity parameter for SCAD

and MC+ penalties

concavity.parameter = 3.7 /= 3

are defaults for SCAD/MC+. Any

γ > 2 for SCAD or γ > 1 for MC+

can be user-specified

d Upper bound on the number of

predictors to be selected

nsis default is model-based as exp-

lained in Section 1.3.4. It can also

be user-specified.

ISIS variant Flags which ISIS version to

perform

varISIS = "vanilla" (default) /

= "aggr" /= "cons"

Permuation

variant

Flags whether to use permuta-

tion variant with data-driven

thresholds

perm = FALSE (default) /= TRUE

q Quantile used in calculating

data-driven thresholds

q = 1 (default) / can be any user-

specified value in [0, 1]

p0 Maximum size of active sets

in greedy modification

greedy.size = 1 (default) / can be

any user-specified integer less than p

Table 1.1: Summary of tuning parameters for variable selection using ISIS procedures

within the SIS package, as well as associated defaults. All ISIS variants are imple-

mented through the SIS function, which we describe in Section 1.4.4 using a gene

expression data set.


ification (Perm-g-ISIS), the permutation-based variant with sample splitting (Perm-

var-ISIS) and its greedy modification (Perm-var-g-ISIS), under both generalized linear

models and the Cox proportional hazards model. We also demonstrate the power of

ISIS and its variants, in terms of model selection and estimation, by comparing them

with traditional LASSO and SCAD penalized estimation. Our SIS code is tested

against the competing R packages glmnet (Friedman et al., 2013) and ncvreg (Bre-

heny, 2013) for LASSO and SCAD penalization, respectively. All calculations were

carried out on an Intel Xeon L5430 @ 2.66 GHz processor. We refer interested readers

to Friedman et al. (2010), Breheny and Huang (2011), and Simon et al. (2011) for

a detailed discussion in penalized likelihood estimation algorithms under generalized

linear models and the Cox proportional hazards model.

Four different statistical models are considered here: Linear regression, Logistic

regression, Poisson regression, and Cox proportional hazards regression. We select the

configuration (n, p) = (400, 5000) for all models, and generate covariates x1, . . . , xp

as follows:

Case 1: x1, . . . , xp are independent and identically distributed N(0, 1) random varia-

bles.

Case 2: x1, . . . , xp are jointly Gaussian, marginally distributed as N(0, 1), and with

correlation structure Corr(xi, xj) = 1/2 if i 6= j.


correlation structure Corr(xi, x4) = 1/√

2 for all i 6= 4 and Corr(xi, xj) = 1/2

if i and j are distinct elements of 1, . . . , p \ 4.


correlation structure Corr(xi, x5) = 0 for all i 6= 5, Corr(xi, x4) = 1/√

2

for all i /∈ 4, 5, and Corr(xi, xj) = 1/2 if i and j are distinct elements of

1, . . . , p \ 4, 5.


With independent features, Case 1 is the most straightforward for variable selec-

tion. In Cases 2–4, however, we have serial correlation among predictors such that

Corr(xi, xj) does not decay as |i − j| increases. We will see below that for Cases

3 and 4, the true sparse model M? is chosen such that the response is marginally

independent but jointly dependent of x4. This type of dependence is ruled out in

the asymptotic theory of SIS in Fan and Song (2010), so we should expect variable

selection in these settings to be more challenging for the non-iterated version of SIS.

In the Cox proportional hazards scenario, the right-censoring time is generated

from the exponential distribution with mean 10. This corresponds to fixing the base-

line hazard function h0(t) = 0.1 for all t ≥ 0. The true regression coefficients from

the sparse model M? in each of the four settings are as follows:

Case 1: β?0 = 0, β?1 = −1.5140, β?2 = 1.2799, β?3 = −1.5307, β?4 = 1.5164, β?5 =

−1.3019, β?6 = 1.5833, and β?j = 0 for j > 6.

Case 2: The coefficients are the same as in Case 1.

Case 3: β?0 = 0, β?1 = 0.6, β?2 = 0.6, β?3 = 0.6, β?4 = −0.9√

2, and β?j = 0 for j > 4.

Case 4: β?1 = 4, β?2 = 4, β?3 = 4, β?4 = −6√

2, β?5 = 4/3, and β?j = 0 for j > 5. The

corresponding median censoring rate is 33.5%.

For Cases 1 and 2, the coefficients were randomly generated as (4 log n/√n +

|Z|/4)U with Z ∼ N(0, 1) and U = 1 with probability 0.5 and −1 with probability

0.5, independent of the value of Z. For Cases 3 and 4, the selected model ensures that

even though β?4 6= 0, the associated predictor x4 and the response y are marginally

independent. This is designed in order to make it challenging for the vanilla sure

independence screening procedure to select this important variable. Furthermore, in

Case 4, we add another important predictor x5 with a small coefficient to make it

even more challenging to identify the true sparse model.


The results are given in Tables 1.2–1.5, in which the median and robust estimate

of the standard deviation (over 100 repetitions) of several performance measures are

reported: `1-estimation error, squared `2-estimation error, true positives (TP), false

positives (FP), and computational time in seconds (Time). In Cases 1 and 2, under

the Linear and Logistic regression setups, for any type of SIS or ISIS, we employ

the SCAD penalty (γ = 3.7) at the end of the screening steps; whereas LASSO is

applied for Cases 3 and 4, under the Poisson and Cox proportional hazards regression

frameworks. For simplicity, we exclude the performance of MC+ based screening

procedures in the current analysis. Whenever necessary, for all variable selection

procedures considered here, the BIC criterion is used as a fast way to select the

regularization parameter λ > 0, always chosen from a path of 100 candidate λ values.

As the covariates are all independent in Case 1, it is not surprising to see that Van-

SIS performs reasonably well. However, this non-iterative procedure fails in terms of

identifying the true model when correlation is present, particularly in the challenging

Cases 3 and 4. When predictors are dependent, the vanilla ISIS improves significantly

over SIS in terms of true positives. While the number of false positive variables may

be larger in some settings, Van-ISIS provides comparable estimation errors in Cases

1–3 but significant reduction in the complicated Case 4.

In terms of further reducing the false selection rate and estimation errors, while

still selecting the true model M?, Var1-ISIS performs much better than Var2-ISIS.

Being a more conservative variable selection approach, Var2-ISIS tends to have a

higher number of false positives. This is particularly true in the Poisson regression

scenario, in which the second variant even misses one important predictor.

From the permutation-based variants, the ones that combine the sample splitting

approach (Perm-var-ISIS and Perm-var-g-ISIS) outperform all other ISIS procedures

in terms of true positives, low false selection rates, and small estimation errors, with

Var1-ISIS following closely. In particular, for Perm-var-ISIS, the number of false

positives is approximately zero for all examples. The only drawback seems to be their


Method ‖β − β?‖1 ‖β − β?‖22 TP FP Time

Van-SIS 0.24(0.10) 0.01(0.01) 6(0.00) 0(0.00) 0.26(0.02)

Van-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 8.34(0.78)

Var1-ISIS 0.29(0.15) 0.02(0.02) 6(0.00) 0(0.74) 11.76(8.65)

Var2-ISIS 0.24(0.10) 0.01(0.01) 6(0.00) 0(0.00) 11.90(1.12)

Perm-ISIS 0.41(0.25) 0.05(0.05) 6(0.00) 1(1.49) 44.57(13.38)

Perm-g-ISIS 0.39(0.27) 0.04(0.05) 6(0.00) 1(1.49) 107.50(22.99)

Perm-var-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 41.64(1.01)

Perm-var-g-ISIS 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 82.81(15.80)

SCAD 0.24(0.09) 0.01(0.01) 6(0.00) 0(0.00) 6.89(5.56)

Table 1.2: Linear regression, Case 1, where results are given in the form of medians

and robust standard deviations (in parentheses).


Van-SIS 2.79(2.20) 2.02(2.55) 5.5(0.75) 0(0.75) 0.36(0.05)

Van-ISIS 4.06(7.78) 2.77(7.35) 6(0.00) 2(7.46) 48.73(11.72)

Var1-ISIS 1.86(1.25) 0.79(1.07) 6(0.00) 0(0.75) 76.47(20.56)

Var2-ISIS 5.29(7.99) 4.43(7.75) 6(0.00) 5(6.72) 96.25(59.99)

Perm-ISIS 2.94(8.80) 1.87(9.03) 6(0.00) 2(7.46) 128.94(42.74)

Perm-g-ISIS 18.65(6.10) 23.38(14.40) 6(0.00) 10(0.93) 739.84(89.96)

Perm-var-ISIS 1.55(1.38) 0.53(1.43) 6(0.75) 0(0.00) 153.53(43.84)

Perm-var-g-ISIS 1.64(1.37) 0.63(1.18) 6(0.00) 0(0.75) 251.21(63.66)

SCAD 409.22(104.33) 8403.18(4844.63) 6(0.00) 20(2.24) 304.98(66.65)

Table 1.3: Logistic regression, Case 2, where results are given in the form of medians




Van-SIS 3.10(0.55) 2.08(0.26) 3(0.00) 9.5(18.66) 0.14(0.04)

Van-ISIS 5.21(0.79) 2.21(2.20) 4(0.75) 29(0.00 ) 86.23(64.17)

Var1-ISIS 0.53(0.39) 0.09(0.11) 4(0.00) 1(0.75) 88.05(28.02)

Var2-ISIS 5.05(0.67) 2.15(0.22) 3(0.75) 29(0.75) 207.67(100.74)

Perm-ISIS 6.16(1.40) 6.56(5.54) 3(0.00) 30(0.75) 130.93(221.45)

Perm-g-ISIS 6.76(0.77) 1.70(0.74) 4(0.00) 29(0.00) 2202.81(136.08)

Perm-var-ISIS 0.26(0.21) 0.02(0.04) 4(0.00) 0(0.75) 174.90(42.50)

Perm-var-g-ISIS 0.43(0.36) 0.07(0.08) 4(0.00) 1(1.49) 231.81(89.62)

LASSO 2.97(0.07) 2.14(0.18) 3(0.00) 20(9.70) 1.47(0.36)

Table 1.4: Poisson regression, Case 3, where results are given in the form of medians



Van-SIS 21.27(0.49) 95.64(3.63) 3(0.75) 12(0.75) 0.15(0.04)

Van-ISIS 3.13(1.20) 1.02(1.35) 5(0.00) 11(0.00) 92.29(45.61)

Var1-ISIS 1.33(0.62) 0.38(0.42) 5(0.00) 2(2.24) 207.16(45.69)

Var2-ISIS 2.80(1.12) 0.93(1.10) 5(0.00) 11(0.00) 189.24(84.92)

Perm-ISIS 21.44(0.25) 93.93(6.42) 3(0.00) 13(0.00) 136.47(69.53)

Perm-g-ISIS 9.19(1.95) 8.26(5.53) 5(0.00) 11(0.00) 1102.28(96.11)

Perm-var-ISIS 0.95(0.76) 0.24(0.38) 5(0.00) 0(0.00) 386.87(68.42)

Perm-var-g-ISIS 1.24(0.66) 0.35(0.41) 5(0.00) 1(1.49) 509.85(147.38)

LASSO 163.09(14.17) 1035.23(173.27) 4(0.00) 313(10.63) 35.67(3.87)

Table 1.5: Cox proportional hazards regression, Case 4, where results are given in the

form of medians and robust standard deviations (in parentheses).


relatively large computation cost, being at least twice as large as that of Var1-ISIS.

This is to be expected considering the amount of extra work these procedures have to

perform: two rounds of marginal fits to obtain sample specific data-driven thresholds

ω(1)(q) and ω

(2)(q) , plus two additional rounds of marginal fits to compute the corresponding

index sets A (1)1 and A (2)

1 . Computational costs potentially increase further when

carrying out the conditional marginal regression steps described in Section 1.2.2,

however, the gains in statistical accuracy and model selection offset the increased

timings, particularly in the equally correlated Case 2 with nonconvex penalties.

Tables 1.2–1.3 show that SCAD also enjoys the sure screening property for the rel-

atively easy Cases 1 and 2, however, model sizes and estimation errors are significantly

larger than any of the ISIS procedures in the correlated scenario. On the other hand,

for the difficult Cases 3 and 4, surprisingly, LASSO rarely includes the important

predictor x4 even though it is not a marginal screening based method. While exhibit-

ing competitive performance with some of the ISIS variants in the Poisson regression

scenario, LASSO performs poorly in the Cox model setup, having prohibitively large

model sizes and estimation errors.

The ncvreg package with SCAD outperforms all ISIS variants in terms of compu-

tational cost for the uncorrelated Case 1. Still, the vanilla SIS procedure identifies the

true model faster than ncvreg. For the correlated Case 2, however, only the greedy

modification Perm-g-ISIS is slower than SCAD. In the Poisson and Cox model setups,

while the computational cost of LASSO with the glmnet package is smaller than any

of the ISIS procedures, the vanilla SIS shows better performance in terms of timings

and estimation errors.

As they become more elaborate, ISIS procedures become more computationally

expensive. Yet, vanilla ISIS and most of its variants presented here, particularly

Var1-ISIS and Perm-var-ISIS, are clearly competitive variable selection methods in

ultrahigh dimensional statistical models, adequately using the joint covariate infor-

mation, while exhibiting a very low false selection rate and competing computational


cost.

1.4.2 Scaling in n and p with Feature Screening

In addition to comparing our SIS codes with glmnet and ncvreg, we would like to

know how the timings of the vanilla SIS and ISIS procedures scale in n and p. We

simulated data sets from Cases 1–3 above and, for a variety of n and p values, we took

the median running time over 10 repetitions. Again, for each (n, p) pair, whenever

necessary, the BIC criterion was used to select the best λ value among a path of 100

possible candidates. Figure 1.1 shows timings for fixed n as p grows (Cases 1–3), and

for fixed p as n grows (Case 2).

From the plots we see that independence screening procedures perform uniformly

faster than ncvreg with the SCAD penalty. For Poisson regression, vanilla SIS also

outperforms glmnet with LASSO, particularly in the n = p scenario, where glmnet

exhibits unusually slow performance. It is worth pointing out that iterative varia-

ble screening procedures typically do not show a strictly monotone timing as n or p

increase. This is due to the varying number of iterations it takes to recruit d predic-

tors, the random splitting of the sample in the first two ISIS variants, the random

permutation in the data-driven thresholding, among other factors.

1.4.3 Real Data Analysis

We now evaluate the performance of all variable screening procedures on four well-

studied gene expression data sets: Leukemia (Golub et al., 1999), Prostate cancer

(Singh et al., 2002), Lung cancer (Gordon et al., 2002) and Neuroblastoma (Oberthuer

et al., 2006). The first three data sets come with predetermined, separate training

and test sets of data vectors. The Leukemia data set contains p = 7129 genes for 27

acute lymphoblastic leukemia and 11 acute myeloid leukemia vectors in the training

set. The test set includes 20 acute lymphoblastic leukemia and 14 acute myeloid

leukemia vectors. The Prostate cancer data set consists of gene expression profiles


3.0 3.2 3.4 3.6 3.8 4.0 4.2

−10

12

34

56

Time vs p (Case 1)

log10(p)

log 1

0(T)

n=500 Van−SISn=500 SCAD/ncvregn=1000 Van−SISn=1000 SCAD/ncvreg

3.0 3.2 3.4 3.6 3.8 4.0 4.2

−10

12

34

56

Time vs p (Case 2)

log10(p)

log 1

0(T)

n=500 Van−ISISn=500 SCAD/ncvregn=1000 Van−ISISn=1000 SCAD/ncvreg

3.0 3.2 3.4 3.6 3.8 4.0 4.2

−10

12

34

56

Time vs p (Case 3)

log10(p)

log 1

0(T)

n=500 Van−SISn=500 LASSO/glmnetn=1000 Van−SISn=1000 LASSO/glmnet

2.0 2.2 2.4 2.6 2.8 3.0 3.2

−10

12

34

56

Time vs n (Case 2)

log10(n)

log 1

0(T)

Van−ISIS with p=10,000SCAD/ncvreg with p=10,000Van−ISIS with p=20,000SCAD/ncvreg with p=20,000

Figure 1.1: Median runtime in seconds taken over 10 trials (log-log scale).

from p = 12600 genes for 52 prostate “tumor” samples and 50 “normal” prostate

samples in the training data set. The test set is from a different experiment and

contains 25 tumor and 9 normal samples. The lung cancer data set contains 32 tissue

samples in the training set (16 from malignant pleural mesothelioma and 16 from

adenocarcinoma) and 149 in the test set (15 from malignant pleural mesothelioma

and 134 from adenocarcinoma). Each sample consists of p = 12533 genes. The

neuroblastoma data set consists of gene expression profiles for p = 10707 genes from

246 patients of the German neuroblastoma trials NB90-NB2004, diagnosed between

1989 and 2004. We analyzed the gene expression data by means of the 3-year event-


free survival, indicating whether a patient survived 3 years after the diagnosis of

neuroblastoma. Combining the original training and test sets, the data consists of 56

positive and 190 negative cases. For purposes of the present analysis, in each of these

gene expression data sets, we initially combine the training and test samples and then

perform a 50% - 50% random splitting of the observed data into new training and

test data for which the number of cases remains balanced across these new samples.

In this manner, for the Leukemia data, the balanced training and test samples are of

size 36, for the Prostate data we have balanced training and test samples of size 68,

whereas the Neuroblastoma data set has balanced training and test samples of size

123. The balanced training and test samples for the Lung cancer data are of sizes 90

and 91, respectively. Interested readers can find more details about these data sets

in Golub et al. (1999), Singh et al. (2002), Gordon et al. (2002) and Oberthuer et al.

(2006).

Following the approach of Dudoit et al. (2002), before variable screening and

classification, we first standardize each sample to zero mean and unit variance. We

compare the performance of all described variable screening procedures with the Near-

est Shrunken Centroids (NSC) method of Tibshirani et al. (2002), the Independence

Rule (IR) in the high-dimensional setting (Bickel and Levina, 2004) and the LASSO

(Tibshirani, 1996), which uses ten-fold cross-validation to select its tuning parameter,

applied to the full set of covariates. Under a working independence assumption in

the feature space, NSC selects an important subset of variables for classification by

thresholding a corresponding two-sample t statistic, whereas IR makes use of the full

set of predictors.

Tables 1.6–1.7 show the median and robust standard deviation of the classifica-

tion error rates and model sizes for all procedures, taken over 100 random splittings

into 50% - 50% balanced training and test data. At each intermediate step of the

(I)SIS procedures, we employ the LASSO with ten-fold cross-validation to further

filter unimportant predictors for classification purposes. To determine a data-driven


Leukemia Leukemia Leukemia Prostate Prostate Prostate

Method training test model training test model

error rate error rate size error rate error rate size

Van-SIS 0.00(0.00)0.06(0.04) 16(2.43) 0.00(0.01)0.22(0.05) 14(3.92)

Van-ISIS 0.00(0.00)0.06(0.04) 15(2.99) 0.00(0.01)0.20(0.06) 14(2.43)

Var1-ISIS 0.00(0.02)0.08(0.04) 5(1.49) 0.04(0.04)0.19(0.09) 6(2.24)

Var2-ISIS 0.00(0.00)0.06(0.02) 15(2.24) 0.00(0.01)0.18(0.06) 12(2.24)

Perm-ISIS 0.00(0.00)0.06(0.04) 15(2.99) 0.00(0.01)0.20(0.05) 13(2.99)

Perm-g-ISIS 0.03(0.02)0.08(0.04) 3(1.49) 0.07(0.05)0.23(0.11) 4(1.49)

Perm-Var-ISIS 0.00(0.02)0.08(0.04) 5(1.49) 0.04(0.04)0.19(0.09) 5(2.24)

Perm-Var-g-ISIS 0.03(0.04)0.08(0.04) 2(0.00) 0.08(0.04)0.22(0.10) 4(0.93)

LASSO-CV(10) 0.00(0.00)0.06(0.04) 17(2.99) 0.03(0.06)0.22(0.07) 19(8.21)

NSC 0.03(0.02)0.06(0.04) 143(456.72) 0.07(0.03)0.20(0.12) 13(15.30)

IR 0.03(0.02)0.14(0.10)7129(0.00) 0.29(0.04)0.32(0.06)12600(0.00)

Table 1.6: Classification error rates and number of selected genes by various methods

for the balanced Leukemia and Prostate cancer data sets. For the Leukemia data, the

training and test samples are of size 36. For the Prostate cancer data, the training

and test samples are of size 68. Results are given in the form of medians and robust

standard deviations (in parentheses).

threshold for independence screening, we fix q = 0.95 for all permutation-based varia-

ble selection procedures. Lastly, for each data set considered, we apply all screening

procedures to reduce dimensionality from the corresponding p to d = 100.

From the results in Tables 1.6–1.7, we observe that all ISIS variants perform sim-

ilarly in terms of test error rates, whereas the main differences lie in the estimated

model sizes. Compared with the LASSO applied to the full set of covariates, a major-

ity of ISIS procedures select smaller models while retaining competitive classification


Lung Lung Lung NB NB NB

Method training test model training test model

error rate error rate size error rate error rate size

Van-SIS 0.00(0.00)0.02(0.01) 14(2.43) 0.09(0.02)0.19(0.02) 14(2.99)

Van-ISIS 0.00(0.00)0.02(0.01) 13(1.49) 0.00(0.00)0.22(0.03) 38(5.22)

Var1-ISIS 0.00(0.00)0.02(0.01) 9(1.68) 0.14(0.04)0.21(0.03) 3(2.24)

Var2-ISIS 0.00(0.00)0.02(0.01) 13(2.24) 0.00(0.00)0.22(0.03) 33(5.41)

Perm-ISIS 0.00(0.00)0.02(0.01) 13(1.49) 0.00(0.00)0.22(0.02) 38(5.97)

Perm-g-ISIS 0.00(0.01)0.02(0.02) 2(0.75) 0.03(0.02)0.26(0.04) 10(3.73)

Perm-Var-ISIS 0.00(0.00)0.02(0.01) 9(2.24) 0.14(0.04)0.21(0.03) 4(2.24)

Perm-Var-g-ISIS0.01(0.01)0.02(0.01) 2(0.75) 0.14(0.03)0.21(0.04) 3(1.49)

LASSO-CV(10) 0.01(0.01)0.02(0.02) 15(2.24) 0.14(0.04)0.26(0.04) 23(9.70)

NSC 0.00(0.00)0.00(0.02) 6(22.76) 0.18(0.02)0.20(0.03) 361(5150.37)

IR 0.13(0.05)0.14(0.06)12533(0.00) 0.15(0.02)0.20(0.02)10707(0.00)

Table 1.7: Classification error rates and number of selected genes by various methods

for the balanced Lung and Neuroblastoma (NB) cancer data sets. For the Lung

data, the training and test samples are of sizes 90 and 91, respectively. For the

Neuroblastoma cancer data, the training and test samples are of size 123. Results are

given in the form of medians and robust standard deviations (in parentheses).

error rates. This is in agreement with our simulation results, which highlight the ben-

efits of variable screening over a direct high-dimensional regularized logistic regression

approach. In particular, we observe the variants Var1-ISIS and Perm-var-g-ISIS pro-

vide the most parsimonious models across all four data sets, yielding optimal test error

rates while using only 2 features in the case of the Lung cancer data set. Nonetheless,

due to its robust performance in both the simulated data and these four gene expres-

sion data sets, and its reduced computational cost compared with all available ISIS


variants, we select the vanilla ISIS of Algorithm 1.2 as the default variable selection

procedure within our SIS package.

While the NSC method achieves competitive test error rates, it typically makes

use of larger sets of genes which vary considerably across the different 50% - 50%

training and test data splittings. The Independence Rule exhibits poor test error

performance, except for the Neuroblastoma data set, where it even outperforms some

of the ISIS procedures. However, this approach uses all features without performing

variable selection, thus yielding models of little practical use for researchers.

1.4.4 Code Example

All described independence screening procedures are straightforward to run using the

SIS package. We demonstrate the SIS function on the Leukemia data set from the

previous section. We first load the predictors and response vector from the training

and test data sets, and then carry out the balanced sample splitting as outlined above.

R> set.seed(9)

R> data("leukemia.train", package = "SIS")

R> data("leukemia.test", package = "SIS")

R> y1 = leukemia.train[, dim(leukemia.train)[2]]

R> x1 = as.matrix(leukemia.train[, -dim(leukemia.train)[2]])

R> y2 = leukemia.test[, dim(leukemia.test)[2]]

R> x2 = as.matrix(leukemia.test[, -dim(leukemia.test)[2]])

R> ### Combine Datasets ###

R> x = rbind(x1, x2)

R> y = c(y1, y2)

R> ### Split Datasets ###

R> n = dim(x)[1]; aux = 1:n

R> ind.train1 = sample(aux[y == 0], 23, replace = FALSE)


R> ind.train2 = sample(aux[y == 1], 13, replace = FALSE)

R> ind.train = c(ind.train1, ind.train2)

R> x.train = scale(x[ind.train,])

R> y.train = y[ind.train]

R> ind.test1 = setdiff(aux[y == 0], ind.train1)

R> ind.test2 = setdiff(aux[y == 1], ind.train2)

R> ind.test = c(ind.test1, ind.test2)

R> x.test = scale(x[ind.test,])

R> y.test = y[ind.test]

We now perform variable selection using the Var1-ISIS and Perm-var-ISIS proce-

dures paired with the LASSO penalty and the ten-fold cross-validation method for

choosing the regularization parameter.

R> model1 = SIS(x.train, y.train, family = "binomial",

+ penalty = "lasso", tune = "cv", nfolds = 10, nsis = 100,

+ varISIS = "aggr", seed = 9, standardize = FALSE)

R> model2 = SIS(x.train, y.train, family = "binomial",

+ penalty = "lasso", tune = "cv", nfolds = 10, nsis = 100,

+ varISIS = "aggr", perm = TRUE,q = 0.95, seed = 9,

+ standardize = FALSE)

R> model1$ix

[1] 1834 4377 4847 6281 6493

R> model2$ix

[1] 1834 4377 4847 6281 6493

Here we modified the default value d = bn/(4 log n)c to make both iterative pro-

cedures select models with at most 100 predictors. The value of q ∈ [0, 1], from which

we obtain the data-driven threshold ω(q) for Perm-var-ISIS, was also customized from

its default q = 1.


1.5 Discussion

Sure independence screening is a power family of methods for performing variable

selection in statistical models when the dimension is much larger than the sample

size, as well as in the classical setting where p < n. The focus of the chapter is on

iterative sure independence screening, which iteratively applies a large scale screening

by means of conditional marginal regressions, filtering out unimportant predictors,

and a moderate scale variable selection through penalized pseudo-likelihood methods,

which further selects the unfiltered predictors. With the goal of providing further

flexibility to the iterative screening paradigm, special attention is also paid to powerful

variants which reduce the number of false positives by means of sample splitting and

data-driven thresholding approaches. Compared with the versions of LASSO and

SCAD we used, the iterative procedures presented in this chapter are much more

accurate in selecting important variables and achieving small estimation errors. In

addition, computational time is also reduced, particularly in the case of nonconvex

penalties, thus resulting in a robust family of procedures for model selection and

estimation in ultrahigh dimensional statistical models. Extensions of the current

package to more general loss-based models and nonparametric independence screening

procedures, as well as implementation of conditional marginal regressions through

support vector machine methods are lines of future work.

CHAPTER 2. HOW MANY COMMUNITIES ARE THERE? 36

Chapter 2

How Many Communities Are

There?

Stochastic blockmodels and variants thereof are among the most widely used ap-

proaches to community detection for social networks and relational data. A stochastic

blockmodel partitions the nodes of a network into disjoint sets, called communities.

The approach is inherently related to clustering with mixture models; and raises a

similar model selection problem for the number of communities. The Bayesian infor-

mation criterion (BIC) is a popular solution, however, for stochastic blockmodels, the

conditional independence assumption given the communities of the endpoints among

different edges is usually violated in practice. In this regard, we propose compos-

ite likelihood BIC (CL-BIC) to select the number of communities, and we show it

is robust against possible misspecifications in the underlying stochastic blockmodel

assumptions. We derive the requisite methodology and illustrate the approach using

both simulated and real data.


2.1 Introduction

Enormous network datasets are being generated and analyzed along with an increasing

interest from researchers in studying the underlying structures of a complex networked

world. The potential benefits span traditional scientific fields such as epidemiology

and physics, but also emerging industries, especially large-scale internet companies.

Among a variety of interesting problems arising with network data, in this chapter, we

focus on community detection in undirected networks G := (V,E), where V and E are

the sets of nodes and edges, respectively. In this framework, the community detection

problem can be formulated as finding the true disjoint partition of V = V1t · · · tVK ,

where K is the number of communities. Although it is difficult to give a rigorous

definition, communities are often regarded as tightly-knit groups of nodes which are

loosely connected between themselves.

The community detection problem has close connections with graph partitioning,

which could be traced back to Euler, while it has its own characteristics due to the

concrete physical meanings from the underlying dataset (Newman and Girvan, 2004).

Over the last decade, there has been a considerable amount of work on it, including

minimizing ratio cut (Wei and Cheng, 1989), minimizing normalized cut (Shi and

Malik, 2000), maximizing modularity (Newman and Girvan, 2004), hierarchical clus-

tering (Newman, 2004) and edge-removal methods (Newman and Girvan, 2004), to

name a few. Among all the progress made by peer researchers, spectral clustering

(Donath and Hoffman, 1973) based on stochastic blockmodels (Holland et al., 1983)

debuted and soon gained a majority of attention. We refer the interested readers to

Spielmat and Teng (1996) and Goldenberg et al. (2010) as comprehensive reviews on

the history of spectral clustering and stochastic blockmodels, respectively.

Compared to the amount of work on spectral clustering or stochastic blockmodels,

to the best of our knowledge, there is little work on the selection of the community

number K. In most of the previously mentioned community detection methods, the

number of communities is generally input as a pre-specified quantity. For the lit-


erature addressing the problem of selecting K, besides the block-wise edge splitting

method of Chen and Lei (2014), a common practice is to use BIC-type criteria (Airoldi

et al., 2008; Daudin et al., 2008) or a variational Bayes approach (Latouche et al.,

2012; Hunter et al., 2012). An inherently related problem is that of selecting the

number of components in mixture models, where the birth-and-death point process

of Stephens (2000) and the allocation sampler of Nobile and Fearnside (2007) provide

two fully Bayesian approaches in the case where K is finite but unknown. Based

on the allocation sampler, McDaid et al. (2013) propose an efficient Bayesian clus-

tering algorithm which directly estimates the number of communities in stochastic

blockmodels, and which exhibits similar results to the variational Bayes approach of

Latouche et al. (2012). Nonparametric Bayesian methods based on Dirichlet process

mixtures (Ferguson, 1973) have also been used to estimate the number of components

in this finite but unknown K setting (Fearnhead, 2004), although the inconsistency of

this approach has been recently shown by Miller and Harrison (2014). This commu-

nity or mixture component number K, as a vital part of model selection procedures,

highly depends on the model assumptions. For instance, the famous stochastic block-

model has undesirable restrictive assumptions in the form of independent Bernoulli

observations when the community assignments are known.

In this chapter, we study the community number selection problem with robust-

ness consideration against model misspecification in the stochastic blockmodel and

its variants. Our motivation is that, the conditional independence assumption among

edges, when the communities of their endpoints are given, is usually violated in real

applications. In addition, we do not restrict our interest only to exchangeable graphs.

Using techniques from the composite likelihood paradigm (Lindsay, 1988), we develop

a composite likelihood BIC (CL-BIC) approach (Gao and Song, 2010) for selecting

the community number in the situation where assumed independencies in the stochas-

tic blockmodel and other exchangeable graph models do not hold. The procedure is

tested on simulated and real data, and is shown to outperform two competitors –


traditional BIC and the variational Bayes criterion of Latouche et al. (2012), in terms

of model selection consistency.

The rest of the chapter is organized as follows. The background for stochastic

blockmodels and spectral clustering is introduced in Section 2.2, and the proposed

CL-BIC methodology is developed in Section 2.3. In Section 2.4, several simulation

examples as well as two real data sets are analyzed. The chapter is concluded with a

short discussion in Section 2.5.

2.2 Background

First, we would like to introduce some notation. For an N -node undirected, simple,

and connected network G, its symmetric adjacency matrix A is defined as Aij := 1

if (i, j) is an element in E, and Aij := 0 otherwise. The diagonal AiiNi=1 is fixed

to zero (i.e., self-edges are not allowed). Moreover, D and L denote the degree

matrix and Laplacian matrix, respectively. Here, Dii := di, and Dij := 0 for i 6= j,

where di is the degree of node i, i.e., the number of edges with endpoint node i; and

L := D−1/2AD−1/2. As isolated nodes are discarded, D−1/2 is well-defined.

2.2.1 Stochastic Blockmodels

2.2.1.1 Standard Stochastic Blockmodel

Stochastic blockmodels were first introduced in Holland et al. (1983). They posit

independent Bernoulli random variables Aij1≤i<j≤N with success probabilities Pij

which depend on the communities of their endpoints i and j. Consequently, all

edges are conditionally independent given the corresponding communities. Moreover,

each node is associated with one and only one community, with label Zi, where

Zi ∈ 1, . . . , K. Following Rohe et al. (2011) and Choi et al. (2012), throughout

this chapter we assume each Zi is fixed and unknown, thus yielding P(Aij = 1;Zi =

zi, Zj = zj) = θzizj . Treating the node assignments Z1, . . . , ZN as latent random


variables is another popular approach in the community detection literature, and

various methods including the variational Bayes criterion of Latouche et al. (2012)

and the belief propagation algorithm of Decelle et al. (2011) efficiently approximate

the corresponding observed-data log-likelihood of the stochastic blockmodel, without

having to add KN multinomial terms accounting for all possible label assignments.

For θ := (θab; 1 ≤ a ≤ b ≤ K)′ and for any fixed community assignment z ∈

1, . . . , KN , the log-likelihood under the standard Stochastic Blockmodel (SBM) is

given as

`(θ;A) :=∑i<j

[Aij log θzizj + (1− Aij) log(1− θzizj)]. (2.1)

For the remainder of the chapter, denote Na as the size of community a, and nab

as the maximum number of possible edges between communities a and b, i.e., nab :=

NaNb for a 6= b, and naa := Na(Na−1)/2. Also, let mab :=∑

i<j Aij1zi = a, zj = b,

and θab := mab/nab be the MLE of θab in (2.1).

Under this framework, Choi et al. (2012) showed that the fraction of misclustered

nodes converges in probability to zero under maximum likelihood fitting when K is

allowed to grow no faster than√N . By means of a regularized maximum likelihood

estimation approach, Rohe et al. (2014) further proved that this weak convergence

can be achieved for K = O(N/ log5N).

2.2.1.2 Degree-Corrected Stochastic Blockmodel

Heteroscedasticity of node degrees within communities is often observed in real-

world networks. To tackle this problem, Karrer and Newman (2011) proposed the

Degree-Corrected Blockmodel (DCBM), in which the success probabilities Pij are

also functions of individual effects. To be more precise, the DCBM assumes that

P(Aij = 1;Zi = zi, Zj = zj) = ωiωjθzizj , where ω := (ω1, . . . , ωn)′ are individual

effect parameters satisfying the identifiability constraint∑

i ωi1zi = a = 1 for each

community 1 ≤ a ≤ K.

To simplify technical derivations, Karrer and Newman (2011) allowed networks


to contain both multi-edges and self-edges. Thus, they assumed the random varia-

bles Aij1≤i≤j≤N to be independent Poisson, with the previously defined success

probabilities Pij of an edge between vertices i and j replaced by the expected num-

ber of such edges. Under this framework, and for any fixed community assignment

z ∈ 1, . . . , KN , Karrer and Newman (2011) arrived at the log-likelihood `(θ, ω;A)

of observing the adjacency matrix A = (Aij) under the DCBM,

`(θ, ω;A) := 2∑i

di logωi +∑a,b

(mab log θab − θab). (2.2)

After allowing for the identifiability constraint on ω, the MLEs of the parameters θab

and ωi are given by θab := mab and ωi := di/∑

j:zj=zidj, respectively.

As mentioned in Zhao et al. (2012), there is no practical difference in performance

between the log-likelihood (2.2) and its slightly more elaborate version based on the

true Bernoulli observations. The reason is that the Bernoulli distribution with a small

mean is well approximated by the Poisson distribution, and the sparser the network

is, the better the approximation works (Perry and Wolfe, 2012).

2.2.1.3 Mixed Membership Stochastic Blockmodel

As a methodological extension in which nodes are allowed to belong to more than one

community, Airoldi et al. (2008) proposed the Mixed Membership Stochastic Block-

model (MMB) for directed relational data Aij1≤i,j≤N . For instance, when a social

actor interacts with its different neighbors, an array of different social contexts may

be taking place and thus the actor may be taking on different latent roles.

The model assumes the observed network is generated according to node-specific

distributions of community membership and edge-specific indicator vectors denoting

membership in one of the K communities. More specifically, each vertex i is asso-

ciated with a randomly drawn vector ~πi, with πia denoting the probability of node i

belonging to community a. Additionally, let the indicator vector ~zi→j denote the co-

mmunity membership of node i when he sends a message to node j, and ~zi←j denote


the community membership of node j when he receives a message from node i. If, in

order to account for the asymmetric interactions, we denote by θ := (θab) the K ×K

matrix where θab represents the probability of having an edge from a social actor in

community a to a social actor in community b, the MMB posits that the Aij1≤i,j≤N

are drawn from the following generative process:

• For each node i ∈ V :

– Draw a K dimensional mixed membership vector ~πi ∼ Dirichlet(α), with

the vector α = (α1, . . . , αK)′ being a hyper-parameter.

• For each possible edge variable Aij:

– Draw membership indicator vector for the initiator ~zi→j∼Multinomial(~πi).

– Draw membership indicator vector for the receiver ~zi←j∼Multinomial(~πj).

– Sample the interaction Aij ∼ Bernoulli(~z′i→jθ~zi←j).

Upon defining the set of mixed membership vectors Π := ~πi : i ∈ V and the

sets of membership indicator vectors Z→ := ~zi→j : i, j ∈ V and Z← := ~zi←j :

i, j ∈ V , following Airoldi et al. (2008), we obtain the complete data log-likelihood

of the hyper-parameters θ,α as

`(θ, α;A,Π,Z→,Z←) :=∑i,j

[Aij log(~z′i→jθ~zi←j) + (1− Aij) log(1− ~z′i→jθ~zi←j)]

+N(

log Γ(∑a

αa)−∑a

log Γ(αa))

+∑i

∑a

(αa − 1) log πia

+ const, (2.3)

where A corresponds to the observed data and Π,Z→,Z← are the latent variables.

In order to carry out posterior inference of the latent variables given the observa-

tions A, Airoldi et al. (2008) proposed an efficient coordinate ascent algorithm based

on a variational approximation to the true posterior. Therefore, one can compute


expected posterior mixed membership vectors and posterior membership indicator

vectors. We refer interested readers to Section 3 in Airoldi et al. (2008) for further

details.

Consequently, following the same profile likelihood approach, for any fixed set

Π,Z→,Z←, the MLE of θab is given by

θab :=∑i,j

Aij · ~zi→j,a~zi←j,b/∑

i,j

~zi→j,a~zi←j,b. (2.4)

As the MLE of αa does not admit a closed form, Minka (2000) proposed an efficient

Newton-Raphson procedure for obtaining parameter estimates in Dirichlet models,

where the gradient and Hessian matrix of the complete data log-likelihood (2.3) with

respect to α are

∂`(θ, α;A)

∂αa= N

(Ψ(∑a

αa)−Ψ(αa))

+∑i

∑a

log πia

∂2`(θ, α;A)

∂αa∂αb= N

(Ψ′(∑a

αa)−Ψ′(αa)1a = b),

(2.5)

and Ψ is known as the digamma function (i.e., the logarithmic derivative of the gamma

function).

2.2.2 Spectral Clustering and SCORE

Although there is a parametric framework for the standard stochastic blockmodel,

considering the computational burden, it is intractable to directly estimate both pa-

rameters θ and z based on exact maximization of the log-likelihood (2.1). Researchers

have instead resorted to spectral clustering as a computationally feasible algorithm.

For comprehensive reviews, we refer interested readers to von Luxburg (2007) and

Rohe et al. (2011), in which the authors proved the consistency of spectral clustering

in the standard stochastic blockmodel under proper conditions imposed on the den-

sity of the network and the eigen-structure of the Laplacian matrix. The algorithm

finds the eigenvectors u1, . . . ,uK associated with the K eigenvalues of L that are


largest in magnitude, forming an N ×K matrix U := (u1, . . . ,uK), and then applies

the K-means algorithm to the rows of U .

Similarly, Jin (2015) proposed a variant of spectral clustering for the DCBM,

called Spectral Clustering On Ratios-of-Eigenvectors (SCORE). Instead of using the

Laplacian matrix L, SCORE collects the eigenvectors v1, . . . ,vK associated with the

K eigenvalues of A that are largest in magnitude, and then forms the N ×K matrix

V := (1,v2/v1, . . . ,vK/v1), where the division operator is taken entry-wise, i.e., for

vectors a, b ∈ Rn, with Πn`=1b` 6= 0, a/b := (a1/b1, . . . , an/bn)′. SCORE then applies

the K-means algorithm to the rows of V . The corresponding consistency results for

the DCBM are also provided in Jin (2015).

2.3 Model Selection for the Number of Communi-

ties

2.3.1 Motivation

In much of the previous work, e.g. Airoldi et al. (2008), Daudin et al. (2008) and

Handcock et al. (2007), researchers have used a BIC-penalized version of the log-

likelihood (2.1) to choose the community number K. However, we are aware of the

possible misspecifications in the underlying stochastic blockmodel assumptions and

in the loss of precision from the computational relaxation brought in by spectral

clustering.

Firstly, in network data, edges are not necessarily independent if only the commu-

nities of their endpoints are given. For instance, if two different edges Aij and Ail

have mutual endpoint i, it is highly likely that they are dependent even given the

community labels of their endpoints. This misspecification problem exists in both

the standard stochastic blockmodel and its variants, such as DCBM (Karrer and

Newman, 2011) and MMB (Airoldi et al., 2008). Secondly, as previously mentioned,


spectral clustering is a feasible relaxation, but the loss of precision is inevitable. Sev-

eral examples of this can be found in Guattery and Miller (1998). Whence, we resort

to introducing CL-BIC with the concern of robustness against misspecifications in

the underlying stochastic blockmodel.

We would like to emphasize that CL-BIC is not a new community detection

method. Instead, under the SBM, DCBM, or MMB assumptions, it can be combined

with existing community detection methods to choose the true community number.

2.3.2 Composite Likelihood Inference

The CL-BIC approach extends the concepts and theory of conventional BIC on likeli-

hoods to the composite likelihood paradigm (Lindsay, 1988; Varin et al., 2011). Com-

posite likelihood aims at a relaxation of the computational complexity of statistical

inference based on exact likelihoods. For instance, when the dependence structure for

relational data is too complicated to implement, a working independence assumption

can effectively recover some properties of the usual maximum likelihood estimators

(Cox and Reid, 2004; Varin et al., 2011). However, under this misspecification frame-

work, the asymptotic variance of the resulting estimators is usually underestimated

as the Fisher information. Composite marginal likelihoods (also known as indepen-

dence likelihoods) have the same formula as conventional likelihoods in terms of being

a product of marginal densities (Varin, 2008), while statistical inference based on

them can capture this loss of variance. Consequently, to pursue the “true” model,

CL-BIC penalizes the number of parameters more than what BIC does for dependent

relational data.

Before going into details, we would like to give the rationale of using stochastic

blockmodels under a misspecification framework. In order to estimate the true joint

density g of Aij1≤i<j≤N , we consider the stochastic blockmodel family P = pθ :

θ ∈ Θ, where Θ = [0, 1]K(K+1)/2 for the standard stochastic blockmodel, and Θ =

[0, 1]K(K+1)/2+N for DCBM. The true joint density g may or may not belong to P,


which is a parametric family imposing independence among the Aiji<j when only

the communities of the endpoints are given.

Due to the difficulty in specifying the full, highly structured(N2

)-dimensional

density g, while having access to the univariate densities pij(·;θ) of Aij under the

blockmodel family P, the composite marginal likelihood paradigm compounds the

first-order log-likelihood contributions to form the composite log-likelihood

cl(θ;A) :=∑i<j

log pij(Aij;θ), (2.6)

where cl(·;A) corresponds to (2.1) under the standard stochastic blockmodel, and cor-

responds to (2.2) in the DCBM framework. Since each component of cl(θ;A) in (2.6)

is a valid log-likelihood object, the composite score estimating equation ∇θ cl(θ;A) =

0 is unbiased under usual regularity conditions. The associated Composite Likelihood

Estimator (CLE) θC, defined as the solution to ∇θ cl(θ;A) = 0, suggests a natural

estimator of the form p = pθC to minimize the expected composite Kullback–Leibler

divergence (Varin and Vidoni, 2005) between the assumed blockmodel pθ and the

true, but unknown, joint density g,

∆C(g, p;θ) :=∑i<j

Eg(log g(Aiji<j ∈ Aij)− log pij(Aij;θ)),

where Aiji<j denotes the corresponding set of marginal events.

In terms of the asymptotic properties of the CLE, following the discussion in Cox

and Reid (2004), it is important to distinguish whether the available data consist

of many independent replicates from a common distribution function or form a few

individually large sequences. While, in the first scenario, consistency and asymptotic

normality of the corresponding θC hold under some regularity conditions from the

classical theory of estimating equations (Varin et al., 2011), some difficulties arise

in the second one, which includes our observations Aiji<j. Indeed, as argued in

Cox and Reid (2004), if there is too much internal correlation present among the

individual components of the composite score ∇θ cl(θ;A), the estimator θC will not


be consistent. The CLE will retain good properties as long as the data are not too

highly correlated, which is the case for spatial data with exponential correlation decay.

Under this setting, Heagerty and Lele (1998) proved consistency and asymptotic

normality of θC in a scenario where the data are not sampled independently from

a study population. Under more general settings, consistency results are expected

upon using limit theorems and parametric estimation for fields (e.g. Guyon, 1995);

however, applying the corresponding results requires a properly defined distance on

networks and α-mixing conditions based on such distance.

2.3.3 Composite Likelihood BIC

Taking into account the measure of model complexity in the context of composite

marginal likelihoods (Varin and Vidoni, 2005), we define the following criterion for

selecting the community number K:

CL-BICk := −2 cl(θC;A) + d∗k log (N(N − 1)/2) , (2.7)

where k is the number of communities under consideration in the current model

used as model index, d∗k := trace(H−1k V k), Hk := Eθ(−∇2

θ cl(θ;A)) and V k :=

Varθ(∇θ cl(θ;A)). Then the resulting estimator for the community number is

kCL−BIC := arg mink

CL-BICk.

Note that the CLE is a function of k, since a different model index yields a

different estimator θC := θC(k). Assuming independent and identically distributed

data replicates, which lead to consistent and asymptotically normally distributed

estimators θC, Gao and Song (2010) established the model selection consistency of a

similar composite likelihood BIC approach for high-dimensional parametric models.

While allowing for the number of potential model parameters to increase to infinity,

their consistency result only holds when the true model sparsity is bounded by a

universal constant.


Even though, under a misspecification framework for the blockmodel family P,

the observed data Aiji<j do not form independent replicates from a common popu-

lation, we anticipate the CL-BIC criterion (2.7) to be consistent in selecting the true

community number K, at least when the correlation among the Aiji<j is not severe

and the estimators θC are consistent and asymptotically normal, as in Heagerty and

Lele (1998). Since all the moment conditions in the consistency results from Gao and

Song (2010) hold automatically after noticing the specific forms of the blockmodel

composite log-likelihoods (2.1) – (2.3), under a properly defined mixing condition on

Aiji<j (Guyon, 1995), and for a bounded community number K ≤ k0, we conjec-

ture that PkCL−BIC = K → 1 as the number of nodes N in the network grows to

infinity. This theoretical study will be relegated as a future work.

2.3.4 Formulae

2.3.4.1 Standard Stochastic Blockmodel

Following our discussions in the previous section, we treat (2.1) as the composite

marginal likelihood, under the working independence assumption that, given the co-

mmunity labels of their endpoints, the Bernoulli random variables Aiji<j are inde-

pendent. The first-order partial derivative of `(θ;A) with respect to θ is denoted as

u(θ) = (u(θab); 1 ≤ a ≤ b ≤ k)′, where

u(θab) =∑i<j

[Aijθzizj

− 1− Aij1− θzizj

]Ia,bi,j ,

and

Ia,bi,j = min (1zi = a, zj = b+ 1zi = b, zj = a, 1) .

Furthermore, the second-order partial derivative of `(θ;A) has the following compo-

nents,

∂2`(θ;A)

∂θa1b1∂θa2b2= 0, if (a1, b1) 6= (a2, b2)


and

∂2`(θ;A)

∂θ2ab

= −∑i<j

[Aijθ2zizj

+1− Aij

(1− θzizj)2

]Ia,bi,j .

Define the Hessian matrix Hk(θ) = Eθ(−∂u(θ)/∂θ), then

Hk(θ) = Eθ(diag

−∂2`(θ;A)/∂θ2

ab; 1 ≤ a ≤ b ≤ k).

Define the variability matrix V k(θ) = Varθ(u(θ)) and, following Varin and Vidoni

(2005), the model complexity d∗k = trace[Hk(θ)−1V k(θ)]. If the underlying model is

indeed a correctly specified standard stochastic blockmodel, we have d∗k = k(k+ 1)/2

and CL-BIC reduces to the traditional BIC. Indexed by 1 ≤ k ≤ k0, the estimated

criterion functions for the CL-BIC sequence (2.7) are

CL-BICk = −2 cl(θC;A) + d∗k log (N(N − 1)/2) , (2.8)

where θC and d∗k are estimators of θ and d∗k, respectively. For a certain k, the explicit

estimator forms are given below:

Hk(θC) = diag

∑i<j

[Aij

θ2zizj

+(1− Aij)

(1− θzizj)2

]Ia,bi,j

and V k(θC) = u(θC)[u(θC)]T .

As noted in Gao and Song (2010), the above naive estimator for V k(θ) vanishes

when evaluated at the CLE θC. An alternative proposed in Varin et al. (2011) is to

use a jackknife covariance matrix estimator, for the asymptotic covariance matrix of

θC, of the form

Varjack(θC) =N − 1

N

N∑l=1

(θ(−l)C − θC)(θ

(−l)C − θC)T , (2.9)

where θ(−l)C is the composite likelihood estimator of θ with the l-th vertex deleted.

Let A(−l) be the (N − 1) × (N − 1) matrix obtained after deleting the l-th row and

column from the original adjacency matrix A. An explicit form for θ(−l)C is given


by θ(−l)ab = 1/n

(−l)ab

∑i<j A

(−l)ij 1zi = a, zj = b, with n

(−l)ab = N

(−l)a N

(−l)b for a 6= b,

and n(−l)aa = N

(−l)a (N

(−l)a − 1)/2; naturally, N

(−l)a = Na − 1 if zl = a and N

(−l)a = Na

otherwise.

Since the asymptotic covariance matrix of θC is given by the inverse Godambe in-

formation matrix, Gk(θ)−1 = Hk(θ)−1V k(θ)Hk(θ)−1, see Gao and Song (2010) and

Varin et al. (2011), an explicit estimator for d∗k can be obtained by right-multiplying

the jackknife covariance matrix estimator (2.9) by Hk(θ) to obtain

d∗k = trace[Varjack(θC)Hk(θC)

]=

∑1≤a≤b≤k

Varjack(θab)×

∑i<j

[ Aijθ2zizj

+1− Aij

(1− θzizj)2

]Ia,bi,j

.

2.3.4.2 Degree-Corrected Stochastic Blockmodel

Similarly, we develop corresponding parallel results for DCBM. The first- and second-

order partial derivatives of `(θ,ω;A) with respect to θ are defined as follows,

∂`(θ, ω;A)

∂θ= u(θ) = (u(θab); 1 ≤ a ≤ b ≤ k)′, u(θab) =

mab

θab− 1,

∂2`(θ, ω;A)

∂θa1b1∂θa2b2= 0, if (a1, b1) 6= (a2, b2),

∂2`(θ, ω;A)

∂θ2ab

= −mab

θ2ab

,

Hk(θC) = diag

1

θab; 1 ≤ a ≤ b ≤ k

,

which yields

d∗k =∑

1≤a≤b≤k

Varjack(θab)/θab

.

2.3.4.3 Mixed Membership Stochastic Blockmodel

The estimated model complexity for MMB now involves second-order partial deriva-

tives of `(θ, α;A) with respect to the hyper-parameters θ and α. Upon noticing

the form of the first term of the complete data log-likelihood (2.3), and recalling the


Hessian matrix with respect to α detailed in (2.5), it is easy to see that Hk(θC, αC)

is a block matrix of the form

Hk(θC, αC) =

Hk(θC) 0

0 Hk(αC)

,

where Hk(θC) is a k2 × k2 diagonal matrix given by

Hk(θC) = diag

∑i,j

[Aij

θ2zi→j ,zi←j

+(1− Aij)

(1− θzi→j ,zi←j)2

]1zi→j = a, zi←j = b

,

and Hk(αC) = (Hk(αC)ab) is a k × k matrix with entries

Hk(αC)ab = N(

Ψ′(αa)1a = b −Ψ′(∑a

αa)).

In a slight abuse of notation, we denote by zi→j above the label assignment corres-

ponding to node i when he sends a message to node j, and similarly for zi←j. The

estimated model complexity is thus d∗k = trace[Varjack(θC, αC)Hk(θC, αC)], where the

jackknife matrix Varjack(θC, αC), assuming a similar form as in (2.9) with θ(−l)C and

α(−l)C estimated as explained in Section 2.2, provides the corresponding asymptotic

covariance matrix estimator of the CLE (θC, αC).

We would like to remark that our CL-BIC approach for selecting the community

number K extends beyond the realm of stochastic blockmodels. Indeed, both the

latent space cluster model of Handcock et al. (2007) and the local dependence model

of Schweinberger and Handcock (2015), as well as any other (composite) likelihood-

based approach which requires to select a value of K can employ our proposed CL-BIC

methodology for selecting the number of communities. We leave the details of this

further investigation for future research.

2.4 Experiments

In this section, we show the advantages of the CL-BIC approach over the traditional

BIC as well as the variational Bayes approach in selecting the true number of commu-


nities via simulations and two real datasets.

2.4.1 Simulations

For simplicity of the presentation, we consider only the SBM and the DCBM in our

simulations. For each setting, we relax the assumption that the Aij’s are condition-

ally independent given the labels (Zi = zi, Zj = zj), varying both the dependence

structure of the adjacency matrix A ∈ RN×N and the value of the parameters (θ,ω).

The models introduced are correlation-contaminated stochastic blockmodels, i.e., we

bring different types of correlation into the stochastic blockmodels, both standard

and degree-corrected, mimicking real-world networks.

All of our simulated adjacency matrices have independent rows. That is, the

binary variables Aik and Ajl are independent, whenever i 6= j, given the corres-

ponding community labels of their endpoints. However, for a fixed node i ∈ V ,

correlation does exist across different columns in the binary variables Aij and Ail.

For the standard stochastic blockmodel, correlated binary random variables are gen-

erated, following the approach in Leisch et al. (1998), by thresholding a multivari-

ate Gaussian vector with correlation matrix Σ satisfying Σjl = ρjl. Specifically,

for any choice of |ρjl| ≤ 1, we simulate correlated variables Aij and Ail such that

Cov(Aij, Ail) = L(−µj,−µl, ρjl) − θzizjθzizl . Here, following Leisch et al. (1998), we

have L(−µj,−µl, ρjl) = P(Wj ≥ −µj,Wl ≥ −µl), µj = Φ−1(θzizj) and µl = Φ−1(θzizl),

where (Wj,Wl) is standard bivariate normal with correlation ρjl. Correlated Bernoulli

variables for the degree-corrected blockmodel are generated in a similar fashion.

In each experiment, carried over 200 randomly generated adjacency matrices, we

record the proportion of times the chosen number of communities for each of the

different criteria for selecting K agrees with the truth. Apart from CL-BIC and

BIC, we also consider the Integrated Likelihood Variational Bayes (VB) approach

of Latouche et al. (2012). To estimate the true community number, their method

selects the candidate value k which maximizes a variational Bayes approximation to


Table 2.1: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 1, where

Eq and Dec indicate equally correlated and exponential decaying cases, respectively. Both

correlation of multivariate Gaussian random variables (ρ MVN) and the corresponding

maximum correlation between Bernoulli variables (ρ Ber.) are presented.

CORR PROP MEDIAN DEV CORR PROP MEDIAN DEV

ρ MVN ρ Ber. CL-BIC BIC CL-BIC BIC ρ MVN ρ Ber. CL-BIC BIC CL-BIC BIC

0.10 0.06 1.00 0.40 0.0(0.0) 2.0(1.5) 0.40 0.25 1.00 0.35 0.0(0.0) 2.0(1.5)

0.15 Eq 0.09 0.92 0.14 1.0(0.0) 3.0(2.2) 0.50 Dec 0.32 1.00 0.21 0.0(0.0) 2.0(1.5)

0.20 0.12 0.81 0.03 1.0(0.4) 5.0(3.0) 0.60 0.40 0.99 0.12 1.0(0.0) 3.0(1.5)

NOTE: CORR, correlation; PROP, proportion; MEDIAN DEV, median deviation. In the

MEDIAN DEV columns, results are in the form of median (robust standard deviation).

the observed-data log-likelihood.

We restrict attention to candidate values for the true but unknown K in the

range k ∈ 1, . . . , 18, both in simulations and the real data analysis section. For

Simulations 1 – 3, spectral clustering is used to obtain the community labels for each

candidate k, whereas in the DCBM setting of Simulation 4, the SCORE algorithm

is employed. Additionally, among the incorrectly selected community number trials,

we calculate the median deviation between the selected community number and the

true K = 4, as well as its robust standard deviation.

Simulation 1: Correlation among the edges within and between communities is

introduced simultaneously throughout all blocks in the network, and not proceeding in a

block-by-block fashion. Concretely, for each node i, all edges Aiji<j are generated by

thresholding a correlated (N−i)-dimensional Gaussian random vector with correlation

matrix Σ = (ρjl). Thus, in this scenario, all edges Aij and Ail with common endpoint

i are correlated, regardless of whether j and l belong to the same community or not.

Cases ρjl = ρ and ρjl = ρ|j−l|, with several choices of ρ are conducted. We consider a

4-community network, θ = (θab; 1 ≤ a ≤ b ≤ 4)′, where θaa = 0.35 for all a = 1, . . . , 4


Table 2.2: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 2, where

Ind indicates ρjl = 0 for j 6= l. For simplicity, we omit the correlation between the corres-

ponding Bernoulli variables.


ρ W. ρ B. CL-BIC BIC CL-BIC BIC ρ W. ρ B. CL-BIC BIC CL-BIC BIC

0.10

Ind

1.00 0.64 0.0(0.0) 2.0(0.7)

0.10 Eq

0.40 1.00 0.59 0.0(0.0) 1.0(1.5)

0.15 Eq 0.98 0.36 1.0(0.7) 2.0(1.5) 0.50 Dec 1.00 0.54 0.0(0.0) 2.0(0.7)

0.20 0.80 0.08 1.0(0.7) 3.0(2.2) 0.60 1.00 0.53 0.0(0.0) 2.0(1.1)

0.40

Ind

1.00 0.33 0.0(0.0) 2.0(0.7)

0.15 Eq

0.40 0.98 0.32 1.0(0.0) 2.0(1.5)

0.50 Dec 1.00 0.29 0.0(0.0) 2.0(1.5) 0.50 Dec 0.97 0.30 1.0(0.4) 3.0(1.5)

0.60 1.00 0.14 0.0(0.0) 2.0(1.5) 0.60 0.95 0.25 1.0(0.0) 3.0(1.5)

and θab = 0.05 for 1 ≤ a < b ≤ 4. Community sizes are 60, 90, 120 and 150,

respectively. Results are collected in Table 2.1.

Simulation 2: Correlation among the edges within (ρ W.) and between (ρ B.)

communities is introduced block-wisely. Concretely, for each node i, all edges Aij and

Ail are generated independently whenever j and l belong to different communities. If

j and l belong to the same community, edges Aij and Ail are generated by thresholding

a correlated Gaussian random vector with correlation matrix Σ = (ρjl). Parameter

settings are identical to Simulation 1, with results collected in Table 2.2.

Simulation 3: Correlation settings are the same as in Simulation 2, but we change

the value of the parameter θ to allow for more general network topologies. We set

θ = (θab; 1 ≤ a ≤ b ≤ 4)′ with θaa = θb4 = 0.35 for all a = 1, . . . , 4 and b = 1, 2, 3.

The remaining entries of θ are set to 0.05. Hence, following Latouche et al. (2012),

vertices from community 4 connect with probability 0.35 to any other vertices in the

network, forming a community of only hubs. Community sizes are the same as in

Simulation 1, with results collected in Table 2.3.


Table 2.3: Comparison of CL-BIC and VB over 200 repetitions from Simulation 3. For

simplicity, we omit the correlation between the corresponding Bernoulli variables.


ρ W. ρ B. CL-BIC VB CL-BIC VB ρ W. ρ B. CL-BIC VB CL-BIC VB

0.00

Eq Ind

1.00 1.00 0.0(0.0) 0.0(0.0) 0.00

Dec Ind

1.00 1.00 0.0(0.0) 0.0(0.0)

0.10 0.96 0.00 1.0(0.0) 2.0(0.0) 0.40 1.00 1.00 0.0(0.0) 0.0(0.0)

0.15 0.88 0.00 1.0(0.0) 4.0(2.2) 0.50 1.00 0.94 0.0(0.0) 1.0(0.0)

0.20 0.85 0.00 1.0(0.0) 5.0(1.5) 0.60 1.00 0.56 0.0(0.0) 1.0(0.0)

Simulation 4: We follow the approach of Zhao et al. (2012) in choosing the pa-

rameters (θ,ω) to generate networks from the degree-corrected blockmodel. Thus, the

identifiability constraint∑

i ωi1zi = a = 1 for each community 1 ≤ a ≤ K is re-

placed by the requirement that the ωi be independently generated from a distribution

with unit expectation, fixed here to be

ωi =

ηi, w.p. 0.8,

2/11, w.p. 0.1,

20/11, w.p. 0.1,

where ηi is uniformly distributed on the interval [0, 2]. The vector θ, in a slight abuse

of notation, is reparametrized as θn = γnθ, where we vary the constant γn to obtain

different expected degrees of the network. Correlation settings and community sizes

are the same as in Simulation 1, with results presented in Table 2.4, where choices

for γn and θ are specified.

When the stochastic blockmodels are contaminated by the imposed correlation

structure, which is expected in real-world networks, CL-BIC outperforms BIC over-

whelmingly. Tables 2.1–2.2 show the improvement is more significant when the im-

posed correlation is larger. For instance, in the block-wise correlated case of Table

2.2, when we only have within-community correlation ρDec = 0.60, CL-BIC does the


Table 2.4: Comparison of CL-BIC and BIC over 200 repetitions from Simulation 4. Before

being scaled by the constant γn, we selected θ = (θab; 1 ≤ a ≤ b ≤ 4)′, where θaa = 7 for

all a = 1, . . . , 4 and θab = 1 for 1 ≤ a < b ≤ 4.

CORR γn PROP MEDIAN DEV CORR γn PROP MEDIAN DEV

ρ MVN CL-BIC BIC CL-BIC BIC ρ MVN CL-BIC BIC CL-BIC BIC

0.20 Eq0.02 0.84 0.35 1.0(1.5) 2.0(1.5)

0.60 Dec0.02 0.92 0.52 -1.0(1.5) 2.0(1.5)

0.03 0.96 0.58 1.0(0.0) 3.0(1.5) 0.03 1.00 0.81 0.0(0.0) 1.0(1.5)

0.30 Eq0.02 0.70 0.31 1.0(0.4) 2.0(0.7)

0.70 Dec0.02 0.83 0.41 -1.0(1.5) 2.0(1.5)

0.03 0.93 0.52 1.0(0.6) 3.0(1.5) 0.03 1.00 0.77 0.0(0.0) 1.0(0.7)

0.40 Eq0.02 0.43 0.21 1.0(1.5) 2.0(1.5)

0.80 Dec0.02 0.69 0.22 -1.0(1.5) 3.0(2.4)

0.03 0.85 0.51 1.0(1.9) 3.0(1.9) 0.03 0.98 0.69 -1.0(0.4) 1.0(1.5)

right selection in all cases, while BIC is only successful in 14% of 200 trials.

As shown in Table 2.3 for the model with a community of only hubs, if the network

is generated from a purely stochastic blockmodel, or if the contaminating correlation

is not too strong, CL-BIC and VB have similar performance in selecting the correct

K = 4. But again, as the imposed correlation increases, VB fails to make the right

selection more often than CL-BIC. This is particularly true in the ρEq = 0.20 case,

where CL-BIC makes the right selection in 85% of simulated networks, whereas VB

fails in all cases, yielding models with a median of 9 communities.

The same pattern translates into the DCBM setting of Table 2.4, where smaller

values of γn yield sparser networks. The community number selection problem be-

comes more difficult as γn decreases, as degrees for many nodes are small, yielding

noisy individual effect estimates ωi = di/∑

j:zj=zidj. Nevertheless, the CL-BIC ap-

proach consistently selects the correct number of communities more frequently than

BIC over different correlation settings.

In addition, Figure 2.1 presents simulation results where the true community

number K increases from 2 to 8. Following our previous examples, community


2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

True Community Number K

Pro

po

rtio

n

Simulation 1 CL−BIC (ρEq=0.10)Simulation 1 BIC (ρEq=0.10)Simulation 2 CL−BIC (ρEq=0.10)Simulation 2 BIC (ρEq=0.10)Simulation 3 CL−BIC (ρDec=0.60)Simulation 3 VB (ρDec=0.60)

Figure 2.1: (Color online) Comparisons between different methods for selecting the true

community number K in the standard blockmodel settings of Simulations 1 – 3. Along the

y-axis, we record the proportion of times the chosen number of communities for each of the

different criteria for selecting K agrees with the truth.

sizes grow according to the sequence (60, 90, 120, 150, 60, 90, 120, 150). The selected

correlation-contaminated stochastic blockmodels are ρEq = 0.10 from Simulation 1,

within-community correlation ρEq = 0.10 from Simulation 2, and within-community

correlation ρDec = 0.60 from Simulation 3. As K increases and enough vertices are

added into the network, CL-BIC tends to correctly estimate the true community

number in all simulation settings. Even in this scenario with a growing number of

communities, the proportion of times CL-BIC selects the true K is always greater

than the corresponding BIC or VB estimates.

Before moving to our last simulation example, we would like to define two measures

to quantify the accuracy of a given node label assignment. The first measure is a


“goodness-of-fit” (GF) measure defined as

GF (z, zk) =∑i<j

(1zi = zj1zi = zj+ 1zi 6= zj1zi 6= zj)/(N

2

), (2.10)

where z represents the true community labels and zk represents the community as-

signments from an estimator. Thus, the measure GF (z, zk) calculates the proportion

of pairs whose estimated assignments agree with the correct labels in terms of being

assigned to the same or different communities, and is commonly known as the Rand

Index (Rand, 1971) in the cluster analysis literature.

The second measure is motivated from the “assortativity” notion. The ratio of

the median within community edge number to that of the between community edge

number (MR) is defined as

MR(zk) = mediana=1,...,k

(maa)/

mediana6=b

(mab) , (2.11)

where k is the number of communities implied by zk and mab is the total number of

edges between communities a and b, as given by the community assignment zk. It is

clear that for both measures, a higher value indicates a better community detection

performance.

As a final simulation example, we analyze the performance of CL-BIC and BIC

for a growing number of communities under the degree-corrected blockmodel. While

the reparametrized vector θn = γnθ remains as in Simulation 4, the ωi are now

independently generated from Uniform(1/5, 9/5). The results are collected in Table

2.5, where we also record the performance of the SCORE algorithm under the true

K, along with the goodness-of-fit (GF) and median ratio (MR) performance measures

introduced in (2.10) and (2.11), respectively.

The true community number and community sizes grow as in the case for the

standard blockmodel described in Figure 2.1. Although CL-BIC performs uniformly

better than BIC across all validating criteria and throughout all K, the procedure

does not appear to yield model selection consistent results in this example. Aside

from the fact that the introduced correlation is not exponentially decaying, this poor


Table 2.5: Comparison of CL-BIC and BIC over 200 repetitions from the DCBM case in

Simulation 4, with (ρEq = 0.2,γn = 0.03), where the individual effect parameters ωi are

now generated from a Uniform(1/5, 9/5) distribution.

SCORE Performance CL-BIC BIC

K Misc. R. Orac. Err. Est. Err. PROP MD RSD GF MR PROP MD RSD GF MR

2 0.02 0.51 0.54 0.88 1 0 0.96 7.33 0.10 2 0.75 0.73 3.37

3 0.03 0.53 0.55 0.93 1 0.75 0.97 7.87 0.09 3 1.49 0.88 3.91

4 0.03 0.55 0.58 0.86 1 0 0.97 8.11 0.16 3 1.49 0.91 5.41

5 0.04 0.58 0.62 0.56 1 0 0.96 6.35 0.09 3 2.05 0.92 5.92

6 0.05 0.60 0.64 0.47 1 1.49 0.96 7.10 0.09 2 2.24 0.94 7.08

7 0.05 0.63 0.66 0.39 1 1.49 0.97 6.77 0.09 3 1.49 0.95 6.73

8 0.08 0.63 0.66 0.29 1 1.49 0.97 7.24 0.02 3 2.24 0.96 6.80

NOTE: PROP, proportion; MD, median deviation; RSD, robust standard deviation; GF,

goodness-of-fit measure; MR: median ratio measure. Misc. R. denotes the miscluster-

ing rate of the SCORE algorithm. For Ω = Eθ(A), Orac. Err. and Est. Err. are

‖ΩO − Ω‖ / ‖Ω‖ and ‖ΩSC − Ω‖ / ‖Ω‖, respectively, where ‖·‖ denotes Frobenius norm.

Here, ΩO denotes the estimate of Ω under the oracle scenario where we know the true co-

mmunity assignment z ∈ 1, . . . ,KN , and ΩSC is the estimate of Ω using the SCORE

labeling vector.

performance as K increases can also be explained by the difficulty in estimating the

DCBM parameters (θ,ω) in a scenario where several vertices have potentially low

degrees. Indeed, even in the oracle scenario where we know the true community labels

zi ahead of time, and for a relatively small misclustering rate of the SCORE algorithm,

Table 2.5 exhibits the difficulty in obtaining accurate estimates (θC, ωC), and in

evaluating the CL-BIC criterion functions (2.8), under this increasing K scenario for

the DCBM. Whether the increased number of parameters in the DCBM has an effect

on the consistency results of CL-BIC as K increases is also an interesting line of future

work.



2.4.2.1 International Trade Networks

We first study an international trade dataset originally analyzed in Westveld and

Hoff (2011), containing yearly international trade data between N = 58 countries

from 1981 − 2000. For a more detailed description of this dataset, we refer the

interested reader to the Appendix in Westveld and Hoff (2011). In our numerical

comparisons between CL-BIC and BIC paired with the standard stochastic block-

model log-likelihood (2.1), we focus on data from year 1995. For this network, an

adjacency matrix A can be formed by first considering a weight matrix W with

Wij = Tradei,j + Tradej,i, where Tradei,j denotes the value of exports from country

i to country j. Finally, we define Aij = 1 if Wij ≥ Wα, and Aij = 0 otherwise;

here Wα denotes the α-th quantile of Wij1≤i<j≤N . For the choice of α = 0.5, Fig-

ure 2.2 shows the largest connected component of the resulting network. Panel (a)

shows CL-BIC selecting 3 communities, corresponding to countries with the highest

GDPs (dark blue), industrialized European and Asian countries with medium-level

GDPs (green), and developing countries in South America with the smallest GDPs

(yellow). Next, in panel (b) we also show the variational Bayes solution correspon-

ding to k = 7, providing finer communities for some Central and South American

neighboring countries (yellow and pink, respectively) but fragmenting the high- and

medium-level GDP countries into ambiguous communities. For instance, it is not

clear why countries like Bolivia and Nepal belong to the same community (orange) or

why the Netherlands, rather than Brazil or Italy, joined the community of countries

with the highest GDPs (light blue). At last, panel (c) corresponds to the final BIC

model selecting 10 communities. Under this partition, South American countries are

now split into 6 “noisy” communities, while high GDP countries are unnecessarily

fragmented into two.

We believe CL-BIC provides a better model than traditional BIC, yielding commu-


Algeria

Argentina

Australia

Austria

Barbados

Belgium

Bolivia

Brazil

Canada

Chile

Colombia

Costa_Rica

Cyprus

Denmark

Ecuador

El_SalvadorFinland

FranceGermany

Greece

Guatemala

Honduras

Iceland

India

Indonesia

Ireland

Israel

Italy

Jamaica

Japan Korea_Rep

Malaysia

Mauritius

Mexico

Morocco

Nepal

NetherlandsNew_Zealand

Norway

Oman

Panama

ParaguayPeru

Philippines

Portugal

Singapore

Spain

Sweden

Switzerland

ThailandTrinidad_and_Tobago

Tunisia

Turkey

United_KingdomUnited_States

Uruguay

Venezuela

(a) k = 3 (CL-BIC-SBM)

Algeria

Argentina

Australia

Austria

Barbados

Belgium

Bolivia

Brazil

Canada

Chile

Colombia

Costa_Rica

Cyprus

Denmark

Ecuador

El_SalvadorFinland

FranceGermany

Greece

Guatemala

Honduras

Iceland

India

Indonesia

Ireland

Israel

Italy

Jamaica

Japan Korea_Rep

Malaysia

Mauritius

Mexico

Morocco

Nepal


Norway

Oman

Panama

ParaguayPeru

Philippines

Portugal

Singapore

Spain

Sweden

Switzerland


Tunisia

Turkey


Uruguay

Venezuela

(b) k = 7 (VB-SBM)

Algeria

Argentina

Australia

Austria

Barbados

Belgium

Bolivia

Brazil

Canada

Chile

Colombia

Costa_Rica

Cyprus

Denmark

Ecuador

El_SalvadorFinland

FranceGermany

Greece

Guatemala

Honduras

Iceland

India

Indonesia

Ireland

Israel

Italy

Jamaica

Japan Korea_Rep

Malaysia

Mauritius

Mexico

Morocco

Nepal


Norway

Oman

Panama

ParaguayPeru

Philippines

Portugal

Singapore

Spain

Sweden

Switzerland


Tunisia

Turkey


Uruguay

Venezuela

(c) k = 10 (BIC-SBM)

Figure 2.2: (Color online) Largest connected component of the international trade network

for the year 1995.

nities with countries sharing similar GDP values without dividing an entire continent

into 6 smaller communities. On the contrary, BIC selects a model containing commu-

nities of size as small as one, which are of little, if any, practical use. The variational

Bayes approach provides a meaningful solution in this example, exhibiting a similar

performance as in Latouche et al. (2012) in terms of providing some finer community

assignments.


1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

18

19

20

21

22

23

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

5253

54

55

56

57

58

5960

61

62

63

64

6566

67

68

69

71

7273

74

75

77

78

79

80

8182

83

84

85

86

87

88

89

90

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109 110

112

113

114

115

116

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

162

163

164

166

167

168

169

170

171172

173

174

175

176

177

178

179

180

181

182

183

184

185

186187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247249

250

251

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278279

280

281

282

283

284 285

286

287

288

289

290

291

292

294

295

296297

298

299

300

301

303

304

305

306

307

308

309

310

311

312

313

315

316

317

318

319

320

321322

323

324

325

326

327

328

329

330 331

332

333

334

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

357

358

360

361

363

364365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380381

382

383

384

385

386

387

388

389

390

391

392

393

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

455

456

457

(a) k = 6 (CL-BIC-DCBM)

1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

18

19

20

21

22

23

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

5253

54

55

56

57

58

5960

61

62

63

64

6566

67

68

69

71

7273

74

75

77

78

79

80

8182

83

84

85

86

87

88

89

90

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109 110

112

113

114

115

116

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

162

163

164

166

167

168

169

170

171172

173

174

175

176

177

178

179

180

181

182

183

184

185

186187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247249

250

251

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278279

280

281

282

283

284 285

286

287

288

289

290

291

292

294

295

296297

298

299

300

301

303

304

305

306

307

308

309

310

311

312

313

315

316

317

318

319

320

321322

323

324

325

326

327

328

329

330 331

332

333

334

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

357

358

360

361

363

364365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380381

382

383

384

385

386

387

388

389

390

391

392

393

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

455

456

457

(b) k = 9 (BIC-DCBM)

1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

18

19

20

21

22

23

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

5253

54

55

56

57

58

5960

61

62

63

64

6566

67

68

69

71

7273

74

75

77

78

79

80

8182

83

84

85

86

87

88

89

90

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109 110

112

113

114

115

116

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

162

163

164

166

167

168

169

170

171172

173

174

175

176

177

178

179

180

181

182

183

184

185

186187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247249

250

251

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278279

280

281

282

283

284 285

286

287

288

289

290

291

292

294

295

296297

298

299

300

301

303

304

305

306

307

308

309

310

311

312

313

315

316

317

318

319

320

321322

323

324

325

326

327

328

329

330 331

332

333

334

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

357

358

360

361

363

364365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380381

382

383

384

385

386

387

388

389

390

391

392

393

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

455

456

457

(c) “Truth” (K = 6)

Figure 2.3: (Color online) Largest connected component of the school friendship network.

Panel (c) shows the “true” grade community labels: 7th (blue), 8th (yellow), 9th (green),

10th (purple), 11th (red), and 12th (black).

2.4.2.2 School Friendship Networks

Now, we consider a school friendship network obtained from the National Longitudinal

Study of Adolescent Health (http://www.cpc.unc.edu/projects/addhealth). For this

network, Aij = 1 if either student i or j reported a close friendship tie between the

two, and Aij = 0 otherwise. We focus on the network of school 7 from this dataset,

and our comparisons between CL-BIC and BIC are done with respect to the degree-

http://www.cpc.unc.edu/projects/addhealth


corrected blockmodel log-likelihood (2.2). With 433 vertices, Figure 2.3 shows the

largest connected component of the resulting network. As shown in panel (a), CL-

BIC selects the true community number K = 6, roughly agreeing with the actual

grade labels, except for the black community. BIC, shown in panel (b), selects 9

communities, unnecessarily splitting the 7th and 8th graders. The “true” friendship

network is shown in panel (c).

We still conclude CL-BIC performs better than traditional BIC. Except for the

misallocation of the black community of 12th graders, the model selected by CL-BIC

correctly labels most of the remaining network. While BIC partially separates the

10th graders and the 12th graders, a substantial portion of the 10th graders are ab-

sorbed into the 9th grader community (green). In addition, BIC further fragments

7th and 8th graders into “noisy” communities. This is an extremely difficult commu-

nity detection problem since, even for a “correctly” specified k = 6, SCORE fails

to assign all 12th graders to their corresponding true grade. The black community

selected by SCORE in panel (a) mainly corresponds to female students and hispanic

males, reflecting perhaps closer friendship ties among a subgroup of students recently

starting junior high school.

Using the “goodness-of-fit” measure defined in (2.10), we found that the CL-BIC

community assignment leads to GF (z, z6) = 0.811, which is slightly better than the

GF (z, z9) = 0.810 obtained for BIC. For the MR measure given in (2.11), the results

for CL-BIC and BIC are MR(z6) = 40.8 and MR(z9) = 33.3, respectively, again

indicating the superiority of the CL-BIC solution paired with SCORE.

In both examples, BIC tends to overestimate the “true” community number K,

rendering very small communities which are in turn penalized under the CL-BIC ap-

proach. This means CL-BIC successfully remedies the robustness issues brought in

by spectral clustering, due to the misspecification of the underlying stochastic block-

models, and effectively captures the loss of variance produced by using traditional

BIC.


2.5 Discussion

There has been a tremendous amount of research in recovering the underlying struc-

tures of network data, especially on the community detection problem. Most of the

existing work has focused on studying the properties of the stochastic blockmodel

and its variants without looking at the possible model misspecification problem. In

this chapter, under the standard stochastic blockmodel and its variants, we advocate

the use of composite likelihood BIC for selecting the number of communities due to

its simplicity in implementation and its robustness against correlated binary data.

Some extensions are possible. For instance, the proposed methodology in this work

is based on the spectral clustering and SCORE algorithms, and it would be interesting

to explore the combination of the CL-BIC with other community detection methods.

In addition, most examples considered here are dense graphs, which are common

but cannot exhaust all scenarios in real applications. Another open problem is to

study whether the CL-BIC approach is consistent for the degree-corrected stochastic

blockmodel, which is not necessarily true from our numerical studies.

CHAPTER 3. KERNEL-BASED CHANGE-POINT DETECTION IN DYNAMICSTOCHASTIC BLOCKMODELS 65

Chapter 3

Kernel-Based Change-Point

Detection in Dynamic Stochastic

Blockmodels

The stochastic blockmodel and its variants have been one of the most widely used

approaches for spatial segmentation of static network data. As most complex phenom-

ena studied through statistical network analysis are dynamic in nature, static network

models fail to capture the relevant temporal features. In this chapter, we formulate

the segmentation of a time-varying network into temporal and spatial components by

means of a change-point detection hypothesis testing problem on dynamic stochastic

blockmodels. We propose a Kernel Change-Points (KCP) test statistic based on the

idea of data aggregation across the different temporal layers through kernel-weighted

adjacency matrices computed before and after each candidate change-point. We de-

rive the required theoretical framework and illustrate our anomaly detection approach

using a wide array of synthetic data sets. In addition, we apply our proposed KCP

methodology to the real-world Enron network, a dynamic social network of email

communications.


3.1 Introduction

The study of several complex systems having interacting units through network or

graph representations has emerged as a topic of great research interest in recent years.

Examples of modeled phenomena include online friendship interactions within social

networks, regulatory protein-protein interactions in systems biology, hyperlinks be-

tween webpages in information systems, and coauthorship within citation networks.

While various probabilistic models have been proposed for static networks (Gold-

enberg et al., 2010), which represent a single time snapshot of the phenomenon of

interest, many of these systems are dynamic in nature, meaning their evolving struc-

ture can be represented as a sequence of networks snapshots taken at discrete time

points over a common set of nodes. Among a myriad of research problems arising

with time-varying network data, in this chapter we tackle the problem of change-

point detection within dynamic stochastic blockmodels, where the goal is to find the

M discrete time snapshots at which a subset of vertices of the network alter their con-

nectivity patterns under a time-dependent version of the classic stochastic blockmodel

(Holland et al., 1983) with T observed time points. Our aim is thus the segmenta-

tion of dynamic networks into a temporal component, where the network generative

mechanism remains fixed between presumed change-points, and a spatial component,

in which we seek to partition the nodes of the network into tightly-knit communities

or clusters.

Significant progress has been made recently in generalizing static network mo-

dels such as the stochastic blockmodel and its mixed membership variant (Airoldi

et al., 2008) to their dynamic counterparts. Proposed extensions of the stochas-

tic blockmodel include the state-space model for dynamic networks of Xu and Hero

(2014), where the state evolution of the blockmodel parameters is assumed to follow

a stochastic dynamic system, and the multi-graph stochastic blockmodel of Han et al.

(2015), where network vertices share the same block structure over the multiple tem-

poral layers but class connection probabilities are allowed to vary. Under some mild


conditions on the sequence of block probabilities, Han et al. (2015) show a modified

version of the classic spectral clustering algorithm (Donath and Hoffman, 1973) to

be consistent at estimating the fixed block labels for a diverging number of snap-

shots. In the opposite setting of nodes changing their community membership over

time under fixed block probabilities, Ghasemian et al. (2015) develop detectability

thresholds for community detection in dynamic stochastic blockmodels as a function

of community strength and the rate of change of nodes to different communities. The

more general dynamic mixed membership stochastic blockmodel (Xing et al., 2010;

Ho et al., 2011) has also been proposed, where all block probabilities, membership

indicators and mixed membership vectors are assumed to follow a state-space model.

Aside from community detection in temporally evolving networks, researchers have

also focused in modeling dynamic relational network data in the presence of covari-

ates through longitudinal data structures (Westveld and Hoff, 2011) and continuous

time, point process models paired with partial likelihood inference procedures (Perry

and Wolfe, 2013). We refer interested readers to Aggarwal and Subbian (2014) for a

comprehensive survey in related time-dependent network methodology.

While some of the above models follow complex dynamics in their treatment

of temporally evolving networks, we favor the simpler approach of multiple-change

point detection, also known as temporal segmentation in time series analysis (Cho

and Fryzlewicz, 2015). Among a growing literature of methodology for this network

problem, the scan statistics approach for anomaly detection of Wang et al. (2014)

represents one of the few works where limiting distributions and power computations

are provided for the associated change-point hypothesis testing problem. The main

idea in their paper is to scan the observed network data over small time and spatial

windows, calculating some locality statistic or measure for each window, with the

computed scan statistic being the maximum of these locality statistics. Although

without obtaining explicit power calculations, the latent process model of Lee and

Priebe (2011) and the latent position model with an EM-type partition algorithm


for identifying multiple change-points of Robinson and Priebe (2015) provide a more

flexible change-point detection framework than in Wang et al. (2014). Recently,

Cribben and Yu (2015) tackled temporal segmentation of brain network dynamics

using principal angles between subspaces as a measure of local distance for adjacent

network snapshots. A common feature of all these change-point detection techniques

is the aim at sensing sudden changes in the local behavior of the connectivity patterns

of the network through a set of spatial or temporal measurement windows.

Motivated by the idea of capturing the local characteristics of dynamic networks,

and making use of the consistency result for spectral clustering on a mean adjacency

matrix (Han et al., 2015), showing the benefits of data aggregation for time-dependent

networks, in this chapter, we propose a kernel-based regularized distance statistic for

multiple change-point detection under dynamic stochastic blockmodels. While allow-

ing for changes both in the class connection probabilities and the node community

memberships, our approach first computes kernel-weighted adjacency matrices which

aggregate information across the temporal layers before and after each candidate

change-point. By the consistency result in Han et al. (2015), spectral clustering on

these weighted adjacency matrices provides a more precise estimation of the underly-

ing block structure in between presumed change-points. With the help of the profile

likelihood paradigm, we use the obtained community labels to compute estimated

expectation matrices of the aggregated networks before and after each candidate

change-point, and whose difference in the Frobenius norm yields a tailor-made test

statistic for the corresponding hypothesis testing problem. The resulting procedure,

Kernel Change-Points (KCP), is then tested on synthetic networks and the Enron

email data set, yielding a performance comparable to the scan statistic techniques

(Priebe et al., 2005; Wang et al., 2014) in terms of change-point selection.

The remainder of the chapter is organized as follows. In Section 3.2 we provide a

background on dynamic stochastic blockmodels, and introduce a rigorous definition

of our multiple change-point detection problem. The proposed Kernel Change-Point


methodology is developed in Section 3.3, with simulation examples and a real-world

network data analysis on the Enron email corpus carried out in Section 3.4. We

conclude the chapter with a short discussion in Section 3.5.

3.2 Background

We first introduce some notation and present a general overview of our time-varying

network data framework. Consider a temporal series of simple, undirected networks

G(1), . . . ,G(T ) over a fixed set of nodes V , where G(t) := (V,E(t)) represents the

network snapshot observed at time t = 1, . . . , T and E(t) denotes the corresponding

set of edges over N = |V | vertices. The corresponding sequence of N ×N symmetric

adjacency matrices is thus A(t)Tt=1; where, as usual, Aij(t) := 1 if (i, j) ∈ E(t), and

Aij(t) := 0 otherwise. Self-edges are not allowed, and so Aii(t)Ni=1 is fixed to zero

across all t values.

3.2.1 Dynamic Stochastic Blockmodels

The original, static stochastic blockmodel for a single time snapshot network G :=

(V,E), with N ×N adjacency matrix A, was first proposed by Holland et al. (1983).

The model posits each node i ∈ V is associated with one and only one community

from V1, . . . , VK, with label Ci, where V = V1 t · · · t VK and Ci ∈ 1, . . . , K.

Further, for all possible pairs (i, j), the model assumes the corresponding edge va-

riables Aiji<j are conditionally independent Bernoulli random variables given the

communities of their endpoints, i.e., P(Aij = 1;Ci = ci, Cj = cj) = θcicj . In other

words, a standard Stochastic Blockmodel (SBM) with K communities is parametrized

by a class connection probability matrix θ ∈ [0, 1]K×K , where θab denotes the proba-

bility of forming an edge between nodes from communities a and b, and a community

assignment matrix Z = (z1, . . . , zN)′ ∈ 0, 1N×K , where zi,k = 1 if and only if ci = k,

and zi,k = 0 otherwise.


Under the Dynamic Stochastic Blockmodel (DSBM) framework proposed in this

chapter, we allow for nodes i ∈ V to change their community membership over time,

as well as let the class connection probability matrices vary across different layers. As

such, each network G(t) in the time series has its own community assignment matrix

Z(t) = (z1(t), . . . , zN(t))′ ∈ 0, 1N×K , with zi,k(t) = 1 if and only if ci(t) = k, and

zi,k(t) = 0 otherwise. Moreover, the sequence of K ×K class connection probability

matrices θ(t)Tt=1 satisfies

PAij(t) = 1;Ci(t) = ci(t), Cj(t) = cj(t) = θ(t)ci(t)cj(t) . (3.1)

For the DSBM, we define the expectation matrix Ω(t) := E(A(t)). A few algebra

steps show that this matrix can be expressed as the product

Ω(t) = Z(t)θ(t)Z(t)′, (3.2)

with diag(Ω(t)) := 0 for all t = 1, . . . , T . In a slight abuse of notation, we write

A(t) ∼ SBM(Ω(t)), (3.3)

to mean the individual network snapshot data A(t) follows a standard stochastic

blockmodel with fixed but unknown parameters θ(t) and Z(t) for each t = 1, . . . , T .

While not as general as the dynamic Mixed Membership Stochastic Blockmodel

from Xing et al. (2010), the DSBM proposed in equations (3.1) – (3.3) allows for

enough flexibility in modeling dynamic relational network data. As opposed to mo-

dels where the block membership is fixed from the initial observation time (Wang

et al., 2014; Han et al., 2015), or settings where the edges in E(t) are generated in-

dependently for each t according to the same class connection probability matrix θ

(Ghasemian et al., 2015), our DSBM framework above allows for time-evolving θ(t),

Z(t) or both. Since we aim at modeling various change-point phenomena throughout

the time series G(1), . . . ,G(T ), we expect (θ(t),Z(t)) to vary only at a limited

number of snapshots. In this regard, the DSBM setting presented in this chapter is

more stringent than the linear dynamic state-space model of Xu and Hero (2014).


3.2.2 Change-Point Model

Before estimating the parameters (θ(t),Z(t)) driving the network evolution, we are

first interested in detecting change-points, the discrete time snapshots at which the

underlying characteristics of the evolving network are altered. Following Lee and

Priebe (2011), we make the following definition of a change-point.

Definition 3.1. A fixed but unknown constant t∗ ∈ 1, 2, 3, . . .∪+∞ is a change-

point for the temporal series G(t)t≥1 if there exist distinct expectation matrices

Ω0,ΩA ∈ [0, 1]N×N such that prior to time t∗, the sequence A(t) follows the null

dynamics, i.e.,

A(t) ∼ SBM(Ω0) for t ≤ t∗;

and after time t∗, the sequence A(t) starts following the alternative dynamics

A(t) ∼ SBM(ΩA) for t > t∗.

For any given time snapshot t ∈ N, the associated change-point hypothesis testing

problem is

H0 : t∗ > t

HA : t∗ = t.(3.4)

As a convention, if t∗ = +∞, then we assume the sequence A(t)t≥1 evolves

according to the null dynamics for all t. The above definition extends naturally to

the case of multiple change-points within the time series of graphs G(t)t≥1; for sim-

plicity of the presentation, we omit details of this definition. We do assume, however,

there is an unknown number of M distinct change-points 1 ≤ t∗1 < · · · < t∗M ≤ +∞

in our observed data A(t)t≥1. While in principle the above definition considers all

time snapshots as candidate change-points, in our numerical implementation, we bar

any two adjacent time points to be labeled as such; a minimum of separation common

to the literature in multiple change-point detection for multivariate time series (Cho

and Fryzlewicz, 2015). Modulo this consideration for adjacent time snapshots, every


time point t is considered a candidate change-point for the hypothesis testing problem

(3.4). In the next section, we build upon the spectral clustering algorithm (Donath

and Hoffman, 1973) to construct a test statistic comparing a form of distance between

networks before and after each candidate change-point for the testing problem.

3.3 Change-Point Detection Methodology

3.3.1 Spectral Clustering and Parameter Estimation in SBM

Spectral clustering is a popular, computationally feasible algorithm for estimating the

community structure Z ∈ 0, 1N×K of the single network snapshot SBM. Proven to

be consistent for sufficiently dense networks as N → ∞ (Rohe et al., 2011), the

algorithm can be derived as a convex relaxation (von Luxburg, 2007) of assortative

mixing, NP-hard criteria such as minimizing normalized cut (Shi and Malik, 2000).

The procedure, given in detail in Algorithm 3.1 for a generic weighted adjacency

matrix W ∈ [0, 1]N×N , is based on the eigenvalue decomposition of the normalized

graph Laplacian L (von Luxburg, 2007) and on the K-means algorithm applied to

the eigenvectors of L.

Following Choi et al. (2012), and for a fixed time snapshot in (3.1), the log-

likelihood of observing an adjacency matrix A under the SBM is

`(θ, c;A) :=∑i<j

[Aij log θcicj + (1− Aij) log(1− θcicj)], (3.5)

with c ∈ 1, . . . , KN being a community membership vector directly parametrizing

Z. Considering the computational intractability of directly estimating both param-

eters (θ, c) based on exact maximum likelihood fitting, we follow a profile likelihood

approach paired with the spectral clustering relaxation to estimate the model param-

eters and the associated expectation matrix Ω. Denote Na as the size of community

a, and nab as the maximum number of possible edges between communities a and

b, i.e., nab := NaNb for a 6= b, and naa := Na(Na − 1)/2. For a fixed community


Algorithm 3.1 Spectral Clustering

1. Input: Weighted adjacency matrix W ∈ [0, 1]N×N . Pre-specified number of co-

mmunities K.

2. Compute normalized graph Laplacian L := D−1/2WD−1/2, where Dii := di and

Dij := 0 for i 6= j. Here, di is the degree of node i.

3. Find the eigenvectors u1, . . . ,uK associated with the K eigenvalues of L that are

largest in magnitude, forming an N ×K matrix U := (u1, . . . ,uK).

4. Apply the K-means algorithm to the rows of U and obtain estimated co-

mmunity labels c := (c1, . . . , cN), and estimated community assignment matrix

Z := (z1, . . . , zN)′ ∈ 0, 1N×K ; with zi,k = 1 if and only if ci = k, and zi,k = 0

otherwise.

5. Output: (c, Z) ∈ RN ⊗ RN×K .

assignment c ∈ 1, . . . , KN , let mab :=∑

i<j Aij1ci = a, cj = b so that the MLE

of θab in (3.5) is given by θab := mab/nab. Instead of using a Gibbs sampler to explore

the space of candidate community assignments, we resort to spectral clustering to

obtain an estimated Z, and thus define

Ω := Ω(A) = ZθZ′, (3.6)

the estimated expectation matrix for a fixed time snapshot SBM with adjacency

matrix A.

3.3.2 Algorithm Description and Rationale

3.3.2.1 Motivation

As our main goal is multiple change-point detection under the dynamic stochastic

blockmodel setting for A(t)Tt=1, a natural question is how one can extend the spec-

tral clustering algorithm for the SBM to discover community structure in the DSBM;

and how one can use the aggregated information in G(t)Tt=1 to compute a test statis-

tic for problem (3.4) using a regularized distance between networks before and after


each candidate change-point. Bazzi et al. (2015) argue that two main approaches

have been followed to detect communities in temporally evolving networks. The first

involves constructing a single static network by aggregating information from the

different snapshots into a mean graph, or a weighted average graph considering the

total edge weight for each edge across time points t = 1, . . . , T ; and then using stan-

dard techniques such as spectral clustering on this aggregated network (Han et al.,

2015). The second approach involves using standard community detection techniques

to infer Z(t) independently on each network snapshot, and then tracking the commu-

nities across the sequence (Macon et al., 2012; Fenn et al., 2012) or averaging them

by means of majority voting. Han et al. (2015) show the effectiveness of the former

approach by, under some identifiability conditions on θ(t)Tt=1, proving a consistency

result for spectral clustering on the mean adjacency matrix

A :=1

T

T∑t=1

A(t) (3.7)

as the number of snapshots T → ∞ with N fixed. Moreover, their paper also finds

the majority voting scheme underperforming spectral clustering on the mean graph

in instances where the SBM on a single time snapshot is below the detectability limit

developed in Decelle et al. (2011), thus showing difficulty in learning the underlying

community structure from the multiple temporal layers for this heuristic.

Motivated by these results favoring data aggregation for discovering community

structure within DSBM, and keeping in mind the necessity of specifying a metric

comparing distances between adjacent networks for each candidate change-point, we

define the following sequences of weighted before and after adjacency matrices:

B(t) :=

∑t−1i=1Kt−1,i,hA(i)∑t−1

i=1Kt−1,i,h

and A(t) :=

∑Ti=tKt,i,hA(i)∑T

i=tKt,i,h, (3.8)

for t = 2, . . . , T ; where Kt,i,h := K(|t − i|/h), K(·) is a Gaussian kernel, and h is a

kernel bandwidth. Spectral clustering on B(t) helps unveil the community structure

before each candidate t, and together with the profile likelihood approach of Section


3.3.1, it provides an estimate Ω(B(t)) of the parameters driving the evolution of the

dynamic stochastic blockmodel just prior to time snapshot t. Similarly, through the

same approach outlined in equation (3.6), A(t) provides an estimate Ω(A(t)) of the

DSBM parameters for snapshot t and afterwards.

Just as in the case of univariate kernel density estimation (Silverman, 1986), the

choice of bandwidth h poses a natural bias-variance tradeoff. A small value of h

in (3.8) leads to under-smoothing, potentially yielding estimates (Ω(B(t)), Ω(A(t)))

showing features in the dynamic network data that are not really present (e.g., change-

points); whereas a large value of h leads to over-smoothing, possibly resulting in

overlooking important features present in A(t)Tt=1. Given our goal of detecting

multiple change-points within DSBM data, the choice of bandwidth h will play a

crucial role in evaluating the deviance between adjacent networks before and after a

given candidate change-point.

3.3.2.2 Test Statistic d(t)

For a candidate change-point t ∈ 2, . . . , T under the observed temporal series

A(t)Tt=1, we define the following test statistic for the hypothesis testing problem

(3.4):

d(t) :=∥∥∥Ω(B(t))− Ω(A(t))

∥∥∥F, (3.9)

where (Ω(B(t)), Ω(A(t))) are the estimated expectation matrices for the weighted

adjacency matrices given in (3.8), and ‖·‖F denotes the usual Frobenius norm. Natu-

rally, one would reject the null hypothesis that the candidate t is not a change-point

for large observed values of d(t).

The procedure for determining whether a candidate t is indeed a change-point,

described in Algorithm 3.2 below, incorporates the following aspects of the observed

data A(t)Tt=1 and the DSBM defined in equations (3.1) – (3.3). Firstly, as a result

of our flexible dynamic stochastic blockmodel framework, the test statistic d(t) is

able to detect structural changes in θ(t), Z(t) or both. Our approach is thus more


Algorithm 3.2 Kernel Change-Points (KCP)

1. Input: Observed temporal series of adjacency matrices A(t)Tt=1. Pre-specified

number of communities K and significance level α.

2. For every candidate change-point t ∈ 2, . . . , T:

(a) Compute the weighted before and after adjacency matrices B(t), A(t) from

(3.8).

(b) For the provided K, apply the spectral clustering algorithm on B(t) and

A(t).

(c) Follow the profile likelihood approach to produce estimates Ω(B(t)),

Ω(A(t)) of the corresponding expectation matrices.

(d) Calculate the regularized distance, test statistic d(t) from (3.9).

(e) Define δ(t) := 1 if d(t) ≥ cα(t), the upper αth quantile of the distribution of

d(t) under the null hypothesis H0; otherwise, let δ(t) := 0.

3. Output: The estimated change-points t : t ∈ 2, . . . , T, δ(t) = 1 of the dy-

namic network.

flexible than the scan statistics procedures presented in Wang et al. (2014), where

the proposed methodology only deals with the anomaly detection problem in which

only one diagonal entry from θ(t) increases in value at the presumed change-point.

Secondly, given the consistency result of spectral clustering on the mean adjacency

matrix (3.7) presented in Han et al. (2015), and considering the flexibility provided

by the Gaussian kernel family Kt,i,h, our proposed test statistic d(t) both success-

fully aggregates information from the different snapshots to produce more accurate

estimates (θ(t), Z(t)), and provides a regularized distance between adjacent networks

which is essential for detecting multiple change-points along G(t)Tt=1.

Dynamic relational network data represent complex phenomena. Through our

test statistic (3.9), we aim to construct a simple procedure capable of detecting time-


varying structure in real-world network systems, effectively segmenting the observed

data A(t)Tt=1 into different behavioral patterns. If the desired network analysis

requires attention to more subtle features that emerge from every given snapshot to

the next, then a change-point model like the one proposed in this chapter does not

seem appropriate, with the state-space models of Xing et al. (2010) and Xu and Hero

(2014) being more fitting in this situation.

3.3.3 Implementation Details

In this subsection, we clarify several implementation details related to Algorithm 3.2

when applied to real-world dynamic network data sets.

3.3.3.1 Selecting the Number of Communities K

From our definition of the dynamic stochastic blockmodel in equations (3.1) – (3.3),

we have treated the number of communities K as a pre-specified model input. While

the problem of selecting the true community number K remains largely unexplored

in the DSBM, some recent progress has been made in the case of static stochastic

blockmodels (Chen and Lei, 2014; Saldana et al., 2015). We offer three main alterna-

tives to specify K in Algorithm 3.2. The first one involves employing strategies such

as the composite likelihood BIC approach of Saldana et al. (2015), or the network

cross-validation of Chen and Lei (2014), to the mean adjacency matrix (3.7) to de-

termine a plausible choice for K. While nodes are allowed to change communities in

our DSBM framework, the input K remains fixed throughout the temporal series, so

aggregation is only expected to help in the consistency of these methods in correctly

estimating the total number of communities.

The second approach, proposed in Cribben and Yu (2015), consists in specifying an

over-estimated value for the true K. As an under-estimated true value misses impor-

tant directions in the spectral clustering algorithm, the authors show this approach

yields better change-point detection results than the correct K for their provided


simulation settings.

Although more computationally demanding, a third alternative is to provide a

different value for K at each snapshot. While our DSBM framework (3.1) – (3.3) does

not explicitly model a change in the number of communities, this approach provides

a slight generalization from our initial assumptions, in which we seek to detect a

change-point if the number of available communities changes from one snapshot to

the next. Existing methods like the ones described above for estimating the true K

at fixed snapshots typically require additional computational resources, thus posing

a trade-off between modeling flexibility and computation time.

3.3.3.2 Candidate Change-Points

As mentioned in Section 3.2.2, we bar any adjacent time points to be labeled as

change-points. Thus, whenever the null hypothesis in (3.4) is rejected for three con-

secutive candidate snapshots (i.e., δ(t) = 1 in Algorithm 3.2), we take the one with

the maximum value of d(t) as the true change-point, disregarding the other two can-

didates. Additionally, in order to improve the finite sample behavior of the weighted

adjacency matrices B(t) and A(t), we exclude as candidate change-points the first

and last tmin snapshots. In our simulation settings and real-data analysis, we choose

tmin = 5 and 10 for simplicity.

3.3.3.3 Choice of Bandwidth h

Following Silverman’s rule of thumb (Silverman, 1986), we select the kernel bandwidth

as h∗ = 1.06×T−1/5. Although none of B(t), A(t) in (3.8) are being calculated with

exactly T data points, nor is the sequence A(t)Tt=1 independent and identically dis-

tributed, we find this choice of h∗ performing favorably in our numerical experiments

next section.


3.3.3.4 Quantile cα(t) Computation

The test statistic d(t) in (3.9) depends on the results of the spectral clustering al-

gorithm applied to B(t) and A(t), the profile maximum likelihood estimators, as

well as the number and location of the true change-points of the temporal series

G(t)t≥1. As such, exact finite-sample or asymptotic behavior results for d(t) are

difficult to come by, thus requiring the use of Monte Carlo replicates or bootstrap

schemes (Cribben and Yu, 2015) to conduct the change-point inferential procedures.

Our data-driven approach to estimate the quantile cα(t) of Algorithm 3.2 works

as follows. We first note that Ω(B(t)), Ω(A(t)) both parametrize fixed time snap-

shot stochastic blockmodels from which we can simulate independent and identically

distributed sequences of adjacency matrices B(τ)Tτ=1 and A(τ)Tτ=1, respectively.

Note that, as opposed to our observed temporal series A(t)Tt=1, there are no change-

points present in each of these two simulated sequences; therefore, their properties are

crucial in describing the behavior of d(t) under the null hypothesis of no change-point

at t. Additionally, for the simulated sequences B(τ)Tτ=1 and A(τ)Tτ=1, associated

before and after weighted adjacency matrices can be computed, giving rise to regu-

larized distances d(1)(t) and d(2)(t) for each candidate change-point. After following

this approach for q Monte Carlo replicates, we are able to compute corresponding

upper αth quantiles c(1)α (t) and c

(2)α (t) for each distance sequence, thus yielding our

estimated cα(t) as:

cα(t) := maxc(1)α (t), c(2)

α (t). (3.10)

We summarize this fully data-driven procedure in Algorithm 3.3 below.

3.4 Experiments

In this section, we present a comprehensive numerical study of the temporal segmen-

tation properties of our Kernel Change-Points (KCP) detection approach for different

choices of kernel bandwidth. We perform our experiments under two major synthetic


Algorithm 3.3 Quantile cα(t) Computation

1. Input: Observed temporal series of adjacency matrices A(t)Tt=1. Significance

level α.

2. For every candidate change-point t ∈ 2, . . . , T:

(a) Compute the estimated expectation matrices Ω(B(t)), Ω(A(t)) based on

B(t), A(t).

(b) For i = 1, . . . , q Monte Carlo replicates:

(i) Generate B(i)1 , . . . , B

(i)T ∼ SBM(Ω(B(t))) and A

(i)1 , . . . , A

(i)T ∼

SBM(Ω(A(t))).

(ii) Calculate weighted adjacency matrices B(i)(t) and A(i)(t) based on (3.8).

(iii) Compute associated regularized distances d(1)i (t) and d

(2)i (t).

(c) Define c(1)α (t) and c

(2)α (t) as the upper αth quantiles of each distance sequence.

3. Output: The estimated upper αth quantile cα(t) of d(t) under H0 as given in

(3.10).

network structures for the underlying dynamic stochastic blockmodels, and compare

our methodology with the scan statistics approach of Wang et al. (2014) for the Enron

email data set (Priebe et al., 2005).

3.4.1 Simulations

We study two different examples of network structure generating mechanisms, where

for the observed data A(t)Tt=1, we consider a time-evolving θ(t) while keeping Z(t)

fixed, as well as the opposite scenario of θ(t) fixed with time-varying Z(t). We

demonstrate the robustness of our approach by covering both sparse and non-sparse

network settings.

In each experiment, carried over 100 randomly generated dynamic stochastic

blockmodels for the temporal series A(t)Tt=1, we fix the number of network snap-


shots at T = 50 and introduce M = 3 distinct change-points at locations (t∗1, t∗2, t∗3) =

(15, 30, 40). While the number of available vertices N = 900 and the number of co-

mmunities K = 3 are fixed along the sequence A(t)Tt=1, the community sizes are

allowed to vary only in our second experiment. Following Algorithm 3.3, for each

candidate change-point t ∈ 2, . . . , 50, we generate q = 50 Monte Carlo replicates

to compute the estimated quantile cα(t) given in (3.10).

Simulation 1: In this fixed Z(t) := Z scenario, community sizes are 300, 300 and

300. Accordingly, we generate A(t)50t=1 following a dynamic stochastic blockmodel

with expectation matrix Ω(t) = Zθ(t)Z ′. At times (t∗1, t∗2, t∗3) = (15, 30, 40), we in-

troduce change-points that modify the underlying structure of the temporally evolving

networks according to

θ(t) =

0.02 0.01 0.01

0.01 0.02 0.01

0.01 0.01 0.02

for t ≤ t∗1, θ(t) =

0.02 0.01 0.01

0.01 0.02 0.01

0.01 0.01 0.04

for t∗1 < t ≤ t∗2,

and

θ(t) =

0.02 0.01 0.01

0.01 0.02 0.01

0.01 0.01 0.05

for t∗2 < t ≤ t∗3, θ(t) =

0.02 0.01 0.01

0.01 0.02 0.01

0.01 0.01 0.07

for t > t∗3.

Simulation 2: In this fixed θ(t) := θ setting, the time series A(t)50t=1 is generated

according to the community assignments

Z(t) =

300

300

300

for t ≤ t∗1, Z(t) =

295

295

310

for t∗1 < t ≤ t∗2,

and

Z(t) =

285

285

330

for t∗2 < t ≤ t∗3, Z(t) =

250

250

400

for t > t∗3.


Table 3.1: Performance of KCP over 100 repetitions from Simulations 1 and 2. Results

are given in the form of mean (standard deviation).

CONF. LEVEL h SIMULATION 1 CONF. LEVEL h SIMULATION 2

1 - α TP FP 1 - α TP FP

90 %

0.24 2.60(0.57) 4.76(1.36)

90 %

0.24 3.00(0.00) 1.62(1.05)

0.48 2.87(0.34) 3.10(1.40) 0.48 3.00(0.00) 1.99(1.31)

0.96 2.88(0.33) 1.65(1.25) 0.96 3.00(0.00) 1.89(1.23)

95 %

0.24 2.56(0.57) 4.08(1.31)

95 %

0.24 3.00(0.00) 1.00(0.86)

0.48 2.82(0.39) 2.18(1.23) 0.48 3.00(0.00) 1.09(1.15)

0.96 2.83(0.38) 0.90(0.99) 0.96 3.00(0.00) 0.98(1.01)

99 %

0.24 2.41(0.59) 2.91(1.37)

99 %

0.24 3.00(0.00) 0.50(0.63)

0.48 2.72(0.45) 1.16(1.01) 0.48 3.00(0.00) 0.28(0.59)

0.96 2.67(0.47) 0.34(0.64) 0.96 3.00(0.00) 0.34(0.62)

NOTE: TP, true positives; FP, false positives; α, significance level; h, choice of bandwidth.

where, in a slight abuse of notation, we mean that the third community in the time

series G(t)50t=1 increases its size to 310, 330 and 400 at the designated change-points

(t∗1, t∗2, t∗3) = (15, 30, 40), respectively. The class connection probability matrix is set be

θaa = 0.15 for a = 1, 2, 3 and θab = 0.05 for 1 ≤ a < b ≤ 3.

To simplify matters, for purposes of change-point detection in this simulation

study, we compute the KCP test statistic (3.9) under a correctly specified choice of

K = 3. Additionally, we measure the robustness of our approach with the choice of

bandwidths h = 0.5h∗, h∗, and 2h∗, where h∗ is the Silverman rule of thumb discussed

in Section 3.3.3. Along with the mean values of d(t) for each candidate change-point,

we plot in Figure 3.1 the mean estimated quantile cα(t) in each location, thus depicting

the average rejection values for the test statistics along the time-evolving networks

across all simulations. Results are summarized in Table 3.1 above.


As we can see from the true positive (TP) and false positive (FP) rates collected,

KCP performs reasonably well in both scenarios. In the “chatty” group setting from

Simulation 1, where the third community increases its network communication pat-

tern to an abnormal level, for α = 0.05, the highest true positives and smallest false

positives occur at bandwidth h = 0.96. While it may appear straightforwardly that

KCP benefits from network data aggregation by simply increasing the bandwidth h,

the implicit bias-variance tradeoff is already evident in Table 3.1 when the confidence

level is at 99 %. As shown in Figure 3.1, the change-point at t∗2 = 30 is considerably

harder to detect in this scenario, contributing to the vast majority of missed true

positives from Table 3.1. Nevertheless, and in spite of the poor performance of the

spectral clustering algorithm on sparse networks like the one in Simulation 1 (Qin and

Rohe, 2013), these results put KCP forward as a novel valuable tool for change-point

detection in dynamic network data.

The right panels of Figure 3.1 show the values of our test statistic d(t) easily

exceeding the estimated upper αth quantiles cα(t) at the change-point locations

t = t∗1, t∗2, t∗3; thus shifting the analysis of Simulation 2 exclusively to false positive

rates. This is confirmed by Table 3.1, where, under this setting, the choice of α = 0.01

naturally leads to the smallest false positive rates. Again, due to the bias-variance

tradeoff, the optimal choice of h = 0.48 yields the smallest number of false positives

across all randomly generated dynamic stochastic blockmodels. While the perfor-

mance of KCP is again fairly accurate at detecting change-points, panel (f) of Figure

3.1 underscores the importance of our interpretation of Definition 3.1, forbidding two

adjacent time points to be labeled as change-points, effectively reducing the false pos-

itive rate by discarding neighboring points which also exceed the estimated quantile

cα(t). Lastly, we would like to point out that, in an effort to measure the false pos-

itive rate of KCP under no change-points in the time series A(t)50t=1, we repeated

Simulation 1 with θaa(t) = 0.02 for a = 1, 2, 3 and θab(t) = 0.01 for 1 ≤ a < b ≤ 3

with t = 1, . . . , 50. For the choice of h = 0.48, only 1 out of 100 repetitions delivered


candidate change-points with the rest correctly yielding zero false positives.

Although our “chatty” setting of Simulation 1 is inspired by the work of Wang

et al. (2014), we decided to not compare our KCP procedure with their scan statistics

approach as their asymptotic power derivations are not easily extended to our choices

of (θ(t),Z(t)). Furthermore, their theoretical framework does not cover the case of

a fixed θ(t) with time-varying Z(t) presented here in Simulation 2.


For our real data experiment, we use the Enron email corpus described in Priebe et al.

(2005); Wang et al. (2014). The data set collects a time series of networks G(t)Tt=1

with N = 184 vertices taken over a period of T = 189 weeks from 1998 through 2002.

Consisting mostly of Enron executives, for each network, Aij(t) = 1 if employee i

sends at least one email message to employee j (or viceversa) during the one week

period t, and Aij(t) = 0 otherwise. We refer interested readers to Priebe et al. (2005)

for a more detailed description of this email corpus.

For purposes of change-point detection on this data set, Wang et al. (2014) rely

on the idea of scan statistics (Glaz et al., 2001), which constitute of vertex-dependent

normalizations and temporal normalizations of locality statistics. For a given network

G := (V,E) with V′ ⊂ V , denote by H(V

′,G) the subgraph of G induced by V

′.

Given a fixed time snapshot t, and for all k ≥ 0 and v ∈ V , Wang et al. (2014) define

a first locality statistic as

Ψt;k(v) =∣∣∣E(H(Nk(v;G(t)),G(t))

)∣∣∣ .More specifically, the statistic Ψt;k(v) counts the number of edges in the subgraph of

G(t) induced by Nk(v;G(t)), the subset of vertices u at a distance at most k from

v in G(t). Additionally, for given snapshots t and t′

with t′ ≤ t, Wang et al. (2014)

also define the locality statistic

Φt,t′ ;k(v) =∣∣∣E(H(Nk(v;G(t)),G(t

′)))∣∣∣ .


Simulation 1 Simulation 2

0 10 20 30 40 50Time Snapshot

Sta

tistic

d(t)

0

2

4

6

8

10

12

Test Statistic d(t)90% Conf. Level95% Conf. Level99% Conf. Level

(a) Time-evolving θ(t), h = 0.24


Sta

tistic

d(t)

0

5

10

15

20

25

30


(b) Time-evolving Z(t), h = 0.24


Sta

tistic

d(t)

0

2

4

6

8

10


(c) Time-evolving θ(t), h = 0.48


Sta

tistic

d(t)

0

5

10

15

20

25

30


(d) Time-evolving Z(t), h = 0.48


Sta

tistic

d(t)

0

2

4

6

8


(e) Time-evolving θ(t), h = 0.96


Sta

tistic

d(t)

0

5

10

15

20

25

30


(f) Time-evolving Z(t), h = 0.96

Figure 3.1: (Color online) Mean values for the test statistic d(t) and the estimated quantiles

cα(t) for each candidate change-point in Simulations 1 and 2, where the choices of α = 0.10

(green), α = 0.05 (blue), and α = 0.01 (red) are considered.


In this case, the locality statistic Φt,t′ ;k(v) counts the number of edges in the sub-

graph of G(t′) induced by Nk(v;G(t)). After a fixed number of τ ≥ 0 vertex-

dependent normalizations and λ ≥ 0 temporal normalizations, the resulting scan

statistics Sτ,λ,k(t; Ψ) and Sτ,λ,k(t; Φ) are employed by Wang et al. (2014) as a tool for

change-point detection. For some additional work on scan statistics, see the papers

Priebe et al. (2005, 2010) and references therein.

We compare the performance of the scan statistic approach with our KCP metho-

dology in Figure 3.2 below. After truncating for the first 40 weeks due to τ = λ = 20

normalization steps, we analyze the remaining period of 149 weeks from August 1999

to June 2002 for different values of k in the scan statistics Sτ,λ,k(t; ·), as well as different

choices of kernel bandwidth h for the KCP statistic d(t). For our KCP methodology,

in order to avoid cluttered displays, we only present results for a fixed K = 2 and

kernel bandwidths h = h∗, h∗ + 0.09, and h∗ + 2 · 0.09. Following Wang et al. (2014),

change-point detections for scan statistics are defined whenever Sτ,λ,k(t; ·) > 5. For

purposes of this numerical experiment, we select α = 0.01 in our KCP approach and

plot the estimated upper αth quantile cα(t) based on q = 200 Monte Carlo replicates.

The red circles in Figure 3.2 correspond to change-points that were identi-

fied by both algorithms, whereas the gold ones refer to the most representa-

tive change-points discovered only by KCP. A summary of our findings based

on the Enron Chronology (http://www.desdemonadespair.net/2010/09/bushenron-

chronology.html) follows below:

• Most scan statistics as well as d(t) with h = 0.47, panel (d), signal a change-

point at t∗ = 58 in December 1999. This period coincides with the finalization

of a sham energy deal between Enron and Merrill Lynch in order to meet profit

expectations and boost stock price.

• A second common change-point is t∗ = 94 during August 2000, where Enron

stock peaks at $90.56 and a large-scale insider selling scheme is underway to

profit while the price is still high.

http://www.desdemonadespair.net/2010/09/bushenron-chronology.html

http://www.desdemonadespair.net/2010/09/bushenron-chronology.html


Scan Statistics Kernel Change-Points

Time

Sca

n S

tatis

tic

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

5

10

15

20

Scan Statistic Sτ,λ,0(t, Ψ)Scan Statistic Sτ,λ,0(t, Φ)

(a) Scan statistics Sτ,λ,0(t; Ψ) and Sτ,λ,0(t; Φ)

Time

Sta

tistic

d(t)

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

2

4

6

Test Statistic d(t) with K=299% Conf. Level

(b) KCP test statistic d(t) with (K,h) = (2, 0.38)

Time

Sca

n S

tatis

tic

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

5

10

15

20


(c) Scan statistics Sτ,λ,1(t; Ψ) and Sτ,λ,1(t; Φ)

Time

Sta

tistic

d(t)

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

2

4

6


(d) KCP test statistic d(t) with (K,h) = (2, 0.47)

Time

Sca

n S

tatis

tic

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

5

10

15

20


(e) Scan statistics Sτ,λ,2(t; Ψ) and Sτ,λ,2(t; Φ)

Time

Sta

tistic

d(t)

08/99 12/99 04/00 08/00 12/00 04/01 08/01 12/01 04/02

0

2

4

6


(f) KCP test statistic d(t) with (K,h) = (2, 0.56)

Figure 3.2: (Color online) Scan statistic values and associated detected change-points for

the Enron email data set (left panels). Kernel Change-Points test statistic d(t) along with

the estimated quantiles cα(t), with α = 0.01, for each candidate change-point (right panels).

Red circles indicate common discovered change-points between the two approaches, whereas

gold circles represent newly discovered change-points by the KCP statistic.


• The third common change-point is t∗ = 115 in December 2000. According to

the Enron Chronology, Jeff Skilling is just being promoted to CEO of Enron

and has begun selling 10,000 shares of stock weekly.

• The final common change-point between the scan statistic approach and KCP

occurs at t∗ = 145 during August 2001. This anomaly occurs as Enron CEO

Skilling steps down citing personal reasons and surrounded by public criticisms

on a number of accounting and trading issues.

• Two further change-points are only detected by d(t) with h = 0.38. The first

one at t∗ = 67 in February 2000 coincides with a period in which Enron CFO

Andrew Fastow is involved in selling Enron subsidiaries at an increased value

to bolster financial statements. To conclude, at t∗ = 137 during May 2001,

the anomaly occurs as Enron’s largest foreign investment, the Dabhol Power

Company in India, shuts down amid heavy protests and environmental damage.

In conclusion, we notice that while the theoretical framework of the scan statis-

tic approach does not support arbitrary time-varying (θ(t),Z(t)), the red circles in

Figure 3.2 demonstrate that in some instances both Sτ,λ,k(t; ·) and d(t) are able to

capture the same anomalous connectivity patterns. Furthermore, due to the adapt-

ability of d(t) provided by the different kernel bandwidths, the KCP statistic d(t) is

able to detect relevant change-points overlooked by Sτ,λ,k(t; ·).

3.5 Discussion

The analysis and collection of network data observed over time has recently enjoyed

tremendous popularity among peer researchers trying to understand time-evolving

complex phenomena. In this chapter, we focus on the change-point detection prob-

lem under the dynamic stochastic blockmodel framework. We develop a novel me-

thodology — KCP, advocating the use of kernel-weighted adjacency matrices for data


aggregation and test statistics based on the spectral clustering algorithm for anomaly

detection. Our flexible dynamic stochastic blockmodel setting allows for a variety of

network models considered, while the adaptability of our KCP methodology through

different kernel bandwidths enables the detection of different types of change-points.

Model selection rates on synthetic network data demonstrate the robustness of our

approach in terms of change-point detection, with the Enron email corpus provid-

ing an example of commensurate performance with competing methods in real-world

network data sets.

There are several interesting extensions of our current work. First, we can con-

sider even more sophisticated network models, such as a dynamic degree-corrected

blockmodel or mixed membership models, to carry out change-point detection. In

addition, given the major computational bottleneck of our approach is in calculating

the quantile cα(t), we leave an online implementation using parallel computing as fu-

ture work. Based on the classical analysis of kernel density estimation, an additional

further investigation is to study the limiting properties and power characteristics of

the KCP test statistic.

CHAPTER 4. NC-IMPUTE: SCALABLE MATRIX COMPLETION WITHNONCONVEX PENALTIES 90

Chapter 4

NC-Impute: Scalable Matrix

Completion with Nonconvex

Penalties

In this chapter, we study the popularly dubbed matrix completion problem in its

noisy setting, where the task is to “fill in” the unobserved entries of a matrix from

a small subset of observed entries under the assumption that the underlying matrix

is of (approximately) low-rank. The main contribution of this chapter is to present

a systematic computational study of nonconvex regularized matrix completion prob-

lems. The present work complements and enhances our previous work on nuclear

norm regularized problems for matrix completion (Mazumder et al., 2010); and also

provides a systematic study of spectral thresholding operators. Inspired by Soft-

Impute (Mazumder et al., 2010; Hastie et al., 2016), we propose NC-Impute —an

algorithmic framework for computing a family of nonconvex penalized matrix com-

pletion problems with warm-starts. Using structured low-rank SVD computations,

we demonstrate the computational scalability of our proposal for problems up to the

Netflix size (approximately, a 500, 000 × 20, 000 matrix with 108 observed entries).

We demonstrate that on a wide range of synthetic and real data instances, our pro-


posed nonconvex regularization framework leads to low-rank solutions with better

predictive performance when compared to convex nuclear norm problems. Imple-

mentations of algorithms proposed herein, written in the R programming language,

are made available on github.

4.1 Introduction

In several problems of contemporary interest, arising for instance, in recommender

system applications, for example, the Netflix Prize competition (see SIGKDD and

Netflix (2007)), observed data is in the form of a large sparse matrix, Yij, (i, j) ∈ Ω,

where Ω ⊂ 1, . . . ,m× 1, . . . , n, with |Ω| mn. Popularly dubbed as the matrix

completion problem (Candes and Recht, 2009), the task is to predict the unobserved

entries, under the assumption that the underlying matrix is of low-rank. This leads

to the natural rank regularized optimization problem:

minimizeX

1

2‖PΩ(X − Y )‖2

F + λ rank(X), (4.1)

where, PΩ(X) denotes the projection of X onto the observed matrix indices Ω and is

zero otherwise; and ‖·‖F denotes the usual Frobenius norm of a matrix. Problem (4.1),

however, is computationally difficult due to the presence of the combinatorial rank

constraint (Chistov and Grigor’ev, 1984). A natural convexification (Fazel, 2002;

Recht et al., 2010) of rank(X) is ‖X‖∗, the nuclear norm of X, which leads to the

following surrogate of Problem (4.1) given by:

minimizeX

1

2‖PΩ(X − Y )‖2

F + λ‖X‖∗. (4.2)

Candes and Recht (2009); Candes and Plan (2010) show that under some assum-

ptions on the underlying “population” matrix, a solution to Problem (4.2) approx-

imates a solution to Problem (4.1) reasonably well. The estimator obtained from

Problem (4.2) enjoys several advantages: the nuclear norm shrinks the singular val-

ues and simultaneously sets many of the singular values to zero, thereby encouraging


low-rank solutions. It is thus not surprising that Problem (4.2) has enjoyed a signif-

icant amount of attention in the wider statistical community over the past several

years. There have been impressive advances in understanding its statistical proper-

ties (Negahban and Wainwright, 2012; Candes and Plan, 2010; Recht et al., 2010;

Rohde and Tsybakov, 2011). Motivated by the work of Candes and Recht (2009);

Cai et al. (2010), the authors in Mazumder et al. (2010) proposed Soft-Impute,

an EM-flavored (Dempster et al., 1977) algorithm for optimizing Problem (4.2). For

some other computational work in developing scalable algorithms for the problem,

see the papers Hastie et al. (2016); Jaggi and Sulovsky (2010); Freund et al. (2015),

and references therein. Typical assumptions under which the nuclear norm works as

a good proxy for the low-rank problem require the entries of the singular vectors of

the “true” low-rank matrix to be sufficiently spread, and the missing pattern to be

roughly uniform. The proportion of observed entries needs to be sufficiently larger

than the number of parameters of the matrix O ((m+ n)r), where, r denotes the

rank of the true underlying matrix. Negahban and Wainwright (2012) propose im-

provements with a (convex) weighted nuclear norm penalty in addition to spikiness

constraints for the noisy matrix completion problem.

The nuclear norm penalization framework, however, has limitations. If the condi-

tions mentioned above fail, Problem (4.2) may fall short of delivering reliably low-rank

estimators with good prediction properties (on the missing entries). In addition, since

the nuclear norm shrinks the singular values, in order to obtain an estimator with

good explanatory power, it often results in a matrix estimator with high numerical

rank — thereby leading to models that have higher rank than what might be desir-

able. The limitations mentioned above, however, should not come as a surprise to

an expert — especially if one draws a parallel connection to the Lasso (Tibshirani,

1996), a popular sparsity inducing shrinkage mechanism effectively used in the context

of sparse linear modeling and regression. In the linear regression context, the Lasso

often leads to dense models and suffers when the features are highly correlated — the


limitations of the Lasso are quite well known in the statistics literature, and there

have been major strides in moving beyond the convex `1-penalty to more aggressive

forms of nonconvex penalties (Fan and Li, 2001; Mazumder et al., 2011; Bertsimas

et al., 2016; Zou and Li, 2008; Zhang, 2010; Zhang et al., 2012). The key principle

in these methods is the use of nonconvex regularizers that better approximate the

`0-penalty, leading to possibly nonconvex estimation problems. Thusly motivated,

we study herein, the following family of nonconvex regularized estimators for the task

of (noisy) matrix completion:

minimizeX

f(X) :=1

2‖PΩ(X − Y )‖2

F +

minm,n∑i=1

P (σi(X);λ, γ), (4.3)

where, σi(X), i ≥ 1, are the singular values of Xm×n and σ 7→ P (σ;λ, γ) is a con-

cave penalty function on [0,∞). We will denote an estimator obtained from (4.3)

by Xλ,γ. The family of penalty functions P (σ;λ, γ) is indexed by the parameters

(λ, γ) — together they control the amount of nonconvexity and shrinkage — see for

example Mazumder et al. (2011); Zhang et al. (2012) and also Section 4.2, herein, for

examples of such nonconvex families.

A caveat in considering problems of the form (4.3) is that they lead to nonconvex

optimization problems and thus obtaining a certifiably optimal global minimizer is

generally difficult. Fairly recently, Bertsimas et al. (2016); Mazumder and Radchenko

(2015) have shown that subset selection problems in sparse linear regression can

be computed using advances in mixed integer quadratic optimization. Such global

optimization methods, however, do not apply to matrix variate problems involving

spectral1 penalties, as in Problems (4.1) or (4.3). The main focus in our work herein is

to develop a computationally scalable algorithmic framework that allows us to obtain

1We say that a function is a spectral function of a matrix X, if it depends only upon the singular

values of X. The state of the art algorithmics in mixed integer Semidefinite optimization problems

is in its nascent stage; and not nearly as advanced as the technology for mixed integer quadratic

optimization.


high quality stationary points or upper bounds2 for Problem (4.3) — we obtain a

path of solutions Xλ,γ across a grid of values of (λ, γ) for Problem (4.3) by employing

warm-starts, following the path-following scheme proposed in Mazumder et al. (2010).

Leveraging problem structure and modern advances in computationally scalable low-

rank SVDs, as successfully employed in Mazumder et al. (2010); Hastie et al. (2016),

we empirically demonstrate the computational scalability of our method for problems

of the size of the Netflix data set, a matrix of size (approx.) 480, 000 × 18, 000

with ∼ 108 observed entries. Perhaps most importantly, we demonstrate empirically

that the resultant estimators lead to better statistical properties (i.e., the estimators

have lower rank and enjoy better prediction performance) over nuclear norm based

estimates, on a variety of problem instances.

4.1.1 Contributions and Outline

The main contributions of this chapter can be summarized as follows:

• We propose a computational framework for nonconvex penalized matrix comple-

tion problems of the form (4.3). Our algorithm, NC-Impute, is an adaptation

of the EM-stylized procedure Soft-Impute (Mazumder et al., 2010) to more

general nonconvex penalized thresholding operators.

• We present an in-depth investigation of the nonconvex spectral thresholding

operators, which form the main building block of our algorithm. We also study

their effective degrees of freedom, which provide a simple and intuitive way to

calibrate the two-dimensional grid of tuning parameters, extending the method

proposed in nonconvex penalized regression by Mazumder et al. (2011).

2Since the problems under consideration are nonconvex, our methods are not guaranteed to reach

the global minimum – we thus refer to the solutions obtained as upper bounds. In many synthetic

examples, however, the solutions are indeed seen to be globally optimal. We do show rigorously,

however, that these solutions are first order stationary points for the optimization problems under

consideration.


• We provide comprehensive computational guarantees of our algorithm in terms

of the number of iterations needed to reach a first order stationary point and the

asymptotic convergence of the sequence of estimates produced by NC-Impute.

• Every iteration of NC-Impute requires the computation of a low-rank SVD

of a structured matrix, for which we propose new methods. Using efficient

warm-start tricks to speed up the low-rank computations, we demonstrate the

effectiveness of our proposal to large scale instances up to the Netflix size in

reasonable computation times.

• Over a wide range of synthetic and real-data examples, we show that our pro-

posed nonconvex penalized framework leads to high quality solutions with excel-

lent statistical properties, which are often found to be significantly better than

nuclear norm regularized solutions in terms of producing low-rank solutions

with good predictive performances.

• Implementations of our algorithms in the R programming language have been

made publicly available on github at the following url: https://github.com/

diegofrasal/ncImpute.

The remainder of the chapter is organized as follows. Section 4.2 studies several

properties of nonconvex spectral penalties and associated spectral thresholding opera-

tors, including their effective degrees of freedom. Section 4.3 describes our algorithmic

framework NC-Impute and studies the convergence properties of the algorithm. Sec-

tion 4.4 presents numerical experiments demonstrating the usefulness of nonconvex

penalized estimation procedures in terms of superior statistical properties on several

synthetic data sets — we also show the usefulness of these estimators on several real

data instances. Section 4.5 contains the conclusions. To improve readability, some

technical material is relegated to Appendix A.

https://github.com/diegofrasal/ncImpute

https://github.com/diegofrasal/ncImpute


Notation: For a matrix Am×n, we denote its (i, j)th entry by aij. PΩ(A) is a

matrix with its (i, j)th entry given by aij for (i, j) ∈ Ω and zero otherwise, with

Ω ⊂ 1, . . . ,m × 1, . . . , n. We use the notation P⊥Ω (A) = X − PΩ(A) to denote

projection onto the complement of Ω. Let σi(A), i = 1, . . . ,maxm,n denote the

singular values of A, with σi(A) ≥ σi+1(A) (for all i) – we will use the notation σ(A)

to denote the vector of singular values. When clear from the context, we will simply

write σ instead of σ(A). For a vector a = (a1, . . . , an) ∈ Rn, we will use the notation

diag(a) to denote a n× n diagonal matrix with ith diagonal entry being ai.

4.2 Spectral Thresholding Operators

We begin our analysis by considering the fully observed version of Problem (4.3),

given by:

Sλ,γ(Z) ∈ arg minX

g(X) :=1

2‖X − Z‖2

F +

minm,n∑i=1

P (σi(X);λ, γ), (4.4)

where, for a given matrix Z, a minimizer of the function g(X), given by Sλ,γ(Z), is

the spectral thresholding operator induced by the spectral penalty function denoted

through:∑minm,n

i=1 P (σi(X);λ, γ). Suppose Udiag(σ)V ′ denotes the SVD of Z. For

the nuclear norm regularized problem, with P (σi(X);λ, γ) = λσi(X), the thresh-

olding operator denoted by Sλ,`1(Z) (say) is given by the familiar soft-thresholding

operator (Cai et al., 2010; Mazumder et al., 2010):

Sλ,`1(Z) := Udiag(sλ,`1(σ))V ′, with sλ,`1(σi) := (σi − λ)+, (4.5)

where, (·)+ = max·, 0 and sλ,`1(σi) is the ith entry of sλ,`1(σ) (due to separabil-

ity of the thresholding operator). Here, sλ,`1(σ) is the the soft-thresholding opera-

tor on the singular values of Z and plays a crucial role in the Soft-Impute algo-

rithm (Mazumder et al., 2010). For the rank regularized problem, with penalty given

as P (σi(X);λ, γ) = λ1(σi(X) > 0), the thresholding operator denoted by Sλ,`0(Z) is


given by the familiar hard-thresholding operator (Mazumder et al., 2010):

Sλ,`0(Z) := Udiag(sλ,`0(σ))V ′, with sλ,`0(σi) = σi1(σi >√

2λ). (4.6)

A closely related thresholding operator that retains the top r singular values and sets

the remaining to zero formed the basis of the Hard-Impute algorithm in Mazumder

et al. (2010); Troyanskaya et al. (2001). The results in (4.5) and (4.6) suggest a

curious link — the spectral thresholding operators (for the two specific choices of the

spectral penalty functions given above) are tied to the corresponding thresholding

functions that operate only on the singular values of the matrix — in other words,

the operators Sλ,`1(Z), Sλ,`0(Z) do not affect the singular vectors of the matrix Z. It

turns out that a similar result holds true for more general spectral penalty functions

P (·;λ, γ) as the following proposition illustrates.

Proposition 4.1. Let Z = Udiag(σ)V ′ denote the SVD of Z, and sλ,γ(σ) denote the

following thresholding operator on the singular values of Z:

sλ,γ(σ) ∈ arg minα≥0

g(α) :=1

2‖α− σ‖2

2 +

minm,n∑i=1

P (αi;λ, γ). (4.7)

Then Sλ,γ(Z) = Udiag(sλ,γ(σ))V ′.

Proof. Note that by the Wielandt-Hoffman inequality (Horn and Johnson, 2012) we

have that: ‖X − Z‖2F ≥ ‖σ(X)− σ(Z)‖2

2, where, for a vector a ∈ Rm, ‖a‖2 denotes

the standard Euclidean norm. Equality holds when X and Z share the same left and

right singular vectors. This leads to:

1

2‖X − Z‖2

F +

minm,n∑i=1

P (σi(X);λ, γ) ≥ 1

2‖σ(X)− σ(Z)‖2

2 +

minm,n∑i=1

P (σi(X);λ, γ).

(4.8)

In the above inequality, note that the left hand side is g(X) and right hand side is

g(σ(X)). It follows that

minX

g(X) ≥ minσ(X)

g(σ(X)) = g (sλ,γ(σ)) , (4.9)


where, we used the observation that σ(X) ≥ 0 and sλ,γ(σ), as defined in (4.7)

minimizes g(σ(X)). In addition, this minimum is attained by the function g(X), at

the choice X = Udiag(sλ,γ(σ))V ′. This completes the proof of the proposition.

Due to the separability of the optimization problem (4.7) across the coordinates,

i.e., g(α) =∑

i gi(αi), it suffices to consider each of the subproblems separately. Let

sλ,γ(σi) denote a minimizer of gi(αi) (over αi ≥ 0), i.e.,

sλ,γ(σi) ∈ arg minα≥0

gi(α) :=1

2(α− σi)2 + P (α;λ, γ). (4.10)

It is easy to see that the ith coordinate of sλ,γ(σ) is given by sλ,γ(σi). This discus-

sion suggests that our understanding of the spectral thresholding operator Sλ,γ(Z)

is intimately tied to the univariate thresholding operator (4.10). Thusly motivated,

in the following, we present a concise discussion about univariate penalty functions

and the resultant thresholding operators. We begin with some examples of concave

penalties that are popularly used in the statistics literature in the context of sparse

linear modeling.

Families of Nonconvex Penalty Functions: Several types of nonconvex penal-

ties are popularly used in the high-dimensional regression framework—see for exam-

ple Nikolova (2000); Lv and Fan (2009); Zhang et al. (2012). Note that for our setup,

since these penalty functions operate on the singular values of a matrix, it suffices to

consider nonconvex functions that are defined only on the nonnegative real numbers.

We present a few examples below:

• The `γ penalty (Frank and Friedman, 1993) given by P (σ;λ, γ) = λσγ (0 ≤ γ <

1) on σ ≥ 0.

• The SCAD penalty (Fan and Li, 2001) is defined via:

P ′(σ;λ, γ) = λ1(σ ≤ λ) +(γλ− σ)+

γ − 11(σ > λ), for σ ≥ 0, γ > 2,

where P ′(σ;λ, γ) denotes the derivative of σ 7→ P (σ;λ, γ) on σ ≥ 0 with

P (0;λ, γ) = 0.


• The MC+ penalty (Zhang, 2010; Mazumder et al., 2011) defined as

P (σ;λ, γ) = λ

(σ − σ2

2λγ

)1(0 ≤ σ < λγ) +

λ2γ

21(σ ≥ λγ),

on λ ≥ 0, γ ≥ 0.

• The log-penalty, with P (σ;λ, γ) = λ log(γσ+1)/ log(γ+1) on γ > 0 and λ ≥ 0.

Figure 4.1 shows some members of the above nonconvex penalty families. The `γ

penalty function is non differentiable at σ = 0, due to the unboundedness of P ′(σ;λ, γ)

as σ → 0+. The non-zero derivative at σ = 0+ encourages sparsity in the estimated

σ. The `γ penalty functions show a clear transition from the `1 penalty, i.e., σ, to

the `0 penalty, i.e., 1(σ > 0) — similarly, the resultant thresholding operators show a

passage from the soft-thresholding to the hard-thresholding operator. Let us examine

the analytic form of the thresholding function induced by the MC+ penalty (for any

γ > 1):

sλ,γ(σ) =

0, if σ ≤ λ(

σ−λ1−1/γ

), if λ < σ ≤ λγ

σ, if σ > λγ.

(4.11)

It is interesting to note that for the MC+ penalty, the derivatives are all bounded

and the thresholding functions are all continuous for all γ > 1. As γ → ∞, the

threshold operator (4.11) coincides with the soft-thresholding operator. However, as

γ → 1+ the thresholding operator approaches the discontinuous hard-thresholding

operator σ1(σ ≥ λ) — this is illustrated in Figure 4.1 and can also be observed by

inspecting (4.11). Note that the `1 penalty penalizes small and large singular values

in a similar fashion, thereby incurring in an increased bias in estimating the larger

coefficients. For the MC+ and SCAD penalties, we observe that they penalize the

larger coefficients less severely than the `1 penalty — simultaneously, they penalize

the smaller coefficients in a manner similar to that of the `1-penalty. On the other

hand, the `γ penalty (for small values of γ) imposes a more severe penalty for values

of σ ≈ 0, quite different from the behavior of the other penalties.


0 1 2 3 4 5

01

23

45

ℓγ penalty

σ

γ=0.01γ=0.1γ=0.3γ=0.6γ=0.99

0 1 2 3 4 5

SCAD penalty

σ

γ=2.1γ=3γ=3.5γ=4γ=100

0 1 2 3 4 5

MC+ penalty

σ

γ=1.1γ=2γ=3γ=4γ=100

0 1 2 3 4 5

01

23

45

ℓγ threshold operator

σ

γ=0.01γ=0.1γ=0.3γ=0.6γ=0.99

0 1 2 3 4 5

SCAD threshold operator

σ

γ=2.1γ=3γ=3.5γ=4γ=100

0 1 2 3 4 5

MC+ threshold operator

σ

γ=1.1γ=2γ=3γ=4γ=100

Figure 4.1: [Top panel] Examples of nonconvex penalties σ 7→ P (σ;λ, γ) with λ = 1

for different values of γ. [Bottom Panel] The corresponding scalar thresholding operators:

σ 7→ sλ,γ(σ). At σ = 1, some of the thresholding operators corresponding to the `γ penalty

function are discontinuous, and some of the other thresholding functions are “close” to

being so.

4.2.1 Properties of Spectral Thresholding Operators

The nonconvex penalty functions described in the previous section are concave func-

tions on the nonnegative real line. We will now discuss measures that character-

ize the amount of concavity in the functions. For a univariate penalty function

α 7→ P (α;λ, γ) on α ≥ 0, assumed to be differentiable on (0,∞), we introduce

the following quantity (φP ) that measures the amount of concavity (see also Zhang

(2010)) of P (α;λ, γ):

φP := infα,α′>0

P ′(α;λ, γ)− P ′(α′;λ, γ)

α− α′, (4.12)


where P ′(α;λ, γ) denotes the derivative of P (α;λ, γ) with respect to α on α > 0.

We say that the function g(X) is τ -strongly convex if the following condition holds:

g(X) ≥ g(X) + 〈∇g(X), X − X〉+τ

2‖X − X‖2

F , (4.13)

for some τ ≥ 0 and all X, X. In (4.13), ∇g(X) denotes any subgradient (assuming

it exists) of g(X) at X. If τ = 0 then the function is simply convex3. Using standard

properties of spectral functions (Borwein and Lewis, 2006; Lewis, 1995), it follows

that g(X) is τ -strongly convex iff the vector function:

g(α) =1

2‖α− σ(Z)‖2

2 +

minm,n∑i=1

P (αi;λ, γ) (4.14)

is τ -strongly convex on α : α ≥ 0, where σ(Z) denotes the singular values of Z.

Let us recall the separable decomposition of g(α) =∑

i gi(αi), with gi(α) as defined

in (4.10). Clearly, the function α 7→ g(α) is τ -strongly convex (on the nonnegative

reals) iff each summand gi(α) is τ -strongly convex on αi ≥ 0. Towards this end,

notice that gi(α) is convex on α ≥ 0 iff 1+φP ≥ 0 — in particular, gi(α) is τ -strongly

convex with parameter τ = 1+φP , provided this number is nonnegative. In this vein,

we have the following proposition:

Proposition 4.2. Suppose φP > −1, then the function X 7→ g(X) is τ -strongly

convex with τ = 1 + φP > 0.

For the MC+ penalty, the condition τ = 1 + φP > 0 is equivalent to γ > 1. For

the `γ penalty function, with γ < 1, the parameter τ = −∞.

Proposition 4.3. Suppose 1 + φP > 0, then Z 7→ Sλ,γ(Z) is Lipschitz continuous

with constant 11+φP

:

‖Sλ,γ(Z1)− Sλ,γ(Z2)‖F ≤1

1 + φP‖Z1 − Z2‖F . (4.15)

3Note that we consider τ ≥ 0 in the definition so that it includes the case of convexity.


Proof. We rewrite g(X) as:

g(X) =

1

2‖X − Z‖2

F −ψ

2‖X‖2

F

+

minm,n∑

i=1

P (σi(X);λ, γ) +ψ

2‖X‖2

F

. (4.16)

We have that ‖X‖2F =

∑minm,ni=1 σ2

i (X). Using the shorthand notation P (σi(X)) =

P (σi(X);λ, γ) + ψ2σ2i (X), and rearranging the terms in g(X), it follows that Sλ,γ(Z),

a minimizer of g(X), is given by:

Sλ,γ(Z) ∈ arg minX

1− ψ2‖X − 1

1− ψZ‖2

F +

minm,n∑i=1

P (σi(X))

. (4.17)

If ψ + φP > 0, the function σi 7→ P (σi) is convex for every i. If 1 − ψ > 0, then

the first term appearing in the objective function in (4.17) is convex. Thus, assuming

ψ+φP > 0, 1−ψ > 0 both summands in the above objective function are convex. In

particular, the optimization problem (4.17) is convex and Z 7→ Sλ,γ(Z) can be viewed

as a convex proximal map (Rockafellar, 1970). Using standard contraction properties

of proximal maps, we have that:

‖Sλ,γ(Z1)− Sλ,γ(Z2)‖F ≤∥∥∥∥ Z1

1− ψ− Z2

1− ψ

∥∥∥∥F

≤ 1

1− ψ‖Z1 − Z2‖F .

Since the above holds true for any ψ as chosen above, optimizing over the value of

ψ such that Problem (4.17) remains convex gives us ψ = −φP , i.e., 1/(1 − ψ) =

1/(1 + φP ), thereby leading to (4.15).

4.2.2 Effective Degrees of Freedom for Spectral Thresholding

Operators

The effective degrees of freedom or df is a popularly used statistical notion that

measures the amount of “fitting” performed by an estimator (Efron et al., 2004;

Hastie et al., 2009; Stein, 1981). In the case of classical linear regression, for example,

df is simply given by the number of features used in the linear model. This notion


applies more generally to additive fitting procedures. Following Efron et al. (2004);

Stein (1981), let us consider an additive model of the form:

Zij = µij + εij with εijiid∼ N(0, v2), i = 1, . . . ,m, j = 1, . . . , n. (4.18)

The df of µ := µ(Z), for the fully observed model above (4.18) is given by:

df(µ) =∑ij

Cov(µij, Zij)/v2,

where µij denotes the (i, j)th entry of the matrix µ. For the particular case of a

spectral thresholding operator we have µ = Sλ,γ(Z). When Z 7→ µ(Z) satisfies a weak

differentiability condition, the df may be computed via a divergence formula (Stein,

1981; Efron et al., 2004):

df(µ) = E ((∇ · µ(Z)) · (Z)) , (4.19)

where (∇ · µ) · (Z) =∑

ij ∂µ(Zij)/∂Zij. For the spectral thresholding operator

Sλ,γ(·), expression (4.19) holds if the map Z 7→ Sλ,γ(Z) is Lipschitz and hence weakly

differentiable—see for example Candes et al. (2013). In the light of Proposition 4.3,

the map Z 7→ Sλ,γ(Z) is Lipschitz when φP + 1 > 0. Under the model (4.18), the

singular values of Z will have a multiplicity of one with probability one. With these

assumptions in place, the divergence formula for Sλ,γ(Z) can be obtained follow-

ing Candes et al. (2013). We assume that the univariate thresholding operators are

differentiable, i.e., s′λ,γ(·) exist.

Proposition 4.4. Assume that 1 + φP > 0 and the model (4.18) is in place. Then

the degrees of freedom of the estimator Sλ,γ(Z) are given by:

df(Sλ,γ(Z)) = E

(∑i

(s′λ,γ(σi) + |m− n|sλ,γ(σi)

σi

)+ 2

∑i 6=j

σisλ,γ(σi)

σ2i − σ2

j

), (4.20)

where the σi’s are the singular values of Z.

We note that the above expression is true for any value of 1 + φP > 0, and in

particular, for the MC+ penalty function, expression (4.20) holds for γ > 1. As soon


0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.7

0.8

0.9

1.0

log(γ)

df/(

mn)

λ=0.5λ=1λ=1.5

Figure 4.2: Figure showing the df for the MC+ thresholding operator for a matrix with

m = n = 10, µ = 0 and v = 1. The df profile as a function of γ (in the log scale) is shown

for three values of λ. The dashed lines correspond to the df of the spectral soft-thresholding

operator, corresponding to γ = ∞. We propose calibrating the (λ, γ) grid to a (λ, γ) grid

such that the df corresponding to every value of γ matches the df of the soft-thresholding

operator — as shown in Figure 4.3.

as γ ≤ 1, the above method of deriving df does not apply due to the discontinuity

in the map Z 7→ Sλ,γ(Z). Values of γ close to one (but larger), however, give an

expression for the df near the hard-thresholding spectral operator, which corresponds

to γ = 1.

To understand the behavior of the df as a function of (λ, γ), let us consider the

null model with µ = 0 and the MC+ penalty function. In this case, for a fixed λ

(see Figure 4.2, for any fixed λ), the df is seen to increase with smaller γ values:

the soft-thresholding function shrinks the large coefficients and sets all coefficients

smaller than λ to be zero; the more aggressive shrinkage operators shrink less for

larger values of σ and set all coefficients smaller than λ to zero. Thus, intuitively,

the more aggressive thresholding operators should have larger df since they do more

“fitting” — this is indeed observed in Figure 4.2. Mazumder et al. (2011) studied

the df of the univariate thresholding operators in the linear regression problem, and


0.2 0.4 0.6 0.8 1.0 1.2 1.4

12

510

2050

100

λ

γ

Figure 4.3: Figure showing the calibrated (λ, γ) lattice — for every fixed value of λ, the

df of the MC+ spectral threshold operators are the same across different γ values. The df

computations have been performed on a null model using Proposition 4.5.

observed a similar pattern in the behavior of the df across (λ, γ) values. For the linear

regression problem, Mazumder et al. (2011) argued that it is desirable to choose a

parametrization for (λ, γ) such that for a fixed λ, as one moves across γ, the df should

be the same. We follow the same strategy for the spectral regularization problem

considered herein — we reparametrize a two-dimensional grid of (λ, γ) values to a

two-dimensional grid of (λ, γ) values, such that the df remain calibrated in the sense

described above — this is illustrated in Figure 4.3.

The study of df presented herein provides a simple and intuitive explanation to

the roles of (λ, γ) for the fully observed problem. The notion of calibration proposed

herein gives a new parametrization of the family of penalties. The general algorith-

mic framework presented in this chapter (see Section 4.3) computes a regularization

surface using warm-starts across adjacent (λ, γ) values on a two-dimensional grid; it

is thus desirable for the adjacent values to be close — the df calibration ensures this

in a simple and intuitive manner.


Computation of df : The df estimate as implied by Proposition 4.4 depends only

upon singular values (and not the vectors) of a matrix and can hence be computed

with cost O(minm,n2). The expectation can be approximated via Monte-Carlo

simulation — these computations are easy to parallelize and can be done offline.

Since we compute the df for the null model, for larger values of m,n we resort to

the Marchenko-Pastur law for iid Gaussian matrix ensembles to approximate the

df expression (4.20). We illustrate the method using the MC+ penalty for γ > 1.

Towards this end, let us define a function on β ≥ 0:

gζ,γ(β) =

0, if

√β ≤ ζ

γγ−1

(1− ζ√β), if ζ <

√β ≤ ζγ

1, if√β > ζγ .

(4.21)

The following proposition provides a method to approximate the df of the spectral

threshold operator for the null model, using the Marchenko-Pastur distribution— see

Lemma 1 in Appendix A. For the following proposition, we will assume (for simplicity)

that m ≥ n.

Proposition 4.5. Let m,n → ∞ with nm→ α ∈ (0, 1], then under the model Zij

iid∼

N(0, 1), we have

limm,n→∞

df(Sλ,γ(Z))

mn=

0, if λ√

m→∞

(1− α)Egt,γ(T1) + αE(T1gt,γ(T1)−T2gt,γ(T2)

T1−T2

),if λ√

m→ ζ

1, if λ√m→ 0 ,

(4.22)

where Sλ,γ(Z) is the thresholding operator corresponding to the MC+ penalty with

λ ≥ 0, γ > 1 and the expectation is taken with respect to T1 and T2 independently

generated from the Marchenko-Pastur distribution (as described in Appendix A).

Proof. For a proof, see Appendix A.1.


Note that the variance v2 in model (4.18) can always be assumed to be one (by

adjusting the value of the tuning parameter accordingly4).

4.3 The NC-Impute Algorithm

In this section, we present our main contribution of the chapter: algorithm NC-

Impute. The algorithm is inspired by an EM-stylized procedure, similar to Soft-

Impute (Mazumder et al., 2010). It is helpful to recall that, for observed data:

PΩ(Y ), the algorithm Soft-Impute relies on the following update sequence

Xk+1 = Sλ,`1(PΩ(Y ) + P⊥Ω (Xk)

), (4.23)

which can be interpreted as computing the nuclear norm regularized spectral thresh-

olding operator for the following “fully observed” problem:

Xk+1 ∈ arg minX

1

2‖X −

(PΩ(Y ) + P⊥Ω (Xk)

)‖2F + λ‖X‖∗

,

where the missing entries are filled in by the current estimate, i.e., P⊥Ω (Xk). We refer

the reader to Mazumder et al. (2010) for a detailed study of the algorithm. Mazumder

et al. (2010) suggest in passing the notion of extending Soft-Impute to more gen-

eral thresholding operators; however, such generalizations were not pursued by the

authors. In this chapter, we present a more thorough investigation about nonconvex

generalized thresholding operators — we study their convergence properties, scala-

bility aspects and demonstrate their superior statistical performance across a wide

range of numerical experiments.

Update (4.23) suggests a natural generalization to more general nonconvex penalty

functions, by simply replacing the spectral thresholding operator Sλ,`1(·) with more

general operators Sλ,γ(·):

Xk+1 = Sλ,γ(PΩ(Y ) + P⊥Ω (Xk)

). (4.24)

4This follows from the simple observation that saλ,γ(ax) = asλ,γ(x) and s′aλ,γ(ax) = s′λ,γ(x).


Algorithm 4.1 NC-Impute

1. Input: A search grid λ1 > λ2 > · · · > λN ; +∞ := γ1 > γ2 > · · · > γM . Tolerance

ε.

2. Compute solutions Xλi,γ1 for i = 1, . . . , N , for the nuclear norm regularized prob-

lem.

3. For every (γ, λ) ∈ γ2, . . . , γM × λ1, . . . , λN:

(a) Initialize Xold = arg minX

f(X), X ∈

Xλi−1,γj , Xλi,γj−1

.

(b) Repeat until convergence, i.e., ‖Xnew −Xold‖2F < ε‖Xold‖2

F :

(i) Compute Xnew ∈ arg minX

F`(X;Xold).

(ii) Assign Xold ← Xnew.

(c) Assign Xλi,γj ← Xnew.

4. Output: Xλi,γj for i = 1, . . . , N , j = 1, . . . ,M .

While the above update rule works quite well in our numerical experiments, it en-

joys limited computational guarantees, as suggested by our convergence analysis in

Section 4.3.1. We thus propose and study a seemingly minor generalization of the

rule (4.24) — this modified rule enjoys superior finite time convergence rates to a first

order stationary point.

Towards this end, let us define the following function:

F`(X;Xk) :=1

2‖PΩ(X − Y )‖2

F +1

2‖P⊥Ω (X −Xk)‖2

F

+`

2‖X −Xk‖2

F +

minm,n∑i=1

P (σi(X);λ, γ),

(4.25)

for ` ≥ 0. Note that F`(X;Xk) majorizes the function f(X), i.e., F`(X;Xk) ≥ f(X)

for any X and Xk, with equality holding at X = Xk. In an attempt to obtain a

minimum of Problem (4.3), our algorithm iteratively minimizes an upper bound to

f(X), given by F`(X;Xk), to obtain Xk+1 — more formally, this leads to the following


update sequence:

Xk+1 ∈ arg minX

F`(X;Xk). (4.26)

Note that Xk+1 is easy to compute; by some rearrangement of (4.25) we see:

Xk+1 ∈ arg minX

`+ 1

2‖X − Xk‖2

F +

minm,n∑i=1

P (σi(X);λ, γ)

:= S`λ,γ

(Xk

),

(4.27)

where Xk =(PΩ(Y ) + P⊥Ω (Xk) + `Xk

)/(` + 1). Note that (4.27) is a minor modifi-

cation of (4.24) — in particular, if ` = 0, then these two update rules coincide.

The sequence Xk defined via (4.27) has desirable convergence properties, as we

discuss in Section 4.3.1. In particular, as k →∞, the sequence reaches (in a sense that

will be made more precise later) a first order stationary point for Problem (4.3). We

however, seek to compute an entire regularization surface of solutions to Problem (4.3)

over a two-dimensional grid of (λ, γ), for which we use warm-starts. We take the

MC+ family of functions as a running example, with (λ, γ) ∈ λ1 > λ2 > · · · >

λN×∞ := γ1 > γ2 > . . . > γM. At the beginning, we compute a path of solutions

for the nuclear norm penalized problem, i.e., Problem (4.3) for γ =∞ on the grid of

λ values. For a fixed value of λ, we compute solutions to Problem (4.3) for smaller

values of γ, gradually moving away from the convex problems. In this scheme, we

found the following strategies useful:

• For every value of (λi, γj), we apply two copies of the iterative scheme (4.26)

initialized with solutions obtained from its two neighboring points (λi−1, γj) and

(λi, γj−1). From these two candidates, we select the one that leads to a smaller

value of the objective function f(·) at (λi, γj).

• Instead of using a two-dimensional rectangular lattice, one can also use the

recalibrated lattice, suggested in Section 4.2.2, as the two-dimensional grid of

tuning parameters.

The algorithm outlined above, called NC-Impute is summarized as Algorithm 4.1.


We now present an elementary convergence analysis of the update sequence (4.27).

Since the problems under investigation herein are nonconvex, our analysis requires

new ideas and techniques beyond those used in Mazumder et al. (2010) for the convex

nuclear norm regularized problem.

4.3.1 Convergence Analysis

By the definition of Xk+1, we have that:

F`(Xk+1;Xk) = minX

F`(X;Xk) ≤ F`(Xk;Xk) = f(Xk). (4.28)

Let us define the quantities:

ν(`) := 1 + φP + ` and ν†(`) := max ν(`), 0 , (4.29)

where, if ν(`) ≥ 0, then the function X 7→ F`(X;Xk) is ν(`)-strongly convex. In

particular, from (4.26), it follows that ∇F`(Xk+1;Xk), a subgradient of the map

X 7→ F`(X;Xk), equals zero. We thus have:

F`(Xk;Xk)− F`(Xk+1;Xk) ≥ν(`)

2‖Xk+1 −Xk‖2

F . (4.30)

Now note that, by the definition of Xk+1, we always have:

F`(Xk;Xk)− F`(Xk+1;Xk) ≥ν†(`)

2‖Xk+1 −Xk‖2

F , (4.31)

since Xk+1 minimizes F`(X;Xk). In addition, we have:

F`(Xk+1;Xk) =1

2‖PΩ(Xk+1 − Y )‖2

F +1

2‖P⊥Ω (Xk+1 −Xk)‖2

F +`

2‖Xk+1 −Xk‖2

F

+

minm,n∑i=1

P (σi(Xk+1);λ, γ) (4.32)

=f(Xk+1) +1

2‖P⊥Ω (Xk+1 −Xk)‖2

F +`

2‖Xk+1 −Xk‖2

F .


Combining (4.31) and (4.32), and observing that F`(Xk;Xk) = f(Xk), we have:

f(Xk)− f(Xk+1) ≥ν†(`)

2‖Xk+1 −Xk‖2

F +`

2‖Xk+1 −Xk‖2

F +1

2‖P⊥Ω (Xk+1 −Xk)‖2

F

=ν†(`) + `

2‖Xk+1 −Xk‖2

F +1

2‖P⊥Ω (Xk+1 −Xk)‖2

F︸︷︷︸:=∆`(Xk;Xk+1)

. (4.33)

Since ∆`(Xk;Xk+1) ≥ 0, the above inequality immediately implies that f(Xk) ≥

f(Xk+1) for all k; and the improvement in objective values is at least as large as

the quantity ∆`(Xk;Xk+1). The term ∆`(Xk;Xk+1) is a measure of progress of the

algorithm, as formalized by the following proposition.

Proposition 4.6. (a): Let ν†(`) + ` > 0 and for any Xa, let us consider the update

Xa+1 ∈ arg minX F`(X;Xa). Then the following are equivalent:

(i) f(Xa+1) = f(Xa).

(ii) ∆`(Xa;Xa+1) = 0.

(iii) Xa is a fixed point, i.e., Xa+1 = Xa.

(b): If ν†(`), ` = 0 and ∆`(Xa;Xa+1) = 0, then Xa+1 is a fixed point.

Proof. Proof of Part (a): We will show that (i) =⇒ (ii) =⇒ (iii) =⇒ (i); by

analyzing (4.33). If f(Xa+1) = f(Xa) then ∆`(Xa;Xa+1) = 0. Since ν†(`) + ` > 0,

we have that Xa+1 = Xa, which trivially implies (i).

Proof of Part (b): If ν†(`) + ` = 0, Part (a) needs to be slightly modified. Note

that ∆`(Xa;Xa+1) = 0 iff P⊥Ω (Xa+1) = P⊥Ω (Xa). Since ` = 0, we have that Xa+2 =

Sλ,γ(PΩ(Y ) + P⊥Ω (Xa+1)

). The condition P⊥Ω (Xa+1) = P⊥Ω (Xa) trivially implies that

Sλ,γ(PΩ(Y ) + P⊥Ω (Xa+1)

)= Sλ,γ

(PΩ(Y ) + P⊥Ω (Xa)

), where the term on the right

equals Xa+1. Thus, Xa+1 = Xa+2 = · · · , i.e., Xa+1 is a fixed point.

Since the f(Xk)’s form a decreasing sequence which is bounded from below, they

converge to f , say — this implies that ∆`(Xk;Xk+1) → 0 as k → ∞. Let us now


consider two cases, depending upon the value of ν†(`)+`. If ν†(`)+` > 0, then we have

Xk+1 −Xk → 0 as k →∞. On the other hand, if the quantities ν†(`) = 0, ` = 0, the

conclusion needs to be modified: ∆`(Xk;Xk+1)→ 0 implies that P⊥Ω (Xk+1−Xk)→ 0

as k →∞.

Motivated by the above discussion, we make the following definition of a first order

stationary point for Problem (4.3).

Definition 4.1. Xa is said to be a first order stationary point for Problem (4.3) if

∆`(Xa;Xa+1) = 0. Xa is said to be an ε-accurate first order stationary point for

Problem (4.3) if ∆`(Xa;Xa+1) ≤ ε.

Proposition 4.7. The sequence f(Xk) is decreasing and suppose it converges to f .

Then the rate of convergence of Xk to this first order stationary point is given by:

min1≤k≤K

∆`(Xk;Xk+1) ≤ 1

K

(f(X1)− f

). (4.34)

Proof. The arguments presented preceding Proposition 4.7 establish that the sequence

f(Xk) is decreasing and converges to f , say.

Consider (4.33) for any 1 ≤ k ≤ K: ∆`(Xk;Xk+1) ≤ f(Xk)−f(Xk+1) — summing

this inequality for k = 1, . . . ,K, we have:

K min1≤k≤K

∆`(Xk;Xk+1) ≤∑

1≤k≤K

∆`(Xk;Xk+1) ≤ f(X1)− f(XK+1) ≤ f(X1)− f ,

where in the last inequality we used the simple fact that f(Xk) ↓ f—gathering the

left and right parts of the above chain of inequalities leads to (4.34).

Proposition 4.7 shows that the sequence Xk reaches an ε-accurate first order sta-

tionary point within Kε = (f(X1)− f)/ε many iterations. The number of iterations

Kε, depends upon how close the initial estimate f(X1) is to the eventual solution

f . Since NC-Impute employs warm-starts, this rate suggests that the number of

iterations required to a reach an approximate first order stationary point is quite low

— this is indeed observed in our experiments, and this feature makes our algorithm

particularly attractive from a practical viewpoint.


Rank Stabilization: Let us consider the thresholding function S`λ,γ(Xk) defined

in (4.27), which expresses Xk+1 as a function of Xk. Using the development in Sec-

tion 4.2, it is easy to see that the spectral operator S`λ,γ(Xk) is closely tied to the

following vector thresholding operator (4.35), acting on the singular values of Xk.

Formally, for a given nonnegative vector x, if we denote:

s`λ,γ(x) ∈ arg minα≥0

`+ 1

2‖α− x‖2

2 +

minm,n∑i=1

P (αi;λ, γ), (4.35)

then S`λ,γ(X) = Udiag(s`λ,γ(x))V ′, where X = Udiag(x)V ′ is the SVD of X. Thus,

properties of the thresholding function S`λ,γ(X) are closely related to those of the vec-

tor thresholding operator s`λ,γ(x). Due to the separability of the vector thresholding

operator s`λ,γ(x), across each coordinate of x and σ, we denote by s`λ,γ(xi), the ith

coordinate of s`λ,γ(x).

We now investigate what happens to the rank of the sequence Xk as defined

via (4.26). In particular, does this rank converge? We show that the rank indeed

stabilizes after finitely many iterations, under an additional assumption — namely,

the spectral thresholding operator is discontinuous — see Figure 4.1 for examples of

discontinuous thresholding functions.

Proposition 4.8. Consider the update sequence Xk+1 = S`λ,γ(Xk) as defined

in (4.27); and let ν†(`) + ` > 0. Suppose that there is a λS > 0 such that, for

any scalar x ≥ 0, the following holds: s`λ,γ(x) 6= 0 =⇒ |s`λ,γ(x)| > λS — i.e., the

scalar thresholding operator x 7→ s`λ,γ(x) is discontinuous. Then there exists an inte-

ger K∗ such that for all k ≥ K∗, we have rank(Xk) = r, i.e., the rank stabilizes after

finitely many iterations.

Proof. Using (4.33) it follows that

f(Xk)− f(Xk+1) ≥ ν†(`) + `

2‖Xk+1 −Xk‖2

F ≥ν†(`) + `

2‖σk+1 − σk‖2

2,

where the last inequality follows from Wielandt-Hoffman inequality (Horn and John-

son, 2012) and σk := σ(Xk) denotes the vector of singular values of Xk. Let 1(σ)


be an indicator vector with ith coordinate being equal to 1(σi 6= 0). We will prove

the result of rank stabilization via the method of contradiction. Suppose the rank

does not stabilize, then 1(σk+1) 6= 1(σk) for infinitely many k values. Thus there are

infinitely many k′ values such that:

‖σk′+1 − σk′‖22 ≥ σ2

k′+1,i ,

where i is taken such that σk′+1,i 6= 0 but σk′,i = 0. Note that by the property of the

thresholding function s`λ,γ(·) we have that s`λ,γ(x) 6= 0 =⇒ |s`λ,γ(x)| > λS. This implies

that ‖σk′+1 − σk′‖22 ≥ λ2

S for infinitely many k′ values, which is a contradiction to

the convergence: f(Xk+1) − f(Xk) → 0. Thus the support of σ(Xk) converges, and

necessarily after finitely many iterations — leading to the existence of an iteration

number K∗, after which the rank of Xk remains fixed. This completes the proof of

the proposition.

Remark 4.1. If ` = 0, the discontinuity of the thresholding operator sλ,γ(·) (as

demanded by Proposition 4.8) occurs for the MC+ penalty function as soon as γ ≤ 1.

For a general ` > 0, discontinuity in s`λ,γ(·) occurs as soon as γ ≤ 1`+1

. Note that the

condition ` > 0 implies that ν†(`) + ` > 0.

Asymptotic Convergence: We now investigate the asymptotic convergence prop-

erties of the sequence Xk, k ≥ 1. Proposition 4.8 shows that under suitable assum-

ptions, the sequence rank(Xk), k ≥ 1, converges. In this situation, the existence of a

limit point of Xk is guaranteed if the singular values of σ(Xk) remain bounded. It is

not immediately clear whether the sequence σ(Xk) will remain bounded since several

spectral penalty functions (like the MC+ penalty) are bounded5. We address herein

the existence of a limit point of the sequence σ(Xk), and hence the sequence Xk.

For the following proposition, we will assume that the concave penalty function

σ 7→ P (σ;λ, γ) on σ ≥ 0 is differentiable and the gradient is bounded.

5Due to the boundedness of the penalty function, the boundedness of the objective function does

not necessarily imply that the sequence σ(Xk) will remain bounded.


Proposition 4.9. Let Ukdiag(σk)V′k denote the rank-reduced SVD of Xk. Let the

matrices Um×r, Vn×r denote a limit point of the sequence Uk, Vk, k ≥ 1, such that

(Unk , Vnk) → (U , V ) along a subsequence nk → ∞. Let ui denote the ith column of

U (and similarly for vi, V ) and let us denote Θ = [vec(PΩ(u1v′1)), . . . , vec(PΩ(urv

′r))].

We have the following:

(a) If rank(Θ) = r, then the sequence Xnk has a limit point which is a first order

stationary point.

(b) If λmin(Θ′Θ) +φP > 0, then the sequence Xnk converges to a first order station-

ary point: X = Udiag(σ)V ′, where σnk → σ.

Proof. See Appendix A.2.

Proposition 4.8 describes sufficient conditions under which the rank of the se-

quence Xk stabilizes after finitely many iterations — it does not describe the bound-

edness of the sequence Xk, which is addressed in Proposition 4.9. Note that Propo-

sition 4.9 does not imply that the rank of the sequence Xk stabilizes after finitely

many iterations (recall that Proposition 4.9 does not assume that the thresholding

operators are discontinuous, an assumption required by Proposition 4.8).

4.3.2 Computing the Thresholding Operators

The operator (4.27) requires computing a thresholded SVD of the matrix Xk, as

demonstrated by Proposition 4.1. The thresholded singular values s`λ,γ(·) as in (4.35)

will have many zero coordinates due to the “sparsity promoting” nature of the con-

cave penalty. Thus, computing the thresholding operator (4.27) will typically require

performing a low-rank SVD on the matrix Xk. While direct factorization based SVD

methods can be used for smaller problems (i.e., when minm,n is of the order of

a thousand or so); for larger matrices, such methods become computationally pro-

hibitive — we thus resort to iterative methods for computing low-rank SVDs for large


scale problems. Algorithms like the block power method; also known as block QR

iterations, or those based on the Lanczos method (Golub and Van Loan, 1983) are

quite effective in computing the top few singular value/vectors of a matrix A, espe-

cially when the operations of multiplying Ab1 and A′b2 (for vectors b1, b2 of matching

dimensions) can be done efficiently. Indeed, such matrix-vector multiplications turn

out to be quite computationally attractive for our problem, since the computational

cost of multiplying Xk and X ′k with vectors of matching dimensions is quite low. This

is due to the structure of:

Xk =(PΩ(Y ) + P⊥Ω (Xk) + `Xk

)/(`+ 1)

=1

`+ 1PΩ(Y −Xk)︸︷︷︸

Sparse

+ Xk︸︷︷︸Low-rank

,(4.36)

which admits a decomposition as the sum of a sparse matrix (given by 1`+1

PΩ(Y −Xk))

and a low-rank matrix6, i.e., Xk. Note that the sparse matrix has the same sparsity

pattern as the observed indices Ω. Decomposition (4.36) is inspired by a similar de-

composition that was exploited effectively in the algorithm Soft-Impute (Mazumder

et al., 2010), where the authors use PROPACK (Larsen, 2004) to compute the low-

rank SVDs. In this chapter, we use the Alternating Least Squares (ALS)-stylized

procedure proposed in Hastie et al. (2016), which computes a low-rank SVD by solv-

ing the following nonlinear optimization problem:

minimizeUm×r,Vn×r

1

2‖Xk − UV ′‖2

F ,

using alternating least squares—this turns out to be, in fact, equivalent to the block

power method (Hastie et al., 2016; Golub and Van Loan, 1983), in computing a rank

6We note that it is not guaranteed that the Xk’s will be of low-rank across the iterations of the

algorithm for k ≥ 1, even if they are eventually, for k sufficiently large. However, in the presence

of warm-starts across (λ, γ) they are indeed, empirically, found to have low-rank as long as the

regularization parameters are large enough to result in a small rank solution. Typically, as we have

observed in our experiments, in the presence of warm-starts, the rank of Xk is found remain low

across all iterations.


r SVD of the matrix Xk. Across the iterations of NC-Impute, we pass the warm-

start information in the U, V ’s obtained from a low-rank SVD of Xk to compute

the low-rank SVD for Xk+1. Empirically, this warm-start strategy is found to be

significantly more advantageous than a black-box low-rank SVD stylized approach,

as used in Soft-Impute. Using warm-start information across successive iterations

(i.e., k values) leads to dramatic gains in computational speed (often reduces the total

time to compute a family of solutions by orders of magnitude), when compared to

black-box SVD stylized methods that do not rely on such warm-start strategies.

4.4 Numerical Experiments

In this section, we present a systematic experimental study of the statistical proper-

ties of estimators obtained from (4.3) for different choices of penalty functions. We

perform our experiments on a wide array of synthetic and real data instances.

4.4.1 Synthetic Examples

We study three different examples, where, for the true low-rank matrix M = LΦRT ,

we vary both the structure of the left and right singular vectors in L and R, as well

as the sampling scheme used to obtain the observed entries in Ω. Our basic model is

Yij = Mij + εij, where we observe entries (i, j) ∈ Ω. We consider different types of

missing patterns for Ω, and various signal-to-noise ratios for the Gaussian error term

ε, defined here to be:

SNR =Var(vec(M))

Var(vec(ε)).

Accordingly, the (standardized) training and test error for the model are defined as:

Training Error =‖PΩ(Y − M)‖2

F

‖PΩ(Y )‖2F

and Test Error =‖P⊥Ω (LΦRT − M)‖2

F

‖P⊥Ω (LΦRT )‖2F

,

where a value greater than one for the test error indicates that the computed estimate

M does a worse job at estimating M than the zero solution, and the training error


corresponds to the fraction of the error explained on the observed entries by the

estimate M relative to the zero solution.

Example-A: In our first simulation setting, we use the observation model Ym×n =

Lm×rΦr×rRTr×n+εm×n, where L andR are matrices generated from the random orthog-

onal model (Candes and Recht, 2009), and the singular values Φ = diag(φ1, . . . , φr) are

randomly selected as φ1, . . . , φriid∼ Uniform(0, 100). The set Ω is sampled uniformly at

random. Recall that for this model, exact matrix completion in the noiseless setting

is guaranteed as long as |Ω| ≥ Cmr log4m, for some universal constant C (Recht,

2011). Under the noisy setting, Mazumder et al. (2010) show superior performance

of nuclear norm regularization vis-a-vis other matrix recovery algorithms (Cai et al.,

2010; Keshavan et al., 2010) in terms of achieving smaller test error. For the pur-

poses herein, we fix (m,n) = (800, 400) and set the fraction of missing entries to

|Ωc|/mn = 0.9 and |Ωc|/mn = 0.95.

Example-B: In our second setting, we also consider the observation model Ym×n =

Lm×rΦr×rRTr×n + εm×n, but we now select matrices L and R which do not satisfy the

incoherence conditions required for full matrix recovery. Specifically, for the choices

of (m,n, r) = (800, 400, 10) and |Ωc|/mn = 0.9, we select L and R to be block-

diagonal matrices of the form L = diag(L1, . . . , L5) and R = diag(R1, . . . , R5), where,

Li ∈ R160×2 and Ri ∈ R80×2, i = 1, . . . , 5, are random matrices with scaled Gaussian

entries. The singular values are again sampled as φ1, . . . , φriid∼ Uniform(0, 100) with

Ω being uniformly random over the set of indices. For this model, successful matrix

completion is not guaranteed even for the noiseless problem with the nuclear norm

relaxation, as the left and right singular vectors are not sufficiently spread. We would

like to investigate the performance of nonconvex regularizers based on the MC+

family of penalties in this challenging scenario. A superior performance of these

penalties over the usual nuclear norm regularization signals that nonconvex penalties

are able to weaken the strong incoherence conditions required for successful nuclear


norm matrix reconstruction.

Example-C: In our third simulation setting, for the given choice of (m,n, r) =

(100, 100, 10), we also generate Ym×n = Lm×rΦr×rRTr×n + εm×n from the random

orthogonal model as in our first setting, but we now allow the observed entries in Ω

to follow a non-uniform sampling scheme. In particular, we fix Ωc = 1 ≤ i, j ≤ 100 :

1 ≤ i ≤ 50, 51 ≤ j ≤ 100 so that

PΩ(Y ) =

Y11 0

Y21 Y22

, where Y =

Y11 Y12

Y21 Y22

,with the fraction of missing entries thus being |Ωc|/mn = 0.25. This is again a

challenging simulation setting in which both the uniform (Candes and Recht, 2009)

and independent (Chen et al., 2014) sampling scheme assumptions in Ω that are

necessary for full matrix recovery are violated. Our aim again is to explore whether

the nonconvex MC+ family is able to outperform nuclear norm regularization even

in this setting where no theoretical guarantees for exact matrix reconstruction exist.

For all three settings, we choose a 100 × 25 grid of (λ, γ) values as follows. In

each simulation instance we fix λ1 = ‖PΩ(Y )‖2, the smallest value of λ for which

the nuclear norm regularized solution is zero, and set λ100 = 0.001 · λ1. Keeping

in mind that NC-Impute benefits greatly from using warm starts, we construct an

equally spaced sequence of 100 values of λ decreasing from λ1 to λ100. We choose

25 γ-values in a logarithmic grid from 5000 to 1.1. The results displayed in Figures

4.4 – 4.6 show averages of training and test errors, as well as recovered ranks of

the solution matrix Mλ,γ for the values of (λ, γ), taken over 50 simulations under

all three problem instances. The plots including rank reveal how effective the MC+

family is at recovering the true rank while minimizing prediction error. Throughout

the simulations we keep an upper bound of the operating rank as 50.


4.4.1.1 Discussion of Experimental Results

We devote Figures 4.4 and 4.5 to analyze the simpler random orthogonal model

(Example-A), leaving the more challenging coherent and non-uniform sampling set-

tings (Example-B and Example-C) for Figure 4.6. In each case, the captions detail

the results which we summarize here. The noise is quite high in Figure 4.4 with

SNR = 1 and 90% of the entries missing in both displayed settings, while the model

complexity decreases from a true rank of 10 to 5. The underlying true ranks remain

the same in Figure 4.5, but the noise level has decreased to SNR = 5 with the missing

entries increasing to 95%. For each model setting considered, all nonconvex methods

from the MC+ family outperform nuclear norm regularization in terms of prediction

performance, while members of the MC+ family with smaller values of γ are better

at estimating the correct rank. The choices of γ = 30 and γ = 20 have the best

performance in Figure 4.4 (best prediction errors around the true ranks), while more

nonconvex alternatives fare better in the high-sparsity, low-noise setting of Figure

4.5. In both figures, the performance of nuclear norm regularization is somewhat

similar to the least nonconvex alternative displayed at γ = 100, however, the bias

induced in the estimation of the singular values of the low-rank matrix M leads to

the worst bias-variance trade-off among all training versus test error plots for the

settings considered.

While the nuclear norm relaxation provides a good convex lower approximation

for the rank of a matrix (Recht et al., 2010), these examples show that nonconvex

regularization methods provide a superior mechanism for rank estimation. This is

reminiscent of the performance of the MC+ penalty in the context of variable selection

within sparse, high-dimensional regression models. Although the `1 penalty function

represents the best convex approximation to the `0 penalty, the gap bridged by the

nonconvex MC+ penalty family P (·;λ, γ) provides a better basis for model selection,

and hence rank estimation in the low-rank matrix completion setting.

For the coherent and non-uniform sampling settings of Figure 4.6, we choose the


Example-A (Low SNR, less missing entries)

0.0 0.2 0.4 0.6 0.8 1.00.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Training Error

Test

Err

or

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(a) ROM, 90% missing, SNR=1, true rank=10

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Training Error

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(b) ROM, 90% missing, SNR = 1, true rank = 5

0 10 20 30 40 500.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Rank

Test

Err

or

(c) ROM, 90% missing, SNR=1, true rank=10

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

Rank

(d) ROM, 90% missing, SNR = 1, true rank = 5

Figure 4.4: (Color online) Random Orthogonal Model (ROM) simulations with SNR = 1.

The choice γ = +∞ refers to nuclear norm regularization as provided by the Soft-Impute

algorithm. The least nonconvex alternatives at γ = 100 and γ = 80 behave similarly to

nuclear norm, although with better prediction performance. The choices of γ = 5 and γ = 10

result in excessively aggressive fitting behavior for the true rank = 10 case, but improve

significantly in prediction error and recovering the true rank in the sparser true rank = 5

setting. In both scenarios, the intermediate models with γ = 30 and γ = 20 fare the best,

with the former achieving the smallest prediction error, while the latter estimates the actual

rank of the matrix. Values of test error larger than one are not displayed in the figure.


Example-A (High SNR, more missing entries)

0.0 0.2 0.4 0.6 0.8 1.00.2

0.4

0.6

0.8

1.0

Training Error

Test

Err

or

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(a) ROM, 95% missing, SNR=5, true rank=10

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Training Error

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(b) ROM, 95% missing, SNR = 5, true rank = 5

0 10 20 30 40 500.2

0.4

0.6

0.8

1.0

Rank

Test

Err

or

(c) ROM, 95% missing, SNR=5, true rank=10

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

Rank

(d) ROM, 95% missing, SNR = 5, true rank = 5

Figure 4.5: (Color online) Random Orthogonal Model (ROM) simulations with SNR =

5. The benefits of nonconvex regularization are more evident in this high-sparsity, high-

missingness scenario. While the γ = 100 and γ = 80 models distance themselves more from

nuclear norm, the remaining members of the MC+ family essentially minimize prediction

error while correctly estimating the true rank. This is especially true in panel (d), where

the best predictive performance of the model γ = 5 at the correct rank is achieved under a

low-rank truth and high SNR setting.


Example-B Example-C

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Training Error

Test

Err

or

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(a) Coherent, 90% missing, SNR = 10,

true rank=10

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Training Error

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(b) NUS, 25% missing, SNR=10, true rank=10

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

Rank

Test

Err

or

(c) Coherent, 90% missing, SNR = 10,

true rank=10

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

Rank

(d) NUS, 25% missing, SNR=10, true rank=10

Figure 4.6: (Color online) Coherent and Non-Uniform Sampling (NUS) simulations with

SNR = 10. Nonconvex regularization also proves to be a successful strategy in these chal-

lenging scenarios, particularly in the non-uniform sampling setting where the MC+ family

exhibits a monotone decrease in prediction error as γ approaches 1. Again, the model γ = 5

estimates the correct rank under high SNR settings. Although nuclear norm achieves a rel-

atively small prediction error, compared with previous simulation settings, the MC+ family

still provides a superior and more robust mechanism for regularization.


small noise scenario SNR = 10 in order to favor all considered models. Despite the

absence of any theoretical guarantees for successful matrix recovery, the nuclear norm

regularization approach achieves a relatively small prediction error in all displayed

instances. Nevertheless, the nonconvex MC+ family of penalties seems empirically

more adept at overcoming the limitations of nuclear norm penalized matrix com-

pletion in these challenging simulation settings. In particular, the most aggressive

nonconvex fitting behavior at γ = 5 achieves almost uniform best prediction perfor-

mance in the non-uniform sampling setting while correctly estimating the true rank

for the coherent model. In spite of these promising initial findings, a more thorough

theoretical investigation of whether nonconvex regularization methods are able to

overcome the incoherence and uniform sampling assumptions required for successful

matrix reconstruction in the noisy setting (Candes and Plan, 2010; Keshavan et al.,

2010) is deferred for future work.

4.4.2 Real Data Examples: MovieLens and Netflix Data Sets

We now use the real world recommendation system data sets ml100k and ml1m pro-

vided by MovieLens (http://grouplens.org/datasets/movielens/), as well as the fa-

mous Netflix competition data to compare the usual nuclear norm approach with

the MC+ regularizers. The data set ml100k consists of 100, 000 movie ratings (1–5)

from 943 users on 1, 682 movies, whereas ml1m includes 1, 000, 209 anonymous ratings

from 6, 040 users on 3, 952 movies. In both data sets, for all regularization methods

considered, a random subset of 80% of the ratings were used for training purposes;

the remaining were used as the test set.

We also choose a similar 100 × 25 grid of (λ, γ) values, but for each value of λ

in the decreasing sequence, we use an “operating rank” threshold somewhat larger

than the rank of the previous solution, with the goal of always obtaining solution

ranks smaller than the operating threshold. Following the approach of Hastie et al.

(2016), we perform row and column centering of the corresponding (incomplete) data

http://grouplens.org/datasets/movielens/


Real Data Example: MovieLens

0.2 0.4 0.6 0.8 1.0

Training RMSE

Test

RM

SE

0.88

0.89

0.90

0.91

γ=+∞γ=100γ=80γ=30γ=20γ=10γ=5

(a) MovieLens100k, 20% test data

0.5 0.6 0.7 0.8

Training RMSE

0.84

0.85

0.86

0.87

γ=+∞γ=100γ=80γ=30γ=20γ=15γ=10

(b) MovieLens1m, 20% test data

0 50 100 150 200

Rank

Test

RM

SE

0.88

0.89

0.90

0.91

(c) MovieLens100k, 20% test data

0 50 100 150 200 250

Rank

0.84

0.85

0.86

0.87

(d) MovieLens1m, 20% test data

Figure 4.7: (Color online) MovieLens 100k and 1m data. For each value of λ in the

solution path, an operating rank threshold (capped at 250) larger than the rank of the

previous solution was employed.

matrices as a preprocessing step.

Figure 4.7 compares the performance of nuclear norm regularization with the

MC+ family of penalties on these data sets, in terms of the prediction error (RMSE)

obtained from the left out portion of the data. While the fitting behavior at γ = 5

is overly aggressive in these instances, the choice γ = 10 achieves the best test set

RMSE with a minimum solution rank of 20 for the ml100k data. With a higher test


Real Data Example: Netflix

0.60 0.65 0.70 0.75 0.80 0.85

Training RMSE

Test

RM

SE

0.82

0.83

0.84

0.85

γ=+∞γ=100γ=20γ=15γ=10γ=8γ=6

(a) Netflix, test data=1, 500, 000 ratings

0 50 100 150 200 250

Rank

0.82

0.83

0.84

0.85

(b) Netflix, test data=1, 500, 000 ratings

Figure 4.8: (Color online) Netflix competition data. The model γ = 10 achieves optimal

test set RMSE of 0.8276 for a solution rank of 105.

RMSE, nuclear norm regularization achieves its minimum with a less parsimonious

model of rank 62. Similar results hold for the ml1m data, where the model γ = 15

achieves near optimal test RMSE at a solution rank of 115, while the best estimation

accuracy of Soft-Impute occurs for ranks well over 200.

The Netflix competition data consists of 100, 480, 507 ratings from 480, 189 users

on 17, 770 movies. A designated probe set, a subset of 1, 408, 395 of these ratings, was

distributed to participants for calibration purposes, leaving 99, 072, 112 for training.

We did not consider the probe set as part of this numerical experiment, instead choos-

ing 1, 500, 000 randomly selected entries as test data with the remaining 97, 572, 112

used for training purposes. Similar to the MovieLens data, we select a 20× 25 grid of

(λ, γ) values which adaptively chooses an operating rank threshold, and also remove

row and columns means for prediction purposes.

As shown in Figure 4.8, the MC+ family again yields better prediction perfor-

mance under more parsimonious models. On average, and for a convergence tolerance

of 0.001 in Algorithm 4.1, the sequence of twenty models took under 10.5 hours of

computing on an Intel E5-2650L cluster with 2.6 GHz processor. Our main goal here


is to show the feasibility of applying NC-Impute to the MC+ family on a batch mode

to such a large data set — and that it works better in terms of statistical properties

when compared to the nuclear norm regularized problem.

4.5 Conclusions

In this project we present a computational study for the noisy matrix completion

problem with nonconvex spectral penalties — we consider a family of spectral penal-

ties that bridge the convex nuclear norm penalty and the rank penalty, leading to

a family of estimators with varying degrees of shrinkage and nonconvexity. We pro-

pose NC-Impute, an algorithm that is inspired by the EM-stylized procedure Soft-

Impute (Mazumder et al., 2010), that effectively computes a two-dimensional family

of solutions with specialized warm-start strategies. The main computational bottle-

neck of our algorithm is a low-rank SVD of a matrix that can be written as the sum of

a sparse and low-rank matrix, which is performed using a block QR stylized strategy

that makes effective use of singular subspace warm-start information across itera-

tions. We discuss computational guarantees of our algorithm, including a finite time

complexity analysis to a first order stationary point. We present a systematic study

of various properties of spectral thresholding functions, a main tool in our algorithm.

We also demonstrate the impressive gains in statistical properties of our framework

on a wide array of synthetic and real data sets.

BIBLIOGRAPHY 128

Bibliography

Aggarwal, C. and Subbian, K. (2014). Evolutionary network analysis: A survey. ACM

Computing Surveys, 47:10:1–10:36.

Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008). Mixed membership

stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood

principle. In Petrov, B. N. and Csaki, F., editors, Proceedings of the 2nd Interna-

tional Symposium on Information Theory. Akademiai Kiado, Budapest.

Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random

Matrices. Springer.

Bazzi, M., Porter, M. A., Williams, S., McDonald, M., Fenn, D. J., and Howison,

S. D. (2015). Community detection in temporal multilayer networks, and its ap-

plication to correlation networks. Multiscale Modeling and Simulation: A SIAM

Interdisciplinary Journal, to appear.

Bernau, C., Waldron, L., and Riester, M. (2014). survHD: Synthesis of High-

Dimensional Survival Analysis. R package version 0.99.1, URL https://

bitbucket.org/lwaldron/survhd.

Bertsimas, D., King, A., and Mazumder, R. (2016). Best subset selection via a modern

optimization lens. Annals of Statistics (to appear).

https://bitbucket.org/lwaldron/survhd

https://bitbucket.org/lwaldron/survhd

BIBLIOGRAPHY 129

Bickel, P. J. and Levina, E. (2004). Some theory for fisher’s linear discriminant

function, ‘naive bayes’, and some alternatives when there are many more variables

than observations. Bernoulli, 10(6):989–1010.

Borwein, J. and Lewis, A. (2006). Convex Analysis and Nonlinear Optimization.

Springer.

Breheny, P. (2013). ncvreg: Regularization Paths for SCAD- and MCP-Penalized

Regression Models. R package version 2.6-0, URL http://CRAN.R-project.org/

package=ncvreg.

Breheny, P. and Huang, J. (2011). Coordiante descent algorithms for nonconvex

penalized regression, with applications to biological feature selection. The Annals

of Applied Statistics, 5(1):232–253.

Cai, J.-F., Candes, E. J., and Shen, Z. (2010). A singular value thresholding algorithm

for matrix completion. SIAM Journal on Optimization, 20:1956–1982.

Candes, E., Sing-Long, C., and Trzasko, J. D. (2013). Unbiased risk estimates for

singular value thresholding and spectral estimators. Signal Processing, IEEE Trans-

actions on, 61(19):4643–4657.

Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the

IEEE, 98:925–936.

Candes, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.

Foundations of Computational mathematics, 9:717–772.

Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model

selection with large model spaces. Biometrika, 95(3):759–771.

Chen, K. and Lei, J. (2014). Network cross-validation for determining the number of

communities in network data. Available at arXiv:1411.1715v1.

http://CRAN.R-project.org/package=ncvreg

http://CRAN.R-project.org/package=ncvreg

BIBLIOGRAPHY 130

Chen, Y., Bhojanapalli, S., Sanghavi, S., and Ward, R. (2014). Coherent matrix com-

pletion. In Proceedings of the 31st International Conference on Machine Learning,

pages 674–682. JMLR.

Chistov, A. L. and Grigor’ev, D. Y. (1984). Complexity of quantifier elimination in

the theory of algebraically closed fields. In Mathematical Foundations of Computer

Science 1984, pages 17–31. Springer.

Cho, H. and Fryzlewicz, P. (2015). Multiple-change-point detection for high dimen-

sional time series via sparsified binary segmentation. Journal of the Royal Statistical

Society, Series B, 77:475–507.

Choi, D. S., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with a

growing number of classes. Biometrika, 99:273–284.

Cox, D. R. (1975). Partial likelihood. Biometrika, 62(2):269–276.

Cox, D. R. and Reid, N. (2004). A note on pseudolikelihood constructed from marginal

densities. Biometrika, 91:729–737.

Cribben, I. and Yu, Y. (2015). Estimating whole brain dynamics using spectral

clustering. Available at arXiv:1509.03730v1.

Daudin, J.-J., Picard, F., and Robin, S. (2008). A mixture model for random graphs.

Statistics and Computing, 18:173–183.

Decelle, A., Krzakala, F., Moore, C., and Zdeborova, L. (2011). Asymptotic analysis

of the stochastic block model for modular networks and its algorithmic applications.

Physical Review E, 84:066106.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from

incomplete data via the em algorithm. Journal of the Royal Statistical Society,

Series B, 39:1–38.

BIBLIOGRAPHY 131

Donath, W. E. and Hoffman, A. J. (1973). Lower bounds for the partitioning of

graphs. IBM Journal of Research and Development, 17:420–425.

Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination

methods for the classification of tumors using gene expression data. Journal of the

American Statistical Association, 97(457):77–87.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression

(with discussion). Annals of Statistics, 32(2):407–499.

Fan, J., Feng, Y., Saldana, D. F., Samworth, R., and Wu, Y. (2015). SIS: Sure

Independence Screening. R package version 0.7-6, URL http://CRAN.R-project.

org/package=SIS.

Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in

sparse ultra-high-dimensional additive models. Journal of the American Statistical

Association, 106(494):544–557.

Fan, J., Feng, Y., and Wu, Y. (2010). High-dimensional variable selection for cox’s

proportional hazards model. IMS Collections, 6:70–86.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and

its oracle properties. Journal of the American Statistical Association, 96:1348–1360.

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional

feature space. Journal of the Royal Statistical Society, Series B, 70(5):849–911.

Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional feature selection:

Beyond the linear model. Journal of Machine Learning Research, 10:2013–2038.

Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models

with np-dimensionality. The Annals of Statistics, 38(6):3567–3604.

http://CRAN.R-project.org/package=SIS

http://CRAN.R-project.org/package=SIS

BIBLIOGRAPHY 132

Fazel, M. (2002). Matrix Rank Minimization with Applications. PhD thesis, Stanford

University.

Fearnhead, P. (2004). Particle filters for mixture models with an unknown number of

components. Statistics and Computing, 14:11–21.

Feng, Y. and Yu, Y. (2013). Consistent cross-validation for tuning parameter selection

in high-dimensional variable selection. manuscript.

Fenn, D. J., Porter, M. A., Mucha, P. J., McDonald, M., Williams, S., Johnson, N. F.,

and Jones, N. S. (2012). Dynamical clustering of exchange rates. Quantitative

Finance, 12:1493–1520.

Ferguson, T. S. (1973). A bayesian analysis of some nonparametric problems. The

Annals of Statistics, 1:209–230.

Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics

regression tools. Technometrics, 35:109–135.

Freund, R. M., Grigas, P., and Mazumder, R. (2015). An Extended Frank-Wolfe

Method with “In-Face” Directions, and its Application to Low-Rank Matrix Com-

pletion. ArXiv e-prints.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for general-

ized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–

22.

Friedman, J., Hastie, T., and Tibshirani, R. (2013). glmnet: Lasso and Elastic-

Net Regularized Generalized Linear Models. R package version 1.9-5, URL http:

//CRAN.R-project.org/package=glmnet.

Gao, X. and Song, P. X.-K. (2010). Composite likelihood bayesian information criteria

for model selection in high-dimensional data. Journal of The American Statistical

Association, 105:1531–1540.

http://CRAN.R-project.org/package=glmnet

http://CRAN.R-project.org/package=glmnet

BIBLIOGRAPHY 133

Ghasemian, A., Zhang, P., Clauset, A., Moore, C., and Peel, L. (2015). Detectability

thresholds and optimal algorithms for community structure in dynamic networks.

Available at arXiv:1506.06179v1.

Glaz, J., Naus, J., and Wallenstein, S. (2001). Scan Statistics. Springer, New York.

Goldenberg, A., Zheng, A. X., Fienberg, S. E., and Airoldi, E. M. (2010). A survey of

statistical network models. Foundations and Trends in Machine Learning, 2:129–

233.

Golub, G. and Van Loan, C. (1983). Matrix Computations. Johns Hopkins University

Press, Baltimore.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,

Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and

Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class

prediction by gene expression monitoring. Science, 286(5439):531–537.

Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E.,

Ramaswamy, S., Richards, W. G., Sugarbaker, D. J., and Bueno, R. (2002). Trans-

lation of microarray data into clinically relevant cancer diagnostic tests using gene

expression ratios in lung cancer and mesothelioma. Cancer Research, 62:4963–4967.

Guattery, S. and Miller, G. L. (1998). On the quality of spectral separators. SIAM

Journal on Matrix Analysis and Applications, 19:701–719.

Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection.

Journal of Machine Learning Research, 3:1157–1182.

Guyon, X. (1995). Random Fields on a Network: Modeling, Statistics, and Applica-

tions. Springer-Verlag, New York.

BIBLIOGRAPHY 134

Han, Q., Xu, K. S., and Airoldi, E. M. (2015). Consistent estimation of dynamic and

multi-layer block models. In Proceedings of the 32nd International Conference on

Machine Learning, pages 1511–1520. JMLR.

Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007). Model-based clustering

for social networks. Journal of the Royal Statistical Society, Series A, 170:301–354.

Hastie, T., Mazumder, R., Lee, J. D., and Zadeh, R. (2016). Matrix completion

and low-rank svd via fast alternating least squares. Journal of Machine Learning

Research, to appear.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learn-

ing: Prediction, Inference and Data Mining (Second Edition). Springer Verlag, New

York.

Heagerty, P. J. and Lele, S. R. (1998). A composite likelihood approach to binary

spatial data. Journal of the American Statistical Association, 93:1099–1111.

Ho, Q., Song, L., and Xing, E. P. (2011). Evolving cluster mixed-membership block-

model for time-varying networks. In Proceedings of the 14th International Confer-

ence on Artificial Intelligence and Statistics, pages 342–350. JMLR.

Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). Stochastic blockmodels:

First steps. Social Networks, 5:109–137.

Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.

Hunter, D. R., Krivitsky, P. N., and Schweinberger, M. (2012). Computational statis-

tical methods for social network models. Journal of Computational and Graphical

Statistics, 21:856–882.

Jaggi, M. and Sulovsky, M. (2010). A simple algorithm for nuclear norm regularized

problems. In Proceedings of the 27th International Conference on Machine Learning

(ICML-10), pages 471–478.

BIBLIOGRAPHY 135

Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics,

43:57–89.

Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time

Data. John Wiley & Sons, New Jersey, 2nd edition.

Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community

structure in networks. Physical Review E, 83:016107.

Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy

entries. Journal of Machine Learning Research, 11:2057–2078.

Larsen, R. (2004). Propack-software for large and sparse svd calculations. Available

at http://sun.stanford.edu/$\sim$rmunk/PROPACK.

Latouche, P., Birmele, E., and Ambroise, C. (2012). Variational Bayesian inference

and complexity control for stochastic block models. Statistical Modelling, 12:93–

115.

Lee, N. H. and Priebe, C. E. (2011). A latent process model for time series of

attributed random graphs. Statistical Inference for Stochastic Processes, 14:231–

253.

Leisch, F., Weingessel, A., and Hornik, K. (1998). On the generation of correlated

artifical binary data. Working Paper Series, SFB, Adaptive Information Systems

and Modelling in Economics and Management Science.

Lewis, A. S. (1995). The convex analysis of unitarily invariant matrix functions.

Journal of Convex Analysis, 2:173–183.

Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics,

80:221–239.

http://sun.stanford.edu/$\sim $rmunk/PROPACK

BIBLIOGRAPHY 136

Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery

using regularized least squares. The Annals of Statistics, 37:3498–3528.

Macon, K. T., Mucha, P. J., and Porter, M. A. (2012). Community structure in the

united nations general assembly. Physica A, 391:343–361.

Mallows, C. L. (1973). Some comments on cp. Technometrics, 15(4):661–675.

Mazumder, R., Friedman, J. H., and Hastie, T. (2011). Sparsenet: coordinate de-

scent with nonconvex penalties. Journal of the American Statistical Association,

106:1125–1138.

Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization al-

gorithms for learning large incomplete matrices. Journal of Machine Learning

Research, 11:2287–2322.

Mazumder, R. and Radchenko, P. (2015). The discrete dantzig selector: Estimat-

ing sparse linear models via mixed integer linear optimization. arXiv preprint

arXiv:1508.01922.

McDaid, A. F., Murphy, T. B., Friel, N., and Hurley, N. J. (2013). Improved bayesian

inference for the stochastic blockmodel with application to large networks. Com-

putational Statistics and Data Analysis, 60:12–31.

Miller, J. W. and Harrison, M. T. (2014). Inconsistency of pitman-yor process

mixtures for the number of components. Journal of Machine Learning Research,

15:3333–3370.

Minka, T. P. (2000). Estimating a dirichlet distribution. Technical report, Microsoft

Research.

Negahban, S. N. and Wainwright, M. J. (2012). Restricted strong convexity and

weighted matrix completion: optimal bounds with noise. Journal of Machine

Learning Research, 13:1665–1697.

BIBLIOGRAPHY 137

Newman, M. E. J. (2004). Detecting community structure in networks. The European

Physical Journal B, 38:321–330.

Newman, M. E. J. and Girvan, M. (2004). Finding and evaluating community struc-

ture in networks. Physical Review E, 69:026113.

Nikolova, M. (2000). Local strong homogeneity of a regularized estimator. SIAM

Journal on Applied Mathematics, 61:633–658.

Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in mul-

tiple regression. The Annals of Statistics, 12(2):758–765.

Nobile, A. and Fearnside, A. (2007). Bayesian finite mixtures with an unknown

number of components: The allocation sampler. Statistics and Computing, 17:147–

162.

Oberthuer, A., Berthold, F., Warnat, P., Hero, B., Kahlert, Y., Spitz, R., Ernestus,

K., Konig, R., Haas, S., Eils, R., Schwab, M., Brors, B., Westermann, F., and

Fischer, M. (2006). Customized oligonucleotide microarray gene expression-based

classification of neuroblastoma patients outperforms current clinical risk stratifica-

tion. Journal of Clinical Oncology, 24(31):5070–5078.

Perry, P. O. and Wolfe, P. J. (2012). Null models for network data. Available at

arXiv:1201.5871v1.

Perry, P. O. and Wolfe, P. J. (2013). Point process modelling for directed interaction

networks. Journal of the Royal Statistical Society, Series B, 75:821–849.

Priebe, C. E., Conroy, J. M., Marchette, D. J., and Park, Y. (2005). Scan statistics on

enron graphs. Computational and Mathematical Organization Theory, 11:229–247.

Priebe, C. E., Park, Y., Marchette, D. J., Conroy, J. M., Grothendieck, J., and Gorin,

A. L. (2010). Statistical inference on attributed random graphs: Fusion of graph

BIBLIOGRAPHY 138

features and content: An experiment on time series of enron graphs. Computational

Statistics and Data Analysis, 54:1766–1776.

Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-

corrected stochastic blockmodel. In Advances in Neural Information Processing

Systems 26, pages 3120–3128.

Rakotomamonjy, A. (2003). Variable selection using svm-based criteria. Journal of

Machine Learning Research, 3:1357–1370.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.

Journal of the American Statistical Association, 66:846–850.

Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine

Learning Research, 12:3413–3430.

Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions

of linear matrix equations via nuclear norm minimization. SIAM Review, 52:471–

501.

Robinson, L. F. and Priebe, C. E. (2015). Detecting time-dependent structure in

network data via a new class of latent process models. Computational Statistics,

to appear.

Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press, Princeton,

New Jersey.

Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank

matrices. The Annals of Statistics, 39:887–930.

Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-

dimensional stochastic blockmodel. The Annals of Statistics, 39:1878–1915.

BIBLIOGRAPHY 139

Rohe, K., Qin, T., and Fan, H. (2014). The highest dimensional stochastic blockmodel

with a regularized estimator. Statistica Sinica, 24:1771–1786.

Saldana, D. F., Yu, Y., and Feng, Y. (2015). How many communities are there?

Journal of Computational and Graphical Statistics, to appear.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,

6(2):461–464.

Schweinberger, M. and Handcock, M. S. (2015). Local dependence in random graph

models: characterization, properties and statistical inference. Journal of the Royal

Statistical Society, Series B, 77:647–676.

Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 22:888–905.

SIGKDD, A. and Netflix (2007). Soft modelling by latent variables: the nonlinear

iterative partial least squares (NIPALS) approach. In Proceedings of KDD Cup and

Workshop. Available at http://www.cs.uic.edu/$\sim$liub/KDD-cup-2007/

proceedings.html.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap-

man & Hall/CRC, London.

Simon, H., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths

for cox’s proportional hazards model via coordinate descent. Journal of Statistical

Software, 39(5):1–13.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P.,

Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff,

P. W., Golub, T. R., and Sellers, W. R. (2002). Gene expression correlates of

clinical prostate cancer behavior. Cancer Cell, 1(2):203–209.

http://www.cs.uic.edu/$\sim $liub/KDD-cup-2007/proceedings.html

http://www.cs.uic.edu/$\sim $liub/KDD-cup-2007/proceedings.html

BIBLIOGRAPHY 140

Spielmat, D. A. and Teng, S.-H. (1996). Planar graphs and finite element meshes.

In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium

on, pages 96–105. IEEE.

Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.

The Annals of Statistics, pages 1135–1151.

Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of

components - an alternative to reversible jump methods. The Annals of Statistics,

28:40–74.

R Core Team (2015). R: A Language and Environment for Statistical Comput-

ing. R Foundation for Statistical Computing, Vienna, Austria. URL http:

//www.R-project.org/.

Therneau, T. M. and Lumley, T. (2015). survival: Survival Analysis. R package

version 2.38-3, URL http://CRAN.R-project.org/package=survival.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society, Series B, 58:267–288.

Tibshirani, R. (1997). The lasso method for variable selection in the cox model.

Statistics in Medicine, 16:385–395.

Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple

cancer types by shrunken centroids of gene expression. Proceedings of the National

Academy of Sciences of the United States of America, 99(10):6567–6572.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,

Botstein, D., and Altman, R. B. (2001). Missing value estimation methods for dna

microarrays. Bioinformatics, 17(6):520–525.

Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical

Analysis, 92:1–28.

http://www.R-project.org/

http://www.R-project.org/

http://CRAN.R-project.org/package=survival

BIBLIOGRAPHY 141

Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood

methods. Statistica Sinica, 21:5–42.

Varin, C. and Vidoni, P. (2005). A note on composite likelihood inference and model

selection. Biometrika, 92:519–528.

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing,

17:395–416.

Wang, H., Tang, M., Park, Y., and Priebe, C. E. (2014). Locality statistics for

anomaly detection in time series of graphs. IEEE Transactions on Signal Process-

ing, 62(3):703–717.

Wei, Y.-C. and Cheng, C.-K. (1989). Towards efficient hierarchical designs by ratio

cut partitioning. In Computer-Aided Design, 1989. ICCAD-89. Digest of Technical

Papers., 1989 IEEE International Conference on, pages 298–301.

Westveld, A. H. and Hoff, P. D. (2011). A mixed effects model for longitudinal

relational and network data, with applications to international trade and conflict.

The Annals of Applied Statistics, 5:843–872.

Xing, E. P., Fu, W., and Song, L. (2010). A state-space mixed membership blockmodel

for dynamic network tomography. The Annals of Applied Statistics, 4:535–566.

Xu, K. S. and Hero, A. O. (2014). Dynamic stochastic blockmodels for time-evolving

social networks. IEEE Journal of Selected Topics in Signal Processing, 8(4):552–

562.

Yu, Y. and Feng, Y. (2014a). Apple: Approximate path for penalized likelihood

estimators. Statistics and Computing, 24:803–819.

Yu, Y. and Feng, Y. (2014b). Modified cross-validation for penalized high-dimensional

linear regression models. Journal of Computational and Graphical Statistics,

23(4):1009–1027.

BIBLIOGRAPHY 142

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave

penalty. The Annals of Statistics, 38:894–942.

Zhang, C.-H., Zhang, T., et al. (2012). A general theory of concave regularization for

high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593.

Zhao, Y., Levina, E., and Zhu, J. (2012). Consistency of community detection in

networks under degree-corrected stochastic block models. The Annals of Statistics,

40:2266–2292.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic

net. Journal of the Royal Statistical Society, Series B, 67(2):301–320.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likeli-

hood models. The Annals of Statistics, 36(4):1509–1533.

APPENDIX A. TECHNICAL MATERIAL FOR NC-IMPUTE 143

Appendix A

Technical Material for NC-Impute

Lemma 1. (Marchenko-Pastur law; Bai and Silverstein (2010)). Let X ∈ Rm×n,

where Xij are iid with E(Xij) = 0,E(X2ij) = 1, and m > n. Let λ1 ≤ λ2 ≤ · · · ≤ λn

be the eigenvalues of Qm = 1mXTX. Define the random spectral measure

µn =1

n

n∑i=1

δλi .

Then, assuming n/m→ α ∈ (0, 1], we have

µn(·, ω)→ µ a.s.,

where µ is a deterministic measure with density

dµ

dx=

√(α+ − x)(x− α−)

2παxI(α− ≤ x ≤ α+). (A.1)

Here, α+ = (1 +√α)2 and α− = (1−

√α)2.

A.1 Proof of Proposition 4.5.

Throughout the proof, we make use of the following notation: Θ1(·) and Θ2(·), defined

as follows. For two positive sequences ak and bk, we say ak = Θ2(bk) if there exists

a constant c > 0 such that ak ≥ cbk, and we say ak = Θ1(bk), whenever ak = Θ2(bk)

and bk = Θ2(ak).


We first consider the case λn = Θ1(√m). For simplicity, we assume λn = ζ

√m

for some constant ζ > 0. Denote df(Sλn,γ(Z)) = Dλn,γ. Adopting the notation from

Lemma 1, it is not hard to verify that

Dλn,γ = nEµn

s′λn,γ(

√mt1) + |m− n|sλn,γ(

√mt1)√

mt1

+n2Eµn

√mt1sλn,γ(

√mt1)−

√mt2sλn,γ(

√mt2)

mt1 −mt21(t1 6= t2)

,

where t1, t2iid∼ µn. A quick check of the relation between sλn,γ and gζ,γ yields

Dλn,γ

mn=

1

mEµns′λn,γ(

√mt1) +

(1− n

m

)Eµngζ,γ(t1)

+n

mEµn

t1gζ,γ(t1)− t2gζ,γ(t2)

t1 − t21(t1 6= t2)

.

Due to the Lipschitz continuity of the functions sλn,γ(x) and xgζ,γ(x), we obtain∣∣∣Dλn,γ

mn

∣∣∣ ≤ γ

m(γ − 1)+(

1− n

m

)+n

m

(2γ − 1

2γ − 2

).

Hence, there exists a positive constant Cα, such that for sufficiently large n,∣∣∣Dλn,γ

mn

∣∣∣ ≤ Cα, a.s.

Let T1, T2 be two independent random variables generated from the Marchenko-Pastur

distribution µ. If we can showDλn,γmn→ (1−α)Egζ,γ(T1)+αE

(T1gζ,γ(T1)−T2gζ,γ(T2)

T1−T2

)a.s.,

then by the Dominated Convergence Theorem (DCT), we conclude the proof in the

λn = Θ1(√m) regime. Note immediately that

1

mEµns′λn,γ(

√mt1)→ 0 a.s. (A.2)

Moreover gζ,γ(·) is bounded and continuous, by the Marchenko-Pastur result in

Lemma 1, (1− n

m

)Eµngζ,γ(t1)→ (1− α)Eµgζ,γ(T1) a.s. (A.3)


Since (t1, t2)d→ (T1, T2), and given that the discontinuity set of the measurable func-

tiont1gζ,γ(t1)−t2gζ,γ(t2)

t1−t2 1(t1 6= t2) has zero probability under the measure induced by

(T1, T2), by the continuous mapping theorem,

t1gζ,γ(t1)− t2gζ,γ(t2)

t1 − t21(t1 6= t2)

d→ T1gζ,γ(T1)− T2gζ,γ(T2)

T1 − T2

1(T1 6= T2) as n→∞ .

Due to the boundedness of the functiont1gζ,γ(t1)−t2gζ,γ(t2)

t1−t2 1(t1 6= t2),

Eµnt1gζ,γ(t1)− t2gζ,γ(t2)

t1 − t21(t1 6= t2)

→Eµ

T1gζ,γ(T1)− T2gζ,γ(T2)

T1 − T2

1(T1 6= T2)

, (A.4)

almost surely. Combining (A.2) – (A.4) completes the proof for the λn = Θ1(√m)

case.

When λn = o(√m), we can readily see that Eµn1(

√mt1 ≥ λnγ) → 1, a.s. Using

thatsλn,γ(

√mt1)√

mt1and

√mt1sλn,γ(

√mt1)−

√mt2sλn,γ(

√mt2)

mt1−mt2 1(t1 6= t2) are bounded, we have,

almost surely

Eµnsλn,γ(

√mt1)√

mt1= Eµn1(

√mt1 ≥ λnγ) + Eµn

sλn,γ(

√mt1)√

mt11(√mt1 < λnγ)

→ 1

and

Eµn√

mt1sλn,γ(√mt1)−

√mt2sλn,γ(

√mt2)

mt1−mt2 1(t1 6= t2)

= Eµn1(√mt1 ≥ λnγ)1(

√mt2 ≥ λnγ) + o(1)→ 1 .

Invoking Dominated Convergence Theorem completes the proof. Similar arguments

hold for the case λn = Θ2(√m).

A.2 Proof of Proposition 4.9

Proof of Part (a):

Let us write the stationary conditions for every update: Xk+1 = arg minX F`(X;Xk).

We set the subdifferential of the map X 7→ F`(X;Xk) to zero at X = Xk+1:

(Xk+1 −

(PΩ(Y ) + P⊥Ω (Xk)

))+ `(Xk+1 −Xk) + Uk+1∇k+1V

′k+1 = 0, (A.5)


where Xk+1 = Uk+1diag(σk+1)V ′k+1 is the singular value decomposition of Xk+1. Note

that the term: Uk+1∇k+1V′k+1 in (A.5), is a subdifferential (Lewis, 1995) of the spectral

function: X 7→∑

i P (σi(X);λ, γ), where ∇k+1 is a diagonal matrix with the ith

diagonal entry being a derivative of the map σi 7→ P (σi;λ, γ) (on σi ≥ 0), denoted

by ∂P (σk+1,i;λ, γ)/∂σi for all i. Note that (A.5) can be rewritten as:

PΩ(Xk+1)− PΩ(Y ) +(P⊥Ω (Xk+1 −Xk) + `(Xk+1 −Xk)

)︸︷︷︸(a)

+Uk+1∇k+1V′k+1 = 0. (A.6)

As k →∞, term (a) converges to zero (See Proposition 4.7) and thus, we have:

PΩ(Xk+1)− PΩ(Y ) + Uk+1∇k+1V′k+1 → 0. (A.7)

Let us denote the ith column of Uk by uk,i, and use a similar notation for Vk and vk,i.

Let rk+1 denote the rank of Xk+1. Hence, we have:

rk+1∑i=1

σk+1,iPΩ(uk+1,iv′k+1,i)− PΩ(Y ) + Uk+1∇k+1V

′k+1 → 0.

Multiplying the left and right hand sides of the above by u′k+1,j and vk+1,j, we have

the following:

rk+1∑i=1

σk+1,iu′k+1,jPΩ(uk+1,iv

′k+1,i)vk+1,j − u′k+1,jPΩ(Y )vk+1,j +∇k+1,j → 0, (A.8)

for j = 1, . . . , rk+1. LetU , V

denote a limit point of the sequence Uk, Vk (which

exists since the sequence is bounded); and let r be the rank of U and V . Let us now

study the following equations1:

r∑i=1

σiu′jPΩ(uiv

′i)vj − u′jPΩ(Y )vj + ∇j = 0, j = 1, . . . , r. (A.9)

Using the notation θj = vec(PΩ(uj v

′j))

and y = vec(PΩ(Y )), we note that (A.9) are

the first order stationary conditions for a point σ for the following penalized regression

problem:

minimizeσ

1

2‖

r∑j=1

σj θj − y‖22 +

r∑j=1

P (σj;λ, γ), (A.10)

1Note that we do not assume that σk has a limit point.


with σ ≥ 0.

If the matrix Θ = [θ1, . . . , θr] (note that Θ ∈ Rmn×r) has rank r, then any σ that

satisfies (A.9) is finite — in particular, the sequence σk is bounded and has a limit

point: σ which satisfies the first order stationary condition (A.9).

Proof of Part (b):

Furthermore, if we assume that

λmin(Θ′Θ) + φP > 0,

then (A.10) admits a unique solution σ, which implies that σk has a unique limit

point, and hence the sequence σk necessarily converges.

Documents

Advances in Model Selection Techniques with Applications