by Kaijie Xue - University of Toronto T-Space · lar partially functional linear and quantile models, respectively, but did not deal with vari- able selection or with multiple functional

FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS

by

Kaijie Xue

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Statistical ScienceUniversity of Toronto

c© Copyright 2017 by Kaijie Xue

Abstract

Functional Linear Regression in High Dimensions

Kaijie Xue

Doctor of Philosophy

Graduate Department of Statistical Science

University of Toronto

2017

Functional linear regression has occupied a central position in the area of functional data

analysis, and attracted substantial research attention in the past decade. With increasingly

complex data of this type collected in modern experiments, we conduct further investigations

in response to the great need of statistical tools that are capable of handling functional objects

in high-dimensional spaces.

In the first project, we deal with the situation that functional and non-functional data are

encountered simultaneously when observations are sampled from random processes and a po-

tentially large number of scalar covariates. It is difficult to apply existing methods for model

selection and estimation. We propose a new class of partially functional linear models to char-

acterize the regression between a scalar response and those covariates, including both func-

tional and scalar types. The new approach provides a unified and flexible framework to simul-

taneously take into account multiple functional and ultra-high dimensional scalar predictors,

identify important features and improve interpretability of the estimators. The underlying pro-

cesses of the functional predictors are considered to be infinite-dimensional, and one of our

contributions is to characterize the impact of regularization on the resulting estimators. We

establish consistency and oracle properties under mild conditions, illustrate the performance of

the proposed method with simulation studies, and apply it to air pollution data.

In the second project, we further explore the linear regression by focusing on the large-scale

scenario that the scalar response is related to potentially an ultra-large number of functional pre-

ii

dictors, leading to a more challenging model framework. The emphasis of our investigation is

to establish valid testing procedures for general hypothesis on an arbitrary subset of regression

coefficient functions. Specifically, we exploit the techniques developed for post-regularization

inference, and propose a score test for the large-scale functional linear regression based on the

so-called de-correlated score function that separates the primary and nuisance parameters in

functional spaces. The proposed score test is shown uniformly convergent to the prescribed

significance, and its finite sample performance is illustrated via simulation studies.

iii

Acknowledgements

First of all, I would like to express my sincere gratitude to my supervisor, Professor Fang

Yao, for his guidance and patience through my PHD program. Without his precious help and

support, it is not possible for me to finish the thesis.

Secondly, I would also like to show my gratitude to my thesis committee members, Profes-

sor Keith Knight and Professor Zhou Zhou for their insightful and constructive comments.

Thirdly, I am also very grateful to all the faculty members, especially Professor Mike

Evans, Professor Nancy Reid, Professor Keith Knight, Professor Zhou Zhou, Professor Andrey

Feuerverger and Professor Lawrence J. Brunner for teaching me insightful statistical courses.

In addition, I would also like to thank all the staff members, especially Andrea Carter,

Christine Bulguryemez and Dermot Whelan for their help and support during my research.

Finally, I would like to thank my parents for their help and support.

iv

Contents

1 Partially Functional Linear Regression in High Dimensions 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Regularized Partially Functional Linear Regression . . . . . . . . . . . . . . . 3

1.2.1 Classical functional linear model via principal components . . . . . . . 3

1.2.2 Partially functional linear regression with regularization . . . . . . . . 3

1.2.3 Algorithms and parameter tuning . . . . . . . . . . . . . . . . . . . . 7

1.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.6 Proof of Lemmas and Main Theorems . . . . . . . . . . . . . . . . . . . . . . 21

1.6.1 Regularity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.6.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.6.3 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.6.4 Proofs of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 37

2 Testing General Hypothesis in Large-scale FLR 47

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.2.1 Regularized estimation by group penalized least squares . . . . . . . . 51

2.2.2 General hypothesis and score decorrelation . . . . . . . . . . . . . . . 53

v

2.2.3 Bootstrapped score test for general hypothesis in large-scale FLR . . . 55

2.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.5 Proof of Lemmas and Main Theorems . . . . . . . . . . . . . . . . . . . . . . 63

2.5.1 Algorithm and parameter tuning . . . . . . . . . . . . . . . . . . . . . 63

2.5.2 Technical Conditions on Regularizers . . . . . . . . . . . . . . . . . . 64

2.5.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.5.4 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.5.5 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.5.6 Proofs of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 94

Bibliography 99

vi

List of Tables

1.1 Simulation results with sample size n = 200 based on 200 Monte Carlo repli-

cates for Designs I and II. Shown are the Monte Carlo averages (standard errors

in parentheses) for the number of false zero functional predictors (FZf ), the

number of the false nonzero functional predictors (FNf ), the functional mean

squared error (MSEf ), the number of false zero scalar covariates (FZs), the

number of false nonzero scalar covariates (FNs), the scalar mean squared error

(MSEs), and the prediction error (PE). We first use ABIC to choose the tuning

parameter λn and a common truncation sn, then tune snj jointly with AIC by

refitting the selected model using ordinary least squares. In Design II, Step 1

results are based on the original sample in each Monte Carlo run, while Step 2

contains the improved results by fitting the penalized procedure to the selected

model in Step 1 with an additional sample of n = 200. . . . . . . . . . . . . . 16

vii

1.2 Simulation results with sample size n = 200 based on 200 Monte Carlo repli-

cates for Designs III and IV. Shown are the Monte Carlo averages (standard

errors in parentheses) for the number of false zero functional predictors (FZf ),

the number of the false nonzero functional predictors (FNf ), the functional

mean squared error (MSEf ), the number of false zero scalar covariates (FZs),

the number of false nonzero scalar covariates (FNs), the scalar mean squared

error (MSEs), and the prediction error (PE). We first use ABIC to choose the

tuning parameter λn and a common truncation sn, then tune snj jointly with

AIC by refitting the selected model using ordinary least squares. In Design IV,

Step 1 results are based on the original sample in each Monte Carlo run, while

Step 2 contains the improved results by fitting the penalized procedure to the

selected model in Step 1 with an additional sample of n = 200. . . . . . . . . . 17

1.3 Simulation results using the same settings as Design I & II in Section 4 of the

paper and Deisgn III & IV as above, based on 200 Monte Carlo replicates,

except the sample size n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Simulation results using the same settings as Design I & II in Section 4 of the

paper and Deisgn III & IV as above, based on 200 Monte Carlo replicates,

except the variance of the regression error σ2 = 2. . . . . . . . . . . . . . . . 19

2.1 Results for the several different combinations of null hypothesis and settings

of βj(t) that were measured over 300 simualtions . . . . . . . . . . . . . . . . 62

viii

List of Figures

1.1 The estimated regression parameter functions and 95% confidence bands based

on 1000 bootstrap samples for PM2 · 5 and temperature with sn1 = 2 and

sn2 = 2, respectively. The solid line denotes the estimated regression parame-

ter functions and the dashed lines denote the 95% bootstrap confidence bands.

The left and right panels are for the PM2·5 and temperature, respectively. . . . 22

ix

Chapter 1

Partially Functional Linear Regression in

High Dimensions

1.1 Introduction

Functional linear regression is widely used to model the prediction of a functional predictor

through a linear operator, often realized by an integral form of a regression parameter func-

tion; see Ramsay and Dalzell [1991], Cardot et al. [2003], Cuevas et al. [2002], Yao et al.

[2005a], and Ramsay and Silverman [2005]. To capture the regression relation of the response

with a functional predictor, regularization is necessary. One common approach is functional

principal component analysis, which has been studied by Rice and Silverman [1991], and re-

cently by Yao et al. [2005b], Hall et al. [2006], Cai and Hall [2006], Zhang and Chen [2007],

and Hall and Horowitz [2007], among others. Functional linear models have been extended

to generalized functional linear models [Escabias et al., 2004, Cardot and Sarda, 2005, Muller

and Stadtmuller, 2005], varying-coefficient models [Fan and Zhang, 2000, Fan et al., 2003],

wavelet-based functional models [Morris et al., 2003], functional additive models [Muller and

Yao, 2008] and quadratic models [Yao and Muller, 2010].

Classical functional linear regression is designed to describe the relation between a real-

1

CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 2

valued response and one functional explanatory variable. However, in many real problems, it

is common to also collect a large number of non-functional predictors. How to incorporate

scalar predictors in functional linear regression and perform model selection/regularization

is an important issue. For a standard linear regression with scalar covariates only, various

penalization procedures have been proposed and studied, including the lasso [Tibshirani, 1996],

the smoothly clipped absolute deviation [Fan and Li, 2001] and the adaptive lasso [Zou, 2006].

In this work, we develop a class of partially functional linear regression models, to han-

dle multiple functional and non-functional predictors and automatically identify important risk

factors by suitable regularization. Shin [2009] and Lu et al. [2014] have considered simi-

lar partially functional linear and quantile models, respectively, but did not deal with vari-

able selection or with multiple functional predictors and high-dimensional scalar covariates.

We propose a unified framework that regularizes each functional predictor as a whole, com-

bined with a penalty on high-dimensional scalar covariates. Due to the differences between

the functional and scalar predictors, we use two regularizing operations. Shrinkage penalties

are imposed on the effects of both functional predictors and scalar covariates to achieve model

selection and enhance interpretability, while a data-adaptive truncation that plays the role of

a tuning parameter is applied to functional predictors. We treat the functional predictors as

infinite-dimensional processes, which distinguishes our work from work that fixes the number

of principal components [Li et al., 2010]. A main contribution is to quantify the theoretical im-

pact of functional principal component estimation with diverging truncation, especially when

the number of scalar covariates is permitted to diverge at an exponential order of the sample

size.


1.2 Regularized Partially Functional Linear Regression

1.2.1 Classical functional linear model via principal components

Let X(·) be a square-integrable random function defined on a closed interval T of the real line

with continuous mean and covariance functions, denoted byEX(t) = µ(t) and covX(s), X(t) =

K(s, t), respectively. The classical functional linear model is

Y = µY +

∫T

X(t)− µ(t)β(t)dt+ ε, (1.1)

where the regression parameter function β(·) is assumed to be square-integrable, and ε is a

random error independent of X(t). Mercer’s theorem implies that there exists a complete and

orthonormal basis φk in L2(T ) and a non-increasing sequence of non-negative eigenvalues

wk such that K(s, t) =∑∞

k=1 wkφk(s)φk(t) with∑∞

k=1wk < ∞. We further assume that

w1 > w2 > · · · ≥ 0. Let (yi, xi), i = 1, . . . , n be independent and identically distributed

random samples from (Y,X). The Karhunen–Loeve expansion xi(t) = µ(t) +∑∞

k=1 ξikφk(t)

forms the foundation of functional principal component analysis, where the coefficients ξik =∫Txi(t) − µ(t)φk(t)dt are uncorrelated random variables with mean zero and variances

E(ξ2ik) = wk, also called the functional principal component scores. Expanded on the or-

thonormal eigenbasis φk, the regression function becomes β(t) =∑∞

k=1 bkφk(t), and the

functional linear model (1.1) can be written as yi = µY +∑∞

k=1 bkξik + εi. The basis with

respect to which the regression parameter b is expanded is determined by the covariance func-

tion K. This is not unnatural since φk is the unique canonical basis leading to a generalized

Fourier series which gives the most rapidly convergent representation of X in L2 sense.

1.2.2 Partially functional linear regression with regularization

We now consider functional linear regression with multiple functional and scalar predictors.

Suppose the data are Y,X(·), Z, where Y is a scalar continuous response, X(·) = Xj(·) :


j = 1, . . . , d are d functional predictors, and Z = (Z1, . . . , Zpn)T is a pn-dimensional vector

of scalar covariates. This is motivated by commonly encountered situations where both func-

tional and non-functional predictors may have significant impacts on the response. As with

many real examples, we assume the number of functional predictors d to be fixed, while the

number of scalar covariates pn may grow with the sample size. Specifically we allow pn to be

ultra-high dimensional, such that log pn = O(nα) for some α > 0. Without loss of generality,

we assume that the response Y , the functional predictors Xj : j = 1, . . . , d and the scalar

covariates Zl : l = 1, . . . , pn have been centered to have mean zero. We then model the

linear relationship between Y and (X,Z) by

Y =d∑j=1

∫T

Xj(t)βj(t)dt+ ZTγ + ε, (1.2)

where βj(·) : j = 1, . . . , d are square-integrable regression parameter functions, γ =

(γ1, . . . , γpn)T contains the regression coefficients of non-functional covariates, and ε is the

random error independent of Xj(·) : j = 1, . . . , d and Z with E(ε) = 0 and var(ε) = σ2.

For convenience, assume that the first qn scalar covariates are significant, while the rest are not.

In other words, the true values of regression coefficients γT0 equals (γ

(1)T0 , γ

(2)T0 ), where γ(1)

0

is a qn × 1 vector corresponding to significant effects and γ(2)0 is a (pn − qn) × 1 zero vector.

We also assume that only the first g functional predictors are significant, equivalently, the true

values of regression functions βj0(t) ≡ 0 for j = g+ 1, . . . , d. Each functional predictor Xj(·)

is an infinite-dimensional process and requires regularization. Therefore the proposed model

has a partially functional structure that combines the multiple functional and high-dimensional

scalar components into one linear framework.

Let (yi, xi, zi) : i = 1, . . . , n denote independent and identically distributed realizations

from the population (Y,X,Z). Let xij denote the jth component of xi for j = 1, . . . , d, and

let zij be the lth component of zi for l = 1, . . . , pn. We further write YM = (y1, . . . , yn)T,

and ZM = (z1, . . . , zn)T. To estimate the functions βj(·) : j = 1, . . . , d and the regres-


sion coefficients γj : j = 1, . . . , pn, we consider the least squares loss, which couples

βj(t) =∑

k bjkφjk(t) with xij(t) =∑

k ξijkφjk(t) for each j = 1, . . . , d given the complete

orthonormal basis series φjkk=1,2,...,

L(b, γ | Dn) =n∑i=1

yi −d∑j=1

∫T

xij(t)βj(t)dt− zT

i γ2 (1.3)

=n∑i=1

(yi −d∑j=1

∞∑k=1

bjkξijk − zT

i γ)2,

where Dn = (yi, xi, zi) : i = 1, . . . , n, and b = (bT1 , . . . , bTd)T with bj = (bj1, bj2, . . .)

T

for each j. It is evident that the loss function (1.3) should not be directly minimized due

to the infinite expansions of the functional predictors and high-dimensional scalar covariates,

requiring suitable regularization for both X and Z.

One primary goal for (1.2) is to extract useful information from Z and X , whereas the

classical functional linear model focuses only on a single functional predictor. It is thus essen-

tial to select and estimate the nonzero coefficients in γ and nonzero functions in b1, . . . , bd to

enhance model prediction and interpretability. To achieve simultaneous variable selection and

estimation, we introduce a shrinkage penalty function Jλ(·) associated with a tuning parameter

λ. Many penalty choices are available for variable selection. We use the smoothly clipped

absolute deviation penalty of Fan and Li [2001], whose derivative is J ′λ(|γ|) = λI(|γ| ≤

λ) + I(|γ| > λ)(aλ − |γ|)+/(a − 1)λ with a = 3.7 suggested by Fan and Li [2001] for

implementation.

Due to the infinite dimensionality of the functional predictors, smoothing and regulariza-

tion are necessary in estimation. It is sensible to control the complexity of βj(t) as a whole

function, rather than treating its basis terms as separate predictors. We adopt the simple yet ef-

fective truncation approach in the spirit of controlling smoothness as in classical nonparametric

regression. Denote the truncated form by Xsj(t) = µj(t) +∑sj

k=1 ξjkφjk(t) for j = 1, . . . , d,

where sj is the truncation parameter. Correspondingly, for βj(t) =∑∞

k=1 bjkφjk(t), write


bj = (b(1)Tj , b

(2)Tj )T, where b(1)

j = (bj1, ..., bjsj)T and b(2)

j = (bj,sj+1, . . .)T. Unlike with non-

functional effects, there is no underlying separation between b(1)j and b

(2)j . For the sake of

adaptivity, we allow sj to vary with the sample size, sj ≡ snj , such that it behaves like a

smoothing parameter that balances the trade-off between bias and variance. The coefficients in

b(2)j associated with higher-order basis functions are nonzero but decay rapidly. It is of interest

to study the impact of snj on the convergence rates of the resulting estimators.

In practice we do not observe the entire trajectories xij , and only have intermittent noisy

measurements Wijl = xij(tijl)+εijl, where εijl, i = 1 . . . , n are independent and identically

distributed measurement errors independent of xij , satisfying E(εijl) = 0, var(εijl) = σ2xj for

i = 1, . . . , n and l = 1, . . . ,mij . When the repeated observations are sufficiently dense for

each subject, a common practice is to run a smoother through (tijl,Wijl), l = 1, . . . ,mij,

and then the estimates xij, i = 1, . . . , n, j = 1, . . . , d are used to construct the covariance,

eigenvalues/basis, and functional principal component scores; details are given in section 1.6.

A theoretical justification of the asymptotic equivalence between the estimators obtained from

xij and those from the true xij is also given in section 1.6. The unobservable functional prin-

cipal component scores ξijk : k = 1, . . . , sn; j = 1, . . . , d; i = 1, . . . , n are estimated

by functional principal component analysis based on the observed data (tijl,Wijl) : l =

1, . . . ,mij; j = 1, . . . , d; i = 1, . . . , n. Therefore we minimize

minb(1),γ

n∑i=1

(yi −d∑j=1

snj∑k=1

ξijkbjk − zT

i γ)2 + 2nd∑j=1

Jλjn(‖b(1)j ‖)

+ 2n

pn∑l=1

Jλn(|γl|), (1.4)

given suitable choices of snj , λn and λjn, where ‖b(1)j ‖ is the Euclidean norm invoking a group

penalty that shrinks the regression functions of unimportant functional predictors to zero. To

regularize all predictors on a comparable scale, one often standardizes the predictors before

imposing a penalty associated with a common tuning parameter [Fan and Li, 2001]. Thus we

standardize (z1l, . . . , znl)T to have unit variance. The variability of the jth functional predictor

can be approximated by∑snj

k=1 wjk, where wjk is the kth estimated eigenvalue of the jth predic-


tor. Since standardization is equivalent to adding weights to the penalty function, we suggest

using λjn = λn(∑snj

k=1 wjk)1/2, which simplifies both the computation and theoretical analysis.

The estimated regression parameter functions are βj(t) =∑snj

k=1 bjkφjk(t).

1.2.3 Algorithms and parameter tuning

The optimization of (1.4) can be seen as a group smoothly clipped absolute deviation problem

with different weights on penalties, and the individual γl can be treated as group of size one.

We propose two algorithms to solve the minimization problem (1.4), which adapts to the di-

mension pn. Generally, when pn is moderately large, say pn < n, we modify the local linear

approximation algorithm [Zou and Li, 2008], which inherits the computational efficiency and

sparsity of lasso-type solutions. For ultra-high pn, especially pn n, the local linear approx-

imation algorithm may not be applicable, and then we modify the concave convex procedure

used in Kim et al. [2008]. The following gives the details. Recall that YM = (y1, . . . , yn)T and

ZM = (z1, . . . , zn)T, where zi = (zi1, ..., zipn)T. In addition, Mj is a n×sn matrix with (i, k)th

element ξijk, M = (M1, . . . ,Md), and N = (M,ZM) = (N1, . . . , Nn)T is a n × (dsn + pn)

matrix. Further, η = (b(1)T, γT)T. The solution to (4) is equivalent to

argminη(2n)−1||YM −Nη||2 +R∑r=1

Jλr(‖ηr‖

),

where R = d + pn. The tuning parameter λr = λrn with group size Kr = sn if r = 1, . . . , d,

and λr = λn with group size Kr = sn if r = d+ 1, . . . , d+ pn.

When pn is moderately large, say pn < n, one can modify the local linear approximation

algorithm of Zou and Li [2008] which inherits the computational efficiency and sparsity of

lasso-type solutions. Denote the initial estimate from the ordinary least square solution by

η(0), and we solve η(1) = argminη(2n)−1||YM − Nη||2 +∑R

r=1 J′λr

(‖η(0)

r ‖)‖ηr‖. Since

some of the J ′λr(‖η(0)

r ‖)

are zero, we use similar algorithm proposed by Zou and Li [2008].

Denote V = r : J ′λr(‖η(0)

r ‖)

= 0, W = r : J ′λr(‖η(0)

r ‖)> 0, N = (NV , NW ) and


η(1) = (η(1)TV , η

(1)TW )T. Our algorithm is as follows:

1. Reparameterize the response vector by Y ∗M = Nη(0), and reparameterize the observed

data matrix by N∗r = NrK1/2r /J ′λr

(‖η(0)

r ‖)

for r ∈ W ; and N∗r = Nr for r ∈ V .

2. Let PV denote the projection matrix of the space N∗r , r ∈ V , where PV = NV (NTVNV )−1NV

T.

Then, calculate Y ∗∗M = Y ∗M − PV Y ∗M and N∗∗W = N∗W − PVN∗W .

3. Find η∗W = argminβ(2n)−1||Y ∗∗M −N∗∗W η||2 +∑

r∈W K1/2r ‖ηr‖.

4. Compute η∗V = (N∗TV N∗V )−1N∗TV (Y ∗M −N∗W η∗W ).

5. Let η(0) = η(1), where η(1)V = η∗V , and η(1)

r = η∗rK1/2r /J ′λr

(‖η(0)

r ‖)

for r ∈ W .

We repeat steps 1–5 until convergence; the final η(0) is the regularized estimator. Step 3 essen-

tially solves a group lasso, so we adopt the shooting algorithm of Fu [1998] and Yuan and Lin

[2006].

For pn n, it is likely that N∗TV N∗V in step 4 is singular, so the local linear approx-

imation algorithm is inapplicable. We modify the concave convex procedure used in Kim

et al. [2008]. Let Jλr(‖ηr‖

)= Jλr

(‖ηr‖

)− λr‖ηr‖ for each r. The convex part of the ob-

jective function is Cvex(η) = (2n)−1||YM − Nη||2 +∑J

r=1 λr‖ηr‖, and the concave part is

Ccav(η) =∑J

r=1 Jλr(‖ηr‖

). Begin with the initial estimator η(0) = 0 and iteratively update the

solution until convergence,

η(1) = argminηCvex(η) +∇Ccav(η(0))Tη = argminηM(η | η(0)) +R∑r=1

λr‖ηr‖,

where M(η | η(0)) = (2n)−1||YM − Nη||2 + ∇Ccav(η(0))Tη is quadratic in η. The proxi-

mal gradient method of Parikh and Boyd [2013] is adopted to solve the above minimization

problem.

Two sets of tuning parameters play crucial roles in the penalized procedure (1.4). The pa-

rameter λn in the smoothly clipped absolute deviation directly controls the sparsity of both


the functional and non-functional predictors. Wang et al. [2007] showed that minimizing the

BIC can identify the true model consistently, while generalized cross-validation might lead to

over-fitting. The truncation parameters snj control the dimensions of the functional spaces to

approximate the true function parameters. Previous work mostly chose snj based on the func-

tional principal component representation, such as leave-one-curve-out cross-validation [Rice

and Silverman, 1991] and the pseudo-AIC [Yao et al., 2005a]. However, a sensible tuning cri-

terion of snj in a regression setting should take into account its impact on the response. The

process Xj is infinite-dimensional, and its coefficient function βj(t) =∑∞

k=1 bjkφk(t) does

not have a finite cut-off. This is similar to the situation that the true model does not lie in the

space formed by finite-dimensional candidate models. Therefore we propose a hybrid tuning

procedure, which in principle combines BIC for tuning λn and AIC for choosing snj due to the

infinite-dimensional parameter spaces of βj, j = 1, . . . ,∞. In practice it is computationally

prohibitive to choose snj : j = 1, . . . , d simultaneously in the penalized procedure (1.4).

Table 1.1 suggests that, when using a common truncation, the selection of both functional

and scalar covariates is accurate and stable for a wide range of sn. Thus we propose to use a

common truncation parameter sn when solving (1.4), then refit the selected model with the sig-

nificant functional and scalar predictors using ordinary least squares, while different truncation

parameters snj are tuned simultaneously by AIC for the retained functional predictors.

Specifically, for a fixed pair (sn, λn), the ABIC criterion is defined as

ABIC(sn, λn) = log

RSS(sn, λn)

+ 2g(sn, λn)sn/n+ n−1df(sn, λn) log(n),

where

RSS(sn, λn) =n∑i=1

yi −

d∑j=1

sn∑k=1

ξijkbjk(sn, λn)− zT

i γ(sn, λn)2,

and g(sn, λn) is the number of non-zero estimates of the regression functions, g(sn, λn) =∑dj=1 I(βj; sn, λn). The degree of freedom df(sn, λn) equals I(γ; sn, λn)+

∑dj=1

∑snk=1 wjkI(βj; sn, λn),


with I(γ; sn, λn) indicating the number of non-zero elements in γ. This procedure requires es-

timation using the whole data only once and is computationally fast.

For the refit step, denote the index set of the selected functional predictors byD ⊂ 1, . . . , d,

and the index set of the selected scalar covariates by S ⊂ 1, . . . , pn. We minimize

AIC(snj : j ∈ D) = log

RSS(snj : j ∈ D)

+ 2n−1∑j∈D

snj,

with respect to combinations of snj : j ∈ D, where

RSS(snj : j ∈ D) =n∑i=1

yi −

∑j∈D

snj∑k=1

ξijkb∗jk(snj)−

∑l∈S

zij γ∗l (snj)

2,

and b∗jk(snj) and γ∗l (snj) are the refitted values using ordinary least squares.

1.3 Asymptotic Properties

Denote the true values of b(1) and γ by b(1)0 and γ0, and similarly for the rest of parameters.

Recall that the boundedness of the covariance functions Kj(s, t) and regression operators im-

plies that∑∞

k=1wjk <∞ and∑∞

k=1 b2jk0 <∞. We impose mild conditions on the decay rates

of eigenvalues wjk and regression coefficients bjk0, similar to those adopted by Hall and

Horowitz [2007] and Lei [2014]. We assume that, for j = 1, . . . , d:

(A1) wjk − wj(k+1) ≥ Ck−a−1 for k ≥ 1.

This implies that wjk ≥ Ck−a. As the covariance functions K1, . . . , Kd are bounded, one has

a > 1. Regarding the regression function βj(·), in order to prevent the coefficients bjk0 from

decreasing too slowly, we assume that

(A2) |bjk0| ≤ Ck−b for k > 1.

The decay conditions are needed only to control the tail behaviors for large k, and so are not as

restrictive as they appear. Without loss of generality, we use a common truncation parameter


sn in theoretical analysis. It is important to control sn appropriately. On one hand sn cannot be

too large due to increasingly unstable functional principal component estimates,

(A3) (s2a+2n + sa+4

n )/n = o(1).

On the other hand sn cannot be too small, so that the covariances between Z and the unobserv-

able ξjk: k ≥ sn + 1 are asymptotically negligible,

(A4) s2b−1n /n→∞ as n→∞.

Combining (A3) and (A4) entails b > max(a+3/2, a/2+5/2) as a sufficient condition for such

an sn to exist. This implies that the regression function is smoother than the lower bound on

the smoothness of Kj . Regarding the dimension of scalar covariates, assume that the number

of significant covariates satisfies

(A5) sa+2n q2

n/n = o(1),

Such qn = o(n1/2s−a/2−1n ) does exist and is allowed to diverge with the sample size given (A3).

The dimension of the candidate set, pn, is allowed to be ultra-high,

(A6) pn = Oexp(nα) for some α ∈ (0, 1/2).

Lastly we require the following to hold for the tuning parameter λn and the sparsity of γ to

achieve consistent estimation,

(A7) λn = o(1), maxn2α−1, n−1(qn + sn) = o(λ2n), minl=1,...,qn |γl0|/λn →∞.

We defer to section 1.6 the standard conditions (B1)–(B5) on the underlying processes xij ,

how the data are sampled and smoothed, and the moments of scalar covariates, followed by the

auxiliary lemmas and proofs.

To facilitate theoretical analysis, we re-parameterize by writing bjk = w1/2jk bjk, so that the

functional principal component scores serving as predictor variables are on a common scale

of variabilities. This re-parameterization is only used for technical derivations, and does not

appear in the estimation procedure. Let η = (b(1)T, γT)T, where b(1) = (b(1)T1 , . . . , b

(1)Td )T,


b(1)j = Ajb

(1)j , and Aj is the sn × sn diagonal matrix with Aj(k, k) = w

1/2jk . Then, the mini-

mization of (1.4) is equivalent to minimizing

Qn(η) =n∑i=1

yi −d∑j=1

sn∑k=1

(ξijkw−1/2jk )bjk − zT

i γ2 + 2n

pn∑l=1

Jλn(|γl|) + 2nd∑j=1

Jλjn(‖b(1)j ‖).

Theorem 1 establishes the estimation and selection consistency for both the functional and

scalar regression parameters. For a random variable ε with mean zero, define ε as a subgaus-

sian random variable if there exists some positive constant C1 > 0 such that pr(|ε| > t) ≤

exp(−2−1C1t2) for t ≥ 0. Let b(1) denote the estimate of b(1).

Theorem 1 If ε1, . . . , εn are independent and identically distributed subgaussian random vari-

ables, then under conditions (A1)–(A7) and (B1)–(B5), there exists a local minimizer η =

(b(1)T, γT)T of Qn(η) such that ‖η − η0‖ = Op[(qn + sn)/n1/2] and pr(γ2 = 0, b(1) = 0, j =

g + 1, . . . , d)→ 1.

The estimation consistency result is expressed in terms of b(1), not the original parame-

ter b(1) = (b(1)′

1 , . . . , b(1)′

d )′. For estimation, given b(1) = A−1j b(1), it is easy to deduce that

‖βj − βj0‖2L2 = Opsan(qn + sn)/n, where βj =

∑snk=1 bjkφjk and βj0 =

∑∞k=1 bjk0φjk.

Theorem 2 establishes the asymptotic normality for the qn-dimensional vector γ(1). Write

Σ1 = E(z(1)i z

(1) T

i ), and Σ1 = n−1∑n

i=1 zi(1)zi

(1)T with zi(1) = (zi1, ..., ziqn)T.

Theorem 2 If ε1, . . . , εn are independent and identically distributed subgaussian random vari-

ables and qn = o(n1/3), then under conditions (A1)–(A7) and (B1)–(B5), for the local min-

imizer in Theorem 1, n1/2AnΣ1(γ(1) − γ(1)0 ) → N(0, σ2H∗ + B∗) in distribution, for any

r × qn matrix An such that G = limn→∞AnATn is positive definite, where σ2 = var(ε),

H∗ = limn→∞AnΣ1AnT, B∗ = limn→∞AnBnA

Tn with

Bn = cov g∑

j=1

sn∑k=1

∑v 6=k

bjk0(wjk − wjv)−1〈Ξj, φjv〉∫

(xij ⊗ xij)φjkφjv,

where Ξj = (Ξj1, ...,Ξjqn)T, EXj(t)Zl = Ξjl(t), and (xij ⊗ xij)(s, t) = xij(s)xij(t).


The asymptotic covariance is inflated by estimating the unobservable functional principal com-

ponent scores. The inflation is quantified by a convergent sequence Bn associated with the

truncation size sn.

1.4 Simulation Studies

The simulated data yi, i = 1, . . . , n are generated from the model

yi =d∑j=1

∫ 1

0

βj(t)xij(t)dt+ zT

i γ + εi =d∑j=1

∑k

bjkξijk + zT

i γ + εi,

with d = 4 functional predictors, pn scalar covariates, the errors ε1, . . . , εn are independent and

identically distributed from N(0, σ2), and γ is the vector of scalar coefficients. The functional

predictors have mean zero and covariance function derived from the Fourier basis φ2`−1 =

2−1/2 cos(2` − 1)πt and φ2` = 2−1/2 sin(2` − 1)πt, ` = 1, . . . , 25, and t ∈ T = [0, 1].

The underlying regression function is βj(t) =∑50

k=1 bjkφk(t), a linear combination of the

eigenbasis. The scalar covariates zi = (zi1, . . . , zipn)T are jointly normal with mean zero, unit

variance and AR(0.5) correlation structure. Next, we describe how to generate the d = 4

functional predictors xij(t). For j = 1, . . . , 4, define Vij(t) =∑50

k=1 ξijkφk(t), where ξijk, i =

1, . . . , n follow independent and identically distributed N(0, 16k−2) for different i and j. The

four functional predictors are then defined through the linear transformations

xi1 = Vi1 + 0·5(Vi2 + Vi3), xi2 = Vi2 + 0·5(Vi1 + Vi3),

xi3 = Vi3 + 0·5(Vi1 + Vi2), xi4 = Vi4.

Here the first three functional predictors are correlated with each other. To be more real-

istic, we allow a moderate correlation between Vi1 and zi for i = 1, . . . , n, by setting that

ξ = (ξi11, ξi12, ξi13, ξi14)T and zi = (zi1, . . . , zipn)T have a correlation structure specified by

corr(ξi1k, zij) = r|k−l|+1, (k = 1, . . . , 4; l = 1, . . . , pn), with r = 0 ·2. For the actual obser-


vations, we assume they are realizations of xij(·), j = 1, 2, 3, 4 at 100 equally spaced times

tijl, l = 1, . . . , 100 ∈ T with independent and identically distributed noise εijl ∼ N(0, 1).

We use 200 Monte Carlo runs for model assessment. Since inferences on both the para-

metric component γ and the functional components βj are of interest, we report the Monte

Carlo averages for the numbers of false nonzero and false zero functional predictors, and the

functional mean squared error MSEf =∑d

j=1E(‖βj − βj‖2L2). For the scalar covariates, we

report the Monte Carlo averages for the numbers of false nonzero and false zero scalar co-

variates, and the scalar mean squared error MSEs = E(||γ − γ||2). The prediction error is

assessed using an independent test set of size N = 1000 for each Monte Carlo repetition, and

PE = N−1∑N

i=1(y∗i − y∗i )2 − σ2, where x∗ij, z∗i , y∗i , j = 1, . . . , 4 are the testing data gener-

ated from the same model, and the predictions are y∗i =∑

j

∑k ξ∗ijkbjk + z∗Ti γ by plugging in

estimates from the corresponding training sample.

Design I is for a moderate number of scalar covariates with sample size n = 200 and error

variance σ2 = 1. Specifically, for j = 1, 2, bj1 = 1, bj2 = 0 ·8 , bj3 = 0 ·6 , bj4 = 0 ·5

and bjk = 8(k − 2)−4 (k = 5, . . . , 50), β3 = β4 = 0, and γ = (1T5 , 0

T15)T. Thus pn = 20,

qn = 5. To illustrate the impact of the choice of sn, we inspect the results for sn ranging from

1 to 16 with λn chosen by BIC in Table 1.1. The selection of functional and scalar predictors is

quite accurate and stable for a wide range of sn, but with a very small number of false nonzero

scalars. For functional predictors, the functional mean square error improves until sn reaches

an optimal level, then deteriorates as sn continues to increase. For the scalar covariates, the

mean square error and prediction error appear more stable after the optimal level. We then

use ABIC with a common sn to select both sn and λn. It yields similar results to those at

optimal mean square and prediction errors, selecting an average sn = 4·30 with standard error

0·050. Refitting the selected model using ordinary least squares with jointly tuned snj via AIC

improves the estimation of the functional coefficients and the overall prediction.

Design II illustrates the situation with ultra-high dimension of scalar covariates γ = (1T5 , 0

T995)T

with pn = 1000, and other settings the same as in Design I. The ABIC yields results similar


to those giving the optimal estimation and prediction. The number of false nonzero scalar co-

variates, the scalar mean square error and prediction error in Step 1 become larger than those

in Design I, mainly due to the ultra-high number of insignificant scalar covariates. The func-

tional mean square error is also higher, as the correlation between functional predictors and

scalar covariates becomes greater for a larger pn. To improve the estimation and prediction, for

each Monte Carlo run, after obtaining the estimates based on ABIC in Step 1, we generate an

additional sample of size 200 and implement the penalized procedure using the significant vari-

ables and sn selected in Step 1. The results summarized in Step 2 are dramatically improved

and become comparable to those for Design I. This hints at a promising two-step procedure

via sample splitting in when pn is ultra-high, in a similar spirit to Fan et al. [2012]. A further

improvement can be achieved by refitting the selected model with jointly tuned snj using ordi-

nary least squares. In particular, Design III illustrates the situation that the response does not

necessarily depend on the leading principal components and the regression coefficients may

decay more slowly than theoretically required. Specifically, we use bj1 = 0, bj2 = 1, bj3 = 0·5,

bjk = (k − 2)−3 for k = 4, . . . , 50, j = 1, 2, and other settings are the same as Design I. The

results across sn follow a similar pattern, while the automated ABIC captures the regression

relationship adaptively and resembles the optimal estimation and prediction. The refitting step

using least squares with jointly tuned snj behaves similarly as in Design I. Designs IV contains

ultra-high numbers of scalar covariates γ = (1T5 , 0

T995)T with pn = 1000, and other settings the

same as Design II. The results exhibit similar phenomenon as those in Design II. Moreover, Ta-

ble 1.3 includes the results obtained from the same settings as Design I–IV except for a larger

sample size n = 400. For Table 1.4, the only setting difference is that the regression error εi

is generated from N(0, 2) instead of N(0, 1). As expected, a larger sample size reduced the

estimation and prediction errors, while a higher noise level increased such errors.


Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·95 0 4·1 (0·03) 0·39 7·1 4·7 (0·15) 27·9 (0·6)2 0·35 0 2·2 (0·10) 0·05 3·2 1·1 (0·06) 7·9 (0·3)3 0 0 0·6 (0·01) 0 0·91 0·22 (0·013) 1·5 (0·03)4 0 0 0·12 (0·005) 0 0·36 0·067 (0·005) 0·21 (0·007)5 0 0 0·14 (0·005) 0 0·38 0·069 (0·004) 0·19 (0·006)

I 6 0 0 0·19 (0·007) 0 0·44 0·072 (0·004) 0·19 (0·006)pn = 20 10 0 0 0·65 (0·03) 0 0·38 0·073 (0·004) 0·22 (0·006)

16 0·03 0·08 3·1 (0·13) 0 0·11 0·074 (0·006) 0·54 (0·1)sn=4·30 (0·050)ABIC

0 0 0·13 (0·007) 0 0·34 0·067 (0·004) 0·20 (0·006)sn1=4·78 (0·075), sn2=4·87 (0·071)TUNE snj

0 0 0·09 (0·003) 0 0·28 0·065 (0·004) 0·18 (0·006)sn =4·07 (0·034)

STEP 1 0 0·04 0·18 (0·004) 0 7·4 0·36 (0·018) 1·1 (0·032)II STEP 2 0 0 0·095 (0·005) 0 0·10 0·047 (0·003) 0·17 (0·004)

pn = 1000 sn1=4·76 (0·066), sn2 =4·62 (0·055)TUNE snj 0 0 0·076 (0·003) 0 0·09 0·046 (0·003) 0·16 (0·004)

Table 1.1: Simulation results with sample size n = 200 based on 200 Monte Carlo replicatesfor Designs I and II. Shown are the Monte Carlo averages (standard errors in parentheses) forthe number of false zero functional predictors (FZf ), the number of the false nonzero functionalpredictors (FNf ), the functional mean squared error (MSEf ), the number of false zero scalarcovariates (FZs), the number of false nonzero scalar covariates (FNs), the scalar mean squarederror (MSEs), and the prediction error (PE). We first use ABIC to choose the tuning parameterλn and a common truncation sn, then tune snj jointly with AIC by refitting the selected modelusing ordinary least squares. In Design II, Step 1 results are based on the original samplein each Monte Carlo run, while Step 2 contains the improved results by fitting the penalizedprocedure to the selected model in Step 1 with an additional sample of n = 200.


Design sn FZf FNf MSEf FZs FNs MSEs PE1 2·0 0 2·5 (0) 1·5 1·1 4·2 (0·15) 26·0 (0·1)2 0·70 0 1·7 (0·06) 0·02 1·2 0·49 (0·028) 3·9 (0·1)3 0 0 0·076 (0·004) 0 0·38 0·069 (0·004) 0·21 (0·009)4 0 0 0·069 (0·003) 0 0·38 0·065 (0·004) 0·13 (0·005)5 0 0 0·11 (0·005) 0 0·40 0·067 (0·004) 0·14 (0·005)

III 6 0 0 0·15 (0·006) 0 0·41 0·068 (0·004) 0·15 (0·005)pn = 20 10 0 0·01 0·60 (0·02) 0 0·37 0·071 (0·004) 0·19 (0·006)

16 0·17 0·21 3·4 (0·1) 0 0·05 0·089 (0·007) 0·65 (0·05)sn=3·73 (0·048)ABIC

0 0 0·072 (0·004) 0 0·38 0·066 (0·004) 0·14 (0·005)sn1=3·99 (0·083), sn2=3·98 (0·079)TUNE snj

0 0 0·056 (0·003) 0 0·32 0·063 (0·004) 0·13 (0·005)sn=3·50 (0·040)

STEP 1 0 0·09 0·10 (0·003) 0 6·5 0·41 (0·02) 0·81 (0·021)IV STEP 2 0 0 0·059 (0·003) 0 0·13 0·048 (0·003) 0·13 (0·005)

pn = 1000 sn1=4·05 (0·067), sn2 =4·11 (0·065)TUNE snj 0 0 0·042 (0·002) 0 0·09 0·045 (0·003) 0·11 (0·004)

Table 1.2: Simulation results with sample size n = 200 based on 200 Monte Carlo replicatesfor Designs III and IV. Shown are the Monte Carlo averages (standard errors in parentheses) forthe number of false zero functional predictors (FZf ), the number of the false nonzero functionalpredictors (FNf ), the functional mean squared error (MSEf ), the number of false zero scalarcovariates (FZs), the number of false nonzero scalar covariates (FNs), the scalar mean squarederror (MSEs), and the prediction error (PE). We first use ABIC to choose the tuning parameterλn and a common truncation sn, then tune snj jointly with AIC by refitting the selected modelusing ordinary least squares. In Design IV, Step 1 results are based on the original samplein each Monte Carlo run, while Step 2 contains the improved results by fitting the penalizedprocedure to the selected model in Step 1 with an additional sample of n = 200.


Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·90 0 4·1 (0·04) 0·15 5·1 2·3 (0·08) 25·8 (0·2)2 0 0 1·3 (0·009) 0 1·4 0·34 (0·01) 4·6 (0·04)3 0 0 0·58 (0·008) 0 0·23 0·11 (0·004) 1·5 (0·02)4 0 0 0·063 (0·002) 0 0·23 0·027 (0·002) 0·13 (0·004)5 0 0 0·071 (0·003) 0 0·27 0·028 (0·002) 0·12 (0·004)

I 6 0 0 0·095 (0·003) 0 0·24 0·028 (0·002) 0·12 (0·004)pn = 20 10 0 0 0·32 (0·01) 0 0·31 0·029 (0·002) 0·13 (0·004)

16 0 0 1·2 (0·04) 0 0·10 0·026 (0·002) 0·15 (0·004)sn=4·22 (0·047)ABIC

0 0 0·067 (0·003) 0 0·23 0·027 (0·002) 0·13 (0·004)sn1=4·67 (0·058), sn2=4·68 (0·061)TUNE snj

0 0 0·049 (0·002) 0 0·20 0·026 (0·002) 0·12 (0·004)sn=4·10 (0·027)

STEP 1 0 0 0·11 (0·004) 0 4·6 0·13(0·007) 0·51 (0·01)II STEP 2 0 0 0·052 (0·002) 0 0·07 0·025 (0·001) 0·11 (0·004)

pn = 1000 sn1=4·72 (0·056), sn2=4·60 (0·053)TUNE snj 0 0 0·041 (0·001) 0 0·06 0·025 (0·001) 0·11 (0·004)

1 2·0 0 2·5 (0) 1·0 0·43 2·7 (0·1) 25·2 (0·09)2 0·10 0 0·72 (0·04) 0 0·43 0·19 (0·01) 2·8 (0·06)3 0 0 0·059 (0·002) 0 0·25 0·032 (0·002) 0·16 (0·006)4 0 0 0·033 (0·001) 0 0·23 0·026 (0·001) 0·073 (0·004)5 0 0 0·048 (0·002) 0 0·26 0·027 (0·002) 0·075 (0·004)

III 6 0 0 0·072 (0·003) 0 0·25 0·027 (0·002) 0·078 (0·004)pn = 20 10 0 0 0·28 (0·01) 0 0·25 0·028 (0·002) 0·099 (0·004)

16 0 0·03 1·2 (0·04) 0 0·02 0·024 (0·001) 0·13 (0·004)sn=3·91 (0·036)ABIC

0 0 0·034 (0·002) 0 0·23 0·026 (0·001) 0·075 (0·004)sn1=4·37 (0·064), sn2=4·38 (0·066)TUNE snj

0 0 0·026 (0·001) 0 0·22 0·026 (·001) 0·070 (0·004)sn=3·81 (0·043)

STEP 1 0 0·01 0·047 (0·001) 0 3·8 0·17 (0·009) 0·35 (0·008)IV STEP 2 0 0 0·031 (0·002) 0 0·08 0·024 (0·001) 0·073 (0·004)

pn = 1000 sn1=4·28 (0·061), sn2=4·37 (0·067)TUNE snj 0 0 0·023 (0·001) 0 0·07 0·024 (0·001) 0·058 (0·004)

Table 1.3: Simulation results using the same settings as Design I & II in Section 4 of the paperand Deisgn III & IV as above, based on 200 Monte Carlo replicates, except the sample sizen = 400.


Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·94 0 4·1 (0·03) 0·41 7·3 4·9 (0·2) 28·0 (0·6)2 0·56 0 2·8 (0·1) 0·08 3·3 1·4 (0·07) 9·9 (0·3)3 0·01 0 0·63 (0·02) 0 1·7 0·38 (0·03) 1·7 (0·07)4 0 0 0·17 (0·007) 0 0·58 0·15 (0·009) 0·31 (0·01)5 0 0 0·23 (0·1) 0 0·58 0·15 (0·009) 0·30 (0·01)

I 6 0 0 0·33 (0·01) 0 0·63 0·15 (0·009) 0·32 (0·01)pn = 20 10 0·01 0·01 1·3 (0·05) 0 0·49 0·15 (0·009) 0·44 (0·05)

16 0·31 0·18 7·7 (0·3) 0·08 0·20 0·36 (0·03) 3·6 (0·3)sn=4·12 (0·026)ABIC

0 0 0·18 (0·009) 0 0·56 0·15 (0·009) 0·30 (0·01)sn1=4·59 (0·069), sn2=4·61 (0·060)TUNE snj

0 0 0·14 (0·006) 0 0·48 0·14 (0·008) 0·28 (0·01)sn=4·04 (0·021)

STEP 1 0 0·04 0·29 (0·008) 0 7·3 0·55 (0·03) 1·9 (0·07)II STEP 2 0 0 0·15 (0·008) 0 0·16 0·096 (0·005) 0·26 (0·02)

pn = 1000 sn1=4·53 (0·056), sn2=4·46 (0·050)TUNE snj 0 0 0·12 (0·006) 0 0·15 0·095 (0·005) 0·24 (0·008)

1 2·0 0 2·5 (0) 1·5 1·2 4·3 (0·15) 26·1 (0·1)2 0·33 0 1·1 (0·06) 0·02 0·69 0·43 (0·03) 3·3 (0·1)3 0 0 0·10 (0·005) 0 0·39 0·13 (0·008) 0·29 (0·01)4 0 0 0·12 (0·006) 0 0·39 0·13 (0·008) 0·22 (0·01)5 0 0 0·20 (0·009) 0 0·41 0·13 (0·008) 0·24 (0·01)

III 6 0 0 0·29 (0·01) 0 0·41 0·14 (0·008) 0·26 (0·01)pn = 20 10 0·03 0·07 1·4 (0·06) 0 0·17 0·12 (0·007) 0·41 (0·03)

16 0·02 1·3 11·8(0·4) 0·02 0·39 0·18 (0·02) 1·0 (0·03)sn=3·76 (0·065)ABIC

0 0 0·15 (0·02) 0 0·36 0·13 (0·008) 0·24 (0·01)sn1=3·76 (0·064), sn2=3·79 (0·066)TUNE snj

0 0 0·070 (0·004) 0 0·29 0·12 (0·007) 0·20 (0·01)sn=3·33 (0·040)

STEP 1 0 0·06 0·18 (0·006) 0 6·4 0·63 (0·03) 1·5 (0·04)IV STEP 2 0 0 0·091 (0·005) 0 0·12 0·099 (0·005) 0·23 (0·009)

pn = 1000 sn1=3·96 (0·060), sn2=3·73 (0·048)TUNE snj 0 0 0·065 (0·003) 0 0·10 0·095 (0·005) 0·18 (0·009)

Table 1.4: Simulation results using the same settings as Design I & II in Section 4 of the paperand Deisgn III & IV as above, based on 200 Monte Carlo replicates, except the variance of theregression error σ2 = 2.


1.5 Application

We applied our method to a dataset from the National Mortality, Morbidity, and Air Pollution

Study that contains air pollution measurements and mortality counts for U.S. cities with the

U.S. census information for year 2000. A main goal of the study is to investigate the impact

of air pollution on the non-accidental mortality rate across different cities in the United States,

when we take into account climate patterns and inform the U.S. census. Previous studies

conducted a two-stage analysis: first modelling the short-term effect of certain air pollutants on

the mortality count for each city, then combining the estimates across cities [Peng et al., 2005,

2006]. By contrast, we apply the partially functional linear regression model to the data for

different cities. In particular, we are interested in studying the effect of particulate matter with

an aerodynamic diameter of less than 2·5µm, abbreviated PM2·5 and measured in µgm−3, as

its negative effect on health, revealed by recent toxicological and epidemiological studies, has

brought it to the public’s attention. Other studies [Samoli et al., 2013, Pascal et al., 2014] have

shown that PM2·5 has a larger effect on mortality in warm weather, so we focused on the daily

concentration measurements of PM2·5 from April 1, 2000 to August 31, 2000, which along with

the daily observations of temperature and humidity were treated as the functional predictors.

After removing the cities which have more than ten consecutive missing measurements of

PM2·5, we included a total of 69 cities in our analysis. The response of interest is the log-

transformed total non-accidental mortality rate in the following month, September 2000, of

the population of age 65 and older, which accounts for the majority of non-accidental deaths.

The scalar covariates available from the U.S. census for each city are land area per individual,

water area per individual, proportion of the urban population, proportion of the population with

at least a high school diploma, proportion of the population with at least a university degree,

proportion of the population below poverty line, and proportion of household owners.

The ABIC was used to first choose significant predictors with a common truncation, fol-

lowed by a least squares refitting using AIC to tune snj jointly. Among scalar covariates, our

analysis shows that only the proportion of household owners has a negative effect −1·80 with


the standard error of 0 ·41, indicating that household owners tend to incur a lower mortality

rate, where the standard error was obtained based on 1000 bootstrap samples by fitting the se-

lected model using ordinary least squares. Our method also selected two significant functional

predictors, PM2 ·5 and temperature. The least squares refitting chose the truncation numbers

sn1 = 2 and sn2 = 2. The estimated regression parameter functions with their 95% bootstrap

confidence bands are shown in Figure 1.1. We observe that higher PM2·5 concentrations during

the summer, especially July and August, can lead to an increased mortality during the imme-

diately following period, which coincides with the findings in Samoli et al. [2013] and Pascal

et al. [2014]. This may be partially explained by the proximity of the pollution period to the

time of death. Higher temperatures in the summer, contrasted with lower temperatures in April,

may also increase the mortality rate, which agrees with Curriero et al. [2002]. To better under-

stand the effects of functional predictors, we fitted a linear regression using only the selected

scalar covariate, giving R2 = 0·15. Including temperature leads to R2 = 0·25, and including

both temperature and PM2·5 yields R2 = 0·38. A heuristic F test for the significance of two

principal component scores of temperature gives a p-value of 0·01, and adding additional two

principal component scores of PM2·5 gives a p-value of 0·0008. For comparison, we also fitted

the marginal models containing only PM2 ·5 or temperature using classical functional linear

regression. The marginal F tests for temperature and PM2·5 yield p-values of 0·0001 and 0·004.

The regression parameter functions show similar patterns and are omitted. We conclude that,

after adjusting for temperature and household ownership, summer PM2·5 concentrations have

a significant impact on the near-future mortality rate of elder citizens in the U.S..

1.6 Proof of Lemmas and Main Theorems

1.6.1 Regularity Conditions

Without loss of generality, we assume that Xj, j = 1, . . . , d, Y and Z have been centred

to have mean zero. With Wijk = xij(tijk) + εijk, for definiteness, we consider the local lin-


0 50 100 150−6

−4

−2

0

2

4

6

8

10

12x 10−4 PM2.5

Time (day)

Effe

ct

(a)

0 50 100 150−8

−6

−4

−2

0

2

4

6

8x 10−4 Temperature

Time (day)

Effe

ct

(b)

Figure 1.1: The estimated regression parameter functions and 95% confidence bands based on1000 bootstrap samples for PM2·5 and temperature with sn1 = 2 and sn2 = 2, respectively. Thesolid line denotes the estimated regression parameter functions and the dashed lines denote the95% bootstrap confidence bands. The left and right panels are for the PM2·5 and temperature,respectively.

ear smoother for each set of subjects using bandwidths hij, j = 1, . . . , d, and denote the

smoothed trajectories by xij . Denote the minimum and maximum eigenvalues of a symmetric

matrix A by λmin(A) and λmax(A). Recall that the first g functional predictors are significant,

while the rest are not. Define the (gsn + qn)× 1 vector N = (ξ11w−1/211 , . . . , ξ1snw

−1/21sn , . . . ,

ξg1w−1/2g1 , . . . , ξgsnw

−1/2gsn , Z1, . . . Zqn)T to combine all functional and scalar predictors.

Condition (B1) consists of regularity assumptions for functional data, for example, a Gaus-

sian process with Holder continuous sample paths satisfies (B1), see Hall and Hosseini-Nasab

[2006]. Condition (B2) is standard for local linear smoothers, (B3)–(B4) concern how the func-

tional predictors are sampled and smoothed, while (B5) is for the moments of non-functional

covariates Z = (Z1, . . . , Zpn)T.

(B1) For j = 1, . . . , d, for any C > 0 there exists an ε > 0 such that

supt∈T

[E|Xj(t)|C] <∞, sups,t∈T

(E[|s− t|−ε|Xj(s)−Xj(t)|C ]) <∞.

For each integer r ≥ 1, w−rjk E(ξ2rjk) is bounded uniformly in k.

(B2) For j = 1, . . . , d, Xj is twice continuously differentiable on T with probability 1, and


∫EX(2)

j (t)4dt <∞, where X(2)j (·) denotes the second derivative of Xj(·).

The following condition concerns the design on which xij is observed and the local linear

smoother xij . When a function is said to be smooth, we mean that it is continuously differen-

tiable to an adequate order.

(B3) For j = 1, . . . , d, tijl, l = 1, . . . ,mij are considered deterministic and ordered in-

creasingly for i = 1, . . . , n. There exist densities gij uniformly smooth over i, satisfying∫Tgij(t) dt = 1 and 0 < c1 < infiinft∈Tgij(t) < supisupt∈Tgij(t) < c2 < ∞

that generate tijl according to tijl = G−1ij (l − 1)/(mij − 1), where G−1

ij is the inverse

of Gij(t) =∫ t−∞ gij(s) ds. For each j = 1, . . . , d, there exist a common sequence of

bandwidths hj such that 0 < c1 < infihij/hj < supihij/hj < c2 < ∞, where hij is the

bandwidth for xij . The kernel density function is smooth and compactly supported.

Let T = [a0, b0], tij0 = a0, tij,mij+1 = b0, let ∆ij = suptij,l+1 − tij,l, l = 0, . . . ,mij and

mj = mj(n) = infi=1,...,nmij . The condition below is to let the smooth estimate xij serve as

well as the true xij in the asymptotic analysis, denoting 0 < lim an/bn <∞ by an ∼ bn.

(B4) For j = 1, . . . , d , supi ∆ij = O(m−1j ), hj ∼ m

−1/5j , mjn

−5/4 →∞.

The condition for the scalar covariates is given below.

(B5) Z1, . . . , Zpn are subgaussian random variables such that pr(|Zl| > t) ≤ exp(−2−1Ct2)

for any t ≥ 0 and some C > 0 that does not depend on l, E(Z4l ) is uniformly bounded

for l = 1, . . . , pn, λmax(Σ) ≤ c1 < ∞ and 0 < c2 ≤ λmin(U1) ≤ λmax(U1) ≤ c3 < ∞

for all n, where Σ = E(ZZT) and U1 = E(NNT).

1.6.2 Auxiliary Lemmas

For each j = 1, . . . , d, given the estimated covariances Kj(s, t) = n−1∑n

i=1 xij(s)xij(t), the

eigenvalues/functions and functional principal component scores are estimated by

∫T

Kj(s, t)φjk(s)ds = λjkφjk(t), ξijk =

∫T

xij(t)φjk(t)dt, (1.5)


subject to∫Tφ2jk(t)dt = 1 and

∫Tφjk1(t)φjk2(t)dt = 0 for k1 6= k2. Denote ∆2

j = |||Kj −

Kj|||2 =∫T

∫TKj(s, t) − Kj(s, t)2dsdt, <j(s, t) = n1/2Kj(s, t) − Kj(s, t), (xij ⊗

xij)(s, t) = xij(s)xij(t), the second derivative of xij by x(2)ij , the minimum and maximum

eigenvalues of a symmetric matrix A by λmin(A) and λmax(A). Let the n × pn matrix ZM =

(z1, . . . , zn)T = (Z(1)M , Z

(2)M ) with Z(1)

M containing the first qn columns of ZM . Define Mj as

the n × sn matrix with (i, k)th element ξijk, and M = (M1, . . . ,Md), M (1) = (M1, . . . ,Mg)

and M (2) = (Mg+1, . . . ,Md). Similarly, we define Mj , M (1) and M (2), where ξijk is re-

placed by ξijkw−1/2jk . Moreover, we denote Mj as the n × sn matrix with (i, k)th element

ξijk, and M = (M1, . . . , Md), M (1) = (M1, . . . , Mg) and M (2) = (Mg+1, . . . , Md). Sim-

ilarly, we define Mj , M (1) and M (2), where ξijk is replaced by ξijkw−1/2jk . Combining the

functional and scalar covariates, we have N1 = (M (1), Z(1)M ) = (N1, . . . , Nn)T, where Ni =

(Ni1, ..., Ni,gsn+qn)T, i = 1, . . . , n are independently and identically distributed as N . We fur-

ther denote N1 = (M (1), Z(1)M ) = (N1, . . . , Nn)T, N (1) = (M (1), Z

(1)M ), N (1) = (M (1), Z

(1)M ),

and let E(NiNTi ) = U1. Recall that b(1) denotes the estimate of b(1), η = (b(1)T, γT)T, and

we further denote b(1)j the estimate of b(1)

j . Suppose α(·), β(·) and G(·, ·) are square-integrable

functions on T and T × T . Write ||α||,∫αβ (or 〈α, β〉) and

∫Gαβ for

∫Tα2(t)dt1/2,∫

Tα(t)β(t)dt and

∫ ∫T 2 G(s, t)α(s)β(t)dsdt.

Lemma 1 provides results for the estimates obtained by functional principal component

analysis. We first quantify the smoothing error of xij and its influence carried over to the

covariance and functional principal component estimates, ξijk−ξijk = ξijk−ξijk+ ξijk− ξijk =∫xij(φjk−φjk)+

∫(xij−xij)φjk. Then we obtain upper bounds for the differences between the

estimated and true eigenfunctions in terms of covariance perturbation and eigenvalue spacings

for quantifying the increased estimation error of the higher order terms. For each j = 1, . . . , d,

denote

δjk = minl=1,...,k

(wjl − wj,l+1), Jjn = k = 1, . . . ,∞ : wjk − wj,k+1 > 2∆j,


where ∆j = |||Kj−Kj|||, that is, consider k ∈ Jjn for which the distance of wjk to the nearest

other eigenvalues does not fall below 2∆j . It is known that φjk can be consistently estimated

for k ∈ Jjn [Theorem 1, Hall and Hosseini-Nasab, 2006], ‖φjk − φjk‖ = Op(δjk∆j) =

Op(ka+1n−1/2), implying supJjn = o

n1/(2a+2)

. However, for our theoretical analysis, we

need a sharper bound without compromising the number of eigenfunctions considered. Define

the set of realizations such that, for sample size n, some C and any τ < 1,

Fsn,j = (wjk1 − wjk2)−2 ≤ 2(wjk1 − wjk2)−2 ≤ Cnτ , k1, k2 = 1, . . . , sn, k1 6= k2.

Lemma 1 (a) Under conditions (B1)–(B4), for each i = 1, . . . , n and each j = 1, . . . , d, we

have E(‖xij − xij‖2) = o(n−1), E∫

(xij − xij)4

= o(n−2). For each j = 1, . . . , d,

we have E(|||Kj −Kj|||2) = O(n−1).

(b) Under conditions (A1), (A3), (B1)–(B4), for each j = 1, . . . , d, we have sn ∈ Jjn for

large n, P (Fsn,j) → 1 as n → ∞ for any τ < 1. Moreover, for each k = 1, . . . , sn,

‖φjk − φjk‖ = Op(k n−1/2), where Op(·) is uniform in k = 1, . . . , sn.

(c) Under conditions (A1), (A3), (B1)–(B4), for each j = 1, . . . , d,

φjk(t)− φjk(t) = n−1/2∑v:v 6=k

(wjk − wjv)−1φjv(t)

∫<jφjkφjv + αjk(t),

where ||αjk|| = Op(ka+2n−1), <j = n1/2(Kj − Kj), and Op(·) is uniform in k =

1, . . . , sn.

(d) Under conditions (A1), (B1)–(B4) and n−1/2sn = op(1), for any j = 1, . . . , d, we have

c1λn ≤ λjn ≤ c2λn for some positive constants c1, c2.

The next lemma quantifies the asymptotic orders of several important types of expressions that

will be encountered in the proofs of our lemmas and main theorems. For convenience, define


the following notations, l, l1, l2 = 1, . . . , pn, k, k1, k2 = 1, . . . , sn, j, j1, j2 = 1, . . . , d,

θ(1)jk =

n∑i=1

(ξijk − ξijk)2w−1jk , θj1j2k1k2

(2)= n−1

n∑i=1

(ξij1k1 ξij2k2 − ξij1k1ξij2k2)(wj1k1wj2k2)−1/2,

θj1j2k1k2

(3)= n−1

n∑i=1

ξij1k1ξij2k2 − E(ξj1k1ξj2k2)(wj1k1wj2k2)−1/2, θj1j2k1k2

(4)= θj1j2k1k2

(2)+ θj1j2k1k2

(3),

θ(5)jkl = n−1

n∑i=1

(ξijk − ξijk)zijw−1/2jk , θ

(6)jkl = n−1

n∑i=1

ξijkzij − E(ξijkzij)w−1/2jk ,

θ(7)jkl = θ

(5)jkl + θ

(6)jkl, θ

(8)l1l2

= n−1

n∑i=1

zil1zil2 − E(zil1zil2),

ϑ(1)jl =

n∑i=1

∞∑k=sn+1

ξijkbjk0zij, ϑ(2)jl =

n∑i=1

sn∑k=1

(ξijk − ξijk)(bjk − bjk0)zij,

ϑ(3)jl =

n∑i=1

sn∑k=1

(ξijk − ξijk)bjk0zij, ϑ(4)jl =

n∑i=1

sn∑k=1

(ξijk − ξijk)bjk0zij,

ϑ(5)jl =

sn∑k=1

bjk0

∫(n∑i=1

xijzij − nΞjl)(φjk − φjk), ϑ(6)jl = n

sn∑k=1

bjk0〈Ξjl, αjk〉,

ϑ(7)jl = n1/2

sn∑k=1

∑v 6=k

bjk0(wjk − wjv)−1〈Ξjl, φjv〉∫

(<j −<∗j)φjkφjv,

ϑ(8)jl = n1/2

sn∑k=1

∑v 6=k

bjk0(wjk − wjv)−1〈Ξjl, φjv〉∫<∗jφjkφjv,

where ϑj(m) = (ϑ(m)j1 , ...ϑ

(m)jqn

)T denote corresponding vectors and <∗j = n1/2(Kj −Kj), where

Kj = n−1∑n

i=1 xij ⊗ xij .

Lemma 2 (a) Under conditions (A1), (A3), (B1)–(B5), we have

θ(1)jk = Op(k

a+2), θj1j2k1k2

(2)= Op

(ka/2+11 n−1/2 + k

a/2+12 n−1/2

),

θj1j2k1k2

(3)= Op(n

−1/2), θj1j2k1k2

(4)= Op

(ka/2+11 n−1/2 + k

a/2+12 n−1/2

),

θ(5)jkl = Op

(ka/2+1n−1/2

), θ

(6)jkl = Op(n

−1/2),

θ(7)jkl = Op

(ka/2+1n−1/2

), θ

(8)l1l2

= Op(n−1/2),


where the Op(·) and op(·) terms are uniform for k, k1, k2 = 1, . . . , sn and l, l1, l2 =

1, . . . , pn.

(b) Under conditions (A1)–(A7), (B1)–(B5), uniformly for l = 1, . . . , pn,

ϑ(1)jl = ϑ

(2)jl = ϑ

(3)jl = ϑ

(5)jl = ϑ

(6)jl = ϑ

(7)jl = op(n

1/2),

ϑ(4)jl = ϑ

(5)jl + ϑ

(6)jl + ϑ

(7)jl + ϑ

(8)jl .

Lemma 3 characterizes the eigenvalues of the design matrices N1, and Lemma 4 concerns

the asymptotic order of Λ1 = PN1(YM − N1η10), where PN1

= N1(NT1 N1)−1NT

1 , YM =

(y1, ..., yn)T, and η10 is the true parameter of η1.

Lemma 3 Under conditions (A1), (A3), (A5), (B1)–(B5), we have |λmin(NT1 N1/n)−λmin(U1)| =

op(1), |λmax(NT1 N1/n)− λmax(U1)| = op(1).

Lemma 4 Under conditions (A1)–(A5), (B1)–(B5), we have ‖Λ1‖22 = Op(r

2n), where r2

n =

qn + sn.

1.6.3 Proof of Lemmas

Proof of Lemma 1. We shall show Lemma 1 for any fixed j = 1, . . . , d, we suppress the

subscript j in this proof for convenience. For part (a), recall that Wil = xi(til) + εil, where

the error εil are independent of xi. Thus one can factor the probability space Ω = Ω1 × Ω2,

where Ω1 is for xii=1,...,n and Ω2 for εill=1,...,mi . Given the data for a single subject, a

specific realization xi corresponds to fixing a value ωi ∈ Ω1, that is xi(·) = xi(·, ωi). We write

Eε for expectations with regard to the probability measure on Ω2, and EX with respect to Ω1.

For each fixed ωi, the error εil, i = 1, . . . , n are independent and identically distributed for

different l, with Eε(εil) = 0 and Eε(ε2il) = σ2

x. Given condition (B3) together with a local

linear smoother, one has E(‖xi − xi‖2) = EX[Eε ∫

(xi − xi)2]

. Using standard Taylor

expansion argument, together with the dominant convergence theorem given E(‖x(2)i ‖4) <∞,


it is easy to verify EX[Eε ∫

(xi − xi)2]≤ C

∫ [EX

(x(2)i )2

h4 + (mh)−1

]under (B3),

yielding E(‖xi − xi‖2) = o(n) by (B4). Similar arguments will also lead to EX[Eε

∫(xi −

xi)4]≤ C

∫ [EX(x(2)

i )4h8 + (mh)−2

], thus E

∫(xi − xi)

4]

= o(n−2). Regarding

K = n−1∑n

i=1 xi⊗xi, notice that K−K = n−1∑n

i=1

(xi−xi)⊗xi+(xi−xi)⊗(xi−xi)

+(

n−1∑n

i=1 xi⊗xi−K). It is easy to see that the first term on the right hand side is dominated

by n−1∑n

i=1 |||(xi − xi)⊗ xi|||. Since xi, xi is independent of xi′ , xi′ for i 6= i′, we have

E[n−1

∑ni=1 |||(xi−xi)⊗xi|||

2]=E|||(xi−xi)⊗xi|||2 +n−1var(|||(xi−xi)⊗xi|||) ≤

E(‖xi − xi‖2)(E‖xi‖2) + n−1E(‖(xi − xi)‖4)(E‖xi‖4)

1/2= o(n−1). The second term

E(|||(n−1

∑ni=1 xi⊗xi−K|||2) = O(n−1) by Lemma 3.3 of Hall and Hosseini-Nasab [2009].

This leads to E(|||K −K|||2) = O(n−1).

We next obtain the bound in (b) for k = 1, . . . , sn. First notice that Fn can be implied by

mink=1,...,sn(wk − wk+1) ≥ s−a−1n ≥ Cn−τ/2 for some C and any τ < 1, which is fulfilled

by sa+1n n−1/2 → 0 in (A3), leading to P (Fn) → 1 as n → ∞. It is easy to see that for

k = 1, . . . , sn, wk − wk+1 ≥ s−a−1n > 2∆ as n → ∞, that is sn ∈ Jn for large n. Now we

will bound ||φk − φk||2 for k = 1, . . . , sn. By (5.16) in Hall and Horowitz [2007], one has

‖φk − φk‖2 ≤ 2u2k, where u2

k =∑

v:v 6=k(wk − wv)−2∫

(K −K)φkφv

2

. Also P (Fn) → 1

implies that (wk − wv)−2 ≤ Cnτ with probability tending to 1. Then we have

u2k ≤

∑v:v 6=k

(wk − wv)−2[2∫

(K −K)φjφv

2

+2∫

(K −K)(φk − φk)φv2]

≤ 2∑v:v 6=k

(wk − wv)−2∫

(K −K)φkφv

2

+2Cnτ∑v 6=k

∫(K −K)(φk − φk)φv

2

≤ 4∑v:v 6=k

(wk − wv)−2∫

(K −K)φkφv

2

+ 2Cnτ∆2||φk − φk||2.


Plugging this into ‖φk − φk‖2 ≤ 2u2k, one has

(1− 2Cnτ∆2)‖φk − φk‖2 ≤ 4∑v:v 6=k

(wk − wv)−2∫

(K −K)φkφv

2

.

As ∆ = Op(n−1/2) and τ < 1, we have nτ∆2 = op(1), and ‖φk − φk‖2 ≤ C

∑v 6=k(wk −

wv)−2∫

(K −K)φkφv2. Given the results in (a), by analogy to (5.22) in Hall and Horowitz

[2007], nE[∑

v 6=k(wk − wv)−2∫

(K −K)φkφv2] = O(k2) still holds, the result follows by

Chebyshev’s inequality.

To obtain (c), using Lemma 5.1 of Hall and Horowitz [2007] with ψk = φk, λk = wk and

L = K, since infv:v 6=k |wk − wv| > 0, we have

φk − φk =∑v:v 6=k

(wk − wv)−1φv

∫(K −K)φkφv + φk

∫(φk − φk)φk

=∑v:v 6=k

(wk − wv)−1φv

∫(K −K)φkφv + φk

∫(φk − φk)φk

+∑v:v 6=k

(wk − wv)−1φv

∫(K −K)(φk − φk)φv

−∑v:v 6=k

(wk − wk)(wk − wv)(wk − wv)−1φv

∫(K −K)φkφv.

Denote the last three terms by αk ≡ α1k + α2k + α3k. For α1k, one has

‖φk − φk‖2 = 2− 2

∫φkφk = 2(

∫φ2k −

∫φkφk) = −2

∫(φk − φk)φk,

that is α1k = −‖φk − φk‖2/2. From part (b), one has ‖φk − φk‖ = Op(kn−1/2) uniformly in

k = 1, . . . , sn, then ‖α1k‖ = Op(k2n−1). To bound α2k, noticing ‖

∑v cvφv‖2 =

∑v c

2v due to

orthonormal φk,

‖α2k‖ ≤[∑v 6=k

(wk − wv)−2∫

(K −K)(φk − φk)φv2]1/2

≤ δ−1k ∆‖φk − φk‖ = Op(k

a+2n−1).


For α3k, noticing that supk=1,...,∞ |wk − wk| = Op(∆), we can bound ‖α3k‖ as follows,

2|wk − wk|[ ∑v:v 6=j

(wk − wv)−4∫

(K −K)φkφv2]1/2

≤ C∆[ ∑v:v 6=k

(wk − wv)−4∫

(K −K)φkφv2

+∑v:v 6=k

(wk − wv)−4∫

(K −K)(φk − φk)φv2]1/2

= C∆(A1 + A2)1/2,

where A1 =∑

v:v 6=k(wk − wv)−4 ∫

(K −K)φkφv2 and A2 =

∑v:v 6=k(wk − wv)−4

∫(K −

K)(φk − φk)φv2. By the derivations of part (b), we have

A1 ≤ δ−2k

∑v:v 6=k

(wk − wv)−2∫

(K −K)φkφv

2

= Op(k2δ−2k n−1),

A2 ≤ δ−4k ∆2‖φk − φk‖2 = Op(δ

−4k j2∆4).

Notice that δ−2k = Ok2(a+1) = o(n) by (A3), which indicates A2 = op(k

2δ−2k n−1). Thus

‖α3k‖ = Op(kδ−1k n−1). It leads to ‖αk‖ = Op(k

a+2n−1).

For part (d), since∑sn

k=1 wk <∞ and supk=1,...,sn |wk−wk| ≤ |||K−K||| = Op(n−1/2) by

Theorem 1 of Hall and Hosseini-Nasab [2006] and part (a), we can see that (∑sn

k=1 wk)1/2 =

Op(1) given n−1/2sn = o(1), which completes the proof.


Proof of Lemma 2. For θ(1)jk and any fixed j, we have

θ(1)jk =

n∑i=1

(ξijk − ξijk)2w−1jk =

n∑i=1

(ξijk − ξijk + ξijk − ξijk)2w−1jk

≤ 2n∑i=1

(ξijk − ξijk)2w−1jk + 2

n∑i=1

(ξijk − ξijk)2w−1jk

= 2n∑i=1

∫xij(φjk − φjk)

2w−1jk + 2

n∑i=1

∫(xij − xij)φjk

2w−1jk

≤ 2n∑i=1

(‖xij‖2‖φjk − φjk‖2 + ‖φjk‖2‖xij − xij‖2

)w−1jk .

Given Lemma 1 and (B1), we know

E(‖xij‖2) ≤ 2E(‖xij − xij‖2) + 2E(‖xij‖2) = O(1),

hence, ‖xij‖2 = Op(1). From Lemma 1, we have θ(1)jk =

∑ni=1

Op(1)Op(k

2n−1)+op(n−1)ka =

Op(ka+2), uniformly for k = 1, . . . , sn.

For θj1j2k1k2

(2), it is obvious that

|θj1j2k1k2

(2)| =∣∣∣n−1

n∑i=1

(ξij1k1 ξij2k2 − ξij1k1ξij2k2)(wj1k1wj2k2)−1/2∣∣∣

= n−1∣∣∣ n∑i=1

ξij1k1(ξij2k2 − ξij2k2)(wj1k1wj2k2)−1/2 +n∑i=1

ξij2k2(ξij1k1 − ξij1k1)(wj1k1wj2k2)−1/2∣∣∣

≤ n−1( n∑i=1

ξ2ij1k1

w−1j1k1

)1/2 n∑i=1

(ξij2k2 − ξij2k2)2w−1j2k2

1/2+

n−1( n∑i=1

ξ2ij2k2

w−1j2k2

)1/2 n∑i=1


1/2.

Since E(∑n

i=1 ξ2ij2k2

w−1j2k2

) = n for any k2 = 1, . . . , sn , we have∑n

i=1 ξ2ij2k2

w−1j2k2

= Op(n),


uniformly for k2 = 1, . . . , sn. Moreover,

n∑i=1

ξ2ij1k1

w−1j1k1

≤ 2n∑i=1


+ 2n∑i=1

ξij1k12w−1

j1k1

= Op(ka+21 + n) = Op(n),

uniformly for k1 = 1, . . . , sn. In conclusion, we have

|θj1j2k1k2

(2)| = n−1Op(n1/2)Op(k

a/2+12 ) + n−1Op(n

1/2)Op(ka/2+11 )

= Op

(ka/2+11 n−1/2 + k

a/2+12 n−1/2

),

uniformly for k1, k2 = 1, . . . , sn.

For θj1j2k1k2

(3), E(θj1j2k1k2

(3))2 ≤ n−1

E(ξ4

ij1k1w−2j1k1

)E(ξ4ij2k2

w−2j2k2

)1/2 ≤ c1n

−1, uniformly for

k1, k2 = 1, . . . , sn by (B1). It follows that θj1j2k1k2

(3)= Op(n

−1/2), uniformly for k1, k2 =

1, . . . , sn. Moreover, it is trivial that θj1j2k1k2

(4)= θj1j2k1k2

(2)+ θj1j2k1k2

(3)= Op(n

−1/2ka/2+11 +

n−1/2ka/2+12 ), uniformly for k1, k2 = 1, . . . , sn.

Based on (B5), we conclude that z4ij = Op(1), uniformly for l = 1, . . . , pn. For θ(5)

jkl, we

have

|θ(5)jkl| ≤ n−1

n∑i=1

(ξijk − ξijk)2w−1jk

1/2( n∑i=1

z2ij

)1/2

= n−1Op(ka/2+1)Op(n

1/2) = Op

(ka/2+1n−1/2

),

uniformly for k = 1, . . . , sn, l = 1, . . . , pn. Furthermore,

E(θ(6)jkl)

2 ≤ n−1E(ξ4

ijkw−2jk )E(z4

ij)1/2

= O(n−1),

uniformly for k = 1, . . . , sn, l = 1, . . . , pn. Thus θ(6)jkl = Op(n

−1/2), uniformly for k =

1, . . . , sn, l = 1, . . . , pn. Then it follows that θ(7)jkl = θ

(5)jkl + θ

(6)jkl = Op(k

a/2+1n−1/2), uniformly

for k = 1, . . . , sn, l = 1, . . . , pn. For θ(8)l1l2

, we have E(θ(8)l1l2

)2 ≤ n−1E(z4

il1)E(z4

il2)1/2

=


O(n−1), uniformly for l1 = 1, . . . , pn, l2 = 1, . . . , pn, and this entails that θ(8)l1l2

= Op(n−1/2),

uniformly for l1 = 1, . . . , pn, l2 = 1, . . . , pn.

For ϑ(1)jl , given (A4), we have

E|ϑ(1)jl | ≤

n∑i=1

∞∑k=sn+1

|bjk0|E |ξijkzij| ≤n∑i=1

∞∑k=sn+1

|bjk0| E(ξ2ijk)E(z2

ij)1/2

≤ c1

n∑i=1

∞∑k=sn+1

k−bk−1/2 = O(ns−b+1/2n ) = o(n1/2).

Hence ϑ(1)jl = op(n

1/2), uniformly for l = 1, . . . , pn. For ϑ(2)jl , we have

ϑ(2)jl =

n∑i=1

sn∑k=1

∫xij(φjk − φjk) +

∫(xij − xij)φjk

(bjk − bjk0)zij

=

∫(n∑i=1

xijzij)sn∑k=1

(φjk − φjk)(bjk − bjk0)

+

∫

n∑i=1

(xij − xij)zijsn∑k=1

φjk(bjk − bjk0).

It follows that

(ϑ(2)jl )2 ≤ 2‖

n∑i=1

xijzij‖2‖sn∑k=1

(φjk − φjk)(bjk − bjk0)‖2

+2‖n∑i=1

(xij − xij)zij‖2‖sn∑k=1

φjk(bjk − bjk0)‖2.

Since

E‖n∑i=1

xijzij‖2 ≤ E(n∑i=1

‖xij‖|zij|)2 ≤ n2E(‖x1j‖|z1l|)2 + nE(‖x1j‖4)E(z41l)1/2 = O(n2),

we have ‖∑n

i=1 xijzij‖2 = Op(n2). Similarly,

E‖n∑i=1

(xij − xij)zij‖2 ≤ n2(E‖xij − xij‖|zij|)2 + nE(‖xij − xij‖4)E(z4ij)1/2 = o(n),


lead to ‖∑n

i=1(xij − xij)zij‖2 = op(n). Moreover,

‖sn∑k=1

(φjk − φjk)(bjk − bjk0)‖2 ≤ ‖b(1)j − b

(1)j0 ‖2

2

sn∑k=1

‖φjk − φjk‖2w−1jk

= Op(r2n/n)Op

sn∑k=1

(k2/n)ka

= Op(r2ns

a+3n /n2),

and

‖sn∑k=1

φjk(bjk − bjk0)‖2 ≤ (sn∑k=1

|bjk − bjk0|)2 ≤ ‖b(1)j − b

(1)j0 ‖2

2

sn∑k=1

w−1jk = Op(r

2ns

a+1n /n),

by Theorem 1. In summary, we get (ϑ(2)jl )2 = op(r

2ns

a+3n ) = op(qns

a+3n + sa+4

n ), uniformly for

l = 1, . . . , pn. Under conditions (A3) and (A5), ϑ(2)jl = op(n

1/2), uniformly for l = 1, . . . , pn.

For ϑ(3)jl , we have

(ϑ(3)jl )2 = [

∫

n∑i=1

(xij − xij)zij(sn∑k=1

φjkbjk0)]2 ≤ ‖n∑i=1

(xij − xij)zij‖2‖sn∑k=1

φjkbjk0‖2

≤ (sn∑k=1

|bjk0|)2op(n) = op(n),

hence ϑ(3)jl = op(n

1/2) uniformly for l = 1, . . . , pn.

Next, we show that ϑ(4)jl = ϑ

(5)jl + ϑ

(6)jl + ϑ

(7)jl + ϑ

(8)jl . Recall that Ξj = (Ξj1, ...,Ξjqn)T with

Exij(t)zij = Ξjl(t). By lemma 1, we have the expression

φjk(t)− φjk(t) = n−1/2∑v:v 6=k

(wjk − wjv)−1φjv(t)

∫<jφjkφjv + αjk(t),

where ||αjk|| = Op(ka+2n−1), <j = n1/2(Kj − Kj), and Op(·) is uniform in k = 1, . . . , sn.

Since

ϑ(4)jl =

n∑i=1

sn∑k=1

(ξijk − ξijk)bjk0zij =

∫(n∑i=1

xijzij)sn∑k=1

(φjk − φjk)bjk0,


substitute the expression for φjk(t)−φjk(t) into the above equation, we immediately get ϑ(4)jl =

ϑ(5)jl + ϑ

(6)jl + ϑ

(7)jl + ϑ

(8)jl .

For ϑ(5)jl , it is obvious that

(ϑ(5)jl )2 ≤ ‖

sn∑k=1

(φjk − φjk)bjk0‖2‖n∑i=1

(xijzij − Ξjl)‖2,

where

E‖n∑i=1

(xijzij − Ξjl)‖2 =

∫E[

n∑i=1

(xijzij − Ξjl)2] = n

∫var(xijzij)

≤ n

∫E(x4

ij)E(z4ij)1/2 = O(n),

and

‖sn∑k=1

(φjk − φjk)bjk0‖ ≤sn∑k=1

|bjk0|‖φjk − φjk‖ = Op

( sn∑k=1

k−bkn−1/2)

= Op(n−1/2),

thus (ϑ(5)jl )2 = Op(1) = op(n), and it follows that ϑ(5)

jl = op(n1/2) uniformly for l = 1, . . . , pn.

For ϑ(6)jl , we have

(ϑ(6)jl )2 ≤ n2

∫(Exijzij)

2‖sn∑k=1

bjk0αjk‖2 ≤ n2

∫E(x4

ij)E(z4ij)1/2(

sn∑k=1

|bjk0|‖αjk‖)2

= n2Op(sn∑k=1

k−bka+2n−1)2 = op(n),

hence ϑ(6)jl = op(n

1/2) uniformly for l = 1, . . . , pn.

From the proof of lemma 1, it is easy to see that <j −<∗j = n1/2(Kj − Kj) = op(1), which

follows that ϑ(7)jl = op(n

1/2), uniformly for l = 1, . . . , pn.

Proof of Lemma 3. First, it is obvious that |λmin(NT1 N1/n)− λmin(U1)| ≤ ||NT

1 N1/n− U1||1,


where ||.||1 is the L1 norm for matrix. Since

||NT

1 N1/n− U1||1 ≤ Op(sn∑k1=1

|θj1j2k1sn

(4)|+qn∑l=1

|θ(7)j1snl|+

sn∑k1=1

|θ(7)j1k1qn

|+qn∑l1=1

|θ(8)l1qn|)

= Op

sn∑k1=1

(ka/2+11 n−1/2 + sa/2+1

n n−1/2) +

qn∑l=1

(sa/2+1n n−1/2)

+sn∑k1=1

(ka/2+11 n−1/2) +

qn∑l1=1

n−1/2

= Op(sa/2+2n n−1/2 + qns

a/2+1n n−1/2),

by Lemma 2 (a), hence we have |λmin(NT1 N1/n)−λmin(U1)| = Op(s

a/2+2n n−1/2+qns

a/2+1n n−1/2).

Under conditions (A3) and (A5), it is obvious that |λmin(NT1 N1/n)− λmin(U1)| = op(1). Sim-

ilarly |λmax(NT1 N1/n) − λmax(U1)| ≤ ||NT

1 N1/n − U1||1 = op(1), and we skip the details

here.

Proof of Lemma 4. By Lemma 3 and (B5), we know that NT1 N1 is invertible, hence PN1

exists.

For Λ1, we have

Λ1 = PN1(YM − N1η10) = PN1

YM − N1η10 − ν + ν + (N1 − N1)η10

= PN1ε+ ν + (N1 − N1)η10,

where ε = YM − N1η10−ν = (ε1, ..., εn)T, ν = (ν1, ..., νn)T with νi =∑g

j=1

∑∞k=sn+1 ξijkbjk0.

For PN1ε, we have

E‖PN1ε‖2 = E(εTPN1

ε) = EE(εTPN1ε | ε) = E[trPN1

E(εεT)]

= σ2tr(PN1) = σ2(qn + gsn) = O(qn + sn),

hence ‖PN1ε‖2 = Op(qn + sn).


For PN1(N1 − N1)η10, we have

‖PN1(N1 − N1)η10‖2 ≤ ‖(N1 − N1)η10‖2 ≤ O

[ n∑i=1

g∑j=1

sn∑k=1

(ξijk − ξijk)bjk02]

≤ O[2g

n∑i=1

g∑j=1

sn∑k=1

(ξijk − ξijk)bjk02 + 2gn∑i=1

g∑j=1

sn∑k=1

(ξijk − ξijk)bjk02]

≤ O[ g∑j=1

n∑i=1

‖xij‖2Op

((sn∑k=1

k−bkn−1/2)2])

+Op

[ g∑j=1

n∑i=1

‖xij − xij‖2(sn∑k=1

k−b)2]

= Op(1),

since b > 2 implies that∑g

j=1

∑ni=1 ||xij||2Op

(∑sn

k=1 k−bkn−1/2)2

= Op(1), and Lemma 1

entails that∑g

j=1

∑ni=1‖xij − xij‖2(

∑snk=1 k

−b)2 = Op(1). It then follows that ‖PN1(N1 −

N1)η10‖2 = Op(1).

For PN1ν, it is obvious that ‖PN1

ν‖2 ≤ ‖ν‖2. For i = 1, . . . , n, we have

E(νi)2 = O

g∑j=1

var(∞∑

k=sn+1

ξijkbjk0)

= O( g∑j=1

∞∑k=sn+1

b2jk0wjk

)= O

( g∑j=1

∞∑k=sn+1

k−2bk−1)

= O(s−2bn ).

It follows that ‖PN1ν‖2 = Op(ns

−2bn ). In summary,

‖Λ1‖22 ≤ O(‖PN1

ε‖2 + ‖PN1(N1 − N1)η10‖2 + ‖PN1

ν‖2)

= Op(qn + sn + 1 + ns−2bn ) = Op(qn + sn) = Op(r

2n),

where r2n = qn + sn. This completes the proof.

1.6.4 Proofs of Main Theorems

Proof of Theorem 1. First, we constrainQn(η) on the subspace, where the true zero parameters

are set as 0, that is η ∈ Rdsn+pn : b(1)j = 0, k = g+ 1, . . . , d, γ(2) = 0, and prove consistency


in the (gsn + qn)-dimensional space. Define the constrained penalized function

Qn(η1) =n∑i=1

yi −g∑j=1

sn∑k=1

(ξijkw−1/2jk )bjk − zi(1)Tγ(1)2 + 2n

qn∑l=1

Jλn(|γl|)

+2n

g∑j=1

Jλjn(‖b(1)j ‖),

where η1 = (b(1)T1 , ..., b

(1)Tg , γ(1)T)T and z(1)

i = (zi1, ..., ziqn)T. We now show there exists a local

minimizer η1 such that ‖η1−η10‖ = Op(rnn−1/2) with rn = (qn+rn)1/2. Let αn = rnn

−1/2, we

aim to show that for any ε > 0, there exists a large constant C such that Pinf ||u||=C Qn(η10 +

αnu) > Qn(η10) ≥ 1 − ε for large n, which implies that there exists a local minimizer η1 of

Qn(η1) such that ‖η1− η10‖ = Op(αn). Here, u = (u(1)T, uγT)T with u(1) = (u(1)T1 , . . . , u

(1)Tg )T

and uγ = (uγ1 , . . . , uγqn)T. We have

Qn(η10 + αnu)− Qn(η10)

≥ ‖N1αnu‖2 − 2ΛT

1 N1αnu+ 2n[

qn∑l=1

Jλn(|γl0 + αnuγl |)− Jλn(|γl0|)

+

g∑j=1

Jλjn(‖b(1)j0 + αnA

−1j u

(1)j ‖)− Jλjn

(‖b(1)j0 ‖)]

≥ nλmin(NT

1 N1/n)α2n‖u‖2 − 2n1/2‖Λ1‖2λ

1/2max(NT

1 N1/n)αn‖u‖

+2n[

qn∑l=1

J ′λn(|γl0|) sgn(γl0)αnuγl + J ′′λn(|γl0|)α2

n(uγl )2(1 + o(1))

+

g∑j=1

J ′λjn(‖b(1)j0 ‖)αn‖A−1

j u(1)j ‖+ J ′′λjn(‖b(1)

j0 ‖)α2n‖A−1

j u(1)j ‖2(1 + o(1))].

The Taylor expansion in the second inequality holds because αnw−1/2jsn

≤ rnsa/2n n−1/2 =

o(1) by condition (A5). As for the smoothly clipped absolute deviation penalty, we have

J ′λn(|γl0|) = J ′′λn(|γl0|) = 0 for all l = 1, . . . , qn under condition (A7), and J ′λjn(‖b(1)j0 ‖) =

J ′′λjn(‖b(1)j0 ‖) = 0 for all j = 1, . . . , g since ‖b(1)

j0 ‖ ≥ C1 for some constant C1. Thus, we can


see that

Qn(η10 + αnu)− Qn(η10)

≥ nλmin(NT

1 N1/n)α2n‖u‖2

2 − 2n1/2‖Λ1‖2λ1/2max(NT

1 N1/n)αn‖u‖2

≥ c3nα2n‖u‖2 − c4n

1/2rnαn‖u‖ = c3r2n‖u‖2 − c4r

2n‖u‖,

where c3, c4 are some positive constants, and the second inequality is by Lemma 3 and (B5).

When C is large enough, we have Qn(η10 + αnu) − Qn(η10) > 0, which implies that there

exists a local minimizer η1 of Qn(η1) such that ‖η1 − η10‖ = Op(αn).

Next, we denote η = (b(1)T1 , ..., b

(1)Tg , 0T, ..., 0T, γ(1)T, 0T)T, where η1 = (b

(1)T1 , ..., b

(1)Tg , γ(1)T)T,

and our second goal is to show that η is a local minimizer of Qn(η) over the whole space

Rdsn+pn . Denote Sl(η) = ∂(2n)−1||YM−Mb(1)−ZMγ||2/∂γl and S∗j (η) = ∂(2n)−1||YM−

Mb(1)−ZMγ||2/∂b(1)j . By Karush-Kuhn-Tucker condition, it suffices to show that η satisfying

the following conditions,

Sl(η) = 0, and , |γl| ≥ aλn for l = 1, . . . , qn,

|Sl(η)| ≤ λn, and , |γl| < λn for l = qn + 1, . . . , pn,

S∗j (η) = 0, and , ‖b(1)j ‖ ≥ aλjn for j = 1, . . . , g,

‖S∗j (η)‖ ≤ λjn, and , ‖b(1)j ‖ < λjn for j = g + 1, . . . , d,

so that η is a local minimizer ofQn(η). When l = 1, . . . , qn, since minl=1,...,qn |γl| ≥ minl=1,...,qn |γl0|−

‖γ(1)− γ(1)0 ‖ and ‖γ(1)− γ(1)

0 ‖ = op(λn), hence under (A7), we have minl=1,...,qn |γl|/λn →∞

in probability. It follows that

P (|γl| ≥ aλn for l = 1, . . . , qn)→ 1. (1.6)

When j = 1, . . . , g, since ‖b(1)j ‖ ≥ ‖b

(1)j0 ‖−‖b

(1)j −b

(1)j0 ‖, ‖b

(1)j −b

(1)j0 ‖ = op(s

a/2n αn) = op(1)

and minj=1,...,g ‖b(1)j0 ‖ > C1 for some positive constantC1, hence, we have minj=1,...,g ‖b(1)

j ‖/λn →


∞ in probability. It follows that

P (‖b(1)j ‖ ≥ aλjn for j = 1, . . . , g)→ 1. (1.7)

When l = 1, . . . , qn and j = 1, . . . , g, Sl(η) = 0 and S∗j (η) = 0 hold trivially since (1.6), (1.7)

hold and η1 is a local minimizer of Qn(η1).

Next, we show that η satisfy

|Sl(η)| ≤ λn, and , |γl| < λn for l = qn + 1, . . . , pn

and

||S∗j (η)|| ≤ λjn, and , ||b(1)j || < λjn for j = g + 1, . . . , d.

Since γl = 0 for l = qn + 1, . . . , pn and b(1)j = 0 for j = g + 1, . . . , d, it suffices to show that

P (|Sl(η)| > λn for some l = qn + 1, . . . , pn)→ 0, (1.8)

and

P (||S∗j (η)|| > λjn for some j = g + 1, . . . , d)→ 0. (1.9)

Denote dn = (Sqn+1(η), ..., Spn(η))T = −n−1Z(2)M T(YM − N1η1), where

YM − N1η1 = ε+ ν + (N1 − N1)η10 + N1(η10 − η1),

and ν = (ν1, ..., νn)T with νi =∑g

j=1

∑∞k=sn+1 ξijkbjk0. From the proof of Lemma 4 and

previous derivations, we have

‖ν + (N1 − N1)η10 + N1(η10 − η1)‖2 = Op(r2n) = op(nλ

2n).


Moreover, we have maxl=qn+1,...,pn

∑ni=1 z

2ij = Op(n) by (B5). Thus

‖n−1Z(2)M

T(ν + (N1 − N1)η10 + N1(η10 − η1))‖∞

≤ n−1(

maxl=qn+1,...,pn

n∑i=1

z2ij

)1/2‖ν + (N1 − N1)η10 + N1(η10 − η1)‖ = op(λn).

Next, we denote n−1/2Z(2)M Tε = (δqn+1, ..., δpn)T with δl = n−1/2

∑ni=1 zijεi for l =

qn + 1, . . . , pn. As εi, i = 1, . . . , n and zil, i = 1, . . . , n are subgaussian random variables

and independent of each other, given (B5), hence δl is subexponential random variables with

uniformly bounded second moments for l = qn + 1, . . . , pn. Thus, for any constant C > 0,

there exists positive constants C1 and C2 such that uniformly for l = qn + 1, . . . , pn,

P (|δl| > Cn1/2λn) ≤ C2 exp(−2−1C1n1/2λn),

and it follows that

P (‖n−1/2Z(2)M

Tε‖∞ > Cn1/2λn) ≤pn∑

l=qn+1

P (|δl| > Cn1/2λn)

≤ Opn exp(−2−1C1n1/2λn) = Oexp(nα − 2−1C1n

1/2λn) = o(1),

where the last two equalities are by (A6) and (A7). Hence, it is obvious that ‖n−1Z(2)M Tε‖∞ =

op(λn). In conclusion, we have ‖dn‖∞ ≤ ‖n−1Z(2)M Tε‖∞+‖n−1Z(2)

M Tν+(N1−N1)η10 +

N1(η10 − η1)‖∞ = op(λn) and this implies that (1.8) holds.

Next, we start to show (1.9). For j = g + 1, . . . , d, we have S∗j (η) = −n−1MTj (YM −

N1η1) = (Sj1, ..., Sjsn)T such that∑sn

k=1 S2jk is bounded by

2sn∑k=1

(n−1

n∑i=1

ξijkεi)2

+ 2n−2

sn∑k=1

n∑i=1

ξ2ijk‖ν + (N1 − N1)η10 + N1(η10 − η1)‖2.


For∑sn

k=1

(n−1

∑ni=1 ξijkεi

)2, we have

Esn∑k=1

(n−1

n∑i=1

ξijkεi)2 = n−2

sn∑k=1

n∑i=1

σ2E(ξ2ijk) ≤ n−2

sn∑k=1

n∑i=1

σ2E(‖xij‖2) = O(snn−1).

It is easy to see that∑sn

k=1

∑ni=1 ξ

2ijk = Op(n

∑snk=1wjk) = Op(n). Thus

sn∑k=1

S2jk = Op(snn

−1) + n−2Op(n)Op(r2n) = Op(qn + sn)n−1,

and it follows that

‖S∗j (η)‖2 =sn∑k=1

S2jk = Opn−1(qn + sn) = op(λ

2n),

under (A7), which entails (1.9) immediately. Hence, we conclude that η is a local minimizer

of Qn(η) such that ||γ − γ0|| = Op(rnn−1/2), ||b(1) − b(1)

0 || = Op(rnn−1/2), ||b(1) − b(1)

0 || =

Op(rnsa/2n n−1/2), and P (b

(1)j = 0, j = g + 1, . . . , d, γ2 = 0)→ 1. This completes the proof.


Proof of Theorem 2. For convenience, define the following,

Ln(η) =n∑i=1

(yi −

d∑j=1

∞∑k=1

ξijkbjk − zT

i γ)2

,

T1n(η) = −n∑i=1

( d∑j=1

∞∑k=sn+1

ξijkbjk

)2

+ 2n∑i=1

(yi −

d∑j=1

sn∑k=1

ξijkbjk − zT

i γ)

( d∑j=1

∞∑k=sn+1

ξijkbjk

),

T2n(η) = −2n∑i=1

(yi −

d∑j=1

sn∑k=1

ξijkbjk − zT

i γ) d∑

j=1

sn∑k=1

(ξijk − ξijk)bjk,

T3n(η) =n∑i=1

d∑j=1

sn∑k=1

(ξijk − ξijk)bjk2

,

T4n(η) = 2npn∑l=1

Jλn(|γl|) +d∑j=1

Jλjn(‖b(1)j ‖),

Pn(η) = Ln(η) + T1n(η) =n∑i=1

(yi −

d∑j=1

sn∑k=1

ξijkbjk − zT

i γ)2

.

Then we have Qn(η) = Ln(η) + T1n(η) + T2n(η) + T3n(η) + T4n(η), and denote ∇Qn(η) =

∂Qn(η)/∂γ(1) = (∂Qn(η)/∂γ1, ..., ∂Qn(η)/∂γqn)T and ∇2Qn(η) = ∂2Qn(η)/(∂γ(1)∂γ(1)T).

Recall that η is a local minimizer of Qn(η), hence, we have ∇Qn(η) = ∇Ln(η) +∇T1n(η) +

∇T2n(η) +∇T3n(η) +∇T4n(η) = 0.

It is obvious that∇T3n(η) = 0 because T3n(η) doesn’t depend on γ(1). Moreover, by prop-

erties of η derived before and (A7), we have that uniformly for l = 1, . . . , qn, ∂T4n(η)/∂γl =

2nJ ′λn(|γl|)sgn(γl) = 0 with probability tending to one, thus ∇T4n(η) = 0. For ∇Pn(η), by


Taylor expansion we have,

∇Pn(η) = ∇Pn(η0) +∇2Pn(η0)(γ(1) − γ(1)0 )

= ∇Ln(η0) +∇T1n(η0) +∇2Ln(η0)(γ(1) − γ(1)0 )

= (−2n∑i=1

εiz(1)i ) + (−2

d∑j=1

ϑj(1)) + 2

n∑i=1

z(1)i z

(1)i

T

(γ(1) − γ(1)0 )

= (−2n∑i=1

εiz(1)i ) + 2

n∑i=1

z(1)i z

(1)i

T

(γ(1) − γ(1)0 ) + op(n

1/2)1qn ,

where the last equality is by Lemma 2 (b), and 1qn is the vector of ones with dimension qn. For

∇T2n(η),

∇T2n(η) = 2d∑j=1

(ϑj(2) + ϑj

(3) + ϑj(4))

= 2d∑j=1

(ϑj(2) + ϑj

(3) + ϑj(5) + ϑj

(6) + ϑj(7) + ϑj

(8))

= op(n1/2)1qn + 2

d∑j=1

ϑj(8),

where the last two equalities are by Lemma 2 (b). Notice that ϑj(8) = 0 for j = g + 1, . . . , d,

∇Qn(η) = ∇Ln(η) +∇T1n(η) +∇T2n(η) +∇T3n(η) +∇T4n(η)

= op(n1/2)1qn + 2

g∑j=1

ϑj(8) + (−2

n∑i=1

εiz(1)i )

+2n∑i=1

z(1)i z

(1)i

T

(γ(1) − γ(1)0 ) = 0,

and it follows that

n−1/2

n∑i=1

z(1)i z

(1)i

T

(γ(1) − γ(1)0 ) = n−1/2

n∑i=1

εiz(1)i − n−1/2

g∑j=1

ϑj(8) + op(1)1qn

=n∑i=1

Win + op(1)1qn ,


with Win = n−1/2εiz(1)i − n−1/2

∑gj=1

∑snk=1

∑v 6=k bjk0(wjk − wjv)−1〈Ξj, φjv〉

∫(xij ⊗ xij −

Kj)φjkφjv. It is obvious that Win, i = 1, . . . , n are independently and identically distributed.

with zero mean and cov(Win) = n−1(σ2Σ1 +Bn). It satisfies to show that V −1/2n An

∑ni=1Win

converges to N(0r, Ir) in distribution, where Vn = An(σ2Σ1 + Bn)ATn→V ∗ = σ2H∗ + B∗,

and V ∗ is positive definite. We start to show this by Linderberg Feller theorem.

First, we denote Πin = V −2−1

n AnWin, and it is trivial that Πin, i = 1, . . . , n are indepen-

dently and identically distributed centered random vectors with cov(∑n

i=1 Πin) = Ir. Second,

for any ζ > 0, we have

n∑i=1

E[‖Πin‖2

21qn‖Πin‖ > ζ]≤ nE(‖Π1n‖4)1/2P (‖Π1n‖ > ζ)1/2,

where

‖Π1n‖2 = ΠT

1nΠ1n = W T

1nAT

nV−1n AnW1n ≤ λmax(AT

nV−1n An)‖W1n‖2

≤ λmax(V −1n )λmax(AT

nAn)‖W1n‖2 = O(1)‖W1n‖2.

It follows that E(‖Π1n‖4

)= O(1)E

(‖W1n‖4

)= O(1)q2

n/n2. Moreover,

P (‖Π1n‖ > ζ) = P (‖V −1/2n AnW1n‖ > ζ) ≤ E(‖V −1/2

n AnW1n‖2)/ζ2

≤ λmax(AT

nV−1n An)E(‖W1n‖2)/ζ2

= O(1)E(‖W1n‖2) = O(1)qn/n.

By previous derivations and assumptions, we know

n∑i=1

E[‖Πin‖21qn‖Πin‖ > ζ

]= nO(qn/n)O(qnn

−1/2)

= O(q3/2n n−1/2) = o(1),

since qn = o(n1/3). By Linderberg Feller theorem, we conclude that V −1/2n An

∑ni=1 Win con-


verges to N(0r, Ir) in distribution, which completes the proof.

Chapter 2

Testing General Hypothesis in Large-scale

FLR

2.1 Introduction

The classical functional linear regression (FLR) has been widely adopted for modelling the lin-

ear relationship between a scalar response Y and a functional predictor that is often assumed to

be sampled from an L2(T ) random processX(t) defined on a compact interval T ⊆ R. Specif-

ically, given n independent and identically distributed (i.i.d.) pairs Yi, Xi(·), the classical

FLR is formulated as

Yi =

∫T

Xi(t)β(t)dt+ εi, i = 1, . . . , n (2.1)

where both Yi and Xi are centred without loss of generality, i.e., EYi = 0 and EXi(t) = 0 for

t ∈ T , the unknown regression parameter function β(t) is square-integrable,i.e., β ∈ L2(T ),

and the i.i.d. regression error εi is independent of Xi with mean zero and variance 0 < σ2 <

∞. This model has been extensively researched in the literature of functional data analysis

[Ramsay and Dalzell, 1991, Cardot et al., 1999, Fan and Zhang, 2000, Yuan and Cai, 2010,

among others], including theoretical considerations [Cai and Hall, 2006, Hall and Horowitz,

47

CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 48

2007, Cai and Yuan, 2012] and statistical inference [Cardot et al., 2003, Lei, 2014, Hilgert et al.,

2013], and we refer readers to Ramsay and Silverman [2005] for an overview and numerical

examples. There has also been numerous work that extended the classical FLR to various

settings, such as the functional response [Faraway, 1997, Cuevas et al., 2002, Yao et al., 2005b],

the generalized FLR [Escabias et al., 2004, James, 2002, Muller and Stadtmuller, 2005, Shang

and Cheng, 2015], the partially FLR [Lian, 2011, Kong et al., 2016], the additive regression

[Muller and Yao, 2008, Zhu et al., 2014, Fan et al., 2015], among many others.

In modern scientific experiments, the response Y can be potentially associated with multi-

ple or even a large number of functional predictors, e.g., Lian [2013] proposed a FLR involving

a fixed number of functional predictors and Kong et al. [2016] considered regularized estima-

tion and variable selection for a partially FLR that contains high-dimensional scalar covariates

in addition to a finite number of functional predictors. However, in large-scale data analysis

where a FLR is exploited, the number of potential functional predictors pn can be possibly

much larger than the sample size n, despite the significant ones of size qn are usually assumed

to be sparse or at a fraction polynomial order of n. Examples can be found in neuroimage

analysis, where the relationship between certain disease marker and a number of brain regions

of interest (ROI) scanned over time is of interest. This consideration motivates, namely, a

large-scale FLR model as follows,

Yi =

pn∑j=1

∫T

Xij(t)βj(t)dt+ εi, i = 1, . . . , n (2.2)

where pn is allowed to grow exponentially in sample size n, without loss of generality, the first

qn regression parameter functions βj : j = 1 . . . , qn are assumed nonzero (i.e., important

ones), while the rest are zero, and the i.i.d. error εi is independent of Xij : j = 1, . . . , pnwith

mean zero and variance σ2. It is common to use a set of pre-fixed (i.e., B-splines, wavelets) or

data-driven (i.e., eigenfunctions) basis to represent the underlying processXj of each predictor

Xij : i = 1, . . . , n. Since the data-driven bases, such as eigenfunctions, have to be estimated


from pn separate functional principal component analysis procedures, which is computationally

prohibited when pn n and does not necessarily produces efficient representations for the

sake of regression/prediction of Y , hence we adopt a common pre-fixed basis bk : k ≥ 1 that

is complete and orthonormal in L2(t) for all processes Xj for j = 1, . . . , pn, and do not further

pursue other complicated basis-seeking procedures, e.g., functional partial least squares [Reiss

and Ogden, 2007].

To be specific, functional predictors and the associated regression functions can be ex-

pressed as linear combinations of the complete orthonormal basis bk : k ≥ 1, such that βj =∑∞k=1 ηjkbk, Xij =

∑∞k=1 θijkbk, where the coefficients θijk =

∫TXij(t)βk(t)dt coinciding

with the projections are mean-zero random variables whose variances are E(θ2ijk) = ωjk > 0.

As a result, model (2.2) can be reformulated as

Yi =

pn∑j=1

∞∑k=1

θijkηjk + εi. (2.3)

To perform estimation and inference on the regression functions of primary interest, a direct

minimization of the square loss with respect to the infinite sequences of unknown coefficients

ηjk is not feasible. A common practice is to truncate up to the first sn leading terms that

is allowed to grow with n, where sn controls the complexity of βj as a whole function and

balances the bias-variance tradeoff in a similar spirit of classical nonparametric regression.

To our knowledge, this type of large-scale FLR first appeared in Fan et al. [2015] that con-

sidered a penalized procedure for model estimation and selection, while our primary interest is

the hypothesis testing. A careful inspection on Condition 1(A) in Fan et al. [2015, Appendix B]

which requires∑∞

k=1 θ2ijkk

4 < C2 for a universal constant C for i = 1, . . . , n, j = 1, . . . , pn,

reveals that all random processes Xj are bounded in L2(T ) uniformly in j, thus excludes even

the case of Gaussian processes. Moreover, Condition 2(D) assumes that the minimal eigenval-

ues of n−1Θ′jΘj ≥ c0, bounded from below by a constant c0 uniformly in 1 ≤ j ≤ qn (i.e.,

the important ones), where Θj = (θijk)1≤i≤n;1≤k≤sn is the n× sn design matrix induced by Xj


in Fan et al. [2015] that used qn to denote the truncation size with a notation abuse. Unfortu-

nately, this crucial condition is not valid for infinite-dimensional L2 process, as the minimal

eigenvalues necessarily approach 0 when sn diverges, with a typical illustration given by the

Karhunen-Loeve expansion. By contrast, we do not make these stringent assumptions so that

the predictor processes are genuinely functional in the large-scale FLR (2.2).

The main contribution of this paper is to develop a rigorous testing procedure for the gen-

eral hypothesis on an arbitrary subset of regression functions βk : k = 1, . . . , pn. The

challenge arises from not only the ultrahigh-dimensionality in pn that can be as exponentially

large as n but also the intrinsic infinite-dimensionality of each Xj , j = 1, . . . , pn. Although the

FLR (and its variants) has been well studied, there are only a few contributions on inference

procedures, e.g., Hilgert et al. [2013] and Lei [2014] considered adaptive tests of a single re-

gression function in classical FLR and Shang and Cheng [2015] for the generalized FLR. In the

current exposition, we adopt a general class of nonconvex penalty functions [Loh and Wain-

wright, 2015] which include LASSO [Tibshirani, 1996], SCAD [Fan and Li, 2001] and MCP

[Zhang, 2010] penalties as special cases and whose theoretical properties in high-dimensional

linear regression have been extensively studied [Meinshausen and Buhlmann, 2006, van de

Geer, 2008, Meinshausen and Yu, 2009, Bickel et al., 2009, Zhang, 2009, Fan and Lv, 2011,

Wang et al., 2013, 2014, Fan et al., 2014, Loh and Wainwright, 2015, among many others].

Recently the research on inference of high-dimensional linear regression has flourished, es-

pecially for LASSO-type convex penalty [Tibshirani, 1996], such as Wasserman and Roeder

[2009], Meinshausen and Buhlmann [2010], Shah and Samworth [2013] based on sample split-

ting or subsampling, Zhang and Zhang [2014], van de Geer et al. [2014] on bias correction

methods, Lockhart et al. [2014], Taylor et al. [2014] for and conditional inference on the even

that some covariates have been selected, among others.

In this article, we are inspired by the unconditional inference based on decorrelated score

function proposed in Ning and Liu [2016] owing to its generality, no need of data splitting or

strong minimal signal conditions. We first exploit a penalized least squares procedure treating


the truncated coefficients of each βk altogether in a group fashion, and obtain estimation con-

sistency with no need of oracle properties under weaker minimal signal conditions that allow

for a wider class of suitable settings. Then we devise the decorrelated score function in the

context of large-scale FLR for testing general null hypothesis on any subset of βk : k ≥ 1.

Unlike testing null hypothesis on a single parameter in high-dimensional linear regression, the

limiting distribution for such a general null hypothesis is intractable. Finally we adopt the mul-

tiplier bootstrap method to approximate the limiting distribution of the score test statistic under

null hypothesis and provide theoretical guarantees for all possible sizes in a uniform manner.

The rest of the article is organized as follows. In section 2.2, we describe the general hy-

pothesis we want to test and the way to construct the proposed score test based on a penalized

estimator and a well defined decorrelated score function. Section 2.3 is devoted to introducing

the theoretical properties of the penalized estimator and the score test defined in section 2.2

under some regularity conditions. Section 2.4 proposes the simulation results to back up the

theoretical properties. The algorithm to obtain the penalized estimator, technical conditions on

penalty functions, notations used throughout the paper and the proofs of Lemmas and Theo-

rems are deferred to section 2.5.

2.2 Proposed Methodology

2.2.1 Regularized estimation by group penalized least squares

Recall the large-scale FLR defined in (2.2), the underlying predictor processes Xj and the

corresponding regression functions βj are expressed by a set of complete and orthonormal

basis bk : k ≥ 1, leading to an infinite-dimensional representation (2.3). Since the square

loss function∑n

i=1(Yi−∑pn

j=1

∑∞k=1 θijkηjk)

2 cannot be directly minimized with respect to ηjk

due to infinite-dimensionality, we control the complexity of each βk as a whole function and

adopt the commonly-used truncation for the purpose of regularization, rather than viewing its

basis terms as separate predictors. Here this truncation parameter is allowed to grow with n,


and plays a role of smoothing parameter balancing the trade-off between approximation bias

and sampling variability.

REMARK. Ideally one can use different truncation sizes for each βk, however, selection

of different truncation sizes for a large number of functional predictors is computationally

infeasible. Therefore, in practice, we may adopt the strategy suggested by Kong et al. [2016],

i.e., use a common sn to perform regularized estimation, then amend a refit step using ordinary

least squares for the retained variables and choose different truncations using a K-fold cross-

validation, saying K = 5. Nonetheless, the case of common sn suffices for methodological

development and theoretical analysis.

Besides the truncation, it is essential to impose suitable penalty for each regression function

as a whole by utilizing a functional version of group regularization [Yuan and Lin, 2006]. To

regularize all predictors on a comparable scale, one often standardizes scalar predictors in

linear regression [Fan and Li, 2001]. For functional predictors Xj , we choose to account

the variability of the grouped projection coefficients θijk in the n × sn design matrix Θj =

(θijk)1≤i≤n;1≤k≤sn . Hence the penalty imposed on n−1/2||Θjηj||2 invokes a group penalty that

shrinks the unimportant regression function to zero, where ‖ · ‖2 is the Euclidean or `2 norm

(if an infinite sequence). For technical convenience, we scale up the penalty parameter λn by

s1/2n , which does not affect the relative weighting of penalties given the common group size sn.

Therefore, our objective is to minimize the penalized square loss function as follows, de-

noting η = (η′1, . . . , η′pn)′ with vectors ηj = (ηj1, . . . , ηjsn)′, and ‖ · ‖1 the `1 norm,

minη:||η||1≤Rn

[ Qn(η)︷︸︸︷(2n)−1

n∑i=1

(Yi −pn∑j=1

sn∑k=1

θijkηjk)2

︸︷︷︸Ln(η)

+

pn∑j=1

ρλns

1/2n

(n−1/2||Θjηj||2)︸︷︷︸Pλn (η)

](2.4)

where ρλ(·) with the tuning parameter λ belongs to a general class of nonconvex penalty func-

tions satisfying conditions (P1)–(P5) in section 2.5, which includes the mostly used penalties,


such as LASSO, SCAD and MCP and so on. [Loh and Wainwright, 2015]. The positive

constraint Rn should be chosen carefully to make the true value η∗ a feasible point such that

‖η∗‖1 ≤ Rn. Upon solving the optimization problem (2.4), which is guaranteed to have a

global minimum by Weierstrass extreme value theorem if ρλ(·) is continuous, the regularized

estimator for each βj(t) can be expressed as βj(t) =∑sn

k=1 ηjkbk(t), where η is obtained from

solving (2.4). The implementation by a coordinate descent algorithm, similar to Ravikumar

et al. [2008] with slight modification, is presented in section 2.5, while the tuning parameters

λn and sn are chosen by a K-fold cross-validation, saying K = 5 or 10. It is worth mentioning

that, for the purpose of general hypothesis testing, it is sufficient to obtain consistent estimation

of η from (2.4) in both `1 and `2 sense stated in Theorem 3, while the selection consistency or

the oracle properties is not necessary.

2.2.2 General hypothesis and score decorrelation

Our goal is to test a class of hypotheses that is of full generality in the framework of large-scale

FLR. DenotePn = 1, . . . , pn as the index set of all functional predictors, and letHn ⊆ Pn be

an arbitrary nonempty subset of Pn with the cardinality |Hn| = hn ≤ pn, and the complement

ofHn is denoted byHcn = Pn \ Hn. Then the general hypothesis can be expressed as

H0 : ‖βj‖L2 = 0 for all j ∈ Hn v.s. Ha : ‖βj‖L2 > 0 for at least one j ∈ Hn, (2.5)

which is asymptotically equivalent to the hypothesis

H0 : ηj = 0 for all j ∈ Hn v.s. Ha : ηj 6= 0 for at least one j ∈ Hn,

with each ηj = (ηj1, . . . , ηjsn)′, as sn → ∞ given n → ∞. It is worth noting that the

cardinality hn can be as large as the cardinality of the full predictor set pn, allowing for any

hypothesis on βk : k = 1, . . . , pn of interest.

For testing the the general null hypothesis in (2.5), the building block originates from the


combination of consistently estimated regression functions and a new type of score function.

As illustrated in Ning and Liu [2016], the motivation of considering a decorrelated score func-

tion is the high-dimensionality of the nuisance parameter space Hcn = Pn \ Hn that makes

the limiting distribution of the estimated nuisance parameter constrained by null hypothesis

intractable [Fu and Knight, 2000]. Hence the key is to decorrelate the score function of the

primary parameter in Hn from that of the nuisance parameter in Hcn in order to control the

variability induced by high-dimensional nuisance parameter. As a result, the decorrelation op-

eration is a natural extension of profile score to high-dimensional case and leads to a test that

is asymptotically equivalent to the classical Rao’s score test in low-dimensional case [Cox and

Hinkley, 1979, Ning and Liu, 2016].

We first introduce some notation for adopting the score function decorrelation in the pro-

posed large-scale FLR. Recall that ωjk is the variance of the i.i.d. projection coefficient

θijk =∫TXi(t)bk(t)dt : i = 1, . . . , n. Denote Λj = diagω1/2

j1 , . . . , ω1/2jsn for j ≤ pn,

the block diagonal matrix ΛHn = diagΛj : j ∈ Hn, and similarly for ΛPn ≡ Λ. Let

Θ = (G′1, . . . , G′n)′ = (ΘHn ,ΘHcn), ΘHn = (E ′1, . . . , E

′n)′, ΘHcn = (F ′1, . . . , F

′n)′, where Gi, Ei

and Fi are the vectors containing coefficients θijk from corresponding functional predictors for

the ith subject, respectively, ΘHn is formed by concatenating Θj : j ∈ Hn in a row, and simi-

larly for ΘHcn , where Θj is the n×sn design matrix containing θijk as its ikth entry. Also denote

η = (η′Hn , η′Hcn)′ and Y = (Y1, . . . , Yn)′, where ηHn is to stack ηj : j ∈ Hn in a column and

similarly for ηHcn . Here we view the least squares `(η) = `(ηHn , ηHn) = n−1(Y −Θη)′(Y −Θη)

as the likelihood function of η without introducing extra notations. Further we define

w = I−1HcnHcnIHcnHn , where IHcnHn = E(FiEi

′), IHcnHcn = E(FiFi′).

For the purpose of decorrelation, a new score function with respect to the primary parameter


ηHn , denoted by S(η), is defined in the context of our large-scale FLR as follows,

S(η) = S(ηHn , ηHcn) = n−1Λ−1Hn(w′Θ′Hcn −Θ′Hn)(Y −ΘHnηHn −ΘHcnηHcn)

= n−1

n∑i=1

Λ−1Hn(w′Fi − Ei)(Yi − E ′iηHn − F ′iηHcn). (2.6)

It is easy to verify that this new score function with respect to the primary parameter ηHn is

uncorrelated with the traditional score function with respect to the nuisance parameter ηHcn ,

i.e., ES(η)∇ηHcn`(η) = 0, as in Ning and Liu [2016], where ∇γ denotes the gradient vector

taken with respect to γ.

2.2.3 Bootstrapped score test for general hypothesis in large-scale FLR

Given the consistent estimation of regression functions and the decorrelated score function, we

are ready to construct the proposed score test for the general hypothesis (2.5) in large-scale

FLR. It is noted that the decorrelated score function S(η) defined in (2.6) cannot be calculated

from observed data directly due to the unknown quantities w = I−1HcnHcnIHcnHn and ΛHn . It is

straightforward to estimate ΛHn by plugging in ωjk =∑n

i=1 θ2ijk, denoted by ΛHn , and similarly

for Λ and ΛHcn . To estimate w, a natural choice is the moment estimator w = I−1HcnHcn IHcnHn ,

where IHcnHcn = n−1ΘHcn′ΘHcn for IHcnHn = E(FiEi

′) and IHcnHn = n−1ΘHcn′ΘHn for IHcnHcn =

E(FiFi′). However, this estimator may not exist, since the matrix IHcnHcn can be singular in

high-dimensional settings. We follow the suggestion by Ning and Liu [2016] to adopt the

Dantizig selection [Candes and Tao, 2007] to estimate the (pn−hn)sn×hnsn unknown matrix

w by column, while alternative procedures can also be used but not pursued here for brevity.

Specifically, for each l = 1, . . . , hnsn, we solve

wl ∈ argminwl

||wl||1 s.t. ||n−1

n∑i=1

EilFi′ − w′ln−1

n∑i=1

FiFi′||∞ ≤ τn, (2.7)


where τn as a common tuning parameter chosen by aK-fold cross-validation in practice, giving

the resulting estimator w. It then follows the estimated decorrelated score function

S(η) = S(ηHn , ηHcn) = n−1Λ−1Hn(w′Θ′Hcn −Θ′Hn)(Y −ΘHnηHn −ΘHcn ηHcn)

= n−1∑n

i=1 Λ−1Hn(w′Fi − Ei)(Yi − E ′iηHn − F ′iηHcn). (2.8)

Then we plug in the estimator η obtained from minimizing (2.4) to construct the decorrelated

score test statistic under the null hypothesis of ηHn = 0, leading to

T = n1/2S(0, ηHcn) = n−1/2

n∑i=1

Si, where Si = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn). (2.9)

Note that the null hypothesis (2.5) is of full generality with the dimension hnsn, where sn

grows with n (often at a factional polynomial order) to approximate the infinite-dimensional

functional spaces and hn can be as large as pn. Unlike testing a finite dimensional null hypoth-

esis, it is rather intricate to find a limiting distribution that is likely to be intractable, even for

the case of hn = 1. Hence, we use its infinity norm ||T ||∞ = max|Tl| : l = 1, . . . , hnsn

to test against the null hypothesis (2.5), and adopt a computationally efficient and theoretically

guaranteed bootstrap method to approximate the limiting distribution of ||T ||∞. Since a stan-

dard bootstrap is computationally intensive by repeatedly estimating η and w, we consider a

multiplier bootstrap method which was studied by Chernozhukov et al. [2014]. Specifically,

denote Te = n−1/2∑n

i=1 eiSi, where e1, . . . , en is a set of i.i.d. standard normal random

variables independent of the data, and define cB(α) = inft ∈ R : Pe(||Te||∞ ≥ t) ≤ α

as the 100(1 − α)th percentile of ||Te||∞, where Pe(·) denotes the probability with respect to

e1, . . . , en. Based on this critical value, we reject the null hypothesis at the significance level

α provided that ||T ||∞ ≥ cB(α). In Section 2.3, we show that under the null hypothesis and

some mild conditions, the Kolmogorov distance between the distributions of ||T ||∞ and ||Te||∞

converges to zero as sample size grows, which provides theoretical guarantees for the proposed

score test.


REMARK. Based on the decorrelated score function (2.6), in principle one can construct the

counterparts of other classical tests, such as the Wald and likelihood ratio tests, for a variety of

high-dimensional models, for instance, the Cox proportional hazard model [Fang et al., 2014].

We expect that these asymptotically equivalent tests may also hold for the large-scale FLR

model, which deserves further investigation.

2.3 Asymptotic Properties

Before introducing the main Theorems, we impose some mild conditions on the covariance

structure of functinal predictors, and the orders of various tuning parammters. First, recall that

we have ωjk = E(θ2ijk) > 0. Accordingly, we assume that the expected norm of the functinal

predictors are uniformly bounded such that

(A1) supj≤pn∑∞

k=1 ωjk <∞.

Notice that (A1) implies that λmax(Λ) = O(1). Before discussing the next condition which

is devoted to describing the distribution of several subgaussian random variables, we intro-

duce some useful notations as follows. For any sub-gaussian random variable X , define

||X||φ1 = supq≥1 q−1/2(E|X|q)1/q as the sub-gaussian norm, and for any sub-exponential ran-

dom variable X , define ||X||φ2 = supq≥1 q−1(E|X|q)1/q as the sub-exponential norm, similar

to those adopted in Ning and Liu [2016].

(A2) Assume that εi, ω−1/2jk θijk, (wl

′Fi − Eil)[E(E2

il)]−1/2 are centred sub-gaussian random

variables satisfying ||εi||φ1 ≤ c, ||ω−1/2jk θijk||φ1 ≤ c, ||(wl′Fi − Eil)

[E(E2

il)]−1/2||φ1 ≤ c

for some positive constant c, uniformly in i = 1, . . . , n, j = 1, . . . , pn, k = 1, . . . ,∞,

l = 1, . . . , hnsn.

Notice that (A1) together with (A2) implies that θijk,wl′(Fi−Eil) are also centred sub-gaussian

random variables satisfying ||θijk||φ1 ≤ c1, ||wl′Fi − Eil||φ1 ≤ c1 for some positive constant

c1, uniformly in 1 ≤ i ≤ n, 1 ≤ j ≤ pn, 1 ≤ l ≤ hnsn and k ≥ 1. Before introducing


the subsequent conditions, we denote the information matrix and partial information matrix

as I = E(GiGi′) and IHn|Hcn = IHnHn − w′IHcnHn respectively. Similarly, the standardized

version of the information matrix and partial information matrix are denoted as I = Λ−1IΛ−1

and IHn|Hcn = Λ−1HnIHn|HcnΛ−1

Hn respectively. Next, we assume that

(A3) λmin(Λ) ≥ cs−a/2n for some constants c > 0 and a > 1.

(A4) 0 < m0 ≤ λmin(I) ≤ λmax(I) ≤ m1 <∞ for some constants m1 > m0 > 0, satisfying

m0 > 2−1m1µ.

Note that µ > 0 is a constant such that ρλ,µ(t) is convex in t, which can be referred to Appendix.

Combining (A3) and (A4), it is easy to see that λmin(I) = λmin(ΛIΛ) ≥ cs−an , for some c > 0.

In addition, we need to control the correlation between the functional part for testing and the

rest of the functional predictors, so we assume

(A5) 0 < c1 ≤ λmin(IHn|Hcn) ≤ λmax(IHn|Hcn) ≤ c2 <∞ for some constants c2 > c1 > 0.

Next, the overall number of functional predictors is permitted to grow exponentially in sample

size and for any two sequences an and bn, an ∼ bn if and only if c1 ≤ limn→∞ |an/bn| ≤ c2 for

some positive constants c1, c2. Then we assume

(A6) pn ∼ exp(nβ) for some β ∈ (0, 9−1).

Regarding the first qn nonzero regression functions, we assume that they belong to a sobolev

ball as follows

(A7) supj≤qn∑∞

k=1 η2jkk

2δ < c for some positive constants δ and c.

Next, in (A8), we quantify the relationship among the four parameters qn, ρn, sn, Rn, and the

sample size n. qn is the number of significant predictors. ρn is equal to supl≤hnsn ||wl||0, where

the vector norm || · ||0 denotes the number of nonzeros in the vector. Rn is a general parameter

such that ||η∗||1 ≤ Rn, where η∗ represents the true value of η. sn is the truncation size for

each functional predictor. We assume


(A8) maxn3β/2ρnqns3a/2−δn log sn, nβq2

nsa+1−δn , n2βq2

nsa/2+1−δn , n5β/2−1/2ρnq

2ns

2a+1−δn ,

n2β−1/2(log n)1/2ρns3a/2n , n5β/2−1/2Rnqns

a/2+1n , n3β−1Rnρnqns

2a+1n ,

n3β/2−1/2Rnqnsa+1n , nβ+1/2qns

−δn log sn = o(1).

Note that (A8) entails δ > a + 1 > 2, and (A8) also implies that both ρn and qn are small in

sample size reflecting the sparseness of the model. Next, for a concrete example, if sn, Rn, ρn

and qn are on the same order, (A8) can be rewritten as

maxn3β/2s3a/2+2−δn log sn, n

βsa+3−δn , n2βsa/2+3−δ

n , n5β/2−1/2s2a+4−δn , n2β−1/2s3a/2+1

n (log n)1/2,

n5β/2−1/2sa/2+3n , n3β−1s2a+4

n , n3β/2−1/2sa+3n , nβ+1/2s1−δ

n log sn = o(1). (2.10)

It is easy to verify that there exists sn satisfying (2.10) if and only if

min2δ − 3a− 4

3β,δ − a− 3

β,

2δ − 2

2β + 1

>max4a+ 8− 2δ

1− 5β,

3a+ 2

1− 4β,a+ 6

1− 5β,

2a+ 6

1− 3β, (2.11)

and it is easy to show that the setting a = 2, δ = 7, β < 28−1 satisfies (2.11) for instance.

Next, in (A9), we impose some conditions on the tuning parameter λn for the regularizer in (3)

to ensure estimation consistency.

(A9) nβqnsa+1n = o(λ−1

n ), n2βqnsa/2+1n = o(λ−1

n ), n5β/2−1/2ρnqns2a+1n = o(λ−1

n ),

nβ/2−1/2Rn = o(λn), qns−δn = o(λn).

Note that (A8) is sufficient for the existence of such λn which satisfy (A9). Lastly, in (A10),

we assume the tuning parameter τn for the Dantizig selector method to satisfy

(A10) τn ∼ [log(pnsn)/n]1/2.

Combining (A6), (A8) with (A10), we have τn ∼ nβ/2−1/2. In the following, we establish

the main theorems. Several auxiliary lemmas and the proofs of those lemmas and the main

theorems are deferred to Appendix. Theorem 3 establishes the estimation consistency for the


estimator obtained from (2.4) under some mild conditions, where the conditions (P1)–(P5) are

delegated to Appendix.

Theorem 3 Under conditions (A1)–(A4), (A6)–(A9) and (P1)–(P5), every local minimizer η

of Qn(η) obtained from (2.4) satisfies

1) ||η − η||2 ≤ c0λnsa/2+1/2n q

1/2n , with probability tending to one, for some constant c0 > 0

2) ||η − η||1 ≤ c1λnsa/2+1n qn, with probability tending to one, for some constant c1 > 0.

In particular, regarding the consistency property of the estimated regression curves βj(t) =∑snk=1 ηjkbk(t), it is straightforward to see that

supj≤pn||βj − βj||L2 ≤ sup

j≤pn||ηj − ηj||2 + s−δn sup

j≤qn

( ∞∑k=sn+1

η2jkk

2δ)1/2

= O(λnsa/2+1/2n q1/2

n + s−δn ).

Next, we demonstrate Theorem 4, which lay the foundation for hypothesis testing through the

multiplier bootstrap method.

Theorem 4 Under conditions (A1)–(A10) and (P1)–(P5), by using the local minimizer η from

Theorem 3, then under H0 : ‖βj‖L2 = 0 for all j ∈ Hn, the Kolmogorov distance between the

distributions of ||T ||∞ and ||Te||∞ satisfies

limn→∞

supt≥0

∣∣P (||T ||∞ ≤ t)− Pe(||Te||∞ ≤ t)∣∣ = 0,

and consequently,

limn→∞

supα∈(0,1)

∣∣P (||T ||∞ ≤ cB(α))− α∣∣ = 0.


2.4 Simulation Studies

The simulated data yi, i = 1, . . . , n are generated from the model

yi =

pn∑j=1

∫ 1

0

βj(t)xij(t)dt =

pn∑j=1

∑k

ηjkθijk + εi,

with pn functional predictors, the errors ε1, . . . , εn are independent and identically distributed

from N(0, σ2), and the functional predictors have mean zero and covariance function derived

from the Fourier basis φ1 = 1, φ2` = 21/2 cos`π(2t − 1), ` = 1, . . . , 25 and φ2`−1 =

21/2 sin(` − 1)π(2t − 1), ` = 2, . . . , 25, and t ∈ T = [0, 1]. The underlying regression

function is βj(t) =∑50

k=1 ηjkφk(t) for j ≤ qn, where ηjk = cj(1.2 − 0.2k) for k ≤ 4 and

ηjk = 0.4cj(k − 3)−8 for 5 ≤ k ≤ 50, with some constants cj : j ≤ qn that are chosen

for different settings, and the remaining βj(t) = 0. Next, we describe how to generate the pn

functional predictors xij(t). For j = 1, . . . , pn, define Vij(t) =∑50

k=1 θijkφk(t), where θijk

follow independent distributed N(0, k−2) for different i, j and k. The pn functional predictors

are then defined through the AR(ρ) structure linear transformations

xij(t) =

pn∑j′=1

ρ|j−j′|Vij′(t) =

50∑k=1

pn∑j′=1

ρ|j−j′|θij′kφk(t) =

50∑k=1

θijkφk(t),

with θijk =∑pn

j′=1 ρ|j−j′|θij′k, and the constant ρ ∈ (0, 1) controls the correlation among the

functional predictors where we set ρ = 0.3 in our simulation study. For the actual observa-

tions, we assume they are realizations of xij(·), j = 1, . . . , pn at 100 equally spaced times

tijl, l = 1, . . . , 100 ∈ T . To be more realistic, we adopt a known orthonormal cubic spline

basis containing m = 30 basis functions to fit each observed functional data to obtain the

corresponding estimated coefficients rather than using the Fourier basis. After obtaining the

estimated xij(t), we apply the algorthm in section 2.5.1 to get the estimated βj(t) under SCAD

penalty function. Then, we construct the test statistic and its associated α = 5% empirical

quantile through wild bootstrap by using N = 1000 bootstrap samples to test on several alter-


native hypothesis. We summarize and compare the finite sample performance of several null

hypothesis under various settings of regression curves specified by cj : j ≤ qn based on the

rejection proportion over 300 Monte Carlos in the table below.

Table 2.1: Results for the several different combinations of null hypothesis and settings of βj(t)that were measured over 300 simualtions

Settings of βj(t) Null hypothesis H0 Rejection proportion

n = 100pn = 200qn = 3σ2 = 1

c1 = 0, c2 = 0, c3 = 0 H0 : β1(t) = 0 .0467c1 = .1, c2 = 0, c3 = 0 H0 : β1(t) = 0 .1067c1 = .2, c2 = 0, c3 = 0 H0 : β1(t) = 0 .2400c1 = .6, c2 = 0, c3 = 0 H0 : β1(t) = 0 .9133

c1 = 1c2 = 1c3 = 1

H0 : β1(t) = 0 .9967H0 : β1(t) = · · · = β5(t) = 0 1H0 : β1(t) = · · · = β20(t) = 0 1H0 : β5(t) = · · · = β20(t) = 0 .0533

From Table 2.1, we find that the rejection proportions of the first four null hypothesis,

where c2 and c3 are held to be zero, are in ascending order as the signal of the first predictor

increases, which is consistent with the shape of a power curve. In addition, we also find that

the rejection proportion of the first and the last null hypothesis are close to the pre-specified

significance level α = 5%, which is to be expected since the null hypothesis is true based

on the underlying model and the rejection proportion represents the average Type I error of

the multiplier decorrelated score test. In particular, when the minimal signal of the significant

functional predictors is strong enough to be detected (i.e. c1 = c2 = c3 = 1), we observe

that the rejection proportions of the associated null hypothesis are close to one, which is to

be expected. In conclusion, we summarize that the multiplier decorrelated score test performs

well in various settings and this justify the validity of the proposed score test.


2.5 Proof of Lemmas and Main Theorems

2.5.1 Algorithm and parameter tuning

First, without loss of generality, we assume that the data are centered so that we have

n−1

n∑i=1

Yi = 0, n−1

n∑i=1

θijk = 0,

for any j = 1, . . . , pn, k = 1, . . . , sn. In addition, for each j = 1, . . . , pn, we denote fj = Θj ηj ,

where ηj is an estimator for ηj , and Uj = Θj(Θj′Θj)

−1Θj′. The optimization of (2.4) can be

achieved by adopting the coordinate descent method similar to those used in Ravikumar et al.

[2008] and Fan et al. [2015] with a little modification, where ρλn(·) is replaced by ρλns

1/2n

(·). In

the following, we propose an algorithm corresponding to general penalty functions satisfying

the technical conditions (P1)–(P5) in Appendix.

Algorithm for general ρλ(·)

(i) Start with the initial estimator fj = 0, for each j = 1, . . . , pn.

(ii) Caculate the residual Rj = Y −∑

k 6=j fk, while fixing the values of fk : k 6= j.

(iii) Caculate the Pj = UjRj .

(iv) Let fj = max

1− ρ′λns

1/2n

(n−1/2||fj||2)n1/2/||Pj||2, 0Pj .

(v) Let fj = fj − n−11n′fj1n, where 1n denotes the n× 1 vector of ones.

(vi) Repeat (ii) to (v) for j = 1, . . . , pn and iterate until convergence to obtain the final

estimates fj , for j = 1, . . . , pn.

(vii) Compute ηj = (Θj′Θj)

−1Θj′fj by using the final estimates fj from step (vi) to get

the final estimates ηj , for j = 1, . . . , pn.

In addition, several tuning parameters play critical roles in the penalized procedure. The

optimal values of the parameters λn, sn will be automatically choosen by the method of cross

validation.


2.5.2 Technical Conditions on Regularizers

We impose some conditions on the nonconvex penalty function ρλ, similar to those adopted in

Loh and Wainwright [2015].

(P1) ρλ is an even function, and ρλ(0) = 0.

(P2) For t ≥ 0, ρλ(t) is nondecreasing in t.

(P3) The function gλ(t) = ρλ(t)/t is nonincreasing in t, for t > 0.

(P4) The function ρλ(t) is differentiable everywhere except at t = 0. At t = 0, we have

limt→0+ ρ′λ(t) = λL, for some positive constant L.

(P5) The function ρλ,µ(t) is convex in t, for some positive constant µ, where ρλ,µ(t) = ρλ(t)+

2−1µt2.

From Loh and Wainwright [2015], it is obvious that most nonconvex regularizers such as

LASSO, SCAD and MCP meet those conditons. Note that Lemma 5 is devoted to summa-

rizing the general properties of those penalty functions.

2.5.3 Notations

We summarize most of the notations that will be used throughout the paper as follows. First, for

any vector u = (u1, . . . , us)′, we let ||u||q = (

∑sl=1 |ul|q)1/q for q ≥ 1, ||u||∞ = maxl≤s |ul|,

and ||u||0 represents the cardinality of supp(u), where supp(u) = l : ul 6= 0. Moreover, if

S = supp(u), we let uS to be the vector containing only nonzeros of u, while uSc contains only

zero elements. For any matrix C = [cij], we denote ||C||∞ = maxi,j |cij|, if C is symmetric,

λmin(C) and λmax(C) are the minimum and maximum eigenvalues. For sequences an and bn,

an ∼ bn if and only if c1 ≤ limn→∞ |an/bn| ≤ c2 for some positive constants c1, c2. For any

sub-gaussian random variable X , define ||X||φ1 = supq≥1 q−1/2(E|X|q)1/q as the sub-gaussian


norm, and for any sub-exponential random variable X , define ||X||φ2 = supq≥1 q−1(E|X|q)1/q

as the sub-exponential norm, similar to those adopted in Ning and Liu [2016].

Second, recall that Pn = 1, . . . , pn is the index set representing all functional predictors,

while Hn ⊆ Pn is any nonempty subset of Pn, with cardinality |Hn| = hn ≤ pn, and we also

denote Hcn = Pn \ Hn. For each j ≤ pn, we let Λj = diagω1/2

j1 , . . . , ω1/2jsn. Moreover, we

also let Λ = ΛPn be the block diagonal matrix with Λj : j ≤ pn as its diagonal submatrices

and let ΛHn be the block diagonal matrix with Λj : j ∈ Hn as its diagonal submatrices.

Similarly, we define Λj = diagω1/2j1 , . . . , ω

1/2jsn, Λ = ΛPn and ΛHn . Furthermore, we denote

η = Λη = (η′1, . . . , η

′pn)′, with each ηj = Λjηj = (ηj1ω

1/2j1 , . . . , ηjsnω

1/2jsn

)′. Analogously, we

denote ν = η − η∗ and ν = η − η∗ = Λν, where η∗ and η∗ are true versions of η and η

respectively. For anyHn ⊆ Pn, νHn is constructed by stacking the vectors νj = Λj(ηj − η∗j ) :

j ∈ Hn in a column.

Third, without loss of generality, we let Θ = (G1, . . . , Gn)′ = [ΘHn ,ΘHcn ], ΘHn =

(E1, . . . , En)′, ΘHcn = (F1, . . . , Fn)′, where ΘHn is constructed by stacking Θj : j ∈ Hn in a

row. Similary, we denote Θ = (G1, . . . , Gn)′ = [ΘHn , ΘHcn ] = ΘΛ−1 = [ΘHnΛ−1Hn ,ΘHcnΛ−1

Hcn ],

ΘHn = (E1, . . . , En)′, ΘHcn = (F1, . . . , Fn)′, where ΘHn is constructed by stacking Θj =

ΘjΛ−1j : j ∈ Hn in a row.

Fourth, we denote a series of terms that can be constructed from the information matrix I .

I = E(GiGi′), IHnHn = E(EiEi

′), IHnHcn = E(EiFi′),

IHcnHn = E(FiEi′), IHcnHcn = E(FiFi

′), IHn|Hcn = IHnHn − w′IHcnHn

w = I−1HcnHcnIHcnHn = (w1, . . . , whnsn), I = E(GiG

′i) = Λ−1IΛ−1,

IHnHn = E(EiE′i) = Λ−1

HnIHnHnΛ−1Hn , IHnHcn = E(EiF

′i ) = Λ−1

HnIHnHcnΛ−1Hcn ,

IHcnHn = E(FiE′i) = Λ−1

HcnIHcnHnΛ−1Hn , IHcnHcn = E(FiF

′i ) = Λ−1

HcnIHcnHcnΛ−1Hcn ,

IHn|Hcn = Λ−1HnIHn|HcnΛ−1

Hn = Λ−1Hn(IHnHn − w′IHcnHn)Λ−1

Hn ,

ρn = maxl≤hnsn

||wl||0 = maxl≤hnsn

ρnl, ρnl = ||wl||0.


Finally, by using the η generated from (2.4) under the conditions of Theorem 3, we further

denote

T = n−1/2

n∑i=1

Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn) = n−1/2

n∑i=1

Si,

T ∗ = n−1/2

n∑i=1

Λ−1Hn(w′Fi − Ei)εi = n−1/2

n∑i=1

Si,

Te = n−1/2

n∑i=1

eiSi, T ∗e = n−1/2

n∑i=1

eiSi,

φ′(α) = inft ∈ R : Pe(||Te||∞ ≤ t) ≥ α, α ∈ (0, 1),

φ∗(α) = inft ∈ R : Pe(||T ∗e ||∞ ≤ t) ≥ α, α ∈ (0, 1),

Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn),

Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)εi,

where ei : i ≤ n is a set of independent and identically distribted standard normal random

variables independent of the data, Pe(·) means the probability with respect to e, φ′(α) denotes

the α quantile of ||Te||∞, and φ∗(α) denotes the α quantile of ||T ∗e ||∞. It is also worth to notice

that Si : i ≤ n are iid centered random vectors such that E(SiSi′) = σ2IHn|Hcn .

2.5.4 Auxiliary Lemmas

Lemma 5 (a) Under conditions (P1)–(P4), we have |ρλ(t1)− ρλ(t2)| ≤ λL|t1 − t2|, for all

t1, t2 ∈ R.

(b) Under conditions (P1)–(P4), we have |ρ′λ(t)| ≤ λL, for all t 6= 0.

(c) Under conditions (P1)–(P5), we have λL|t| ≤ ρλ(t) + 2−1µt2, for all t ∈ R.

Lemma 6 Under conditions (P1)–(P4), if Pλn(η∗)− Pλn(η) ≥ 0, where


Pλn(η) =∑pn

j=1 ρλns1/2n(n−1/2||Θjηj||2) and η∗ is the true version of η, then we have

0 ≤ Pλn(η∗)− Pλn(η) ≤ λns1/2n L

( ∑j∈An

n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2),

where An ⊆ Pn is the index set corresponding to the largest qn elements of n−1/2||Θjηj||2 :

j ≤ pn in magnitude, while Acn = Pn \ An.

Lemma 7 Under conditions (A2), (A6), (A8), we have

1) ||ΛΛ−1 − I||∞ ≤ c1nβ/2−1/2, with probability tending to one, for some constant c1 > 0

2) ||ΛΛ−1 − I||∞ ≤ c2nβ/2−1/2, with probability tending to one, for some constant c2 > 0,

where I denotes the pnsn × pnsn identity matrix.

Lemma 8 Under conditions (A1), (A2), (A6), (A8), we have

1) ||n−1Θ′Θ − E(GiGi′)||∞ ≤ c0n

β/2−1/2, with probability tending to one, for some con-

stant c0 > 0

2) ||n−1Θ′Θ−E(GiG′i)||∞ ≤ c1n

β/2−1/2, with probability tending to one, for some constant

c1 > 0

3) ||n−1Θ′ε||∞ ≤ c2nβ/2−1/2, with probability tending to one, for some constant c2 > 0

4) ||n−1Θ′ε||∞ ≤ c3nβ/2−1/2, with probability tending to one, for some constant c3 > 0

5) maxl≤hnsn ||n−1∑n

i=1(wl′Fi −Eil)Fi′||∞ ≤ c4n

β/2−1/2, with probability tending to one,

for some constant c4 > 0


i=1(wl′Fi − Eil)[E(E2

il)]−1/2Fi

′||∞ ≤ c5nβ/2−1/2, with probability

tending to one, for some constant c5 > 0

7) maxl≤hnsn |n−1∑n

i=1(wl′Fi − Eil)εi| ≤ c6n

β/2−1/2, with probability tending to one, for

some constant c6 > 0


8) maxl≤hnsn |n−1∑n

i=1(wl′Fi−Eil)[E(E2

il)]−1/2εi| ≤ c7n

β/2−1/2, with probability tending

to one, for some constant c7 > 0,

where ε denotes the n× 1 random error vector.

Lemma 9 Under conditions (A1), (A2), (A6)-(A8), and H0 : βj(t) = 0 for all j ∈ Hn, we

have

1) ||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ ≤ c0(nβ/2−1/2 + qns−δn ), with probability tending to one, for

some constant c0 > 0

2) ||n−1ΘHcn′(Y − ΘHcnηHcn)||∞ ≤ c1(nβ/2−1/2 + qns

−δn ), with probability tending to one,

for some constant c1 > 0.

Lemma 10 Under conditions (A1), (A2), (A4), (A6), (A8), we have

1) n−1∑pn

j=1 ||Θj(ηj − η∗j )||22 ≤ m1[1 + o(1)]||ν||22, with probability tending to one, where

the constant m1 is defined in (A4)

2) ||ν||1 ≤ c0s1/2n

∑pnj=1 n

−1/2||Θj(ηj − η∗j )||2, with probability tending to one, for some

constant c0 > 0

3) λn||ν||1 ≤ c1

[Pλn(η∗) + Pλn(η) + ||ν||22

], with probability tending to one, for some

constant c1 > 0.

Recall that ν = Λν = η − η∗ = Λ(η − η∗).

Lemma 11 Under conditions (A1)–(A4), (A6), (A8), (A10), we have


i=1 FiFi′(wl − wl)||∞ ≤ c0n

β/2−1/2, with probability tending to one,

for some constant c0 > 0

2) maxl≤hnsn ||wl − wl||2 ≤ c1ρ1/2n sann

β/2−1/2, with probability tending to one, for some

constant c1 > 0


3) maxl≤hnsn ||wl − wl||1 ≤ c2ρnsann

β/2−1/2, with probability tending to one, for some

constant c2 > 0.

Recall that ρn = maxl≤hnsn ||wl||0 = maxl≤hnsn ρnl, ρnl = ||wl||0.

Lemma 12 Under conditions (A1)–(A4), (A6)–(A10), and (P1)–(P5), we have

1) maxl≤hnsn |n−1/2∑n

i=1(Sil−Sil)| ≤ c0(nβ−1/2ρns3a/2n +λnn

β/2qnsa+1n +nβ/2+1/2qns

−δn log sn+

nβρnqns3a/2−δn log sn), with probability tending to one, for some constant c0 > 0

2) maxl≤hnsn[n−1

∑ni=1(Sil − Sil)2

]1/2 ≤ c1

[λnn

βqnsa/2+1n + nβ/2qns

−δn log sn +

ρns3a/2n nβ−1/2(log n)1/2 +λnρnqns

2a+1n n3β/2−1/2 +nβ−1/2ρnqns

3a/2−δn log sn

], with prob-

ability tending to one, for some constant c1 > 0.

Recall that we denote

Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn),

Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)εi.

2.5.5 Proof of Lemmas

Proof of Lemma 5. For part (a), if |t1| = |t2|, then by condition (P1), it is easy to see that

|ρλ(t1)− ρλ(t2)| = |ρλ(|t1|)− ρλ(|t2|)| = 0 ≤ λL|t1 − t2|. If |t1| 6= |t2|, then, without loss of

generality, we assume |t1| > |t2|. By conditions (P1)–(P3), we have

|ρλ(t1)− ρλ(t2)| ≤ (|t1| − |t2|)ρλ(|t1| − |t2|)/(|t1| − |t2|) ≤ |t1 − t2| limt→0+ρλ(t)/t.

Given conditons (P1), (P3) and (P4), it is easy to see that limt→0+ρλ(t)/t = limt→0+ ρ′λ(t) =

λL. It follows that |ρλ(t1)− ρλ(t2)| ≤ λL|t1 − t2|, which completes the proof of part (a).

For part (b), by definition, for all t 6= 0, we have |ρ′λ(t)| = | lim∆→0(ρλ(t + ∆) −

ρλ(t))/∆|. By Lemma 5(a), we have |ρ′λ(t)| ≤ λL, which completes the proof of part (b).


For part (c), if t = 0, the inequality holds trivially. If t 6= 0, without loss of generality, we

let t > 0. Since ρλ,µ(s) = ρλ(s) + 2−1µs2 is convex for s ∈ [0, t], and is differentiable for

s ∈ (0, t], it is easy to see that ρλ,µ(t) − ρλ,µ(0) ≥ t lims1→0+ρ′λ(s1) + µs1 = tλL, which

completes the proof of part (c).

Proof of Lemma 6. First, we define fn(t) = t/ρλns

1/2n

(t) for t > 0, and fn(t) = (λns1/2n L)−1

for t = 0. Under conditions (P1)–(P4), it is easy to see that fn(t) is nondecreasing in t for

t ≥ 0. Then, it follows that

∑j∈Acn

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2) =∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2[fn(n−1/2||Θj(ηj − η∗j )||2)

]−1

≥[fn(max

j∈Acnn−1/2||Θj(ηj − η∗j )||2)

]−1∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2. (2.12)

Moreover, we also have

∑j∈An

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2) =∑j∈An

n−1/2||Θj(ηj − η∗j )||2[fn(n−1/2||Θj(ηj − η∗j )||2)

]−1

≤[fn(max

j∈Acnn−1/2||Θj(ηj − η∗j )||2)

]−1∑j∈An

n−1/2||Θj(ηj − η∗j )||2. (2.13)


Combining (2.12), (2.13) and Pλn(η∗)− Pλn(η) ≥ 0, we have

0 ≤ Pλn(η∗)− Pλn(η) =

pn∑j=1

ρλns

1/2n

(n−1/2||Θjη∗j ||2)−

pn∑j=1

ρλns

1/2n

(n−1/2||Θjηj||2)

=

qn∑j=1

ρλns

1/2n

(n−1/2||Θjη∗j ||2)−

pn∑j=1

ρλns

1/2n

(n−1/2||Θjηj||2)

≤qn∑j=1

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2) +

qn∑j=1

ρλns

1/2n

(n−1/2||Θjηj||2)−pn∑j=1

ρλns

1/2n

(n−1/2||Θjηj||2)

=

qn∑j=1

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2)−pn∑

j=qn+1

ρλns

1/2n

(n−1/2||Θjηj||2)

≤∑j∈An

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2)−∑j∈Acn

ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2)

≤[fn(max

j∈Acnn−1/2||Θj(ηj − η∗j )||2)

]−1( ∑j∈An

n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2)

≤ λns1/2n L

( ∑j∈An

n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2),

which completes the proof.

Proof of Lemma 7. First, under condition (A2) and by Bernstein inequality, it is easy to see

that for any t > 0,

P (|n−1

n∑i=1

(ω−1jk θ

2ijk − 1)| ≥ t) ≤ 2 exp

(− c0 minc−2

1 t2, c−11 tn

),

for some universal constants c0, c1 > 0, uniformly in j = 1, . . . , pn, k = 1, . . . , sn. It then

follows from union bound inequality that

P(

maxj≤pn

maxk≤sn|n−1

n∑i=1

(ω−1jk θ

2ijk − 1)| ≥ t

)≤ 2pnsn exp

(− c0 minc−2

1 t2, c−11 tn

),

by choosing t = c2[log(pnsn)/n]1/2, for some sufficiently large c2 > 0, it is easy to see that

maxj≤pn

maxk≤sn|n−1

n∑i=1

(ω−1jk θ

2ijk − 1)| ≤ c2[log(pnsn)/n]1/2,


with probability tending to one. Accordingly, it is easy to see that

||ΛΛ−1 − I||∞ = maxj≤pn

maxk≤sn|(n−1

n∑i=1

ω−1jk θ

2ijk

)1/2 − 1|

≤ maxj≤pn

maxk≤sn|n−1

n∑i=1

(ω−1jk θ

2ijk − 1)| ≤ c2[log(pnsn)/n]1/2,

with probability tending to one. Under (A6) and (A8), it is obvious that ||ΛΛ−1 − I||∞ ≤

c3nβ/2−1/2, with probability tending to one, for some constant c3 > 0. Moreover, we have

||ΛΛ−1 − I||∞ = maxj≤pn

maxk≤sn|ω1/2jk

(n−1

n∑i=1

θ2ijk

)−1/2 − 1|

≤[

maxj≤pn

maxk≤sn

ω1/2jk

(n−1

n∑i=1

θ2ijk

)−1/2]||ΛΛ−1 − I||∞

≤ (||ΛΛ−1 − I||∞ + 1)||ΛΛ−1 − I||∞,

which further implies that

||ΛΛ−1 − I||∞ ≤ ||ΛΛ−1 − I||∞/(1− ||ΛΛ−1 − I||∞)

≤ c3nβ/2−1/2/(1− c3n

β/2−1/2)

≤ 2c3nβ/2−1/2,

with probability tending to one, which completes the proof.

Proof of Lemma 8. By using Bernstein inequality and union bound inequality repeatedly, the

Lemma holds trivially.


Proof of Lemma 9. First, under H0 : βj(t) = 0 for all j ∈ Hn, it is easy to see that

||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ = ||n−1

n∑i=1

Fi(Yi − F ′iηHcn)||∞

=||n−1

n∑i=1

Fiεi + n−1

n∑i=1

Fi(∑j∈Hcn

∞∑k=sn+1

θijkηjk)||∞

=||n−1

n∑i=1

Fiεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

[Fiθijk − E(Fiθijk)

]ηjk + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

E(Fiθijk)ηjk||∞

≤||n−1

n∑i=1

Fiεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk||∞+

||n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

E(Fiθijk)ηjk||∞. (2.14)

In addition, for each l = 1, . . . , (pn − hn)sn, we have

E[(n−1

n∑i=1

Filεi)2] = n−1σ2 ∼ n−1. (2.15)


Moreover, for each l = 1, . . . , (pn − hn)sn, we have

E(n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

[Filθijk − E(Filθijk)

]ηjk

)2

=n−2

n∑i=1

E( ∑j∈Hcn

∞∑k=sn+1


]ηjk

)2

≤n−2

n∑i=1

E( ∑j∈Hcn

|∞∑

k=sn+1


]ηjk|)2

≤n−2

n∑i=1

E( qn∑j=1

|∞∑

k=sn+1


]ηjk|)2

≤n−2qn

n∑i=1

E( qn∑j=1

|∞∑

k=sn+1


]ηjk|2

)≤n−2qn

n∑i=1

qn∑j=1

( ∞∑k=sn+1

η2jkk

2δ)( ∞∑

k1=sn+1

Var(Filθijk1)k−2δ1

)≤n−2qn

n∑i=1

qn∑j=1

( ∞∑k=sn+1

η2jkk

2δ)( ∞∑

k1=sn+1

E(F 2ilθ

2ijk1

)k−2δ1

)=n−2qn

n∑i=1

qn∑j=1

( ∞∑k=sn+1

η2jkk

2δ)( ∞∑

k1=sn+1

E(F 2il θ

2ijk1

)ωjk1k−2δ1

)≤cn−1q2

ns−2δn = o(n−1), (2.16)

where the last inequality is by (A1), (A2), (A7) and (A8). Therefore, by combining (2.15) with

(2.16), it is easy to see that for any l = 1, . . . , (pn − hn)sn, there exists a universal constant

c0 > 0 such that

limn→∞

P(|n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk| ≤ c0|n−1

n∑i=1

Filεi|)

= 1,


which implies that for any t > 0,

limn→∞

P(|n−1

n∑i=1

Filεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk| ≥ t

)≤ lim

n→∞P(|n−1

n∑i=1

Filεi| ≥ t)

+ limn→∞

P(|n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk| ≥ t

)≤ lim

n→∞P(|n−1

n∑i=1

Filεi| ≥ t)

+ limn→∞

P(|n−1

n∑i=1

Filεi| ≥ c−10 t)

≤2 limn→∞

P(|n−1

n∑i=1

Filεi| ≥ c1t), (2.17)

where c1 = min1, c−10 > 0. Moreover, under (A2) and by Bernstein inequality, we have that

for any l = 1, . . . , (pn − hn)sn, and any t > 0,

limn→∞

P(|n−1

n∑i=1

Filεi| ≥ c1t)≤ lim

n→∞2 exp

(− c2 minc−2

3 t2, c−13 tn

), (2.18)

where c2 and c3 are some universal positive constants. By combining (2.17) with (2.18), we

have that for any l = 1, . . . , (pn − hn)sn, and any t > 0,

limn→∞

P(|n−1

n∑i=1

Filεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk| ≥ t

)≤2 lim

n→∞P(|n−1

n∑i=1

Filεi| ≥ c1t)≤ lim

n→∞4 exp

(− c2 minc−2

3 t2, c−13 tn

),

invoking union bound inequality, we have that for any t > 0,

limn→∞

P(||n−1

n∑i=1

Fiεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk||∞ ≥ t

)≤ lim

n→∞4pnsn exp

(− c2 minc−2

3 t2, c−13 tn

),


by choosing t = c4[log(pnsn)/n]1/2, for some sufficiently large c4 > 0, it is easy to see that

||n−1

n∑i=1

Fiεi + n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1


]ηjk||∞

≤c4[log(pnsn)/n]1/2 ≤ c5nβ/2−1/2 (2.19)

with probability tending to one, for some c5 > 0. Furthermore, we have

||n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

E(Fiθijk)ηjk||∞ = maxl≤(pn−hn)sn

|n−1

n∑i=1

∑j∈Hcn

∞∑k=sn+1

E(Filθijk)ηjk|

≤ maxl≤(pn−hn)sn

n−1

n∑i=1

∑j∈Hcn

|∞∑

k=sn+1

E(Filθijk)ηjk|


n−1

n∑i=1

qn∑j=1

|∞∑

k=sn+1

E(Filθijk)ηjk|


n−1

n∑i=1

qn∑j=1

( ∞∑k=sn+1

[E(Filθijk)

]2k−2δ

)1/2( ∞∑k1=sn+1

η2jk1k2δ

1

)1/2

≤n−1

n∑i=1

qn∑j=1

( ∞∑k=sn+1

ωjkk−2δ)1/2( ∞∑

k1=sn+1

η2jk1k2δ

1

)1/2

≤ c6qns−δn , (2.20)

for some constant c6 > 0, where the last inequality is by (A1) and (A7). By combining (2.12),

(2.19) with (2.20), we conclude that ||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ ≤ c7(nβ/2−1/2 + qns−δn ), with

probability tending to one, for some constant c7 > 0, which completes the proof of part 1). For

part 2), it is easy to observe that

||n−1ΘHcn′(Y −ΘHcnηHcn)||∞ = ||n−1ΛHcnΘ′Hcn(Y −ΘHcnηHcn)||∞

≤λmax(ΛHcn)||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞

≤c8(nβ/2−1/2 + qns−δn ),

with probability tending to one, for some constant c8 > 0, which completes the proof.


Proof of Lemma 10. First, it is easy to see that

n−1

pn∑j=1

||Θj(ηj − η∗j )||22 = n−1

pn∑j=1

||Θj(ηj − η∗j )||22 =

pn∑j=1

(ηj − η∗j )′(n−1Θ′jΘj)(ηj − η∗j )

=

pn∑j=1

(ηj − η∗j )′E(n−1Θ′jΘj)(ηj − η∗j ) +

pn∑j=1

(ηj − η∗j )′[n−1Θ′jΘj − E(n−1Θ′jΘj)

](ηj − η∗j )

≤λmax(I)||ν||22 +

pn∑j=1

(ηj − η∗j )′[n−1Θ′jΘj − E(n−1Θ′jΘj)

](ηj − η∗j )

≤λmax(I)||ν||22 + ||n−1Θ′Θ− E(n−1Θ′Θ)||∞pn∑j=1

||ηj − η∗j ||21

≤[λmax(I) + sn||n−1Θ′Θ− E(n−1Θ′Θ)||∞

]||ν||22

≤(m1 + c2snnβ/2−1/2)||ν||22,

for some constant c2 > 0, where the last inequality is by (A4) and Lemma 8. Since snnβ/2−1/2 =

o(1), it follows that n−1∑pn

j=1 ||Θj(ηj − η∗j )||22 ≤ m1[1 + o(1)]||ν||22, with probability tending


to one, which completes the proof of part 1). Moreover, we have

||ν||1 =

pn∑j=1

sn∑k=1

|ηjk − η∗jk| ≤ s1/2n

pn∑j=1

||ηj − η∗j ||2

=[λmin(I)

]−1/2s1/2n

pn∑j=1

[λmin(I)

]1/2||ηj − η∗j ||2≤m−1/2

0 s1/2n

pn∑j=1

[λmin(I)

]1/2||ηj − η∗j ||2 ≤ m−1/20 s1/2

n

pn∑j=1

[(ηj − η∗j )′E(n−1Θ′jΘj)(ηj − η∗j )

]1/2=m

−1/20 s1/2

n

pn∑j=1

((ηj − η∗j )′(n−1Θ′jΘj)(ηj − η∗j ) + (ηj − η∗j )′

[E(n−1Θ′jΘj)− n−1Θ′jΘj

](ηj − η∗j )

)1/2

≤m−1/20 s1/2

n

pn∑j=1

(n−1||Θj(ηj − η∗j )||22 + ||E(n−1Θ′Θ)− n−1Θ′Θ||∞||ηj − η∗j ||21

)1/2

≤m−1/20 s1/2

n

pn∑j=1

(n−1/2||Θj(ηj − η∗j )||2 + ||E(n−1Θ′Θ)− n−1Θ′Θ||1/2∞ ||ηj − η∗j ||1

)=(m−1/20 s1/2

n

pn∑j=1

n−1/2||Θj(ηj − η∗j )||2)

+(m−1/20 s1/2

n ||E(n−1Θ′Θ)− n−1Θ′Θ||1/2∞ ||ν||1)

≤(m−1/20 s1/2

n

pn∑j=1

n−1/2||Θj(ηj − η∗j )||2)

+(c3s

1/2n nβ/4−1/4||ν||1

),

for some constant c3 > 0, where the last inequality is by Lemma 8. Invoking (A8), it is easy to

see that ||ν||1 ≤ c4s1/2n

∑pnj=1 n

−1/2||Θj(ηj − η∗j )||2, with probability tending to one, for some

constant c4 > 0, which completes the proof of part 2). For part 3), it is easy to observe that by

combining part 2) with Lemma 5(c), we have

λn||ν||1 ≤ c4λns1/2n

pn∑j=1

n−1/2||Θj(ηj − η∗j )||2

=c4L−1

pn∑j=1

λns1/2n Ln−1/2||Θj(ηj − η∗j )||2

≤c4L−1

pn∑j=1

[ρλns

1/2n

(n−1/2||Θj(ηj − η∗j )||2) + 2−1µn−1||Θj(ηj − η∗j )||22]

≤c4L−1[Pλn(η∗) + Pλn(η) + 2−1µn−1

pn∑j=1

||Θj(ηj − η∗j )||22]. (2.21)


By combining (2.21) with part 1), it is easy to see that λn||ν||1 ≤ c5

[Pλn(η∗)+Pλn(η)+||ν||22

],

with probability tending to one, for some constant c5 > 0, which completes the proof of part

3).

Proof of Lemma 11. First, by Lemma 8, we have

maxl≤hnsn

||n−1

n∑i=1

(wl′Fi − Eil)Fi′||∞ ≤ c0n

β/2−1/2,

with probability tending to one, for some constant c0 > 0. Moreover, under (A6), (A8), and

(A10), we have τn ∼ nβ/2−1/2. By choosing τn = c0nβ/2−1/2, then it is easy to see that

maxl≤hnsn

||n−1

n∑i=1

(wl′Fi − Eil)Fi′||∞ ≤ τn = c0n

β/2−1/2. (2.22)

Under (2.7), we have

maxl≤hnsn

||n−1

n∑i=1

(w′lFi − Eil)Fi′||∞ ≤ τn = c0nβ/2−1/2. (2.23)

Combining (2.22) with (2.23), it is easy to see that

maxl≤hnsn

||n−1

n∑i=1

FiFi′(wl − wl)||∞

≤ maxl≤hnsn

||n−1

n∑i=1

(wl′Fi − Eil)Fi′||∞ + max

l≤hnsn||n−1

n∑i=1

(w′lFi − Eil)Fi′||∞

≤2τn = 2c0nβ/2−1/2, (2.24)

which completes the proof of part 1). In addition, by the definition of the Dantizig selector

method, it is easy to see that

P( hnsn⋂

l=1

||wl||1 ≤ ||wl||1)→ 1, as n→∞. (2.25)


Denote Sl = j : wlj 6= 0 as the support set for each vector wl, then (2.25) implies that

maxl≤hnsn

||wl − wl||1 ≤ maxl≤hnsn

||(wl − wl)Sl ||1 + maxl≤hnsn

||(wl)Scl ||1

≤ maxl≤hnsn

||(wl − wl)Sl ||1 + maxl≤hnsn

[||(wl)Sl ||1 − ||(wl)Sl ||1

]≤2 max

l≤hnsn||(wl − wl)Sl ||1 ≤ 2 max

l≤hnsnρnl

1/2||(wl − wl)Sl ||2

≤2ρn1/2 max

l≤hnsn||wl − wl||2, (2.26)

with probability tending to one. Furthermore, we have

maxl≤hnsn

(wl − wl)′(n−1

n∑i=1

FiFi′)(wl − wl)

≤ maxl≤hnsn

||wl − wl||1||n−1

n∑i=1


≤2c0nβ/2−1/2 max

l≤hnsn||wl − wl||1, (2.27)

with probability tending to one, where the last inequality is by (2.24). In addition, we also have

maxl≤hnsn

(wl − wl)′(n−1

n∑i=1

FiFi′)(wl − wl)

= maxl≤hnsn

((wl − wl)′E(FiFi

′)(wl − wl)− (wl − wl)′[E(FiFi

′)− n−1

n∑i=1

FiFi′](wl − wl))

≥ maxl≤hnsn

(λmin(I)||wl − wl||22 − ||wl − wl||21||E(FiFi

′)− n−1

n∑i=1

FiFi′||∞)

≥ maxl≤hnsn

(λmin(I)||wl − wl||22 − ||wl − wl||21||E(GiGi

′)− n−1Θ′Θ||∞)

≥λmin(I)(

maxl≤hnsn

||wl − wl||22)− ||E(GiGi

′)− n−1Θ′Θ||∞(

maxl≤hnsn

||wl − wl||21)

≥c1s−an

(maxl≤hnsn

||wl − wl||22)− c2n

β/2−1/2(

maxl≤hnsn

||wl − wl||21)

≥(c1s−an − c3ρnn

β/2−1/2)(

maxl≤hnsn

||wl − wl||22), (2.28)

with probability tending to one, where the second last inequality is by (A3), (A4), and Lemma 8,


while the last inequality is by (2.26). Combining (2.26), (2.27), (2.28) with (A8) entails that

maxl≤hnsn

||wl − wl||2 ≤ c4ρ1/2n sann

β/2−1/2, (2.29)


2). Lastly, combining (2.26) with (2.29) entails that

maxl≤hnsn

||wl − wl||1 ≤ c5ρnsann

β/2−1/2,


3).


Proof of Lemma 12. First, it is easy to see that for each l = 1, . . . , hnsn, we have

Sil − Sil

=[n−1

n∑i1=1

E2i1l

]−1/2(w′lFi − Eil)(Yi − F ′i ηHcn)−

[E(E2

il)]−1/2

(wl′Fi − Eil)εi

=[E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2[

E(E2il)]−1/2

(w′lFi − Eil)(Yi − F ′i ηHcn)

−[E(E2

il)]−1/2


=[E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2[

E(E2il)]−1/2

(w′lFi − Eil)[εi + Fi

′(ηHcn − ηHcn)+

∑j∈Hcn

∞∑k=sn+1

θijkηjk]−[E(E2

il)]−1/2


=[E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2[

E(E2il)]−1/2[

(wl′Fi − Eil) + (wl − wl)′Fi

][εi + Fi

′(ηHcn − ηHcn) +∑j∈Hcn

∞∑k=sn+1

θijkηjk]−[E(E2

il)]−1/2


=([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2 − 1

)([E(E2

il)]−1/2


)+([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl′Fi − Eil)Fi′(ηHcn − ηHcn)

)+([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1

θijkηjk))

+([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′Fiεi)

+([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′FiFi′(ηHcn − ηHcn))

+([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′Fi(∑j∈Hcn

∞∑k=sn+1

θijkηjk))

=∆1il + ∆2il + ∆3il + ∆4il + ∆5il + ∆6il,


where we denote

∆1il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2 − 1

)([E(E2

il)]−1/2


),

∆2il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2


),

∆3il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1

θijkηjk)),

∆4il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′Fiεi),

∆5il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′FiFi′(ηHcn − ηHcn)),

∆6il =([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2


∞∑k=sn+1

θijkηjk)).

Accordingly, it is easy to observe that

maxl≤hnsn

|n−1/2

n∑i=1

(Sil − Sil)| ≤ maxl≤hnsn

|n−1/2

n∑i=1

∆1il|+ maxl≤hnsn

|n−1/2

n∑i=1

∆2il|

+ maxl≤hnsn

|n−1/2

n∑i=1


|n−1/2

n∑i=1

∆4il|

+ maxl≤hnsn

|n−1/2

n∑i=1


|n−1/2

n∑i=1

∆6il|. (2.30)


For maxl≤hnsn |n−1/2∑n

i=1 ∆1il|, we have

maxl≤hnsn

|n−1/2

n∑i=1

∆1il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2 − 1

)([E(E2

il)]−1/2


)∣∣∣≤||ΛΛ−1 − I||∞ max

l≤hnsn|n−1/2

n∑i=1

(wl′Fi − Eil)[E(E2

il)]−1/2εi|

=n1/2||ΛΛ−1 − I||∞ maxl≤hnsn

|n−1

n∑i=1


il)]−1/2εi|

≤c0nβ−1/2, (2.31)

with probability tending to one, for some constant c0 > 0, where the last inequality is by

Lemma 7 and Lemma 8.



maxl≤hnsn

|n−1/2

n∑i=1

∆2il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2


)∣∣∣≤(1 + ||ΛΛ−1 − I||∞

)maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)]−1/2


)∣∣∣≤n1/2

(1 + ||ΛΛ−1 − I||∞

)||ηHcn − ηHcn||1 max

l≤hnsn||n−1

n∑i=1


il)]−1/2Fi

′||∞

≤n1/2(1 + ||ΛΛ−1 − I||∞

)||η − η||1 max

l≤hnsn||n−1

n∑i=1


il)]−1/2Fi

′||∞

≤c1λnnβ/2qns

a/2+1n , (2.32)


Lemma 7, Lemma 8 and Theorem 3.




maxl≤hnsn

|n−1/2

n∑i=1

∆3il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1

θijkηjk))∣∣∣

≤(1 + ||ΛΛ−1 − I||∞

)maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1


≤n1/2(

1 + ||ΛΛ−1 − I||∞)(

maxl≤hnsn

maxi≤n

∣∣[E(E2il)]−1/2

(wl′Fi − Eil)

∣∣)·(n−1

n∑i=1

∣∣ ∑j∈Hcn

∞∑k=sn+1

θijkηjk∣∣). (2.33)

Regarding maxl≤hnsn maxi≤n∣∣[E(E2

il)]−1/2

(wl′Fi−Eil)

∣∣, it is easily to be seen from (A2),

(A6) and (A8) that we have

maxl≤hnsn

maxi≤n

∣∣[E(E2il)]−1/2

(wl′Fi − Eil)

∣∣≤c1[log(npnsn)]1/2

≤c2nβ/2, (2.34)

with probability tending to one, for some constants c1, c2 > 0.


Regarding n−1∑n

i=1

∣∣∑j∈Hcn

∑∞k=sn+1 θijkηjk

∣∣, we have

E(n−1

n∑i=1

∣∣ ∑j∈Hcn

∞∑k=sn+1

θijkηjk∣∣) ≤ E

(n−1

n∑i=1

qn∑j=1

∣∣ ∞∑k=sn+1

θijkηjk∣∣)

≤E(n−1

n∑i=1

qn∑j=1

( ∞∑k=sn+1

θ2ijkk

−2δ)1/2( ∞∑

k1=sn+1

η2jk1k2δ

1

)1/2)

≤n−1

n∑i=1

qn∑j=1

( ∞∑k1=sn+1

η2jk1k2δ

1

)1/2E[( ∞∑

k=sn+1

θ2ijkk

−2δ)1/2]

≤n−1

n∑i=1

qn∑j=1

( ∞∑k1=sn+1

η2jk1k2δ

1

)1/2[E( ∞∑k=sn+1

θ2ijkk

−2δ)]1/2

=n−1

n∑i=1

qn∑j=1

( ∞∑k1=sn+1

η2jk1k2δ

1

)1/2( ∞∑k=sn+1

ωjkk−2δ)1/2

≤O(qns−δn ), (2.35)

where the last inequality is by (A1) and (A7). Hence, by combining (2.33), (2.34) with (2.35),

it is easy to see that

maxl≤hnsn

|n−1/2

n∑i=1

∆3il| ≤ Op(nβ/2+1/2qns

−δn ). (2.36)




maxl≤hnsn

|n−1/2

n∑i=1

∆4il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′Fiεi)∣∣∣

≤(1 + ||ΛΛ−1 − I||∞

)maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)]−1/2

(wl − wl)′Fiεi)∣∣∣

≤n1/2(

1 + ||ΛΛ−1 − I||∞)(

maxl≤hnsn

[E(E2

il)]−1/2

)·(

maxl≤hnsn

||wl − wl||1)(||n−1Θ′ε||∞

)≤n1/2

(1 + ||ΛΛ−1 − I||∞

)([λmin(Λ)

]−1)

·(

maxl≤hnsn

||wl − wl||1)(||n−1Θ′ε||∞

)≤c3n

β−1/2ρns3a/2n , (2.37)


Lemma 7, Lemma 8, Lemma 11 and (A3).



maxl≤hnsn

|n−1/2

n∑i=1

∆5il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2

(wl − wl)′FiFi′(ηHcn − ηHcn))∣∣∣

≤(1 + ||ΛΛ−1 − I||∞

)maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)]−1/2

(wl − wl)′FiFi′(ηHcn − ηHcn))∣∣∣

≤n1/2(

1 + ||ΛΛ−1 − I||∞)([

λmin(Λ)]−1)(||η − η||1

)(maxl≤hnsn

||n−1

n∑i=1


)≤c4λnn

β/2qnsa+1n , (2.38)


Lemma 7, Lemma 11, Theorem 3 and (A3).




maxl≤hnsn

|n−1/2

n∑i=1

∆6il|

= maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)([

E(E2il)]−1/2


∞∑k=sn+1


≤(1 + ||ΛΛ−1 − I||∞

)maxl≤hnsn

∣∣∣n−1/2

n∑i=1

([E(E2

il)]−1/2


∞∑k=sn+1


≤n1/2(

1 + ||ΛΛ−1 − I||∞)(

maxl≤hnsn

[E(E2

il)]−1/2

)(maxl≤hnsn

||wl − wl||1)

·(||n−1

n∑i=1

Fi(∑j∈Hcn

∞∑k=sn+1

θijkηjk)||∞)

≤n1/2(

1 + ||ΛΛ−1 − I||∞)([

λmin(Λ)]−1)(

maxl≤hnsn

||wl − wl||1)

·(

maxl≤(pn−hn)sn

∣∣n−1

n∑i=1

Fil(∑j∈Hcn

∞∑k=sn+1

θijkηjk)∣∣)

≤n1/2(

1 + ||ΛΛ−1 − I||∞)([

λmin(Λ)]−1)(

maxl≤hnsn

||wl − wl||1)

·(

maxl≤(pn−hn)sn

maxi≤n

∣∣Fil∣∣)(n−1

n∑i=1

∣∣ ∑j∈Hcn

∞∑k=sn+1

θijkηjk∣∣)

≤Op(nβρnqns

3a/2−δn ), (2.39)

where the last inequality is by Lemma 7, Lemma 11, (A2), (A3) and (2.35).

In summary, by combining (2.30), (2.31), (2.32), (2.36), (2.37), (2.38) with (2.39), it is


easy to see that

maxl≤hnsn

|n−1/2

n∑i=1

(Sil − Sil)| ≤ maxl≤hnsn

|n−1/2

n∑i=1


|n−1/2

n∑i=1

∆2il|

+ maxl≤hnsn

|n−1/2

n∑i=1


|n−1/2

n∑i=1

∆4il|

+ maxl≤hnsn

|n−1/2

n∑i=1


|n−1/2

n∑i=1

∆6il|

≤ c0nβ−1/2 + c1λnn

β/2qnsa/2+1n +Op(n

β/2+1/2qns−δn )

+ c3nβ−1/2ρns

3a/2n + c4λnn

β/2qnsa+1n +Op(n

βρnqns3a/2−δn )

≤ c5(nβ−1/2ρns3a/2n + λnn

β/2qnsa+1n + nβ/2+1/2qns

−δn log sn

+ nβρnqns3a/2−δn log sn),


1).

Next, we start to prove part 2). First, it is easy to observe that

maxl≤hnsn

[n−1

n∑i=1

(Sil − Sil)2]

≤100(

maxl≤hnsn

n−1

n∑i=1

∆21il + max

l≤hnsnn−1

n∑i=1

∆22il + max

l≤hnsnn−1

n∑i=1

∆23il

+ maxl≤hnsn

n−1

n∑i=1

∆24il + max

l≤hnsnn−1

n∑i=1

∆25il + max

l≤hnsnn−1

n∑i=1

∆26il

). (2.40)


For maxl≤hnsn n−1∑n

i=1 ∆21il, we have

maxl≤hnsn

n−1

n∑i=1

∆21il

= maxl≤hnsn

n−1

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2 − 1

)2([E(E2

il)]−1/2


)2

≤(||ΛΛ−1 − I||∞

)2

maxl≤hnsn

n−1

n∑i=1

([E(E2

il)]−1/2


)2

≤(||ΛΛ−1 − I||∞

)2

maxl≤hnsn

maxi≤n

([E(E2

il)]−1/2


)2

≤c6n2β−1 log n, (2.41)


Lemma 7 and (A2).



maxl≤hnsn

n−1

n∑i=1

∆22il

= maxl≤hnsn

n−1

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)2([

E(E2il)]−1/2


)2

≤(

1 + ||ΛΛ−1 − I||∞)2

maxl≤hnsn

n−1

n∑i=1

([E(E2

il)]−1/2


)2

≤(

1 + ||ΛΛ−1 − I||∞)2(||η − η||1

)2(maxl≤hnsn

n−1

n∑i=1

∣∣∣∣[E(E2il)]−1/2

(wl′Fi − Eil)Fi′

∣∣∣∣2∞

)≤(

1 + ||ΛΛ−1 − I||∞)2(||η − η||1

)2(maxl≤hnsn

maxi≤n

maxl1≤(pn−hn)sn

∣∣[E(E2il)]−1/2

(wl′Fi − Eil)Fil1

∣∣2)≤c7λ

2nn

2βq2ns

a+2n , (2.42)


Lemma 7, Theorem 3 and (A2).




maxl≤hnsn

n−1

n∑i=1

∆23il

= maxl≤hnsn

n−1

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)2([

E(E2il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1

θijkηjk))2

≤(

1 + ||ΛΛ−1 − I||∞)2

maxl≤hnsn

n−1

n∑i=1

([E(E2

il)]−1/2

(wl′Fi − Eil)(

∑j∈Hcn

∞∑k=sn+1

θijkηjk))2

≤(

1 + ||ΛΛ−1 − I||∞)2(

maxl≤hnsn

maxi≤n

∣∣[E(E2il)]−1/2

(wl′Fi − Eil)

∣∣2)·(n−1

n∑i=1

(∑j∈Hcn

∞∑k=sn+1

θijkηjk)2). (2.43)

Regarding maxl≤hnsn maxi≤n∣∣[E(E2

il)]−1/2

(wl′Fi − Eil)

∣∣2, it is easy to see that

maxl≤hnsn

maxi≤n

∣∣[E(E2il)]−1/2

(wl′Fi − Eil)

∣∣2≤c8 log(nhnsn) ≤ c8 log(npnsn)

≤c9nβ, (2.44)

with probability tending to one, for some constants c8, c9 > 0, where the last inequality is by

(A2), (A6) and (A8).


Regarding n−1∑n

i=1(∑

j∈Hcn

∑∞k=sn+1 θijkηjk)

2, it is easy to see that

E[n−1

n∑i=1

(∑j∈Hcn

∞∑k=sn+1

θijkηjk)2]≤ E

[n−1

n∑i=1

(∑j∈Hcn

|∞∑

k=sn+1

θijkηjk|)2]

≤E[n−1

n∑i=1

(

qn∑j=1

|∞∑

k=sn+1

θijkηjk|)2]≤ E

[n−1qn

n∑i=1

qn∑j=1

|∞∑

k=sn+1

θijkηjk|2]

≤E[n−1qn

n∑i=1

qn∑j=1

( ∞∑k=sn+1

θ2ijkk

−2δ)( ∞∑

k1=sn+1

η2jk1k2δ

1

)]=n−1qn

n∑i=1

qn∑j=1

( ∞∑k=sn+1

ωjkk−2δ)( ∞∑

k1=sn+1

η2jk1k2δ

1

)=O(q2

ns−2δn ). (2.45)

Hence, by combining (2.43), (2.44), (2.45) with Lemma 7, it is easy to show that

maxl≤hnsn

n−1

n∑i=1

∆23il ≤ Op(n

βq2ns−2δn ). (2.46)



maxl≤hnsn

n−1

n∑i=1

∆24il

= maxl≤hnsn

n−1

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)2([

E(E2il)]−1/2

(wl − wl)′Fiεi)2

≤(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2(

n−1

n∑i=1

∣∣∣∣Fiεi∣∣∣∣2∞)≤(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2(

maxi≤n

maxl≤(pn−hn)sn

∣∣Filεi∣∣)2

≤c10ρ2ns

3an n

2β−1 log n, (2.47)


Lemma 7, Lemma 11, (A1), (A2) and (A3).




maxl≤hnsn

n−1

n∑i=1

∆25il

≤(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2(||η − η||1

)2(maxi≤n||Fi||4∞

)≤c11λ

2nρ

2nq

2ns

4a+2n n3β−1, (2.48)


Lemma 7, Lemma 11, Theorem 3, (A1), (A2) and (A3).



maxl≤hnsn

n−1

n∑i=1

∆26il

= maxl≤hnsn

n−1

n∑i=1

([E(E2

il)/(n−1

n∑i1=1

E2i1l

)]1/2)2([

E(E2il)]−1/2


∞∑k=sn+1

θijkηjk))2

≤(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2

·(n−1

n∑i=1

∣∣∣∣Fi(∑j∈Hcn

∞∑k=sn+1

θijkηjk)∣∣∣∣2∞

)=(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2

·(n−1

n∑i=1

∣∣∣∣Fi∣∣∣∣2∞(∑j∈Hcn

∞∑k=sn+1

θijkηjk)2)

=(

1 + ||ΛΛ−1 − I||∞)2([

λmin(Λ)]−1)2(

maxl≤hnsn

||wl − wl||1)2(

maxi≤n

∣∣∣∣Fi∣∣∣∣2∞)·(n−1

n∑i=1

(∑j∈Hcn

∞∑k=sn+1

θijkηjk)2)

≤Op(ρ2nq

2ns

3a−2δn n2β−1), (2.49)

where the last inequality is by Lemma 7, Lemma 11, (2.45), (A1), (A2) and (A3).

In summary, by combining (2.41), (2.42), (2.46), (2.47), (2.48), (2.49) with (2.40), it is


easy to observe that

maxl≤hnsn

[n−1

n∑i=1

(Sil − Sil)2]

≤c12

[λ2nn

2βq2ns

a+2n + nβq2

ns−2δn (log sn)2 + ρ2

ns3an n

2β−1 log n

+ λ2nρ

2nq

2ns

4a+2n n3β−1 + n2β−1ρ2

nq2ns

3a−2δn (log sn)2

], (2.50)

with probability tending to one, for some constant c12 > 0, which further implies that

maxl≤hnsn

[n−1

n∑i=1

(Sil − Sil)2]1/2

≤c13

[λnn

βqnsa/2+1n + nβ/2qns

−δn log sn + ρns

3a/2n nβ−1/2(log n)1/2

+ λnρnqns2a+1n n3β/2−1/2 + nβ−1/2ρnqns

3a/2−δn log sn

], (2.51)


2).

2.5.6 Proofs of Main Theorems

Proof of Theorem 3. Recall that we denote ν as ν = η − η∗, and ν = η − η∗ = Λν. By

first order necessary condition of the optimization theory, any local minimizer η of Qn(η) from

(2.4) must satisfy η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn, where 〈a, b〉 = a′b

for any vectors a, b. Hence, in order to prove Theorem 3, it is sufficient to show that for any

η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn, parts 1) and 2) of Theorem 3 hold.

Next, we begin the proof. First, we assume

η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn. (2.52)


Moreover, we have

〈∇Ln(η)−∇Ln(η∗), ν〉 = ν ′(n−1Θ′Θ)ν

=ν ′E(n−1Θ′Θ)ν − ν ′[E(n−1Θ′Θ)− n−1Θ′Θ

]ν

≥λmin(I)||ν||22 − ||E(n−1Θ′Θ)− n−1Θ′Θ||∞||ν||21

≥m0||ν||22 − c0nβ/2−1/2||ν||21, (2.53)


(A4) and Lemma 8. In addition, with a little abuse of notation, denote Pλn,µ(η) = Pλn(η) +

2−1µn−1∑pn

j=1 ||Θjηj||22 =∑pn

j=1

(ρλns

1/2n

(n−1/2||Θjηj||2) + 2−1µn−1||Θjηj||22). Under (P5),

(A3) and (A4), it is easy to see that Pλn,µ(η) is convex in η, which implies that

Pλn,µ(η∗)− Pλn,µ(η) ≥ 〈∇Pλn,µ(η),−ν〉,

which further implies that

〈∇Pλn(η),−ν〉 ≤ Pλn(η∗)− Pλn(η) + 2−1µn−1

pn∑j=1

||Θj(ηj − η∗j )||22. (2.54)

By combining (2.52), (2.53) with (2.54), we have

m0||ν||22 − c0nβ/2−1/2||ν||21 ≤ 〈∇Ln(η)−∇Ln(η∗), ν〉

=〈∇Ln(η) +∇Pλn(η), ν〉+ 〈∇Pλn(η),−ν〉+ 〈−∇Ln(η∗), ν〉

≤Pλn(η∗)− Pλn(η) + 2−1µn−1

pn∑j=1

||Θj(ηj − η∗j )||22

+ ||n−1Θ′(Y −Θη∗)||∞||ν||1. (2.55)


By combining (2.52), (2.55) with Lemma 9, we have

m0||ν||22 ≤Pλn(η∗)− Pλn(η) + 2−1µn−1

pn∑j=1

||Θj(ηj − η∗j )||22

+ c1(nβ/2−1/2Rn + qns−δn )||ν||1, (2.56)

with probability tending to one, for some constant c1 > 0. By combining (2.56) with Lemma 10

and (A9), we have

[m0 − 2−1m1µ(1 + o(1))

]||ν||22 ≤(1 + o(1))

[Pλn(η∗)− Pλn(η)

]. (2.57)

By combining (2.57) with Lemma 6 and (A4), we have

0 ≤[m0 − 2−1m1µ(1 + o(1))

]||ν||22

≤c2λns1/2n

( ∑j∈An

n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2), (2.58)

with probability tending to one, for some constant c2 > 0. On one hand, (2.58) implies that

||ν||22 ≤ c3λns1/2n

∑j∈An

n−1/2||Θj(ηj − η∗j )||2

≤c3λns1/2n q1/2

n

( ∑j∈An

n−1||Θj(ηj − η∗j )||22)1/2

≤c3λns1/2n q1/2

n

( pn∑j=1

n−1||Θj(ηj − η∗j )||22)1/2

≤c4λns1/2n q1/2

n ||ν||2,

with probability tending to one, for some constants c3, c4 > 0, where the last inequality is by

Lemma 10. Then, it follows that we have ||ν||2 ≤ c4λns1/2n q

1/2n , which further entails that

||ν||2 = ||Λ−1ν||2 ≤ λmax(Λ−1)||ν||2 =[λmin(Λ)

]−1||ν||2 ≤ c5λnsa/2+1/2n q1/2

n ,



1). On the other hand, (2.58) also implies that

∑j∈An

n−1/2||Θj(ηj − η∗j )||2 ≥∑j∈Acn

n−1/2||Θj(ηj − η∗j )||2. (2.59)

Under Lemma 10 and (2.59), we have

||ν||1 ≤ c5s1/2n

pn∑j=1

n−1/2||Θj(ηj − η∗j )||2 ≤ 2c5s1/2n

∑j∈An

n−1/2||Θj(ηj − η∗j )||2

≤2c5s1/2n q1/2

n

( pn∑j=1

n−1||Θj(ηj − η∗j )||22)1/2 ≤ c6s

1/2n q1/2

n ||ν||2

≤c7λnsnqn,

with probability tending to one, for some constants c5, c6, c7 > 0, which further entails that

||ν||1 = ||Λ−1ν||1 ≤ λmax(Λ−1)||ν||1 =[λmin(Λ)

]−1||ν||1 ≤ c8λnsa/2+1n qn,


2).

Proof of Theorem 4. First, we have

∣∣||T ||∞ − ||T ∗||∞∣∣ ≤ ||T − T ∗||∞=||n−1/2

n∑i=1

(Si − Si)||∞ = maxl≤hnsn

|n−1/2

n∑i=1

(Sil − Sil)|

≤c0g(n), (2.60)

with probability tending to one, for some constant c0 > 0, where g(n) = nβ−1/2ρns3a/2n +

λnnβ/2qns

a+1n +nβ/2+1/2qns

−δn log sn+nβρnqns

3a/2−δn log sn, and the last inequality is by Lemma 12.


Second, we have

∣∣||Te||∞ − ||T ∗e ||∞∣∣ ≤ ||Te − T ∗e ||∞=||n−1/2

n∑i=1

ei(Si − Si)||∞ = maxl≤hnsn

|n−1/2

n∑i=1

ei(Sil − Sil)|. (2.61)

Since ei : i ≤ n is a set of independent and identically distribted standard normal ran-

dom variables independent of the data, by the Hoeffding inequality, we have that for any

l = 1, . . . , hnsn and t > 0,

Pe(|n−1/2

n∑i=1

ei(Sil − Sil)| ≥ t) ≤ 2 exp(− t2

2n−1∑n

i=1(Sil − Sil)2

),

where Pe(·) means the probability with respect to e. Then, the union bound inequality yields,

Pe( maxl≤hnsn

|n−1/2

n∑i=1

ei(Sil − Sil)| ≥ t) ≤ 2hnsn exp(− t2

2 maxl≤hnsn[n−1

∑ni=1(Sil − Sil)2

]).Together with Lemma 12, it is easy to see that

Pe( maxl≤hnsn

|n−1/2

n∑i=1

ei(Sil − Sil)| ≤ c1f(n))→ 1,

with probability tending to one, for some constant c1 > 0, where

f(n) =λnn3β/2qns

a/2+1n + nβqns


3a/2n n3β/2−1/2(log n)1/2

+ λnρnqns2a+1n n2β−1/2 + n3β/2−1/2ρnqns

3a/2−δn log sn.

Together with (2.61), we obtain

Pe(∣∣||Te||∞ − ||T ∗e ||∞∣∣ ≤ c1f(n)

)→ 1. (2.62)


Moreover, we have

g(n)[log(pnsn)]1/2 ∼ g(n)nβ/2

=n3β/2−1/2ρns3a/2n + λnn

βqnsa+1n + nβ+1/2qns

−δn log sn

+ n3β/2ρnqns3a/2−δn log sn = o(1), (2.63)

under (A6), (A8) and (A9). In addition, we also have

f(n)[log(pnsn)]1/2 ∼ f(n)nβ/2

=λnn2βqns

a/2+1n + n3β/2qns


3a/2n n2β−1/2(log n)1/2

+ λnρnqns2a+1n n5β/2−1/2 + n2β−1/2ρnqns

3a/2−δn log sn = o(1), (2.64)

under (A6), (A8) and (A9). Furthermore, (A5) implies that

E(S2il) ≥ c2, (2.65)

for some universal constant c2 > 0, and (A2) implies that

||Sil||φ2 ≤ c3, (2.66)

for some universal constant c3 > 0. Hence, by combining (2.60), (2.62), (2.63), (2.64), (2.65),

(2.66) with (A6), we have

limn→∞

supα∈(0,1)

∣∣P (||T ||∞ ≤ φ′(α))− α∣∣ = 0,

by Lemma H.7 in Ning and Liu [2016], which completes the proof.

Bibliography

James O. Ramsay and C. J. Dalzell. Some tools for functional data analysis. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 53(3):539–572, 1991. ISSN

0035-9246.

Herve Cardot, Frederic Ferraty, Andre Mas, and Pascal Sarda. Testing hypotheses in the func-

tional linear model. Scandinavian Journal of Statistics. Theory and Applications, 30(1):

241–255, 2003. ISSN 0303-6898.

Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Linear functional regression: the

case of fixed design and functional response. Canadian Journal of Statistics. La Revue

Canadienne de Statistique, 30(2):285–300, 2002. ISSN 0319-5724.

Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional data analysis for sparse lon-

gitudinal data. Journal of the American Statistical Association, 100(470):577–590, 2005a.

ISSN 0162-1459. doi: 10.1198/016214504000001745.

James O. Ramsay and B. W. Silverman. Functional data analysis. Springer Series in Statistics.

Springer, New York, second edition, 2005. ISBN 978-0387-40080-8; 0-387-40080-X.

John A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparamet-

rically when the data are curves. Journal of the Royal Statistical Society, Series B, 53(1):

233–243, 1991. ISSN 0035-9246.

100

BIBLIOGRAPHY 101

Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional linear regression analysis for

longitudinal data. Annals of Statistics, 33(6):2873–2903, 2005b.

P. Hall, H. G. Muller, and J. L. Wang. Properties of principal component methods for functional

and longitudinal data analysis. Annals of Statistics, 34(3):1493–1517, 2006. ISSN 0090-

5364. doi: 10.1214/009053606000000272.

Tony Cai and Peter Hall. Prediction in functional linear regression. Annals of statistics, 34:

2159–2179, 2006.

Jin-Ting Zhang and Jianwei Chen. Statistical inferences for functional data. Annals of Statis-

tics, 35:1052–1079, 2007.

Peter Hall and Joe L. Horowitz. Methodology and convergence rates for functional linear

regression. Annals of Statistics, 35:70–91, 2007.

M. Escabias, A. M. Aguilera, and M. J. Valderrama. Principal component estimation of func-

tional logistic regression: discussion of two different approaches. Journal of Nonparametric

Statistics, 16(3-4):365–384, 2004. ISSN 1048-5252.

Herve Cardot and Pacal Sarda. Estimation in generalized linear models for functional data

via penalized likelihood. Journal of Multivariate Analysis, 92(1):24–41, 2005. ISSN 0047-

259X. doi: 10.1016/j.jmva.2003.08.008.

Hans-Georg Muller and Ulrich Stadtmuller. Generalized functional linear models. Annals of

Statistics, 33(2):774–805, 2005. ISSN 0090-5364. doi: 10.1214/009053604000001156.

Jianqing Fan and Jin-Ting Zhang. Two-step estimation of functional linear models with ap-

plications to longitudinal data. Journal of the Royal Statistical Society, Series B, 62(2):

303–322, 2000. ISSN 1369-7412.

BIBLIOGRAPHY 102

Jianqing Fan, Qiwei Yao, and Zongwu Cai. Adaptive varying-coefficient linear models. Jour-

nal of the Royal Statistical Society, Series B, 65(1):57–80, 2003. ISSN 1369-7412. doi:

10.1111/1467-9868.00372.

Jeffrey S. Morris, Marina Vannucci, Philip J. Brown, and Raymond J. Carroll. Wavelet-

based nonparametric modeling of hierarchical functions in colon carcinogenesis. Journal

of the American Statistical Association, 98(463):573–597, 2003. ISSN 0162-1459. doi:

10.1198/016214503000000422.

Hans-Georg Muller and Fang Yao. Functional additive models. Journal of the Amer-

ican Statistical Association, 103(484):1534–1544, 2008. ISSN 0162-1459. doi:

10.1198/016214508000000751.

Fang Yao and Hans-Georg Muller. Functional quadratic regression. Biometrika, 97(1):49–64,

2010. ISSN 0006-3444. doi: 10.1093/biomet/asp069.

Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society, Series B, 58(1):267–288, 1996.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its

oracle properties. Journal of the American Statistical Association, 96:1348–1360, December

2001.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Asso-

ciation, 101(476):1418–1429, 2006. ISSN 0162-1459. doi: 10.1198/016214506000000735.

Hyejin Shin. Partial functional linear regression. Journal of Statistical Planning and Inference,

139(10):3405–3418, 2009. doi: 10.1016/j.jspi.2009.03.001.

Ying Lu, Jiang Du, and Zhimeng Sun. Functional partially linear quantile regression model.

Metrika, 77(2):317–332, 2014. doi: 10.1007/s00184-013-0439-7.

BIBLIOGRAPHY 103

Yehua Li, Naisyin Wang, and Raymond J. Carroll. Generalized functional linear models with

semiparametric single-index interactions. Journal of the American Statistical Association,

105(490):621–633, 2010. ISSN 0162-1459. doi: 10.1198/jasa.2010.tm09313.

Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models.

Annals of Statistics, 36:1509–1533, 2008.

Yongdai Kim, Hosik Choi, and Hee-Seok Oh. Smoothly clipped absolute deviation on high

dimensions. Journal of the American Statistical Association, 103(484):1665–1673, 2008.

doi: 10.1198/016214508000001066.

Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational

and Graphical Statistics, 7(3):397–416, 1998.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.

Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.

Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization,

pages 123–231, 2013.

Hansheng Wang, Runze Li, and Chih-Ling Tsai. Tuning parameter selectors for the smoothly

clipped absolute deviation method. Biometrika, 93:553–568, 2007.

Jing Lei. Adaptive global testing for functional linear models. Journal of the American Statis-

tical Association, 109(506):624–634, 2014.

Jianqing Fan, Shaojun Guo, and Ning Hao. Variance estimation using refitted cross-validation

in ultrahigh dimensional regression. Journal of the Royal Statistical Society, Series B, 74:

37–65, 2012.

Roger D Peng, Francesca Dominici, Roberto Pastor-Barriuso, Scott L Zeger, and Jonathan M

Samet. Seasonal analyses of air pollution and mortality in 100 US cities. American Journal

of Epidemiology, 161(6):585–594, 2005.

BIBLIOGRAPHY 104

Roger D Peng, Francesca Dominici, and Thomas A Louis. Model choice in time series studies

of air pollution and mortality. Journal of the Royal Statistical Society, Series A, 169(2):

179–203, 2006.

Evangelia Samoli, Massimo Stafoggia, Sophia Rodopoulou, Bart Ostro, Christophe Declercq,

Ester Alessandrini, Julio Dıaz, Angeliki Karanasiou, Apostolos G Kelessis, Alain Le Tertre,

et al. Associations between fine and coarse particles and mortality in mediterranean cities:

results from the med-particles project. Environmental Health Perspectives, 121(8):932,

2013.

Mathilde Pascal, Gregoire Falq, Verene Wagner, Edouard Chatignoux, Magali Corso, Myriam

Blanchard, Sabine Host, Laurence Pascal, and Sophie Larrieu. Short-term impacts of par-

ticulate matter (PM10, PM10–2.5, PM2.5) on mortality in nine french cities. Atmospheric

Environment, 95:175–184, 2014.

Frank C Curriero, Karlyn S Heiner, Jonathan M Samet, Scott L Zeger, Lisa Strug, and

Jonathan A Patz. Temperature and mortality in 11 cities of the eastern united states. Ameri-

can Journal of Epidemiology, 155(1):80–87, 2002.

Peter Hall and Mohammad Hosseini-Nasab. On properties of functional principal components

analysis. Journal of the Royal Statistical Society: Series B, 68(1):109–126, 2006.

Peter Hall and Mohammad Hosseini-Nasab. Theory for high-order bounds in functional prin-

cipal components analysis. Mathematical Proceedings of the Cambridge Philosophical So-

ciety, 146(1):225–256, 2009. doi: 10.1017/S0305004108001850.

Herve Cardot, Frederic Ferraty, and Pascal Sarda. Functional linear model. Statistics & Prob-

ability Letters, 45(1):11–22, 1999.

Ming Yuan and T. Tony Cai. A reproducing kernel Hilbert space approach to functional linear

regression. Annals of Statistics, 38(6):3412–3444, 2010.

BIBLIOGRAPHY 105

Tony Cai and Ming Yuan. Minimax and adaptive prediction for functional linear regression.

Journal of the American Statistical Association, 107:1201–1216, 2012.

Nadine Hilgert, Andre Mas, and Nicolas Verzelen. Minimax adaptive tests for the functional

linear model. Annals of Statistics, 41(2):838–869, 2013.

Julian J. Faraway. Regression analysis for a functional response. Technometrics, 39(3):254–

261, 1997.

Gareth M. James. Generalized linear models with functional predictors. Journal of the Royal

Statistical Society, Series B, 64(3):411–432, 2002. doi: 10.1111/1467-9868.00342.

Zuofeng Shang and Guang Cheng. Nonparametric inference in generalized functional linear

models. Annals of Statistics, 43(4):1742–1773, 2015.

Heng Lian. Functional partial linear model. Journal of Nonparametric Statistics, 23(1):115–

128, 2011.

Dehan Kong, Kaijie Xue, Fang Yao, and Hao H. Zhang. Partially functional linear regression

in high dimensions. Biometrika, 103(1):147–159, 2016.

Hans-Georg Muller and Fang Yao. Functional additive models. Journal of the American

Statistical Association, 2008.

H. Zhu, F. Yao, and H. H. Zhang. Structured functional additive regression in reproducing

kernel Hilbert spaces. Journal of the Royal Statistical Society, Series B, 76:581–603, 2014.

Yingying Fan, Gareth M. James, and Peter Radchenko. Functional additive regression. Annals

of Statistics, 43(5):2296–2325, 2015.

Heng Lian. Shrinkage estimation and selection for multiple functional regression. Statistica

Sinica, 23(1):51–74, 2013.

BIBLIOGRAPHY 106

Philip T Reiss and R Todd Ogden. Functional principal component regression and functional

partial least squares. Journal of the American Statistical Association, 102(479):984–996,

2007.

Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: statisti-

cal and algorithmic theory for local optima. Journal of Machine Learning Research, 71(5):

559–616, 2015.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals

of Statistics, 38(2):894–942, 2010.

Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and variable selection

with the lasso. Annals of Statistics, 34(3):1436–1462, 2006.

Sara A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of

Statistics, 36(2):614–645, 2008.

Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-

dimensional data. Annals of Statistics, 37(1):246–270, 2009.

Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso

and dantzig selector. Annals of Statistics, 37(4):1705–1732, 2009.

Tong Zhang. Some sharp performance bounds for least squares regression with l1 regulariza-

tion. Annals of Statistics, 37(5A):2109–2144, 2009.

Jianqing Fan and Jinchi Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE

Transactions on Information Theory, 57(8):5467–5484, 2011.

Lan Wang, Yongdai Kim, and Runze Li. Calibrating nonconvex penalized regression in ultra-

high dimension. Annals of Statistics, 41(5):2505–2536, 2013.

BIBLIOGRAPHY 107

Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of

convergence for sparse nonconvex learning problems. Annals of Statistics, 42(6):2164–2201,

2014.

Jianqing Fan, Lingzhou Xue, and Hui Zou. Strong oracle optimality of folded concave penal-

ized estimation. Annals of Statistics, 42(3):819–849, 2014.

Larry Wasserman and Kathryn Roeder. High-dimensional variable selection. Annals of Statis-

tics, 37(5A):2178–2201, 2009.

Nicolai Meinshausen and Peter Buhlmann. Stability selection. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 72(4):417–473, 2010. doi: 10.1111/j.1467-

9868.2010.00740.x.

Rajen D. Shah and Richard J. Samworth. Variable selection with error control: another look at

stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodol-

ogy), 75(1):55–80, 2013. doi: 10.1111/j.1467-9868.2011.01034.x.

Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parame-

ters in high dimensional linear models. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 76(1):217–242, 2014. doi: 10.1111/rssb.12026.

Sara van de Geer, Peter Buhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically

optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42

(3):1166–1202, 2014.

Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance

test for the lasso. Annals of statistics, 42(2):413, 2014.

Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Post-selection

adaptive inference for least angle regression and the lasso. arXiv: 1401.3889, 2014.

BIBLIOGRAPHY 108

Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse

high dimensional models. Annals of Statistics, 2016.

Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models.

Journal of the Royal Statistical Society. Series B, 71(5):1009–1030, 2008.

Wenjiang Fu and Keith Knight. Asymptotics for lasso-type estimators. Annals of Statistics, 28

(5):1356–1378, 2000.

D. R. Cox and D. V. Hinkley. Theoretical Statistics. CRC Press, 1979.

Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is

much larger than n. Annals of Statistics, 35(6):2313–2351, 2007.

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of

suprema of empirical processes. Annals of Statistics, 42(4):1564–1597, 2014.

Ethan X Fang, Yang Ning, and Han Liu. Testing and confidence intervals for high dimensional

proportional hazards model. arXiv preprint arXiv:1412.5158, 2014.

Documents

by Kaijie Xue - University of Toronto T-Space · lar partially functional linear and quantile models, respectively, but did not deal with vari- able selection or with multiple functional