Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS
by
Kaijie Xue
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical ScienceUniversity of Toronto
c© Copyright 2017 by Kaijie Xue
Abstract
Functional Linear Regression in High Dimensions
Kaijie Xue
Doctor of Philosophy
Graduate Department of Statistical Science
University of Toronto
2017
Functional linear regression has occupied a central position in the area of functional data
analysis, and attracted substantial research attention in the past decade. With increasingly
complex data of this type collected in modern experiments, we conduct further investigations
in response to the great need of statistical tools that are capable of handling functional objects
in high-dimensional spaces.
In the first project, we deal with the situation that functional and non-functional data are
encountered simultaneously when observations are sampled from random processes and a po-
tentially large number of scalar covariates. It is difficult to apply existing methods for model
selection and estimation. We propose a new class of partially functional linear models to char-
acterize the regression between a scalar response and those covariates, including both func-
tional and scalar types. The new approach provides a unified and flexible framework to simul-
taneously take into account multiple functional and ultra-high dimensional scalar predictors,
identify important features and improve interpretability of the estimators. The underlying pro-
cesses of the functional predictors are considered to be infinite-dimensional, and one of our
contributions is to characterize the impact of regularization on the resulting estimators. We
establish consistency and oracle properties under mild conditions, illustrate the performance of
the proposed method with simulation studies, and apply it to air pollution data.
In the second project, we further explore the linear regression by focusing on the large-scale
scenario that the scalar response is related to potentially an ultra-large number of functional pre-
ii
dictors, leading to a more challenging model framework. The emphasis of our investigation is
to establish valid testing procedures for general hypothesis on an arbitrary subset of regression
coefficient functions. Specifically, we exploit the techniques developed for post-regularization
inference, and propose a score test for the large-scale functional linear regression based on the
so-called de-correlated score function that separates the primary and nuisance parameters in
functional spaces. The proposed score test is shown uniformly convergent to the prescribed
significance, and its finite sample performance is illustrated via simulation studies.
iii
Acknowledgements
First of all, I would like to express my sincere gratitude to my supervisor, Professor Fang
Yao, for his guidance and patience through my PHD program. Without his precious help and
support, it is not possible for me to finish the thesis.
Secondly, I would also like to show my gratitude to my thesis committee members, Profes-
sor Keith Knight and Professor Zhou Zhou for their insightful and constructive comments.
Thirdly, I am also very grateful to all the faculty members, especially Professor Mike
Evans, Professor Nancy Reid, Professor Keith Knight, Professor Zhou Zhou, Professor Andrey
Feuerverger and Professor Lawrence J. Brunner for teaching me insightful statistical courses.
In addition, I would also like to thank all the staff members, especially Andrea Carter,
Christine Bulguryemez and Dermot Whelan for their help and support during my research.
Finally, I would like to thank my parents for their help and support.
iv
Contents
1 Partially Functional Linear Regression in High Dimensions 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regularized Partially Functional Linear Regression . . . . . . . . . . . . . . . 3
1.2.1 Classical functional linear model via principal components . . . . . . . 3
1.2.2 Partially functional linear regression with regularization . . . . . . . . 3
1.2.3 Algorithms and parameter tuning . . . . . . . . . . . . . . . . . . . . 7
1.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Proof of Lemmas and Main Theorems . . . . . . . . . . . . . . . . . . . . . . 21
1.6.1 Regularity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.3 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6.4 Proofs of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Testing General Hypothesis in Large-scale FLR 47
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.1 Regularized estimation by group penalized least squares . . . . . . . . 51
2.2.2 General hypothesis and score decorrelation . . . . . . . . . . . . . . . 53
v
2.2.3 Bootstrapped score test for general hypothesis in large-scale FLR . . . 55
2.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.5 Proof of Lemmas and Main Theorems . . . . . . . . . . . . . . . . . . . . . . 63
2.5.1 Algorithm and parameter tuning . . . . . . . . . . . . . . . . . . . . . 63
2.5.2 Technical Conditions on Regularizers . . . . . . . . . . . . . . . . . . 64
2.5.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.5.4 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5.5 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5.6 Proofs of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography 99
vi
List of Tables
1.1 Simulation results with sample size n = 200 based on 200 Monte Carlo repli-
cates for Designs I and II. Shown are the Monte Carlo averages (standard errors
in parentheses) for the number of false zero functional predictors (FZf ), the
number of the false nonzero functional predictors (FNf ), the functional mean
squared error (MSEf ), the number of false zero scalar covariates (FZs), the
number of false nonzero scalar covariates (FNs), the scalar mean squared error
(MSEs), and the prediction error (PE). We first use ABIC to choose the tuning
parameter λn and a common truncation sn, then tune snj jointly with AIC by
refitting the selected model using ordinary least squares. In Design II, Step 1
results are based on the original sample in each Monte Carlo run, while Step 2
contains the improved results by fitting the penalized procedure to the selected
model in Step 1 with an additional sample of n = 200. . . . . . . . . . . . . . 16
vii
1.2 Simulation results with sample size n = 200 based on 200 Monte Carlo repli-
cates for Designs III and IV. Shown are the Monte Carlo averages (standard
errors in parentheses) for the number of false zero functional predictors (FZf ),
the number of the false nonzero functional predictors (FNf ), the functional
mean squared error (MSEf ), the number of false zero scalar covariates (FZs),
the number of false nonzero scalar covariates (FNs), the scalar mean squared
error (MSEs), and the prediction error (PE). We first use ABIC to choose the
tuning parameter λn and a common truncation sn, then tune snj jointly with
AIC by refitting the selected model using ordinary least squares. In Design IV,
Step 1 results are based on the original sample in each Monte Carlo run, while
Step 2 contains the improved results by fitting the penalized procedure to the
selected model in Step 1 with an additional sample of n = 200. . . . . . . . . . 17
1.3 Simulation results using the same settings as Design I & II in Section 4 of the
paper and Deisgn III & IV as above, based on 200 Monte Carlo replicates,
except the sample size n = 400. . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Simulation results using the same settings as Design I & II in Section 4 of the
paper and Deisgn III & IV as above, based on 200 Monte Carlo replicates,
except the variance of the regression error σ2 = 2. . . . . . . . . . . . . . . . 19
2.1 Results for the several different combinations of null hypothesis and settings
of βj(t) that were measured over 300 simualtions . . . . . . . . . . . . . . . . 62
viii
List of Figures
1.1 The estimated regression parameter functions and 95% confidence bands based
on 1000 bootstrap samples for PM2 · 5 and temperature with sn1 = 2 and
sn2 = 2, respectively. The solid line denotes the estimated regression parame-
ter functions and the dashed lines denote the 95% bootstrap confidence bands.
The left and right panels are for the PM2·5 and temperature, respectively. . . . 22
ix
Chapter 1
Partially Functional Linear Regression in
High Dimensions
1.1 Introduction
Functional linear regression is widely used to model the prediction of a functional predictor
through a linear operator, often realized by an integral form of a regression parameter func-
tion; see Ramsay and Dalzell [1991], Cardot et al. [2003], Cuevas et al. [2002], Yao et al.
[2005a], and Ramsay and Silverman [2005]. To capture the regression relation of the response
with a functional predictor, regularization is necessary. One common approach is functional
principal component analysis, which has been studied by Rice and Silverman [1991], and re-
cently by Yao et al. [2005b], Hall et al. [2006], Cai and Hall [2006], Zhang and Chen [2007],
and Hall and Horowitz [2007], among others. Functional linear models have been extended
to generalized functional linear models [Escabias et al., 2004, Cardot and Sarda, 2005, Muller
and Stadtmuller, 2005], varying-coefficient models [Fan and Zhang, 2000, Fan et al., 2003],
wavelet-based functional models [Morris et al., 2003], functional additive models [Muller and
Yao, 2008] and quadratic models [Yao and Muller, 2010].
Classical functional linear regression is designed to describe the relation between a real-
1
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 2
valued response and one functional explanatory variable. However, in many real problems, it
is common to also collect a large number of non-functional predictors. How to incorporate
scalar predictors in functional linear regression and perform model selection/regularization
is an important issue. For a standard linear regression with scalar covariates only, various
penalization procedures have been proposed and studied, including the lasso [Tibshirani, 1996],
the smoothly clipped absolute deviation [Fan and Li, 2001] and the adaptive lasso [Zou, 2006].
In this work, we develop a class of partially functional linear regression models, to han-
dle multiple functional and non-functional predictors and automatically identify important risk
factors by suitable regularization. Shin [2009] and Lu et al. [2014] have considered simi-
lar partially functional linear and quantile models, respectively, but did not deal with vari-
able selection or with multiple functional predictors and high-dimensional scalar covariates.
We propose a unified framework that regularizes each functional predictor as a whole, com-
bined with a penalty on high-dimensional scalar covariates. Due to the differences between
the functional and scalar predictors, we use two regularizing operations. Shrinkage penalties
are imposed on the effects of both functional predictors and scalar covariates to achieve model
selection and enhance interpretability, while a data-adaptive truncation that plays the role of
a tuning parameter is applied to functional predictors. We treat the functional predictors as
infinite-dimensional processes, which distinguishes our work from work that fixes the number
of principal components [Li et al., 2010]. A main contribution is to quantify the theoretical im-
pact of functional principal component estimation with diverging truncation, especially when
the number of scalar covariates is permitted to diverge at an exponential order of the sample
size.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 3
1.2 Regularized Partially Functional Linear Regression
1.2.1 Classical functional linear model via principal components
Let X(·) be a square-integrable random function defined on a closed interval T of the real line
with continuous mean and covariance functions, denoted byEX(t) = µ(t) and covX(s), X(t) =
K(s, t), respectively. The classical functional linear model is
Y = µY +
∫T
X(t)− µ(t)β(t)dt+ ε, (1.1)
where the regression parameter function β(·) is assumed to be square-integrable, and ε is a
random error independent of X(t). Mercer’s theorem implies that there exists a complete and
orthonormal basis φk in L2(T ) and a non-increasing sequence of non-negative eigenvalues
wk such that K(s, t) =∑∞
k=1 wkφk(s)φk(t) with∑∞
k=1wk < ∞. We further assume that
w1 > w2 > · · · ≥ 0. Let (yi, xi), i = 1, . . . , n be independent and identically distributed
random samples from (Y,X). The Karhunen–Loeve expansion xi(t) = µ(t) +∑∞
k=1 ξikφk(t)
forms the foundation of functional principal component analysis, where the coefficients ξik =∫Txi(t) − µ(t)φk(t)dt are uncorrelated random variables with mean zero and variances
E(ξ2ik) = wk, also called the functional principal component scores. Expanded on the or-
thonormal eigenbasis φk, the regression function becomes β(t) =∑∞
k=1 bkφk(t), and the
functional linear model (1.1) can be written as yi = µY +∑∞
k=1 bkξik + εi. The basis with
respect to which the regression parameter b is expanded is determined by the covariance func-
tion K. This is not unnatural since φk is the unique canonical basis leading to a generalized
Fourier series which gives the most rapidly convergent representation of X in L2 sense.
1.2.2 Partially functional linear regression with regularization
We now consider functional linear regression with multiple functional and scalar predictors.
Suppose the data are Y,X(·), Z, where Y is a scalar continuous response, X(·) = Xj(·) :
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 4
j = 1, . . . , d are d functional predictors, and Z = (Z1, . . . , Zpn)T is a pn-dimensional vector
of scalar covariates. This is motivated by commonly encountered situations where both func-
tional and non-functional predictors may have significant impacts on the response. As with
many real examples, we assume the number of functional predictors d to be fixed, while the
number of scalar covariates pn may grow with the sample size. Specifically we allow pn to be
ultra-high dimensional, such that log pn = O(nα) for some α > 0. Without loss of generality,
we assume that the response Y , the functional predictors Xj : j = 1, . . . , d and the scalar
covariates Zl : l = 1, . . . , pn have been centered to have mean zero. We then model the
linear relationship between Y and (X,Z) by
Y =d∑j=1
∫T
Xj(t)βj(t)dt+ ZTγ + ε, (1.2)
where βj(·) : j = 1, . . . , d are square-integrable regression parameter functions, γ =
(γ1, . . . , γpn)T contains the regression coefficients of non-functional covariates, and ε is the
random error independent of Xj(·) : j = 1, . . . , d and Z with E(ε) = 0 and var(ε) = σ2.
For convenience, assume that the first qn scalar covariates are significant, while the rest are not.
In other words, the true values of regression coefficients γT0 equals (γ
(1)T0 , γ
(2)T0 ), where γ(1)
0
is a qn × 1 vector corresponding to significant effects and γ(2)0 is a (pn − qn) × 1 zero vector.
We also assume that only the first g functional predictors are significant, equivalently, the true
values of regression functions βj0(t) ≡ 0 for j = g+ 1, . . . , d. Each functional predictor Xj(·)
is an infinite-dimensional process and requires regularization. Therefore the proposed model
has a partially functional structure that combines the multiple functional and high-dimensional
scalar components into one linear framework.
Let (yi, xi, zi) : i = 1, . . . , n denote independent and identically distributed realizations
from the population (Y,X,Z). Let xij denote the jth component of xi for j = 1, . . . , d, and
let zij be the lth component of zi for l = 1, . . . , pn. We further write YM = (y1, . . . , yn)T,
and ZM = (z1, . . . , zn)T. To estimate the functions βj(·) : j = 1, . . . , d and the regres-
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 5
sion coefficients γj : j = 1, . . . , pn, we consider the least squares loss, which couples
βj(t) =∑
k bjkφjk(t) with xij(t) =∑
k ξijkφjk(t) for each j = 1, . . . , d given the complete
orthonormal basis series φjkk=1,2,...,
L(b, γ | Dn) =n∑i=1
yi −d∑j=1
∫T
xij(t)βj(t)dt− zT
i γ2 (1.3)
=n∑i=1
(yi −d∑j=1
∞∑k=1
bjkξijk − zT
i γ)2,
where Dn = (yi, xi, zi) : i = 1, . . . , n, and b = (bT1 , . . . , bTd)T with bj = (bj1, bj2, . . .)
T
for each j. It is evident that the loss function (1.3) should not be directly minimized due
to the infinite expansions of the functional predictors and high-dimensional scalar covariates,
requiring suitable regularization for both X and Z.
One primary goal for (1.2) is to extract useful information from Z and X , whereas the
classical functional linear model focuses only on a single functional predictor. It is thus essen-
tial to select and estimate the nonzero coefficients in γ and nonzero functions in b1, . . . , bd to
enhance model prediction and interpretability. To achieve simultaneous variable selection and
estimation, we introduce a shrinkage penalty function Jλ(·) associated with a tuning parameter
λ. Many penalty choices are available for variable selection. We use the smoothly clipped
absolute deviation penalty of Fan and Li [2001], whose derivative is J ′λ(|γ|) = λI(|γ| ≤
λ) + I(|γ| > λ)(aλ − |γ|)+/(a − 1)λ with a = 3.7 suggested by Fan and Li [2001] for
implementation.
Due to the infinite dimensionality of the functional predictors, smoothing and regulariza-
tion are necessary in estimation. It is sensible to control the complexity of βj(t) as a whole
function, rather than treating its basis terms as separate predictors. We adopt the simple yet ef-
fective truncation approach in the spirit of controlling smoothness as in classical nonparametric
regression. Denote the truncated form by Xsj(t) = µj(t) +∑sj
k=1 ξjkφjk(t) for j = 1, . . . , d,
where sj is the truncation parameter. Correspondingly, for βj(t) =∑∞
k=1 bjkφjk(t), write
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 6
bj = (b(1)Tj , b
(2)Tj )T, where b(1)
j = (bj1, ..., bjsj)T and b(2)
j = (bj,sj+1, . . .)T. Unlike with non-
functional effects, there is no underlying separation between b(1)j and b
(2)j . For the sake of
adaptivity, we allow sj to vary with the sample size, sj ≡ snj , such that it behaves like a
smoothing parameter that balances the trade-off between bias and variance. The coefficients in
b(2)j associated with higher-order basis functions are nonzero but decay rapidly. It is of interest
to study the impact of snj on the convergence rates of the resulting estimators.
In practice we do not observe the entire trajectories xij , and only have intermittent noisy
measurements Wijl = xij(tijl)+εijl, where εijl, i = 1 . . . , n are independent and identically
distributed measurement errors independent of xij , satisfying E(εijl) = 0, var(εijl) = σ2xj for
i = 1, . . . , n and l = 1, . . . ,mij . When the repeated observations are sufficiently dense for
each subject, a common practice is to run a smoother through (tijl,Wijl), l = 1, . . . ,mij,
and then the estimates xij, i = 1, . . . , n, j = 1, . . . , d are used to construct the covariance,
eigenvalues/basis, and functional principal component scores; details are given in section 1.6.
A theoretical justification of the asymptotic equivalence between the estimators obtained from
xij and those from the true xij is also given in section 1.6. The unobservable functional prin-
cipal component scores ξijk : k = 1, . . . , sn; j = 1, . . . , d; i = 1, . . . , n are estimated
by functional principal component analysis based on the observed data (tijl,Wijl) : l =
1, . . . ,mij; j = 1, . . . , d; i = 1, . . . , n. Therefore we minimize
minb(1),γ
n∑i=1
(yi −d∑j=1
snj∑k=1
ξijkbjk − zT
i γ)2 + 2nd∑j=1
Jλjn(‖b(1)j ‖)
+ 2n
pn∑l=1
Jλn(|γl|), (1.4)
given suitable choices of snj , λn and λjn, where ‖b(1)j ‖ is the Euclidean norm invoking a group
penalty that shrinks the regression functions of unimportant functional predictors to zero. To
regularize all predictors on a comparable scale, one often standardizes the predictors before
imposing a penalty associated with a common tuning parameter [Fan and Li, 2001]. Thus we
standardize (z1l, . . . , znl)T to have unit variance. The variability of the jth functional predictor
can be approximated by∑snj
k=1 wjk, where wjk is the kth estimated eigenvalue of the jth predic-
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 7
tor. Since standardization is equivalent to adding weights to the penalty function, we suggest
using λjn = λn(∑snj
k=1 wjk)1/2, which simplifies both the computation and theoretical analysis.
The estimated regression parameter functions are βj(t) =∑snj
k=1 bjkφjk(t).
1.2.3 Algorithms and parameter tuning
The optimization of (1.4) can be seen as a group smoothly clipped absolute deviation problem
with different weights on penalties, and the individual γl can be treated as group of size one.
We propose two algorithms to solve the minimization problem (1.4), which adapts to the di-
mension pn. Generally, when pn is moderately large, say pn < n, we modify the local linear
approximation algorithm [Zou and Li, 2008], which inherits the computational efficiency and
sparsity of lasso-type solutions. For ultra-high pn, especially pn n, the local linear approx-
imation algorithm may not be applicable, and then we modify the concave convex procedure
used in Kim et al. [2008]. The following gives the details. Recall that YM = (y1, . . . , yn)T and
ZM = (z1, . . . , zn)T, where zi = (zi1, ..., zipn)T. In addition, Mj is a n×sn matrix with (i, k)th
element ξijk, M = (M1, . . . ,Md), and N = (M,ZM) = (N1, . . . , Nn)T is a n × (dsn + pn)
matrix. Further, η = (b(1)T, γT)T. The solution to (4) is equivalent to
argminη(2n)−1||YM −Nη||2 +R∑r=1
Jλr(‖ηr‖
),
where R = d + pn. The tuning parameter λr = λrn with group size Kr = sn if r = 1, . . . , d,
and λr = λn with group size Kr = sn if r = d+ 1, . . . , d+ pn.
When pn is moderately large, say pn < n, one can modify the local linear approximation
algorithm of Zou and Li [2008] which inherits the computational efficiency and sparsity of
lasso-type solutions. Denote the initial estimate from the ordinary least square solution by
η(0), and we solve η(1) = argminη(2n)−1||YM − Nη||2 +∑R
r=1 J′λr
(‖η(0)
r ‖)‖ηr‖. Since
some of the J ′λr(‖η(0)
r ‖)
are zero, we use similar algorithm proposed by Zou and Li [2008].
Denote V = r : J ′λr(‖η(0)
r ‖)
= 0, W = r : J ′λr(‖η(0)
r ‖)> 0, N = (NV , NW ) and
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 8
η(1) = (η(1)TV , η
(1)TW )T. Our algorithm is as follows:
1. Reparameterize the response vector by Y ∗M = Nη(0), and reparameterize the observed
data matrix by N∗r = NrK1/2r /J ′λr
(‖η(0)
r ‖)
for r ∈ W ; and N∗r = Nr for r ∈ V .
2. Let PV denote the projection matrix of the space N∗r , r ∈ V , where PV = NV (NTVNV )−1NV
T.
Then, calculate Y ∗∗M = Y ∗M − PV Y ∗M and N∗∗W = N∗W − PVN∗W .
3. Find η∗W = argminβ(2n)−1||Y ∗∗M −N∗∗W η||2 +∑
r∈W K1/2r ‖ηr‖.
4. Compute η∗V = (N∗TV N∗V )−1N∗TV (Y ∗M −N∗W η∗W ).
5. Let η(0) = η(1), where η(1)V = η∗V , and η(1)
r = η∗rK1/2r /J ′λr
(‖η(0)
r ‖)
for r ∈ W .
We repeat steps 1–5 until convergence; the final η(0) is the regularized estimator. Step 3 essen-
tially solves a group lasso, so we adopt the shooting algorithm of Fu [1998] and Yuan and Lin
[2006].
For pn n, it is likely that N∗TV N∗V in step 4 is singular, so the local linear approx-
imation algorithm is inapplicable. We modify the concave convex procedure used in Kim
et al. [2008]. Let Jλr(‖ηr‖
)= Jλr
(‖ηr‖
)− λr‖ηr‖ for each r. The convex part of the ob-
jective function is Cvex(η) = (2n)−1||YM − Nη||2 +∑J
r=1 λr‖ηr‖, and the concave part is
Ccav(η) =∑J
r=1 Jλr(‖ηr‖
). Begin with the initial estimator η(0) = 0 and iteratively update the
solution until convergence,
η(1) = argminηCvex(η) +∇Ccav(η(0))Tη = argminηM(η | η(0)) +R∑r=1
λr‖ηr‖,
where M(η | η(0)) = (2n)−1||YM − Nη||2 + ∇Ccav(η(0))Tη is quadratic in η. The proxi-
mal gradient method of Parikh and Boyd [2013] is adopted to solve the above minimization
problem.
Two sets of tuning parameters play crucial roles in the penalized procedure (1.4). The pa-
rameter λn in the smoothly clipped absolute deviation directly controls the sparsity of both
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 9
the functional and non-functional predictors. Wang et al. [2007] showed that minimizing the
BIC can identify the true model consistently, while generalized cross-validation might lead to
over-fitting. The truncation parameters snj control the dimensions of the functional spaces to
approximate the true function parameters. Previous work mostly chose snj based on the func-
tional principal component representation, such as leave-one-curve-out cross-validation [Rice
and Silverman, 1991] and the pseudo-AIC [Yao et al., 2005a]. However, a sensible tuning cri-
terion of snj in a regression setting should take into account its impact on the response. The
process Xj is infinite-dimensional, and its coefficient function βj(t) =∑∞
k=1 bjkφk(t) does
not have a finite cut-off. This is similar to the situation that the true model does not lie in the
space formed by finite-dimensional candidate models. Therefore we propose a hybrid tuning
procedure, which in principle combines BIC for tuning λn and AIC for choosing snj due to the
infinite-dimensional parameter spaces of βj, j = 1, . . . ,∞. In practice it is computationally
prohibitive to choose snj : j = 1, . . . , d simultaneously in the penalized procedure (1.4).
Table 1.1 suggests that, when using a common truncation, the selection of both functional
and scalar covariates is accurate and stable for a wide range of sn. Thus we propose to use a
common truncation parameter sn when solving (1.4), then refit the selected model with the sig-
nificant functional and scalar predictors using ordinary least squares, while different truncation
parameters snj are tuned simultaneously by AIC for the retained functional predictors.
Specifically, for a fixed pair (sn, λn), the ABIC criterion is defined as
ABIC(sn, λn) = log
RSS(sn, λn)
+ 2g(sn, λn)sn/n+ n−1df(sn, λn) log(n),
where
RSS(sn, λn) =n∑i=1
yi −
d∑j=1
sn∑k=1
ξijkbjk(sn, λn)− zT
i γ(sn, λn)2,
and g(sn, λn) is the number of non-zero estimates of the regression functions, g(sn, λn) =∑dj=1 I(βj; sn, λn). The degree of freedom df(sn, λn) equals I(γ; sn, λn)+
∑dj=1
∑snk=1 wjkI(βj; sn, λn),
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 10
with I(γ; sn, λn) indicating the number of non-zero elements in γ. This procedure requires es-
timation using the whole data only once and is computationally fast.
For the refit step, denote the index set of the selected functional predictors byD ⊂ 1, . . . , d,
and the index set of the selected scalar covariates by S ⊂ 1, . . . , pn. We minimize
AIC(snj : j ∈ D) = log
RSS(snj : j ∈ D)
+ 2n−1∑j∈D
snj,
with respect to combinations of snj : j ∈ D, where
RSS(snj : j ∈ D) =n∑i=1
yi −
∑j∈D
snj∑k=1
ξijkb∗jk(snj)−
∑l∈S
zij γ∗l (snj)
2,
and b∗jk(snj) and γ∗l (snj) are the refitted values using ordinary least squares.
1.3 Asymptotic Properties
Denote the true values of b(1) and γ by b(1)0 and γ0, and similarly for the rest of parameters.
Recall that the boundedness of the covariance functions Kj(s, t) and regression operators im-
plies that∑∞
k=1wjk <∞ and∑∞
k=1 b2jk0 <∞. We impose mild conditions on the decay rates
of eigenvalues wjk and regression coefficients bjk0, similar to those adopted by Hall and
Horowitz [2007] and Lei [2014]. We assume that, for j = 1, . . . , d:
(A1) wjk − wj(k+1) ≥ Ck−a−1 for k ≥ 1.
This implies that wjk ≥ Ck−a. As the covariance functions K1, . . . , Kd are bounded, one has
a > 1. Regarding the regression function βj(·), in order to prevent the coefficients bjk0 from
decreasing too slowly, we assume that
(A2) |bjk0| ≤ Ck−b for k > 1.
The decay conditions are needed only to control the tail behaviors for large k, and so are not as
restrictive as they appear. Without loss of generality, we use a common truncation parameter
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 11
sn in theoretical analysis. It is important to control sn appropriately. On one hand sn cannot be
too large due to increasingly unstable functional principal component estimates,
(A3) (s2a+2n + sa+4
n )/n = o(1).
On the other hand sn cannot be too small, so that the covariances between Z and the unobserv-
able ξjk: k ≥ sn + 1 are asymptotically negligible,
(A4) s2b−1n /n→∞ as n→∞.
Combining (A3) and (A4) entails b > max(a+3/2, a/2+5/2) as a sufficient condition for such
an sn to exist. This implies that the regression function is smoother than the lower bound on
the smoothness of Kj . Regarding the dimension of scalar covariates, assume that the number
of significant covariates satisfies
(A5) sa+2n q2
n/n = o(1),
Such qn = o(n1/2s−a/2−1n ) does exist and is allowed to diverge with the sample size given (A3).
The dimension of the candidate set, pn, is allowed to be ultra-high,
(A6) pn = Oexp(nα) for some α ∈ (0, 1/2).
Lastly we require the following to hold for the tuning parameter λn and the sparsity of γ to
achieve consistent estimation,
(A7) λn = o(1), maxn2α−1, n−1(qn + sn) = o(λ2n), minl=1,...,qn |γl0|/λn →∞.
We defer to section 1.6 the standard conditions (B1)–(B5) on the underlying processes xij ,
how the data are sampled and smoothed, and the moments of scalar covariates, followed by the
auxiliary lemmas and proofs.
To facilitate theoretical analysis, we re-parameterize by writing bjk = w1/2jk bjk, so that the
functional principal component scores serving as predictor variables are on a common scale
of variabilities. This re-parameterization is only used for technical derivations, and does not
appear in the estimation procedure. Let η = (b(1)T, γT)T, where b(1) = (b(1)T1 , . . . , b
(1)Td )T,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 12
b(1)j = Ajb
(1)j , and Aj is the sn × sn diagonal matrix with Aj(k, k) = w
1/2jk . Then, the mini-
mization of (1.4) is equivalent to minimizing
Qn(η) =n∑i=1
yi −d∑j=1
sn∑k=1
(ξijkw−1/2jk )bjk − zT
i γ2 + 2n
pn∑l=1
Jλn(|γl|) + 2nd∑j=1
Jλjn(‖b(1)j ‖).
Theorem 1 establishes the estimation and selection consistency for both the functional and
scalar regression parameters. For a random variable ε with mean zero, define ε as a subgaus-
sian random variable if there exists some positive constant C1 > 0 such that pr(|ε| > t) ≤
exp(−2−1C1t2) for t ≥ 0. Let b(1) denote the estimate of b(1).
Theorem 1 If ε1, . . . , εn are independent and identically distributed subgaussian random vari-
ables, then under conditions (A1)–(A7) and (B1)–(B5), there exists a local minimizer η =
(b(1)T, γT)T of Qn(η) such that ‖η − η0‖ = Op[(qn + sn)/n1/2] and pr(γ2 = 0, b(1) = 0, j =
g + 1, . . . , d)→ 1.
The estimation consistency result is expressed in terms of b(1), not the original parame-
ter b(1) = (b(1)′
1 , . . . , b(1)′
d )′. For estimation, given b(1) = A−1j b(1), it is easy to deduce that
‖βj − βj0‖2L2 = Opsan(qn + sn)/n, where βj =
∑snk=1 bjkφjk and βj0 =
∑∞k=1 bjk0φjk.
Theorem 2 establishes the asymptotic normality for the qn-dimensional vector γ(1). Write
Σ1 = E(z(1)i z
(1) T
i ), and Σ1 = n−1∑n
i=1 zi(1)zi
(1)T with zi(1) = (zi1, ..., ziqn)T.
Theorem 2 If ε1, . . . , εn are independent and identically distributed subgaussian random vari-
ables and qn = o(n1/3), then under conditions (A1)–(A7) and (B1)–(B5), for the local min-
imizer in Theorem 1, n1/2AnΣ1(γ(1) − γ(1)0 ) → N(0, σ2H∗ + B∗) in distribution, for any
r × qn matrix An such that G = limn→∞AnATn is positive definite, where σ2 = var(ε),
H∗ = limn→∞AnΣ1AnT, B∗ = limn→∞AnBnA
Tn with
Bn = cov g∑
j=1
sn∑k=1
∑v 6=k
bjk0(wjk − wjv)−1〈Ξj, φjv〉∫
(xij ⊗ xij)φjkφjv,
where Ξj = (Ξj1, ...,Ξjqn)T, EXj(t)Zl = Ξjl(t), and (xij ⊗ xij)(s, t) = xij(s)xij(t).
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 13
The asymptotic covariance is inflated by estimating the unobservable functional principal com-
ponent scores. The inflation is quantified by a convergent sequence Bn associated with the
truncation size sn.
1.4 Simulation Studies
The simulated data yi, i = 1, . . . , n are generated from the model
yi =d∑j=1
∫ 1
0
βj(t)xij(t)dt+ zT
i γ + εi =d∑j=1
∑k
bjkξijk + zT
i γ + εi,
with d = 4 functional predictors, pn scalar covariates, the errors ε1, . . . , εn are independent and
identically distributed from N(0, σ2), and γ is the vector of scalar coefficients. The functional
predictors have mean zero and covariance function derived from the Fourier basis φ2`−1 =
2−1/2 cos(2` − 1)πt and φ2` = 2−1/2 sin(2` − 1)πt, ` = 1, . . . , 25, and t ∈ T = [0, 1].
The underlying regression function is βj(t) =∑50
k=1 bjkφk(t), a linear combination of the
eigenbasis. The scalar covariates zi = (zi1, . . . , zipn)T are jointly normal with mean zero, unit
variance and AR(0.5) correlation structure. Next, we describe how to generate the d = 4
functional predictors xij(t). For j = 1, . . . , 4, define Vij(t) =∑50
k=1 ξijkφk(t), where ξijk, i =
1, . . . , n follow independent and identically distributed N(0, 16k−2) for different i and j. The
four functional predictors are then defined through the linear transformations
xi1 = Vi1 + 0·5(Vi2 + Vi3), xi2 = Vi2 + 0·5(Vi1 + Vi3),
xi3 = Vi3 + 0·5(Vi1 + Vi2), xi4 = Vi4.
Here the first three functional predictors are correlated with each other. To be more real-
istic, we allow a moderate correlation between Vi1 and zi for i = 1, . . . , n, by setting that
ξ = (ξi11, ξi12, ξi13, ξi14)T and zi = (zi1, . . . , zipn)T have a correlation structure specified by
corr(ξi1k, zij) = r|k−l|+1, (k = 1, . . . , 4; l = 1, . . . , pn), with r = 0 ·2. For the actual obser-
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 14
vations, we assume they are realizations of xij(·), j = 1, 2, 3, 4 at 100 equally spaced times
tijl, l = 1, . . . , 100 ∈ T with independent and identically distributed noise εijl ∼ N(0, 1).
We use 200 Monte Carlo runs for model assessment. Since inferences on both the para-
metric component γ and the functional components βj are of interest, we report the Monte
Carlo averages for the numbers of false nonzero and false zero functional predictors, and the
functional mean squared error MSEf =∑d
j=1E(‖βj − βj‖2L2). For the scalar covariates, we
report the Monte Carlo averages for the numbers of false nonzero and false zero scalar co-
variates, and the scalar mean squared error MSEs = E(||γ − γ||2). The prediction error is
assessed using an independent test set of size N = 1000 for each Monte Carlo repetition, and
PE = N−1∑N
i=1(y∗i − y∗i )2 − σ2, where x∗ij, z∗i , y∗i , j = 1, . . . , 4 are the testing data gener-
ated from the same model, and the predictions are y∗i =∑
j
∑k ξ∗ijkbjk + z∗Ti γ by plugging in
estimates from the corresponding training sample.
Design I is for a moderate number of scalar covariates with sample size n = 200 and error
variance σ2 = 1. Specifically, for j = 1, 2, bj1 = 1, bj2 = 0 ·8 , bj3 = 0 ·6 , bj4 = 0 ·5
and bjk = 8(k − 2)−4 (k = 5, . . . , 50), β3 = β4 = 0, and γ = (1T5 , 0
T15)T. Thus pn = 20,
qn = 5. To illustrate the impact of the choice of sn, we inspect the results for sn ranging from
1 to 16 with λn chosen by BIC in Table 1.1. The selection of functional and scalar predictors is
quite accurate and stable for a wide range of sn, but with a very small number of false nonzero
scalars. For functional predictors, the functional mean square error improves until sn reaches
an optimal level, then deteriorates as sn continues to increase. For the scalar covariates, the
mean square error and prediction error appear more stable after the optimal level. We then
use ABIC with a common sn to select both sn and λn. It yields similar results to those at
optimal mean square and prediction errors, selecting an average sn = 4·30 with standard error
0·050. Refitting the selected model using ordinary least squares with jointly tuned snj via AIC
improves the estimation of the functional coefficients and the overall prediction.
Design II illustrates the situation with ultra-high dimension of scalar covariates γ = (1T5 , 0
T995)T
with pn = 1000, and other settings the same as in Design I. The ABIC yields results similar
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 15
to those giving the optimal estimation and prediction. The number of false nonzero scalar co-
variates, the scalar mean square error and prediction error in Step 1 become larger than those
in Design I, mainly due to the ultra-high number of insignificant scalar covariates. The func-
tional mean square error is also higher, as the correlation between functional predictors and
scalar covariates becomes greater for a larger pn. To improve the estimation and prediction, for
each Monte Carlo run, after obtaining the estimates based on ABIC in Step 1, we generate an
additional sample of size 200 and implement the penalized procedure using the significant vari-
ables and sn selected in Step 1. The results summarized in Step 2 are dramatically improved
and become comparable to those for Design I. This hints at a promising two-step procedure
via sample splitting in when pn is ultra-high, in a similar spirit to Fan et al. [2012]. A further
improvement can be achieved by refitting the selected model with jointly tuned snj using ordi-
nary least squares. In particular, Design III illustrates the situation that the response does not
necessarily depend on the leading principal components and the regression coefficients may
decay more slowly than theoretically required. Specifically, we use bj1 = 0, bj2 = 1, bj3 = 0·5,
bjk = (k − 2)−3 for k = 4, . . . , 50, j = 1, 2, and other settings are the same as Design I. The
results across sn follow a similar pattern, while the automated ABIC captures the regression
relationship adaptively and resembles the optimal estimation and prediction. The refitting step
using least squares with jointly tuned snj behaves similarly as in Design I. Designs IV contains
ultra-high numbers of scalar covariates γ = (1T5 , 0
T995)T with pn = 1000, and other settings the
same as Design II. The results exhibit similar phenomenon as those in Design II. Moreover, Ta-
ble 1.3 includes the results obtained from the same settings as Design I–IV except for a larger
sample size n = 400. For Table 1.4, the only setting difference is that the regression error εi
is generated from N(0, 2) instead of N(0, 1). As expected, a larger sample size reduced the
estimation and prediction errors, while a higher noise level increased such errors.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 16
Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·95 0 4·1 (0·03) 0·39 7·1 4·7 (0·15) 27·9 (0·6)2 0·35 0 2·2 (0·10) 0·05 3·2 1·1 (0·06) 7·9 (0·3)3 0 0 0·6 (0·01) 0 0·91 0·22 (0·013) 1·5 (0·03)4 0 0 0·12 (0·005) 0 0·36 0·067 (0·005) 0·21 (0·007)5 0 0 0·14 (0·005) 0 0·38 0·069 (0·004) 0·19 (0·006)
I 6 0 0 0·19 (0·007) 0 0·44 0·072 (0·004) 0·19 (0·006)pn = 20 10 0 0 0·65 (0·03) 0 0·38 0·073 (0·004) 0·22 (0·006)
16 0·03 0·08 3·1 (0·13) 0 0·11 0·074 (0·006) 0·54 (0·1)sn=4·30 (0·050)ABIC
0 0 0·13 (0·007) 0 0·34 0·067 (0·004) 0·20 (0·006)sn1=4·78 (0·075), sn2=4·87 (0·071)TUNE snj
0 0 0·09 (0·003) 0 0·28 0·065 (0·004) 0·18 (0·006)sn =4·07 (0·034)
STEP 1 0 0·04 0·18 (0·004) 0 7·4 0·36 (0·018) 1·1 (0·032)II STEP 2 0 0 0·095 (0·005) 0 0·10 0·047 (0·003) 0·17 (0·004)
pn = 1000 sn1=4·76 (0·066), sn2 =4·62 (0·055)TUNE snj 0 0 0·076 (0·003) 0 0·09 0·046 (0·003) 0·16 (0·004)
Table 1.1: Simulation results with sample size n = 200 based on 200 Monte Carlo replicatesfor Designs I and II. Shown are the Monte Carlo averages (standard errors in parentheses) forthe number of false zero functional predictors (FZf ), the number of the false nonzero functionalpredictors (FNf ), the functional mean squared error (MSEf ), the number of false zero scalarcovariates (FZs), the number of false nonzero scalar covariates (FNs), the scalar mean squarederror (MSEs), and the prediction error (PE). We first use ABIC to choose the tuning parameterλn and a common truncation sn, then tune snj jointly with AIC by refitting the selected modelusing ordinary least squares. In Design II, Step 1 results are based on the original samplein each Monte Carlo run, while Step 2 contains the improved results by fitting the penalizedprocedure to the selected model in Step 1 with an additional sample of n = 200.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 17
Design sn FZf FNf MSEf FZs FNs MSEs PE1 2·0 0 2·5 (0) 1·5 1·1 4·2 (0·15) 26·0 (0·1)2 0·70 0 1·7 (0·06) 0·02 1·2 0·49 (0·028) 3·9 (0·1)3 0 0 0·076 (0·004) 0 0·38 0·069 (0·004) 0·21 (0·009)4 0 0 0·069 (0·003) 0 0·38 0·065 (0·004) 0·13 (0·005)5 0 0 0·11 (0·005) 0 0·40 0·067 (0·004) 0·14 (0·005)
III 6 0 0 0·15 (0·006) 0 0·41 0·068 (0·004) 0·15 (0·005)pn = 20 10 0 0·01 0·60 (0·02) 0 0·37 0·071 (0·004) 0·19 (0·006)
16 0·17 0·21 3·4 (0·1) 0 0·05 0·089 (0·007) 0·65 (0·05)sn=3·73 (0·048)ABIC
0 0 0·072 (0·004) 0 0·38 0·066 (0·004) 0·14 (0·005)sn1=3·99 (0·083), sn2=3·98 (0·079)TUNE snj
0 0 0·056 (0·003) 0 0·32 0·063 (0·004) 0·13 (0·005)sn=3·50 (0·040)
STEP 1 0 0·09 0·10 (0·003) 0 6·5 0·41 (0·02) 0·81 (0·021)IV STEP 2 0 0 0·059 (0·003) 0 0·13 0·048 (0·003) 0·13 (0·005)
pn = 1000 sn1=4·05 (0·067), sn2 =4·11 (0·065)TUNE snj 0 0 0·042 (0·002) 0 0·09 0·045 (0·003) 0·11 (0·004)
Table 1.2: Simulation results with sample size n = 200 based on 200 Monte Carlo replicatesfor Designs III and IV. Shown are the Monte Carlo averages (standard errors in parentheses) forthe number of false zero functional predictors (FZf ), the number of the false nonzero functionalpredictors (FNf ), the functional mean squared error (MSEf ), the number of false zero scalarcovariates (FZs), the number of false nonzero scalar covariates (FNs), the scalar mean squarederror (MSEs), and the prediction error (PE). We first use ABIC to choose the tuning parameterλn and a common truncation sn, then tune snj jointly with AIC by refitting the selected modelusing ordinary least squares. In Design IV, Step 1 results are based on the original samplein each Monte Carlo run, while Step 2 contains the improved results by fitting the penalizedprocedure to the selected model in Step 1 with an additional sample of n = 200.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 18
Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·90 0 4·1 (0·04) 0·15 5·1 2·3 (0·08) 25·8 (0·2)2 0 0 1·3 (0·009) 0 1·4 0·34 (0·01) 4·6 (0·04)3 0 0 0·58 (0·008) 0 0·23 0·11 (0·004) 1·5 (0·02)4 0 0 0·063 (0·002) 0 0·23 0·027 (0·002) 0·13 (0·004)5 0 0 0·071 (0·003) 0 0·27 0·028 (0·002) 0·12 (0·004)
I 6 0 0 0·095 (0·003) 0 0·24 0·028 (0·002) 0·12 (0·004)pn = 20 10 0 0 0·32 (0·01) 0 0·31 0·029 (0·002) 0·13 (0·004)
16 0 0 1·2 (0·04) 0 0·10 0·026 (0·002) 0·15 (0·004)sn=4·22 (0·047)ABIC
0 0 0·067 (0·003) 0 0·23 0·027 (0·002) 0·13 (0·004)sn1=4·67 (0·058), sn2=4·68 (0·061)TUNE snj
0 0 0·049 (0·002) 0 0·20 0·026 (0·002) 0·12 (0·004)sn=4·10 (0·027)
STEP 1 0 0 0·11 (0·004) 0 4·6 0·13(0·007) 0·51 (0·01)II STEP 2 0 0 0·052 (0·002) 0 0·07 0·025 (0·001) 0·11 (0·004)
pn = 1000 sn1=4·72 (0·056), sn2=4·60 (0·053)TUNE snj 0 0 0·041 (0·001) 0 0·06 0·025 (0·001) 0·11 (0·004)
1 2·0 0 2·5 (0) 1·0 0·43 2·7 (0·1) 25·2 (0·09)2 0·10 0 0·72 (0·04) 0 0·43 0·19 (0·01) 2·8 (0·06)3 0 0 0·059 (0·002) 0 0·25 0·032 (0·002) 0·16 (0·006)4 0 0 0·033 (0·001) 0 0·23 0·026 (0·001) 0·073 (0·004)5 0 0 0·048 (0·002) 0 0·26 0·027 (0·002) 0·075 (0·004)
III 6 0 0 0·072 (0·003) 0 0·25 0·027 (0·002) 0·078 (0·004)pn = 20 10 0 0 0·28 (0·01) 0 0·25 0·028 (0·002) 0·099 (0·004)
16 0 0·03 1·2 (0·04) 0 0·02 0·024 (0·001) 0·13 (0·004)sn=3·91 (0·036)ABIC
0 0 0·034 (0·002) 0 0·23 0·026 (0·001) 0·075 (0·004)sn1=4·37 (0·064), sn2=4·38 (0·066)TUNE snj
0 0 0·026 (0·001) 0 0·22 0·026 (·001) 0·070 (0·004)sn=3·81 (0·043)
STEP 1 0 0·01 0·047 (0·001) 0 3·8 0·17 (0·009) 0·35 (0·008)IV STEP 2 0 0 0·031 (0·002) 0 0·08 0·024 (0·001) 0·073 (0·004)
pn = 1000 sn1=4·28 (0·061), sn2=4·37 (0·067)TUNE snj 0 0 0·023 (0·001) 0 0·07 0·024 (0·001) 0·058 (0·004)
Table 1.3: Simulation results using the same settings as Design I & II in Section 4 of the paperand Deisgn III & IV as above, based on 200 Monte Carlo replicates, except the sample sizen = 400.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 19
Design sn FZf FNf MSEf FZs FNs MSEs PE1 0·94 0 4·1 (0·03) 0·41 7·3 4·9 (0·2) 28·0 (0·6)2 0·56 0 2·8 (0·1) 0·08 3·3 1·4 (0·07) 9·9 (0·3)3 0·01 0 0·63 (0·02) 0 1·7 0·38 (0·03) 1·7 (0·07)4 0 0 0·17 (0·007) 0 0·58 0·15 (0·009) 0·31 (0·01)5 0 0 0·23 (0·1) 0 0·58 0·15 (0·009) 0·30 (0·01)
I 6 0 0 0·33 (0·01) 0 0·63 0·15 (0·009) 0·32 (0·01)pn = 20 10 0·01 0·01 1·3 (0·05) 0 0·49 0·15 (0·009) 0·44 (0·05)
16 0·31 0·18 7·7 (0·3) 0·08 0·20 0·36 (0·03) 3·6 (0·3)sn=4·12 (0·026)ABIC
0 0 0·18 (0·009) 0 0·56 0·15 (0·009) 0·30 (0·01)sn1=4·59 (0·069), sn2=4·61 (0·060)TUNE snj
0 0 0·14 (0·006) 0 0·48 0·14 (0·008) 0·28 (0·01)sn=4·04 (0·021)
STEP 1 0 0·04 0·29 (0·008) 0 7·3 0·55 (0·03) 1·9 (0·07)II STEP 2 0 0 0·15 (0·008) 0 0·16 0·096 (0·005) 0·26 (0·02)
pn = 1000 sn1=4·53 (0·056), sn2=4·46 (0·050)TUNE snj 0 0 0·12 (0·006) 0 0·15 0·095 (0·005) 0·24 (0·008)
1 2·0 0 2·5 (0) 1·5 1·2 4·3 (0·15) 26·1 (0·1)2 0·33 0 1·1 (0·06) 0·02 0·69 0·43 (0·03) 3·3 (0·1)3 0 0 0·10 (0·005) 0 0·39 0·13 (0·008) 0·29 (0·01)4 0 0 0·12 (0·006) 0 0·39 0·13 (0·008) 0·22 (0·01)5 0 0 0·20 (0·009) 0 0·41 0·13 (0·008) 0·24 (0·01)
III 6 0 0 0·29 (0·01) 0 0·41 0·14 (0·008) 0·26 (0·01)pn = 20 10 0·03 0·07 1·4 (0·06) 0 0·17 0·12 (0·007) 0·41 (0·03)
16 0·02 1·3 11·8(0·4) 0·02 0·39 0·18 (0·02) 1·0 (0·03)sn=3·76 (0·065)ABIC
0 0 0·15 (0·02) 0 0·36 0·13 (0·008) 0·24 (0·01)sn1=3·76 (0·064), sn2=3·79 (0·066)TUNE snj
0 0 0·070 (0·004) 0 0·29 0·12 (0·007) 0·20 (0·01)sn=3·33 (0·040)
STEP 1 0 0·06 0·18 (0·006) 0 6·4 0·63 (0·03) 1·5 (0·04)IV STEP 2 0 0 0·091 (0·005) 0 0·12 0·099 (0·005) 0·23 (0·009)
pn = 1000 sn1=3·96 (0·060), sn2=3·73 (0·048)TUNE snj 0 0 0·065 (0·003) 0 0·10 0·095 (0·005) 0·18 (0·009)
Table 1.4: Simulation results using the same settings as Design I & II in Section 4 of the paperand Deisgn III & IV as above, based on 200 Monte Carlo replicates, except the variance of theregression error σ2 = 2.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 20
1.5 Application
We applied our method to a dataset from the National Mortality, Morbidity, and Air Pollution
Study that contains air pollution measurements and mortality counts for U.S. cities with the
U.S. census information for year 2000. A main goal of the study is to investigate the impact
of air pollution on the non-accidental mortality rate across different cities in the United States,
when we take into account climate patterns and inform the U.S. census. Previous studies
conducted a two-stage analysis: first modelling the short-term effect of certain air pollutants on
the mortality count for each city, then combining the estimates across cities [Peng et al., 2005,
2006]. By contrast, we apply the partially functional linear regression model to the data for
different cities. In particular, we are interested in studying the effect of particulate matter with
an aerodynamic diameter of less than 2·5µm, abbreviated PM2·5 and measured in µgm−3, as
its negative effect on health, revealed by recent toxicological and epidemiological studies, has
brought it to the public’s attention. Other studies [Samoli et al., 2013, Pascal et al., 2014] have
shown that PM2·5 has a larger effect on mortality in warm weather, so we focused on the daily
concentration measurements of PM2·5 from April 1, 2000 to August 31, 2000, which along with
the daily observations of temperature and humidity were treated as the functional predictors.
After removing the cities which have more than ten consecutive missing measurements of
PM2·5, we included a total of 69 cities in our analysis. The response of interest is the log-
transformed total non-accidental mortality rate in the following month, September 2000, of
the population of age 65 and older, which accounts for the majority of non-accidental deaths.
The scalar covariates available from the U.S. census for each city are land area per individual,
water area per individual, proportion of the urban population, proportion of the population with
at least a high school diploma, proportion of the population with at least a university degree,
proportion of the population below poverty line, and proportion of household owners.
The ABIC was used to first choose significant predictors with a common truncation, fol-
lowed by a least squares refitting using AIC to tune snj jointly. Among scalar covariates, our
analysis shows that only the proportion of household owners has a negative effect −1·80 with
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 21
the standard error of 0 ·41, indicating that household owners tend to incur a lower mortality
rate, where the standard error was obtained based on 1000 bootstrap samples by fitting the se-
lected model using ordinary least squares. Our method also selected two significant functional
predictors, PM2 ·5 and temperature. The least squares refitting chose the truncation numbers
sn1 = 2 and sn2 = 2. The estimated regression parameter functions with their 95% bootstrap
confidence bands are shown in Figure 1.1. We observe that higher PM2·5 concentrations during
the summer, especially July and August, can lead to an increased mortality during the imme-
diately following period, which coincides with the findings in Samoli et al. [2013] and Pascal
et al. [2014]. This may be partially explained by the proximity of the pollution period to the
time of death. Higher temperatures in the summer, contrasted with lower temperatures in April,
may also increase the mortality rate, which agrees with Curriero et al. [2002]. To better under-
stand the effects of functional predictors, we fitted a linear regression using only the selected
scalar covariate, giving R2 = 0·15. Including temperature leads to R2 = 0·25, and including
both temperature and PM2·5 yields R2 = 0·38. A heuristic F test for the significance of two
principal component scores of temperature gives a p-value of 0·01, and adding additional two
principal component scores of PM2·5 gives a p-value of 0·0008. For comparison, we also fitted
the marginal models containing only PM2 ·5 or temperature using classical functional linear
regression. The marginal F tests for temperature and PM2·5 yield p-values of 0·0001 and 0·004.
The regression parameter functions show similar patterns and are omitted. We conclude that,
after adjusting for temperature and household ownership, summer PM2·5 concentrations have
a significant impact on the near-future mortality rate of elder citizens in the U.S..
1.6 Proof of Lemmas and Main Theorems
1.6.1 Regularity Conditions
Without loss of generality, we assume that Xj, j = 1, . . . , d, Y and Z have been centred
to have mean zero. With Wijk = xij(tijk) + εijk, for definiteness, we consider the local lin-
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 22
0 50 100 150−6
−4
−2
0
2
4
6
8
10
12x 10−4 PM2.5
Time (day)
Effe
ct
(a)
0 50 100 150−8
−6
−4
−2
0
2
4
6
8x 10−4 Temperature
Time (day)
Effe
ct
(b)
Figure 1.1: The estimated regression parameter functions and 95% confidence bands based on1000 bootstrap samples for PM2·5 and temperature with sn1 = 2 and sn2 = 2, respectively. Thesolid line denotes the estimated regression parameter functions and the dashed lines denote the95% bootstrap confidence bands. The left and right panels are for the PM2·5 and temperature,respectively.
ear smoother for each set of subjects using bandwidths hij, j = 1, . . . , d, and denote the
smoothed trajectories by xij . Denote the minimum and maximum eigenvalues of a symmetric
matrix A by λmin(A) and λmax(A). Recall that the first g functional predictors are significant,
while the rest are not. Define the (gsn + qn)× 1 vector N = (ξ11w−1/211 , . . . , ξ1snw
−1/21sn , . . . ,
ξg1w−1/2g1 , . . . , ξgsnw
−1/2gsn , Z1, . . . Zqn)T to combine all functional and scalar predictors.
Condition (B1) consists of regularity assumptions for functional data, for example, a Gaus-
sian process with Holder continuous sample paths satisfies (B1), see Hall and Hosseini-Nasab
[2006]. Condition (B2) is standard for local linear smoothers, (B3)–(B4) concern how the func-
tional predictors are sampled and smoothed, while (B5) is for the moments of non-functional
covariates Z = (Z1, . . . , Zpn)T.
(B1) For j = 1, . . . , d, for any C > 0 there exists an ε > 0 such that
supt∈T
[E|Xj(t)|C] <∞, sups,t∈T
(E[|s− t|−ε|Xj(s)−Xj(t)|C ]) <∞.
For each integer r ≥ 1, w−rjk E(ξ2rjk) is bounded uniformly in k.
(B2) For j = 1, . . . , d, Xj is twice continuously differentiable on T with probability 1, and
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 23
∫EX(2)
j (t)4dt <∞, where X(2)j (·) denotes the second derivative of Xj(·).
The following condition concerns the design on which xij is observed and the local linear
smoother xij . When a function is said to be smooth, we mean that it is continuously differen-
tiable to an adequate order.
(B3) For j = 1, . . . , d, tijl, l = 1, . . . ,mij are considered deterministic and ordered in-
creasingly for i = 1, . . . , n. There exist densities gij uniformly smooth over i, satisfying∫Tgij(t) dt = 1 and 0 < c1 < infiinft∈Tgij(t) < supisupt∈Tgij(t) < c2 < ∞
that generate tijl according to tijl = G−1ij (l − 1)/(mij − 1), where G−1
ij is the inverse
of Gij(t) =∫ t−∞ gij(s) ds. For each j = 1, . . . , d, there exist a common sequence of
bandwidths hj such that 0 < c1 < infihij/hj < supihij/hj < c2 < ∞, where hij is the
bandwidth for xij . The kernel density function is smooth and compactly supported.
Let T = [a0, b0], tij0 = a0, tij,mij+1 = b0, let ∆ij = suptij,l+1 − tij,l, l = 0, . . . ,mij and
mj = mj(n) = infi=1,...,nmij . The condition below is to let the smooth estimate xij serve as
well as the true xij in the asymptotic analysis, denoting 0 < lim an/bn <∞ by an ∼ bn.
(B4) For j = 1, . . . , d , supi ∆ij = O(m−1j ), hj ∼ m
−1/5j , mjn
−5/4 →∞.
The condition for the scalar covariates is given below.
(B5) Z1, . . . , Zpn are subgaussian random variables such that pr(|Zl| > t) ≤ exp(−2−1Ct2)
for any t ≥ 0 and some C > 0 that does not depend on l, E(Z4l ) is uniformly bounded
for l = 1, . . . , pn, λmax(Σ) ≤ c1 < ∞ and 0 < c2 ≤ λmin(U1) ≤ λmax(U1) ≤ c3 < ∞
for all n, where Σ = E(ZZT) and U1 = E(NNT).
1.6.2 Auxiliary Lemmas
For each j = 1, . . . , d, given the estimated covariances Kj(s, t) = n−1∑n
i=1 xij(s)xij(t), the
eigenvalues/functions and functional principal component scores are estimated by
∫T
Kj(s, t)φjk(s)ds = λjkφjk(t), ξijk =
∫T
xij(t)φjk(t)dt, (1.5)
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 24
subject to∫Tφ2jk(t)dt = 1 and
∫Tφjk1(t)φjk2(t)dt = 0 for k1 6= k2. Denote ∆2
j = |||Kj −
Kj|||2 =∫T
∫TKj(s, t) − Kj(s, t)2dsdt, <j(s, t) = n1/2Kj(s, t) − Kj(s, t), (xij ⊗
xij)(s, t) = xij(s)xij(t), the second derivative of xij by x(2)ij , the minimum and maximum
eigenvalues of a symmetric matrix A by λmin(A) and λmax(A). Let the n × pn matrix ZM =
(z1, . . . , zn)T = (Z(1)M , Z
(2)M ) with Z(1)
M containing the first qn columns of ZM . Define Mj as
the n × sn matrix with (i, k)th element ξijk, and M = (M1, . . . ,Md), M (1) = (M1, . . . ,Mg)
and M (2) = (Mg+1, . . . ,Md). Similarly, we define Mj , M (1) and M (2), where ξijk is re-
placed by ξijkw−1/2jk . Moreover, we denote Mj as the n × sn matrix with (i, k)th element
ξijk, and M = (M1, . . . , Md), M (1) = (M1, . . . , Mg) and M (2) = (Mg+1, . . . , Md). Sim-
ilarly, we define Mj , M (1) and M (2), where ξijk is replaced by ξijkw−1/2jk . Combining the
functional and scalar covariates, we have N1 = (M (1), Z(1)M ) = (N1, . . . , Nn)T, where Ni =
(Ni1, ..., Ni,gsn+qn)T, i = 1, . . . , n are independently and identically distributed as N . We fur-
ther denote N1 = (M (1), Z(1)M ) = (N1, . . . , Nn)T, N (1) = (M (1), Z
(1)M ), N (1) = (M (1), Z
(1)M ),
and let E(NiNTi ) = U1. Recall that b(1) denotes the estimate of b(1), η = (b(1)T, γT)T, and
we further denote b(1)j the estimate of b(1)
j . Suppose α(·), β(·) and G(·, ·) are square-integrable
functions on T and T × T . Write ||α||,∫αβ (or 〈α, β〉) and
∫Gαβ for
∫Tα2(t)dt1/2,∫
Tα(t)β(t)dt and
∫ ∫T 2 G(s, t)α(s)β(t)dsdt.
Lemma 1 provides results for the estimates obtained by functional principal component
analysis. We first quantify the smoothing error of xij and its influence carried over to the
covariance and functional principal component estimates, ξijk−ξijk = ξijk−ξijk+ ξijk− ξijk =∫xij(φjk−φjk)+
∫(xij−xij)φjk. Then we obtain upper bounds for the differences between the
estimated and true eigenfunctions in terms of covariance perturbation and eigenvalue spacings
for quantifying the increased estimation error of the higher order terms. For each j = 1, . . . , d,
denote
δjk = minl=1,...,k
(wjl − wj,l+1), Jjn = k = 1, . . . ,∞ : wjk − wj,k+1 > 2∆j,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 25
where ∆j = |||Kj−Kj|||, that is, consider k ∈ Jjn for which the distance of wjk to the nearest
other eigenvalues does not fall below 2∆j . It is known that φjk can be consistently estimated
for k ∈ Jjn [Theorem 1, Hall and Hosseini-Nasab, 2006], ‖φjk − φjk‖ = Op(δjk∆j) =
Op(ka+1n−1/2), implying supJjn = o
n1/(2a+2)
. However, for our theoretical analysis, we
need a sharper bound without compromising the number of eigenfunctions considered. Define
the set of realizations such that, for sample size n, some C and any τ < 1,
Fsn,j = (wjk1 − wjk2)−2 ≤ 2(wjk1 − wjk2)−2 ≤ Cnτ , k1, k2 = 1, . . . , sn, k1 6= k2.
Lemma 1 (a) Under conditions (B1)–(B4), for each i = 1, . . . , n and each j = 1, . . . , d, we
have E(‖xij − xij‖2) = o(n−1), E∫
(xij − xij)4
= o(n−2). For each j = 1, . . . , d,
we have E(|||Kj −Kj|||2) = O(n−1).
(b) Under conditions (A1), (A3), (B1)–(B4), for each j = 1, . . . , d, we have sn ∈ Jjn for
large n, P (Fsn,j) → 1 as n → ∞ for any τ < 1. Moreover, for each k = 1, . . . , sn,
‖φjk − φjk‖ = Op(k n−1/2), where Op(·) is uniform in k = 1, . . . , sn.
(c) Under conditions (A1), (A3), (B1)–(B4), for each j = 1, . . . , d,
φjk(t)− φjk(t) = n−1/2∑v:v 6=k
(wjk − wjv)−1φjv(t)
∫<jφjkφjv + αjk(t),
where ||αjk|| = Op(ka+2n−1), <j = n1/2(Kj − Kj), and Op(·) is uniform in k =
1, . . . , sn.
(d) Under conditions (A1), (B1)–(B4) and n−1/2sn = op(1), for any j = 1, . . . , d, we have
c1λn ≤ λjn ≤ c2λn for some positive constants c1, c2.
The next lemma quantifies the asymptotic orders of several important types of expressions that
will be encountered in the proofs of our lemmas and main theorems. For convenience, define
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 26
the following notations, l, l1, l2 = 1, . . . , pn, k, k1, k2 = 1, . . . , sn, j, j1, j2 = 1, . . . , d,
θ(1)jk =
n∑i=1
(ξijk − ξijk)2w−1jk , θj1j2k1k2
(2)= n−1
n∑i=1
(ξij1k1 ξij2k2 − ξij1k1ξij2k2)(wj1k1wj2k2)−1/2,
θj1j2k1k2
(3)= n−1
n∑i=1
ξij1k1ξij2k2 − E(ξj1k1ξj2k2)(wj1k1wj2k2)−1/2, θj1j2k1k2
(4)= θj1j2k1k2
(2)+ θj1j2k1k2
(3),
θ(5)jkl = n−1
n∑i=1
(ξijk − ξijk)zijw−1/2jk , θ
(6)jkl = n−1
n∑i=1
ξijkzij − E(ξijkzij)w−1/2jk ,
θ(7)jkl = θ
(5)jkl + θ
(6)jkl, θ
(8)l1l2
= n−1
n∑i=1
zil1zil2 − E(zil1zil2),
ϑ(1)jl =
n∑i=1
∞∑k=sn+1
ξijkbjk0zij, ϑ(2)jl =
n∑i=1
sn∑k=1
(ξijk − ξijk)(bjk − bjk0)zij,
ϑ(3)jl =
n∑i=1
sn∑k=1
(ξijk − ξijk)bjk0zij, ϑ(4)jl =
n∑i=1
sn∑k=1
(ξijk − ξijk)bjk0zij,
ϑ(5)jl =
sn∑k=1
bjk0
∫(n∑i=1
xijzij − nΞjl)(φjk − φjk), ϑ(6)jl = n
sn∑k=1
bjk0〈Ξjl, αjk〉,
ϑ(7)jl = n1/2
sn∑k=1
∑v 6=k
bjk0(wjk − wjv)−1〈Ξjl, φjv〉∫
(<j −<∗j)φjkφjv,
ϑ(8)jl = n1/2
sn∑k=1
∑v 6=k
bjk0(wjk − wjv)−1〈Ξjl, φjv〉∫<∗jφjkφjv,
where ϑj(m) = (ϑ(m)j1 , ...ϑ
(m)jqn
)T denote corresponding vectors and <∗j = n1/2(Kj −Kj), where
Kj = n−1∑n
i=1 xij ⊗ xij .
Lemma 2 (a) Under conditions (A1), (A3), (B1)–(B5), we have
θ(1)jk = Op(k
a+2), θj1j2k1k2
(2)= Op
(ka/2+11 n−1/2 + k
a/2+12 n−1/2
),
θj1j2k1k2
(3)= Op(n
−1/2), θj1j2k1k2
(4)= Op
(ka/2+11 n−1/2 + k
a/2+12 n−1/2
),
θ(5)jkl = Op
(ka/2+1n−1/2
), θ
(6)jkl = Op(n
−1/2),
θ(7)jkl = Op
(ka/2+1n−1/2
), θ
(8)l1l2
= Op(n−1/2),
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 27
where the Op(·) and op(·) terms are uniform for k, k1, k2 = 1, . . . , sn and l, l1, l2 =
1, . . . , pn.
(b) Under conditions (A1)–(A7), (B1)–(B5), uniformly for l = 1, . . . , pn,
ϑ(1)jl = ϑ
(2)jl = ϑ
(3)jl = ϑ
(5)jl = ϑ
(6)jl = ϑ
(7)jl = op(n
1/2),
ϑ(4)jl = ϑ
(5)jl + ϑ
(6)jl + ϑ
(7)jl + ϑ
(8)jl .
Lemma 3 characterizes the eigenvalues of the design matrices N1, and Lemma 4 concerns
the asymptotic order of Λ1 = PN1(YM − N1η10), where PN1
= N1(NT1 N1)−1NT
1 , YM =
(y1, ..., yn)T, and η10 is the true parameter of η1.
Lemma 3 Under conditions (A1), (A3), (A5), (B1)–(B5), we have |λmin(NT1 N1/n)−λmin(U1)| =
op(1), |λmax(NT1 N1/n)− λmax(U1)| = op(1).
Lemma 4 Under conditions (A1)–(A5), (B1)–(B5), we have ‖Λ1‖22 = Op(r
2n), where r2
n =
qn + sn.
1.6.3 Proof of Lemmas
Proof of Lemma 1. We shall show Lemma 1 for any fixed j = 1, . . . , d, we suppress the
subscript j in this proof for convenience. For part (a), recall that Wil = xi(til) + εil, where
the error εil are independent of xi. Thus one can factor the probability space Ω = Ω1 × Ω2,
where Ω1 is for xii=1,...,n and Ω2 for εill=1,...,mi . Given the data for a single subject, a
specific realization xi corresponds to fixing a value ωi ∈ Ω1, that is xi(·) = xi(·, ωi). We write
Eε for expectations with regard to the probability measure on Ω2, and EX with respect to Ω1.
For each fixed ωi, the error εil, i = 1, . . . , n are independent and identically distributed for
different l, with Eε(εil) = 0 and Eε(ε2il) = σ2
x. Given condition (B3) together with a local
linear smoother, one has E(‖xi − xi‖2) = EX[Eε ∫
(xi − xi)2]
. Using standard Taylor
expansion argument, together with the dominant convergence theorem given E(‖x(2)i ‖4) <∞,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 28
it is easy to verify EX[Eε ∫
(xi − xi)2]≤ C
∫ [EX
(x(2)i )2
h4 + (mh)−1
]under (B3),
yielding E(‖xi − xi‖2) = o(n) by (B4). Similar arguments will also lead to EX[Eε
∫(xi −
xi)4]≤ C
∫ [EX(x(2)
i )4h8 + (mh)−2
], thus E
∫(xi − xi)
4]
= o(n−2). Regarding
K = n−1∑n
i=1 xi⊗xi, notice that K−K = n−1∑n
i=1
(xi−xi)⊗xi+(xi−xi)⊗(xi−xi)
+(
n−1∑n
i=1 xi⊗xi−K). It is easy to see that the first term on the right hand side is dominated
by n−1∑n
i=1 |||(xi − xi)⊗ xi|||. Since xi, xi is independent of xi′ , xi′ for i 6= i′, we have
E[n−1
∑ni=1 |||(xi−xi)⊗xi|||
2]=E|||(xi−xi)⊗xi|||2 +n−1var(|||(xi−xi)⊗xi|||) ≤
E(‖xi − xi‖2)(E‖xi‖2) + n−1E(‖(xi − xi)‖4)(E‖xi‖4)
1/2= o(n−1). The second term
E(|||(n−1
∑ni=1 xi⊗xi−K|||2) = O(n−1) by Lemma 3.3 of Hall and Hosseini-Nasab [2009].
This leads to E(|||K −K|||2) = O(n−1).
We next obtain the bound in (b) for k = 1, . . . , sn. First notice that Fn can be implied by
mink=1,...,sn(wk − wk+1) ≥ s−a−1n ≥ Cn−τ/2 for some C and any τ < 1, which is fulfilled
by sa+1n n−1/2 → 0 in (A3), leading to P (Fn) → 1 as n → ∞. It is easy to see that for
k = 1, . . . , sn, wk − wk+1 ≥ s−a−1n > 2∆ as n → ∞, that is sn ∈ Jn for large n. Now we
will bound ||φk − φk||2 for k = 1, . . . , sn. By (5.16) in Hall and Horowitz [2007], one has
‖φk − φk‖2 ≤ 2u2k, where u2
k =∑
v:v 6=k(wk − wv)−2∫
(K −K)φkφv
2
. Also P (Fn) → 1
implies that (wk − wv)−2 ≤ Cnτ with probability tending to 1. Then we have
u2k ≤
∑v:v 6=k
(wk − wv)−2[2∫
(K −K)φjφv
2
+2∫
(K −K)(φk − φk)φv2]
≤ 2∑v:v 6=k
(wk − wv)−2∫
(K −K)φkφv
2
+2Cnτ∑v 6=k
∫(K −K)(φk − φk)φv
2
≤ 4∑v:v 6=k
(wk − wv)−2∫
(K −K)φkφv
2
+ 2Cnτ∆2||φk − φk||2.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 29
Plugging this into ‖φk − φk‖2 ≤ 2u2k, one has
(1− 2Cnτ∆2)‖φk − φk‖2 ≤ 4∑v:v 6=k
(wk − wv)−2∫
(K −K)φkφv
2
.
As ∆ = Op(n−1/2) and τ < 1, we have nτ∆2 = op(1), and ‖φk − φk‖2 ≤ C
∑v 6=k(wk −
wv)−2∫
(K −K)φkφv2. Given the results in (a), by analogy to (5.22) in Hall and Horowitz
[2007], nE[∑
v 6=k(wk − wv)−2∫
(K −K)φkφv2] = O(k2) still holds, the result follows by
Chebyshev’s inequality.
To obtain (c), using Lemma 5.1 of Hall and Horowitz [2007] with ψk = φk, λk = wk and
L = K, since infv:v 6=k |wk − wv| > 0, we have
φk − φk =∑v:v 6=k
(wk − wv)−1φv
∫(K −K)φkφv + φk
∫(φk − φk)φk
=∑v:v 6=k
(wk − wv)−1φv
∫(K −K)φkφv + φk
∫(φk − φk)φk
+∑v:v 6=k
(wk − wv)−1φv
∫(K −K)(φk − φk)φv
−∑v:v 6=k
(wk − wk)(wk − wv)(wk − wv)−1φv
∫(K −K)φkφv.
Denote the last three terms by αk ≡ α1k + α2k + α3k. For α1k, one has
‖φk − φk‖2 = 2− 2
∫φkφk = 2(
∫φ2k −
∫φkφk) = −2
∫(φk − φk)φk,
that is α1k = −‖φk − φk‖2/2. From part (b), one has ‖φk − φk‖ = Op(kn−1/2) uniformly in
k = 1, . . . , sn, then ‖α1k‖ = Op(k2n−1). To bound α2k, noticing ‖
∑v cvφv‖2 =
∑v c
2v due to
orthonormal φk,
‖α2k‖ ≤[∑v 6=k
(wk − wv)−2∫
(K −K)(φk − φk)φv2]1/2
≤ δ−1k ∆‖φk − φk‖ = Op(k
a+2n−1).
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 30
For α3k, noticing that supk=1,...,∞ |wk − wk| = Op(∆), we can bound ‖α3k‖ as follows,
2|wk − wk|[ ∑v:v 6=j
(wk − wv)−4∫
(K −K)φkφv2]1/2
≤ C∆[ ∑v:v 6=k
(wk − wv)−4∫
(K −K)φkφv2
+∑v:v 6=k
(wk − wv)−4∫
(K −K)(φk − φk)φv2]1/2
= C∆(A1 + A2)1/2,
where A1 =∑
v:v 6=k(wk − wv)−4 ∫
(K −K)φkφv2 and A2 =
∑v:v 6=k(wk − wv)−4
∫(K −
K)(φk − φk)φv2. By the derivations of part (b), we have
A1 ≤ δ−2k
∑v:v 6=k
(wk − wv)−2∫
(K −K)φkφv
2
= Op(k2δ−2k n−1),
A2 ≤ δ−4k ∆2‖φk − φk‖2 = Op(δ
−4k j2∆4).
Notice that δ−2k = Ok2(a+1) = o(n) by (A3), which indicates A2 = op(k
2δ−2k n−1). Thus
‖α3k‖ = Op(kδ−1k n−1). It leads to ‖αk‖ = Op(k
a+2n−1).
For part (d), since∑sn
k=1 wk <∞ and supk=1,...,sn |wk−wk| ≤ |||K−K||| = Op(n−1/2) by
Theorem 1 of Hall and Hosseini-Nasab [2006] and part (a), we can see that (∑sn
k=1 wk)1/2 =
Op(1) given n−1/2sn = o(1), which completes the proof.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 31
Proof of Lemma 2. For θ(1)jk and any fixed j, we have
θ(1)jk =
n∑i=1
(ξijk − ξijk)2w−1jk =
n∑i=1
(ξijk − ξijk + ξijk − ξijk)2w−1jk
≤ 2n∑i=1
(ξijk − ξijk)2w−1jk + 2
n∑i=1
(ξijk − ξijk)2w−1jk
= 2n∑i=1
∫xij(φjk − φjk)
2w−1jk + 2
n∑i=1
∫(xij − xij)φjk
2w−1jk
≤ 2n∑i=1
(‖xij‖2‖φjk − φjk‖2 + ‖φjk‖2‖xij − xij‖2
)w−1jk .
Given Lemma 1 and (B1), we know
E(‖xij‖2) ≤ 2E(‖xij − xij‖2) + 2E(‖xij‖2) = O(1),
hence, ‖xij‖2 = Op(1). From Lemma 1, we have θ(1)jk =
∑ni=1
Op(1)Op(k
2n−1)+op(n−1)ka =
Op(ka+2), uniformly for k = 1, . . . , sn.
For θj1j2k1k2
(2), it is obvious that
|θj1j2k1k2
(2)| =∣∣∣n−1
n∑i=1
(ξij1k1 ξij2k2 − ξij1k1ξij2k2)(wj1k1wj2k2)−1/2∣∣∣
= n−1∣∣∣ n∑i=1
ξij1k1(ξij2k2 − ξij2k2)(wj1k1wj2k2)−1/2 +n∑i=1
ξij2k2(ξij1k1 − ξij1k1)(wj1k1wj2k2)−1/2∣∣∣
≤ n−1( n∑i=1
ξ2ij1k1
w−1j1k1
)1/2 n∑i=1
(ξij2k2 − ξij2k2)2w−1j2k2
1/2+
n−1( n∑i=1
ξ2ij2k2
w−1j2k2
)1/2 n∑i=1
(ξij1k1 − ξij1k1)2w−1j1k1
1/2.
Since E(∑n
i=1 ξ2ij2k2
w−1j2k2
) = n for any k2 = 1, . . . , sn , we have∑n
i=1 ξ2ij2k2
w−1j2k2
= Op(n),
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 32
uniformly for k2 = 1, . . . , sn. Moreover,
n∑i=1
ξ2ij1k1
w−1j1k1
≤ 2n∑i=1
(ξij1k1 − ξij1k1)2w−1j1k1
+ 2n∑i=1
ξij1k12w−1
j1k1
= Op(ka+21 + n) = Op(n),
uniformly for k1 = 1, . . . , sn. In conclusion, we have
|θj1j2k1k2
(2)| = n−1Op(n1/2)Op(k
a/2+12 ) + n−1Op(n
1/2)Op(ka/2+11 )
= Op
(ka/2+11 n−1/2 + k
a/2+12 n−1/2
),
uniformly for k1, k2 = 1, . . . , sn.
For θj1j2k1k2
(3), E(θj1j2k1k2
(3))2 ≤ n−1
E(ξ4
ij1k1w−2j1k1
)E(ξ4ij2k2
w−2j2k2
)1/2 ≤ c1n
−1, uniformly for
k1, k2 = 1, . . . , sn by (B1). It follows that θj1j2k1k2
(3)= Op(n
−1/2), uniformly for k1, k2 =
1, . . . , sn. Moreover, it is trivial that θj1j2k1k2
(4)= θj1j2k1k2
(2)+ θj1j2k1k2
(3)= Op(n
−1/2ka/2+11 +
n−1/2ka/2+12 ), uniformly for k1, k2 = 1, . . . , sn.
Based on (B5), we conclude that z4ij = Op(1), uniformly for l = 1, . . . , pn. For θ(5)
jkl, we
have
|θ(5)jkl| ≤ n−1
n∑i=1
(ξijk − ξijk)2w−1jk
1/2( n∑i=1
z2ij
)1/2
= n−1Op(ka/2+1)Op(n
1/2) = Op
(ka/2+1n−1/2
),
uniformly for k = 1, . . . , sn, l = 1, . . . , pn. Furthermore,
E(θ(6)jkl)
2 ≤ n−1E(ξ4
ijkw−2jk )E(z4
ij)1/2
= O(n−1),
uniformly for k = 1, . . . , sn, l = 1, . . . , pn. Thus θ(6)jkl = Op(n
−1/2), uniformly for k =
1, . . . , sn, l = 1, . . . , pn. Then it follows that θ(7)jkl = θ
(5)jkl + θ
(6)jkl = Op(k
a/2+1n−1/2), uniformly
for k = 1, . . . , sn, l = 1, . . . , pn. For θ(8)l1l2
, we have E(θ(8)l1l2
)2 ≤ n−1E(z4
il1)E(z4
il2)1/2
=
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 33
O(n−1), uniformly for l1 = 1, . . . , pn, l2 = 1, . . . , pn, and this entails that θ(8)l1l2
= Op(n−1/2),
uniformly for l1 = 1, . . . , pn, l2 = 1, . . . , pn.
For ϑ(1)jl , given (A4), we have
E|ϑ(1)jl | ≤
n∑i=1
∞∑k=sn+1
|bjk0|E |ξijkzij| ≤n∑i=1
∞∑k=sn+1
|bjk0| E(ξ2ijk)E(z2
ij)1/2
≤ c1
n∑i=1
∞∑k=sn+1
k−bk−1/2 = O(ns−b+1/2n ) = o(n1/2).
Hence ϑ(1)jl = op(n
1/2), uniformly for l = 1, . . . , pn. For ϑ(2)jl , we have
ϑ(2)jl =
n∑i=1
sn∑k=1
∫xij(φjk − φjk) +
∫(xij − xij)φjk
(bjk − bjk0)zij
=
∫(n∑i=1
xijzij)sn∑k=1
(φjk − φjk)(bjk − bjk0)
+
∫
n∑i=1
(xij − xij)zijsn∑k=1
φjk(bjk − bjk0).
It follows that
(ϑ(2)jl )2 ≤ 2‖
n∑i=1
xijzij‖2‖sn∑k=1
(φjk − φjk)(bjk − bjk0)‖2
+2‖n∑i=1
(xij − xij)zij‖2‖sn∑k=1
φjk(bjk − bjk0)‖2.
Since
E‖n∑i=1
xijzij‖2 ≤ E(n∑i=1
‖xij‖|zij|)2 ≤ n2E(‖x1j‖|z1l|)2 + nE(‖x1j‖4)E(z41l)1/2 = O(n2),
we have ‖∑n
i=1 xijzij‖2 = Op(n2). Similarly,
E‖n∑i=1
(xij − xij)zij‖2 ≤ n2(E‖xij − xij‖|zij|)2 + nE(‖xij − xij‖4)E(z4ij)1/2 = o(n),
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 34
lead to ‖∑n
i=1(xij − xij)zij‖2 = op(n). Moreover,
‖sn∑k=1
(φjk − φjk)(bjk − bjk0)‖2 ≤ ‖b(1)j − b
(1)j0 ‖2
2
sn∑k=1
‖φjk − φjk‖2w−1jk
= Op(r2n/n)Op
sn∑k=1
(k2/n)ka
= Op(r2ns
a+3n /n2),
and
‖sn∑k=1
φjk(bjk − bjk0)‖2 ≤ (sn∑k=1
|bjk − bjk0|)2 ≤ ‖b(1)j − b
(1)j0 ‖2
2
sn∑k=1
w−1jk = Op(r
2ns
a+1n /n),
by Theorem 1. In summary, we get (ϑ(2)jl )2 = op(r
2ns
a+3n ) = op(qns
a+3n + sa+4
n ), uniformly for
l = 1, . . . , pn. Under conditions (A3) and (A5), ϑ(2)jl = op(n
1/2), uniformly for l = 1, . . . , pn.
For ϑ(3)jl , we have
(ϑ(3)jl )2 = [
∫
n∑i=1
(xij − xij)zij(sn∑k=1
φjkbjk0)]2 ≤ ‖n∑i=1
(xij − xij)zij‖2‖sn∑k=1
φjkbjk0‖2
≤ (sn∑k=1
|bjk0|)2op(n) = op(n),
hence ϑ(3)jl = op(n
1/2) uniformly for l = 1, . . . , pn.
Next, we show that ϑ(4)jl = ϑ
(5)jl + ϑ
(6)jl + ϑ
(7)jl + ϑ
(8)jl . Recall that Ξj = (Ξj1, ...,Ξjqn)T with
Exij(t)zij = Ξjl(t). By lemma 1, we have the expression
φjk(t)− φjk(t) = n−1/2∑v:v 6=k
(wjk − wjv)−1φjv(t)
∫<jφjkφjv + αjk(t),
where ||αjk|| = Op(ka+2n−1), <j = n1/2(Kj − Kj), and Op(·) is uniform in k = 1, . . . , sn.
Since
ϑ(4)jl =
n∑i=1
sn∑k=1
(ξijk − ξijk)bjk0zij =
∫(n∑i=1
xijzij)sn∑k=1
(φjk − φjk)bjk0,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 35
substitute the expression for φjk(t)−φjk(t) into the above equation, we immediately get ϑ(4)jl =
ϑ(5)jl + ϑ
(6)jl + ϑ
(7)jl + ϑ
(8)jl .
For ϑ(5)jl , it is obvious that
(ϑ(5)jl )2 ≤ ‖
sn∑k=1
(φjk − φjk)bjk0‖2‖n∑i=1
(xijzij − Ξjl)‖2,
where
E‖n∑i=1
(xijzij − Ξjl)‖2 =
∫E[
n∑i=1
(xijzij − Ξjl)2] = n
∫var(xijzij)
≤ n
∫E(x4
ij)E(z4ij)1/2 = O(n),
and
‖sn∑k=1
(φjk − φjk)bjk0‖ ≤sn∑k=1
|bjk0|‖φjk − φjk‖ = Op
( sn∑k=1
k−bkn−1/2)
= Op(n−1/2),
thus (ϑ(5)jl )2 = Op(1) = op(n), and it follows that ϑ(5)
jl = op(n1/2) uniformly for l = 1, . . . , pn.
For ϑ(6)jl , we have
(ϑ(6)jl )2 ≤ n2
∫(Exijzij)
2‖sn∑k=1
bjk0αjk‖2 ≤ n2
∫E(x4
ij)E(z4ij)1/2(
sn∑k=1
|bjk0|‖αjk‖)2
= n2Op(sn∑k=1
k−bka+2n−1)2 = op(n),
hence ϑ(6)jl = op(n
1/2) uniformly for l = 1, . . . , pn.
From the proof of lemma 1, it is easy to see that <j −<∗j = n1/2(Kj − Kj) = op(1), which
follows that ϑ(7)jl = op(n
1/2), uniformly for l = 1, . . . , pn.
Proof of Lemma 3. First, it is obvious that |λmin(NT1 N1/n)− λmin(U1)| ≤ ||NT
1 N1/n− U1||1,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 36
where ||.||1 is the L1 norm for matrix. Since
||NT
1 N1/n− U1||1 ≤ Op(sn∑k1=1
|θj1j2k1sn
(4)|+qn∑l=1
|θ(7)j1snl|+
sn∑k1=1
|θ(7)j1k1qn
|+qn∑l1=1
|θ(8)l1qn|)
= Op
sn∑k1=1
(ka/2+11 n−1/2 + sa/2+1
n n−1/2) +
qn∑l=1
(sa/2+1n n−1/2)
+sn∑k1=1
(ka/2+11 n−1/2) +
qn∑l1=1
n−1/2
= Op(sa/2+2n n−1/2 + qns
a/2+1n n−1/2),
by Lemma 2 (a), hence we have |λmin(NT1 N1/n)−λmin(U1)| = Op(s
a/2+2n n−1/2+qns
a/2+1n n−1/2).
Under conditions (A3) and (A5), it is obvious that |λmin(NT1 N1/n)− λmin(U1)| = op(1). Sim-
ilarly |λmax(NT1 N1/n) − λmax(U1)| ≤ ||NT
1 N1/n − U1||1 = op(1), and we skip the details
here.
Proof of Lemma 4. By Lemma 3 and (B5), we know that NT1 N1 is invertible, hence PN1
exists.
For Λ1, we have
Λ1 = PN1(YM − N1η10) = PN1
YM − N1η10 − ν + ν + (N1 − N1)η10
= PN1ε+ ν + (N1 − N1)η10,
where ε = YM − N1η10−ν = (ε1, ..., εn)T, ν = (ν1, ..., νn)T with νi =∑g
j=1
∑∞k=sn+1 ξijkbjk0.
For PN1ε, we have
E‖PN1ε‖2 = E(εTPN1
ε) = EE(εTPN1ε | ε) = E[trPN1
E(εεT)]
= σ2tr(PN1) = σ2(qn + gsn) = O(qn + sn),
hence ‖PN1ε‖2 = Op(qn + sn).
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 37
For PN1(N1 − N1)η10, we have
‖PN1(N1 − N1)η10‖2 ≤ ‖(N1 − N1)η10‖2 ≤ O
[ n∑i=1
g∑j=1
sn∑k=1
(ξijk − ξijk)bjk02]
≤ O[2g
n∑i=1
g∑j=1
sn∑k=1
(ξijk − ξijk)bjk02 + 2gn∑i=1
g∑j=1
sn∑k=1
(ξijk − ξijk)bjk02]
≤ O[ g∑j=1
n∑i=1
‖xij‖2Op
((sn∑k=1
k−bkn−1/2)2])
+Op
[ g∑j=1
n∑i=1
‖xij − xij‖2(sn∑k=1
k−b)2]
= Op(1),
since b > 2 implies that∑g
j=1
∑ni=1 ||xij||2Op
(∑sn
k=1 k−bkn−1/2)2
= Op(1), and Lemma 1
entails that∑g
j=1
∑ni=1‖xij − xij‖2(
∑snk=1 k
−b)2 = Op(1). It then follows that ‖PN1(N1 −
N1)η10‖2 = Op(1).
For PN1ν, it is obvious that ‖PN1
ν‖2 ≤ ‖ν‖2. For i = 1, . . . , n, we have
E(νi)2 = O
g∑j=1
var(∞∑
k=sn+1
ξijkbjk0)
= O( g∑j=1
∞∑k=sn+1
b2jk0wjk
)= O
( g∑j=1
∞∑k=sn+1
k−2bk−1)
= O(s−2bn ).
It follows that ‖PN1ν‖2 = Op(ns
−2bn ). In summary,
‖Λ1‖22 ≤ O(‖PN1
ε‖2 + ‖PN1(N1 − N1)η10‖2 + ‖PN1
ν‖2)
= Op(qn + sn + 1 + ns−2bn ) = Op(qn + sn) = Op(r
2n),
where r2n = qn + sn. This completes the proof.
1.6.4 Proofs of Main Theorems
Proof of Theorem 1. First, we constrainQn(η) on the subspace, where the true zero parameters
are set as 0, that is η ∈ Rdsn+pn : b(1)j = 0, k = g+ 1, . . . , d, γ(2) = 0, and prove consistency
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 38
in the (gsn + qn)-dimensional space. Define the constrained penalized function
Qn(η1) =n∑i=1
yi −g∑j=1
sn∑k=1
(ξijkw−1/2jk )bjk − zi(1)Tγ(1)2 + 2n
qn∑l=1
Jλn(|γl|)
+2n
g∑j=1
Jλjn(‖b(1)j ‖),
where η1 = (b(1)T1 , ..., b
(1)Tg , γ(1)T)T and z(1)
i = (zi1, ..., ziqn)T. We now show there exists a local
minimizer η1 such that ‖η1−η10‖ = Op(rnn−1/2) with rn = (qn+rn)1/2. Let αn = rnn
−1/2, we
aim to show that for any ε > 0, there exists a large constant C such that Pinf ||u||=C Qn(η10 +
αnu) > Qn(η10) ≥ 1 − ε for large n, which implies that there exists a local minimizer η1 of
Qn(η1) such that ‖η1− η10‖ = Op(αn). Here, u = (u(1)T, uγT)T with u(1) = (u(1)T1 , . . . , u
(1)Tg )T
and uγ = (uγ1 , . . . , uγqn)T. We have
Qn(η10 + αnu)− Qn(η10)
≥ ‖N1αnu‖2 − 2ΛT
1 N1αnu+ 2n[
qn∑l=1
Jλn(|γl0 + αnuγl |)− Jλn(|γl0|)
+
g∑j=1
Jλjn(‖b(1)j0 + αnA
−1j u
(1)j ‖)− Jλjn
(‖b(1)j0 ‖)]
≥ nλmin(NT
1 N1/n)α2n‖u‖2 − 2n1/2‖Λ1‖2λ
1/2max(NT
1 N1/n)αn‖u‖
+2n[
qn∑l=1
J ′λn(|γl0|) sgn(γl0)αnuγl + J ′′λn(|γl0|)α2
n(uγl )2(1 + o(1))
+
g∑j=1
J ′λjn(‖b(1)j0 ‖)αn‖A−1
j u(1)j ‖+ J ′′λjn(‖b(1)
j0 ‖)α2n‖A−1
j u(1)j ‖2(1 + o(1))].
The Taylor expansion in the second inequality holds because αnw−1/2jsn
≤ rnsa/2n n−1/2 =
o(1) by condition (A5). As for the smoothly clipped absolute deviation penalty, we have
J ′λn(|γl0|) = J ′′λn(|γl0|) = 0 for all l = 1, . . . , qn under condition (A7), and J ′λjn(‖b(1)j0 ‖) =
J ′′λjn(‖b(1)j0 ‖) = 0 for all j = 1, . . . , g since ‖b(1)
j0 ‖ ≥ C1 for some constant C1. Thus, we can
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 39
see that
Qn(η10 + αnu)− Qn(η10)
≥ nλmin(NT
1 N1/n)α2n‖u‖2
2 − 2n1/2‖Λ1‖2λ1/2max(NT
1 N1/n)αn‖u‖2
≥ c3nα2n‖u‖2 − c4n
1/2rnαn‖u‖ = c3r2n‖u‖2 − c4r
2n‖u‖,
where c3, c4 are some positive constants, and the second inequality is by Lemma 3 and (B5).
When C is large enough, we have Qn(η10 + αnu) − Qn(η10) > 0, which implies that there
exists a local minimizer η1 of Qn(η1) such that ‖η1 − η10‖ = Op(αn).
Next, we denote η = (b(1)T1 , ..., b
(1)Tg , 0T, ..., 0T, γ(1)T, 0T)T, where η1 = (b
(1)T1 , ..., b
(1)Tg , γ(1)T)T,
and our second goal is to show that η is a local minimizer of Qn(η) over the whole space
Rdsn+pn . Denote Sl(η) = ∂(2n)−1||YM−Mb(1)−ZMγ||2/∂γl and S∗j (η) = ∂(2n)−1||YM−
Mb(1)−ZMγ||2/∂b(1)j . By Karush-Kuhn-Tucker condition, it suffices to show that η satisfying
the following conditions,
Sl(η) = 0, and , |γl| ≥ aλn for l = 1, . . . , qn,
|Sl(η)| ≤ λn, and , |γl| < λn for l = qn + 1, . . . , pn,
S∗j (η) = 0, and , ‖b(1)j ‖ ≥ aλjn for j = 1, . . . , g,
‖S∗j (η)‖ ≤ λjn, and , ‖b(1)j ‖ < λjn for j = g + 1, . . . , d,
so that η is a local minimizer ofQn(η). When l = 1, . . . , qn, since minl=1,...,qn |γl| ≥ minl=1,...,qn |γl0|−
‖γ(1)− γ(1)0 ‖ and ‖γ(1)− γ(1)
0 ‖ = op(λn), hence under (A7), we have minl=1,...,qn |γl|/λn →∞
in probability. It follows that
P (|γl| ≥ aλn for l = 1, . . . , qn)→ 1. (1.6)
When j = 1, . . . , g, since ‖b(1)j ‖ ≥ ‖b
(1)j0 ‖−‖b
(1)j −b
(1)j0 ‖, ‖b
(1)j −b
(1)j0 ‖ = op(s
a/2n αn) = op(1)
and minj=1,...,g ‖b(1)j0 ‖ > C1 for some positive constantC1, hence, we have minj=1,...,g ‖b(1)
j ‖/λn →
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 40
∞ in probability. It follows that
P (‖b(1)j ‖ ≥ aλjn for j = 1, . . . , g)→ 1. (1.7)
When l = 1, . . . , qn and j = 1, . . . , g, Sl(η) = 0 and S∗j (η) = 0 hold trivially since (1.6), (1.7)
hold and η1 is a local minimizer of Qn(η1).
Next, we show that η satisfy
|Sl(η)| ≤ λn, and , |γl| < λn for l = qn + 1, . . . , pn
and
||S∗j (η)|| ≤ λjn, and , ||b(1)j || < λjn for j = g + 1, . . . , d.
Since γl = 0 for l = qn + 1, . . . , pn and b(1)j = 0 for j = g + 1, . . . , d, it suffices to show that
P (|Sl(η)| > λn for some l = qn + 1, . . . , pn)→ 0, (1.8)
and
P (||S∗j (η)|| > λjn for some j = g + 1, . . . , d)→ 0. (1.9)
Denote dn = (Sqn+1(η), ..., Spn(η))T = −n−1Z(2)M T(YM − N1η1), where
YM − N1η1 = ε+ ν + (N1 − N1)η10 + N1(η10 − η1),
and ν = (ν1, ..., νn)T with νi =∑g
j=1
∑∞k=sn+1 ξijkbjk0. From the proof of Lemma 4 and
previous derivations, we have
‖ν + (N1 − N1)η10 + N1(η10 − η1)‖2 = Op(r2n) = op(nλ
2n).
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 41
Moreover, we have maxl=qn+1,...,pn
∑ni=1 z
2ij = Op(n) by (B5). Thus
‖n−1Z(2)M
T(ν + (N1 − N1)η10 + N1(η10 − η1))‖∞
≤ n−1(
maxl=qn+1,...,pn
n∑i=1
z2ij
)1/2‖ν + (N1 − N1)η10 + N1(η10 − η1)‖ = op(λn).
Next, we denote n−1/2Z(2)M Tε = (δqn+1, ..., δpn)T with δl = n−1/2
∑ni=1 zijεi for l =
qn + 1, . . . , pn. As εi, i = 1, . . . , n and zil, i = 1, . . . , n are subgaussian random variables
and independent of each other, given (B5), hence δl is subexponential random variables with
uniformly bounded second moments for l = qn + 1, . . . , pn. Thus, for any constant C > 0,
there exists positive constants C1 and C2 such that uniformly for l = qn + 1, . . . , pn,
P (|δl| > Cn1/2λn) ≤ C2 exp(−2−1C1n1/2λn),
and it follows that
P (‖n−1/2Z(2)M
Tε‖∞ > Cn1/2λn) ≤pn∑
l=qn+1
P (|δl| > Cn1/2λn)
≤ Opn exp(−2−1C1n1/2λn) = Oexp(nα − 2−1C1n
1/2λn) = o(1),
where the last two equalities are by (A6) and (A7). Hence, it is obvious that ‖n−1Z(2)M Tε‖∞ =
op(λn). In conclusion, we have ‖dn‖∞ ≤ ‖n−1Z(2)M Tε‖∞+‖n−1Z(2)
M Tν+(N1−N1)η10 +
N1(η10 − η1)‖∞ = op(λn) and this implies that (1.8) holds.
Next, we start to show (1.9). For j = g + 1, . . . , d, we have S∗j (η) = −n−1MTj (YM −
N1η1) = (Sj1, ..., Sjsn)T such that∑sn
k=1 S2jk is bounded by
2sn∑k=1
(n−1
n∑i=1
ξijkεi)2
+ 2n−2
sn∑k=1
n∑i=1
ξ2ijk‖ν + (N1 − N1)η10 + N1(η10 − η1)‖2.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 42
For∑sn
k=1
(n−1
∑ni=1 ξijkεi
)2, we have
Esn∑k=1
(n−1
n∑i=1
ξijkεi)2 = n−2
sn∑k=1
n∑i=1
σ2E(ξ2ijk) ≤ n−2
sn∑k=1
n∑i=1
σ2E(‖xij‖2) = O(snn−1).
It is easy to see that∑sn
k=1
∑ni=1 ξ
2ijk = Op(n
∑snk=1wjk) = Op(n). Thus
sn∑k=1
S2jk = Op(snn
−1) + n−2Op(n)Op(r2n) = Op(qn + sn)n−1,
and it follows that
‖S∗j (η)‖2 =sn∑k=1
S2jk = Opn−1(qn + sn) = op(λ
2n),
under (A7), which entails (1.9) immediately. Hence, we conclude that η is a local minimizer
of Qn(η) such that ||γ − γ0|| = Op(rnn−1/2), ||b(1) − b(1)
0 || = Op(rnn−1/2), ||b(1) − b(1)
0 || =
Op(rnsa/2n n−1/2), and P (b
(1)j = 0, j = g + 1, . . . , d, γ2 = 0)→ 1. This completes the proof.
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 43
Proof of Theorem 2. For convenience, define the following,
Ln(η) =n∑i=1
(yi −
d∑j=1
∞∑k=1
ξijkbjk − zT
i γ)2
,
T1n(η) = −n∑i=1
( d∑j=1
∞∑k=sn+1
ξijkbjk
)2
+ 2n∑i=1
(yi −
d∑j=1
sn∑k=1
ξijkbjk − zT
i γ)
( d∑j=1
∞∑k=sn+1
ξijkbjk
),
T2n(η) = −2n∑i=1
(yi −
d∑j=1
sn∑k=1
ξijkbjk − zT
i γ) d∑
j=1
sn∑k=1
(ξijk − ξijk)bjk,
T3n(η) =n∑i=1
d∑j=1
sn∑k=1
(ξijk − ξijk)bjk2
,
T4n(η) = 2npn∑l=1
Jλn(|γl|) +d∑j=1
Jλjn(‖b(1)j ‖),
Pn(η) = Ln(η) + T1n(η) =n∑i=1
(yi −
d∑j=1
sn∑k=1
ξijkbjk − zT
i γ)2
.
Then we have Qn(η) = Ln(η) + T1n(η) + T2n(η) + T3n(η) + T4n(η), and denote ∇Qn(η) =
∂Qn(η)/∂γ(1) = (∂Qn(η)/∂γ1, ..., ∂Qn(η)/∂γqn)T and ∇2Qn(η) = ∂2Qn(η)/(∂γ(1)∂γ(1)T).
Recall that η is a local minimizer of Qn(η), hence, we have ∇Qn(η) = ∇Ln(η) +∇T1n(η) +
∇T2n(η) +∇T3n(η) +∇T4n(η) = 0.
It is obvious that∇T3n(η) = 0 because T3n(η) doesn’t depend on γ(1). Moreover, by prop-
erties of η derived before and (A7), we have that uniformly for l = 1, . . . , qn, ∂T4n(η)/∂γl =
2nJ ′λn(|γl|)sgn(γl) = 0 with probability tending to one, thus ∇T4n(η) = 0. For ∇Pn(η), by
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 44
Taylor expansion we have,
∇Pn(η) = ∇Pn(η0) +∇2Pn(η0)(γ(1) − γ(1)0 )
= ∇Ln(η0) +∇T1n(η0) +∇2Ln(η0)(γ(1) − γ(1)0 )
= (−2n∑i=1
εiz(1)i ) + (−2
d∑j=1
ϑj(1)) + 2
n∑i=1
z(1)i z
(1)i
T
(γ(1) − γ(1)0 )
= (−2n∑i=1
εiz(1)i ) + 2
n∑i=1
z(1)i z
(1)i
T
(γ(1) − γ(1)0 ) + op(n
1/2)1qn ,
where the last equality is by Lemma 2 (b), and 1qn is the vector of ones with dimension qn. For
∇T2n(η),
∇T2n(η) = 2d∑j=1
(ϑj(2) + ϑj
(3) + ϑj(4))
= 2d∑j=1
(ϑj(2) + ϑj
(3) + ϑj(5) + ϑj
(6) + ϑj(7) + ϑj
(8))
= op(n1/2)1qn + 2
d∑j=1
ϑj(8),
where the last two equalities are by Lemma 2 (b). Notice that ϑj(8) = 0 for j = g + 1, . . . , d,
∇Qn(η) = ∇Ln(η) +∇T1n(η) +∇T2n(η) +∇T3n(η) +∇T4n(η)
= op(n1/2)1qn + 2
g∑j=1
ϑj(8) + (−2
n∑i=1
εiz(1)i )
+2n∑i=1
z(1)i z
(1)i
T
(γ(1) − γ(1)0 ) = 0,
and it follows that
n−1/2
n∑i=1
z(1)i z
(1)i
T
(γ(1) − γ(1)0 ) = n−1/2
n∑i=1
εiz(1)i − n−1/2
g∑j=1
ϑj(8) + op(1)1qn
=n∑i=1
Win + op(1)1qn ,
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 45
with Win = n−1/2εiz(1)i − n−1/2
∑gj=1
∑snk=1
∑v 6=k bjk0(wjk − wjv)−1〈Ξj, φjv〉
∫(xij ⊗ xij −
Kj)φjkφjv. It is obvious that Win, i = 1, . . . , n are independently and identically distributed.
with zero mean and cov(Win) = n−1(σ2Σ1 +Bn). It satisfies to show that V −1/2n An
∑ni=1Win
converges to N(0r, Ir) in distribution, where Vn = An(σ2Σ1 + Bn)ATn→V ∗ = σ2H∗ + B∗,
and V ∗ is positive definite. We start to show this by Linderberg Feller theorem.
First, we denote Πin = V −2−1
n AnWin, and it is trivial that Πin, i = 1, . . . , n are indepen-
dently and identically distributed centered random vectors with cov(∑n
i=1 Πin) = Ir. Second,
for any ζ > 0, we have
n∑i=1
E[‖Πin‖2
21qn‖Πin‖ > ζ]≤ nE(‖Π1n‖4)1/2P (‖Π1n‖ > ζ)1/2,
where
‖Π1n‖2 = ΠT
1nΠ1n = W T
1nAT
nV−1n AnW1n ≤ λmax(AT
nV−1n An)‖W1n‖2
≤ λmax(V −1n )λmax(AT
nAn)‖W1n‖2 = O(1)‖W1n‖2.
It follows that E(‖Π1n‖4
)= O(1)E
(‖W1n‖4
)= O(1)q2
n/n2. Moreover,
P (‖Π1n‖ > ζ) = P (‖V −1/2n AnW1n‖ > ζ) ≤ E(‖V −1/2
n AnW1n‖2)/ζ2
≤ λmax(AT
nV−1n An)E(‖W1n‖2)/ζ2
= O(1)E(‖W1n‖2) = O(1)qn/n.
By previous derivations and assumptions, we know
n∑i=1
E[‖Πin‖21qn‖Πin‖ > ζ
]= nO(qn/n)O(qnn
−1/2)
= O(q3/2n n−1/2) = o(1),
since qn = o(n1/3). By Linderberg Feller theorem, we conclude that V −1/2n An
∑ni=1 Win con-
CHAPTER 1. PARTIALLY FUNCTIONAL LINEAR REGRESSION IN HIGH DIMENSIONS 46
verges to N(0r, Ir) in distribution, which completes the proof.
Chapter 2
Testing General Hypothesis in Large-scale
FLR
2.1 Introduction
The classical functional linear regression (FLR) has been widely adopted for modelling the lin-
ear relationship between a scalar response Y and a functional predictor that is often assumed to
be sampled from an L2(T ) random processX(t) defined on a compact interval T ⊆ R. Specif-
ically, given n independent and identically distributed (i.i.d.) pairs Yi, Xi(·), the classical
FLR is formulated as
Yi =
∫T
Xi(t)β(t)dt+ εi, i = 1, . . . , n (2.1)
where both Yi and Xi are centred without loss of generality, i.e., EYi = 0 and EXi(t) = 0 for
t ∈ T , the unknown regression parameter function β(t) is square-integrable,i.e., β ∈ L2(T ),
and the i.i.d. regression error εi is independent of Xi with mean zero and variance 0 < σ2 <
∞. This model has been extensively researched in the literature of functional data analysis
[Ramsay and Dalzell, 1991, Cardot et al., 1999, Fan and Zhang, 2000, Yuan and Cai, 2010,
among others], including theoretical considerations [Cai and Hall, 2006, Hall and Horowitz,
47
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 48
2007, Cai and Yuan, 2012] and statistical inference [Cardot et al., 2003, Lei, 2014, Hilgert et al.,
2013], and we refer readers to Ramsay and Silverman [2005] for an overview and numerical
examples. There has also been numerous work that extended the classical FLR to various
settings, such as the functional response [Faraway, 1997, Cuevas et al., 2002, Yao et al., 2005b],
the generalized FLR [Escabias et al., 2004, James, 2002, Muller and Stadtmuller, 2005, Shang
and Cheng, 2015], the partially FLR [Lian, 2011, Kong et al., 2016], the additive regression
[Muller and Yao, 2008, Zhu et al., 2014, Fan et al., 2015], among many others.
In modern scientific experiments, the response Y can be potentially associated with multi-
ple or even a large number of functional predictors, e.g., Lian [2013] proposed a FLR involving
a fixed number of functional predictors and Kong et al. [2016] considered regularized estima-
tion and variable selection for a partially FLR that contains high-dimensional scalar covariates
in addition to a finite number of functional predictors. However, in large-scale data analysis
where a FLR is exploited, the number of potential functional predictors pn can be possibly
much larger than the sample size n, despite the significant ones of size qn are usually assumed
to be sparse or at a fraction polynomial order of n. Examples can be found in neuroimage
analysis, where the relationship between certain disease marker and a number of brain regions
of interest (ROI) scanned over time is of interest. This consideration motivates, namely, a
large-scale FLR model as follows,
Yi =
pn∑j=1
∫T
Xij(t)βj(t)dt+ εi, i = 1, . . . , n (2.2)
where pn is allowed to grow exponentially in sample size n, without loss of generality, the first
qn regression parameter functions βj : j = 1 . . . , qn are assumed nonzero (i.e., important
ones), while the rest are zero, and the i.i.d. error εi is independent of Xij : j = 1, . . . , pnwith
mean zero and variance σ2. It is common to use a set of pre-fixed (i.e., B-splines, wavelets) or
data-driven (i.e., eigenfunctions) basis to represent the underlying processXj of each predictor
Xij : i = 1, . . . , n. Since the data-driven bases, such as eigenfunctions, have to be estimated
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 49
from pn separate functional principal component analysis procedures, which is computationally
prohibited when pn n and does not necessarily produces efficient representations for the
sake of regression/prediction of Y , hence we adopt a common pre-fixed basis bk : k ≥ 1 that
is complete and orthonormal in L2(t) for all processes Xj for j = 1, . . . , pn, and do not further
pursue other complicated basis-seeking procedures, e.g., functional partial least squares [Reiss
and Ogden, 2007].
To be specific, functional predictors and the associated regression functions can be ex-
pressed as linear combinations of the complete orthonormal basis bk : k ≥ 1, such that βj =∑∞k=1 ηjkbk, Xij =
∑∞k=1 θijkbk, where the coefficients θijk =
∫TXij(t)βk(t)dt coinciding
with the projections are mean-zero random variables whose variances are E(θ2ijk) = ωjk > 0.
As a result, model (2.2) can be reformulated as
Yi =
pn∑j=1
∞∑k=1
θijkηjk + εi. (2.3)
To perform estimation and inference on the regression functions of primary interest, a direct
minimization of the square loss with respect to the infinite sequences of unknown coefficients
ηjk is not feasible. A common practice is to truncate up to the first sn leading terms that
is allowed to grow with n, where sn controls the complexity of βj as a whole function and
balances the bias-variance tradeoff in a similar spirit of classical nonparametric regression.
To our knowledge, this type of large-scale FLR first appeared in Fan et al. [2015] that con-
sidered a penalized procedure for model estimation and selection, while our primary interest is
the hypothesis testing. A careful inspection on Condition 1(A) in Fan et al. [2015, Appendix B]
which requires∑∞
k=1 θ2ijkk
4 < C2 for a universal constant C for i = 1, . . . , n, j = 1, . . . , pn,
reveals that all random processes Xj are bounded in L2(T ) uniformly in j, thus excludes even
the case of Gaussian processes. Moreover, Condition 2(D) assumes that the minimal eigenval-
ues of n−1Θ′jΘj ≥ c0, bounded from below by a constant c0 uniformly in 1 ≤ j ≤ qn (i.e.,
the important ones), where Θj = (θijk)1≤i≤n;1≤k≤sn is the n× sn design matrix induced by Xj
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 50
in Fan et al. [2015] that used qn to denote the truncation size with a notation abuse. Unfortu-
nately, this crucial condition is not valid for infinite-dimensional L2 process, as the minimal
eigenvalues necessarily approach 0 when sn diverges, with a typical illustration given by the
Karhunen-Loeve expansion. By contrast, we do not make these stringent assumptions so that
the predictor processes are genuinely functional in the large-scale FLR (2.2).
The main contribution of this paper is to develop a rigorous testing procedure for the gen-
eral hypothesis on an arbitrary subset of regression functions βk : k = 1, . . . , pn. The
challenge arises from not only the ultrahigh-dimensionality in pn that can be as exponentially
large as n but also the intrinsic infinite-dimensionality of each Xj , j = 1, . . . , pn. Although the
FLR (and its variants) has been well studied, there are only a few contributions on inference
procedures, e.g., Hilgert et al. [2013] and Lei [2014] considered adaptive tests of a single re-
gression function in classical FLR and Shang and Cheng [2015] for the generalized FLR. In the
current exposition, we adopt a general class of nonconvex penalty functions [Loh and Wain-
wright, 2015] which include LASSO [Tibshirani, 1996], SCAD [Fan and Li, 2001] and MCP
[Zhang, 2010] penalties as special cases and whose theoretical properties in high-dimensional
linear regression have been extensively studied [Meinshausen and Buhlmann, 2006, van de
Geer, 2008, Meinshausen and Yu, 2009, Bickel et al., 2009, Zhang, 2009, Fan and Lv, 2011,
Wang et al., 2013, 2014, Fan et al., 2014, Loh and Wainwright, 2015, among many others].
Recently the research on inference of high-dimensional linear regression has flourished, es-
pecially for LASSO-type convex penalty [Tibshirani, 1996], such as Wasserman and Roeder
[2009], Meinshausen and Buhlmann [2010], Shah and Samworth [2013] based on sample split-
ting or subsampling, Zhang and Zhang [2014], van de Geer et al. [2014] on bias correction
methods, Lockhart et al. [2014], Taylor et al. [2014] for and conditional inference on the even
that some covariates have been selected, among others.
In this article, we are inspired by the unconditional inference based on decorrelated score
function proposed in Ning and Liu [2016] owing to its generality, no need of data splitting or
strong minimal signal conditions. We first exploit a penalized least squares procedure treating
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 51
the truncated coefficients of each βk altogether in a group fashion, and obtain estimation con-
sistency with no need of oracle properties under weaker minimal signal conditions that allow
for a wider class of suitable settings. Then we devise the decorrelated score function in the
context of large-scale FLR for testing general null hypothesis on any subset of βk : k ≥ 1.
Unlike testing null hypothesis on a single parameter in high-dimensional linear regression, the
limiting distribution for such a general null hypothesis is intractable. Finally we adopt the mul-
tiplier bootstrap method to approximate the limiting distribution of the score test statistic under
null hypothesis and provide theoretical guarantees for all possible sizes in a uniform manner.
The rest of the article is organized as follows. In section 2.2, we describe the general hy-
pothesis we want to test and the way to construct the proposed score test based on a penalized
estimator and a well defined decorrelated score function. Section 2.3 is devoted to introducing
the theoretical properties of the penalized estimator and the score test defined in section 2.2
under some regularity conditions. Section 2.4 proposes the simulation results to back up the
theoretical properties. The algorithm to obtain the penalized estimator, technical conditions on
penalty functions, notations used throughout the paper and the proofs of Lemmas and Theo-
rems are deferred to section 2.5.
2.2 Proposed Methodology
2.2.1 Regularized estimation by group penalized least squares
Recall the large-scale FLR defined in (2.2), the underlying predictor processes Xj and the
corresponding regression functions βj are expressed by a set of complete and orthonormal
basis bk : k ≥ 1, leading to an infinite-dimensional representation (2.3). Since the square
loss function∑n
i=1(Yi−∑pn
j=1
∑∞k=1 θijkηjk)
2 cannot be directly minimized with respect to ηjk
due to infinite-dimensionality, we control the complexity of each βk as a whole function and
adopt the commonly-used truncation for the purpose of regularization, rather than viewing its
basis terms as separate predictors. Here this truncation parameter is allowed to grow with n,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 52
and plays a role of smoothing parameter balancing the trade-off between approximation bias
and sampling variability.
REMARK. Ideally one can use different truncation sizes for each βk, however, selection
of different truncation sizes for a large number of functional predictors is computationally
infeasible. Therefore, in practice, we may adopt the strategy suggested by Kong et al. [2016],
i.e., use a common sn to perform regularized estimation, then amend a refit step using ordinary
least squares for the retained variables and choose different truncations using a K-fold cross-
validation, saying K = 5. Nonetheless, the case of common sn suffices for methodological
development and theoretical analysis.
Besides the truncation, it is essential to impose suitable penalty for each regression function
as a whole by utilizing a functional version of group regularization [Yuan and Lin, 2006]. To
regularize all predictors on a comparable scale, one often standardizes scalar predictors in
linear regression [Fan and Li, 2001]. For functional predictors Xj , we choose to account
the variability of the grouped projection coefficients θijk in the n × sn design matrix Θj =
(θijk)1≤i≤n;1≤k≤sn . Hence the penalty imposed on n−1/2||Θjηj||2 invokes a group penalty that
shrinks the unimportant regression function to zero, where ‖ · ‖2 is the Euclidean or `2 norm
(if an infinite sequence). For technical convenience, we scale up the penalty parameter λn by
s1/2n , which does not affect the relative weighting of penalties given the common group size sn.
Therefore, our objective is to minimize the penalized square loss function as follows, de-
noting η = (η′1, . . . , η′pn)′ with vectors ηj = (ηj1, . . . , ηjsn)′, and ‖ · ‖1 the `1 norm,
minη:||η||1≤Rn
[ Qn(η)︷ ︸︸ ︷(2n)−1
n∑i=1
(Yi −pn∑j=1
sn∑k=1
θijkηjk)2
︸ ︷︷ ︸Ln(η)
+
pn∑j=1
ρλns
1/2n
(n−1/2||Θjηj||2)︸ ︷︷ ︸Pλn (η)
](2.4)
where ρλ(·) with the tuning parameter λ belongs to a general class of nonconvex penalty func-
tions satisfying conditions (P1)–(P5) in section 2.5, which includes the mostly used penalties,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 53
such as LASSO, SCAD and MCP and so on. [Loh and Wainwright, 2015]. The positive
constraint Rn should be chosen carefully to make the true value η∗ a feasible point such that
‖η∗‖1 ≤ Rn. Upon solving the optimization problem (2.4), which is guaranteed to have a
global minimum by Weierstrass extreme value theorem if ρλ(·) is continuous, the regularized
estimator for each βj(t) can be expressed as βj(t) =∑sn
k=1 ηjkbk(t), where η is obtained from
solving (2.4). The implementation by a coordinate descent algorithm, similar to Ravikumar
et al. [2008] with slight modification, is presented in section 2.5, while the tuning parameters
λn and sn are chosen by a K-fold cross-validation, saying K = 5 or 10. It is worth mentioning
that, for the purpose of general hypothesis testing, it is sufficient to obtain consistent estimation
of η from (2.4) in both `1 and `2 sense stated in Theorem 3, while the selection consistency or
the oracle properties is not necessary.
2.2.2 General hypothesis and score decorrelation
Our goal is to test a class of hypotheses that is of full generality in the framework of large-scale
FLR. DenotePn = 1, . . . , pn as the index set of all functional predictors, and letHn ⊆ Pn be
an arbitrary nonempty subset of Pn with the cardinality |Hn| = hn ≤ pn, and the complement
ofHn is denoted byHcn = Pn \ Hn. Then the general hypothesis can be expressed as
H0 : ‖βj‖L2 = 0 for all j ∈ Hn v.s. Ha : ‖βj‖L2 > 0 for at least one j ∈ Hn, (2.5)
which is asymptotically equivalent to the hypothesis
H0 : ηj = 0 for all j ∈ Hn v.s. Ha : ηj 6= 0 for at least one j ∈ Hn,
with each ηj = (ηj1, . . . , ηjsn)′, as sn → ∞ given n → ∞. It is worth noting that the
cardinality hn can be as large as the cardinality of the full predictor set pn, allowing for any
hypothesis on βk : k = 1, . . . , pn of interest.
For testing the the general null hypothesis in (2.5), the building block originates from the
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 54
combination of consistently estimated regression functions and a new type of score function.
As illustrated in Ning and Liu [2016], the motivation of considering a decorrelated score func-
tion is the high-dimensionality of the nuisance parameter space Hcn = Pn \ Hn that makes
the limiting distribution of the estimated nuisance parameter constrained by null hypothesis
intractable [Fu and Knight, 2000]. Hence the key is to decorrelate the score function of the
primary parameter in Hn from that of the nuisance parameter in Hcn in order to control the
variability induced by high-dimensional nuisance parameter. As a result, the decorrelation op-
eration is a natural extension of profile score to high-dimensional case and leads to a test that
is asymptotically equivalent to the classical Rao’s score test in low-dimensional case [Cox and
Hinkley, 1979, Ning and Liu, 2016].
We first introduce some notation for adopting the score function decorrelation in the pro-
posed large-scale FLR. Recall that ωjk is the variance of the i.i.d. projection coefficient
θijk =∫TXi(t)bk(t)dt : i = 1, . . . , n. Denote Λj = diagω1/2
j1 , . . . , ω1/2jsn for j ≤ pn,
the block diagonal matrix ΛHn = diagΛj : j ∈ Hn, and similarly for ΛPn ≡ Λ. Let
Θ = (G′1, . . . , G′n)′ = (ΘHn ,ΘHcn), ΘHn = (E ′1, . . . , E
′n)′, ΘHcn = (F ′1, . . . , F
′n)′, where Gi, Ei
and Fi are the vectors containing coefficients θijk from corresponding functional predictors for
the ith subject, respectively, ΘHn is formed by concatenating Θj : j ∈ Hn in a row, and simi-
larly for ΘHcn , where Θj is the n×sn design matrix containing θijk as its ikth entry. Also denote
η = (η′Hn , η′Hcn)′ and Y = (Y1, . . . , Yn)′, where ηHn is to stack ηj : j ∈ Hn in a column and
similarly for ηHcn . Here we view the least squares `(η) = `(ηHn , ηHn) = n−1(Y −Θη)′(Y −Θη)
as the likelihood function of η without introducing extra notations. Further we define
w = I−1HcnHcnIHcnHn , where IHcnHn = E(FiEi
′), IHcnHcn = E(FiFi′).
For the purpose of decorrelation, a new score function with respect to the primary parameter
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 55
ηHn , denoted by S(η), is defined in the context of our large-scale FLR as follows,
S(η) = S(ηHn , ηHcn) = n−1Λ−1Hn(w′Θ′Hcn −Θ′Hn)(Y −ΘHnηHn −ΘHcnηHcn)
= n−1
n∑i=1
Λ−1Hn(w′Fi − Ei)(Yi − E ′iηHn − F ′iηHcn). (2.6)
It is easy to verify that this new score function with respect to the primary parameter ηHn is
uncorrelated with the traditional score function with respect to the nuisance parameter ηHcn ,
i.e., ES(η)∇ηHcn`(η) = 0, as in Ning and Liu [2016], where ∇γ denotes the gradient vector
taken with respect to γ.
2.2.3 Bootstrapped score test for general hypothesis in large-scale FLR
Given the consistent estimation of regression functions and the decorrelated score function, we
are ready to construct the proposed score test for the general hypothesis (2.5) in large-scale
FLR. It is noted that the decorrelated score function S(η) defined in (2.6) cannot be calculated
from observed data directly due to the unknown quantities w = I−1HcnHcnIHcnHn and ΛHn . It is
straightforward to estimate ΛHn by plugging in ωjk =∑n
i=1 θ2ijk, denoted by ΛHn , and similarly
for Λ and ΛHcn . To estimate w, a natural choice is the moment estimator w = I−1HcnHcn IHcnHn ,
where IHcnHcn = n−1ΘHcn′ΘHcn for IHcnHn = E(FiEi
′) and IHcnHn = n−1ΘHcn′ΘHn for IHcnHcn =
E(FiFi′). However, this estimator may not exist, since the matrix IHcnHcn can be singular in
high-dimensional settings. We follow the suggestion by Ning and Liu [2016] to adopt the
Dantizig selection [Candes and Tao, 2007] to estimate the (pn−hn)sn×hnsn unknown matrix
w by column, while alternative procedures can also be used but not pursued here for brevity.
Specifically, for each l = 1, . . . , hnsn, we solve
wl ∈ argminwl
||wl||1 s.t. ||n−1
n∑i=1
EilFi′ − w′ln−1
n∑i=1
FiFi′||∞ ≤ τn, (2.7)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 56
where τn as a common tuning parameter chosen by aK-fold cross-validation in practice, giving
the resulting estimator w. It then follows the estimated decorrelated score function
S(η) = S(ηHn , ηHcn) = n−1Λ−1Hn(w′Θ′Hcn −Θ′Hn)(Y −ΘHnηHn −ΘHcn ηHcn)
= n−1∑n
i=1 Λ−1Hn(w′Fi − Ei)(Yi − E ′iηHn − F ′iηHcn). (2.8)
Then we plug in the estimator η obtained from minimizing (2.4) to construct the decorrelated
score test statistic under the null hypothesis of ηHn = 0, leading to
T = n1/2S(0, ηHcn) = n−1/2
n∑i=1
Si, where Si = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn). (2.9)
Note that the null hypothesis (2.5) is of full generality with the dimension hnsn, where sn
grows with n (often at a factional polynomial order) to approximate the infinite-dimensional
functional spaces and hn can be as large as pn. Unlike testing a finite dimensional null hypoth-
esis, it is rather intricate to find a limiting distribution that is likely to be intractable, even for
the case of hn = 1. Hence, we use its infinity norm ||T ||∞ = max|Tl| : l = 1, . . . , hnsn
to test against the null hypothesis (2.5), and adopt a computationally efficient and theoretically
guaranteed bootstrap method to approximate the limiting distribution of ||T ||∞. Since a stan-
dard bootstrap is computationally intensive by repeatedly estimating η and w, we consider a
multiplier bootstrap method which was studied by Chernozhukov et al. [2014]. Specifically,
denote Te = n−1/2∑n
i=1 eiSi, where e1, . . . , en is a set of i.i.d. standard normal random
variables independent of the data, and define cB(α) = inft ∈ R : Pe(||Te||∞ ≥ t) ≤ α
as the 100(1 − α)th percentile of ||Te||∞, where Pe(·) denotes the probability with respect to
e1, . . . , en. Based on this critical value, we reject the null hypothesis at the significance level
α provided that ||T ||∞ ≥ cB(α). In Section 2.3, we show that under the null hypothesis and
some mild conditions, the Kolmogorov distance between the distributions of ||T ||∞ and ||Te||∞
converges to zero as sample size grows, which provides theoretical guarantees for the proposed
score test.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 57
REMARK. Based on the decorrelated score function (2.6), in principle one can construct the
counterparts of other classical tests, such as the Wald and likelihood ratio tests, for a variety of
high-dimensional models, for instance, the Cox proportional hazard model [Fang et al., 2014].
We expect that these asymptotically equivalent tests may also hold for the large-scale FLR
model, which deserves further investigation.
2.3 Asymptotic Properties
Before introducing the main Theorems, we impose some mild conditions on the covariance
structure of functinal predictors, and the orders of various tuning parammters. First, recall that
we have ωjk = E(θ2ijk) > 0. Accordingly, we assume that the expected norm of the functinal
predictors are uniformly bounded such that
(A1) supj≤pn∑∞
k=1 ωjk <∞.
Notice that (A1) implies that λmax(Λ) = O(1). Before discussing the next condition which
is devoted to describing the distribution of several subgaussian random variables, we intro-
duce some useful notations as follows. For any sub-gaussian random variable X , define
||X||φ1 = supq≥1 q−1/2(E|X|q)1/q as the sub-gaussian norm, and for any sub-exponential ran-
dom variable X , define ||X||φ2 = supq≥1 q−1(E|X|q)1/q as the sub-exponential norm, similar
to those adopted in Ning and Liu [2016].
(A2) Assume that εi, ω−1/2jk θijk, (wl
′Fi − Eil)[E(E2
il)]−1/2 are centred sub-gaussian random
variables satisfying ||εi||φ1 ≤ c, ||ω−1/2jk θijk||φ1 ≤ c, ||(wl′Fi − Eil)
[E(E2
il)]−1/2||φ1 ≤ c
for some positive constant c, uniformly in i = 1, . . . , n, j = 1, . . . , pn, k = 1, . . . ,∞,
l = 1, . . . , hnsn.
Notice that (A1) together with (A2) implies that θijk,wl′(Fi−Eil) are also centred sub-gaussian
random variables satisfying ||θijk||φ1 ≤ c1, ||wl′Fi − Eil||φ1 ≤ c1 for some positive constant
c1, uniformly in 1 ≤ i ≤ n, 1 ≤ j ≤ pn, 1 ≤ l ≤ hnsn and k ≥ 1. Before introducing
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 58
the subsequent conditions, we denote the information matrix and partial information matrix
as I = E(GiGi′) and IHn|Hcn = IHnHn − w′IHcnHn respectively. Similarly, the standardized
version of the information matrix and partial information matrix are denoted as I = Λ−1IΛ−1
and IHn|Hcn = Λ−1HnIHn|HcnΛ−1
Hn respectively. Next, we assume that
(A3) λmin(Λ) ≥ cs−a/2n for some constants c > 0 and a > 1.
(A4) 0 < m0 ≤ λmin(I) ≤ λmax(I) ≤ m1 <∞ for some constants m1 > m0 > 0, satisfying
m0 > 2−1m1µ.
Note that µ > 0 is a constant such that ρλ,µ(t) is convex in t, which can be referred to Appendix.
Combining (A3) and (A4), it is easy to see that λmin(I) = λmin(ΛIΛ) ≥ cs−an , for some c > 0.
In addition, we need to control the correlation between the functional part for testing and the
rest of the functional predictors, so we assume
(A5) 0 < c1 ≤ λmin(IHn|Hcn) ≤ λmax(IHn|Hcn) ≤ c2 <∞ for some constants c2 > c1 > 0.
Next, the overall number of functional predictors is permitted to grow exponentially in sample
size and for any two sequences an and bn, an ∼ bn if and only if c1 ≤ limn→∞ |an/bn| ≤ c2 for
some positive constants c1, c2. Then we assume
(A6) pn ∼ exp(nβ) for some β ∈ (0, 9−1).
Regarding the first qn nonzero regression functions, we assume that they belong to a sobolev
ball as follows
(A7) supj≤qn∑∞
k=1 η2jkk
2δ < c for some positive constants δ and c.
Next, in (A8), we quantify the relationship among the four parameters qn, ρn, sn, Rn, and the
sample size n. qn is the number of significant predictors. ρn is equal to supl≤hnsn ||wl||0, where
the vector norm || · ||0 denotes the number of nonzeros in the vector. Rn is a general parameter
such that ||η∗||1 ≤ Rn, where η∗ represents the true value of η. sn is the truncation size for
each functional predictor. We assume
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 59
(A8) maxn3β/2ρnqns3a/2−δn log sn, nβq2
nsa+1−δn , n2βq2
nsa/2+1−δn , n5β/2−1/2ρnq
2ns
2a+1−δn ,
n2β−1/2(log n)1/2ρns3a/2n , n5β/2−1/2Rnqns
a/2+1n , n3β−1Rnρnqns
2a+1n ,
n3β/2−1/2Rnqnsa+1n , nβ+1/2qns
−δn log sn = o(1).
Note that (A8) entails δ > a + 1 > 2, and (A8) also implies that both ρn and qn are small in
sample size reflecting the sparseness of the model. Next, for a concrete example, if sn, Rn, ρn
and qn are on the same order, (A8) can be rewritten as
maxn3β/2s3a/2+2−δn log sn, n
βsa+3−δn , n2βsa/2+3−δ
n , n5β/2−1/2s2a+4−δn , n2β−1/2s3a/2+1
n (log n)1/2,
n5β/2−1/2sa/2+3n , n3β−1s2a+4
n , n3β/2−1/2sa+3n , nβ+1/2s1−δ
n log sn = o(1). (2.10)
It is easy to verify that there exists sn satisfying (2.10) if and only if
min2δ − 3a− 4
3β,δ − a− 3
β,
2δ − 2
2β + 1
>max4a+ 8− 2δ
1− 5β,
3a+ 2
1− 4β,a+ 6
1− 5β,
2a+ 6
1− 3β, (2.11)
and it is easy to show that the setting a = 2, δ = 7, β < 28−1 satisfies (2.11) for instance.
Next, in (A9), we impose some conditions on the tuning parameter λn for the regularizer in (3)
to ensure estimation consistency.
(A9) nβqnsa+1n = o(λ−1
n ), n2βqnsa/2+1n = o(λ−1
n ), n5β/2−1/2ρnqns2a+1n = o(λ−1
n ),
nβ/2−1/2Rn = o(λn), qns−δn = o(λn).
Note that (A8) is sufficient for the existence of such λn which satisfy (A9). Lastly, in (A10),
we assume the tuning parameter τn for the Dantizig selector method to satisfy
(A10) τn ∼ [log(pnsn)/n]1/2.
Combining (A6), (A8) with (A10), we have τn ∼ nβ/2−1/2. In the following, we establish
the main theorems. Several auxiliary lemmas and the proofs of those lemmas and the main
theorems are deferred to Appendix. Theorem 3 establishes the estimation consistency for the
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 60
estimator obtained from (2.4) under some mild conditions, where the conditions (P1)–(P5) are
delegated to Appendix.
Theorem 3 Under conditions (A1)–(A4), (A6)–(A9) and (P1)–(P5), every local minimizer η
of Qn(η) obtained from (2.4) satisfies
1) ||η − η||2 ≤ c0λnsa/2+1/2n q
1/2n , with probability tending to one, for some constant c0 > 0
2) ||η − η||1 ≤ c1λnsa/2+1n qn, with probability tending to one, for some constant c1 > 0.
In particular, regarding the consistency property of the estimated regression curves βj(t) =∑snk=1 ηjkbk(t), it is straightforward to see that
supj≤pn||βj − βj||L2 ≤ sup
j≤pn||ηj − ηj||2 + s−δn sup
j≤qn
( ∞∑k=sn+1
η2jkk
2δ)1/2
= O(λnsa/2+1/2n q1/2
n + s−δn ).
Next, we demonstrate Theorem 4, which lay the foundation for hypothesis testing through the
multiplier bootstrap method.
Theorem 4 Under conditions (A1)–(A10) and (P1)–(P5), by using the local minimizer η from
Theorem 3, then under H0 : ‖βj‖L2 = 0 for all j ∈ Hn, the Kolmogorov distance between the
distributions of ||T ||∞ and ||Te||∞ satisfies
limn→∞
supt≥0
∣∣P (||T ||∞ ≤ t)− Pe(||Te||∞ ≤ t)∣∣ = 0,
and consequently,
limn→∞
supα∈(0,1)
∣∣P (||T ||∞ ≤ cB(α))− α∣∣ = 0.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 61
2.4 Simulation Studies
The simulated data yi, i = 1, . . . , n are generated from the model
yi =
pn∑j=1
∫ 1
0
βj(t)xij(t)dt =
pn∑j=1
∑k
ηjkθijk + εi,
with pn functional predictors, the errors ε1, . . . , εn are independent and identically distributed
from N(0, σ2), and the functional predictors have mean zero and covariance function derived
from the Fourier basis φ1 = 1, φ2` = 21/2 cos`π(2t − 1), ` = 1, . . . , 25 and φ2`−1 =
21/2 sin(` − 1)π(2t − 1), ` = 2, . . . , 25, and t ∈ T = [0, 1]. The underlying regression
function is βj(t) =∑50
k=1 ηjkφk(t) for j ≤ qn, where ηjk = cj(1.2 − 0.2k) for k ≤ 4 and
ηjk = 0.4cj(k − 3)−8 for 5 ≤ k ≤ 50, with some constants cj : j ≤ qn that are chosen
for different settings, and the remaining βj(t) = 0. Next, we describe how to generate the pn
functional predictors xij(t). For j = 1, . . . , pn, define Vij(t) =∑50
k=1 θijkφk(t), where θijk
follow independent distributed N(0, k−2) for different i, j and k. The pn functional predictors
are then defined through the AR(ρ) structure linear transformations
xij(t) =
pn∑j′=1
ρ|j−j′|Vij′(t) =
50∑k=1
pn∑j′=1
ρ|j−j′|θij′kφk(t) =
50∑k=1
θijkφk(t),
with θijk =∑pn
j′=1 ρ|j−j′|θij′k, and the constant ρ ∈ (0, 1) controls the correlation among the
functional predictors where we set ρ = 0.3 in our simulation study. For the actual observa-
tions, we assume they are realizations of xij(·), j = 1, . . . , pn at 100 equally spaced times
tijl, l = 1, . . . , 100 ∈ T . To be more realistic, we adopt a known orthonormal cubic spline
basis containing m = 30 basis functions to fit each observed functional data to obtain the
corresponding estimated coefficients rather than using the Fourier basis. After obtaining the
estimated xij(t), we apply the algorthm in section 2.5.1 to get the estimated βj(t) under SCAD
penalty function. Then, we construct the test statistic and its associated α = 5% empirical
quantile through wild bootstrap by using N = 1000 bootstrap samples to test on several alter-
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 62
native hypothesis. We summarize and compare the finite sample performance of several null
hypothesis under various settings of regression curves specified by cj : j ≤ qn based on the
rejection proportion over 300 Monte Carlos in the table below.
Table 2.1: Results for the several different combinations of null hypothesis and settings of βj(t)that were measured over 300 simualtions
Settings of βj(t) Null hypothesis H0 Rejection proportion
n = 100pn = 200qn = 3σ2 = 1
c1 = 0, c2 = 0, c3 = 0 H0 : β1(t) = 0 .0467c1 = .1, c2 = 0, c3 = 0 H0 : β1(t) = 0 .1067c1 = .2, c2 = 0, c3 = 0 H0 : β1(t) = 0 .2400c1 = .6, c2 = 0, c3 = 0 H0 : β1(t) = 0 .9133
c1 = 1c2 = 1c3 = 1
H0 : β1(t) = 0 .9967H0 : β1(t) = · · · = β5(t) = 0 1H0 : β1(t) = · · · = β20(t) = 0 1H0 : β5(t) = · · · = β20(t) = 0 .0533
From Table 2.1, we find that the rejection proportions of the first four null hypothesis,
where c2 and c3 are held to be zero, are in ascending order as the signal of the first predictor
increases, which is consistent with the shape of a power curve. In addition, we also find that
the rejection proportion of the first and the last null hypothesis are close to the pre-specified
significance level α = 5%, which is to be expected since the null hypothesis is true based
on the underlying model and the rejection proportion represents the average Type I error of
the multiplier decorrelated score test. In particular, when the minimal signal of the significant
functional predictors is strong enough to be detected (i.e. c1 = c2 = c3 = 1), we observe
that the rejection proportions of the associated null hypothesis are close to one, which is to
be expected. In conclusion, we summarize that the multiplier decorrelated score test performs
well in various settings and this justify the validity of the proposed score test.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 63
2.5 Proof of Lemmas and Main Theorems
2.5.1 Algorithm and parameter tuning
First, without loss of generality, we assume that the data are centered so that we have
n−1
n∑i=1
Yi = 0, n−1
n∑i=1
θijk = 0,
for any j = 1, . . . , pn, k = 1, . . . , sn. In addition, for each j = 1, . . . , pn, we denote fj = Θj ηj ,
where ηj is an estimator for ηj , and Uj = Θj(Θj′Θj)
−1Θj′. The optimization of (2.4) can be
achieved by adopting the coordinate descent method similar to those used in Ravikumar et al.
[2008] and Fan et al. [2015] with a little modification, where ρλn(·) is replaced by ρλns
1/2n
(·). In
the following, we propose an algorithm corresponding to general penalty functions satisfying
the technical conditions (P1)–(P5) in Appendix.
Algorithm for general ρλ(·)
(i) Start with the initial estimator fj = 0, for each j = 1, . . . , pn.
(ii) Caculate the residual Rj = Y −∑
k 6=j fk, while fixing the values of fk : k 6= j.
(iii) Caculate the Pj = UjRj .
(iv) Let fj = max
1− ρ′λns
1/2n
(n−1/2||fj||2)n1/2/||Pj||2, 0Pj .
(v) Let fj = fj − n−11n′fj1n, where 1n denotes the n× 1 vector of ones.
(vi) Repeat (ii) to (v) for j = 1, . . . , pn and iterate until convergence to obtain the final
estimates fj , for j = 1, . . . , pn.
(vii) Compute ηj = (Θj′Θj)
−1Θj′fj by using the final estimates fj from step (vi) to get
the final estimates ηj , for j = 1, . . . , pn.
In addition, several tuning parameters play critical roles in the penalized procedure. The
optimal values of the parameters λn, sn will be automatically choosen by the method of cross
validation.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 64
2.5.2 Technical Conditions on Regularizers
We impose some conditions on the nonconvex penalty function ρλ, similar to those adopted in
Loh and Wainwright [2015].
(P1) ρλ is an even function, and ρλ(0) = 0.
(P2) For t ≥ 0, ρλ(t) is nondecreasing in t.
(P3) The function gλ(t) = ρλ(t)/t is nonincreasing in t, for t > 0.
(P4) The function ρλ(t) is differentiable everywhere except at t = 0. At t = 0, we have
limt→0+ ρ′λ(t) = λL, for some positive constant L.
(P5) The function ρλ,µ(t) is convex in t, for some positive constant µ, where ρλ,µ(t) = ρλ(t)+
2−1µt2.
From Loh and Wainwright [2015], it is obvious that most nonconvex regularizers such as
LASSO, SCAD and MCP meet those conditons. Note that Lemma 5 is devoted to summa-
rizing the general properties of those penalty functions.
2.5.3 Notations
We summarize most of the notations that will be used throughout the paper as follows. First, for
any vector u = (u1, . . . , us)′, we let ||u||q = (
∑sl=1 |ul|q)1/q for q ≥ 1, ||u||∞ = maxl≤s |ul|,
and ||u||0 represents the cardinality of supp(u), where supp(u) = l : ul 6= 0. Moreover, if
S = supp(u), we let uS to be the vector containing only nonzeros of u, while uSc contains only
zero elements. For any matrix C = [cij], we denote ||C||∞ = maxi,j |cij|, if C is symmetric,
λmin(C) and λmax(C) are the minimum and maximum eigenvalues. For sequences an and bn,
an ∼ bn if and only if c1 ≤ limn→∞ |an/bn| ≤ c2 for some positive constants c1, c2. For any
sub-gaussian random variable X , define ||X||φ1 = supq≥1 q−1/2(E|X|q)1/q as the sub-gaussian
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 65
norm, and for any sub-exponential random variable X , define ||X||φ2 = supq≥1 q−1(E|X|q)1/q
as the sub-exponential norm, similar to those adopted in Ning and Liu [2016].
Second, recall that Pn = 1, . . . , pn is the index set representing all functional predictors,
while Hn ⊆ Pn is any nonempty subset of Pn, with cardinality |Hn| = hn ≤ pn, and we also
denote Hcn = Pn \ Hn. For each j ≤ pn, we let Λj = diagω1/2
j1 , . . . , ω1/2jsn. Moreover, we
also let Λ = ΛPn be the block diagonal matrix with Λj : j ≤ pn as its diagonal submatrices
and let ΛHn be the block diagonal matrix with Λj : j ∈ Hn as its diagonal submatrices.
Similarly, we define Λj = diagω1/2j1 , . . . , ω
1/2jsn, Λ = ΛPn and ΛHn . Furthermore, we denote
η = Λη = (η′1, . . . , η
′pn)′, with each ηj = Λjηj = (ηj1ω
1/2j1 , . . . , ηjsnω
1/2jsn
)′. Analogously, we
denote ν = η − η∗ and ν = η − η∗ = Λν, where η∗ and η∗ are true versions of η and η
respectively. For anyHn ⊆ Pn, νHn is constructed by stacking the vectors νj = Λj(ηj − η∗j ) :
j ∈ Hn in a column.
Third, without loss of generality, we let Θ = (G1, . . . , Gn)′ = [ΘHn ,ΘHcn ], ΘHn =
(E1, . . . , En)′, ΘHcn = (F1, . . . , Fn)′, where ΘHn is constructed by stacking Θj : j ∈ Hn in a
row. Similary, we denote Θ = (G1, . . . , Gn)′ = [ΘHn , ΘHcn ] = ΘΛ−1 = [ΘHnΛ−1Hn ,ΘHcnΛ−1
Hcn ],
ΘHn = (E1, . . . , En)′, ΘHcn = (F1, . . . , Fn)′, where ΘHn is constructed by stacking Θj =
ΘjΛ−1j : j ∈ Hn in a row.
Fourth, we denote a series of terms that can be constructed from the information matrix I .
I = E(GiGi′), IHnHn = E(EiEi
′), IHnHcn = E(EiFi′),
IHcnHn = E(FiEi′), IHcnHcn = E(FiFi
′), IHn|Hcn = IHnHn − w′IHcnHn
w = I−1HcnHcnIHcnHn = (w1, . . . , whnsn), I = E(GiG
′i) = Λ−1IΛ−1,
IHnHn = E(EiE′i) = Λ−1
HnIHnHnΛ−1Hn , IHnHcn = E(EiF
′i ) = Λ−1
HnIHnHcnΛ−1Hcn ,
IHcnHn = E(FiE′i) = Λ−1
HcnIHcnHnΛ−1Hn , IHcnHcn = E(FiF
′i ) = Λ−1
HcnIHcnHcnΛ−1Hcn ,
IHn|Hcn = Λ−1HnIHn|HcnΛ−1
Hn = Λ−1Hn(IHnHn − w′IHcnHn)Λ−1
Hn ,
ρn = maxl≤hnsn
||wl||0 = maxl≤hnsn
ρnl, ρnl = ||wl||0.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 66
Finally, by using the η generated from (2.4) under the conditions of Theorem 3, we further
denote
T = n−1/2
n∑i=1
Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn) = n−1/2
n∑i=1
Si,
T ∗ = n−1/2
n∑i=1
Λ−1Hn(w′Fi − Ei)εi = n−1/2
n∑i=1
Si,
Te = n−1/2
n∑i=1
eiSi, T ∗e = n−1/2
n∑i=1
eiSi,
φ′(α) = inft ∈ R : Pe(||Te||∞ ≤ t) ≥ α, α ∈ (0, 1),
φ∗(α) = inft ∈ R : Pe(||T ∗e ||∞ ≤ t) ≥ α, α ∈ (0, 1),
Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn),
Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)εi,
where ei : i ≤ n is a set of independent and identically distribted standard normal random
variables independent of the data, Pe(·) means the probability with respect to e, φ′(α) denotes
the α quantile of ||Te||∞, and φ∗(α) denotes the α quantile of ||T ∗e ||∞. It is also worth to notice
that Si : i ≤ n are iid centered random vectors such that E(SiSi′) = σ2IHn|Hcn .
2.5.4 Auxiliary Lemmas
Lemma 5 (a) Under conditions (P1)–(P4), we have |ρλ(t1)− ρλ(t2)| ≤ λL|t1 − t2|, for all
t1, t2 ∈ R.
(b) Under conditions (P1)–(P4), we have |ρ′λ(t)| ≤ λL, for all t 6= 0.
(c) Under conditions (P1)–(P5), we have λL|t| ≤ ρλ(t) + 2−1µt2, for all t ∈ R.
Lemma 6 Under conditions (P1)–(P4), if Pλn(η∗)− Pλn(η) ≥ 0, where
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 67
Pλn(η) =∑pn
j=1 ρλns1/2n(n−1/2||Θjηj||2) and η∗ is the true version of η, then we have
0 ≤ Pλn(η∗)− Pλn(η) ≤ λns1/2n L
( ∑j∈An
n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2),
where An ⊆ Pn is the index set corresponding to the largest qn elements of n−1/2||Θjηj||2 :
j ≤ pn in magnitude, while Acn = Pn \ An.
Lemma 7 Under conditions (A2), (A6), (A8), we have
1) ||ΛΛ−1 − I||∞ ≤ c1nβ/2−1/2, with probability tending to one, for some constant c1 > 0
2) ||ΛΛ−1 − I||∞ ≤ c2nβ/2−1/2, with probability tending to one, for some constant c2 > 0,
where I denotes the pnsn × pnsn identity matrix.
Lemma 8 Under conditions (A1), (A2), (A6), (A8), we have
1) ||n−1Θ′Θ − E(GiGi′)||∞ ≤ c0n
β/2−1/2, with probability tending to one, for some con-
stant c0 > 0
2) ||n−1Θ′Θ−E(GiG′i)||∞ ≤ c1n
β/2−1/2, with probability tending to one, for some constant
c1 > 0
3) ||n−1Θ′ε||∞ ≤ c2nβ/2−1/2, with probability tending to one, for some constant c2 > 0
4) ||n−1Θ′ε||∞ ≤ c3nβ/2−1/2, with probability tending to one, for some constant c3 > 0
5) maxl≤hnsn ||n−1∑n
i=1(wl′Fi −Eil)Fi′||∞ ≤ c4n
β/2−1/2, with probability tending to one,
for some constant c4 > 0
6) maxl≤hnsn ||n−1∑n
i=1(wl′Fi − Eil)[E(E2
il)]−1/2Fi
′||∞ ≤ c5nβ/2−1/2, with probability
tending to one, for some constant c5 > 0
7) maxl≤hnsn |n−1∑n
i=1(wl′Fi − Eil)εi| ≤ c6n
β/2−1/2, with probability tending to one, for
some constant c6 > 0
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 68
8) maxl≤hnsn |n−1∑n
i=1(wl′Fi−Eil)[E(E2
il)]−1/2εi| ≤ c7n
β/2−1/2, with probability tending
to one, for some constant c7 > 0,
where ε denotes the n× 1 random error vector.
Lemma 9 Under conditions (A1), (A2), (A6)-(A8), and H0 : βj(t) = 0 for all j ∈ Hn, we
have
1) ||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ ≤ c0(nβ/2−1/2 + qns−δn ), with probability tending to one, for
some constant c0 > 0
2) ||n−1ΘHcn′(Y − ΘHcnηHcn)||∞ ≤ c1(nβ/2−1/2 + qns
−δn ), with probability tending to one,
for some constant c1 > 0.
Lemma 10 Under conditions (A1), (A2), (A4), (A6), (A8), we have
1) n−1∑pn
j=1 ||Θj(ηj − η∗j )||22 ≤ m1[1 + o(1)]||ν||22, with probability tending to one, where
the constant m1 is defined in (A4)
2) ||ν||1 ≤ c0s1/2n
∑pnj=1 n
−1/2||Θj(ηj − η∗j )||2, with probability tending to one, for some
constant c0 > 0
3) λn||ν||1 ≤ c1
[Pλn(η∗) + Pλn(η) + ||ν||22
], with probability tending to one, for some
constant c1 > 0.
Recall that ν = Λν = η − η∗ = Λ(η − η∗).
Lemma 11 Under conditions (A1)–(A4), (A6), (A8), (A10), we have
1) maxl≤hnsn ||n−1∑n
i=1 FiFi′(wl − wl)||∞ ≤ c0n
β/2−1/2, with probability tending to one,
for some constant c0 > 0
2) maxl≤hnsn ||wl − wl||2 ≤ c1ρ1/2n sann
β/2−1/2, with probability tending to one, for some
constant c1 > 0
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 69
3) maxl≤hnsn ||wl − wl||1 ≤ c2ρnsann
β/2−1/2, with probability tending to one, for some
constant c2 > 0.
Recall that ρn = maxl≤hnsn ||wl||0 = maxl≤hnsn ρnl, ρnl = ||wl||0.
Lemma 12 Under conditions (A1)–(A4), (A6)–(A10), and (P1)–(P5), we have
1) maxl≤hnsn |n−1/2∑n
i=1(Sil−Sil)| ≤ c0(nβ−1/2ρns3a/2n +λnn
β/2qnsa+1n +nβ/2+1/2qns
−δn log sn+
nβρnqns3a/2−δn log sn), with probability tending to one, for some constant c0 > 0
2) maxl≤hnsn[n−1
∑ni=1(Sil − Sil)2
]1/2 ≤ c1
[λnn
βqnsa/2+1n + nβ/2qns
−δn log sn +
ρns3a/2n nβ−1/2(log n)1/2 +λnρnqns
2a+1n n3β/2−1/2 +nβ−1/2ρnqns
3a/2−δn log sn
], with prob-
ability tending to one, for some constant c1 > 0.
Recall that we denote
Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)(Yi − F ′i ηHcn),
Si = (Si1, . . . , Si,hnsn) = Λ−1Hn(w′Fi − Ei)εi.
2.5.5 Proof of Lemmas
Proof of Lemma 5. For part (a), if |t1| = |t2|, then by condition (P1), it is easy to see that
|ρλ(t1)− ρλ(t2)| = |ρλ(|t1|)− ρλ(|t2|)| = 0 ≤ λL|t1 − t2|. If |t1| 6= |t2|, then, without loss of
generality, we assume |t1| > |t2|. By conditions (P1)–(P3), we have
|ρλ(t1)− ρλ(t2)| ≤ (|t1| − |t2|)ρλ(|t1| − |t2|)/(|t1| − |t2|) ≤ |t1 − t2| limt→0+ρλ(t)/t.
Given conditons (P1), (P3) and (P4), it is easy to see that limt→0+ρλ(t)/t = limt→0+ ρ′λ(t) =
λL. It follows that |ρλ(t1)− ρλ(t2)| ≤ λL|t1 − t2|, which completes the proof of part (a).
For part (b), by definition, for all t 6= 0, we have |ρ′λ(t)| = | lim∆→0(ρλ(t + ∆) −
ρλ(t))/∆|. By Lemma 5(a), we have |ρ′λ(t)| ≤ λL, which completes the proof of part (b).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 70
For part (c), if t = 0, the inequality holds trivially. If t 6= 0, without loss of generality, we
let t > 0. Since ρλ,µ(s) = ρλ(s) + 2−1µs2 is convex for s ∈ [0, t], and is differentiable for
s ∈ (0, t], it is easy to see that ρλ,µ(t) − ρλ,µ(0) ≥ t lims1→0+ρ′λ(s1) + µs1 = tλL, which
completes the proof of part (c).
Proof of Lemma 6. First, we define fn(t) = t/ρλns
1/2n
(t) for t > 0, and fn(t) = (λns1/2n L)−1
for t = 0. Under conditions (P1)–(P4), it is easy to see that fn(t) is nondecreasing in t for
t ≥ 0. Then, it follows that
∑j∈Acn
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2) =∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2[fn(n−1/2||Θj(ηj − η∗j )||2)
]−1
≥[fn(max
j∈Acnn−1/2||Θj(ηj − η∗j )||2)
]−1∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2. (2.12)
Moreover, we also have
∑j∈An
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2) =∑j∈An
n−1/2||Θj(ηj − η∗j )||2[fn(n−1/2||Θj(ηj − η∗j )||2)
]−1
≤[fn(max
j∈Acnn−1/2||Θj(ηj − η∗j )||2)
]−1∑j∈An
n−1/2||Θj(ηj − η∗j )||2. (2.13)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 71
Combining (2.12), (2.13) and Pλn(η∗)− Pλn(η) ≥ 0, we have
0 ≤ Pλn(η∗)− Pλn(η) =
pn∑j=1
ρλns
1/2n
(n−1/2||Θjη∗j ||2)−
pn∑j=1
ρλns
1/2n
(n−1/2||Θjηj||2)
=
qn∑j=1
ρλns
1/2n
(n−1/2||Θjη∗j ||2)−
pn∑j=1
ρλns
1/2n
(n−1/2||Θjηj||2)
≤qn∑j=1
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2) +
qn∑j=1
ρλns
1/2n
(n−1/2||Θjηj||2)−pn∑j=1
ρλns
1/2n
(n−1/2||Θjηj||2)
=
qn∑j=1
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2)−pn∑
j=qn+1
ρλns
1/2n
(n−1/2||Θjηj||2)
≤∑j∈An
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2)−∑j∈Acn
ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2)
≤[fn(max
j∈Acnn−1/2||Θj(ηj − η∗j )||2)
]−1( ∑j∈An
n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2)
≤ λns1/2n L
( ∑j∈An
n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2),
which completes the proof.
Proof of Lemma 7. First, under condition (A2) and by Bernstein inequality, it is easy to see
that for any t > 0,
P (|n−1
n∑i=1
(ω−1jk θ
2ijk − 1)| ≥ t) ≤ 2 exp
(− c0 minc−2
1 t2, c−11 tn
),
for some universal constants c0, c1 > 0, uniformly in j = 1, . . . , pn, k = 1, . . . , sn. It then
follows from union bound inequality that
P(
maxj≤pn
maxk≤sn|n−1
n∑i=1
(ω−1jk θ
2ijk − 1)| ≥ t
)≤ 2pnsn exp
(− c0 minc−2
1 t2, c−11 tn
),
by choosing t = c2[log(pnsn)/n]1/2, for some sufficiently large c2 > 0, it is easy to see that
maxj≤pn
maxk≤sn|n−1
n∑i=1
(ω−1jk θ
2ijk − 1)| ≤ c2[log(pnsn)/n]1/2,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 72
with probability tending to one. Accordingly, it is easy to see that
||ΛΛ−1 − I||∞ = maxj≤pn
maxk≤sn|(n−1
n∑i=1
ω−1jk θ
2ijk
)1/2 − 1|
≤ maxj≤pn
maxk≤sn|n−1
n∑i=1
(ω−1jk θ
2ijk − 1)| ≤ c2[log(pnsn)/n]1/2,
with probability tending to one. Under (A6) and (A8), it is obvious that ||ΛΛ−1 − I||∞ ≤
c3nβ/2−1/2, with probability tending to one, for some constant c3 > 0. Moreover, we have
||ΛΛ−1 − I||∞ = maxj≤pn
maxk≤sn|ω1/2jk
(n−1
n∑i=1
θ2ijk
)−1/2 − 1|
≤[
maxj≤pn
maxk≤sn
ω1/2jk
(n−1
n∑i=1
θ2ijk
)−1/2]||ΛΛ−1 − I||∞
≤ (||ΛΛ−1 − I||∞ + 1)||ΛΛ−1 − I||∞,
which further implies that
||ΛΛ−1 − I||∞ ≤ ||ΛΛ−1 − I||∞/(1− ||ΛΛ−1 − I||∞)
≤ c3nβ/2−1/2/(1− c3n
β/2−1/2)
≤ 2c3nβ/2−1/2,
with probability tending to one, which completes the proof.
Proof of Lemma 8. By using Bernstein inequality and union bound inequality repeatedly, the
Lemma holds trivially.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 73
Proof of Lemma 9. First, under H0 : βj(t) = 0 for all j ∈ Hn, it is easy to see that
||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ = ||n−1
n∑i=1
Fi(Yi − F ′iηHcn)||∞
=||n−1
n∑i=1
Fiεi + n−1
n∑i=1
Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk)||∞
=||n−1
n∑i=1
Fiεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Fiθijk − E(Fiθijk)
]ηjk + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
E(Fiθijk)ηjk||∞
≤||n−1
n∑i=1
Fiεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Fiθijk − E(Fiθijk)
]ηjk||∞+
||n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
E(Fiθijk)ηjk||∞. (2.14)
In addition, for each l = 1, . . . , (pn − hn)sn, we have
E[(n−1
n∑i=1
Filεi)2] = n−1σ2 ∼ n−1. (2.15)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 74
Moreover, for each l = 1, . . . , (pn − hn)sn, we have
E(n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk
)2
=n−2
n∑i=1
E( ∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk
)2
≤n−2
n∑i=1
E( ∑j∈Hcn
|∞∑
k=sn+1
[Filθijk − E(Filθijk)
]ηjk|)2
≤n−2
n∑i=1
E( qn∑j=1
|∞∑
k=sn+1
[Filθijk − E(Filθijk)
]ηjk|)2
≤n−2qn
n∑i=1
E( qn∑j=1
|∞∑
k=sn+1
[Filθijk − E(Filθijk)
]ηjk|2
)≤n−2qn
n∑i=1
qn∑j=1
( ∞∑k=sn+1
η2jkk
2δ)( ∞∑
k1=sn+1
Var(Filθijk1)k−2δ1
)≤n−2qn
n∑i=1
qn∑j=1
( ∞∑k=sn+1
η2jkk
2δ)( ∞∑
k1=sn+1
E(F 2ilθ
2ijk1
)k−2δ1
)=n−2qn
n∑i=1
qn∑j=1
( ∞∑k=sn+1
η2jkk
2δ)( ∞∑
k1=sn+1
E(F 2il θ
2ijk1
)ωjk1k−2δ1
)≤cn−1q2
ns−2δn = o(n−1), (2.16)
where the last inequality is by (A1), (A2), (A7) and (A8). Therefore, by combining (2.15) with
(2.16), it is easy to see that for any l = 1, . . . , (pn − hn)sn, there exists a universal constant
c0 > 0 such that
limn→∞
P(|n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk| ≤ c0|n−1
n∑i=1
Filεi|)
= 1,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 75
which implies that for any t > 0,
limn→∞
P(|n−1
n∑i=1
Filεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk| ≥ t
)≤ lim
n→∞P(|n−1
n∑i=1
Filεi| ≥ t)
+ limn→∞
P(|n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk| ≥ t
)≤ lim
n→∞P(|n−1
n∑i=1
Filεi| ≥ t)
+ limn→∞
P(|n−1
n∑i=1
Filεi| ≥ c−10 t)
≤2 limn→∞
P(|n−1
n∑i=1
Filεi| ≥ c1t), (2.17)
where c1 = min1, c−10 > 0. Moreover, under (A2) and by Bernstein inequality, we have that
for any l = 1, . . . , (pn − hn)sn, and any t > 0,
limn→∞
P(|n−1
n∑i=1
Filεi| ≥ c1t)≤ lim
n→∞2 exp
(− c2 minc−2
3 t2, c−13 tn
), (2.18)
where c2 and c3 are some universal positive constants. By combining (2.17) with (2.18), we
have that for any l = 1, . . . , (pn − hn)sn, and any t > 0,
limn→∞
P(|n−1
n∑i=1
Filεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Filθijk − E(Filθijk)
]ηjk| ≥ t
)≤2 lim
n→∞P(|n−1
n∑i=1
Filεi| ≥ c1t)≤ lim
n→∞4 exp
(− c2 minc−2
3 t2, c−13 tn
),
invoking union bound inequality, we have that for any t > 0,
limn→∞
P(||n−1
n∑i=1
Fiεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Fiθijk − E(Fiθijk)
]ηjk||∞ ≥ t
)≤ lim
n→∞4pnsn exp
(− c2 minc−2
3 t2, c−13 tn
),
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 76
by choosing t = c4[log(pnsn)/n]1/2, for some sufficiently large c4 > 0, it is easy to see that
||n−1
n∑i=1
Fiεi + n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
[Fiθijk − E(Fiθijk)
]ηjk||∞
≤c4[log(pnsn)/n]1/2 ≤ c5nβ/2−1/2 (2.19)
with probability tending to one, for some c5 > 0. Furthermore, we have
||n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
E(Fiθijk)ηjk||∞ = maxl≤(pn−hn)sn
|n−1
n∑i=1
∑j∈Hcn
∞∑k=sn+1
E(Filθijk)ηjk|
≤ maxl≤(pn−hn)sn
n−1
n∑i=1
∑j∈Hcn
|∞∑
k=sn+1
E(Filθijk)ηjk|
≤ maxl≤(pn−hn)sn
n−1
n∑i=1
qn∑j=1
|∞∑
k=sn+1
E(Filθijk)ηjk|
≤ maxl≤(pn−hn)sn
n−1
n∑i=1
qn∑j=1
( ∞∑k=sn+1
[E(Filθijk)
]2k−2δ
)1/2( ∞∑k1=sn+1
η2jk1k2δ
1
)1/2
≤n−1
n∑i=1
qn∑j=1
( ∞∑k=sn+1
ωjkk−2δ)1/2( ∞∑
k1=sn+1
η2jk1k2δ
1
)1/2
≤ c6qns−δn , (2.20)
for some constant c6 > 0, where the last inequality is by (A1) and (A7). By combining (2.12),
(2.19) with (2.20), we conclude that ||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞ ≤ c7(nβ/2−1/2 + qns−δn ), with
probability tending to one, for some constant c7 > 0, which completes the proof of part 1). For
part 2), it is easy to observe that
||n−1ΘHcn′(Y −ΘHcnηHcn)||∞ = ||n−1ΛHcnΘ′Hcn(Y −ΘHcnηHcn)||∞
≤λmax(ΛHcn)||n−1Θ′Hcn(Y −ΘHcnηHcn)||∞
≤c8(nβ/2−1/2 + qns−δn ),
with probability tending to one, for some constant c8 > 0, which completes the proof.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 77
Proof of Lemma 10. First, it is easy to see that
n−1
pn∑j=1
||Θj(ηj − η∗j )||22 = n−1
pn∑j=1
||Θj(ηj − η∗j )||22 =
pn∑j=1
(ηj − η∗j )′(n−1Θ′jΘj)(ηj − η∗j )
=
pn∑j=1
(ηj − η∗j )′E(n−1Θ′jΘj)(ηj − η∗j ) +
pn∑j=1
(ηj − η∗j )′[n−1Θ′jΘj − E(n−1Θ′jΘj)
](ηj − η∗j )
≤λmax(I)||ν||22 +
pn∑j=1
(ηj − η∗j )′[n−1Θ′jΘj − E(n−1Θ′jΘj)
](ηj − η∗j )
≤λmax(I)||ν||22 + ||n−1Θ′Θ− E(n−1Θ′Θ)||∞pn∑j=1
||ηj − η∗j ||21
≤[λmax(I) + sn||n−1Θ′Θ− E(n−1Θ′Θ)||∞
]||ν||22
≤(m1 + c2snnβ/2−1/2)||ν||22,
for some constant c2 > 0, where the last inequality is by (A4) and Lemma 8. Since snnβ/2−1/2 =
o(1), it follows that n−1∑pn
j=1 ||Θj(ηj − η∗j )||22 ≤ m1[1 + o(1)]||ν||22, with probability tending
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 78
to one, which completes the proof of part 1). Moreover, we have
||ν||1 =
pn∑j=1
sn∑k=1
|ηjk − η∗jk| ≤ s1/2n
pn∑j=1
||ηj − η∗j ||2
=[λmin(I)
]−1/2s1/2n
pn∑j=1
[λmin(I)
]1/2||ηj − η∗j ||2≤m−1/2
0 s1/2n
pn∑j=1
[λmin(I)
]1/2||ηj − η∗j ||2 ≤ m−1/20 s1/2
n
pn∑j=1
[(ηj − η∗j )′E(n−1Θ′jΘj)(ηj − η∗j )
]1/2=m
−1/20 s1/2
n
pn∑j=1
((ηj − η∗j )′(n−1Θ′jΘj)(ηj − η∗j ) + (ηj − η∗j )′
[E(n−1Θ′jΘj)− n−1Θ′jΘj
](ηj − η∗j )
)1/2
≤m−1/20 s1/2
n
pn∑j=1
(n−1||Θj(ηj − η∗j )||22 + ||E(n−1Θ′Θ)− n−1Θ′Θ||∞||ηj − η∗j ||21
)1/2
≤m−1/20 s1/2
n
pn∑j=1
(n−1/2||Θj(ηj − η∗j )||2 + ||E(n−1Θ′Θ)− n−1Θ′Θ||1/2∞ ||ηj − η∗j ||1
)=(m−1/20 s1/2
n
pn∑j=1
n−1/2||Θj(ηj − η∗j )||2)
+(m−1/20 s1/2
n ||E(n−1Θ′Θ)− n−1Θ′Θ||1/2∞ ||ν||1)
≤(m−1/20 s1/2
n
pn∑j=1
n−1/2||Θj(ηj − η∗j )||2)
+(c3s
1/2n nβ/4−1/4||ν||1
),
for some constant c3 > 0, where the last inequality is by Lemma 8. Invoking (A8), it is easy to
see that ||ν||1 ≤ c4s1/2n
∑pnj=1 n
−1/2||Θj(ηj − η∗j )||2, with probability tending to one, for some
constant c4 > 0, which completes the proof of part 2). For part 3), it is easy to observe that by
combining part 2) with Lemma 5(c), we have
λn||ν||1 ≤ c4λns1/2n
pn∑j=1
n−1/2||Θj(ηj − η∗j )||2
=c4L−1
pn∑j=1
λns1/2n Ln−1/2||Θj(ηj − η∗j )||2
≤c4L−1
pn∑j=1
[ρλns
1/2n
(n−1/2||Θj(ηj − η∗j )||2) + 2−1µn−1||Θj(ηj − η∗j )||22]
≤c4L−1[Pλn(η∗) + Pλn(η) + 2−1µn−1
pn∑j=1
||Θj(ηj − η∗j )||22]. (2.21)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 79
By combining (2.21) with part 1), it is easy to see that λn||ν||1 ≤ c5
[Pλn(η∗)+Pλn(η)+||ν||22
],
with probability tending to one, for some constant c5 > 0, which completes the proof of part
3).
Proof of Lemma 11. First, by Lemma 8, we have
maxl≤hnsn
||n−1
n∑i=1
(wl′Fi − Eil)Fi′||∞ ≤ c0n
β/2−1/2,
with probability tending to one, for some constant c0 > 0. Moreover, under (A6), (A8), and
(A10), we have τn ∼ nβ/2−1/2. By choosing τn = c0nβ/2−1/2, then it is easy to see that
maxl≤hnsn
||n−1
n∑i=1
(wl′Fi − Eil)Fi′||∞ ≤ τn = c0n
β/2−1/2. (2.22)
Under (2.7), we have
maxl≤hnsn
||n−1
n∑i=1
(w′lFi − Eil)Fi′||∞ ≤ τn = c0nβ/2−1/2. (2.23)
Combining (2.22) with (2.23), it is easy to see that
maxl≤hnsn
||n−1
n∑i=1
FiFi′(wl − wl)||∞
≤ maxl≤hnsn
||n−1
n∑i=1
(wl′Fi − Eil)Fi′||∞ + max
l≤hnsn||n−1
n∑i=1
(w′lFi − Eil)Fi′||∞
≤2τn = 2c0nβ/2−1/2, (2.24)
which completes the proof of part 1). In addition, by the definition of the Dantizig selector
method, it is easy to see that
P( hnsn⋂
l=1
||wl||1 ≤ ||wl||1)→ 1, as n→∞. (2.25)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 80
Denote Sl = j : wlj 6= 0 as the support set for each vector wl, then (2.25) implies that
maxl≤hnsn
||wl − wl||1 ≤ maxl≤hnsn
||(wl − wl)Sl ||1 + maxl≤hnsn
||(wl)Scl ||1
≤ maxl≤hnsn
||(wl − wl)Sl ||1 + maxl≤hnsn
[||(wl)Sl ||1 − ||(wl)Sl ||1
]≤2 max
l≤hnsn||(wl − wl)Sl ||1 ≤ 2 max
l≤hnsnρnl
1/2||(wl − wl)Sl ||2
≤2ρn1/2 max
l≤hnsn||wl − wl||2, (2.26)
with probability tending to one. Furthermore, we have
maxl≤hnsn
(wl − wl)′(n−1
n∑i=1
FiFi′)(wl − wl)
≤ maxl≤hnsn
||wl − wl||1||n−1
n∑i=1
FiFi′(wl − wl)||∞
≤2c0nβ/2−1/2 max
l≤hnsn||wl − wl||1, (2.27)
with probability tending to one, where the last inequality is by (2.24). In addition, we also have
maxl≤hnsn
(wl − wl)′(n−1
n∑i=1
FiFi′)(wl − wl)
= maxl≤hnsn
((wl − wl)′E(FiFi
′)(wl − wl)− (wl − wl)′[E(FiFi
′)− n−1
n∑i=1
FiFi′](wl − wl))
≥ maxl≤hnsn
(λmin(I)||wl − wl||22 − ||wl − wl||21||E(FiFi
′)− n−1
n∑i=1
FiFi′||∞)
≥ maxl≤hnsn
(λmin(I)||wl − wl||22 − ||wl − wl||21||E(GiGi
′)− n−1Θ′Θ||∞)
≥λmin(I)(
maxl≤hnsn
||wl − wl||22)− ||E(GiGi
′)− n−1Θ′Θ||∞(
maxl≤hnsn
||wl − wl||21)
≥c1s−an
(maxl≤hnsn
||wl − wl||22)− c2n
β/2−1/2(
maxl≤hnsn
||wl − wl||21)
≥(c1s−an − c3ρnn
β/2−1/2)(
maxl≤hnsn
||wl − wl||22), (2.28)
with probability tending to one, where the second last inequality is by (A3), (A4), and Lemma 8,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 81
while the last inequality is by (2.26). Combining (2.26), (2.27), (2.28) with (A8) entails that
maxl≤hnsn
||wl − wl||2 ≤ c4ρ1/2n sann
β/2−1/2, (2.29)
with probability tending to one, for some constant c4 > 0, which completes the proof of part
2). Lastly, combining (2.26) with (2.29) entails that
maxl≤hnsn
||wl − wl||1 ≤ c5ρnsann
β/2−1/2,
with probability tending to one, for some constant c5 > 0, which completes the proof of part
3).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 82
Proof of Lemma 12. First, it is easy to see that for each l = 1, . . . , hnsn, we have
Sil − Sil
=[n−1
n∑i1=1
E2i1l
]−1/2(w′lFi − Eil)(Yi − F ′i ηHcn)−
[E(E2
il)]−1/2
(wl′Fi − Eil)εi
=[E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2[
E(E2il)]−1/2
(w′lFi − Eil)(Yi − F ′i ηHcn)
−[E(E2
il)]−1/2
(wl′Fi − Eil)εi
=[E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2[
E(E2il)]−1/2
(w′lFi − Eil)[εi + Fi
′(ηHcn − ηHcn)+
∑j∈Hcn
∞∑k=sn+1
θijkηjk]−[E(E2
il)]−1/2
(wl′Fi − Eil)εi
=[E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2[
E(E2il)]−1/2[
(wl′Fi − Eil) + (wl − wl)′Fi
][εi + Fi
′(ηHcn − ηHcn) +∑j∈Hcn
∞∑k=sn+1
θijkηjk]−[E(E2
il)]−1/2
(wl′Fi − Eil)εi
=([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2 − 1
)([E(E2
il)]−1/2
(wl′Fi − Eil)εi
)+([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
)+([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk))
+([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fiεi)
+([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′FiFi′(ηHcn − ηHcn))
+([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk))
=∆1il + ∆2il + ∆3il + ∆4il + ∆5il + ∆6il,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 83
where we denote
∆1il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2 − 1
)([E(E2
il)]−1/2
(wl′Fi − Eil)εi
),
∆2il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
),
∆3il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk)),
∆4il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fiεi),
∆5il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′FiFi′(ηHcn − ηHcn)),
∆6il =([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk)).
Accordingly, it is easy to observe that
maxl≤hnsn
|n−1/2
n∑i=1
(Sil − Sil)| ≤ maxl≤hnsn
|n−1/2
n∑i=1
∆1il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆2il|
+ maxl≤hnsn
|n−1/2
n∑i=1
∆3il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆4il|
+ maxl≤hnsn
|n−1/2
n∑i=1
∆5il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆6il|. (2.30)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 84
For maxl≤hnsn |n−1/2∑n
i=1 ∆1il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆1il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2 − 1
)([E(E2
il)]−1/2
(wl′Fi − Eil)εi
)∣∣∣≤||ΛΛ−1 − I||∞ max
l≤hnsn|n−1/2
n∑i=1
(wl′Fi − Eil)[E(E2
il)]−1/2εi|
=n1/2||ΛΛ−1 − I||∞ maxl≤hnsn
|n−1
n∑i=1
(wl′Fi − Eil)[E(E2
il)]−1/2εi|
≤c0nβ−1/2, (2.31)
with probability tending to one, for some constant c0 > 0, where the last inequality is by
Lemma 7 and Lemma 8.
For maxl≤hnsn |n−1/2∑n
i=1 ∆2il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆2il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
)∣∣∣≤(1 + ||ΛΛ−1 − I||∞
)maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
)∣∣∣≤n1/2
(1 + ||ΛΛ−1 − I||∞
)||ηHcn − ηHcn||1 max
l≤hnsn||n−1
n∑i=1
(wl′Fi − Eil)[E(E2
il)]−1/2Fi
′||∞
≤n1/2(1 + ||ΛΛ−1 − I||∞
)||η − η||1 max
l≤hnsn||n−1
n∑i=1
(wl′Fi − Eil)[E(E2
il)]−1/2Fi
′||∞
≤c1λnnβ/2qns
a/2+1n , (2.32)
with probability tending to one, for some constant c1 > 0, where the last inequality is by
Lemma 7, Lemma 8 and Theorem 3.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 85
For maxl≤hnsn |n−1/2∑n
i=1 ∆3il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆3il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk))∣∣∣
≤(1 + ||ΛΛ−1 − I||∞
)maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk))∣∣∣
≤n1/2(
1 + ||ΛΛ−1 − I||∞)(
maxl≤hnsn
maxi≤n
∣∣[E(E2il)]−1/2
(wl′Fi − Eil)
∣∣)·(n−1
n∑i=1
∣∣ ∑j∈Hcn
∞∑k=sn+1
θijkηjk∣∣). (2.33)
Regarding maxl≤hnsn maxi≤n∣∣[E(E2
il)]−1/2
(wl′Fi−Eil)
∣∣, it is easily to be seen from (A2),
(A6) and (A8) that we have
maxl≤hnsn
maxi≤n
∣∣[E(E2il)]−1/2
(wl′Fi − Eil)
∣∣≤c1[log(npnsn)]1/2
≤c2nβ/2, (2.34)
with probability tending to one, for some constants c1, c2 > 0.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 86
Regarding n−1∑n
i=1
∣∣∑j∈Hcn
∑∞k=sn+1 θijkηjk
∣∣, we have
E(n−1
n∑i=1
∣∣ ∑j∈Hcn
∞∑k=sn+1
θijkηjk∣∣) ≤ E
(n−1
n∑i=1
qn∑j=1
∣∣ ∞∑k=sn+1
θijkηjk∣∣)
≤E(n−1
n∑i=1
qn∑j=1
( ∞∑k=sn+1
θ2ijkk
−2δ)1/2( ∞∑
k1=sn+1
η2jk1k2δ
1
)1/2)
≤n−1
n∑i=1
qn∑j=1
( ∞∑k1=sn+1
η2jk1k2δ
1
)1/2E[( ∞∑
k=sn+1
θ2ijkk
−2δ)1/2]
≤n−1
n∑i=1
qn∑j=1
( ∞∑k1=sn+1
η2jk1k2δ
1
)1/2[E( ∞∑k=sn+1
θ2ijkk
−2δ)]1/2
=n−1
n∑i=1
qn∑j=1
( ∞∑k1=sn+1
η2jk1k2δ
1
)1/2( ∞∑k=sn+1
ωjkk−2δ)1/2
≤O(qns−δn ), (2.35)
where the last inequality is by (A1) and (A7). Hence, by combining (2.33), (2.34) with (2.35),
it is easy to see that
maxl≤hnsn
|n−1/2
n∑i=1
∆3il| ≤ Op(nβ/2+1/2qns
−δn ). (2.36)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 87
For maxl≤hnsn |n−1/2∑n
i=1 ∆4il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆4il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fiεi)∣∣∣
≤(1 + ||ΛΛ−1 − I||∞
)maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)]−1/2
(wl − wl)′Fiεi)∣∣∣
≤n1/2(
1 + ||ΛΛ−1 − I||∞)(
maxl≤hnsn
[E(E2
il)]−1/2
)·(
maxl≤hnsn
||wl − wl||1)(||n−1Θ′ε||∞
)≤n1/2
(1 + ||ΛΛ−1 − I||∞
)([λmin(Λ)
]−1)
·(
maxl≤hnsn
||wl − wl||1)(||n−1Θ′ε||∞
)≤c3n
β−1/2ρns3a/2n , (2.37)
with probability tending to one, for some constant c3 > 0, where the last inequality is by
Lemma 7, Lemma 8, Lemma 11 and (A3).
For maxl≤hnsn |n−1/2∑n
i=1 ∆5il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆5il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′FiFi′(ηHcn − ηHcn))∣∣∣
≤(1 + ||ΛΛ−1 − I||∞
)maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)]−1/2
(wl − wl)′FiFi′(ηHcn − ηHcn))∣∣∣
≤n1/2(
1 + ||ΛΛ−1 − I||∞)([
λmin(Λ)]−1)(||η − η||1
)(maxl≤hnsn
||n−1
n∑i=1
FiFi′(wl − wl)||∞
)≤c4λnn
β/2qnsa+1n , (2.38)
with probability tending to one, for some constant c4 > 0, where the last inequality is by
Lemma 7, Lemma 11, Theorem 3 and (A3).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 88
For maxl≤hnsn |n−1/2∑n
i=1 ∆6il|, we have
maxl≤hnsn
|n−1/2
n∑i=1
∆6il|
= maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)([
E(E2il)]−1/2
(wl − wl)′Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk))∣∣∣
≤(1 + ||ΛΛ−1 − I||∞
)maxl≤hnsn
∣∣∣n−1/2
n∑i=1
([E(E2
il)]−1/2
(wl − wl)′Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk))∣∣∣
≤n1/2(
1 + ||ΛΛ−1 − I||∞)(
maxl≤hnsn
[E(E2
il)]−1/2
)(maxl≤hnsn
||wl − wl||1)
·(||n−1
n∑i=1
Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk)||∞)
≤n1/2(
1 + ||ΛΛ−1 − I||∞)([
λmin(Λ)]−1)(
maxl≤hnsn
||wl − wl||1)
·(
maxl≤(pn−hn)sn
∣∣n−1
n∑i=1
Fil(∑j∈Hcn
∞∑k=sn+1
θijkηjk)∣∣)
≤n1/2(
1 + ||ΛΛ−1 − I||∞)([
λmin(Λ)]−1)(
maxl≤hnsn
||wl − wl||1)
·(
maxl≤(pn−hn)sn
maxi≤n
∣∣Fil∣∣)(n−1
n∑i=1
∣∣ ∑j∈Hcn
∞∑k=sn+1
θijkηjk∣∣)
≤Op(nβρnqns
3a/2−δn ), (2.39)
where the last inequality is by Lemma 7, Lemma 11, (A2), (A3) and (2.35).
In summary, by combining (2.30), (2.31), (2.32), (2.36), (2.37), (2.38) with (2.39), it is
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 89
easy to see that
maxl≤hnsn
|n−1/2
n∑i=1
(Sil − Sil)| ≤ maxl≤hnsn
|n−1/2
n∑i=1
∆1il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆2il|
+ maxl≤hnsn
|n−1/2
n∑i=1
∆3il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆4il|
+ maxl≤hnsn
|n−1/2
n∑i=1
∆5il|+ maxl≤hnsn
|n−1/2
n∑i=1
∆6il|
≤ c0nβ−1/2 + c1λnn
β/2qnsa/2+1n +Op(n
β/2+1/2qns−δn )
+ c3nβ−1/2ρns
3a/2n + c4λnn
β/2qnsa+1n +Op(n
βρnqns3a/2−δn )
≤ c5(nβ−1/2ρns3a/2n + λnn
β/2qnsa+1n + nβ/2+1/2qns
−δn log sn
+ nβρnqns3a/2−δn log sn),
with probability tending to one, for some constant c5 > 0, which completes the proof of part
1).
Next, we start to prove part 2). First, it is easy to observe that
maxl≤hnsn
[n−1
n∑i=1
(Sil − Sil)2]
≤100(
maxl≤hnsn
n−1
n∑i=1
∆21il + max
l≤hnsnn−1
n∑i=1
∆22il + max
l≤hnsnn−1
n∑i=1
∆23il
+ maxl≤hnsn
n−1
n∑i=1
∆24il + max
l≤hnsnn−1
n∑i=1
∆25il + max
l≤hnsnn−1
n∑i=1
∆26il
). (2.40)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 90
For maxl≤hnsn n−1∑n
i=1 ∆21il, we have
maxl≤hnsn
n−1
n∑i=1
∆21il
= maxl≤hnsn
n−1
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2 − 1
)2([E(E2
il)]−1/2
(wl′Fi − Eil)εi
)2
≤(||ΛΛ−1 − I||∞
)2
maxl≤hnsn
n−1
n∑i=1
([E(E2
il)]−1/2
(wl′Fi − Eil)εi
)2
≤(||ΛΛ−1 − I||∞
)2
maxl≤hnsn
maxi≤n
([E(E2
il)]−1/2
(wl′Fi − Eil)εi
)2
≤c6n2β−1 log n, (2.41)
with probability tending to one, for some constant c6 > 0, where the last inequality is by
Lemma 7 and (A2).
For maxl≤hnsn n−1∑n
i=1 ∆22il, we have
maxl≤hnsn
n−1
n∑i=1
∆22il
= maxl≤hnsn
n−1
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)2([
E(E2il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
)2
≤(
1 + ||ΛΛ−1 − I||∞)2
maxl≤hnsn
n−1
n∑i=1
([E(E2
il)]−1/2
(wl′Fi − Eil)Fi′(ηHcn − ηHcn)
)2
≤(
1 + ||ΛΛ−1 − I||∞)2(||η − η||1
)2(maxl≤hnsn
n−1
n∑i=1
∣∣∣∣[E(E2il)]−1/2
(wl′Fi − Eil)Fi′
∣∣∣∣2∞
)≤(
1 + ||ΛΛ−1 − I||∞)2(||η − η||1
)2(maxl≤hnsn
maxi≤n
maxl1≤(pn−hn)sn
∣∣[E(E2il)]−1/2
(wl′Fi − Eil)Fil1
∣∣2)≤c7λ
2nn
2βq2ns
a+2n , (2.42)
with probability tending to one, for some constant c7 > 0, where the last inequality is by
Lemma 7, Theorem 3 and (A2).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 91
For maxl≤hnsn n−1∑n
i=1 ∆23il, we have
maxl≤hnsn
n−1
n∑i=1
∆23il
= maxl≤hnsn
n−1
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)2([
E(E2il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk))2
≤(
1 + ||ΛΛ−1 − I||∞)2
maxl≤hnsn
n−1
n∑i=1
([E(E2
il)]−1/2
(wl′Fi − Eil)(
∑j∈Hcn
∞∑k=sn+1
θijkηjk))2
≤(
1 + ||ΛΛ−1 − I||∞)2(
maxl≤hnsn
maxi≤n
∣∣[E(E2il)]−1/2
(wl′Fi − Eil)
∣∣2)·(n−1
n∑i=1
(∑j∈Hcn
∞∑k=sn+1
θijkηjk)2). (2.43)
Regarding maxl≤hnsn maxi≤n∣∣[E(E2
il)]−1/2
(wl′Fi − Eil)
∣∣2, it is easy to see that
maxl≤hnsn
maxi≤n
∣∣[E(E2il)]−1/2
(wl′Fi − Eil)
∣∣2≤c8 log(nhnsn) ≤ c8 log(npnsn)
≤c9nβ, (2.44)
with probability tending to one, for some constants c8, c9 > 0, where the last inequality is by
(A2), (A6) and (A8).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 92
Regarding n−1∑n
i=1(∑
j∈Hcn
∑∞k=sn+1 θijkηjk)
2, it is easy to see that
E[n−1
n∑i=1
(∑j∈Hcn
∞∑k=sn+1
θijkηjk)2]≤ E
[n−1
n∑i=1
(∑j∈Hcn
|∞∑
k=sn+1
θijkηjk|)2]
≤E[n−1
n∑i=1
(
qn∑j=1
|∞∑
k=sn+1
θijkηjk|)2]≤ E
[n−1qn
n∑i=1
qn∑j=1
|∞∑
k=sn+1
θijkηjk|2]
≤E[n−1qn
n∑i=1
qn∑j=1
( ∞∑k=sn+1
θ2ijkk
−2δ)( ∞∑
k1=sn+1
η2jk1k2δ
1
)]=n−1qn
n∑i=1
qn∑j=1
( ∞∑k=sn+1
ωjkk−2δ)( ∞∑
k1=sn+1
η2jk1k2δ
1
)=O(q2
ns−2δn ). (2.45)
Hence, by combining (2.43), (2.44), (2.45) with Lemma 7, it is easy to show that
maxl≤hnsn
n−1
n∑i=1
∆23il ≤ Op(n
βq2ns−2δn ). (2.46)
For maxl≤hnsn n−1∑n
i=1 ∆24il, we have
maxl≤hnsn
n−1
n∑i=1
∆24il
= maxl≤hnsn
n−1
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)2([
E(E2il)]−1/2
(wl − wl)′Fiεi)2
≤(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2(
n−1
n∑i=1
∣∣∣∣Fiεi∣∣∣∣2∞)≤(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2(
maxi≤n
maxl≤(pn−hn)sn
∣∣Filεi∣∣)2
≤c10ρ2ns
3an n
2β−1 log n, (2.47)
with probability tending to one, for some constant c10 > 0, where the last inequality is by
Lemma 7, Lemma 11, (A1), (A2) and (A3).
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 93
For maxl≤hnsn n−1∑n
i=1 ∆25il, we have
maxl≤hnsn
n−1
n∑i=1
∆25il
≤(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2(||η − η||1
)2(maxi≤n||Fi||4∞
)≤c11λ
2nρ
2nq
2ns
4a+2n n3β−1, (2.48)
with probability tending to one, for some constant c11 > 0, where the last inequality is by
Lemma 7, Lemma 11, Theorem 3, (A1), (A2) and (A3).
For maxl≤hnsn n−1∑n
i=1 ∆26il, we have
maxl≤hnsn
n−1
n∑i=1
∆26il
= maxl≤hnsn
n−1
n∑i=1
([E(E2
il)/(n−1
n∑i1=1
E2i1l
)]1/2)2([
E(E2il)]−1/2
(wl − wl)′Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk))2
≤(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2
·(n−1
n∑i=1
∣∣∣∣Fi(∑j∈Hcn
∞∑k=sn+1
θijkηjk)∣∣∣∣2∞
)=(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2
·(n−1
n∑i=1
∣∣∣∣Fi∣∣∣∣2∞(∑j∈Hcn
∞∑k=sn+1
θijkηjk)2)
=(
1 + ||ΛΛ−1 − I||∞)2([
λmin(Λ)]−1)2(
maxl≤hnsn
||wl − wl||1)2(
maxi≤n
∣∣∣∣Fi∣∣∣∣2∞)·(n−1
n∑i=1
(∑j∈Hcn
∞∑k=sn+1
θijkηjk)2)
≤Op(ρ2nq
2ns
3a−2δn n2β−1), (2.49)
where the last inequality is by Lemma 7, Lemma 11, (2.45), (A1), (A2) and (A3).
In summary, by combining (2.41), (2.42), (2.46), (2.47), (2.48), (2.49) with (2.40), it is
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 94
easy to observe that
maxl≤hnsn
[n−1
n∑i=1
(Sil − Sil)2]
≤c12
[λ2nn
2βq2ns
a+2n + nβq2
ns−2δn (log sn)2 + ρ2
ns3an n
2β−1 log n
+ λ2nρ
2nq
2ns
4a+2n n3β−1 + n2β−1ρ2
nq2ns
3a−2δn (log sn)2
], (2.50)
with probability tending to one, for some constant c12 > 0, which further implies that
maxl≤hnsn
[n−1
n∑i=1
(Sil − Sil)2]1/2
≤c13
[λnn
βqnsa/2+1n + nβ/2qns
−δn log sn + ρns
3a/2n nβ−1/2(log n)1/2
+ λnρnqns2a+1n n3β/2−1/2 + nβ−1/2ρnqns
3a/2−δn log sn
], (2.51)
with probability tending to one, for some constant c13 > 0, which completes the proof of part
2).
2.5.6 Proofs of Main Theorems
Proof of Theorem 3. Recall that we denote ν as ν = η − η∗, and ν = η − η∗ = Λν. By
first order necessary condition of the optimization theory, any local minimizer η of Qn(η) from
(2.4) must satisfy η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn, where 〈a, b〉 = a′b
for any vectors a, b. Hence, in order to prove Theorem 3, it is sufficient to show that for any
η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn, parts 1) and 2) of Theorem 3 hold.
Next, we begin the proof. First, we assume
η ∈ η : 〈∇Ln(η) +∇Pλn(η),−ν〉 ≥ 0, ||ν||1 ≤ Rn. (2.52)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 95
Moreover, we have
〈∇Ln(η)−∇Ln(η∗), ν〉 = ν ′(n−1Θ′Θ)ν
=ν ′E(n−1Θ′Θ)ν − ν ′[E(n−1Θ′Θ)− n−1Θ′Θ
]ν
≥λmin(I)||ν||22 − ||E(n−1Θ′Θ)− n−1Θ′Θ||∞||ν||21
≥m0||ν||22 − c0nβ/2−1/2||ν||21, (2.53)
with probability tending to one, for some constant c0 > 0, where the last inequality is by
(A4) and Lemma 8. In addition, with a little abuse of notation, denote Pλn,µ(η) = Pλn(η) +
2−1µn−1∑pn
j=1 ||Θjηj||22 =∑pn
j=1
(ρλns
1/2n
(n−1/2||Θjηj||2) + 2−1µn−1||Θjηj||22). Under (P5),
(A3) and (A4), it is easy to see that Pλn,µ(η) is convex in η, which implies that
Pλn,µ(η∗)− Pλn,µ(η) ≥ 〈∇Pλn,µ(η),−ν〉,
which further implies that
〈∇Pλn(η),−ν〉 ≤ Pλn(η∗)− Pλn(η) + 2−1µn−1
pn∑j=1
||Θj(ηj − η∗j )||22. (2.54)
By combining (2.52), (2.53) with (2.54), we have
m0||ν||22 − c0nβ/2−1/2||ν||21 ≤ 〈∇Ln(η)−∇Ln(η∗), ν〉
=〈∇Ln(η) +∇Pλn(η), ν〉+ 〈∇Pλn(η),−ν〉+ 〈−∇Ln(η∗), ν〉
≤Pλn(η∗)− Pλn(η) + 2−1µn−1
pn∑j=1
||Θj(ηj − η∗j )||22
+ ||n−1Θ′(Y −Θη∗)||∞||ν||1. (2.55)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 96
By combining (2.52), (2.55) with Lemma 9, we have
m0||ν||22 ≤Pλn(η∗)− Pλn(η) + 2−1µn−1
pn∑j=1
||Θj(ηj − η∗j )||22
+ c1(nβ/2−1/2Rn + qns−δn )||ν||1, (2.56)
with probability tending to one, for some constant c1 > 0. By combining (2.56) with Lemma 10
and (A9), we have
[m0 − 2−1m1µ(1 + o(1))
]||ν||22 ≤(1 + o(1))
[Pλn(η∗)− Pλn(η)
]. (2.57)
By combining (2.57) with Lemma 6 and (A4), we have
0 ≤[m0 − 2−1m1µ(1 + o(1))
]||ν||22
≤c2λns1/2n
( ∑j∈An
n−1/2||Θj(ηj − η∗j )||2 −∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2), (2.58)
with probability tending to one, for some constant c2 > 0. On one hand, (2.58) implies that
||ν||22 ≤ c3λns1/2n
∑j∈An
n−1/2||Θj(ηj − η∗j )||2
≤c3λns1/2n q1/2
n
( ∑j∈An
n−1||Θj(ηj − η∗j )||22)1/2
≤c3λns1/2n q1/2
n
( pn∑j=1
n−1||Θj(ηj − η∗j )||22)1/2
≤c4λns1/2n q1/2
n ||ν||2,
with probability tending to one, for some constants c3, c4 > 0, where the last inequality is by
Lemma 10. Then, it follows that we have ||ν||2 ≤ c4λns1/2n q
1/2n , which further entails that
||ν||2 = ||Λ−1ν||2 ≤ λmax(Λ−1)||ν||2 =[λmin(Λ)
]−1||ν||2 ≤ c5λnsa/2+1/2n q1/2
n ,
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 97
with probability tending to one, for some constant c5 > 0, which completes the proof of part
1). On the other hand, (2.58) also implies that
∑j∈An
n−1/2||Θj(ηj − η∗j )||2 ≥∑j∈Acn
n−1/2||Θj(ηj − η∗j )||2. (2.59)
Under Lemma 10 and (2.59), we have
||ν||1 ≤ c5s1/2n
pn∑j=1
n−1/2||Θj(ηj − η∗j )||2 ≤ 2c5s1/2n
∑j∈An
n−1/2||Θj(ηj − η∗j )||2
≤2c5s1/2n q1/2
n
( pn∑j=1
n−1||Θj(ηj − η∗j )||22)1/2 ≤ c6s
1/2n q1/2
n ||ν||2
≤c7λnsnqn,
with probability tending to one, for some constants c5, c6, c7 > 0, which further entails that
||ν||1 = ||Λ−1ν||1 ≤ λmax(Λ−1)||ν||1 =[λmin(Λ)
]−1||ν||1 ≤ c8λnsa/2+1n qn,
with probability tending to one, for some constant c8 > 0, which completes the proof of part
2).
Proof of Theorem 4. First, we have
∣∣||T ||∞ − ||T ∗||∞∣∣ ≤ ||T − T ∗||∞=||n−1/2
n∑i=1
(Si − Si)||∞ = maxl≤hnsn
|n−1/2
n∑i=1
(Sil − Sil)|
≤c0g(n), (2.60)
with probability tending to one, for some constant c0 > 0, where g(n) = nβ−1/2ρns3a/2n +
λnnβ/2qns
a+1n +nβ/2+1/2qns
−δn log sn+nβρnqns
3a/2−δn log sn, and the last inequality is by Lemma 12.
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 98
Second, we have
∣∣||Te||∞ − ||T ∗e ||∞∣∣ ≤ ||Te − T ∗e ||∞=||n−1/2
n∑i=1
ei(Si − Si)||∞ = maxl≤hnsn
|n−1/2
n∑i=1
ei(Sil − Sil)|. (2.61)
Since ei : i ≤ n is a set of independent and identically distribted standard normal ran-
dom variables independent of the data, by the Hoeffding inequality, we have that for any
l = 1, . . . , hnsn and t > 0,
Pe(|n−1/2
n∑i=1
ei(Sil − Sil)| ≥ t) ≤ 2 exp(− t2
2n−1∑n
i=1(Sil − Sil)2
),
where Pe(·) means the probability with respect to e. Then, the union bound inequality yields,
Pe( maxl≤hnsn
|n−1/2
n∑i=1
ei(Sil − Sil)| ≥ t) ≤ 2hnsn exp(− t2
2 maxl≤hnsn[n−1
∑ni=1(Sil − Sil)2
]).Together with Lemma 12, it is easy to see that
Pe( maxl≤hnsn
|n−1/2
n∑i=1
ei(Sil − Sil)| ≤ c1f(n))→ 1,
with probability tending to one, for some constant c1 > 0, where
f(n) =λnn3β/2qns
a/2+1n + nβqns
−δn log sn + ρns
3a/2n n3β/2−1/2(log n)1/2
+ λnρnqns2a+1n n2β−1/2 + n3β/2−1/2ρnqns
3a/2−δn log sn.
Together with (2.61), we obtain
Pe(∣∣||Te||∞ − ||T ∗e ||∞∣∣ ≤ c1f(n)
)→ 1. (2.62)
CHAPTER 2. TESTING GENERAL HYPOTHESIS IN LARGE-SCALE FLR 99
Moreover, we have
g(n)[log(pnsn)]1/2 ∼ g(n)nβ/2
=n3β/2−1/2ρns3a/2n + λnn
βqnsa+1n + nβ+1/2qns
−δn log sn
+ n3β/2ρnqns3a/2−δn log sn = o(1), (2.63)
under (A6), (A8) and (A9). In addition, we also have
f(n)[log(pnsn)]1/2 ∼ f(n)nβ/2
=λnn2βqns
a/2+1n + n3β/2qns
−δn log sn + ρns
3a/2n n2β−1/2(log n)1/2
+ λnρnqns2a+1n n5β/2−1/2 + n2β−1/2ρnqns
3a/2−δn log sn = o(1), (2.64)
under (A6), (A8) and (A9). Furthermore, (A5) implies that
E(S2il) ≥ c2, (2.65)
for some universal constant c2 > 0, and (A2) implies that
||Sil||φ2 ≤ c3, (2.66)
for some universal constant c3 > 0. Hence, by combining (2.60), (2.62), (2.63), (2.64), (2.65),
(2.66) with (A6), we have
limn→∞
supα∈(0,1)
∣∣P (||T ||∞ ≤ φ′(α))− α∣∣ = 0,
by Lemma H.7 in Ning and Liu [2016], which completes the proof.
Bibliography
James O. Ramsay and C. J. Dalzell. Some tools for functional data analysis. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 53(3):539–572, 1991. ISSN
0035-9246.
Herve Cardot, Frederic Ferraty, Andre Mas, and Pascal Sarda. Testing hypotheses in the func-
tional linear model. Scandinavian Journal of Statistics. Theory and Applications, 30(1):
241–255, 2003. ISSN 0303-6898.
Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Linear functional regression: the
case of fixed design and functional response. Canadian Journal of Statistics. La Revue
Canadienne de Statistique, 30(2):285–300, 2002. ISSN 0319-5724.
Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional data analysis for sparse lon-
gitudinal data. Journal of the American Statistical Association, 100(470):577–590, 2005a.
ISSN 0162-1459. doi: 10.1198/016214504000001745.
James O. Ramsay and B. W. Silverman. Functional data analysis. Springer Series in Statistics.
Springer, New York, second edition, 2005. ISBN 978-0387-40080-8; 0-387-40080-X.
John A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparamet-
rically when the data are curves. Journal of the Royal Statistical Society, Series B, 53(1):
233–243, 1991. ISSN 0035-9246.
100
BIBLIOGRAPHY 101
Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional linear regression analysis for
longitudinal data. Annals of Statistics, 33(6):2873–2903, 2005b.
P. Hall, H. G. Muller, and J. L. Wang. Properties of principal component methods for functional
and longitudinal data analysis. Annals of Statistics, 34(3):1493–1517, 2006. ISSN 0090-
5364. doi: 10.1214/009053606000000272.
Tony Cai and Peter Hall. Prediction in functional linear regression. Annals of statistics, 34:
2159–2179, 2006.
Jin-Ting Zhang and Jianwei Chen. Statistical inferences for functional data. Annals of Statis-
tics, 35:1052–1079, 2007.
Peter Hall and Joe L. Horowitz. Methodology and convergence rates for functional linear
regression. Annals of Statistics, 35:70–91, 2007.
M. Escabias, A. M. Aguilera, and M. J. Valderrama. Principal component estimation of func-
tional logistic regression: discussion of two different approaches. Journal of Nonparametric
Statistics, 16(3-4):365–384, 2004. ISSN 1048-5252.
Herve Cardot and Pacal Sarda. Estimation in generalized linear models for functional data
via penalized likelihood. Journal of Multivariate Analysis, 92(1):24–41, 2005. ISSN 0047-
259X. doi: 10.1016/j.jmva.2003.08.008.
Hans-Georg Muller and Ulrich Stadtmuller. Generalized functional linear models. Annals of
Statistics, 33(2):774–805, 2005. ISSN 0090-5364. doi: 10.1214/009053604000001156.
Jianqing Fan and Jin-Ting Zhang. Two-step estimation of functional linear models with ap-
plications to longitudinal data. Journal of the Royal Statistical Society, Series B, 62(2):
303–322, 2000. ISSN 1369-7412.
BIBLIOGRAPHY 102
Jianqing Fan, Qiwei Yao, and Zongwu Cai. Adaptive varying-coefficient linear models. Jour-
nal of the Royal Statistical Society, Series B, 65(1):57–80, 2003. ISSN 1369-7412. doi:
10.1111/1467-9868.00372.
Jeffrey S. Morris, Marina Vannucci, Philip J. Brown, and Raymond J. Carroll. Wavelet-
based nonparametric modeling of hierarchical functions in colon carcinogenesis. Journal
of the American Statistical Association, 98(463):573–597, 2003. ISSN 0162-1459. doi:
10.1198/016214503000000422.
Hans-Georg Muller and Fang Yao. Functional additive models. Journal of the Amer-
ican Statistical Association, 103(484):1534–1544, 2008. ISSN 0162-1459. doi:
10.1198/016214508000000751.
Fang Yao and Hans-Georg Muller. Functional quadratic regression. Biometrika, 97(1):49–64,
2010. ISSN 0006-3444. doi: 10.1093/biomet/asp069.
Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B, 58(1):267–288, 1996.
Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American Statistical Association, 96:1348–1360, December
2001.
Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Asso-
ciation, 101(476):1418–1429, 2006. ISSN 0162-1459. doi: 10.1198/016214506000000735.
Hyejin Shin. Partial functional linear regression. Journal of Statistical Planning and Inference,
139(10):3405–3418, 2009. doi: 10.1016/j.jspi.2009.03.001.
Ying Lu, Jiang Du, and Zhimeng Sun. Functional partially linear quantile regression model.
Metrika, 77(2):317–332, 2014. doi: 10.1007/s00184-013-0439-7.
BIBLIOGRAPHY 103
Yehua Li, Naisyin Wang, and Raymond J. Carroll. Generalized functional linear models with
semiparametric single-index interactions. Journal of the American Statistical Association,
105(490):621–633, 2010. ISSN 0162-1459. doi: 10.1198/jasa.2010.tm09313.
Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models.
Annals of Statistics, 36:1509–1533, 2008.
Yongdai Kim, Hosik Choi, and Hee-Seok Oh. Smoothly clipped absolute deviation on high
dimensions. Journal of the American Statistical Association, 103(484):1665–1673, 2008.
doi: 10.1198/016214508000001066.
Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational
and Graphical Statistics, 7(3):397–416, 1998.
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.
Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization,
pages 123–231, 2013.
Hansheng Wang, Runze Li, and Chih-Ling Tsai. Tuning parameter selectors for the smoothly
clipped absolute deviation method. Biometrika, 93:553–568, 2007.
Jing Lei. Adaptive global testing for functional linear models. Journal of the American Statis-
tical Association, 109(506):624–634, 2014.
Jianqing Fan, Shaojun Guo, and Ning Hao. Variance estimation using refitted cross-validation
in ultrahigh dimensional regression. Journal of the Royal Statistical Society, Series B, 74:
37–65, 2012.
Roger D Peng, Francesca Dominici, Roberto Pastor-Barriuso, Scott L Zeger, and Jonathan M
Samet. Seasonal analyses of air pollution and mortality in 100 US cities. American Journal
of Epidemiology, 161(6):585–594, 2005.
BIBLIOGRAPHY 104
Roger D Peng, Francesca Dominici, and Thomas A Louis. Model choice in time series studies
of air pollution and mortality. Journal of the Royal Statistical Society, Series A, 169(2):
179–203, 2006.
Evangelia Samoli, Massimo Stafoggia, Sophia Rodopoulou, Bart Ostro, Christophe Declercq,
Ester Alessandrini, Julio Dıaz, Angeliki Karanasiou, Apostolos G Kelessis, Alain Le Tertre,
et al. Associations between fine and coarse particles and mortality in mediterranean cities:
results from the med-particles project. Environmental Health Perspectives, 121(8):932,
2013.
Mathilde Pascal, Gregoire Falq, Verene Wagner, Edouard Chatignoux, Magali Corso, Myriam
Blanchard, Sabine Host, Laurence Pascal, and Sophie Larrieu. Short-term impacts of par-
ticulate matter (PM10, PM10–2.5, PM2.5) on mortality in nine french cities. Atmospheric
Environment, 95:175–184, 2014.
Frank C Curriero, Karlyn S Heiner, Jonathan M Samet, Scott L Zeger, Lisa Strug, and
Jonathan A Patz. Temperature and mortality in 11 cities of the eastern united states. Ameri-
can Journal of Epidemiology, 155(1):80–87, 2002.
Peter Hall and Mohammad Hosseini-Nasab. On properties of functional principal components
analysis. Journal of the Royal Statistical Society: Series B, 68(1):109–126, 2006.
Peter Hall and Mohammad Hosseini-Nasab. Theory for high-order bounds in functional prin-
cipal components analysis. Mathematical Proceedings of the Cambridge Philosophical So-
ciety, 146(1):225–256, 2009. doi: 10.1017/S0305004108001850.
Herve Cardot, Frederic Ferraty, and Pascal Sarda. Functional linear model. Statistics & Prob-
ability Letters, 45(1):11–22, 1999.
Ming Yuan and T. Tony Cai. A reproducing kernel Hilbert space approach to functional linear
regression. Annals of Statistics, 38(6):3412–3444, 2010.
BIBLIOGRAPHY 105
Tony Cai and Ming Yuan. Minimax and adaptive prediction for functional linear regression.
Journal of the American Statistical Association, 107:1201–1216, 2012.
Nadine Hilgert, Andre Mas, and Nicolas Verzelen. Minimax adaptive tests for the functional
linear model. Annals of Statistics, 41(2):838–869, 2013.
Julian J. Faraway. Regression analysis for a functional response. Technometrics, 39(3):254–
261, 1997.
Gareth M. James. Generalized linear models with functional predictors. Journal of the Royal
Statistical Society, Series B, 64(3):411–432, 2002. doi: 10.1111/1467-9868.00342.
Zuofeng Shang and Guang Cheng. Nonparametric inference in generalized functional linear
models. Annals of Statistics, 43(4):1742–1773, 2015.
Heng Lian. Functional partial linear model. Journal of Nonparametric Statistics, 23(1):115–
128, 2011.
Dehan Kong, Kaijie Xue, Fang Yao, and Hao H. Zhang. Partially functional linear regression
in high dimensions. Biometrika, 103(1):147–159, 2016.
Hans-Georg Muller and Fang Yao. Functional additive models. Journal of the American
Statistical Association, 2008.
H. Zhu, F. Yao, and H. H. Zhang. Structured functional additive regression in reproducing
kernel Hilbert spaces. Journal of the Royal Statistical Society, Series B, 76:581–603, 2014.
Yingying Fan, Gareth M. James, and Peter Radchenko. Functional additive regression. Annals
of Statistics, 43(5):2296–2325, 2015.
Heng Lian. Shrinkage estimation and selection for multiple functional regression. Statistica
Sinica, 23(1):51–74, 2013.
BIBLIOGRAPHY 106
Philip T Reiss and R Todd Ogden. Functional principal component regression and functional
partial least squares. Journal of the American Statistical Association, 102(479):984–996,
2007.
Po-Ling Loh and Martin J. Wainwright. Regularized m-estimators with nonconvexity: statisti-
cal and algorithmic theory for local optima. Journal of Machine Learning Research, 71(5):
559–616, 2015.
Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals
of Statistics, 38(2):894–942, 2010.
Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and variable selection
with the lasso. Annals of Statistics, 34(3):1436–1462, 2006.
Sara A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of
Statistics, 36(2):614–645, 2008.
Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-
dimensional data. Annals of Statistics, 37(1):246–270, 2009.
Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso
and dantzig selector. Annals of Statistics, 37(4):1705–1732, 2009.
Tong Zhang. Some sharp performance bounds for least squares regression with l1 regulariza-
tion. Annals of Statistics, 37(5A):2109–2144, 2009.
Jianqing Fan and Jinchi Lv. Nonconcave penalized likelihood with np-dimensionality. IEEE
Transactions on Information Theory, 57(8):5467–5484, 2011.
Lan Wang, Yongdai Kim, and Runze Li. Calibrating nonconvex penalized regression in ultra-
high dimension. Annals of Statistics, 41(5):2505–2536, 2013.
BIBLIOGRAPHY 107
Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of
convergence for sparse nonconvex learning problems. Annals of Statistics, 42(6):2164–2201,
2014.
Jianqing Fan, Lingzhou Xue, and Hui Zou. Strong oracle optimality of folded concave penal-
ized estimation. Annals of Statistics, 42(3):819–849, 2014.
Larry Wasserman and Kathryn Roeder. High-dimensional variable selection. Annals of Statis-
tics, 37(5A):2178–2201, 2009.
Nicolai Meinshausen and Peter Buhlmann. Stability selection. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72(4):417–473, 2010. doi: 10.1111/j.1467-
9868.2010.00740.x.
Rajen D. Shah and Richard J. Samworth. Variable selection with error control: another look at
stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 75(1):55–80, 2013. doi: 10.1111/j.1467-9868.2011.01034.x.
Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parame-
ters in high dimensional linear models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 76(1):217–242, 2014. doi: 10.1111/rssb.12026.
Sara van de Geer, Peter Buhlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically
optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42
(3):1166–1202, 2014.
Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A significance
test for the lasso. Annals of statistics, 42(2):413, 2014.
Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Post-selection
adaptive inference for least angle regression and the lasso. arXiv: 1401.3889, 2014.
BIBLIOGRAPHY 108
Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse
high dimensional models. Annals of Statistics, 2016.
Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models.
Journal of the Royal Statistical Society. Series B, 71(5):1009–1030, 2008.
Wenjiang Fu and Keith Knight. Asymptotics for lasso-type estimators. Annals of Statistics, 28
(5):1356–1378, 2000.
D. R. Cox and D. V. Hinkley. Theoretical Statistics. CRC Press, 1979.
Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is
much larger than n. Annals of Statistics, 35(6):2313–2351, 2007.
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of
suprema of empirical processes. Annals of Statistics, 42(4):1564–1597, 2014.
Ethan X Fang, Yang Ning, and Han Liu. Testing and confidence intervals for high dimensional
proportional hazards model. arXiv preprint arXiv:1412.5158, 2014.