Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Shrinkage Estimation of Common Breaks in Panel Data Models
via Adaptive Group Fused Lasso∗
Junhui Qian
Antai College of Economics and Management, Shanghai Jiao Tong University
Liangjun Su
School of Economics, Singapore Management University
January 30, 2014
Abstract
In this paper we consider estimation and inference of common breaks in panel data models via
adaptive group fused lasso. We consider two approaches — penalized least squares (PLS) for first-
differenced models without endogenous regressors, and penalized GMM (PGMM) for first-differenced
models with endogeneity. We show that with probability tending to one both methods can correctly
determine the unknown number of breaks and estimate the common break dates consistently. We
obtain estimates of the regression coeffi cients via post Lasso and establish their asymptotic distri-
butions. We also propose and validate a data-driven method to determine the tuning parameter
used in the Lasso procedure. Monte Carlo simulations demonstrate that both the PLS and PGMM
estimation methods work well in finite samples. We apply our PGMM method to study the effect of
foreign direct investment (FDI) on economic growth using a panel of 88 countries and regions from
1973 to 2012 and find multiple breaks in the model.
JEL Classification: C13, C23, C33, C51
Key Words: Adaptive Lasso; Change point; Group Lasso; Fused Lasso; Panel data; Penalized least
squares; Penalized GMM; Structural change
∗The authors express their sincere appreciation to Chihwa Kao for discussions on the subject matter. Su gratefully
acknowledges the Singapore Ministry of Education for Academic Research Fund under grant number MOE2012-T2-2-021.
Address Correspondence to: Liangjun Su, School of Economics, Singapore Management University, 90 Stamford Road,
Singapore 178903; E-mail: [email protected], Phone: +65 6828 0386.
1
1 Introduction
Recently there has been a growing literature on the estimation and tests of common breaks in panel
data models in which there are N individual units and T time series observations for each individual.
Depending on whether T is allowed to pass to infinity, the model is called “short”for fixed T and “large”
(or of large dimension) if T passes to infinity. Implicitly, one usually allows N to pass to infinity in panel
data models.1 Most of the literature falls into two categories depending on whether the parameters of
interest are allowed to be heterogenous across individuals or not. The first category focuses on homogenous
panel data models and includes De Watcher and Tzavalis (2005), Baltagi et al. (2012), and De Watcher
and Tzavalis (2012). De Watcher and Tzavalis (2005) compare the relative performance of two model
and moment selection methods in detecting breaks in short panels; Baltagi et al. (2012) consider the
estimation and identification of change points in large dimensional panel models with either stationary or
nonstationary regressors and error terms; De Watcher and Tzavalis (2012) develop a testing procedure for
common breaks in short linear dynamic panel data models. The second category considers estimation and
inference of common breaks in heterogenous panel data models; see Bai (2010), Kim (2011, 2012), Hsu and
Lin (2012), Baltagi et al. (2013), among others. Bai (2010) establishes the asymptotic properties of the
estimated break point in a location-scale heterogenous panel data model with either fixed or large T ; Kim
(2011) extends Bai’s (2010) method and develops an estimation procedure for a common deterministic
time trend break in large heterogenous panels with a multi-factor error structure; Kim (2012) continues
the study by estimating the common break date and common factors jointly; Hsu and Lin (2012) extends
Bai’s (2010) theory to nonstationary panel data models where the error terms follow an I(1) process;
Baltagi et al. (2013) study the estimation of large dimensional static heterogenous panels with a common
break by extending Pesaran’s (2006) common correlated effects (CCE) estimation procedure. In addition,
Chan et al. (2008) extend the testing procedure of Andrews (2003) from time series to heterogenous panels
where the breaks may occur at different time points across individuals; Liao and Wang (2012) study the
estimation of individual-specific structural breaks that exhibit a common distribution in a location-scale
panel data model; Yamazaki and Kurozumi (2013) develop an LM-type test for slope homogeneity along
the time dimension in fixed-effects panel data models with fixed N and large T.2
A common feature of all of the above works is that a one-time break, common or not, is assumed in the
estimation procedure. Although the assumption of a single break greatly facilitates the estimation and
inference procedure, inferences based on it could be misleading if the underlying model has an unknown
number of multiple breaks. For this reason, a large literature on the estimation and inference of models
with multiple structural changes has been developed in the single or multiple time series framework; see,
1Bai (1997a), Bai et al. (1998) and Qu and Perron (2007) extend the estimation of single-time series models to multiple-
ones with simultaneous structural breaks where the number of equations is fixed.2Pesaran and Yamagata (2008) and Su and Chen (2013) propose LM-type tests for slope homogeneity along the cross
section dimension in large dimensional linear panel data models with additive fixed effects and interactive fixed effects,
respectively.
2
e.g., Bai (1997a, 1997b), Bai and Perron (1998), Qu and Perron (2007), Su and White (2010), Kurozumi
(2012), and Qian and Su (2013). In view of the fact that the conventional avg- and exp-type test
statistics for multiple structural changes requires all permissible partitions of the sample which could be
prohibitively large, Qian and Su (2013) propose shrinkage estimation of regression models with multiple
structural changes by extending the fused Lasso of Tibshirani et al. (2005) to the time series regression
framework.
In this paper we propose a shrinkage-based methodology for estimating panel data models with an
unknown number of structural changes. The new methodology is most suitable for the vision that the
regression coeffi cients in a panel data model may be time-varying but at the same time exhibit certain
sparseness in abrupt changes or breaks. This vision seems pertinent in many applied studies using
panel data that have a long time span measured in decades. During such a long time span, shocks to
technologies, preferences, policies, and so on, may result in the change of a statistical relation applied
economists seek to discover; but the shocks tend to be small over a relatively short time interval so that
it does not alter the statistical relationship in short time. In this case, one has to allow the parameters
in the model to change over time in an unknown way and recognize that parameters do not always alter
from one time period to another one. Multiple structural breaks may occur during the whole time span
but the number of breaks is generally small in comparison with the total number of time periods in the
data, resulting in the sparseness of the breaks.
In terms of econometrics methodology, this paper extends the Lasso-type shrinkage approach in Qian
and Su (2013) to panel data settings. To the best of our knowledge, this is the first in the literature to
deal with panel data models with possibly multiple structural changes explicitly.3 To stay focused, we
consider homogenous linear panel data models with an unknown number of common breaks and we do
not allow cross section dependence. The extension to heterogenous panel data models and to panel data
models with cross section dependence will be discussed at the end of Section 7. For the advantage of the
use of panel data to study common breaks, we refer the readers directly to Bai (2010) and De Watcher
and Tzavalis (2012). Despite the fact that the Lasso-type shrinkage estimation has a long history and
wide applications in statistics (see, e.g., Tibshirani (1996), Knight and Fu (2000), Fan and Li (2001)),
the application of Lasso-type shrinkage techniques in econometrics has a relatively short history. But the
number of applications in econometrics has been increasing very fast in the last few years. For example,
Caner (2009) and Fan and Liao (2011) consider covariate selection in GMM estimation; Belloni et al.
(2012) and García (2011) consider selection of instruments in the GMM framework; Liao (2013) provides
a shrinkage GMM method for moment selection and Cheng and Liao (2013) consider the selection of valid
and relevant moments via penalized GMM; Liao and Phillips (2014) apply adaptive shrinkage techniques
3Bai (2010, Section 6) discusses the case of multiple breaks. As he remarks, if the number of breaks is given, the one-
at-a-time approach of Bai (1997b) can be used to estimate the break dates, and if the number of breaks is unknown, a test
for existence of break point can be applied to each subsample before estimating a break point. Alternatively, one can use
information criteria to determine the number of breaks in the latter case, but further investigation is called for.
3
to cointegrated systems; Kock (2013) considers Bridge estimators of static linear panel data models
with random or fixed effects; Caner and Knight (2013) apply Bridge estimators to differentiate a unit
root from a stationary alternative; Caner and Han (2013) proposes a Bridge estimator for pure factor
models and shows the selection consistency; Lu and Su (2013) apply adaptive group Lasso to choose
both regressors and the number of factors in panel data models with factor structures; Su et al. (2013)
propose a procedure that is called classifier Lasso to estimate a latent panel structure; Cheng et al. (2014)
provide an adaptive group Lasso estimator for pure factor structures with a one-time structural break.
This paper adds to the literature by applying the shrinkage idea to panel data models with an unknown
number of breaks.
We propose two approaches, penalized least squares (PLS) and penalized general method of moments
(PGMM), for the estimation of the panel data model with an unknown number of breaks. We apply first
differencing to remove the fixed effects in the equation and focus on the first-differenced equation. When
there is no endogeneity issue in the first-differenced equation, we propose to apply the PLS to estimate
the unknown number of break points and the regime-specific regression coeffi cients jointly where the
penalty term is imposed through the adaptive group fused Lasso (AGFL) component. In the presence
of endogeneity in the first-differenced equation, which may arise from endogenous regressors or lagged
dependent variables in the original fixed-effects equation, we propose to apply the PGMM to estimate
the unknown number of break points and the regime-specific regression coeffi cients jointly where, again,
the penalty term is imposed through the AGFL component. Unlike Qian and Su (2013) who can only
establish the claim that the group fused Lasso can not under-estimate the number of breaks in a time
series regression and that all the break fractions (but not the break dates) can be consistently estimated
as in Bai and Perron (1998), we show that with probability approaching one (w.p.a.1) both of our PLS
and PGMM methods can correctly determine the unknown number of breaks and estimate the common
break dates consistently. We obtain estimates of the regression coeffi cients via post Lasso and establish
their asymptotic distributions. We also propose and validate a data-driven method to determine the
tuning parameter used in the Lasso procedure.
Both PLS and PGMM can be numerically solved using the fast block-coordinate descent algorithm.
Monte Carlo simulations show that our methods perform well in finite samples. First, the probability
of correctly estimating the number of breaks (0, 1, and 2), as N increases from 50 to 500, converges
to 100% quickly. Even when N = 50 and T = 6, our methods are reliable in detecting the number of
breaks in most cases. Second, conditional on the correct estimation of the number of breaks, our methods
accurately estimate the break dates in finite samples.
As an empirical illustration, we employ our PGMM method to evaluate the effect of foreign direct
investment (FDI) inflow on economic growth. We estimate a dynamic panel data model with possibly
multiple breaks using the PGMM approach. We find that, with a tuning parameter selected via minimiz-
ing a BIC-type information criterion, there are four breaks (five regimes) in the span of seven five-year
periods. In each regime, the post-Lasso estimation finds significant positive effect of FDI inflow on GDP
4
growth. In contrast, if we estimate a usual dynamic panel data model with time-invariant parameters, we
would find this effect to be negative, although statistically insignificant. This empirical example illustrates
the perils of employing panel data models with restrictions on the number of breaks. Our contribution
makes the restriction unnecessary.
The rest of the paper is organized as follows. Section 2 introduces our fixed-effect panel data model
and PLS and PGMM estimation of the model depending on whether endogeneity is present in the first-
differenced equation. Sections 3 and 4 analyze the asymptotic properties of PLS and PGMM estimators,
respectively. Section 5 reports the Monte Carlo simulation results. Section 6 provides an empirical
application and Section 7 concludes.
NOTATION. Throughout the paper we adopt the following notation. For an m × n real matrix A,we denote its transpose as A′, its Frobenius norm as ‖A‖ , and its spectral norm as ‖A‖sp . When A is
symmetric, we use µmax (A) and µmin (A) to denote its largest and smallest eigenvalues, respectively. Ipdenotes a p× p identity matrix and 0a×b an a× b matrix of zeros. We use “p.d.”to abbreviate “positivedefinite”. The operator P→ denotes convergence in probability, D→ convergence in distribution, and plim
probability limit. Let ∆ and ∆2 denote the difference operators of order 1 and 2, respectively. In addition,
we use TriD(·, ·)T to denote a symmetric block tridiagonal matrix :
TriD(A,D)T ≡
D1 −A′2−A2 D2 −A′3
−A3 D3 −A′4. . .
. . .. . .
−AT−1 DT−1 −A′T−AT DT
where Dt’s are symmetric, At’s are square matrices, and empty blocks denote the matrices of zeros. By
Molinari (2008), the determinant of TriD(A,D)T is given by det (TriD(A,D)T ) =
T∏t=1
det (Λt) , where
Λ1 = D1 and Λt = Dt − AtΛ−1t−1A
′t for t = 2, ..., T. By Meurant (1992) and Ran and Huang (2006), one
can also calculate the inverse of TriD(A,D)T recursively.
2 Shrinkage estimation of linear panel data models with multi-
ple breaks
In this section we consider a linear panel data model with an unknown number of breaks, which we
estimate via the adaptive group fused Lasso.
5
2.1 The model
Consider the following linear panel data model
yit = µi + β′txit + uit, i = 1, ..., N, t = 1, . . . , T ≥ 2, (2.1)
where xit is a p × 1 vector of regressors, uit is the error term with zero mean, βt is a p × 1 vector of
unknown coeffi cients, and µi is the individual fixed effects that may be correlated with xit. We assume
that N passes to infinity and T can either be fixed or pass to infinity.
Like Qian and Su (2013), we assume that β1, ..., βT exhibit certain sparse nature such that the totalnumber of distinct vectors in the set is given by m + 1, which is unknown but typically much smaller
than T. More specifically, we assume that
βt = αj for t = Tj−1, ..., Tj − 1 and j = 1, ...,m+ 1
where we adopt the convention that T0 = 1 and Tm+1 = T + 1. The indices T1, ..., Tm indicate the
unobserved m break points/dates and the number m + 1 denotes the total number of regimes. We are
interested in estimating the unknown number m of unknown break dates and the regression coeffi cients.
Let αm = (α′1, ..., α′m+1)′ and Tm = T1, ..., Tm . Throughout, we denote the true value of a parameter
with a superscript 0. In particular, we use m0, α0m0 =
(α0′
1 , ..., α0′m0+1
)′and T 0
m0 = T 01 , ..., T
0m0 to
denote the true number of breaks, the vector of true regression coeffi cients, and the set of true break
dates, respectively. We assume that m0 is a fixed finite integer and T 01 ≥ 2 but allow T 0
m0 = T. When
T 0m0 = T, the last break occurs at the end of the sample (c.f., Andrews (2003)) and the (m0 +1)th regime
has only one observation for each individual time series.
To eliminate the effect of µi in the estimation procedure, we consider the first-differenced equation
∆yit = β′txit − β′t−1xi,t−1 + ∆uit
= β′t∆xit +(βt − βt−1
)′xi,t−1 + ∆uit,
where, e.g., ∆yit = yit − yi,t−1 for i = 1, ..., N and t = 2, ..., T. We consider two cases:
(a) E [∆uitxit] = 0 and E [∆uitxi,t−1] = 0;
(b) E [∆uitxit] 6= 0 or E [∆uitxi,t−1] 6= 0.
Case (a) occurs when xit is strictly exogenous in the sense that E (uit|xi) = 0 a.s. where xi =
(xi1, ..., xiT )′. But strict exogeneity is not necessary for case (a) and a suffi cient condition for (a) to
hold is E (∆uit|xit, xit−1) = 0. Case (b) occurs when xit contains either lagged dependent variables (e.g.,
yi,t−1) or endogenous regressors that are correlated with uit. We assume the existence of a q × 1 vector
of instruments zit for (xit, xi,t−1) in case (b) where q ≥ p.Since neither m nor the break dates are known and m is typically much smaller than T, this motivates
us to consider the estimation of βt’s and Tm via a variant of fused Lasso a la Tibshirani et al. (2005).
We propose two approaches —PLS estimation for case (a) and PGMM estimation for case (b).
6
2.2 Penalized least squares (PLS) estimation
In case (a), we propose to estimate β =(β′1, ..., β
′T
)′by minimizing the following PLS objective function
V1NT,λ1 (β) =1
N
N∑i=1
T∑t=2
(∆yit − β′txit + β′t−1xi,t−1
)2+ λ1
T∑t=2
wt∥∥βt − βt−1
∥∥ (2.2)
where λ1 = λ1 (N,T ) ≥ 0 is a tuning parameter, and wt is a data-driven weight defined by
wt =∥∥∥βt − βt−1
∥∥∥−κ1 , t = 2, ..., T, (2.3)
βt are preliminary estimates of βt, and κ1 is an user-specified positive constant that usually takes
value 2 in the literature. Noting that the objective function in (2.2) is convex in β, it is easy to obtain
the solution β = (β′1, ..., β
′T )′ where we suppress the dependence of βt = βt (λ1) on λ1 as long as no
confusion arises. We will propose a data-driven method to choose λ1 in Section 3.4.
For a given solution βt to (2.2), the set of estimated break dates are given by Tm = T1, ..., Tmwhere 1 < T1 < ... < Tm ≤ T such that
∥∥∥βt − βt−1
∥∥∥ 6= 0 at t = Tj for some j ∈ 1, ..., m and Tmdivides the time interval [1, T ] into m + 1 regimes such that the parameter estimates remain constant
within each regime. Let T0 = 1 and Tm+1 = T + 1. Define αj = αj(Tm) = βTj−1 as the estimate of αj
for j = 1, ..., m + 1. Frequently we suppress the dependence of αj on Tm (and λ1) unless necessary. Let
αm = αm(Tm) = (α1(Tm)′, ..., αm+1(Tm)′)′.
Apparently, the objective function in (2.2) is closely related to the literature on the adaptive Lasso
(Zou (2006)), the group Lasso (Yuan and Lin (2006)), the fused Lasso (Tibshirani et al. (2005) and
Rinaldo (2009)), and group fused Lasso (Qian and Su (2013)). Like Qian and Su (2013), the use of the
Frobenius norm ‖·‖ for the vector difference βt − βt−1 generalizes the fused Lasso to the group fused
Lasso. Unlike Qian and Su (2013) who do not have any weights to use in their time series regression, our
panel regression allows us to apply the adaptive weights wt, yielding the adaptive Lasso procedure. Forthis reason, we can call our estimation procedure as an adaptive group fused Lasso (AGFL) procedure.
To obtain wt, we propose to obtain the preliminary estimate β = (β′1, ..., β
′T )′ by minimizing the
first term in the definition of V1NT,λ1 (β) in (2.2). Let φab,ts = 1N
∑Ni=1 aitb
′is and φab,t = φab,tt for
t, s = 1, ..., T, and a, b = x, ∆x, ∆y, ∆2y, ∆u or ∆2u. For example, φx∆2y,t,t+1 = 1N
∑Ni=1 xit∆
2yi,t+1 for
t = 2, ..., T − 1. We can readily demonstrate that β = Q−1NT R
yNT , where
QNT = TriD(Q†, Q)T , (2.4)
RaNT = (−φ′x∆a,2,−φ′x∆2a,2,3,−φ′x∆2a,3,4...,−φ′x∆2a,T−1,T , φ′x∆a,T )′, a = y or u, (2.5)
Qt = φxx,t for t = 1 and T, Qt = 2φxx,t for 2 ≤ t ≤ T − 1, and Q†t = φxx,t,t−1 for t = 2, ..., T.
7
2.2.1 Post-Lasso estimation
For any αm =(α′1, ..., α
′m+1
)′and Tm = T1, ..., Tm with 1 < T1 < · · · < Tm ≤ T, we define4
Q1NT (αm; Tm) =1
N
m+1∑j=1
Tj−1∑t=Tj−1+1
N∑i=1
(∆yit − α′j∆xit
)2+
1
N
m∑j=1
N∑i=1
(∆yiTj − α′j+1xiTj + α′jxi,Tj−1
)2,
(2.6)
where∑Tj−1t=Tj−1+1
∑Ni=1
(∆yit − α′j∆xit
)2corresponds to “the sum of squared errors”for observations in
the jth artificial regime with time series observations indexed by integers in the interval [Tj−1, Tj − 1],
and∑Ni=1
(∆yit − α′j+1xi,Tj + α′jxi,Tj−1
)2corresponds to the “the sum of squared errors”for observations
when one moves from the jth regime to the (j + 1)th regime. The second term in (2.6) is important and
helps to improve the asymptotic effi ciency when T is fixed. It can be omitted if min0≤j≤m |Tj+1 − Tj | →∞ as N → ∞ and only the asymptotic effi ciency is concerned, but we still keep it to improve the finite
sample performance of the post-Lasso estimate in this case. One can choose αm to minimize the objective
function in (2.6). We denote the solution as αpm (Tm) =(αp1 (Tm)
′, ..., αpm+1 (Tm)
′)′. By setting Tm as
Tm, the set of estimated break dates via the AGFL procedure, we obtain the post-Lasso estimator
αpm = αpm
(Tm)
= Φ−1NT Ψy
NT
where ΦNT and ΨyNT are p (m+ 1)× p (m+ 1) and p (m+ 1)× 1 matrices that are respectively defined
in (A.4) and (A.5) in the appendix. We shall study the limiting distribution of αpm below.
2.3 Penalized GMM (PGMM) estimation
In case (b), we propose to estimate β by minimizing the following PGMM objective function
V2NT,λ2 (β) =
T∑t=2
1
N
N∑i=1
ρit(βt, βt−1
)′Wt
1
N
N∑i=1
ρit(βt, βt−1
)+ λ2
T∑t=2
wt∥∥βt − βt−1
∥∥ , (2.7)
where ρit(βt, βt−1
)= zit(∆yit − β′txit + β′t−1xi,t−1), λ2 = λ2 (N,T ) ≥ 0 is a tuning parameter, Wt =
WtNT is a q× q symmetric positive definite weight matrix for t = 2, ..., T, and wt is a data-driven weight
defined by
wt =∥∥∥βt − βt−1
∥∥∥−κ2 , t = 2, ..., T, (2.8)
βt are preliminary estimates of βt, and κ2 is an user-specified positive constant that usually takes
value 2 in the literature. Clearly, the first term in the definition of V2NT,λ2 (β) in (2.7) is different from
the usual GMM objective function in the panel setting with time-invariant parameters where only one
weight matrix (W , say) is needed and the double summation∑Tt=2
∑Ni=1 occurs twice, one before the
single weight matrix and the other after the single weight matrix. It is also different from the GMM-type
objective function in Andrews (1993) who considers the test of a single structural change in a time series
4By default, the summation∑bt=a is zero if b < a.
8
regression. Noting that the objective function in (2.7) is convex in β, it is easy to obtain the solution
β = (β′1, ..., β
′T )′, where we frequently suppress the dependence of βt = βt (λ2) on λ2. We will propose a
data-driven method to choose λ2 in Section 4.4.
For a given solution βt to (2.7), we can find the set of estimated break dates Tm = T1, ..., Tmas in Section 2.2. Like before, Tm divides [1, T ] into m + 1 regimes such that the parameter estimates
remain constant within each regime and∥∥∥βt − βt−1
∥∥∥ 6= 0 whenever t = Tj for some j = 1, ..., m. Let
T0 = 1 and Tm+1 = T + 1. Define αj = αj(Tm) = βTj−1 as the estimate of αj for j = 1, ..., m + 1. Let
αm = αm(Tm) = (α1(Tm)′, ..., αm+1(Tm)′)′.
To obtain the adaptive weights wt, we propose to obtain the preliminary estimate β = (β′1, ..., β
′T )′
by minimizing the first term in the definition of V2NT,λ2 (β) in (2.7). Let Qab,t,s = φ′ab,t,sWtφab,t,s and
Qab,t = Qab,t,t for t, s = 1, 2, ..., T. Let Qzx,t,t−1 = φ′zx,tWtφzx,t,t−1 for t = 2, ..., T. It is easy to show that
β = Q−1NT R
yNT , where
QNT = TriD(Q, Q
)T, (2.9)
RaNT =(−(φ′zx,2W2φz∆a,2
)′,(φ′zx,2W2φz∆a,2 − φ′zx,3,2W3φz∆a,3
)′, ...,(
φ′zx,T−1WT−1φz∆a,T−1 − φ′zx,T,T−1WTφz∆a,T)′,(φ′zx,TWTφz∆a,T
)′)′, (2.10)
a = y or u, Q1 = Qzx,1,2, Qt = Qzx,t + Qzx,t+1,t for t = 2, ..., T − 1, QT = Qzx,T , and Qt = Qzx,t,t−1 for
t = 2, ..., T.
2.3.1 Post-Lasso estimation
For any αm =(α′1, ..., α
′m+1
)′and Tm = T1, ..., Tm with 1 < T1 < · · · < Tm ≤ T, we define5
Q2NT (αm; Tm) =
m+1∑j=1
1
N
Tj−1∑t=Tj−1+1
N∑i=1
ρit (αj)
′W pj
1
N
Tj−1∑t=Tj−1+1
N∑i=1
ρit (αj)
+m∑j=1
[1
N
N∑i=1
ρ1iTj (αj+1, αj)
]′WTj
1
N
N∑i=1
ρ1iTj (αj+1, αj) , (2.11)
where ρit (αj) = zit(∆yit − α′j∆xit
), ρ1iTj (αj+1, αj) = ziTj (∆yiTj − α′j+1xiTj +α′jxi,Tj−1), and W p
j is a
regime-specific q× q symmetric weight matrix that is positive definite in large samples. As in the case ofPLS estimation, the second term in (2.11) is important when T is fixed and can be omitted in the case
where min0≤j≤m |Tj+1 − Tj | → ∞ as N → ∞. Let αpm (Tm) =(αp1 (Tm)
′, ..., αpm+1 (Tm)
′)′ denote theminimizer of Q2NT defined in (2.11). By setting Tm as Tm, the set of estimated break dates, we obtainthe post-Lasso estimator
αpm = αpm
(Tm)
= Υ−1NT ΞyNT
5By default, the summation∑bt=a is zero if b < a.
9
where ΥNT and ΞyNT are p (m+ 1) × p (m+ 1) and p (m+ 1) × 1 matrices that are defined in (B.3) in
the appendix. We shall study the limiting distribution of αpm below.
To obtain the PGMM estimate and the associated post-Lasso estimate, one needs to choose the weight
matrices Wt (t = 2, ..., T ) and W pj (j = 1, ..., m+ 1). In the simulation and application below, we adopt
a two-step strategy for determining both sets of weights. For Wt, we first obtain the estimate βt by
choosing the p × p identity matrix Ip as the weight matrix. In the second step, we specify Wt as the
inverse of the estimated covariance matrix of ρit(βt, βt−1) and to achieve an updated estimate of βt. A
similar procedure is adopted for determining the weights in post-lasso estimation.
3 Asymptotic properties of the PLS estimators
In this section we address the asymptotic properties of the PLS estimators.
3.1 Basic assumptions
Let I0j = T 0
j − T 0j−1 for j = 1, ...,m0 + 1. Define
Imin = min1≤j≤m0+1
I0j , Jmin = min
1≤j≤m0
∥∥α0j+1 − α0
j
∥∥ , and Jmax = max1≤j≤m0
∥∥α0j+1 − α0
j
∥∥ .Apparently, Imin denotes the minimum interval length among the m0 + 1 regimes, and Jmin and Jmax
denote the minimum and maximum jump sizes, respectively. In the case of fixed T, Imin does not pass to
infinity as N →∞. If we allow T →∞, then Imin can either pass to infinity or stay fixed unless otherwise
stated. We will maintain the assumption that Jmax is always a fixed constant but Jmin can be either
fixed or shrinking to zero as either N → ∞ or (N,T ) → ∞, where (N,T ) → ∞ denotes both N and T
pass to infinity simultaneously.
Let Φab,l = 1N
∑T 0l −1
t=T 0l−1+1
∑Ni=1 aitb
′it for l = 1, ...,m0 + 1 and a, b = ∆x, x, ∆y and ∆u. Define the
p(m0 + 1
)× p
(m0 + 1
)matrix ΦNT and p
(m0 + 1
)× 1 vector ΦNT and Ψa
NT , respectively:
ΦNT = TriD(Φ†,Φ
)m0+1
, (3.1)
ΨaNT =
(Φ′∆x∆a,1 − φ′x∆a,T 01−1,T 01
,Φ′∆x∆a,2 − φ′x∆a,T 02−1,T 02+ φ′x∆a,T 01
, ...,
Φ′∆x∆a,m0 − φ′x∆a,T 0m0−1,T 0
m0+ φ′x∆a,T 0
m0−1, Φ′∆x∆a,m0+1 + φ′x∆a,T 0
m0
)′, a = y or u,(3.2)
Φ1 = Φ∆x∆x,1 + φxx,T 01−1, Φl = Φ∆x∆x,l + φxx,T 0l −1 + φxx,T 0l−1 for l = 2, ...,m0, Φm0+1 = Φ∆x∆x,m0+1 +
φxx,T 0m0, and Φ†l+1 = φxx,T 0l ,T 0l −1 for l = 1, ...,m0. Let Dm0+1 =diag
(√I01 , ...,
√I0m0+1
)⊗ Ip. To study
the asymptotic properties of the PLS estimators, we make the following assumptions.
Assumption A.1. (i) Let ui = (ui1, ..., uiT )′. xi, ui are independently distributed over i.(ii) E (xit∆uit) = 0 and E (xi,t−1∆uit) = 0 for i = 1, ..., N and t = 2, ..., T. max1≤i≤N max1≤t≤T
E ‖ςit‖4 < C <∞ for ς = x and u.
10
(iii) There exists a matrix Q0 > 0 such that∥∥∥QNT − Q0
∥∥∥sp
= oP (1) . There exist two constants cQ0
and cQ0such that 0 <cQ0
≤ λmin(Q0) ≤ λmax(Q0) ≤ cQ0<∞.
Assumption A.2. (i) Jmax = O (1) and N1/2Jmin → cJ ∈ (0,∞] as N →∞ or (N,T )→∞.(ii) N1/2λ1J
−κ1min → c ∈ [0,∞) as N →∞ or (N,T )→∞.
(iii) N (κ1+1)/2λ1 →∞ as N →∞ or (N,T )→∞.
Assumption A.3. (i) D−1m0+1ΦNTD−1
m0+1
P→ Φ0 > 0.
(ii)√ND−1
m0+1ΨuNT
D→ N (0,Ω0)
Assumption A.1(i) requires that xi, ui be independently distributed. It may be relaxed to allow forweak forms of cross sectional dependence at very lengthy arguments. A.1(ii) specifies moment conditions
on xit, uit. If E (uit|xit+1, xit) = 0 a.s. for each i and t, then the first part of A.1(ii) is satisfied. In con-
junction with A.1(i), A.1(ii) implies that each block element of√NRuNT is OP (1) and T−1N
∥∥∥RuNT∥∥∥2
=
OP (1) by Chebyshev inequality. A.1(iii) requires that the limiting matrix Q0 of the Tp × Tp ma-
trix QNT be well behaved. Let φ0xx,t,s = limN→∞ φxx,t,s and φ
0xx,t = φ0
xx,t,t. Let ∆01 = φ0
xx,1, ∆0t =
2φxx,t−φ0xx,t,t−1
(∆0t−1
)−1
φ0′xx,t,t−1 for t = 2, ..., T−1, and ∆0
T = φ0xx,T−φ0
xx,T,T−1
(∆0T−1
)−1
φ0′xx,T,T−1.
Then Q0 is p.d. if and only if the sequence of matrices ∆01, ..., ∆
0T are all p.d. Combining A.1(i)-(iii),
we prove in Lemma A.1 that√N(βt − β0
t
)= OP (1) for each t = 1, ..., T. Assumption A.2(i) mainly
specifies conditions on Jmin, λ1, and N. Note that we allow the minimum break size Jmin to shrink to
zero as N →∞ but it cannot shrink to zero faster than N−1/2. In the special case where Jmin is bounded
away from zero, A.2 can be simplified to
Assumption A.2∗. N1/2λ1 → c ∈ [0,∞) and N (κ1+1)/2λ1 →∞ as N →∞ or (N,T )→∞.
Assumption A.3 specify conditions to ensure the asymptotic normality of the post Lasso estimators.
3.2 Consistency
The following theorem establishes the consistency of βt.
Theorem 3.1 Suppose that Assumption A.1 holds. Then (i) T−1∥∥∥β − β0
∥∥∥2
= OP(N−1
), and (ii)
βt − β0t = OP
(N−1/2
)for each t = 1, ..., T.
Theorems 3.1(i) and (ii) establish the mean square and pointwise convergence rates of βt, respec-tively. The two results are equivalent in the case of fixed T. When T is allowed to pass to infinity as
N →∞, the proof of Theorem 3.1(ii) demands some extra effort. In particular, we need a close examina-
tion of the factorization and inversion properties of symmetric block tridiagonal matrices. See the proof
of Theorem 3.1(ii) in Appendix A.
Let T 0cm0 = 2, ..., T \T 0
m0 . Let θ01 = β0
1 and θ0t = β0
t − β0t−1 for t = 2, ..., T. Let θ1 = β1 and
θt = βt − βt−1 for t = 2, ..., T. The following theorem establishes the selection consistency.
11
Theorem 3.2 Suppose that Assumptions A.1-A.2 hold. Then P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c
m0
)→ 1 as N →
∞.
Theorem 3.2 says that w.p.a.1 all the zero vectors inθ0t , 2 ≤ t ≤ T
must be estimated as exactly
zero by the PLS method so that the number of estimated breaks m cannot be larger than m0 when N
is suffi ciently large. On the other hand, by Theorem 3.1(ii), we know that the estimates of the nonzero
vectors inθ0t , 2 ≤ t ≤ T
must be consistent by noting that βt− βt−1 consistently estimates θ
0t for t ≥ 2.
Put together, Theorems 3.1 and 3.2 imply that the AGFL has the ability to identify the true regression
model with the correct number of breaks consistently when the minimum break size Jmin does not shrink
to zero too fast.
Corollary 3.3 Suppose that Assumptions A.1-A.2 hold with cJ = ∞ in Assumption A.2(i). Then (i)
limN→∞ P(m = m0
)= 1, and (ii) limN→∞ P (T1 = T 0
1 , ..., Tm0 = T 0m0 | m = m0) = 1.
The above corollary implies that, as long as Jmin remains fixed or shrinks to zero at a rate slower
than N−1/2 as N → ∞, we can estimate the number of structural changes and all the break datesconsistently regardless of whether T is fixed or passes to infinity. In contrast, Qian and Su (2013,
Theorem 3.3) only establish the claim that the group fused Lasso procedure can not under-estimate
the number of breaks in a time series regression and that all the break fractions (but not the break
dates) can be consistently estimated as in Bai and Perron (1998). More precisely, letting D (A,B) ≡supb∈B infa∈A |a− b| for any two sets A and B, Qian and Su (2013, Theorem 3.2) establish the claim
that limT→∞ P(D(Tm, T 0
m0
)≤ TδT
)= 1 for some sequence δT such that δT → 0 and TδT → ∞ as
T → ∞. In our panel setting, the availability of N cross sectional units for each time period permits us
to obtain the set of consistent preliminary estimatesβt
used to construct the adaptive weights wt .
The adaptive nature of our group fused Lasso procedure helps us to identify the exact set of break dates
and yields stronger results than those in Qian and Su (2013).
3.3 Limiting distribution of the post-Lasso estimator
In this subsection we study the asymptotic distribution of the post-Lasso estimator αpm(Tm). Corollary
3.3 implies that w.p.a.1, m = m0 and Tj = T 0j for j = 1, ...,m0. It follows that αpm(Tm) is asymptotically
equivalent to the infeasible estimator αpm0(T 0m0) which is obtained if one knows the exact set T 0
m0 of true
break dates. Note that
αpm0(T 0m0) = Φ−1
NTΨyNT
where ΦNT and ΨyNT are defined in (3.1) and (3.2), respectively.
The following theorem reports the limiting distribution of αpm(Tm) conditional on the large probability
eventm = m0
.
Theorem 3.4 Suppose that Assumptions A.1-A.3 hold with cJ = ∞ in Assumption A.2(i). Then con-
ditional on m = m0, we have√NDm0+1 (αpm(Tm)−α0)
D→ N(0,Φ−1
0 Ω0Φ−10
).
12
Since we allow I0j to be either fixed or diverge to infinity in the case of large T, α
pj
(Tm)’s may have
different convergence rates to their true values. In the special case where I0j is proportional to T, α
pj
(Tm)
achieves the usual√NT -rate of consistency.
3.4 Choosing the tuning parameter λ1
Let αmλ1 ≡ αmλ1 (Tmλ1 ) = (α1(Tmλ1 )′, ..., αmλ1+1(Tmλ1 )′)′ denote the set of post-Lasso estimates of the
regression coeffi cients based on the break dates in Tmλ1 = Tmλ1 (λ1) , where we make the dependence of
various estimates on λ1 explicit. Let σ2Tmλ1
≡ 1T−1Q1NT (αmλ1 ; Tmλ1 ). We propose to select the tuning
parameter λ1 by minimizing the following information criterion:
IC (λ1) = σ2Tmλ1
+ ρ1NT p (mλ1 + 1) . (3.3)
Denote Ω = [0, λmax] , a bounded interval in R+. We divide Ω into three subsets Ω0, Ω− and Ω+ as
follows
Ω0 =λ1 ∈ Ω : mλ1 = m0
, Ω− =
λ1 ∈ Ω : mλ1 < m0
, and Ω+ =
λ1 ∈ Ω : mλ1 > m0
.
Clearly, Ω0, Ω− and Ω+ denote the three subsets of Ω in which the correct-, under- and over-number
of breaks are selected by the adaptive group fused Lasso, respectively. Let λ01NT denote an element in
Ω0 that satisfies the conditions on λ1 in Assumptions A.2(ii)-(iii).
Let σ2NT ≡ 1
N(T−1)
∑Ni=1
∑Tt=2 (∆uit)
2 and σ20 ≡plimσ2
NT . To state the next result, we add the
following assumptions.
Assumption A.4. (i) plimN→∞min1≤j≤m0 minα∈Rp1
NJ2min
∑Ni=1[(α0
j+1−α)′xiT 0j − (α0j −α)′xi,T 0j −1]2 ≥
cα > 0.
(ii) 1√N(T−1)
∑Tt=2
∑Ni=1 ∆xit∆uit = OP (1) .
(iii) As N →∞ or (N,T )→∞, T(IminJmin)2N
→ 0.
Assumption A.5. As N →∞ or (N,T )→∞,(
1 + TIminJ2min
)ρ1NT → 0 and Nρ1NT →∞.
A.4(i) imposes conditions on the parameters and the observations that are either at the break dates
or immediately preceding the break dates. The scalar J2min reflects the fact that we allow the minimum
break size Jmin to shrink to zero. In the latter case, pulling observations in two adjacent regimes with the
break size of order O (Jmin) together to estimate the regression coeffi cients within these two regimes is
still consistent with J−1min-rate of consistency. Under A.2(i)-(ii), A.4(ii) can be verified under various weak
dependence conditions, say, strong mixing or martingale difference sequence-type of conditions. A.4(iii)
imposes restriction on Imin, Jmin and the sample sizes. It is trivially satisfied if Imin ∝ T and Jmin remains
fixed as N →∞ or (N,T )→∞, and reduces to the condition that cJ =∞ in Assumption A.2(i) in the
case where T is fixed. A.5 reflects the usual conditions for the consistency of model selection, that is, the
penalty coeffi cient ρ1NT cannot shrink to zero either too fast or too slowly. If Imin ∝ T and J−1min = O (1) ,
13
the first part of A.5 requires that ρ1NT → 0, which is standard for an information-criterion function.
N−1 indicates the probability order of the distance between the first term in the criterion function for an
over-parametrized model and that for the true model.
Theorem 3.5 Suppose that Assumptions A.1, A.2(i) and A.3-A.5 hold with cJ = ∞ in Assumption
A.2(i). Then
P
(inf
λ1∈Ω−∪Ω+
IC (λ1) > IC(λ0
1NT
))→ 1 as N →∞.
Theorem 3.5 implies that the λ1’s that yield the over-estimated or under-estimated number of breaks
fail to minimize the information criterion w.p.a.1. Consequently, the minimizer of IC (λ1) can only be
the one that produces the correct number of estimated breaks in large samples. Note that we prove the
above theorem without requiring λ1 to satisfy Assumptions A.2(ii)-(iii). It indicates that if the number
of corrected breaks is of our major concern, we can simply choose λ1 to minimize IC (λ1) .
4 Asymptotic properties of the PGMM estimators
In this section we address the statistical properties of the PGMM estimators.
4.1 Assumptions
Let φ†ab,l+1 = φ′ab,T 0lWT 0l
φab,T 0l ,T 0l −1 for l = 1, ...,m0 and a, b = z, x, ∆x. Define the p(m0 + 1
)×
p(m0 + 1
)matrix ΥNT and p
(m0 + 1
)× 1 vector ΞaNT , respectively:
ΥNT = TriD(Υ†,Υ
)m0+1
, ΞaNT =(Ξ′a,1,Ξ
′a,2 , ...,Ξ
′a,m0+1
)′, a = y or u, (4.1)
whereΥ1 = Φ′z∆x,1Wp1 Φz∆x,1+φ′zx,T 01 ,T 01−1WT 01
φzx,T 01 ,T 01−1, Υl = Φ′z∆x,lWpl Φz∆x,l+φ
′zx,T 0l ,T
0l −1WT 0l
φzx,T 0l ,T 0l −1
+φ′zx,T 0l−1WT 0l−1
φzx,T 0l−1 for l = 2, ...,m0, Υm0+1 = Φ′z∆x,m0+1Wpm0+1Φz∆x,m0+1 + φ′zx,T 0
m0WT 0
m0φzx,T 0
m0,
andΥ†l = φ†xx,l for l = 2, ...,m0+1. In addition, for a = y or u, Ξa,1 = Φ′z∆x,1Wp1 Φz∆a,1−φ′zx,T 01 ,T 01−1WT 01
φz∆a,T 01 ,
Ξa,l = Φ′z∆x,lWpl Φz∆a,l−φ′zx,T 0l ,T 0l −1WT 0l
φz∆a,T 0l +φ′zx,T 0l−1WT 0l−1
φz∆a,T 0l−1 for l = 2, ...,m0, and Ξa,m0+1 =
Φ′z∆x,m0+1 Wpm0+1Φz∆a,m0+1 + φ′zx,T 0
m0WT 0
m0φz∆a,T 0
m0.
To study the asymptotic properties of the PGMM estimators, we make the following assumptions.
Assumption B.1. (i) Let zi = (zi1, ..., ziT )′. xi, zi, ui are independently distributed over i.(ii) E (zit∆uit) = 0 for i = 1, ..., N and 2 = 1, ..., T. max1≤i≤N max1≤t≤T E ‖ςit‖4 < C < ∞ for
ςit = xit, zit, and uit.
(iii) There exists a matrix Q0 > 0 such that∥∥∥QNT − Q0
∥∥∥sp
= oP (1) . There exist two constants cQ0
and cQ0such that 0 <cQ0
≤ λmin(Q0) ≤ λmax(Q0) ≤ cQ0<∞.
Assumption B.2. (i) Jmax = O (1) and N1/2Jmin → cJ ∈ (0,∞] as N →∞ or (N,T )→∞.(ii) N1/2λ2J
−κ2min → c ∈ [0,∞) as N →∞ or (N,T )→∞.
14
(iii) N (κ2+1)/2λ2 →∞ as N →∞ or (N,T )→∞.
Assumption B.3. (i) D−1m0+1ΥNTD−1
m0+1
P→ Υ0 > 0.
(ii)√ND−1
m0+1ΞuNTD→ N (0,Σ0) .
Assumptions B.1(i)-(iii) parallel Assumptions A.1(i)-(iii). B.1(ii) specifies moment conditions on xit,zit, uit. In conjunction with B.1(i), B.1(ii) implies that each block element of
√NRuNT is OP (1) and
T−1N∥∥∥RuNT∥∥∥2
= OP (1) by Chebyshev inequality. Combining B.1(i)-(iii), we prove in Lemma B.1 that√N(βt − β0
t
)= OP (1) for each t = 1, ..., T. Assumption B.2(i) mainly specifies conditions on Jmin, λ2,
and N. Note that we allow the minimum break size Jmin to shrink to zero as N →∞. In the special casewhere Jmin is bounded away from zero, B.2 can be simplified to
Assumption B.2∗. N1/2λ2 → c ∈ [0,∞) and N (κ2+1)/2λ2 →∞ as N →∞ or (N,T )→∞.
Assumption B.3 specify conditions to ensure the asymptotic normality of the post Lasso estimator.
4.2 Consistency
The following theorem establishes the consistency of βt.
Theorem 4.1 Suppose that Assumption B.1 holds. Then (i) T−1∥∥∥β − β0
∥∥∥2
= OP(N−1
), and (ii)
βt − β0t = OP
(N−1/2
)for each t = 1, ..., T.
Theorems 4.1(i) and (ii) establish the mean square and pointwise convergence rates of βt, respec-tively. The two results are equivalent in the case of fixed T and are not in the case of large T. If T →∞as N →∞, the proof of Theorem 4.1(ii) requires the use of the factorization and inversion properties of
symmetric block tridiagonal matrices as in the proof of Theorem 3.1(ii).
Let θ1 = β1 and θt = βt − βt−1 for t = 2, ..., T. The following theorem establishes the selection
consistency.
Theorem 4.2 Suppose that Assumptions B.1-B.2 hold. Then P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c
m0
)→ 1 as N →
∞.
Theorem 4.2 says that w.p.a.1 all the zero vectors inθ0t , 2 ≤ t ≤ T
must be estimated as exactly
zero by the PGMM method. On the other hand, by Theorem 4.1(ii), we know that the estimates of the
nonzero vectors inθ0t , 2 ≤ t ≤ T
must be consistent by noting that βt− βt−1 consistently estimates θ
0t
for t ≥ 2. Put together, Theorems 4.1 and 4.2 imply that the AGFL has the ability to identify the true
regression model with the correct number of breaks consistently when the minimum break size Jmin does
not shrink to zero too fast.
Corollary 4.3 Suppose that Assumptions B.1-B.2 hold with cJ = ∞ in Assumption B.2(i). Then (i)
limN→∞ P(m = m0
)= 1 and (ii) limN→∞ P (T1 = T 0
1 , ..., Tm0 = T 0m0 | m = m0) = 1.
15
The above corollary implies that the PGMM method helps us to estimate the number of structural
changes and all the break dates consistently regardless of whether T is fixed or passes to infinity.
4.3 Limiting distribution of the post-Lasso estimator
In this subsection we study the asymptotic distribution of the post-Lasso estimator αpm(Tm). Corollary
4.3 implies that w.p.a.1, m = m0 and Tj = T 0j for j = 1, ...,m0. It follows that αpm(Tm) is asymptotically
equivalent to the infeasible estimator αpm0(T 0m0) which is obtained if one knows the exact set T 0
m0 of true
break dates. Note that
αm0(T 0m0) = Υ−1
NTΞyNT
where ΥNT and ΞyNT are defined in (4.1).
The following theorem reports the limiting distribution of αpm(Tm) conditional on the large probability
eventm = m0
.
Theorem 4.4 Suppose that Assumptions B.1-B.3 hold. Then conditional on m = m0, we have√NDm0+1
(αpm(Tm)−α0)D→ N
(0,Υ−1
0 Σ0Υ−10
).
Since we allow I0j to be either fixed or diverge to infinity in the case of large T, α
pj
(Tm)’s may have
different convergence rates to their true values. In the special case where I0j is proportional to T, α
pj
(Tm)
achieves the usual√NT -rate of consistency.
4.4 Choosing the tuning parameter λ2
Let αmλ2 = αmλ2 (Tmλ2 ) = (α1(Tmλ2 )′, ..., αmλ2+1(Tmλ2 )′)′ denote the set of post-Lasso estimates of the
regression coeffi cients based on the break dates in Tmλ2 = Tmλ2 (λ2) , where we make the dependence of
various estimates on λ2 explicit. Let σ2Tmλ2
≡ 1T−1Q2NT (αmλ2 , Tmλ2 ). We propose to select the tuning
parameter λ2 by minimizing the following information criterion:
IC2 (λ2) = σ2Tmλ2
+ ρ2NT p (mλ2 + 1) . (4.2)
Denote Ω2 = [0, λ2 max] , a bounded interval in R+. We divide Ω2 into three subsets Ω20, Ω2− and Ω2+
as follows
Ω20 =λ2 ∈ Ω2 : mλ2 = m0
, Ω2− =
λ2 ∈ Ω2 : mλ2 < m0
, and Ω2+ =
λ2 ∈ Ω2 : mλ2 > m0
.
Let λ02NT denote an element in Ω20 that also satisfies the conditions on λ2 in Assumptions B.2(ii)-(iii).
To state the next result, we add the following assumptions.
Assumption B.4. (i) plimN→∞min1≤j≤m0 minα∈Rp1
J2minηj (α)
′WT 0j
ηj (α) ≥ cα > 0, where ηj (α) =1N
∑Ni=1[(α0
j+1 − α)′xiT 0j − (α0j − α)′xi,T 0j −1]ziT 0j .
(ii) 1√N(I0j−1)
∑T 0j −1
t=T 0j−1+1
∑Ni=1 zit∆uit = OP (1) for each j = 1, ...,m0 + 1.
16
(iii) If T →∞ as N →∞, Imin →∞ and T(IminJmin)2N
→ 0.
Assumption B.5. As N →∞ or (N,T )→∞,(
1 + TIminJ2min
)ρ2NT → 0 and Nρ2NT →∞.
Assumptions B.4-B.5 parallel A.4-A.5. Note that we now require Imin →∞ in the case of large T. The
following theorem implies that the minimizer of IC2 (λ2) can only be the one that produces the correct
number of estimated breaks in large samples.
Theorem 4.5 Suppose that Assumptions B.1, B.2(i) and B.3-B.5 hold with cJ = ∞ in Assumption
B.2(i). Then
P
(inf
λ2∈Ω2−∪Ω2+
IC2 (λ2) > IC2
(λ0
2NT
))→ 1 as N →∞.
5 Monte Carlo simulations
In this section we conduct a set of Monte Carlo experiments to evaluate the finite sample performance
of our AGFL method. The first set of experiments are concerned with the PLS or PGMM estimation
of static panel data models. We first evaluate the probability of falsely detecting breaks when there are
none. Then we experiment on the data generating processes (DGPs) with one or two breaks. In this
case, we evaluate both the probability of correctly detecting the number of breaks and the accuracy of
estimating the break dates. The second set of experiments deal with the PGMM estimation of dynamic
panel data models. We focus on DGPs with a lagged dependent variable and an exogenous variable. Like
in the static panel case, we evaluate the probability of correctly detecting the number of breaks and,
when there are indeed breaks, the accuracy of break date estimation.
For fast computation, we use the block-coordinate descent algorithm (see, e.g., Angelosante and
Giannakis (2012)) to solve the minimization problem in (2.2) for the PLS case and (2.7) for the PGMM
case. We select the tuning parameters λ1 and λ2 that minimize the information criterion in (3.3) and
(4.2) for the cases of PLS and PGMM estimation, respectively. Specifically, we choose a tuning parameter
λmax that would yield zero break in every DGP and a λmin that would yield many breaks. In practice, we
can easily find such λmax and λmin by trial and error. We then search for the optimal tuning parameter
on the 40 evenly-distributed logarithmic grids in the interval [λmin , λmax ]. We choose ρ1NT = ρ2NT =
1/√N(T − 1) in (3.3) or (4.2) for the static panel and ρ2NT = log(N(T − 1))/(N(T − 1)) in (4.2) for the
dynamic panel. Note that the latter choice specifies exactly the same rate as required by the Bayesian
Information Criterion (BIC). Both choices are acceptable in theory for every DGP we experiment on, but
their finite-sample performances do differ.
Following the literature on adaptive Lasso, we set κ1 = κ2 = 2 in the construction of the adaptive
weights wt and wt that are used for the PLS and PGMM estimation, respectively. In addition, we
choose all weight matrices Wt, t = 2, ..., T and W pj , j = 1, ..., m+ 1 as detailed in the last paragraph
of Section 2.3. The number of repetitions in all subsequent Monte Carlo experiments is 500.
17
5.1 The case of static panel
We consider the following DGPs:
yit = βtxit + µi + uit, i = 1, . . . , N, t = 1, . . . , T,
where µi = T−1∑Tt=1 xit and
• DGP-1: xit ∼ i.i.d. N(0, 1), uit = σuηit, ηit ∼ i.i.d. N(0, 1).
• DGP-2: Same as DGP-1 except ηit ∼ AR(1) for each i : ηit = 0.5ηi,t−1 + εit, εit ∼ i.i.d. N(0, 0.75).
• DGP-3: Same as DGP-1 except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +
0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).
• DGP-4: xit = ξit + 0.3ηit, ηit and ξit are i.i.d. N(0, 1) and mutually independent, uit = σuηit,
zit = ξit + 0.3εit, εit ∼ i.i.d. N(0, 1).
• DGP-5: Same as DGP-4 except ξit ∼ AR(1) for each i : ξit = 0.5ξi,t−1 + εit, εit ∼ i.i.d. N(0, 0.75).
• DGP-6: Same as DGP-4 except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +
0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).
We consider T = 6 or 12, and N = 50, 100, 200, and 500. For each DGP, we set βt = 1 for all
t when no break exists, βt = 1 1 ≤ t ≤ T/2 when there is one break, and βt = 1 1 ≤ t ≤ T/2 +
1 T/2 < t ≤ 2T/3 when there are two breaks, where 1 · denotes the usual indicator function. IfT = 6, the last case allows consecutive breaks at t = 4 and 5.
Note that the individual effects are generated from within-average and thus regarded as “fixed effects”.
In the first three DGPs, no endogeneity issue exists and we use PLS to estimate the models. DGP-1 serves
as the benchmark case where both the regressor and the idiosyncratic error processes are strong white
noise. DGP-2 allows serial correlation in the idiosyncratic error process and DGP-3 allows conditional
heteroskedasticity. The DGP-4 through 6 contain an endogenous variable xit and a variable zit that
generates a valid IV. We apply PGMM to estimate the models, using (zit, zi,t−1)′ as the instrument.
DGP-4 serves as the benchmark case where both the regressor and the error terms are i.i.d. across i
and t. xit and uit are correlated due to the common component ηit, and zit is correlated with xit due
to the presence of ξit in both. DGP-5 allows serial correlation in xit, and DGP-6 allows conditional
heteroskedasticity in uit.
To evaluate the performance of the PLS or PGMM estimation under different noise levels, we select
the scale parameter σu to be√
1/2 and 1. In DGP-1, these values for σu correspond to signal-to-noise
ratios of 2 and 1 (or in terms of the goodness of fit R2 of the model, 0.67 and 0.5), respectively.
Tables 1 and 2 report simulation results from the above DGPs. The first panel of Table 1 reports the
percentages of falsely detecting breaks when there are none (m0 = 0). The second and the third panels
18
Table 1: The determination of the number of breaks for DGPs 1-6 (static panels)DGP N = 50 N = 100 N = 200 N = 500
σu T = 6 12 6 12 6 12 6 12m0 = 0, % of falsely detecting breaks when there are none.
1√22
0.2 0 0 0 0 0 0 01 5.6 1.8 1.6 0 0.6 0 0 0
2√22
0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0
3√22
0.2 0 0 0 0 0 0 01 6.8 3.6 1 0.6 0 0 0 0
4√22
0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0
5√22
0 0 0 0 0 0 0 01 0.2 0 0 0 0 0 0 0
6√22
0.4 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0
m0 = 1, % of correctly detecting one break1
√22
99.4 99.8 100 100 100 100 100 1001 94.4 95.8 99.0 99.8 99.6 100 100 100
2√22
100 100 100 100 100 100 100 1001 99.8 100 100 100 100 100 100 100
3√22
99.2 100 100 100 100 100 100 1001 95.8 97.8 97.0 99.6 99.4 100 100 100
4√22
95.8 95.8 99.8 99.6 100 100 100 1001 85.4 79.2 97.2 96.4 99.6 99.4 100 100
5√22
87.6 95.2 95.4 100 99.2 100 99.8 1001 64.8 80.4 89.2 95.6 97.2 99.8 99.8 100
6√22
97.8 94.8 100 99.6 100 100 100 1001 48.4 54.0 81.4 81.2 97.6 97.8 100 100
m0 = 2, % of correctly detecting two breaks1
√22
99.0 99.8 100 100 100 100 100 1001 92.8 97.4 99.0 98.8 100 100 100 100
2√22
99.8 100 100 100 100 100 100 1001 99.2 99.8 100 100 100 100 100 100
3√22
98.6 99.8 100 100 100 100 100 1001 93.2 96.2 99.0 99.6 99.8 99.8 100 100
4√22
12.2 43.0 27.8 77.8 91.8 99.4 100 1001 2.6 7.8 4.0 14.8 25.4 82.4 98.4 99.8
5√22
4.8 9.2 26.8 40.2 84.4 97.2 99.0 1001 .8 1.8 5.6 3.2 34.8 46.2 95.2 99.8
6√22
11.8 44.2 31.6 83.2 92.4 99.4 100 1001 2.0 12.6 4.2 14.8 28.0 86.2 98.8 100
19
Table 2: The accuracy of estimating the break dates for DGPs 1-6 (static panels)DGP N = 50 N = 100 N = 200 N = 500
σu T = 6 12 6 12 6 12 6 12m0 = 1
1√22
.034 .017 .000 .000 .000 .000 .000 .0001 .000 0.035 .000 .000 .000 .000 .000 .000
2√22
.000 .000 .000 .000 l.000 .000 .000 .0001 .000 .000 .000 .000 .000 .000 .000 .000
3√22
.000 .000 .000 .000 .000 .000 .000 .0001 .104 .119 .000 .000 .000 .000 .000 .000
4√22
.070 .070 .000 .000 .000 .000 .000 .0001 .468 .463 .069 .121 .000 .000 .000 .000
5√22
.038 .018 .000 .000 .000 .000 .000 .0001 .051 .311 .037 .035 .000 .000 .000 .000
6√22
.068 .035 .000 .000 .000 .000 .000 .0001 1.240 2.037 .532 .616 .034 .068 .000 .000
m0 = 2
1√22
.000 .000 .000 .000 .000 .000 .000 .0001 .180 .171 .000 .000 .000 .000 .000 .000
2√22
.000 .000 .000 .000 .000 .000 .000 .0001 .034 .033 .000 .000 .000 .000 .000 .000
3√22
.000 .017 .000 .000 .000 .000 .000 .0001 .179 .139 .000 .000 .000 .000 .000 .000
4√22
.546 .620 .000 .214 .000 .000 .000 .0001 2.564 1.709 .000 .338 .000 .020 .000 .000
5√22
.000 .000 .124 .083 .000 .000 .000 .0001 4.167 .000 .000 .000 .000 .000 .000 .000
6√22
.000 .339 .000 .060 .000 .000 .000 .0001 .000 1.720 .794 .113 .000 .039 .000 .000
Note: The table reports the ratio of the average Hausdoff distance between the estimated and true sets of break
dates to T , i.e., 100 · HD(T 0m, T 0m0 )/T in DGPs 1-3 and 100· HD(T 0m, T 0m0)/T in DGPs 4-6.
20
report the percentages of correctly estimating the number of breaks when the true number of breaks is
1 and 2, respectively. We summarize some important findings from Table 1. First, when there are no
breaks, the probability of falsely detecting breaks declines to zero as either N or T increases. This is true
for both the PLS estimation in DGP-1 to DGP-3 in the case of no endogenous regressor and the PGMM
estimation in DGP-4 to DGP-6 in the case of an endogenous regressor. Even with N = 50 and T = 6, the
probabilities of false detection of breaks are very small for all DGPs under investigation. Second, when
there is one break, the probabilities of correctly detecting one break converge to 100% as N increases.
In the case of PLS, the probabilities of correct detection are high at both noise levels even when N = 50
and T = 6. In the case of PGMM, however, they are much lower at the high noise level than at the
low noise level when sample sizes are small, although they converge quickly to 100% as N increases. As
T increases from 6 to 12, the probability of correct detection improves in general. Third, when there
are two breaks, the probabilities of correctly detecting two breaks converge to 1 as N increases from 50
to 500. When T = 6, there are two consecutive breaks at t = 4 and 5 and the percentage of correctly
estimating the number of breaks tends to very low in DGP-4 to DGP-6 if N is not large enough (50 or
100). But it improves quickly when T increases to 12, in which case there are no consecutive breaks. It
also tends to 1 rapidly as N increases from 50 to 500.
Table 2 reports the ratio of average Hausdoff distance (HD) between the estimated and true sets of
break dates, i.e., 100· HD(T 0m, T 0
m0)/T in the case of PLS estimation and 100· HD(T 0m, T 0
m0)/T in the
case of PGMM estimation, conditional on correction estimation of the number of breaks.6 Conditional
on the correct estimation of the number of breaks, both PLS and PGMM estimate the break dates very
accurately. Even with N = 50 and T = 6, the average ratios of the Hausdoff distance to the true set
of breaks are close to zero for PLS at both noise levels. For DGPs with endogeneity, the estimation of
break-dates is only slightly less accurate.
5.2 The case of dynamic panel
We consider the following DGP’s with an AR(1) dynamics:
yit = β1tyi,t−1 + β2tx2it + µi + uit,
where µi ∼ i.i.d. Uniform[−0.1, 0.1] and
• DGP-1d: x2it ∼ i.i.d. N(0, 1), uit = σuηit, ηit ∼ i.i.d. N(0, 1).
• DGP-2d: Same as DGP-1d except x2it ∼ AR(1) for each i : x2it = 0.5x2i,t−1 + vit, vit ∼i.i.d. N(0, 0.75).
6Let D (A,B) ≡ supb∈B infa∈A |a− b| for any two sets A and B. The Hausdorff distance between A and B is defined as
HD (A,B) ≡ maxD (A,B) ,D (B,A).
21
• DGP-3d: Same as DGP-1d except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +
0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).
As in the static case, we take T = 6 or 12, and N = 50, 100, 200, and 500. For each DGP,
we set either β1t = β2t = 0.5 or more persistently, β1t = β2t = 0.8 for all t when no break exists,
β1t = β2t = 0.3 · 1 ≤ t ≤ T/2 + 0.7 · 1 T/2 < t ≤ T when there is one break, and β1t = β2t =
0.3 ·1 1 ≤ t ≤ T/2+0.7 ·1 T/2 + 1 ≤ t < 2T/3+0.3 ·1 2T/3 + 1 ≤ t ≤ T when there are two breaks.Note that when T = 6, there are consecutive breaks at t = 4 and 5.
DGP-1d is the benchmark case with i.i.d. xit and uit across both i and t. DGP-2d allows serial
correlation in xit and DGP-3d allows conditional heteroskedasticity in uit. We choose the scale parameter
σu to be 0.2, 0.3, and 0.5, corresponding to signal-to-noise ratio 4, 2, and 1, respectively, in DGP-1d with
β1t = β2t = 0.5. The relatively lower noise levels are justified by the usually high goodness-of-fit of many
dynamic panels in applications. To obtain the PGMM estimate, we use zit = (yi,t−2, x2it, x2i,t−1)′ as the
instrument.
Table 3 reports the estimation of the number of breaks for these three DGPs. The first two panels
report the percentages of falsely detecting breaks when there are none (m = 0). The AR coeffi cient is 0.5
in the first panel and 0.8 in the second panel. The second and the third panels report the percentages of
correctly estimating the number of breaks when the true number of breaks is 1 and 2, respectively. We
summarize the results in Table 3. (i) When there are no breaks, the probabilities of falsely detecting breaks
are small and become smaller in general when N or T increases. When the AR coeffi cient increases from
0.5 to 0.8 and the dynamic panel becomes more persistent, the probabilities of falsely detection remain
low. In fact, for some DGPs (e.g., DGP-3d, N = 500), the probabilities of false detection at higher
persistency level are generally lower than those at moderate persistence level, thanks to the fact that the
signal-to-noise ratio is higher at high persistence level. (ii) When there is one break, the probabilities of
correctly detecting one break converge to 100% across all choices of N , T , and noise levels. (iii) When
there are two breaks, we see relatively lower probabilities of correct estimation, especially at high noise
levels. But as N increases, the probabilities of correction estimation also converge to 100% across all
noise levels.
Table 4 reports the ratio of average Hausdoff distance between the estimated and true sets of break
dates to T , i.e., 100· HD(T 0m, T 0
m0)/T, for DGP-1d to DGP 3-d. As in the static panel case, conditional on
the correct estimation of the number of breaks, our procedure estimates the break dates very accurately.
Even with N = 50 and T = 6, the average Hausdoff distance to the true set of break dates is very close
to zero at all noise levels.
22
Table 3: The determination of the number of breaks for DGPs 1d-3d (dynamic panels)DGP N = 50 N = 100 N = 200 N = 500
σu T = 6 12 6 12 6 12 6 12m0 = 0, β1t = β2t = 0.5, % of falsely detecting breaks when there are none.
.2 2.6 1.4 0.6 0.4 0.6 1 1 0.81d .3 1.8 1.6 0.6 0.6 0.4 0.2 0.6 0.6
.5 2.8 1.2 1.6 0.8 0.8 0.8 0 0
.2 1.4 0 1.2 0.8 0.2 0.2 0.2 02d .3 1.4 1.2 0.4 0.6 1 0.2 0.4 0.2
.5 0.8 1.6 1.2 1.4 0.6 0.8 0.2 0.2
.2 1.8 1.4 0.6 0.6 0.8 0.8 0.4 0.43d .3 1.8 1.4 1 1.4 0.6 0.2 0.2 0.6
.5 2.2 1.4 1.2 0.6 1.4 1.2 0.2 0.6
m0 = 0, β1t = β2t = 0.8, % of falsely detecting breaks when there are none..2 1.8 1.2 1 1.4 0.8 1 0 0.2
1d .3 1.4 1.4 0.6 0.6 1 0.8 0.6 0.4.5 3.6 2 1.2 0.6 0.8 0.6 1 0.6
.2 3 1 1.8 0.6 0.6 0.6 0.2 0.22d .3 1.6 0.6 1.2 0.8 0.4 0.2 0.6 0.2
.5 1.4 1 1.6 1 0.8 1 0.2 0.2
.2 3.2 1.4 1 0.8 0.2 1.2 0.2 0.23d .3 2.2 1 1.6 0.6 1 0.6 0.2 0.4
.5 1.6 1 0.8 0.2 0.8 0.4 0.2 0.2
m0 = 1, % of correctly detecting one break.2 98.4 94.8 98.6 98.4 99.2 99.2 100 100
1d .3 96.6 91.8 98.8 98.8 99 98.8 99.8 99.8.5 87 83.4 96.6 97.2 99.2 99.4 99.6 99.4
.2 97.2 98.4 97.6 98.2 99.8 99 100 1002d .3 94.2 94.4 98.6 99.2 99.6 99.2 99.4 99.6
.5 84.8 82.4 97.8 98 99 100 99 100
.2 97.2 95 98.6 98.6 99 99 99.8 99.83d .3 95.6 90.2 99.6 98.6 99.4 99 99.4 100
.5 86.4 79.8 97.4 96.6 98.8 99.6 99.4 99.8
m0 = 2, % of correctly detecting two breaks0.2 90.4 80.6 97.6 95.6 99.4 98.8 99.8 100
1d 0.3 81 63.6 94 86.8 99 98.2 100 99.80.5 46.6 36.6 87 73 99.4 94.2 100 100
0.2 89.6 86 97.6 97.4 99.6 99.8 99.6 99.82d 0.3 81.2 63.6 94.2 91.2 96.6 98.8 98.2 99.4
0.5 54.4 38.4 85 69 93 86 97.6 93.8
0.2 88.2 79.8 97.6 94.2 99.6 99.6 99.6 99.83d 0.3 79.4 60.8 95 90 99 98.6 100 99.8
0.5 45.8 41.4 90.2 72 98 94.8 99.8 99.6
23
Table 4: The accuracy of estimating the break dates for DGPs 1d-3d (dynamic panels)DGP N = 50 N = 100 N = 200 N = 500
σu T = 6 12 6 12 6 12 6 12m0 = 1
.2 .000 .000 .000 .000 .000 .000 .000 .0001d .3 .000 .036 .000 .000 .000 .000 .000 .000
.5 .421 .719 .000 .034 .000 .000 .000 .000
.2 .000 .000 .000 .000 .000 .000 .000 .0002d .3 .000 .000 .000 .000 .000 .000 .000 .000
.5 .118 .222 .000 .017 .000 .000 .000 .000
.2 .000 .000 .000 .000 .000 .000 .000 .0003d .3 .000 0.055 .000 .000 .000 .000 .000 .000
.5 .386 .606 .000 .000 .000 .000 .000 .000m0 = 2
.2 .000 .000 .000 .000 .000 .000 .000 .0001d .3 .041 .079 .000 .000 .000 .000 .000 .000
.5 .572 .273 .000 .137 .000 .000 .000 .000
.2 .000 .000 .000 .000 .000 .000 .000 .0002d .3 .000 .000 .000 .000 .000 .000 .000 .000
.5 .184 .174 .000 .048 .000 .000 .000 .000
.2 .000 .000 .000 .000 .000 .000 .000 .0003d .3 .000 .055 .000 .000 .000 .000 .000 .000
.5 .509 .362 .037 .000 .000 .000 .000 .000Note: The table reports the ratio of the average Hausdoff distance between the estimated and true sets of break
dates to T , i.e., 100· HD(T 0m, T 0m0)/T.
6 An empirical application
In this section we offer an illustration of the use of our method. We seek to evaluate the effect of FDI inflow
on economic growth with a dynamic panel data model with an unknown number of breaks. The possible
existence of breaks may be justified theoretically. In the endogenous growth model of Romer (1986),
for example, economic growth may behave differently in different policy environments. Furthermore, in
the growth model of Jones (2002), the regime shifts may be common across countries in “a world of
ideas”, assuming that ideas propagate fast enough. Empirically, there is ample evidence of the existence
of breaks in growth path (e.g., Ben-David and Papell (1995)). However, most of existing studies rely on
time series structural break tests for individual economies, the United States in particular.
In this empirical exercise, we use a panel data of 88 countries or regions from 1973 to 2012. We
take data from the UNCTAD (United Nations Conference on Trade and Development) and construct
two variables, the per capita GDP growth and the ratio of FDI inflow to GDP for each economy in the
sample.7 These are annual data. But following the literature on growth empirics (e.g., Islam (1995)),
we take five-year averages of the two variables and denote them by yit and fdiit, respectively. Here
the subscript t denotes a sequence of five-year periods. The averaging gives us eight time five-year time
periods for each economy. Due to the fact that there is one lagged dependent variable in the model, the
7The UNCTAD database covers 237 countries and regions. We delete those economies with missing values over 1973-2012.
24
10 2 10 1 100 1010.14
0.16
0.18
0.2
Tuning Parameter
IC
10 2 10 1 100 1010
1
2
3
4
5
6
Num
ber o
f Bre
aks
ICNumber of Breaks
Figure 1: Selecting the optimal tuning parameter by minimizing the information criterion (IC). Horizontal
axis: tuning parameter, Left vertical axis: IC, right vertical axis: number of breaks.
effective number of data points for each economy is seven. We apply the PGMM method to estimate the
following dynamic panel data model with an unknown number of breaks,
yit = µi + β1tyi,t−1 + β2tfdiit + uit, t = 1, . . . , 7.
As in the simulations, we set κ2 = 2 in the construction of the adaptive weights, choose the weight matrices
(Wt,Wpj ) as detailed in the last paragraph of Section 2.3, and adopt zit = (yi,t−2, fdiit, fdii,t−1)
′ as the
instrument.
We first select an optimal tuning parameter that minimizes the BIC by choosing ρ2NT = log(N(T −1))/(N(T−1)) in (4.2). We choose λmax = 10, which results in zero break, and λmin = 0.01, which results
in six breaks. We then search on the interval [λmin , λmax ] with thirty evenly-distributed logarithmic grids.
We find that the number of breaks is four and that the breaks occur at t = 2, 5, 6 and 7, that is 1983-1987,
1998-2002, 2003-2007, and 2008-2012. Figure 1 shows how BIC (left axis) and the estimated number of
breaks (right axis) change with the tuning parameter λ2. We can see that the BIC declines till the
estimated number of breaks reaches four and rises as λ2 gets bigger. It is notable that there are five λ2’s
that result in four breaks, ranging from 0.053 to 0.137, and the IC curve is flat over this segment (and
25
similarly over several other segments).8 This suggests that the penalized GMM estimation is not very
sensitive to the tuning parameter.
It is well known that BIC, or other information criteria, may not be able to select the right model
in finite samples. It is thus prudent to examine the cases with the number of breaks other than four.
Table 5 shows regime segmentation, parameter estimates, and standard errors (in paratheses), from the
post-lasso estimation for the cases where m = 0, 1, . . . , 6. Note that in the last case (m = 6), there is a
structural break at every time point.
As shown in Table 5, the set of break dates is an increasing sequence as the tuning parameter decreases.
It starts from an empty set when m = 0. When m = 1, we have one break at t = 2, which corresponds to
the five-year period of 1983-1987. As the tuning parameter decreases, another break (in addition to the
one at t = 2) is detected at t = 7, which corresponds to 2008-2012. When m = 3, we have an additional
break at t = 6, corresponding to 2003-2007. As the tuning parameter decreases more, we arrive at the
case of m = 4 that achieves the minimum BIC. When m = 5, there is another break at t = 3 and the set
of break dates becomes 2, 3, 5, 6, 7.Table 5 also shows that the determination of structural change in our model is crucial for the quan-
titative evaluation of the effect of FDI on the economic growth. If we assume that no break exists and
estimate a textbook dynamic panel data model, then we may conclude that FDI has a negative, albeit
statistically insignificant, effect on growth. In the model chosen by BIC (m = 4), in stark contrast, the
coeffi cient of FDI is significantly positive in all regimes. In models with more than four break dates, there
are also negative coeffi cients on FDI in the five-year span of 1983-1987. In models with less than four
but more than or equal to one breaks, the coeffi cients on FDI are positive in all regimes, but are not
statistically significant in some regimes. This exercise suggests that the time-invariant parameter in the
textbook dynamic panel data model is an unnecessarily restrictive assumption and may lead to dubious
conclusions. Our shrinkage-based method, by allowing multiple breaks in panel data model, provides
applied economists with a natural approach to relaxing this assumption.
7 Conclusion
We propose two shrinkage procedures for the determination of the number of structural changes in linear
panel data models via adaptive group fused Lasso: PLS estimation for first-differenced models without
endogeneity and PGMM estimation for first-differenced models with endogeneity. We show that with
probability tending to one our methods can correctly determine the number of breaks and estimate the
break dates consistently. Simulation results suggest that our methods perform well in finite samples.
There are several interesting topics for further research. First, we do not allow cross sectional de-
8When λ2 changes from 0.053 to 0.137, the number of breaks and the set of estimated break dates remain unchanged
so that neither the first term (corresponding to the post Lasso regression) nor the second term (the penalty term) in (4.2)
changes.
26
Table5:
TheeffectofFDIontheeconomicgrowth(88countriesandregions,1973-2012)
mt
1978-1982
1983-1987
1988-1992
1993-1997
1998-2002
2003-2007
2008-2012
BIC
yi,t−1
-.151(.058)***
0fdi it
-.070(.052)
.178
yi,t−1
-.144(.092)
-.170(.069)**
1fdi it
.523(.258)**
.060(.050)
.189
yi,t−1
-.132(.095)
-.065(.074)
-.971(.160)***
2fdi it
.610(.270)**
.050(.047)
.176(.084)**
.161
yi,t−1
-.127(.097)
-.030(.075)
.302(.155)*
-.438(.202)**
3fdi it
.654(.275)**
.064(.054)
.229(.066)***
.174(.080)**
.155
yi,t−1
-.114(.103)
.074(.072)
-.232(.122)*
.266(.171)
-.441(.230)*
4fdi it
1.170(.294)***
.492(.096)***
.161(.046)***
.260(.057)***
.192(.084)**
.142
yi,t−1
-.109(.115)
.221(.102)**
.072(.080)
-.247(.116)**
.251(.171)
-.460(.224)**
5fdi it
.625(.287)**
-.252(.251)
.468(.097)***
.157(.047)***
.256(.056)***
.194(.083)**
.153
yi,t−1
-.107(.117)
.228(.101)**
.094(.104)
.088(.094)
-.240(.115)**
.257(.167)
-.453(.220)**
6fdi it
.653(.285)**
-.222(.240)
.503(.125)***
.461(.101)***
.158(.047)***
.258(.057)***
.193(.083)**
.176
Note:Numbers
inparathesesare
standard
errors.***denotesstatisticalsignificanceat1%
level,**at%5level,and*at%10level.
27
pendence in our models. Given the large literature on cross sectional dependence, it is interesting to
extend our methodology to panel data models with cross sectional dependence. Second, if we model the
cross sectional dependence through a factor structure, the factor loadings may also exhibit structural
changes over time (see, e.g., Breitung and Eickmeier (2011) and Cheng et al. (2014)) and this further
complicates the analysis. Third, we consider the common shocks for homogenous panel data models. It is
also interesting to consider heterogeneous panel data models and to allow the break dates to be different
across individuals. We leave these topics for future research.
28
APPENDIX
A Proof of the results in Section 3
Let V1NT (β) ≡ 1N
∑Ni=1
∑Tt=2
(∆yit − β′txit + β′t−1xi,t−1
)2. Let βt = β0
t +N−1/2bt for t = 1, ..., T with
b ≡ (b′1, ..., b′T )′ satisfying that T−1/2 ‖b‖ = L. Note that β = β0 + N−1/2b. We first prove a technical
lemma.
Lemma A.1 Suppose Assumption A.1 holds. Then βt − β0t = OP
(N−1/2
)for each t = 1, 2, ..., T.
Proof. Let bt = N1/2(βt − β0t ) and b = (b′1, ..., b
′T )′. Noting that ∆yit − x′itβt + x′i,t−1βt−1 = ∆uit −
N−1/2(x′itbt − x′i,t−1bt−1), we have
N[V1NT (β)− V1NT
(β0)]
=
N∑i=1
T∑t=2
[∆uit −N−1/2(x′itbt − x′i,t−1bt−1)
]2− (∆uit)
2
=1
N
N∑i=1
T∑t=2
(x′itbt − x′i,t−1bt−1
)2 − 2
N1/2
N∑i=1
T∑t=2
∆uit(x′itbt − x′i,t−1bt−1
)= b′QNTb− 2b′
√NRuNT ≡ A1 (b)− 2A2 (b) , say,
where QNT and RuNT are defined in (2.4) and (2.5), respectively. Under Assumption A.1(iii), w.p.a.1
λmin
(QNT
)= min‖κ‖=1
κ′Q0κ + κ′
(QNT − Q0
)κ≥ λmin
(Q0
)−∥∥∥QNT − Q0
∥∥∥sp≥ cQ0
/2.
Under Assumptions A.1(i)-(ii), T−1/2∥∥∥√NRuNT∥∥∥ = OP (1) by Chebyshev inequality. It follows that
w.p.a.1
T−1 [A1 (b)− 2A2 (b)] ≥(cQ0
/2)T−1 ‖b‖2 − T−1/2 ‖b‖OP (1) > 0
if T−1/2 ‖b‖ = L is suffi ciently large in which case the quadratic term A1 (b) dominates the linear term
A2 (b) . Consequently, N[V1NT (β)− V1NT
(β0)]> 0 w.p.a.1 if T−1/2 ‖b‖ = L is large and V1NT (β)
cannot be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ must be stochastically bounded.
When T is fixed, the above result also implies that bt is stochastically bounded for each t = 1, ..., T .
We now consider the case of large T. Let L denote the block lower part of the symmetric block tridiagonal
matrix QNT . By Meurant (1995), QNT can be factorized as follows: QNT = (∆ + L)∆−1(∆ + L′), where
∆ =diag(∆1, ..., ∆T ) is a block diagonal matrix, ∆1 = φxx,1, ∆t = 2φxx,t − φxx,t,t−1
(∆t−1
)−1
φ′xx,t,t−1
for t = 2, ..., T − 1, and ∆T = φxx,T − φxx,T,T−1
(∆T−1
)−1
φ′xx,T,T−1. Let b†=(∆ + L′)b = (b†′1 , ..., b
†′T )′
and R†NT=√N(∆+L′)RuNT = (R†′1NT , ..., R
†′TNT )′ where b†t’s and R
†tNT’s are all p×1 vectors. In addition,
R†tNT = OP (1) for each t = 1, ..., T under Assumption A.1. Then
N[V1NT (β)− V1NT
(β0)]
=
T∑t=1
[b†′t ∆−1
t b†t + b†′t R†tNT
]≡ V †1NT
(b†), say.
29
Let β ≡ (β′1, ..., β
′T )′ and b†≡(∆ + L′)b ≡(b†′1 , ..., b
†′T )′. In view of the fact that β minimizes V1NT (β) , we
have
0 ≥ N[V1NT (β)− V1NT
(β0)]
=
T∑t=1
[b†′t ∆−1
t b†t + b†′t R†tNT
].
The last result implies that b†t = OP (1) for each t. Otherwise, b† cannot minimize V †1NT(b†), which
further implies that β cannot minimize V1NT (β) .
To finish the proof, we still need to show that bt = OP (1) for each t based on the result that
b†t = OP (1) for each t. Noting that ∆ + L′ is a nonsingular upper block triangular matrix, we can apply
the fact that the inverse of a nonsingular upper block triangular matrix is also an upper block triangular
matrix (see, e.g., Harville (1997, p.95)) and write (∆+ L′)−1 = ωtsTt,s=1 , where ωts’s are p×p matricesthat are OP (1) for s ≥ t and zeros otherwise. The exact formula of ωts in terms of elements in ∆ and
L′ can be obtained recursively from Harville (1997, p.95). Thus bt =∑Ts=t ωtsb
†s and bt = OP (1) for any
t ≥ T − r where r is a finite integer that does not depend on T. Now, suppose that bτ 6= OP (1) for some
1 ≤ τ < T − r. By the relationship between b† and b, we have
b†τ = ∆τ bτ + φ′xx,τ+1,τ bτ+1
or, equivalently,
∆−1τ b†τ = bτ + ∆−1
τ φ′xx,τ+1,τ bτ+1.
Since ∆−1τ = OP (1) , b†τ = OP (1) , φ′xx,τ+1,τ = OP (1) , and bτ 6= OP (1) , in order for the above equality to
hold, we must have bτ+1 6= OP (1) . Deducting recursively, we must have bT−r 6= OP (1) , a contradiction.
It follows that bt = N1/2(βt − β0t ) = OP (1) for each t.
Proof of Theorem 3.1. (i) Let bt = N1/2(βt − β0t ) and b = N1/2(β − β0
). Noting that ∆yit − x′itβt +
x′i,t−1βt−1 = ∆uit −N−1/2(x′itbt − x′i,t−1bt−1), we have
N[V1NT,λ1 (β)− V1NT,λ1
(β0)]
=1
N
N∑i=1
T∑t=2
(x′itbt − x′i,t−1bt−1
)2 − 2
N1/2
N∑i=1
T∑t=2
∆uit(x′itbt − x′i,t−1bt−1
)+Nλ1
T∑t=2
wt
[∥∥∥β0t − β0
t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0
t − β0t−1
∥∥]= b′QNTb− 2b′
√NRuNT +Nλ1
T∑t=2,t∈T 0
m0
wt
[∥∥∥β0t − β0
t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0
t − β0t−1
∥∥]
+Nλ1
T∑t=2,t∈T 0c
m0
wt
∥∥∥N−1/2(bt − bt−1)∥∥∥
≡ A1 (b)− 2A2 (b) +A3 (b) +A4 (b) , say,
30
where QNT and RuNT are defined in (2.4) and (2.5), respectively. By Lemma A.1 and Assumption A.2(i),
maxt∈T 0m0wt = maxt∈T 0
m0
∥∥∥βt − βt−1
∥∥∥−κ1 = maxt∈T 0m0
∥∥β0t − β0
t−1 +OP(N−1/2
)∥∥−κ1 = OP(J−κ1min
). By
the triangle and Jensen inequalities, the fact that m0 is a fixed finite integer, and Assumption A.2(ii),
∣∣T−1A3 (b)∣∣ ≤ m0T−1N1/2λ1 max
s∈T 0m0
ws
1
m0
T∑t=2,t∈T 0
m0
‖bt − bt−1‖
≤ m0T−1N1/2λ1 max
s∈T 0m0
ws
1
m0
T∑t=2,t∈T 0
m0
‖bt − bt−1‖2
1/2
≤(2m0
)1/2T−1/2N1/2λ1 max
s∈T 0m0
wsT−1/2 ‖b‖
= OP
(N1/2λ1T
−1/2J−κ1min
)T−1/2 ‖b‖ = OP (1)T−1/2 ‖b‖ . (A.1)
In conjunction with the analyses of A1 (b) and A2 (b) in the proof of Lemma A.1, this implies that w.p.a.1
T−1 [A1 (b)− 2A2 (b) +A3 (b)] ≥ λmin
(QNT
)T−1 ‖b‖2 −OP (1)T−1/2 ‖b‖ > 0
if T−1/2 ‖b‖ = L is suffi ciently large. That is, A1 (b) dominates −2A2 (b)+A3 (b) for large L. In addition,
A4 (b) ≥ 0. Consequently, N[V1NT,λ1 (β)− V1NT,λ1
(β0)]> 0 w.p.a.1 for large L and V1NT,λ1 (β) cannot
be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ has to be stochastically bounded and
Theorem 3.1 (i) holds.
(ii) The result follows from (i) in the case of fixed T. So we consider the case of large T. Let L,
∆, b†, R†NT , and ωtsTt,s=1 be as defined in the proof of Lemma A.1. Let ω
†ts = ωts − ωt−1,s. Then
bt − bt−1 =∑Ts=t ωtsb
†s −
∑Ts=t−1 ωt−1,sb
†s =
∑Ts=t−1 ω
†tsb†s as ωts = 0 for s = t − 1. So we can rewrite
N[V1NT,λ1 (β)− V1NT,λ1
(β0)]in terms of b† :
N[V1NT,λ1 (β)− V1NT,λ1
(β0)]
=
T∑t=1
[b†′t ∆−1
t b†t − 2b†′t R†tNT
]+Nλ1
T∑t=2,t∈T 0
m0
wt
[∥∥∥∥∥β0t − β0
t−1 +N−1/2T∑
s=t−1
ω†tsb†s
∥∥∥∥∥− ∥∥β0t − β0
t−1
∥∥]
+N1/2λ1
T∑t=2,t∈T 0c
m0
wt
∥∥∥∥∥T∑
s=t−1
ω†tsb†s
∥∥∥∥∥ ≡ NV †1NT,λ1 (b†) , say.Noting that
∥∥∥R†tNT∥∥∥ = OP (1), N1/2λ1m0 maxs∈T 0m0ws = OP
(N1/2λ1J
−κ1min
)= OP (1) , andmaxs∈T 0
m0
∥∥∥ω†st∥∥∥= OP (1), we have by the triangle and Jensen inequalities (as in the derivation of (A.1))
0 ≥ NV †1NT,λ1(b†)≥
T∑t=1
[b†′t ∆−1
t b†t − 2b†′t R†tNT
]−N1/2λ1m0 max
s∈T 0m0
ws maxs∈T 0
m0
T∑t=s−1
ω†stb†t
≥T∑t=1
[b†′t ∆−1
t b†t −(
2∥∥∥R†tNT∥∥∥+N1/2λ1 (m0)
3/2maxs∈T 0
m0
ws maxs∈T 0
m0
∥∥∥ω†st∥∥∥)∥∥∥b†t∥∥∥
]
=
T∑t=1
[b†′t ∆−1
t b†t −OP (1)∥∥∥b†t∥∥∥] .
31
It follows that 0 ≥ N[V †1NT,λ1(b
†)− V †1NT,λ1 (0Tp×1)]≥∑Tt=1
[b†′t ∆−1
t b†t −OP (1)∥∥∥b†t∥∥∥] and b†t = OP (1)
for each t. Otherwise, b†t cannot minimize V†1NT,λ1
(b†). This implies that bt = N1/2(βt−β0
t ) = OP (1)
by the same arguments as used in the proof of Lemma A.1.
Proof of Theorem 3.2. We want to demonstrate that
P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c
m0
)→ 1 as N →∞. (A.2)
Suppose that to the contrary, θt = βt − βt−1 6= 0 for some t ∈ T 0cm0 for suffi ciently large N or (N,T ) .
Then there exists r ∈ 1, ..., p such that∣∣∣θt,r∣∣∣ = max
∣∣∣θt,l∣∣∣ , l = 1, ..., p, where for any p × 1 vector
at, at,l denotes its lth element. Without loss of generality (wlog) assume that r = p, implying that∣∣∣θt,p∣∣∣ /∥∥∥θt∥∥∥ ≥ 1/√p. To consider the first order condition (FOC) with respect to (wrt) βt, t ≥ 2, based on
subdifferential calculus (e.g., Bersekas (1995, Appendix B.5)), we distinguish two cases: (a) 2 ≤ t ≤ T −1
and (b) t = T and T ∈ T 0cm0 .
In case (a), we consider two subcases: (a1) t+1 = T 0j ∈ T 0
m0 for some j = 1, ...,m0, and (a2) t+1 ∈ T 0cm0 .
In either case, we can apply the FOC wrt βt,p and the equality ∆yit = β0′t xit − β0′
t−1xi,t−1 + ∆uit to
obtain
0 =−2√N
N∑i=1
(∆yit − β
′txit + β
′t−1xi,t−1
)xit,p +
2√N
N∑i=1
(∆yi,t+1 − β
′t+1xi,t+1 + β
′txit
)xit,p
+√Nλ1wt
θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1,p (A.3)
= − 2√N
N∑i=1
[(βt+1 − β0
t+1
)′xi,t+1 − 2
(βt − β0
t
)′xit +
(βt−1 − β0
t−1
)′xi,t−1
]xit,p
+2√N
N∑i=1
∆2ui,t+1xit,p +√Nλ1wt
θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1,p
≡ B1t +B2t +B3t −B4t, say,
where et+1 = θt+1/∥∥∥θt+1
∥∥∥ if ∥∥∥θt+1
∥∥∥ 6= 0 and ‖et+1‖ ≤ 1 otherwise, et+1,p is the pth element in et+1.
By Assumptions A.1(i)-(ii) and Theorem 3.1, B1t = OP (1) and B2t = OP (1). In view of the fact that
w−1t = OP (N−κ1/2) for t ∈ T 0c
m0 , |B3t| ≥√Nλ1wt/
√p, which is explosive in probability under Assumption
A.2(iii) (i.e., N (κ1+1)/2λ1 →∞).To bound the probability order of B4t, we distinguish two subcases. In subcase (a1), noting that
βt+1 − βtP→ θ0
t+1 6= 0 by Theorem 3.1, we have wt+1 =∥∥θ0t+1 +OP (N−1/2)
∥∥−κ1 = OP(J−κ1min
)and
B4t =√Nλ1wt+1et+1,p = OP (
√Nλ1J
−κ1min ) = OP (1) . Consequently, |B3t| |B1t +B2t −B4t| so that
(A.3) cannot be true for suffi ciently large N or (N,T ). Then we conclude that w.p.a.1, θt must be
in a position where∥∥∥θt∥∥∥ is not differentiable in subcase (a1). In addition, a direct implication of this
result is that if t = T 0j − 1 ∈ T 0c
m0 for some j = 1, ...,m0, then P(∥∥∥θT 0j −1
∥∥∥ = 0)→ 1 as N → ∞ and
√Nλ1wT 0j −1eT 0j −1 = OP (1) in order for the FOC to hold for t = T 0
j − 1.
32
In subcase (a2), diffi culty arises as wt+1 = OP (Nκ1/2) and√Nλ1wt+1 = OP (N (1+κ1)/2λ1). But we
can apply the implication from the result in subcase (a1) recursively. When t = T 0j − 2 ∈ T 0c
m0 for
some j = 1, ...,m0, B4t =√Nλ1wT 0j −1eT 0j −1,p = OP (1) and |B3t| |B1t +B2t −B4t| . Thus (A.3)
cannot hold for t = T 0j − 2 ∈ T 0c
m0 either and we must have P(∥∥∥θT 0j −2
∥∥∥ = 0)→ 1 as N → ∞ and
√Nλ1wT 0j −2eT 0j −2 = OP (1) in order for the FOC to hold for t = T 0
j − 2. Deducting in this way until we
reach t = T 0j−1 + 1 ∈ T 0c
m0 . Consequently, θt must be in a position that∥∥∥θt∥∥∥ is not differentiable for all
t ∈ T 0cm0 and t 6= T.
In case (b), noting that only one term in the penalty term (λ1
∑Tt=2 wt
∥∥βt − βt−1
∥∥) is involved withβT , it is easy to show that θT = βT − βT−1 must be in a position where
∥∥∥θT∥∥∥ is not differentiable ifT ∈ T 0c
m0 . Consequently (A.2) follows.
Proof of Corollary 3.3. We consider two cases: (a) t ∈ T 0cm0 , and (b) t ∈ T 0
m0 . In case (a), Theorem
3.2 implies that asymptotically no time point in T 0cm0 can be identified as an estimated break date so that
m ≤ m0. In case (b), we want to show that all break points in T 0m0 must be identified as an estimated
break point. Suppose not. Then there exists t ∈ T 0m0 such that
∥∥∥θt∥∥∥ = 0. By the√N -consistency of
θt and the fact θt = βt − βt−1 = β0t − β0
t−1 + OP (N−1/2) = θ0t + OP (N−1/2) by Theorem 3.1, we have∥∥θ0
t
∥∥ = O(N−1/2), which contradicts the assumption that N1/2Jmin →∞ as N →∞ as∥∥θ0t
∥∥ ≥ Jmin for
any t ∈ T 0m0 .
Proof of Theorem 3.4. Note that αpm(Tm) = (αp1(Tm)′, ..., αpm+1(Tm)′)′ = arg minαm Q1NT
(αm; Tm
).
The FOCs for this minimization problem are
0p×1 =−2
N
T1−1∑t=2
N∑i=1
(∆yit − αp′1 ∆xit
)∆xit +
2
N
N∑i=1
(∆yiT1 − α
p′2 xi,T1 + αp′1 xi,T1−1
)xi,T1−1,
0p×1 =−2
N
Tj−1∑t=Tj−1+1
N∑i=1
(∆yit − αp′j ∆xit
)∆xit +
2
N
N∑i=1
(∆yiTj − α
p′j+1xiTj + αp′j xi,Tj−1
)xi,Tj−1
− 2
N
N∑i=1
(∆yiTj−1 − α
p′j xiTj−1 + αp′j−1xi,Tj−1−1
)xiTj−1 for j = 2, ..., m, and
0p×1 =−2
N
T∑t=Tm+1
N∑i=1
(∆yit − αp′m+1∆xit
)∆xit −
2
N
N∑i=1
(∆yiTm − α
p′m+1xiTm + αp′mxi,Tm−1
)xiTm ,
where we suppress the dependence of αpj’s on Tm. Let Φab,l = 1N
∑Tl−1
t=Tl−1+1
∑Ni=1 aitb
′it for l = 1, ..., m+1
and a, b = ∆x, x, or ∆y. One can readily solve for αpm to obtain αpm = Φ−1NT Ψy
NT , where
ΦNT = TriD(
Φ†, Φ)m+1
, (A.4)
ΨyNT =
(Φ′∆x∆y,1 − φ′x∆y,T1−1,T1
, Φ′∆x∆y,2 − φ′x∆y,T2−1,T2+ φ′
x∆y,T1, ...,
Φ′∆x∆y,m − φ′x∆y,Tm−1,Tm+ φ′
x∆y,Tm−1, Φ′∆x∆y,m+1 + φ′
x∆y,Tm
)′, (A.5)
33
Φ1 = Φ∆x∆x,1+φxx,T1−1, Φl = Φ∆x∆x,l+φxx,Tl−1+φxx,Tl−1 for l = 2, ..., m, Φm+1 = Φ∆x∆x,m+1+φxx,Tm ,
and Φ†l+1 = φxx,Tl,Tl−1 for l = 1, ..., m.
By Corollary 3.3, αpm(Tm) = αpm0 (Tm0) w.p.a.1. Therefore we can study the asymptotic distribution
of αm(Tm) by studying that of αm0 (Tm0) . Note that αpm0 (Tm0) = Φ−1NTΨy
NT , where ΦNT and ΨyNT are
defined in (3.1) and (3.2), respectively. It is easy to verify that
√NDm0+1
(αpm0 (Tm0)−α0
)=(D−1m0+1ΦNTD−1
m0+1
)−1√ND−1
m0+1ΨuNT
where ΨuNT is defined in (3.2). Then by Assumption A.3(i), D
−1m0+1ΦNTD−1
m0+1
P→ Φ0 > 0. By Assump-
tion A.3(ii),√ND−1
m0+1ΨuNT
D→ N (0,Ω0) . Then by the Slutsky lemma,√NDm0+1
(αpm0 (Tm0)−α0
) D→N(0,Φ−1
0 Ω0Φ−10
). This completes the proof of the theorem.
Proof of Theorem 3.5. Recall αpmλ1 (Tmλ1 ) = (αp1(Tmλ1 )′, ..., αpmλ+1(Tmλ1 )′)′ denotes the set of post-
Lasso OLS estimates of the regression coeffi cients based on the break dates in Tmλ1 = T1 (λ1) , ..., Tmλ1 (λ1),where we make the dependence of various estimates on λ1 explicit. Let σ2
Tmλ1≡ 1
T−1Q1NT (αpmλ1(Tmλ1 );
Tmλ1 ). For any λ01NT ∈ Ω0, we have limN→∞P (mλ01NT
= m0) = 1 and limN→∞P (Tj(λ0
1NT
)= T 0
j ,
j = 1, ...,m0) = 1 by Corollary 3.3 as λ01NT also satisfies Assumptions A.2(ii)-(iii). It follows that w.p.a.1
σ2Tmλ1
= σ2Tm0
. Using the√NI0
j -consistency of αpj (Tm0) and the expression ∆yit = α0′
j ∆xit + ∆uit if
t ∈ [T 0j−1 + 1, T 0
j − 1] and ∆yit = α0′j+1xit − α0′
j xi,t−1 + ∆uit if t = T 0j , we can readily show that
σ2Tm0
=1
N (T − 1)
m0+1∑j=1
T 0j −1∑t=T 0j−1+1
N∑i=1
(∆yit − α′j,Tm0
∆xit
)2
+1
N (T − 1)
m0∑j=1
N∑i=1
(∆yiT 0j − α
′j+1,Tm0
xiT 0j + α′j,Tm0xi,T 0j −1
)2
= σ2NT +OP [(NImin)
−1],
where σ2NT ≡ 1
N(T−1)
∑Tt=2
∑Ni=1 ∆u2
itP→ σ2
0 ≡ lim(N,T )→∞1
N(T−1)
∑Tt=2
∑Ni=1E
(∆u2
it
)under Assump-
tions A.1(i)-(ii). Then by Assumption A.5 and Slutsky lemma, IC(λ0
1NT
)= σ2
Tm0+ ρ1NT p
(m0 + 1
) P→σ2
0. We consider the case of under- and over-fitted models separately.
Case 1: Under-fitted model: mλ1 < m0. By Lemma A.2 below, infλ1∈Ω− σ2Tmλ1
− σ2T 0m0≥ c0 where
c0 =IminJ
2min
T−1 [c+ oP (1)] for some c > 0. Then by Assumption A5,
P
(inf
λ1∈Ω−IC (λ1) > IC
(λ0
1NT
))= P
(inf
λ1∈Ω−
[(σ2Tmλ1
− σ2Tm0
)+ ρ1NT p (mλ1 −m)
]> 0
)≥ P
(IminJ
2min
ρ1NT (T − 1)[c+ oP (1)] +OP (1) > 0
)→ 1.
Case 2: Over-fitted model: mλ1 > m0. Let Tm ≡ Tm = T1, ..., Tm : 2 ≤ T1 < ... < Tm ≤ T.Given Tm = T1, ..., Tm ∈ Tm, let Tm∗+m0 = T1, T2, ..., Tm∗+m0 denote the union of Tm and T 0
m0 with
34
elements ordered in non-descending order: 2 ≤ T1 < T2 < · · · < Tm∗+m0 ≤ T for some m∗ ∈ 0, 1, ...,m.Let αpm(Tm) ≡
(αp1(Tm)′, ..., αpm+1(Tm)′
)′= arg minαm Q1NT (αm; Tm) and σ2
Tm ≡ Q1NT (αpm(Tm); Tm).
σ2Tm∗+m0
is analogously defined. In view of the fact that σ2Tm∗+m0
≤ σ2Tm for all Tm ∈ Tm, N(σ2
Tm∗+m0−
σ2NT ) = OP (1) uniformly in Tm ∈ Tm by Lemma A.3 below, and Nρ1NT → ∞ by Assumption A.5, we
have
P
(inf
λ1∈Ω+
IC (λ1) > IC(λ0
1NT
))≥ P
(min
m0<m≤mmax
infTm∈Tm
[N(σ2Tm − σ
2Tm0
)+Nρ1NT p
(m−m0
)]> 0
)≥ P
(min
m0<m≤mmax
infTm∈Tm
[N(σ2Tm∗+m0
− σ2Tm0
)+Nρ1NT p
(m−m0
)]> 0
)→ 1 as N →∞.
Lemma A.2 Let Tm = Tm = T1, ..., Tm : 2 ≤ T1 < ... < Tm ≤ T, T0 = 1 and Tm+1 = T + 1. Thenmin0≤m<m0 infTm∈Tm
(T−1)IminJ2min
(σ2Tm − σ
2T 0m0
) ≥ c+ oP (1) for some c > 0.
Proof. First, by the results for least squares regressions, we can readily show that σ2T 0m0
= σ2NT +
OP ((NImin)−1). In view of the fact that that αpm = arg minαm D1NT (αm; Tm) where
D1NT (αm; Tm) ≡ Q1NT (αm; Tm)−Q1NT
(α0m0 ; T 0
m0
)=
1
N
m+1∑j=1
Tj−1∑t=Tj−1+1
N∑i=1
[(∆yit − α′j∆xit
)2 − (∆uit)2]
+1
N
m∑j=1
N∑i=1
[(∆yiTj − α′j+1xiTj + α′jxi,Tj−1
)2 − (∆uiTj)2] ,and (T − 1) (σ2
Tm − σ2NT ) = D1NT (αpm(Tm); Tm) , it suffi ces to show that 1
IminJ2minD1NT (αpm(Tm); Tm) ≥
c + oP (1) uniformly in Tm ∈ Tm for each m ∈ [0,m0 − 1]. We consider three cases: (a) m0 = 1, (b)
m0 = 2, and (c) 3 < m0 ≤ mmax.
In case (a), m = 0 and Tm = T0 becomes empty so that the post Lasso estimate αpm(Tm) = αp0(T0) =
αp1(T0) becomes the OLS estimate in the first-differenced model: αp1(T0) =(∑T
t=2
∑Ni=1 ∆xit (∆xit)
′)−1
×∑Tt=2
∑Ni=1 ∆xit∆yit. Noting that
∆yit = x′itβ0t − x′i,t−1β
0t−1 + ∆uit =
(∆xit)
′α0
1 + ∆uit if 2 ≤ t ≤ T 01 − 1
x′itα02 − x′i,t−1α
01 + ∆uit if t = T 0
1
(∆xit)′α0
2 + ∆uit if T 01 + 1 ≤ t ≤ T
,
35
we have
αp1(T0) = M−1NTM1NTα
01 +M−1
NTM2NTα02 +M−1
NT
1
N (T − 1)
N∑i=1
∆xiT 01
(x′iT 01
α02 − x′i,T 01−1α
01
)+M−1
NT
1
N (T − 1)
T∑t=2
N∑i=1
∆xit∆uit
= M−1NTM1NTα
01 +M−1
NTM2NTα02 +M−1
NT
1
T − 1
(φ∆xx,T 01
α02 − φ∆xx,T 01 ,T
01−1α
01
)+OP
((N(T − 1))−1/2
),
where MNT = 1N(T−1)
∑Tt=2
∑Ni=1 ∆xit (∆xit)
′, M1NT = 1
N(T−1)
∑T 01−1t=2
∑Ni=1 ∆xit (∆xit)
′, M2NT =
1N(T−1)
∑Tt=T 01 +1
∑Ni=1 ∆xit (∆xit)
′, and the last line follows from Assumption A.1.9 Note that
D1NT (αp0(T0); T0) =1
N
T∑t=2
N∑i=1
[(∆yit − αp1(T0)′∆xit)
2 − (∆uit)2]
=1
N
T∑t=2
N∑i=1
[(β0t − α
p1(T0)
)′xit −
(β0t−1 − α
p1(T0)
)′xi,t−1
]2+
2
N
T∑t=2
N∑i=1
[(β0t − α
p1(T0)
)′xit −
(β0t−1 − α
p1(T0)
)′xi,t−1
]∆uit ≡ D1 + 2D2, say.
Further,
D1 =1
N
T 01−1∑t=2
N∑i=1
[α0
1 − αp1(T0)
]′∆xit
2
+1
N
T∑t=T 01 +1
N∑i=1
[α0
2 − αp1(T0)
]′∆xit
2
+1
N
N∑i=1
[α0
2 − αp1(T0)
]′xiT 01 −
[α0
1 − αp1(T0)
]′xi,T 01−1
2
= (T − 1)[α0
1 − αp1(T0)
]′M1NT
[α0
1 − αp1(T0)
]+ (T − 1)
[α0
2 − αp1(T0)
]′M2NT
[α0
2 − αp1(T0)
]+
1
N
N∑i=1
[α0
2 − αp1(T0)
]′xiT 01 −
[α0
1 − αp1(T0)
]′xi,T 01−1
2
≡ D11 + D12 + D13, say.
Let d01 = α0
2 − α01. Then αp1(T0) − α0
1 = M−1NT M2NT d
01 + OP
((N(T − 1))−1/2
)and αp1(T0) − α0
2 =
−M−1NT M1NT d
01 + OP
((N(T − 1))−1/2
), where M2NT = M2NT + 1
T−1φ∆xx,T 01and M1NT = M1NT −
1T−1φ∆xx,T 01 ,T
01−1.
Noting that∥∥d0
1
∥∥ = Jmin, φ∆xx,T 01= OP (1) , φ∆xx,T 01 ,T
01−1 = OP (1) , M−1
NT = OP (1) , M1NT =
OP
(I01−1T−1
), M2NT = OP
(I02−1T−1
), M1NT = OP
(I01−1T−1
), and M2NT = OP
(I02−1T−1
), we have
d0′1 M
′2NTM
−1NTM1NTM
−1NT M2NT d
01 ≥
J2min
(I01 − 1
) (I02 − 1
)2(T − 1)
3 c1NT ,
9Strictly speaking, we need 3 ≤ T 01 ≤ T − 1. If T 01 = 2 (resp. T ), then M1NT = 0 (resp. M2NT = 0) as∑bt=a = 0 when
a > b.
36
and
d0′1 M
′1NTM
−1NTM2NTM
−1NT M1NT d
01 ≥
J2min
(I01 − 1
)2 (I02 − 1
)(T − 1)
3 c2NT ,
where c1NT = (T−1)3
(I01−1)(I02−1)2λmin
(M ′2NTM
−1NTM1NTM
−1NT M2NT
), and c2NT = (T−1)3
(I01−1)2(I02−1)
λmin(M ′1NT
M−1NTM2NTM
−1NT M1NT ). To bound D1, we consider two subcases: (a1) I0
1 ≥ 2 and I02 ≥ 2, and (a2)
I01 = 1 or I0
2 = 1. Observe that in subcase (a1)
D11 + D12
IminJ2min
=T − 1
IminJ2min
[α0
1 − αp1(T0)
]′M1NT
[α0
1 − αp1(T0)
]+[α0
2 − αp1(T0)
]′M2NT
[α0
2 − αp1(T0)
]≥ T − 1
IminJ2min
J2
min
(I01 − 1
) (I02 − 1
)2(T − 1)
3 c1NT +J2
min
(I01 − 1
)2 (I02 − 1
)(T − 1)
3 c2NT
+T − 1
IminJ2min
OP((N(T − 1))−1
)+ s.m.
≥ 1
Imin
(I01 − 1
) (I02 − 1
)2(T − 1)
2 +
(I01 − 1
)2 (I02 − 1
)(T − 1)
2
cNT + oP (1)
=
(I01 − 1
) (I02 − 1
)(T − 2)
Imin (T − 1)2 cNT + oP (1)
where cNT = min(c1NT , c2NT ) and s.m. denotes terms that are of smaller order than the expressed
terms. Then 1IminJ2min
D1 ≥ c+ oP (1) where c = limT→∞(I01−1)(I02−1)(T−2)
Imin(T−1)2plim(NT ) cNT > 0. In subcase
(a2), D11 = 0 or D12 = 0, (D11 + D12)/(IminJ2min) = oP (1) , and we need to show that 1
IminJ2minD13 is
stochastically bounded below by a positive constant. By Assumption A.4(i),
D13
IminJ2min
≥ 1
IminJ2min
minα1
1
N
N∑i=1
[(α0
2 − α1
)′xiT 01 −
(α0
1 − α1
)′xi,T 01−1
]2≥ cαImin
+ oP (1) .
Then 1IminJ2min
D1 ≥ cαImin
+ oP (1).
To determine the probability order of D2, we make the following decomposition:
D2 =[α0
1 − αp1(T0)
]′Mu
1NT +[α0
2 − αp1(T0)
]′Mu
2NT
+1
N
N∑i=1
[α0
2 − αp1(T0)
]′xiT 01 −
[α0
1 − αp1(T0)
]′xi,T 01−1
∆uiT 01
≡ D21 + D22 + D23, say,
where Mu1NT = 1
N
∑T 01−1t=2
∑Ni=1 ∆xit∆uit, and Mu
2NT = 1N
∑Tt=T 01 +1
∑Ni=1 ∆xit∆uit. Noting that α0
1 −
αp1(T0) = JminOP
(I02−1T−1
), α0
2 − αp1(T0) = JminOP
(I01−1T−1
), Mu
1NT = OP
(√(I0
1 − 1)/N), and Mu
2NT =
OP
(√(I0
2 − 1)/N), we have
D21 + D22
IminJ2min
=1
IminJmin
[OP
(I02 − 1
T − 1
)OP
(√I01 − 1
N
)+OP
(I01 − 1
T − 1
)OP
(√I02 − 1
N
)]
=1
Imin (T − 1) Jmin
√NOP
((I0
2 − 1)√
(I01 − 1) + (I0
1 − 1)√
(I02 − 1)
)= oP (1) .
37
Similarly, noting 1N
∑Ni=1 xit∆uiT 01 = OP
(N−1/2
)for t = T 0
1 and T01 − 1, we have
D23
IminJ2min
=1
IminJmin
√N
OP
(I01 − 1
T − 1
)+OP
(I02 − 1
T − 1
)= oP (1) .
It follows that D2IminJ2min
= oP (1) . Consequently, we have 1IminJ2min
D1NT (αpm(Tm); Tm) ≥ c+ oP (1) .
In cases (b)-(c), it suffi ces to consider the case where m = m0 − 1. [If m < m0 − 1, one can always
augment the set Tm bym0−1−m true break points which are not inside Tm to make D1NT (αpm(Tm); Tm)
smaller.] For case (b) with m = 1, we consider three subcases: (b.1) 2 ≤ T1 < T 01 , (b.2) T
01 ≤ T1 ≤ T 0
2 ,
and (b.3) T 02 < T1 ≤ T, where (b.3) is redundant if T 0
2 = T (i.e., the second true break occurs at the end
of the sample). In subcase (b.1), we can focus on the interval [T1 + 1, T ] which contains two true break
points T 01 and T
02 that are not accounted for by the post Lasso estimate α
p1(T1) = (αp1(T1)′, αp2(T1)′)′.
Observe that
D1NT (αp1(T1); T1) =1
N
T1−1∑t=2
N∑i=1
[(∆yit − αp1(T1)′∆xit)2 − (∆uit)
2]
+1
N
T∑t=T1+1
N∑i=1
[(∆yit − αp2(T1)′∆xit)2 − (∆uit)
2]
+1
N
N∑i=1
[(∆yiT1 − αp2(T1)′xiT1 + αp1(T1)′xi,T1−1)
2 − (∆uiT1)2]
≡ D3 + D4 + D5, say.
It is easy to show that αp1(T1)−α01 = OP
(N−1/2
). With this, one can readily show that D3 = OP
(N−1
).
Let α =argminα 1N
∑Ni=1 (∆yiT1 − α′xiT1 + αp1(T1)′xi,T1−1)
2. By standard results for OLS regressions, α
−β0T1 = OP
(N−1/2
)and
D5 =1
N
N∑i=1
[(∆yiT1 − α′xiT1 + αp1(T1)′xi,T1−1
)2 − (∆uiT1)2] = OP
(N−1
).
It follows that D5 ≥ D5 = OP(N−1
)and D1NT (αp1(T1); T1) ≥ D4 + OP
(N−1
). A simple repetition of
the argument used in case (a) (now with two true breaks) yields 1IminJ2min
D4 ≥ c+ oP (1) for some c > 0.
Then by the fact that NJ2min → cJ =∞, we have D1NT (αp1(T1); T1) ≥ c+ oP (1) .
For subcase (b.2), wlog we assume that T1−T 01 ≥ T 0
2−T1, which implies that T1−T 01 ≥ Imin/2. Then we
can focus on the interval [2, T1] which contains a true break point T 01 . As in subcase (b.1), we can show that
D1NT (αp1(T1); T1) ≥ D1NT +OP(N−1
), where D1NT = 1
N
∑T1t=2
∑Ni=1[(∆yit − αp1(T1)′∆xit)
2−(∆uit)2].
A simple repetition of the argument used in case (a) yields 1IminJ2min
D1NT ≥ c + oP (1) for some c > 0.
Subcase (b.3) is analogous to subcase (b.1). Hence, the conclusion follows in subcase (b). Case (c) can
be studied analogously. This completes the proof of the lemma.
Lemma A.3 Let Tm = Tm = T1, ..., Tm : Tm0 ⊂ Tm, 2 ≤ T1 < ... < Tm ≤ T where m0 < m ≤ mmax.
Then maxm0<m≤mmaxsupTm∈Tm N
−1∣∣∣σ2Tm − σ
2Tm0
∣∣∣ = OP (1) .
38
Proof. Let Tm ∈ Tm where m0 < m ≤ mmax. In view of the fact that σ2T 0m0≥ σ2
Tm and σ2T 0m0
=
σ2NT +OP ((NImin)−1), we have
0 ≤ σ2T 0m0− σ2
Tm = σ2NT − σ2
Tm +OP((NImin)−1
)≤ (m+ 1) JNT +OP
((NImin)−1
)(A.6)
where
JNT ≡ max0≤s≤m
(Ts,Ts+1−1] does not contain any break point
∣∣∣∣ inf(α,β)
Ss (α, β)
∣∣∣∣and Ss (α, β) = 1
N(T−1)
∑Ts+1−1t=Ts+1
∑Ni=1[(∆yit − α′∆xit)2 − (∆uit)
2] + 1
N(T−1)
∑Ni=1[(∆yiTs+1 − β′xiTs+1
+α′xi,Ts+1−1)2 −(∆uiTs+1
)2] for s < m and Sm (α, β) = Sm (α) = 1
N(T−1)
∑Tt=Tm+1
∑Ni=1[(∆yit − α′∆xit)2
− (∆uit)2]. Let (αs, βs) = arg min(α,β) Ss (α, β) and γs = (α′s, β
′s)′ when s < m and αm = arg minα Sm (α) .
To study inf(α,β) Ss (α, β) for s = 0, 1, ...,m, we consider three cases: (a) Ts+1 − Ts = 1, s < m, (b)
Ts+1 − Ts ≥ 2, s < m, and (c) Ts+1 − Ts ≥ 2, s = m.10
In case (a), noting the first term in the definition of Ss (α, β) is zero, we have γs = (X ′sXs)−1X ′s∆Ys
where Xs = (X1s, ..., XNs)′, Xis = (−x′i,Ts+1−1, x
′iTs+1
)′, and ∆Ys = (∆y1Ts+1 , ...,∆yNTs+1)′. Let γ0
s =
(β0′Ts+1−1, β
0′Ts+1)
′. One can readily show that γs − γ0s = OP
(N−1/2
). In addition,
(T − 1)Qs(αs, βs
)=
1
N
N∑i=1
[(∆yiTs+1 − β
′sxiTs+1 + α′sxi,Ts+1−1
)2
−(∆uiTs+1
)2]
=1
N
N∑i=1
[(∆uiTs+1 −
(γs − γ0
s
)′Xis
)2
−(∆uiTs+1
)2]
=(γs − γ0
s
)′( 1
N
N∑i=1
XisX′is
)(γs − γ0
s
)− 2
(γs − γ0
s
)′( 1
N
N∑i=1
Xis∆uiTs+1
)= OP
(N−1/2
)OP (1)OP
(N−1/2
)− 2OP
(N−1/2
)OP
(N−1/2
)= OP
(N−1
).
In case (c), αm =(∑T
t=Tm+1
∑Ni=1 ∆xit(∆xit)
′)−1∑T
t=Tm+1
∑Ni=1 ∆xit∆yit. It is easy to verify that
αm−α0m0+1 = OP
(N−1/2
)and Sm (αm) = OP
(N−1
). Similarly, in case (b), one can verify that γs−γ0
s =
OP(N−1/2
)and Ss
(αs, βs
)= OP
(N−1
). It follows that JNT = OP
(N−1
). This, in conjunction with
(A.6), implies that
σ2Tm − σ
2NT = OP
(N−1
)which holds for all m ∈
m0 + 1, ...,mmax
and Tm = T1, ..., Tm. Then the conclusion follows.
B Proof of the results in Section 4
Let V2NT (β) ≡∑Tt=2[ 1
N
∑Ni=1 ρit
(βt, βt−1
)]′Wt[
1N
∑Ni=1 ρit
(βt, βt−1
)], where ρit
(βt, βt−1
)= zit(∆yit−
x′itβt + x′i,t−1βt−1). We first prove a technical lemma.
10To see why we do not need to consider the case where Tm+1 − Tm = 1. Note that if s = m, we have Tm+1 − Tm =
T + 1− Tm as Tm+1 = T + 1 by default. If Tm+1 − Tm = 1, we have Tm = T so that Sm (α) = 0 in this case.
39
Lemma B.1 Suppose Assumption B.1 holds. Then βt − β0t = OP
(N−1/2
)for each t = 1, 2, ..., T.
Proof. The proof is analogous to that of Lemma A.1 and we only sketch it. Let bt = N1/2(βt − β0t )
and b =N1/2(β − β0). Let ξit = x′itbt − x′i,t−1bt−1 where recall bt = βt − β0
t . Noting that ∆yit − x′itβt +
x′i,t−1βt−1 = ∆uit −N−1/2ξit, we have
N[V2NT,λ2 (β)− V2NT,λ2
(β0)]
=
T∑t=2
1
N
N∑i=1
ξitz′it
Wt
1
N
N∑i=1
zitξit
− 2N1/2
T∑t=2
1
N
N∑i=1
ξitz′it
Wt
1
N
N∑i=1
zit∆uit
= b′QNTb− 2b′
√NRuNT ≡ B1 (b)− 2B2 (b) , say,
where QNT and RuNT are defined in (2.9) and (2.10), respectively. As in the proof of Lemma A.1, under
Assumption B.1, we can readily show that w.p.a.1
T−1 [B1 (b)− 2B2 (b)] ≥(cQ0
/2)T−1 ‖b‖2 − T−1/2 ‖b‖OP (1) > 0
if T−1/2 ‖b‖ is suffi ciently large. It follows that T−1/2∥∥∥b∥∥∥ must be stochastically bounded and the result
follows if T is fixed.
In the case of large T, we can show that βt − β0t = OP
(N−1/2
)for each t by the same arguments as
used in the second part of the proof of Lemma A.1 as QNT is a symmetric block tridiagonal matrix that
is asymptotically nonsingular.
Proof of Theorem 4.1. (i) The proof parallels that of Theorem 3.1 and we only sketch it. Let
bt = N1/2(βt − β0t ) and b =N1/2(β − β0
). Noting that ∆yit − x′itβt + x′i,t−1βt−1 = ∆uit − N−1/2ξit
where ξit = x′itbt − x′i,t−1bt−1, we have
N[V2NT,λ2 (β)− V2NT,λ2
(β0)]
= b′QNTb− 2b′√NRuNT +Nλ2
T∑t=2,t∈T 0
m0
wt
[∥∥∥β0t − β0
t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0
t − β0t−1
∥∥]
+Nλ2
T∑t=2,t∈T 0c
m0
wt
∥∥∥N−1/2(bt − bt−1)∥∥∥
≡ B1 (b)− 2B2 (b) +B3 (b) +B4 (b) , say.
As in the proof of Theorem 3.1, we can show that∣∣T−1B3 (b)
∣∣ = OP(N1/2λ2T
−1/2J−κ2min
)T−1/2 ‖b‖ =
OP (1)T−1/2 ‖b‖ and w.p.a.1
[B1 (b)− 2B2 (b) +B3 (b)] /T ≥ λmin
(QNT
)T−1 ‖b‖2 −OP (1)T−1/2 ‖b‖ > 0
if T−1/2 ‖b‖ = L is suffi ciently large. Consequently, N[V2NT,λ2 (β)− V2NT,λ2
(β0)]> 0 w.p.a.1 for large
L and V2NT,λ2 (β) cannot be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ has to be
stochastically bounded.
40
(ii) The result follows from (i) in the case of fixed T. In the case of large T, the proof is analogous to
that of the second part of Theorem 3.1 by utilizing the fact that QNT is an asymptotically nonsingular
symmetric block tridiagonal matrix.
Proof of Theorem 4.2. We want to demonstrate that
P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c
m0
)→ 1 as N →∞. (B.1)
Suppose that to the contrary, θt = βt − βt−1 6= 0 for some t ∈ T 0cm0 for suffi ciently large N or (N,T ) .
To consider the optimization conditions wrt βt, t ≥ 2, based on subdifferential calculus (e.g., Bersekas
(1995, Appendix B.5)), we distinguish two cases: (a) 2 ≤ t ≤ T − 1 and (b) t = T and T ∈ T 0cm0 .
In case (a), we consider two subcases: (a1) t+1 = T 0j ∈ T 0
m0 for some j = 1, ...,m0, and (a2) t+1 ∈ T 0cm0 .
In either case, we can apply the FOC wrt βt and the equality ∆yit = β0′t xit−β0′
t−1xi,t−1 + ∆uit to obtain
0p×1 = − 2
N
N∑i=1
xitz′itWt
1√N
N∑i=1
zit
[∆yit − β
′txit + β
′t−1xi,t−1
](B.2)
+2
N
N∑i=1
x′itzi,t+1Wt+11√N
N∑i=1
zi,t+1
[∆yi,t+1 − β
′t+1xi,t+1 + β
′txit
]+√Nλ1wt
θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1
= −2φ′zx,tWt1√N
N∑i=1
zit
[∆uit −
(βt − β0
t
)′xit +
(βt−1 − β0
t−1
)′xi,t−1
]
+2φ′zx,t+1,tWt+11√N
N∑i=1
zi,t+1
[∆ui,t+1 −
(βt+1 − β0
t+1
)′xi,t+1 +
(βt − β0
t
)′xit
]
+√Nλ2wt
θt,p∥∥∥θt∥∥∥ −√Nλ2wt+1et+1
= −2√N [φ′zx,t+1,tWt+1φzx,t+1
(βt+1 − β0
t+1
)− φ′zx,tWt+1φzx,t
(βt − β0
t
)−φ′zx,t+1,tWt+1φzx,t+1,t
(βt − β0
t
)+ φ′zx,tWtφzx,t,t−1
(βt−1 − β0
t−1
)′]
+2√N(φ′zx,t+1,tWt+1φz∆u,t+1 − φzx,tWtφz∆u,t
)+√Nλ2wt
θt,p∥∥∥θt∥∥∥ −√Nλ2wt+1et+1,p
≡ B1t +B2t +B3t −B4t, say,
where et+1 = θt+1/∥∥∥θt+1
∥∥∥ if ∥∥∥θt+1
∥∥∥ 6= 0 and ‖et+1‖ ≤ 1 otherwise.
Since θt 6= 0, there exists r ∈ 1, ..., p such that∣∣∣θt,r∣∣∣ = max
∣∣∣θt,l∣∣∣ , l = 1, ..., p, where for any
p× 1 vector at, at,l denotes its lth element. Wlog assume that r = p, implying that∣∣∣θt,p∣∣∣ / ∥∥∥θt∥∥∥ ≥ 1/
√p.
By Assumptions B.1(i)-(ii) and Theorem 4.1, B1t,p = OP (1) and B2t,p = OP (1). In view of the fact
that w−1t = OP (N−κ2/2) for t ∈ T 0
m0 , |B3t,p| ≥√Nλ2wt/
√p, which is explosive in probability un-
der Assumption B.2(iii) (N (κ2+1)/2λ2 → ∞). To bound the probability order of B4t,p, we distinguish
41
two subcases. In subcase (a1), noting that βt+1 − βtP→ θ0
t+1 6= 0 by Theorem 4.1, we have wt+1 =∥∥θ0t+1 +OP (N−1/2)
∥∥−κ2 = OP(J−κ2min
)and B4t =
√Nλ2wt+1et+1,p = OP (
√Nλ2J
−κ2min ) = OP (1) . Con-
sequently, |B3t,p| |B1t,p +B2t,p +B4t,p| so that (B.2) cannot be true for suffi ciently large N or (N,T ).
Then we conclude that w.p.a.1, θt must be in a position where∥∥∥θt∥∥∥ is not differentiable in subcase
(a1). In addition, a direct application of this result is that if T 0j − 1 ∈ T 0c
m0 for some j = 1, ...,m0, then
P(∥∥∥θT 0j −1
∥∥∥ = 0)→ 1 as N → ∞ and
√Nλ2wT 0j −1eT 0j −1 = OP (1) in order for the FOC to hold for
t = T 0j − 1.
In subcase (a2), we apply deductive arguments as used in the proof of Theorem 3.2 and the result
in subcase (a1) so show that θt must be in a position that∥∥∥θt∥∥∥ is not differentiable for all t ∈ T 0c
m0 and
t 6= T.
In case (b), noting that only one term in the penalty term (λ2
∑Tt=2 wt
∥∥βt − βt−1
∥∥) is involved withβT , it is easy to show that θT = βT − βT−1 must be in a position where
∥∥∥θT∥∥∥ is not differentiable ifT ∈ T 0c
m0 . Consequently (B.1) follows.
Proof of Corollary 4.3. The proof is analogous to that of Corollary 3.3 by using Theorems 4.1-4.2
instead.
Proof of Theorem 4.4. Note that αpm(Tm) = (αp1(Tm)′, ..., αpm+1(Tm)′)′ = arg minαm Q2NT
(αm; Tm
).
The first order conditions for this minimization problem are
0p×1 =−2
N
T1−1∑t=2
N∑i=1
∆xitz′itW
p1
1
N
T1−1∑t=2
N∑i=1
zit(∆yit − αp′1 ∆xit
)+
2
N
N∑i=1
xi,T1−1z′iT1
WT1
1
N
N∑i=1
ziT1
(∆yiT1 − α
p′2 xiT1 + αp′1 xi,T1−1
),
0p×1 =−2
N
Tj−1∑t=Tj−1+1
N∑i=1
∆xitz′itW
pj
1
N
Tj−1∑t=Tj−1+1
N∑i=1
zit(∆yit − αp′j ∆xit
)
+2
N
N∑i=1
xi,Tj−1z′iTjWTj
1
N
N∑i=1
ziTj
(∆yiTj − α
p′j+1xiTj + αp′j xi,Tj−1
)− 2
N
N∑i=1
xi,Tj−1z′iTj−1
WTj−1
1
N
N∑i=1
zi,Tj−1
(∆yiTj−1 − α
p′j xiTj−1 + αp′j−1xi,Tj−1−1
)for j = 2, ..., m,
0p×1 =−2
N
T∑t=Tm+1
N∑i=1
∆xitz′itW
pm+1
1
N
T∑t=Tm+1
N∑i=1
zit(∆yit − αp′m+1∆xit
)− 2
N
N∑i=1
xiTmz′iTm
WTm
1
N
N∑i=1
ziTm
(∆yiTm − α
p′m+1xiTm + αp′mxi,Tm−1
),
where we suppress the dependence of αpj’s on Tm.
42
Let Φab,l = 1N
∑Tl−1
t=Tl−1
∑Ni=1 aitb
′it for l = 1, ..., m + 1, and a, b = ∆x, x, or ∆y. Let φ
†ab,l+1 =
φ′ab,Tl
WTlφab,Tl,Tl−1 for l = 1, ..., m. One can readily solve for αpm to obtain αpm = Υ−1
NT ΞyNT , where
ΥNT = TriD(
Υ†, Υ)m+1
, ΞyNT =(
Ξ′y,1, Ξ′y,2 , ...Ξ
′y,m+1
)′(B.3)
Υ1 = Φ′z∆x,1Wp1 Φz∆x,1 + φ′
zx,T1,T1−1WT1
φzx,T1,T1−1, Υl = Φ′z∆x,lWpl Φz∆x,l + φ′
zx,Tl,Tl−1WTl
φzx,Tl,Tl−1 +
φ′zx,Tl−1
WTl−1φzx,Tl−1 for l = 2, ..., m, Υm+1 = Φ′z∆x,m+1W
pm+1Φz∆x,m+1 + φ′
zx,TmWTm
φzx,Tm , and
Υ†l = φ†xx,l for l = 2, ..., m + 1. In addition, Ξy,1 = Φ′z∆x,1W
p1 Φz∆y,1 − φ′zx,T1,T1−1
WT1φz∆y,T1 , Ξy,l =
Φ′z∆x,lWpl Φz∆y,l−φ′zx,Tl,Tl−1
WTlφz∆y,Tl+φ
′zx,Tl−1
WTl−1φz∆y,Tl−1 for l = 2, ..., m, and Ξy,m+1 = Φ′z∆x,m+1
W pm+1Φz∆y,m+1 + φ′
zx,TmWTm
φz∆y,Tm .
By Corollary 4.3, αpm(Tm) = αpm0 (Tm0) w.p.a.1. Therefore we can study the asymptotic distribution
of αpm(Tm) by studying that of αm0 (Tm0) . Note that αm0 (Tm0) = Υ−1NTΞNT , where ΥNT and ΞNT are
defined in (4.1). It is easy to verify that
√NDm0+1
(αpm0 (Tm0)−α0
)=(D−1m0+1ΥNTD−1
m0+1
)−1√ND−1
m0+1ΞuNT
where ΞuNT is defined in (4.1). Then by Assumption B.3(i), D−1m0+1ΥNTD−1
m0+1
P→ Υ0 > 0. By Assumption
B.3(ii),√ND−1
m0+1ΞuNTD→ N (0,Σ0) . Then by the Slutsky lemma,
√NDm0+1
(αpm0 (Tm0)−α0
) D→ N(0,Υ−1
0 Σ0Υ−10
).
This completes the proof of the theorem.
Proof of Theorem 4.5. The proof is analogous to that of Theorem 3.5 and thus omitted.
REFERENCE
Andrews, D. W. K., 1993. Tests for parameter instability and structural change with unknown changepoint. Econometrica 61, 821-856.
Andrews, D. W. K., 2003. End-of-sample instability tests. Econometrica 71, 1661-1694.
Angelosante, D., Giannakis, G.B., 2012. Group Lassoing change-points in piecewise-constant ARprocesses. EURASIP Journal on Advances in Signal Processing 1(70), 1-16.
Bai, J., 1997a. Estimation of a change point in multiple regression models. Review of Economics andStatistics 79, 551-563.
Bai, J., 1997b. Estimating multiple breaks one at a time. Econometric Theory 13, 315-352.
Bai, J., 2010. Common breaks in means and variances for panel data. Journal of Econometrics 157,78-92.
Bai, J., Lumsdaine, R. L., Stock, J., 1998. Testing and dating common breaks in multivariate timeseries. Review of Economic Studies 65, 395-432.
Bai, J., Perron, P., 1998. Estimating and testing liner models with multiple structural changes. Econo-metrica 66, 47-78.
Baltagi, B. H., Feng, Q., Kao, C., 2013. Estimation of heterogeneous panels with structural breaks.Working Paper, Syracuse University.
43
Baltagi, B. H., Kao, C., Liu, L., 2012. Estimation and identification of change points in panel modelswith nonstationary or stationary regressors and error terms. Working paper, Syracuse University.
Belloni, A., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instrumentswith an application to eminent domain. Econometrica 80, 2369-2429.
Ben-David, D., Papell, D. H., 1995. The great wars, the great crash, and steady state growth: somenew evidence about an old stylized fact. Journal of Monetary Economics 36, 453-475.
Bertsekas, D., 1995. Nonlinear Programming. Athena Scientific, Belmont, MA.
Breitung, J., Eickmeier, S., 2011. Testing for structural breaks in dynamic factor models. Journal ofEconometrics 163, 71-84.
Caner, M., 2009. Lasso-type GMM estimator. Econometric Theory 25, 270-290.
Caner, M., Han, X., 2013. Selecting the correct number of factors in approximate factor models: thelarge panel case with Bridge estimators. Working paper, North Carolina State University.
Caner, M., Knight, K., 2013. An alternative to unit root tests: Bridge estimators differentiate betweennonstationary versus stationary models and select optimal lag. Journal of Statistical Planning andInference 143, 691-715.
Chan, F., Mancini-Griffoli, T., Pauwels, L. L., 2008. Stability tests for heterogenous panel. Workingpaper, Curtin University of Technology.
Cheng, X., Liao, Z., 2013. Select the valid and relevant moments: an information-based LASSO forGMM with many moments. Working paper, University of Pennsylvania.
Cheng, X., Liao, Z., Schorfheide, F., 2014. Shrinkage estimation of high-dimensional factor models withstructural instabilities. NBERWorking Paper Series 19792, National Bureau of Economic Research.
De Watcher, S., Tzavalis, E., 2005. Monte Carlo comparison of model and moment selection and classicalinference approaches to break detection in panel data models. Economics Letters 99, 91-96.
De Watcher, S., Tzavalis, E., 2012. Detection of structural breaks in linear dynamic panel data models.Computational Statistics and Data Analysis 56, 3020-3034.
Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American Statistical Association 96, 1348-1360.
Fan, J., Liao, Y., 2011. Ultra high dimensional variable selection with endogenous covariates. Workingpaper, Princeton University.
García, P. E. 2011. Instrumental variable estimation and selection with many weak and irrelevantinstruments. Working paper, University of Wisconsin, Madison.
Harville, D. A., 1997. Matrix Algebra from a Statistician’s Perspective. Springer, New York.
Hsu, C-C., Lin, C-C., 2012. Change-point estimation for nonstationary panel. Working paper, NationalCentral University.
Islam, N., 1995. Growth empirics: a panel data approach. The Quarterly Journal of Economics 110(4),1127-1170.
Jones, C. I., 2002. Sources of U.S. economic growth in a world of ideas. American Economic Review92, 220-239.
Kim, D., 2011. Estimating a common deterministic time trend break in large panels with cross sectionaldependence. Journal of Econometrics 164, 310-330.
Kim, D., 2012. Common breaks in time trends for large panel data with a factor structure. Workingpaper, University of Virginia.
Knight, K., Fu, W., 2000. Asymptotics for Lasso-type estimators. Annals of Statistics 28, 1356-1378.
44
Kock, A. B., 2013. Oracle effi cient variable selection in random and fixed effects panel data models.Econometric Theory 29, 115-152.
Kurozumi, E., 2012. Testing for multiple structural changes with non-homogeneous regressors. Workingpaper, Hitotsubashi University.
Liao, W., Wang, P., 2012. Structural breaks in panel data models: a common distribution approach.Working paper, HKUST.
Liao, Z., 2013. Adaptive GMM shrinkage estimation with consistent moment selection. EconometricTheory 29, 857-904.
Liao, Z., Phillips, P. C. B., 2014. Automated estimation of vector error correction models. EconometricTheory, forthcoming.
Lu, X., Su, L., 2013. Shrinkage estimation of dynamic panel data models with interactive fixed effects.Working paper, HKUST.
Meurant, G., 1992. A review on the inverse of symmetric tridiagonal and block tridiagonal matrices.SIAM Journal of Matrix Analysis and Applications 13, 707-728.
Molinari, L. G., 2008. Determinant of block tridiagonal matrices. Linear Algebra and Its Applications429, 2221-2226.
Pesaran, M. H., 2006. Estimation and inference in large heterogeneous panels with a multifactor errorstructure. Econometrica 74, 967-1012.
Pesaran, M. H., Yamagata, T., 2008. Testing slope homogeneity in large panels. Journal of Econometrics142, 50-93.
Qian, J., Su, L., 2013. Shrinkage estimation of regression models with multiple structural changes.Working paper, Singapore Management University.
Qu, Z., Perron, P., 2007. Estimating and testing structural changes in multiple regressions. Economet-rica 75, 459-502.
Ran, R-S., Huang, T-Z., 2006. The inverses of block tridiagonal matrices. Applied Mathematics andComputation 179, 243-247.
Rinaldo, A., 2009. Properties and refinement of the fused Lasso. Annals of Statistics 37, 2922-2952.
Romer, P. M., 1986. Increasing returns and long-run growth. Journal of Political Economy 94, 1002-1037.
Su, L., Chen, Q., 2013. Testing homogeneity in panel data models with interactive fixed effects. Econo-metric Theory 29, 1079-1135.
Su, L., Shi, Z., Phillips, P. C. B., 2013. Identifying latent structures in panel. Working paper, Dept. ofEconomics, Yale University.
Su, L., White, H., 2010. Testing structural change in partially linear models. Econometric Theory 26,1761-1806.
Tibshirani, R. J., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal StatisticalSociety, Series B 58, 267-288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K., 2005. Sparsity and smoothness via thefused Lasso. Journal of the Royal Statistical Society, Series B 67, 91-108.
Yamazaki, D., Kurozumi, E., 2013. Testing for parameter constancy in the time series direction infixed-effect panel data models. Working paper, Department of Economics, Hitotsubashi University.
Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped variables. Journalof the Royal Statistical Society, Series B 68, 49-67.
Zou, H., 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Asso-ciation 101, 1418-1429.
45