Shrinkage Estimation of Common Breaks in Panel Data Models

Shrinkage Estimation of Common Breaks in Panel Data Models

via Adaptive Group Fused Lasso∗

Junhui Qian

Antai College of Economics and Management, Shanghai Jiao Tong University

Liangjun Su

School of Economics, Singapore Management University

January 30, 2014

Abstract

In this paper we consider estimation and inference of common breaks in panel data models via

adaptive group fused lasso. We consider two approaches — penalized least squares (PLS) for first-

differenced models without endogenous regressors, and penalized GMM (PGMM) for first-differenced

models with endogeneity. We show that with probability tending to one both methods can correctly

determine the unknown number of breaks and estimate the common break dates consistently. We

obtain estimates of the regression coeffi cients via post Lasso and establish their asymptotic distri-

butions. We also propose and validate a data-driven method to determine the tuning parameter

used in the Lasso procedure. Monte Carlo simulations demonstrate that both the PLS and PGMM

estimation methods work well in finite samples. We apply our PGMM method to study the effect of

foreign direct investment (FDI) on economic growth using a panel of 88 countries and regions from

1973 to 2012 and find multiple breaks in the model.

JEL Classification: C13, C23, C33, C51

Key Words: Adaptive Lasso; Change point; Group Lasso; Fused Lasso; Panel data; Penalized least

squares; Penalized GMM; Structural change

∗The authors express their sincere appreciation to Chihwa Kao for discussions on the subject matter. Su gratefully

acknowledges the Singapore Ministry of Education for Academic Research Fund under grant number MOE2012-T2-2-021.

Address Correspondence to: Liangjun Su, School of Economics, Singapore Management University, 90 Stamford Road,

Singapore 178903; E-mail: [email protected], Phone: +65 6828 0386.

1

1 Introduction

Recently there has been a growing literature on the estimation and tests of common breaks in panel

data models in which there are N individual units and T time series observations for each individual.

Depending on whether T is allowed to pass to infinity, the model is called “short”for fixed T and “large”

(or of large dimension) if T passes to infinity. Implicitly, one usually allows N to pass to infinity in panel

data models.1 Most of the literature falls into two categories depending on whether the parameters of

interest are allowed to be heterogenous across individuals or not. The first category focuses on homogenous

panel data models and includes De Watcher and Tzavalis (2005), Baltagi et al. (2012), and De Watcher

and Tzavalis (2012). De Watcher and Tzavalis (2005) compare the relative performance of two model

and moment selection methods in detecting breaks in short panels; Baltagi et al. (2012) consider the

estimation and identification of change points in large dimensional panel models with either stationary or

nonstationary regressors and error terms; De Watcher and Tzavalis (2012) develop a testing procedure for

common breaks in short linear dynamic panel data models. The second category considers estimation and

inference of common breaks in heterogenous panel data models; see Bai (2010), Kim (2011, 2012), Hsu and

Lin (2012), Baltagi et al. (2013), among others. Bai (2010) establishes the asymptotic properties of the

estimated break point in a location-scale heterogenous panel data model with either fixed or large T ; Kim

(2011) extends Bai’s (2010) method and develops an estimation procedure for a common deterministic

time trend break in large heterogenous panels with a multi-factor error structure; Kim (2012) continues

the study by estimating the common break date and common factors jointly; Hsu and Lin (2012) extends

Bai’s (2010) theory to nonstationary panel data models where the error terms follow an I(1) process;

Baltagi et al. (2013) study the estimation of large dimensional static heterogenous panels with a common

break by extending Pesaran’s (2006) common correlated effects (CCE) estimation procedure. In addition,

Chan et al. (2008) extend the testing procedure of Andrews (2003) from time series to heterogenous panels

where the breaks may occur at different time points across individuals; Liao and Wang (2012) study the

estimation of individual-specific structural breaks that exhibit a common distribution in a location-scale

panel data model; Yamazaki and Kurozumi (2013) develop an LM-type test for slope homogeneity along

the time dimension in fixed-effects panel data models with fixed N and large T.2

A common feature of all of the above works is that a one-time break, common or not, is assumed in the

estimation procedure. Although the assumption of a single break greatly facilitates the estimation and

inference procedure, inferences based on it could be misleading if the underlying model has an unknown

number of multiple breaks. For this reason, a large literature on the estimation and inference of models

with multiple structural changes has been developed in the single or multiple time series framework; see,

1Bai (1997a), Bai et al. (1998) and Qu and Perron (2007) extend the estimation of single-time series models to multiple-

ones with simultaneous structural breaks where the number of equations is fixed.2Pesaran and Yamagata (2008) and Su and Chen (2013) propose LM-type tests for slope homogeneity along the cross

section dimension in large dimensional linear panel data models with additive fixed effects and interactive fixed effects,

respectively.

2

e.g., Bai (1997a, 1997b), Bai and Perron (1998), Qu and Perron (2007), Su and White (2010), Kurozumi

(2012), and Qian and Su (2013). In view of the fact that the conventional avg- and exp-type test

statistics for multiple structural changes requires all permissible partitions of the sample which could be

prohibitively large, Qian and Su (2013) propose shrinkage estimation of regression models with multiple

structural changes by extending the fused Lasso of Tibshirani et al. (2005) to the time series regression

framework.

In this paper we propose a shrinkage-based methodology for estimating panel data models with an

unknown number of structural changes. The new methodology is most suitable for the vision that the

regression coeffi cients in a panel data model may be time-varying but at the same time exhibit certain

sparseness in abrupt changes or breaks. This vision seems pertinent in many applied studies using

panel data that have a long time span measured in decades. During such a long time span, shocks to

technologies, preferences, policies, and so on, may result in the change of a statistical relation applied

economists seek to discover; but the shocks tend to be small over a relatively short time interval so that

it does not alter the statistical relationship in short time. In this case, one has to allow the parameters

in the model to change over time in an unknown way and recognize that parameters do not always alter

from one time period to another one. Multiple structural breaks may occur during the whole time span

but the number of breaks is generally small in comparison with the total number of time periods in the

data, resulting in the sparseness of the breaks.

In terms of econometrics methodology, this paper extends the Lasso-type shrinkage approach in Qian

and Su (2013) to panel data settings. To the best of our knowledge, this is the first in the literature to

deal with panel data models with possibly multiple structural changes explicitly.3 To stay focused, we

consider homogenous linear panel data models with an unknown number of common breaks and we do

not allow cross section dependence. The extension to heterogenous panel data models and to panel data

models with cross section dependence will be discussed at the end of Section 7. For the advantage of the

use of panel data to study common breaks, we refer the readers directly to Bai (2010) and De Watcher

and Tzavalis (2012). Despite the fact that the Lasso-type shrinkage estimation has a long history and

wide applications in statistics (see, e.g., Tibshirani (1996), Knight and Fu (2000), Fan and Li (2001)),

the application of Lasso-type shrinkage techniques in econometrics has a relatively short history. But the

number of applications in econometrics has been increasing very fast in the last few years. For example,

Caner (2009) and Fan and Liao (2011) consider covariate selection in GMM estimation; Belloni et al.

(2012) and García (2011) consider selection of instruments in the GMM framework; Liao (2013) provides

a shrinkage GMM method for moment selection and Cheng and Liao (2013) consider the selection of valid

and relevant moments via penalized GMM; Liao and Phillips (2014) apply adaptive shrinkage techniques

3Bai (2010, Section 6) discusses the case of multiple breaks. As he remarks, if the number of breaks is given, the one-

at-a-time approach of Bai (1997b) can be used to estimate the break dates, and if the number of breaks is unknown, a test

for existence of break point can be applied to each subsample before estimating a break point. Alternatively, one can use

information criteria to determine the number of breaks in the latter case, but further investigation is called for.

3

to cointegrated systems; Kock (2013) considers Bridge estimators of static linear panel data models

with random or fixed effects; Caner and Knight (2013) apply Bridge estimators to differentiate a unit

root from a stationary alternative; Caner and Han (2013) proposes a Bridge estimator for pure factor

models and shows the selection consistency; Lu and Su (2013) apply adaptive group Lasso to choose

both regressors and the number of factors in panel data models with factor structures; Su et al. (2013)

propose a procedure that is called classifier Lasso to estimate a latent panel structure; Cheng et al. (2014)

provide an adaptive group Lasso estimator for pure factor structures with a one-time structural break.

This paper adds to the literature by applying the shrinkage idea to panel data models with an unknown

number of breaks.

We propose two approaches, penalized least squares (PLS) and penalized general method of moments

(PGMM), for the estimation of the panel data model with an unknown number of breaks. We apply first

differencing to remove the fixed effects in the equation and focus on the first-differenced equation. When

there is no endogeneity issue in the first-differenced equation, we propose to apply the PLS to estimate

the unknown number of break points and the regime-specific regression coeffi cients jointly where the

penalty term is imposed through the adaptive group fused Lasso (AGFL) component. In the presence

of endogeneity in the first-differenced equation, which may arise from endogenous regressors or lagged

dependent variables in the original fixed-effects equation, we propose to apply the PGMM to estimate

the unknown number of break points and the regime-specific regression coeffi cients jointly where, again,

the penalty term is imposed through the AGFL component. Unlike Qian and Su (2013) who can only

establish the claim that the group fused Lasso can not under-estimate the number of breaks in a time

series regression and that all the break fractions (but not the break dates) can be consistently estimated

as in Bai and Perron (1998), we show that with probability approaching one (w.p.a.1) both of our PLS

and PGMM methods can correctly determine the unknown number of breaks and estimate the common

break dates consistently. We obtain estimates of the regression coeffi cients via post Lasso and establish

their asymptotic distributions. We also propose and validate a data-driven method to determine the

tuning parameter used in the Lasso procedure.

Both PLS and PGMM can be numerically solved using the fast block-coordinate descent algorithm.

Monte Carlo simulations show that our methods perform well in finite samples. First, the probability

of correctly estimating the number of breaks (0, 1, and 2), as N increases from 50 to 500, converges

to 100% quickly. Even when N = 50 and T = 6, our methods are reliable in detecting the number of

breaks in most cases. Second, conditional on the correct estimation of the number of breaks, our methods

accurately estimate the break dates in finite samples.

As an empirical illustration, we employ our PGMM method to evaluate the effect of foreign direct

investment (FDI) inflow on economic growth. We estimate a dynamic panel data model with possibly

multiple breaks using the PGMM approach. We find that, with a tuning parameter selected via minimiz-

ing a BIC-type information criterion, there are four breaks (five regimes) in the span of seven five-year

periods. In each regime, the post-Lasso estimation finds significant positive effect of FDI inflow on GDP

4

growth. In contrast, if we estimate a usual dynamic panel data model with time-invariant parameters, we

would find this effect to be negative, although statistically insignificant. This empirical example illustrates

the perils of employing panel data models with restrictions on the number of breaks. Our contribution

makes the restriction unnecessary.

The rest of the paper is organized as follows. Section 2 introduces our fixed-effect panel data model

and PLS and PGMM estimation of the model depending on whether endogeneity is present in the first-

differenced equation. Sections 3 and 4 analyze the asymptotic properties of PLS and PGMM estimators,

respectively. Section 5 reports the Monte Carlo simulation results. Section 6 provides an empirical

application and Section 7 concludes.

NOTATION. Throughout the paper we adopt the following notation. For an m × n real matrix A,we denote its transpose as A′, its Frobenius norm as ‖A‖ , and its spectral norm as ‖A‖sp . When A is

symmetric, we use µmax (A) and µmin (A) to denote its largest and smallest eigenvalues, respectively. Ipdenotes a p× p identity matrix and 0a×b an a× b matrix of zeros. We use “p.d.”to abbreviate “positivedefinite”. The operator P→ denotes convergence in probability, D→ convergence in distribution, and plim

probability limit. Let ∆ and ∆2 denote the difference operators of order 1 and 2, respectively. In addition,

we use TriD(·, ·)T to denote a symmetric block tridiagonal matrix :

TriD(A,D)T ≡

D1 −A′2−A2 D2 −A′3

−A3 D3 −A′4. . .

. . .. . .

−AT−1 DT−1 −A′T−AT DT

where Dt’s are symmetric, At’s are square matrices, and empty blocks denote the matrices of zeros. By

Molinari (2008), the determinant of TriD(A,D)T is given by det (TriD(A,D)T ) =

T∏t=1

det (Λt) , where

Λ1 = D1 and Λt = Dt − AtΛ−1t−1A

′t for t = 2, ..., T. By Meurant (1992) and Ran and Huang (2006), one

can also calculate the inverse of TriD(A,D)T recursively.

2 Shrinkage estimation of linear panel data models with multi-

ple breaks

In this section we consider a linear panel data model with an unknown number of breaks, which we

estimate via the adaptive group fused Lasso.

5

2.1 The model

Consider the following linear panel data model

yit = µi + β′txit + uit, i = 1, ..., N, t = 1, . . . , T ≥ 2, (2.1)

where xit is a p × 1 vector of regressors, uit is the error term with zero mean, βt is a p × 1 vector of

unknown coeffi cients, and µi is the individual fixed effects that may be correlated with xit. We assume

that N passes to infinity and T can either be fixed or pass to infinity.

Like Qian and Su (2013), we assume that β1, ..., βT exhibit certain sparse nature such that the totalnumber of distinct vectors in the set is given by m + 1, which is unknown but typically much smaller

than T. More specifically, we assume that

βt = αj for t = Tj−1, ..., Tj − 1 and j = 1, ...,m+ 1

where we adopt the convention that T0 = 1 and Tm+1 = T + 1. The indices T1, ..., Tm indicate the

unobserved m break points/dates and the number m + 1 denotes the total number of regimes. We are

interested in estimating the unknown number m of unknown break dates and the regression coeffi cients.

Let αm = (α′1, ..., α′m+1)′ and Tm = T1, ..., Tm . Throughout, we denote the true value of a parameter

with a superscript 0. In particular, we use m0, α0m0 =

(α0′

1 , ..., α0′m0+1

)′and T 0

m0 = T 01 , ..., T

0m0 to

denote the true number of breaks, the vector of true regression coeffi cients, and the set of true break

dates, respectively. We assume that m0 is a fixed finite integer and T 01 ≥ 2 but allow T 0

m0 = T. When

T 0m0 = T, the last break occurs at the end of the sample (c.f., Andrews (2003)) and the (m0 +1)th regime

has only one observation for each individual time series.

To eliminate the effect of µi in the estimation procedure, we consider the first-differenced equation

∆yit = β′txit − β′t−1xi,t−1 + ∆uit

= β′t∆xit +(βt − βt−1

)′xi,t−1 + ∆uit,

where, e.g., ∆yit = yit − yi,t−1 for i = 1, ..., N and t = 2, ..., T. We consider two cases:

(a) E [∆uitxit] = 0 and E [∆uitxi,t−1] = 0;

(b) E [∆uitxit] 6= 0 or E [∆uitxi,t−1] 6= 0.

Case (a) occurs when xit is strictly exogenous in the sense that E (uit|xi) = 0 a.s. where xi =

(xi1, ..., xiT )′. But strict exogeneity is not necessary for case (a) and a suffi cient condition for (a) to

hold is E (∆uit|xit, xit−1) = 0. Case (b) occurs when xit contains either lagged dependent variables (e.g.,

yi,t−1) or endogenous regressors that are correlated with uit. We assume the existence of a q × 1 vector

of instruments zit for (xit, xi,t−1) in case (b) where q ≥ p.Since neither m nor the break dates are known and m is typically much smaller than T, this motivates

us to consider the estimation of βt’s and Tm via a variant of fused Lasso a la Tibshirani et al. (2005).

We propose two approaches —PLS estimation for case (a) and PGMM estimation for case (b).

6

2.2 Penalized least squares (PLS) estimation

In case (a), we propose to estimate β =(β′1, ..., β

′T

)′by minimizing the following PLS objective function

V1NT,λ1 (β) =1

N

N∑i=1

T∑t=2

(∆yit − β′txit + β′t−1xi,t−1

)2+ λ1

T∑t=2

wt∥∥βt − βt−1

∥∥ (2.2)

where λ1 = λ1 (N,T ) ≥ 0 is a tuning parameter, and wt is a data-driven weight defined by

wt =∥∥∥βt − βt−1

∥∥∥−κ1 , t = 2, ..., T, (2.3)

βt are preliminary estimates of βt, and κ1 is an user-specified positive constant that usually takes

value 2 in the literature. Noting that the objective function in (2.2) is convex in β, it is easy to obtain

the solution β = (β′1, ..., β

′T )′ where we suppress the dependence of βt = βt (λ1) on λ1 as long as no

confusion arises. We will propose a data-driven method to choose λ1 in Section 3.4.

For a given solution βt to (2.2), the set of estimated break dates are given by Tm = T1, ..., Tmwhere 1 < T1 < ... < Tm ≤ T such that

∥∥∥βt − βt−1

∥∥∥ 6= 0 at t = Tj for some j ∈ 1, ..., m and Tmdivides the time interval [1, T ] into m + 1 regimes such that the parameter estimates remain constant

within each regime. Let T0 = 1 and Tm+1 = T + 1. Define αj = αj(Tm) = βTj−1 as the estimate of αj

for j = 1, ..., m + 1. Frequently we suppress the dependence of αj on Tm (and λ1) unless necessary. Let

αm = αm(Tm) = (α1(Tm)′, ..., αm+1(Tm)′)′.

Apparently, the objective function in (2.2) is closely related to the literature on the adaptive Lasso

(Zou (2006)), the group Lasso (Yuan and Lin (2006)), the fused Lasso (Tibshirani et al. (2005) and

Rinaldo (2009)), and group fused Lasso (Qian and Su (2013)). Like Qian and Su (2013), the use of the

Frobenius norm ‖·‖ for the vector difference βt − βt−1 generalizes the fused Lasso to the group fused

Lasso. Unlike Qian and Su (2013) who do not have any weights to use in their time series regression, our

panel regression allows us to apply the adaptive weights wt, yielding the adaptive Lasso procedure. Forthis reason, we can call our estimation procedure as an adaptive group fused Lasso (AGFL) procedure.

To obtain wt, we propose to obtain the preliminary estimate β = (β′1, ..., β

′T )′ by minimizing the

first term in the definition of V1NT,λ1 (β) in (2.2). Let φab,ts = 1N

∑Ni=1 aitb

′is and φab,t = φab,tt for

t, s = 1, ..., T, and a, b = x, ∆x, ∆y, ∆2y, ∆u or ∆2u. For example, φx∆2y,t,t+1 = 1N

∑Ni=1 xit∆

2yi,t+1 for

t = 2, ..., T − 1. We can readily demonstrate that β = Q−1NT R

yNT , where

QNT = TriD(Q†, Q)T , (2.4)

RaNT = (−φ′x∆a,2,−φ′x∆2a,2,3,−φ′x∆2a,3,4...,−φ′x∆2a,T−1,T , φ′x∆a,T )′, a = y or u, (2.5)

Qt = φxx,t for t = 1 and T, Qt = 2φxx,t for 2 ≤ t ≤ T − 1, and Q†t = φxx,t,t−1 for t = 2, ..., T.

7

2.2.1 Post-Lasso estimation

For any αm =(α′1, ..., α

′m+1

)′and Tm = T1, ..., Tm with 1 < T1 < · · · < Tm ≤ T, we define4

Q1NT (αm; Tm) =1

N

m+1∑j=1

Tj−1∑t=Tj−1+1

N∑i=1

(∆yit − α′j∆xit

)2+

1

N

m∑j=1

N∑i=1

(∆yiTj − α′j+1xiTj + α′jxi,Tj−1

)2,

(2.6)

where∑Tj−1t=Tj−1+1

∑Ni=1

(∆yit − α′j∆xit

)2corresponds to “the sum of squared errors”for observations in

the jth artificial regime with time series observations indexed by integers in the interval [Tj−1, Tj − 1],

and∑Ni=1

(∆yit − α′j+1xi,Tj + α′jxi,Tj−1

)2corresponds to the “the sum of squared errors”for observations

when one moves from the jth regime to the (j + 1)th regime. The second term in (2.6) is important and

helps to improve the asymptotic effi ciency when T is fixed. It can be omitted if min0≤j≤m |Tj+1 − Tj | →∞ as N → ∞ and only the asymptotic effi ciency is concerned, but we still keep it to improve the finite

sample performance of the post-Lasso estimate in this case. One can choose αm to minimize the objective

function in (2.6). We denote the solution as αpm (Tm) =(αp1 (Tm)

′, ..., αpm+1 (Tm)

′)′. By setting Tm as

Tm, the set of estimated break dates via the AGFL procedure, we obtain the post-Lasso estimator

αpm = αpm

(Tm)

= Φ−1NT Ψy

NT

where ΦNT and ΨyNT are p (m+ 1)× p (m+ 1) and p (m+ 1)× 1 matrices that are respectively defined

in (A.4) and (A.5) in the appendix. We shall study the limiting distribution of αpm below.

2.3 Penalized GMM (PGMM) estimation

In case (b), we propose to estimate β by minimizing the following PGMM objective function

V2NT,λ2 (β) =

T∑t=2

1

N

N∑i=1

ρit(βt, βt−1

)′Wt

1

N

N∑i=1

ρit(βt, βt−1

)+ λ2

T∑t=2

wt∥∥βt − βt−1

∥∥ , (2.7)

where ρit(βt, βt−1

)= zit(∆yit − β′txit + β′t−1xi,t−1), λ2 = λ2 (N,T ) ≥ 0 is a tuning parameter, Wt =

WtNT is a q× q symmetric positive definite weight matrix for t = 2, ..., T, and wt is a data-driven weight

defined by

wt =∥∥∥βt − βt−1

∥∥∥−κ2 , t = 2, ..., T, (2.8)

βt are preliminary estimates of βt, and κ2 is an user-specified positive constant that usually takes

value 2 in the literature. Clearly, the first term in the definition of V2NT,λ2 (β) in (2.7) is different from

the usual GMM objective function in the panel setting with time-invariant parameters where only one

weight matrix (W , say) is needed and the double summation∑Tt=2

∑Ni=1 occurs twice, one before the

single weight matrix and the other after the single weight matrix. It is also different from the GMM-type

objective function in Andrews (1993) who considers the test of a single structural change in a time series

4By default, the summation∑bt=a is zero if b < a.

8

regression. Noting that the objective function in (2.7) is convex in β, it is easy to obtain the solution

β = (β′1, ..., β

′T )′, where we frequently suppress the dependence of βt = βt (λ2) on λ2. We will propose a

data-driven method to choose λ2 in Section 4.4.

For a given solution βt to (2.7), we can find the set of estimated break dates Tm = T1, ..., Tmas in Section 2.2. Like before, Tm divides [1, T ] into m + 1 regimes such that the parameter estimates

remain constant within each regime and∥∥∥βt − βt−1

∥∥∥ 6= 0 whenever t = Tj for some j = 1, ..., m. Let

T0 = 1 and Tm+1 = T + 1. Define αj = αj(Tm) = βTj−1 as the estimate of αj for j = 1, ..., m + 1. Let

αm = αm(Tm) = (α1(Tm)′, ..., αm+1(Tm)′)′.

To obtain the adaptive weights wt, we propose to obtain the preliminary estimate β = (β′1, ..., β

′T )′

by minimizing the first term in the definition of V2NT,λ2 (β) in (2.7). Let Qab,t,s = φ′ab,t,sWtφab,t,s and

Qab,t = Qab,t,t for t, s = 1, 2, ..., T. Let Qzx,t,t−1 = φ′zx,tWtφzx,t,t−1 for t = 2, ..., T. It is easy to show that

β = Q−1NT R

yNT , where

QNT = TriD(Q, Q

)T, (2.9)

RaNT =(−(φ′zx,2W2φz∆a,2

)′,(φ′zx,2W2φz∆a,2 − φ′zx,3,2W3φz∆a,3

)′, ...,(

φ′zx,T−1WT−1φz∆a,T−1 − φ′zx,T,T−1WTφz∆a,T)′,(φ′zx,TWTφz∆a,T

)′)′, (2.10)

a = y or u, Q1 = Qzx,1,2, Qt = Qzx,t + Qzx,t+1,t for t = 2, ..., T − 1, QT = Qzx,T , and Qt = Qzx,t,t−1 for

t = 2, ..., T.

2.3.1 Post-Lasso estimation

For any αm =(α′1, ..., α

′m+1

)′and Tm = T1, ..., Tm with 1 < T1 < · · · < Tm ≤ T, we define5

Q2NT (αm; Tm) =

m+1∑j=1

1

N

Tj−1∑t=Tj−1+1

N∑i=1

ρit (αj)

′W pj

1

N

Tj−1∑t=Tj−1+1

N∑i=1

ρit (αj)

+m∑j=1

[1

N

N∑i=1

ρ1iTj (αj+1, αj)

]′WTj

1

N

N∑i=1

ρ1iTj (αj+1, αj) , (2.11)

where ρit (αj) = zit(∆yit − α′j∆xit

), ρ1iTj (αj+1, αj) = ziTj (∆yiTj − α′j+1xiTj +α′jxi,Tj−1), and W p

j is a

regime-specific q× q symmetric weight matrix that is positive definite in large samples. As in the case ofPLS estimation, the second term in (2.11) is important when T is fixed and can be omitted in the case

where min0≤j≤m |Tj+1 − Tj | → ∞ as N → ∞. Let αpm (Tm) =(αp1 (Tm)

′, ..., αpm+1 (Tm)

′)′ denote theminimizer of Q2NT defined in (2.11). By setting Tm as Tm, the set of estimated break dates, we obtainthe post-Lasso estimator

αpm = αpm

(Tm)

= Υ−1NT ΞyNT

5By default, the summation∑bt=a is zero if b < a.

9

where ΥNT and ΞyNT are p (m+ 1) × p (m+ 1) and p (m+ 1) × 1 matrices that are defined in (B.3) in

the appendix. We shall study the limiting distribution of αpm below.

To obtain the PGMM estimate and the associated post-Lasso estimate, one needs to choose the weight

matrices Wt (t = 2, ..., T ) and W pj (j = 1, ..., m+ 1). In the simulation and application below, we adopt

a two-step strategy for determining both sets of weights. For Wt, we first obtain the estimate βt by

choosing the p × p identity matrix Ip as the weight matrix. In the second step, we specify Wt as the

inverse of the estimated covariance matrix of ρit(βt, βt−1) and to achieve an updated estimate of βt. A

similar procedure is adopted for determining the weights in post-lasso estimation.

3 Asymptotic properties of the PLS estimators

In this section we address the asymptotic properties of the PLS estimators.

3.1 Basic assumptions

Let I0j = T 0

j − T 0j−1 for j = 1, ...,m0 + 1. Define

Imin = min1≤j≤m0+1

I0j , Jmin = min

1≤j≤m0

∥∥α0j+1 − α0

j

∥∥ , and Jmax = max1≤j≤m0

∥∥α0j+1 − α0

j

∥∥ .Apparently, Imin denotes the minimum interval length among the m0 + 1 regimes, and Jmin and Jmax

denote the minimum and maximum jump sizes, respectively. In the case of fixed T, Imin does not pass to

infinity as N →∞. If we allow T →∞, then Imin can either pass to infinity or stay fixed unless otherwise

stated. We will maintain the assumption that Jmax is always a fixed constant but Jmin can be either

fixed or shrinking to zero as either N → ∞ or (N,T ) → ∞, where (N,T ) → ∞ denotes both N and T

pass to infinity simultaneously.

Let Φab,l = 1N

∑T 0l −1

t=T 0l−1+1

∑Ni=1 aitb

′it for l = 1, ...,m0 + 1 and a, b = ∆x, x, ∆y and ∆u. Define the

p(m0 + 1

)× p

(m0 + 1

)matrix ΦNT and p

(m0 + 1

)× 1 vector ΦNT and Ψa

NT , respectively:

ΦNT = TriD(Φ†,Φ

)m0+1

, (3.1)

ΨaNT =

(Φ′∆x∆a,1 − φ′x∆a,T 01−1,T 01

,Φ′∆x∆a,2 − φ′x∆a,T 02−1,T 02+ φ′x∆a,T 01

, ...,

Φ′∆x∆a,m0 − φ′x∆a,T 0m0−1,T 0

m0+ φ′x∆a,T 0

m0−1, Φ′∆x∆a,m0+1 + φ′x∆a,T 0

m0

)′, a = y or u,(3.2)

Φ1 = Φ∆x∆x,1 + φxx,T 01−1, Φl = Φ∆x∆x,l + φxx,T 0l −1 + φxx,T 0l−1 for l = 2, ...,m0, Φm0+1 = Φ∆x∆x,m0+1 +

φxx,T 0m0, and Φ†l+1 = φxx,T 0l ,T 0l −1 for l = 1, ...,m0. Let Dm0+1 =diag

(√I01 , ...,

√I0m0+1

)⊗ Ip. To study

the asymptotic properties of the PLS estimators, we make the following assumptions.

Assumption A.1. (i) Let ui = (ui1, ..., uiT )′. xi, ui are independently distributed over i.(ii) E (xit∆uit) = 0 and E (xi,t−1∆uit) = 0 for i = 1, ..., N and t = 2, ..., T. max1≤i≤N max1≤t≤T

E ‖ςit‖4 < C <∞ for ς = x and u.

10

(iii) There exists a matrix Q0 > 0 such that∥∥∥QNT − Q0

∥∥∥sp

= oP (1) . There exist two constants cQ0

and cQ0such that 0 <cQ0

≤ λmin(Q0) ≤ λmax(Q0) ≤ cQ0<∞.

Assumption A.2. (i) Jmax = O (1) and N1/2Jmin → cJ ∈ (0,∞] as N →∞ or (N,T )→∞.(ii) N1/2λ1J

−κ1min → c ∈ [0,∞) as N →∞ or (N,T )→∞.

(iii) N (κ1+1)/2λ1 →∞ as N →∞ or (N,T )→∞.

Assumption A.3. (i) D−1m0+1ΦNTD−1

m0+1

P→ Φ0 > 0.

(ii)√ND−1

m0+1ΨuNT

D→ N (0,Ω0)

Assumption A.1(i) requires that xi, ui be independently distributed. It may be relaxed to allow forweak forms of cross sectional dependence at very lengthy arguments. A.1(ii) specifies moment conditions

on xit, uit. If E (uit|xit+1, xit) = 0 a.s. for each i and t, then the first part of A.1(ii) is satisfied. In con-

junction with A.1(i), A.1(ii) implies that each block element of√NRuNT is OP (1) and T−1N

∥∥∥RuNT∥∥∥2

=

OP (1) by Chebyshev inequality. A.1(iii) requires that the limiting matrix Q0 of the Tp × Tp ma-

trix QNT be well behaved. Let φ0xx,t,s = limN→∞ φxx,t,s and φ

0xx,t = φ0

xx,t,t. Let ∆01 = φ0

xx,1, ∆0t =

2φxx,t−φ0xx,t,t−1

(∆0t−1

)−1

φ0′xx,t,t−1 for t = 2, ..., T−1, and ∆0

T = φ0xx,T−φ0

xx,T,T−1

(∆0T−1

)−1

φ0′xx,T,T−1.

Then Q0 is p.d. if and only if the sequence of matrices ∆01, ..., ∆

0T are all p.d. Combining A.1(i)-(iii),

we prove in Lemma A.1 that√N(βt − β0

t

)= OP (1) for each t = 1, ..., T. Assumption A.2(i) mainly

specifies conditions on Jmin, λ1, and N. Note that we allow the minimum break size Jmin to shrink to

zero as N →∞ but it cannot shrink to zero faster than N−1/2. In the special case where Jmin is bounded

away from zero, A.2 can be simplified to

Assumption A.2∗. N1/2λ1 → c ∈ [0,∞) and N (κ1+1)/2λ1 →∞ as N →∞ or (N,T )→∞.

Assumption A.3 specify conditions to ensure the asymptotic normality of the post Lasso estimators.

3.2 Consistency

The following theorem establishes the consistency of βt.

Theorem 3.1 Suppose that Assumption A.1 holds. Then (i) T−1∥∥∥β − β0

∥∥∥2

= OP(N−1

), and (ii)

βt − β0t = OP

(N−1/2

)for each t = 1, ..., T.

Theorems 3.1(i) and (ii) establish the mean square and pointwise convergence rates of βt, respec-tively. The two results are equivalent in the case of fixed T. When T is allowed to pass to infinity as

N →∞, the proof of Theorem 3.1(ii) demands some extra effort. In particular, we need a close examina-

tion of the factorization and inversion properties of symmetric block tridiagonal matrices. See the proof

of Theorem 3.1(ii) in Appendix A.

Let T 0cm0 = 2, ..., T \T 0

m0 . Let θ01 = β0

1 and θ0t = β0

t − β0t−1 for t = 2, ..., T. Let θ1 = β1 and

θt = βt − βt−1 for t = 2, ..., T. The following theorem establishes the selection consistency.

11

Theorem 3.2 Suppose that Assumptions A.1-A.2 hold. Then P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c

m0

)→ 1 as N →

∞.

Theorem 3.2 says that w.p.a.1 all the zero vectors inθ0t , 2 ≤ t ≤ T

must be estimated as exactly

zero by the PLS method so that the number of estimated breaks m cannot be larger than m0 when N

is suffi ciently large. On the other hand, by Theorem 3.1(ii), we know that the estimates of the nonzero

vectors inθ0t , 2 ≤ t ≤ T

must be consistent by noting that βt− βt−1 consistently estimates θ

0t for t ≥ 2.

Put together, Theorems 3.1 and 3.2 imply that the AGFL has the ability to identify the true regression

model with the correct number of breaks consistently when the minimum break size Jmin does not shrink

to zero too fast.

Corollary 3.3 Suppose that Assumptions A.1-A.2 hold with cJ = ∞ in Assumption A.2(i). Then (i)

limN→∞ P(m = m0

)= 1, and (ii) limN→∞ P (T1 = T 0

1 , ..., Tm0 = T 0m0 | m = m0) = 1.

The above corollary implies that, as long as Jmin remains fixed or shrinks to zero at a rate slower

than N−1/2 as N → ∞, we can estimate the number of structural changes and all the break datesconsistently regardless of whether T is fixed or passes to infinity. In contrast, Qian and Su (2013,

Theorem 3.3) only establish the claim that the group fused Lasso procedure can not under-estimate

the number of breaks in a time series regression and that all the break fractions (but not the break

dates) can be consistently estimated as in Bai and Perron (1998). More precisely, letting D (A,B) ≡supb∈B infa∈A |a− b| for any two sets A and B, Qian and Su (2013, Theorem 3.2) establish the claim

that limT→∞ P(D(Tm, T 0

m0

)≤ TδT

)= 1 for some sequence δT such that δT → 0 and TδT → ∞ as

T → ∞. In our panel setting, the availability of N cross sectional units for each time period permits us

to obtain the set of consistent preliminary estimatesβt

used to construct the adaptive weights wt .

The adaptive nature of our group fused Lasso procedure helps us to identify the exact set of break dates

and yields stronger results than those in Qian and Su (2013).

3.3 Limiting distribution of the post-Lasso estimator

In this subsection we study the asymptotic distribution of the post-Lasso estimator αpm(Tm). Corollary

3.3 implies that w.p.a.1, m = m0 and Tj = T 0j for j = 1, ...,m0. It follows that αpm(Tm) is asymptotically

equivalent to the infeasible estimator αpm0(T 0m0) which is obtained if one knows the exact set T 0

m0 of true

break dates. Note that

αpm0(T 0m0) = Φ−1

NTΨyNT

where ΦNT and ΨyNT are defined in (3.1) and (3.2), respectively.

The following theorem reports the limiting distribution of αpm(Tm) conditional on the large probability

eventm = m0

.

Theorem 3.4 Suppose that Assumptions A.1-A.3 hold with cJ = ∞ in Assumption A.2(i). Then con-

ditional on m = m0, we have√NDm0+1 (αpm(Tm)−α0)

D→ N(0,Φ−1

0 Ω0Φ−10

).

12

Since we allow I0j to be either fixed or diverge to infinity in the case of large T, α

pj

(Tm)’s may have

different convergence rates to their true values. In the special case where I0j is proportional to T, α

pj

(Tm)

achieves the usual√NT -rate of consistency.

3.4 Choosing the tuning parameter λ1

Let αmλ1 ≡ αmλ1 (Tmλ1 ) = (α1(Tmλ1 )′, ..., αmλ1+1(Tmλ1 )′)′ denote the set of post-Lasso estimates of the

regression coeffi cients based on the break dates in Tmλ1 = Tmλ1 (λ1) , where we make the dependence of

various estimates on λ1 explicit. Let σ2Tmλ1

≡ 1T−1Q1NT (αmλ1 ; Tmλ1 ). We propose to select the tuning

parameter λ1 by minimizing the following information criterion:

IC (λ1) = σ2Tmλ1

+ ρ1NT p (mλ1 + 1) . (3.3)

Denote Ω = [0, λmax] , a bounded interval in R+. We divide Ω into three subsets Ω0, Ω− and Ω+ as

follows

Ω0 =λ1 ∈ Ω : mλ1 = m0

, Ω− =

λ1 ∈ Ω : mλ1 < m0

, and Ω+ =

λ1 ∈ Ω : mλ1 > m0

.

Clearly, Ω0, Ω− and Ω+ denote the three subsets of Ω in which the correct-, under- and over-number

of breaks are selected by the adaptive group fused Lasso, respectively. Let λ01NT denote an element in

Ω0 that satisfies the conditions on λ1 in Assumptions A.2(ii)-(iii).

Let σ2NT ≡ 1

N(T−1)

∑Ni=1

∑Tt=2 (∆uit)

2 and σ20 ≡plimσ2

NT . To state the next result, we add the

following assumptions.

Assumption A.4. (i) plimN→∞min1≤j≤m0 minα∈Rp1

NJ2min

∑Ni=1[(α0

j+1−α)′xiT 0j − (α0j −α)′xi,T 0j −1]2 ≥

cα > 0.

(ii) 1√N(T−1)

∑Tt=2

∑Ni=1 ∆xit∆uit = OP (1) .

(iii) As N →∞ or (N,T )→∞, T(IminJmin)2N

→ 0.

Assumption A.5. As N →∞ or (N,T )→∞,(

1 + TIminJ2min

)ρ1NT → 0 and Nρ1NT →∞.

A.4(i) imposes conditions on the parameters and the observations that are either at the break dates

or immediately preceding the break dates. The scalar J2min reflects the fact that we allow the minimum

break size Jmin to shrink to zero. In the latter case, pulling observations in two adjacent regimes with the

break size of order O (Jmin) together to estimate the regression coeffi cients within these two regimes is

still consistent with J−1min-rate of consistency. Under A.2(i)-(ii), A.4(ii) can be verified under various weak

dependence conditions, say, strong mixing or martingale difference sequence-type of conditions. A.4(iii)

imposes restriction on Imin, Jmin and the sample sizes. It is trivially satisfied if Imin ∝ T and Jmin remains

fixed as N →∞ or (N,T )→∞, and reduces to the condition that cJ =∞ in Assumption A.2(i) in the

case where T is fixed. A.5 reflects the usual conditions for the consistency of model selection, that is, the

penalty coeffi cient ρ1NT cannot shrink to zero either too fast or too slowly. If Imin ∝ T and J−1min = O (1) ,

13

the first part of A.5 requires that ρ1NT → 0, which is standard for an information-criterion function.

N−1 indicates the probability order of the distance between the first term in the criterion function for an

over-parametrized model and that for the true model.

Theorem 3.5 Suppose that Assumptions A.1, A.2(i) and A.3-A.5 hold with cJ = ∞ in Assumption

A.2(i). Then

P

(inf

λ1∈Ω−∪Ω+

IC (λ1) > IC(λ0

1NT

))→ 1 as N →∞.

Theorem 3.5 implies that the λ1’s that yield the over-estimated or under-estimated number of breaks

fail to minimize the information criterion w.p.a.1. Consequently, the minimizer of IC (λ1) can only be

the one that produces the correct number of estimated breaks in large samples. Note that we prove the

above theorem without requiring λ1 to satisfy Assumptions A.2(ii)-(iii). It indicates that if the number

of corrected breaks is of our major concern, we can simply choose λ1 to minimize IC (λ1) .

4 Asymptotic properties of the PGMM estimators

In this section we address the statistical properties of the PGMM estimators.

4.1 Assumptions

Let φ†ab,l+1 = φ′ab,T 0lWT 0l

φab,T 0l ,T 0l −1 for l = 1, ...,m0 and a, b = z, x, ∆x. Define the p(m0 + 1

)×

p(m0 + 1

)matrix ΥNT and p

(m0 + 1

)× 1 vector ΞaNT , respectively:

ΥNT = TriD(Υ†,Υ

)m0+1

, ΞaNT =(Ξ′a,1,Ξ

′a,2 , ...,Ξ

′a,m0+1

)′, a = y or u, (4.1)

whereΥ1 = Φ′z∆x,1Wp1 Φz∆x,1+φ′zx,T 01 ,T 01−1WT 01

φzx,T 01 ,T 01−1, Υl = Φ′z∆x,lWpl Φz∆x,l+φ

′zx,T 0l ,T

0l −1WT 0l

φzx,T 0l ,T 0l −1

+φ′zx,T 0l−1WT 0l−1

φzx,T 0l−1 for l = 2, ...,m0, Υm0+1 = Φ′z∆x,m0+1Wpm0+1Φz∆x,m0+1 + φ′zx,T 0

m0WT 0

m0φzx,T 0

m0,

andΥ†l = φ†xx,l for l = 2, ...,m0+1. In addition, for a = y or u, Ξa,1 = Φ′z∆x,1Wp1 Φz∆a,1−φ′zx,T 01 ,T 01−1WT 01

φz∆a,T 01 ,

Ξa,l = Φ′z∆x,lWpl Φz∆a,l−φ′zx,T 0l ,T 0l −1WT 0l

φz∆a,T 0l +φ′zx,T 0l−1WT 0l−1

φz∆a,T 0l−1 for l = 2, ...,m0, and Ξa,m0+1 =

Φ′z∆x,m0+1 Wpm0+1Φz∆a,m0+1 + φ′zx,T 0

m0WT 0

m0φz∆a,T 0

m0.

To study the asymptotic properties of the PGMM estimators, we make the following assumptions.

Assumption B.1. (i) Let zi = (zi1, ..., ziT )′. xi, zi, ui are independently distributed over i.(ii) E (zit∆uit) = 0 for i = 1, ..., N and 2 = 1, ..., T. max1≤i≤N max1≤t≤T E ‖ςit‖4 < C < ∞ for

ςit = xit, zit, and uit.

(iii) There exists a matrix Q0 > 0 such that∥∥∥QNT − Q0

∥∥∥sp

= oP (1) . There exist two constants cQ0

and cQ0such that 0 <cQ0

≤ λmin(Q0) ≤ λmax(Q0) ≤ cQ0<∞.

Assumption B.2. (i) Jmax = O (1) and N1/2Jmin → cJ ∈ (0,∞] as N →∞ or (N,T )→∞.(ii) N1/2λ2J

−κ2min → c ∈ [0,∞) as N →∞ or (N,T )→∞.

14

(iii) N (κ2+1)/2λ2 →∞ as N →∞ or (N,T )→∞.

Assumption B.3. (i) D−1m0+1ΥNTD−1

m0+1

P→ Υ0 > 0.

(ii)√ND−1

m0+1ΞuNTD→ N (0,Σ0) .

Assumptions B.1(i)-(iii) parallel Assumptions A.1(i)-(iii). B.1(ii) specifies moment conditions on xit,zit, uit. In conjunction with B.1(i), B.1(ii) implies that each block element of

√NRuNT is OP (1) and

T−1N∥∥∥RuNT∥∥∥2

= OP (1) by Chebyshev inequality. Combining B.1(i)-(iii), we prove in Lemma B.1 that√N(βt − β0

t

)= OP (1) for each t = 1, ..., T. Assumption B.2(i) mainly specifies conditions on Jmin, λ2,

and N. Note that we allow the minimum break size Jmin to shrink to zero as N →∞. In the special casewhere Jmin is bounded away from zero, B.2 can be simplified to

Assumption B.2∗. N1/2λ2 → c ∈ [0,∞) and N (κ2+1)/2λ2 →∞ as N →∞ or (N,T )→∞.

Assumption B.3 specify conditions to ensure the asymptotic normality of the post Lasso estimator.

4.2 Consistency

The following theorem establishes the consistency of βt.

Theorem 4.1 Suppose that Assumption B.1 holds. Then (i) T−1∥∥∥β − β0

∥∥∥2

= OP(N−1

), and (ii)

βt − β0t = OP

(N−1/2

)for each t = 1, ..., T.

Theorems 4.1(i) and (ii) establish the mean square and pointwise convergence rates of βt, respec-tively. The two results are equivalent in the case of fixed T and are not in the case of large T. If T →∞as N →∞, the proof of Theorem 4.1(ii) requires the use of the factorization and inversion properties of

symmetric block tridiagonal matrices as in the proof of Theorem 3.1(ii).

Let θ1 = β1 and θt = βt − βt−1 for t = 2, ..., T. The following theorem establishes the selection

consistency.

Theorem 4.2 Suppose that Assumptions B.1-B.2 hold. Then P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c

m0

)→ 1 as N →

∞.

Theorem 4.2 says that w.p.a.1 all the zero vectors inθ0t , 2 ≤ t ≤ T

must be estimated as exactly

zero by the PGMM method. On the other hand, by Theorem 4.1(ii), we know that the estimates of the

nonzero vectors inθ0t , 2 ≤ t ≤ T

must be consistent by noting that βt− βt−1 consistently estimates θ

0t

for t ≥ 2. Put together, Theorems 4.1 and 4.2 imply that the AGFL has the ability to identify the true

regression model with the correct number of breaks consistently when the minimum break size Jmin does

not shrink to zero too fast.

Corollary 4.3 Suppose that Assumptions B.1-B.2 hold with cJ = ∞ in Assumption B.2(i). Then (i)

limN→∞ P(m = m0

)= 1 and (ii) limN→∞ P (T1 = T 0

1 , ..., Tm0 = T 0m0 | m = m0) = 1.

15

The above corollary implies that the PGMM method helps us to estimate the number of structural

changes and all the break dates consistently regardless of whether T is fixed or passes to infinity.

4.3 Limiting distribution of the post-Lasso estimator

In this subsection we study the asymptotic distribution of the post-Lasso estimator αpm(Tm). Corollary

4.3 implies that w.p.a.1, m = m0 and Tj = T 0j for j = 1, ...,m0. It follows that αpm(Tm) is asymptotically

equivalent to the infeasible estimator αpm0(T 0m0) which is obtained if one knows the exact set T 0

m0 of true

break dates. Note that

αm0(T 0m0) = Υ−1

NTΞyNT

where ΥNT and ΞyNT are defined in (4.1).

The following theorem reports the limiting distribution of αpm(Tm) conditional on the large probability

eventm = m0

.

Theorem 4.4 Suppose that Assumptions B.1-B.3 hold. Then conditional on m = m0, we have√NDm0+1

(αpm(Tm)−α0)D→ N

(0,Υ−1

0 Σ0Υ−10

).

Since we allow I0j to be either fixed or diverge to infinity in the case of large T, α

pj

(Tm)’s may have

different convergence rates to their true values. In the special case where I0j is proportional to T, α

pj

(Tm)

achieves the usual√NT -rate of consistency.

4.4 Choosing the tuning parameter λ2

Let αmλ2 = αmλ2 (Tmλ2 ) = (α1(Tmλ2 )′, ..., αmλ2+1(Tmλ2 )′)′ denote the set of post-Lasso estimates of the

regression coeffi cients based on the break dates in Tmλ2 = Tmλ2 (λ2) , where we make the dependence of

various estimates on λ2 explicit. Let σ2Tmλ2

≡ 1T−1Q2NT (αmλ2 , Tmλ2 ). We propose to select the tuning

parameter λ2 by minimizing the following information criterion:

IC2 (λ2) = σ2Tmλ2

+ ρ2NT p (mλ2 + 1) . (4.2)

Denote Ω2 = [0, λ2 max] , a bounded interval in R+. We divide Ω2 into three subsets Ω20, Ω2− and Ω2+

as follows

Ω20 =λ2 ∈ Ω2 : mλ2 = m0

, Ω2− =

λ2 ∈ Ω2 : mλ2 < m0

, and Ω2+ =

λ2 ∈ Ω2 : mλ2 > m0

.

Let λ02NT denote an element in Ω20 that also satisfies the conditions on λ2 in Assumptions B.2(ii)-(iii).

To state the next result, we add the following assumptions.

Assumption B.4. (i) plimN→∞min1≤j≤m0 minα∈Rp1

J2minηj (α)

′WT 0j

ηj (α) ≥ cα > 0, where ηj (α) =1N

∑Ni=1[(α0

j+1 − α)′xiT 0j − (α0j − α)′xi,T 0j −1]ziT 0j .

(ii) 1√N(I0j−1)

∑T 0j −1

t=T 0j−1+1

∑Ni=1 zit∆uit = OP (1) for each j = 1, ...,m0 + 1.

16

(iii) If T →∞ as N →∞, Imin →∞ and T(IminJmin)2N

→ 0.

Assumption B.5. As N →∞ or (N,T )→∞,(

1 + TIminJ2min

)ρ2NT → 0 and Nρ2NT →∞.

Assumptions B.4-B.5 parallel A.4-A.5. Note that we now require Imin →∞ in the case of large T. The

following theorem implies that the minimizer of IC2 (λ2) can only be the one that produces the correct

number of estimated breaks in large samples.

Theorem 4.5 Suppose that Assumptions B.1, B.2(i) and B.3-B.5 hold with cJ = ∞ in Assumption

B.2(i). Then

P

(inf

λ2∈Ω2−∪Ω2+

IC2 (λ2) > IC2

(λ0

2NT

))→ 1 as N →∞.

5 Monte Carlo simulations

In this section we conduct a set of Monte Carlo experiments to evaluate the finite sample performance

of our AGFL method. The first set of experiments are concerned with the PLS or PGMM estimation

of static panel data models. We first evaluate the probability of falsely detecting breaks when there are

none. Then we experiment on the data generating processes (DGPs) with one or two breaks. In this

case, we evaluate both the probability of correctly detecting the number of breaks and the accuracy of

estimating the break dates. The second set of experiments deal with the PGMM estimation of dynamic

panel data models. We focus on DGPs with a lagged dependent variable and an exogenous variable. Like

in the static panel case, we evaluate the probability of correctly detecting the number of breaks and,

when there are indeed breaks, the accuracy of break date estimation.

For fast computation, we use the block-coordinate descent algorithm (see, e.g., Angelosante and

Giannakis (2012)) to solve the minimization problem in (2.2) for the PLS case and (2.7) for the PGMM

case. We select the tuning parameters λ1 and λ2 that minimize the information criterion in (3.3) and

(4.2) for the cases of PLS and PGMM estimation, respectively. Specifically, we choose a tuning parameter

λmax that would yield zero break in every DGP and a λmin that would yield many breaks. In practice, we

can easily find such λmax and λmin by trial and error. We then search for the optimal tuning parameter

on the 40 evenly-distributed logarithmic grids in the interval [λmin , λmax ]. We choose ρ1NT = ρ2NT =

1/√N(T − 1) in (3.3) or (4.2) for the static panel and ρ2NT = log(N(T − 1))/(N(T − 1)) in (4.2) for the

dynamic panel. Note that the latter choice specifies exactly the same rate as required by the Bayesian

Information Criterion (BIC). Both choices are acceptable in theory for every DGP we experiment on, but

their finite-sample performances do differ.

Following the literature on adaptive Lasso, we set κ1 = κ2 = 2 in the construction of the adaptive

weights wt and wt that are used for the PLS and PGMM estimation, respectively. In addition, we

choose all weight matrices Wt, t = 2, ..., T and W pj , j = 1, ..., m+ 1 as detailed in the last paragraph

of Section 2.3. The number of repetitions in all subsequent Monte Carlo experiments is 500.

17

5.1 The case of static panel

We consider the following DGPs:

yit = βtxit + µi + uit, i = 1, . . . , N, t = 1, . . . , T,

where µi = T−1∑Tt=1 xit and

• DGP-1: xit ∼ i.i.d. N(0, 1), uit = σuηit, ηit ∼ i.i.d. N(0, 1).

• DGP-2: Same as DGP-1 except ηit ∼ AR(1) for each i : ηit = 0.5ηi,t−1 + εit, εit ∼ i.i.d. N(0, 0.75).

• DGP-3: Same as DGP-1 except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +

0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).

• DGP-4: xit = ξit + 0.3ηit, ηit and ξit are i.i.d. N(0, 1) and mutually independent, uit = σuηit,

zit = ξit + 0.3εit, εit ∼ i.i.d. N(0, 1).

• DGP-5: Same as DGP-4 except ξit ∼ AR(1) for each i : ξit = 0.5ξi,t−1 + εit, εit ∼ i.i.d. N(0, 0.75).

• DGP-6: Same as DGP-4 except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +

0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).

We consider T = 6 or 12, and N = 50, 100, 200, and 500. For each DGP, we set βt = 1 for all

t when no break exists, βt = 1 1 ≤ t ≤ T/2 when there is one break, and βt = 1 1 ≤ t ≤ T/2 +

1 T/2 < t ≤ 2T/3 when there are two breaks, where 1 · denotes the usual indicator function. IfT = 6, the last case allows consecutive breaks at t = 4 and 5.

Note that the individual effects are generated from within-average and thus regarded as “fixed effects”.

In the first three DGPs, no endogeneity issue exists and we use PLS to estimate the models. DGP-1 serves

as the benchmark case where both the regressor and the idiosyncratic error processes are strong white

noise. DGP-2 allows serial correlation in the idiosyncratic error process and DGP-3 allows conditional

heteroskedasticity. The DGP-4 through 6 contain an endogenous variable xit and a variable zit that

generates a valid IV. We apply PGMM to estimate the models, using (zit, zi,t−1)′ as the instrument.

DGP-4 serves as the benchmark case where both the regressor and the error terms are i.i.d. across i

and t. xit and uit are correlated due to the common component ηit, and zit is correlated with xit due

to the presence of ξit in both. DGP-5 allows serial correlation in xit, and DGP-6 allows conditional

heteroskedasticity in uit.

To evaluate the performance of the PLS or PGMM estimation under different noise levels, we select

the scale parameter σu to be√

1/2 and 1. In DGP-1, these values for σu correspond to signal-to-noise

ratios of 2 and 1 (or in terms of the goodness of fit R2 of the model, 0.67 and 0.5), respectively.

Tables 1 and 2 report simulation results from the above DGPs. The first panel of Table 1 reports the

percentages of falsely detecting breaks when there are none (m0 = 0). The second and the third panels

18

Table 1: The determination of the number of breaks for DGPs 1-6 (static panels)DGP N = 50 N = 100 N = 200 N = 500

σu T = 6 12 6 12 6 12 6 12m0 = 0, % of falsely detecting breaks when there are none.

1√22

0.2 0 0 0 0 0 0 01 5.6 1.8 1.6 0 0.6 0 0 0

2√22

0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0

3√22

0.2 0 0 0 0 0 0 01 6.8 3.6 1 0.6 0 0 0 0

4√22

0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0

5√22

0 0 0 0 0 0 0 01 0.2 0 0 0 0 0 0 0

6√22

0.4 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0

m0 = 1, % of correctly detecting one break1

√22

99.4 99.8 100 100 100 100 100 1001 94.4 95.8 99.0 99.8 99.6 100 100 100

2√22

100 100 100 100 100 100 100 1001 99.8 100 100 100 100 100 100 100

3√22

99.2 100 100 100 100 100 100 1001 95.8 97.8 97.0 99.6 99.4 100 100 100

4√22

95.8 95.8 99.8 99.6 100 100 100 1001 85.4 79.2 97.2 96.4 99.6 99.4 100 100

5√22

87.6 95.2 95.4 100 99.2 100 99.8 1001 64.8 80.4 89.2 95.6 97.2 99.8 99.8 100

6√22

97.8 94.8 100 99.6 100 100 100 1001 48.4 54.0 81.4 81.2 97.6 97.8 100 100

m0 = 2, % of correctly detecting two breaks1

√22

99.0 99.8 100 100 100 100 100 1001 92.8 97.4 99.0 98.8 100 100 100 100

2√22

99.8 100 100 100 100 100 100 1001 99.2 99.8 100 100 100 100 100 100

3√22

98.6 99.8 100 100 100 100 100 1001 93.2 96.2 99.0 99.6 99.8 99.8 100 100

4√22

12.2 43.0 27.8 77.8 91.8 99.4 100 1001 2.6 7.8 4.0 14.8 25.4 82.4 98.4 99.8

5√22

4.8 9.2 26.8 40.2 84.4 97.2 99.0 1001 .8 1.8 5.6 3.2 34.8 46.2 95.2 99.8

6√22

11.8 44.2 31.6 83.2 92.4 99.4 100 1001 2.0 12.6 4.2 14.8 28.0 86.2 98.8 100

19

Table 2: The accuracy of estimating the break dates for DGPs 1-6 (static panels)DGP N = 50 N = 100 N = 200 N = 500

σu T = 6 12 6 12 6 12 6 12m0 = 1

1√22

.034 .017 .000 .000 .000 .000 .000 .0001 .000 0.035 .000 .000 .000 .000 .000 .000

2√22

.000 .000 .000 .000 l.000 .000 .000 .0001 .000 .000 .000 .000 .000 .000 .000 .000

3√22

.000 .000 .000 .000 .000 .000 .000 .0001 .104 .119 .000 .000 .000 .000 .000 .000

4√22

.070 .070 .000 .000 .000 .000 .000 .0001 .468 .463 .069 .121 .000 .000 .000 .000

5√22

.038 .018 .000 .000 .000 .000 .000 .0001 .051 .311 .037 .035 .000 .000 .000 .000

6√22

.068 .035 .000 .000 .000 .000 .000 .0001 1.240 2.037 .532 .616 .034 .068 .000 .000

m0 = 2

1√22

.000 .000 .000 .000 .000 .000 .000 .0001 .180 .171 .000 .000 .000 .000 .000 .000

2√22

.000 .000 .000 .000 .000 .000 .000 .0001 .034 .033 .000 .000 .000 .000 .000 .000

3√22

.000 .017 .000 .000 .000 .000 .000 .0001 .179 .139 .000 .000 .000 .000 .000 .000

4√22

.546 .620 .000 .214 .000 .000 .000 .0001 2.564 1.709 .000 .338 .000 .020 .000 .000

5√22

.000 .000 .124 .083 .000 .000 .000 .0001 4.167 .000 .000 .000 .000 .000 .000 .000

6√22

.000 .339 .000 .060 .000 .000 .000 .0001 .000 1.720 .794 .113 .000 .039 .000 .000

Note: The table reports the ratio of the average Hausdoff distance between the estimated and true sets of break

dates to T , i.e., 100 · HD(T 0m, T 0m0 )/T in DGPs 1-3 and 100· HD(T 0m, T 0m0)/T in DGPs 4-6.

20

report the percentages of correctly estimating the number of breaks when the true number of breaks is

1 and 2, respectively. We summarize some important findings from Table 1. First, when there are no

breaks, the probability of falsely detecting breaks declines to zero as either N or T increases. This is true

for both the PLS estimation in DGP-1 to DGP-3 in the case of no endogenous regressor and the PGMM

estimation in DGP-4 to DGP-6 in the case of an endogenous regressor. Even with N = 50 and T = 6, the

probabilities of false detection of breaks are very small for all DGPs under investigation. Second, when

there is one break, the probabilities of correctly detecting one break converge to 100% as N increases.

In the case of PLS, the probabilities of correct detection are high at both noise levels even when N = 50

and T = 6. In the case of PGMM, however, they are much lower at the high noise level than at the

low noise level when sample sizes are small, although they converge quickly to 100% as N increases. As

T increases from 6 to 12, the probability of correct detection improves in general. Third, when there

are two breaks, the probabilities of correctly detecting two breaks converge to 1 as N increases from 50

to 500. When T = 6, there are two consecutive breaks at t = 4 and 5 and the percentage of correctly

estimating the number of breaks tends to very low in DGP-4 to DGP-6 if N is not large enough (50 or

100). But it improves quickly when T increases to 12, in which case there are no consecutive breaks. It

also tends to 1 rapidly as N increases from 50 to 500.

Table 2 reports the ratio of average Hausdoff distance (HD) between the estimated and true sets of

break dates, i.e., 100· HD(T 0m, T 0

m0)/T in the case of PLS estimation and 100· HD(T 0m, T 0

m0)/T in the

case of PGMM estimation, conditional on correction estimation of the number of breaks.6 Conditional

on the correct estimation of the number of breaks, both PLS and PGMM estimate the break dates very

accurately. Even with N = 50 and T = 6, the average ratios of the Hausdoff distance to the true set

of breaks are close to zero for PLS at both noise levels. For DGPs with endogeneity, the estimation of

break-dates is only slightly less accurate.

5.2 The case of dynamic panel

We consider the following DGP’s with an AR(1) dynamics:

yit = β1tyi,t−1 + β2tx2it + µi + uit,

where µi ∼ i.i.d. Uniform[−0.1, 0.1] and

• DGP-1d: x2it ∼ i.i.d. N(0, 1), uit = σuηit, ηit ∼ i.i.d. N(0, 1).

• DGP-2d: Same as DGP-1d except x2it ∼ AR(1) for each i : x2it = 0.5x2i,t−1 + vit, vit ∼i.i.d. N(0, 0.75).

6Let D (A,B) ≡ supb∈B infa∈A |a− b| for any two sets A and B. The Hausdorff distance between A and B is defined as

HD (A,B) ≡ maxD (A,B) ,D (B,A).

21

• DGP-3d: Same as DGP-1d except ηit ∼ GARCH(1, 1) for each i : ηit =√hitεit, hit = 0.05 +

0.05η2i,t−1 + 0.9hi,t−1, εit ∼ i.i.d. N(0, 1).

As in the static case, we take T = 6 or 12, and N = 50, 100, 200, and 500. For each DGP,

we set either β1t = β2t = 0.5 or more persistently, β1t = β2t = 0.8 for all t when no break exists,

β1t = β2t = 0.3 · 1 ≤ t ≤ T/2 + 0.7 · 1 T/2 < t ≤ T when there is one break, and β1t = β2t =

0.3 ·1 1 ≤ t ≤ T/2+0.7 ·1 T/2 + 1 ≤ t < 2T/3+0.3 ·1 2T/3 + 1 ≤ t ≤ T when there are two breaks.Note that when T = 6, there are consecutive breaks at t = 4 and 5.

DGP-1d is the benchmark case with i.i.d. xit and uit across both i and t. DGP-2d allows serial

correlation in xit and DGP-3d allows conditional heteroskedasticity in uit. We choose the scale parameter

σu to be 0.2, 0.3, and 0.5, corresponding to signal-to-noise ratio 4, 2, and 1, respectively, in DGP-1d with

β1t = β2t = 0.5. The relatively lower noise levels are justified by the usually high goodness-of-fit of many

dynamic panels in applications. To obtain the PGMM estimate, we use zit = (yi,t−2, x2it, x2i,t−1)′ as the

instrument.

Table 3 reports the estimation of the number of breaks for these three DGPs. The first two panels

report the percentages of falsely detecting breaks when there are none (m = 0). The AR coeffi cient is 0.5

in the first panel and 0.8 in the second panel. The second and the third panels report the percentages of

correctly estimating the number of breaks when the true number of breaks is 1 and 2, respectively. We

summarize the results in Table 3. (i) When there are no breaks, the probabilities of falsely detecting breaks

are small and become smaller in general when N or T increases. When the AR coeffi cient increases from

0.5 to 0.8 and the dynamic panel becomes more persistent, the probabilities of falsely detection remain

low. In fact, for some DGPs (e.g., DGP-3d, N = 500), the probabilities of false detection at higher

persistency level are generally lower than those at moderate persistence level, thanks to the fact that the

signal-to-noise ratio is higher at high persistence level. (ii) When there is one break, the probabilities of

correctly detecting one break converge to 100% across all choices of N , T , and noise levels. (iii) When

there are two breaks, we see relatively lower probabilities of correct estimation, especially at high noise

levels. But as N increases, the probabilities of correction estimation also converge to 100% across all

noise levels.

Table 4 reports the ratio of average Hausdoff distance between the estimated and true sets of break

dates to T , i.e., 100· HD(T 0m, T 0

m0)/T, for DGP-1d to DGP 3-d. As in the static panel case, conditional on

the correct estimation of the number of breaks, our procedure estimates the break dates very accurately.

Even with N = 50 and T = 6, the average Hausdoff distance to the true set of break dates is very close

to zero at all noise levels.

22

Table 3: The determination of the number of breaks for DGPs 1d-3d (dynamic panels)DGP N = 50 N = 100 N = 200 N = 500

σu T = 6 12 6 12 6 12 6 12m0 = 0, β1t = β2t = 0.5, % of falsely detecting breaks when there are none.

.2 2.6 1.4 0.6 0.4 0.6 1 1 0.81d .3 1.8 1.6 0.6 0.6 0.4 0.2 0.6 0.6

.5 2.8 1.2 1.6 0.8 0.8 0.8 0 0

.2 1.4 0 1.2 0.8 0.2 0.2 0.2 02d .3 1.4 1.2 0.4 0.6 1 0.2 0.4 0.2

.5 0.8 1.6 1.2 1.4 0.6 0.8 0.2 0.2

.2 1.8 1.4 0.6 0.6 0.8 0.8 0.4 0.43d .3 1.8 1.4 1 1.4 0.6 0.2 0.2 0.6

.5 2.2 1.4 1.2 0.6 1.4 1.2 0.2 0.6

m0 = 0, β1t = β2t = 0.8, % of falsely detecting breaks when there are none..2 1.8 1.2 1 1.4 0.8 1 0 0.2

1d .3 1.4 1.4 0.6 0.6 1 0.8 0.6 0.4.5 3.6 2 1.2 0.6 0.8 0.6 1 0.6

.2 3 1 1.8 0.6 0.6 0.6 0.2 0.22d .3 1.6 0.6 1.2 0.8 0.4 0.2 0.6 0.2

.5 1.4 1 1.6 1 0.8 1 0.2 0.2

.2 3.2 1.4 1 0.8 0.2 1.2 0.2 0.23d .3 2.2 1 1.6 0.6 1 0.6 0.2 0.4

.5 1.6 1 0.8 0.2 0.8 0.4 0.2 0.2

m0 = 1, % of correctly detecting one break.2 98.4 94.8 98.6 98.4 99.2 99.2 100 100

1d .3 96.6 91.8 98.8 98.8 99 98.8 99.8 99.8.5 87 83.4 96.6 97.2 99.2 99.4 99.6 99.4

.2 97.2 98.4 97.6 98.2 99.8 99 100 1002d .3 94.2 94.4 98.6 99.2 99.6 99.2 99.4 99.6

.5 84.8 82.4 97.8 98 99 100 99 100

.2 97.2 95 98.6 98.6 99 99 99.8 99.83d .3 95.6 90.2 99.6 98.6 99.4 99 99.4 100

.5 86.4 79.8 97.4 96.6 98.8 99.6 99.4 99.8

m0 = 2, % of correctly detecting two breaks0.2 90.4 80.6 97.6 95.6 99.4 98.8 99.8 100

1d 0.3 81 63.6 94 86.8 99 98.2 100 99.80.5 46.6 36.6 87 73 99.4 94.2 100 100

0.2 89.6 86 97.6 97.4 99.6 99.8 99.6 99.82d 0.3 81.2 63.6 94.2 91.2 96.6 98.8 98.2 99.4

0.5 54.4 38.4 85 69 93 86 97.6 93.8

0.2 88.2 79.8 97.6 94.2 99.6 99.6 99.6 99.83d 0.3 79.4 60.8 95 90 99 98.6 100 99.8

0.5 45.8 41.4 90.2 72 98 94.8 99.8 99.6

23

Table 4: The accuracy of estimating the break dates for DGPs 1d-3d (dynamic panels)DGP N = 50 N = 100 N = 200 N = 500

σu T = 6 12 6 12 6 12 6 12m0 = 1

.2 .000 .000 .000 .000 .000 .000 .000 .0001d .3 .000 .036 .000 .000 .000 .000 .000 .000

.5 .421 .719 .000 .034 .000 .000 .000 .000

.2 .000 .000 .000 .000 .000 .000 .000 .0002d .3 .000 .000 .000 .000 .000 .000 .000 .000

.5 .118 .222 .000 .017 .000 .000 .000 .000

.2 .000 .000 .000 .000 .000 .000 .000 .0003d .3 .000 0.055 .000 .000 .000 .000 .000 .000

.5 .386 .606 .000 .000 .000 .000 .000 .000m0 = 2

.2 .000 .000 .000 .000 .000 .000 .000 .0001d .3 .041 .079 .000 .000 .000 .000 .000 .000

.5 .572 .273 .000 .137 .000 .000 .000 .000

.2 .000 .000 .000 .000 .000 .000 .000 .0002d .3 .000 .000 .000 .000 .000 .000 .000 .000

.5 .184 .174 .000 .048 .000 .000 .000 .000

.2 .000 .000 .000 .000 .000 .000 .000 .0003d .3 .000 .055 .000 .000 .000 .000 .000 .000

.5 .509 .362 .037 .000 .000 .000 .000 .000Note: The table reports the ratio of the average Hausdoff distance between the estimated and true sets of break

dates to T , i.e., 100· HD(T 0m, T 0m0)/T.

6 An empirical application

In this section we offer an illustration of the use of our method. We seek to evaluate the effect of FDI inflow

on economic growth with a dynamic panel data model with an unknown number of breaks. The possible

existence of breaks may be justified theoretically. In the endogenous growth model of Romer (1986),

for example, economic growth may behave differently in different policy environments. Furthermore, in

the growth model of Jones (2002), the regime shifts may be common across countries in “a world of

ideas”, assuming that ideas propagate fast enough. Empirically, there is ample evidence of the existence

of breaks in growth path (e.g., Ben-David and Papell (1995)). However, most of existing studies rely on

time series structural break tests for individual economies, the United States in particular.

In this empirical exercise, we use a panel data of 88 countries or regions from 1973 to 2012. We

take data from the UNCTAD (United Nations Conference on Trade and Development) and construct

two variables, the per capita GDP growth and the ratio of FDI inflow to GDP for each economy in the

sample.7 These are annual data. But following the literature on growth empirics (e.g., Islam (1995)),

we take five-year averages of the two variables and denote them by yit and fdiit, respectively. Here

the subscript t denotes a sequence of five-year periods. The averaging gives us eight time five-year time

periods for each economy. Due to the fact that there is one lagged dependent variable in the model, the

7The UNCTAD database covers 237 countries and regions. We delete those economies with missing values over 1973-2012.

24

10 2 10 1 100 1010.14

0.16

0.18

0.2

Tuning Parameter

IC

10 2 10 1 100 1010

1

2

3

4

5

6

Num

ber o

f Bre

aks

ICNumber of Breaks

Figure 1: Selecting the optimal tuning parameter by minimizing the information criterion (IC). Horizontal

axis: tuning parameter, Left vertical axis: IC, right vertical axis: number of breaks.

effective number of data points for each economy is seven. We apply the PGMM method to estimate the

following dynamic panel data model with an unknown number of breaks,

yit = µi + β1tyi,t−1 + β2tfdiit + uit, t = 1, . . . , 7.

As in the simulations, we set κ2 = 2 in the construction of the adaptive weights, choose the weight matrices

(Wt,Wpj ) as detailed in the last paragraph of Section 2.3, and adopt zit = (yi,t−2, fdiit, fdii,t−1)

′ as the

instrument.

We first select an optimal tuning parameter that minimizes the BIC by choosing ρ2NT = log(N(T −1))/(N(T−1)) in (4.2). We choose λmax = 10, which results in zero break, and λmin = 0.01, which results

in six breaks. We then search on the interval [λmin , λmax ] with thirty evenly-distributed logarithmic grids.

We find that the number of breaks is four and that the breaks occur at t = 2, 5, 6 and 7, that is 1983-1987,

1998-2002, 2003-2007, and 2008-2012. Figure 1 shows how BIC (left axis) and the estimated number of

breaks (right axis) change with the tuning parameter λ2. We can see that the BIC declines till the

estimated number of breaks reaches four and rises as λ2 gets bigger. It is notable that there are five λ2’s

that result in four breaks, ranging from 0.053 to 0.137, and the IC curve is flat over this segment (and

25

similarly over several other segments).8 This suggests that the penalized GMM estimation is not very

sensitive to the tuning parameter.

It is well known that BIC, or other information criteria, may not be able to select the right model

in finite samples. It is thus prudent to examine the cases with the number of breaks other than four.

Table 5 shows regime segmentation, parameter estimates, and standard errors (in paratheses), from the

post-lasso estimation for the cases where m = 0, 1, . . . , 6. Note that in the last case (m = 6), there is a

structural break at every time point.

As shown in Table 5, the set of break dates is an increasing sequence as the tuning parameter decreases.

It starts from an empty set when m = 0. When m = 1, we have one break at t = 2, which corresponds to

the five-year period of 1983-1987. As the tuning parameter decreases, another break (in addition to the

one at t = 2) is detected at t = 7, which corresponds to 2008-2012. When m = 3, we have an additional

break at t = 6, corresponding to 2003-2007. As the tuning parameter decreases more, we arrive at the

case of m = 4 that achieves the minimum BIC. When m = 5, there is another break at t = 3 and the set

of break dates becomes 2, 3, 5, 6, 7.Table 5 also shows that the determination of structural change in our model is crucial for the quan-

titative evaluation of the effect of FDI on the economic growth. If we assume that no break exists and

estimate a textbook dynamic panel data model, then we may conclude that FDI has a negative, albeit

statistically insignificant, effect on growth. In the model chosen by BIC (m = 4), in stark contrast, the

coeffi cient of FDI is significantly positive in all regimes. In models with more than four break dates, there

are also negative coeffi cients on FDI in the five-year span of 1983-1987. In models with less than four

but more than or equal to one breaks, the coeffi cients on FDI are positive in all regimes, but are not

statistically significant in some regimes. This exercise suggests that the time-invariant parameter in the

textbook dynamic panel data model is an unnecessarily restrictive assumption and may lead to dubious

conclusions. Our shrinkage-based method, by allowing multiple breaks in panel data model, provides

applied economists with a natural approach to relaxing this assumption.

7 Conclusion

We propose two shrinkage procedures for the determination of the number of structural changes in linear

panel data models via adaptive group fused Lasso: PLS estimation for first-differenced models without

endogeneity and PGMM estimation for first-differenced models with endogeneity. We show that with

probability tending to one our methods can correctly determine the number of breaks and estimate the

break dates consistently. Simulation results suggest that our methods perform well in finite samples.

There are several interesting topics for further research. First, we do not allow cross sectional de-

8When λ2 changes from 0.053 to 0.137, the number of breaks and the set of estimated break dates remain unchanged

so that neither the first term (corresponding to the post Lasso regression) nor the second term (the penalty term) in (4.2)

changes.

26

Table5:

TheeffectofFDIontheeconomicgrowth(88countriesandregions,1973-2012)

mt

1978-1982

1983-1987

1988-1992

1993-1997

1998-2002

2003-2007

2008-2012

BIC

yi,t−1

-.151(.058)***

0fdi it

-.070(.052)

.178

yi,t−1

-.144(.092)

-.170(.069)**

1fdi it

.523(.258)**

.060(.050)

.189

yi,t−1

-.132(.095)

-.065(.074)

-.971(.160)***

2fdi it

.610(.270)**

.050(.047)

.176(.084)**

.161

yi,t−1

-.127(.097)

-.030(.075)

.302(.155)*

-.438(.202)**

3fdi it

.654(.275)**

.064(.054)

.229(.066)***

.174(.080)**

.155

yi,t−1

-.114(.103)

.074(.072)

-.232(.122)*

.266(.171)

-.441(.230)*

4fdi it

1.170(.294)***

.492(.096)***

.161(.046)***

.260(.057)***

.192(.084)**

.142

yi,t−1

-.109(.115)

.221(.102)**

.072(.080)

-.247(.116)**

.251(.171)

-.460(.224)**

5fdi it

.625(.287)**

-.252(.251)

.468(.097)***

.157(.047)***

.256(.056)***

.194(.083)**

.153

yi,t−1

-.107(.117)

.228(.101)**

.094(.104)

.088(.094)

-.240(.115)**

.257(.167)

-.453(.220)**

6fdi it

.653(.285)**

-.222(.240)

.503(.125)***

.461(.101)***

.158(.047)***

.258(.057)***

.193(.083)**

.176

Note:Numbers

inparathesesare

standard

errors.***denotesstatisticalsignificanceat1%

level,**at%5level,and*at%10level.

27

pendence in our models. Given the large literature on cross sectional dependence, it is interesting to

extend our methodology to panel data models with cross sectional dependence. Second, if we model the

cross sectional dependence through a factor structure, the factor loadings may also exhibit structural

changes over time (see, e.g., Breitung and Eickmeier (2011) and Cheng et al. (2014)) and this further

complicates the analysis. Third, we consider the common shocks for homogenous panel data models. It is

also interesting to consider heterogeneous panel data models and to allow the break dates to be different

across individuals. We leave these topics for future research.

28

APPENDIX

A Proof of the results in Section 3

Let V1NT (β) ≡ 1N

∑Ni=1

∑Tt=2

(∆yit − β′txit + β′t−1xi,t−1

)2. Let βt = β0

t +N−1/2bt for t = 1, ..., T with

b ≡ (b′1, ..., b′T )′ satisfying that T−1/2 ‖b‖ = L. Note that β = β0 + N−1/2b. We first prove a technical

lemma.

Lemma A.1 Suppose Assumption A.1 holds. Then βt − β0t = OP

(N−1/2

)for each t = 1, 2, ..., T.

Proof. Let bt = N1/2(βt − β0t ) and b = (b′1, ..., b

′T )′. Noting that ∆yit − x′itβt + x′i,t−1βt−1 = ∆uit −

N−1/2(x′itbt − x′i,t−1bt−1), we have

N[V1NT (β)− V1NT

(β0)]

=

N∑i=1

T∑t=2

[∆uit −N−1/2(x′itbt − x′i,t−1bt−1)

]2− (∆uit)

2

=1

N

N∑i=1

T∑t=2

(x′itbt − x′i,t−1bt−1

)2 − 2

N1/2

N∑i=1

T∑t=2

∆uit(x′itbt − x′i,t−1bt−1

)= b′QNTb− 2b′

√NRuNT ≡ A1 (b)− 2A2 (b) , say,

where QNT and RuNT are defined in (2.4) and (2.5), respectively. Under Assumption A.1(iii), w.p.a.1

λmin

(QNT

)= min‖κ‖=1

κ′Q0κ + κ′

(QNT − Q0

)κ≥ λmin

(Q0

)−∥∥∥QNT − Q0

∥∥∥sp≥ cQ0

/2.

Under Assumptions A.1(i)-(ii), T−1/2∥∥∥√NRuNT∥∥∥ = OP (1) by Chebyshev inequality. It follows that

w.p.a.1

T−1 [A1 (b)− 2A2 (b)] ≥(cQ0

/2)T−1 ‖b‖2 − T−1/2 ‖b‖OP (1) > 0

if T−1/2 ‖b‖ = L is suffi ciently large in which case the quadratic term A1 (b) dominates the linear term

A2 (b) . Consequently, N[V1NT (β)− V1NT

(β0)]> 0 w.p.a.1 if T−1/2 ‖b‖ = L is large and V1NT (β)

cannot be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ must be stochastically bounded.

When T is fixed, the above result also implies that bt is stochastically bounded for each t = 1, ..., T .

We now consider the case of large T. Let L denote the block lower part of the symmetric block tridiagonal

matrix QNT . By Meurant (1995), QNT can be factorized as follows: QNT = (∆ + L)∆−1(∆ + L′), where

∆ =diag(∆1, ..., ∆T ) is a block diagonal matrix, ∆1 = φxx,1, ∆t = 2φxx,t − φxx,t,t−1

(∆t−1

)−1

φ′xx,t,t−1

for t = 2, ..., T − 1, and ∆T = φxx,T − φxx,T,T−1

(∆T−1

)−1

φ′xx,T,T−1. Let b†=(∆ + L′)b = (b†′1 , ..., b

†′T )′

and R†NT=√N(∆+L′)RuNT = (R†′1NT , ..., R

†′TNT )′ where b†t’s and R

†tNT’s are all p×1 vectors. In addition,

R†tNT = OP (1) for each t = 1, ..., T under Assumption A.1. Then

N[V1NT (β)− V1NT

(β0)]

=

T∑t=1

[b†′t ∆−1

t b†t + b†′t R†tNT

]≡ V †1NT

(b†), say.

29

Let β ≡ (β′1, ..., β

′T )′ and b†≡(∆ + L′)b ≡(b†′1 , ..., b

†′T )′. In view of the fact that β minimizes V1NT (β) , we

have

0 ≥ N[V1NT (β)− V1NT

(β0)]

=

T∑t=1

[b†′t ∆−1

t b†t + b†′t R†tNT

].

The last result implies that b†t = OP (1) for each t. Otherwise, b† cannot minimize V †1NT(b†), which

further implies that β cannot minimize V1NT (β) .

To finish the proof, we still need to show that bt = OP (1) for each t based on the result that

b†t = OP (1) for each t. Noting that ∆ + L′ is a nonsingular upper block triangular matrix, we can apply

the fact that the inverse of a nonsingular upper block triangular matrix is also an upper block triangular

matrix (see, e.g., Harville (1997, p.95)) and write (∆+ L′)−1 = ωtsTt,s=1 , where ωts’s are p×p matricesthat are OP (1) for s ≥ t and zeros otherwise. The exact formula of ωts in terms of elements in ∆ and

L′ can be obtained recursively from Harville (1997, p.95). Thus bt =∑Ts=t ωtsb

†s and bt = OP (1) for any

t ≥ T − r where r is a finite integer that does not depend on T. Now, suppose that bτ 6= OP (1) for some

1 ≤ τ < T − r. By the relationship between b† and b, we have

b†τ = ∆τ bτ + φ′xx,τ+1,τ bτ+1

or, equivalently,

∆−1τ b†τ = bτ + ∆−1

τ φ′xx,τ+1,τ bτ+1.

Since ∆−1τ = OP (1) , b†τ = OP (1) , φ′xx,τ+1,τ = OP (1) , and bτ 6= OP (1) , in order for the above equality to

hold, we must have bτ+1 6= OP (1) . Deducting recursively, we must have bT−r 6= OP (1) , a contradiction.

It follows that bt = N1/2(βt − β0t ) = OP (1) for each t.

Proof of Theorem 3.1. (i) Let bt = N1/2(βt − β0t ) and b = N1/2(β − β0

). Noting that ∆yit − x′itβt +

x′i,t−1βt−1 = ∆uit −N−1/2(x′itbt − x′i,t−1bt−1), we have

N[V1NT,λ1 (β)− V1NT,λ1

(β0)]

=1

N

N∑i=1

T∑t=2

(x′itbt − x′i,t−1bt−1

)2 − 2

N1/2

N∑i=1

T∑t=2

∆uit(x′itbt − x′i,t−1bt−1

)+Nλ1

T∑t=2

wt

[∥∥∥β0t − β0

t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0

t − β0t−1

∥∥]= b′QNTb− 2b′

√NRuNT +Nλ1

T∑t=2,t∈T 0

m0

wt

[∥∥∥β0t − β0

t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0

t − β0t−1

∥∥]

+Nλ1

T∑t=2,t∈T 0c

m0

wt

∥∥∥N−1/2(bt − bt−1)∥∥∥

≡ A1 (b)− 2A2 (b) +A3 (b) +A4 (b) , say,

30

where QNT and RuNT are defined in (2.4) and (2.5), respectively. By Lemma A.1 and Assumption A.2(i),

maxt∈T 0m0wt = maxt∈T 0

m0

∥∥∥βt − βt−1

∥∥∥−κ1 = maxt∈T 0m0

∥∥β0t − β0

t−1 +OP(N−1/2

)∥∥−κ1 = OP(J−κ1min

). By

the triangle and Jensen inequalities, the fact that m0 is a fixed finite integer, and Assumption A.2(ii),

∣∣T−1A3 (b)∣∣ ≤ m0T−1N1/2λ1 max

s∈T 0m0

ws

1

m0

T∑t=2,t∈T 0

m0

‖bt − bt−1‖

≤ m0T−1N1/2λ1 max

s∈T 0m0

ws

1

m0

T∑t=2,t∈T 0

m0

‖bt − bt−1‖2

1/2

≤(2m0

)1/2T−1/2N1/2λ1 max

s∈T 0m0

wsT−1/2 ‖b‖

= OP

(N1/2λ1T

−1/2J−κ1min

)T−1/2 ‖b‖ = OP (1)T−1/2 ‖b‖ . (A.1)

In conjunction with the analyses of A1 (b) and A2 (b) in the proof of Lemma A.1, this implies that w.p.a.1

T−1 [A1 (b)− 2A2 (b) +A3 (b)] ≥ λmin

(QNT

)T−1 ‖b‖2 −OP (1)T−1/2 ‖b‖ > 0

if T−1/2 ‖b‖ = L is suffi ciently large. That is, A1 (b) dominates −2A2 (b)+A3 (b) for large L. In addition,

A4 (b) ≥ 0. Consequently, N[V1NT,λ1 (β)− V1NT,λ1

(β0)]> 0 w.p.a.1 for large L and V1NT,λ1 (β) cannot

be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ has to be stochastically bounded and

Theorem 3.1 (i) holds.

(ii) The result follows from (i) in the case of fixed T. So we consider the case of large T. Let L,

∆, b†, R†NT , and ωtsTt,s=1 be as defined in the proof of Lemma A.1. Let ω

†ts = ωts − ωt−1,s. Then

bt − bt−1 =∑Ts=t ωtsb

†s −

∑Ts=t−1 ωt−1,sb

†s =

∑Ts=t−1 ω

†tsb†s as ωts = 0 for s = t − 1. So we can rewrite


(β0)]in terms of b† :


(β0)]

=

T∑t=1

[b†′t ∆−1

t b†t − 2b†′t R†tNT

]+Nλ1

T∑t=2,t∈T 0

m0

wt

[∥∥∥∥∥β0t − β0

t−1 +N−1/2T∑

s=t−1

ω†tsb†s

∥∥∥∥∥− ∥∥β0t − β0

t−1

∥∥]

+N1/2λ1

T∑t=2,t∈T 0c

m0

wt

∥∥∥∥∥T∑

s=t−1

ω†tsb†s

∥∥∥∥∥ ≡ NV †1NT,λ1 (b†) , say.Noting that

∥∥∥R†tNT∥∥∥ = OP (1), N1/2λ1m0 maxs∈T 0m0ws = OP

(N1/2λ1J

−κ1min

)= OP (1) , andmaxs∈T 0

m0

∥∥∥ω†st∥∥∥= OP (1), we have by the triangle and Jensen inequalities (as in the derivation of (A.1))

0 ≥ NV †1NT,λ1(b†)≥

T∑t=1

[b†′t ∆−1

t b†t − 2b†′t R†tNT

]−N1/2λ1m0 max

s∈T 0m0

ws maxs∈T 0

m0

T∑t=s−1

ω†stb†t

≥T∑t=1

[b†′t ∆−1

t b†t −(

2∥∥∥R†tNT∥∥∥+N1/2λ1 (m0)

3/2maxs∈T 0

m0

ws maxs∈T 0

m0

∥∥∥ω†st∥∥∥)∥∥∥b†t∥∥∥

]

=

T∑t=1

[b†′t ∆−1

t b†t −OP (1)∥∥∥b†t∥∥∥] .

31

It follows that 0 ≥ N[V †1NT,λ1(b

†)− V †1NT,λ1 (0Tp×1)]≥∑Tt=1

[b†′t ∆−1

t b†t −OP (1)∥∥∥b†t∥∥∥] and b†t = OP (1)

for each t. Otherwise, b†t cannot minimize V†1NT,λ1

(b†). This implies that bt = N1/2(βt−β0

t ) = OP (1)

by the same arguments as used in the proof of Lemma A.1.

Proof of Theorem 3.2. We want to demonstrate that

P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c

m0

)→ 1 as N →∞. (A.2)

Suppose that to the contrary, θt = βt − βt−1 6= 0 for some t ∈ T 0cm0 for suffi ciently large N or (N,T ) .

Then there exists r ∈ 1, ..., p such that∣∣∣θt,r∣∣∣ = max

∣∣∣θt,l∣∣∣ , l = 1, ..., p, where for any p × 1 vector

at, at,l denotes its lth element. Without loss of generality (wlog) assume that r = p, implying that∣∣∣θt,p∣∣∣ /∥∥∥θt∥∥∥ ≥ 1/√p. To consider the first order condition (FOC) with respect to (wrt) βt, t ≥ 2, based on

subdifferential calculus (e.g., Bersekas (1995, Appendix B.5)), we distinguish two cases: (a) 2 ≤ t ≤ T −1

and (b) t = T and T ∈ T 0cm0 .

In case (a), we consider two subcases: (a1) t+1 = T 0j ∈ T 0

m0 for some j = 1, ...,m0, and (a2) t+1 ∈ T 0cm0 .

In either case, we can apply the FOC wrt βt,p and the equality ∆yit = β0′t xit − β0′

t−1xi,t−1 + ∆uit to

obtain

0 =−2√N

N∑i=1

(∆yit − β

′txit + β

′t−1xi,t−1

)xit,p +

2√N

N∑i=1

(∆yi,t+1 − β

′t+1xi,t+1 + β

′txit

)xit,p

+√Nλ1wt

θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1,p (A.3)

= − 2√N

N∑i=1

[(βt+1 − β0

t+1

)′xi,t+1 − 2

(βt − β0

t

)′xit +

(βt−1 − β0

t−1

)′xi,t−1

]xit,p

+2√N

N∑i=1

∆2ui,t+1xit,p +√Nλ1wt

θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1,p

≡ B1t +B2t +B3t −B4t, say,

where et+1 = θt+1/∥∥∥θt+1

∥∥∥ if ∥∥∥θt+1

∥∥∥ 6= 0 and ‖et+1‖ ≤ 1 otherwise, et+1,p is the pth element in et+1.

By Assumptions A.1(i)-(ii) and Theorem 3.1, B1t = OP (1) and B2t = OP (1). In view of the fact that

w−1t = OP (N−κ1/2) for t ∈ T 0c

m0 , |B3t| ≥√Nλ1wt/

√p, which is explosive in probability under Assumption

A.2(iii) (i.e., N (κ1+1)/2λ1 →∞).To bound the probability order of B4t, we distinguish two subcases. In subcase (a1), noting that

βt+1 − βtP→ θ0

t+1 6= 0 by Theorem 3.1, we have wt+1 =∥∥θ0t+1 +OP (N−1/2)

∥∥−κ1 = OP(J−κ1min

)and

B4t =√Nλ1wt+1et+1,p = OP (

√Nλ1J

−κ1min ) = OP (1) . Consequently, |B3t| |B1t +B2t −B4t| so that

(A.3) cannot be true for suffi ciently large N or (N,T ). Then we conclude that w.p.a.1, θt must be

in a position where∥∥∥θt∥∥∥ is not differentiable in subcase (a1). In addition, a direct implication of this

result is that if t = T 0j − 1 ∈ T 0c

m0 for some j = 1, ...,m0, then P(∥∥∥θT 0j −1

∥∥∥ = 0)→ 1 as N → ∞ and

√Nλ1wT 0j −1eT 0j −1 = OP (1) in order for the FOC to hold for t = T 0

j − 1.

32

In subcase (a2), diffi culty arises as wt+1 = OP (Nκ1/2) and√Nλ1wt+1 = OP (N (1+κ1)/2λ1). But we

can apply the implication from the result in subcase (a1) recursively. When t = T 0j − 2 ∈ T 0c

m0 for

some j = 1, ...,m0, B4t =√Nλ1wT 0j −1eT 0j −1,p = OP (1) and |B3t| |B1t +B2t −B4t| . Thus (A.3)

cannot hold for t = T 0j − 2 ∈ T 0c

m0 either and we must have P(∥∥∥θT 0j −2

∥∥∥ = 0)→ 1 as N → ∞ and

√Nλ1wT 0j −2eT 0j −2 = OP (1) in order for the FOC to hold for t = T 0

j − 2. Deducting in this way until we

reach t = T 0j−1 + 1 ∈ T 0c

m0 . Consequently, θt must be in a position that∥∥∥θt∥∥∥ is not differentiable for all

t ∈ T 0cm0 and t 6= T.

In case (b), noting that only one term in the penalty term (λ1

∑Tt=2 wt

∥∥βt − βt−1

∥∥) is involved withβT , it is easy to show that θT = βT − βT−1 must be in a position where

∥∥∥θT∥∥∥ is not differentiable ifT ∈ T 0c

m0 . Consequently (A.2) follows.

Proof of Corollary 3.3. We consider two cases: (a) t ∈ T 0cm0 , and (b) t ∈ T 0

m0 . In case (a), Theorem

3.2 implies that asymptotically no time point in T 0cm0 can be identified as an estimated break date so that

m ≤ m0. In case (b), we want to show that all break points in T 0m0 must be identified as an estimated

break point. Suppose not. Then there exists t ∈ T 0m0 such that

∥∥∥θt∥∥∥ = 0. By the√N -consistency of

θt and the fact θt = βt − βt−1 = β0t − β0

t−1 + OP (N−1/2) = θ0t + OP (N−1/2) by Theorem 3.1, we have∥∥θ0

t

∥∥ = O(N−1/2), which contradicts the assumption that N1/2Jmin →∞ as N →∞ as∥∥θ0t

∥∥ ≥ Jmin for

any t ∈ T 0m0 .

Proof of Theorem 3.4. Note that αpm(Tm) = (αp1(Tm)′, ..., αpm+1(Tm)′)′ = arg minαm Q1NT

(αm; Tm

).

The FOCs for this minimization problem are

0p×1 =−2

N

T1−1∑t=2

N∑i=1

(∆yit − αp′1 ∆xit

)∆xit +

2

N

N∑i=1

(∆yiT1 − α

p′2 xi,T1 + αp′1 xi,T1−1

)xi,T1−1,

0p×1 =−2

N

Tj−1∑t=Tj−1+1

N∑i=1

(∆yit − αp′j ∆xit

)∆xit +

2

N

N∑i=1

(∆yiTj − α

p′j+1xiTj + αp′j xi,Tj−1

)xi,Tj−1

− 2

N

N∑i=1

(∆yiTj−1 − α

p′j xiTj−1 + αp′j−1xi,Tj−1−1

)xiTj−1 for j = 2, ..., m, and

0p×1 =−2

N

T∑t=Tm+1

N∑i=1

(∆yit − αp′m+1∆xit

)∆xit −

2

N

N∑i=1

(∆yiTm − α

p′m+1xiTm + αp′mxi,Tm−1

)xiTm ,

where we suppress the dependence of αpj’s on Tm. Let Φab,l = 1N

∑Tl−1

t=Tl−1+1

∑Ni=1 aitb

′it for l = 1, ..., m+1

and a, b = ∆x, x, or ∆y. One can readily solve for αpm to obtain αpm = Φ−1NT Ψy

NT , where

ΦNT = TriD(

Φ†, Φ)m+1

, (A.4)

ΨyNT =

(Φ′∆x∆y,1 − φ′x∆y,T1−1,T1

, Φ′∆x∆y,2 − φ′x∆y,T2−1,T2+ φ′

x∆y,T1, ...,

Φ′∆x∆y,m − φ′x∆y,Tm−1,Tm+ φ′

x∆y,Tm−1, Φ′∆x∆y,m+1 + φ′

x∆y,Tm

)′, (A.5)

33

Φ1 = Φ∆x∆x,1+φxx,T1−1, Φl = Φ∆x∆x,l+φxx,Tl−1+φxx,Tl−1 for l = 2, ..., m, Φm+1 = Φ∆x∆x,m+1+φxx,Tm ,

and Φ†l+1 = φxx,Tl,Tl−1 for l = 1, ..., m.

By Corollary 3.3, αpm(Tm) = αpm0 (Tm0) w.p.a.1. Therefore we can study the asymptotic distribution

of αm(Tm) by studying that of αm0 (Tm0) . Note that αpm0 (Tm0) = Φ−1NTΨy

NT , where ΦNT and ΨyNT are

defined in (3.1) and (3.2), respectively. It is easy to verify that

√NDm0+1

(αpm0 (Tm0)−α0

)=(D−1m0+1ΦNTD−1

m0+1

)−1√ND−1

m0+1ΨuNT

where ΨuNT is defined in (3.2). Then by Assumption A.3(i), D

−1m0+1ΦNTD−1

m0+1

P→ Φ0 > 0. By Assump-

tion A.3(ii),√ND−1

m0+1ΨuNT

D→ N (0,Ω0) . Then by the Slutsky lemma,√NDm0+1

(αpm0 (Tm0)−α0

) D→N(0,Φ−1

0 Ω0Φ−10

). This completes the proof of the theorem.

Proof of Theorem 3.5. Recall αpmλ1 (Tmλ1 ) = (αp1(Tmλ1 )′, ..., αpmλ+1(Tmλ1 )′)′ denotes the set of post-

Lasso OLS estimates of the regression coeffi cients based on the break dates in Tmλ1 = T1 (λ1) , ..., Tmλ1 (λ1),where we make the dependence of various estimates on λ1 explicit. Let σ2

Tmλ1≡ 1

T−1Q1NT (αpmλ1(Tmλ1 );

Tmλ1 ). For any λ01NT ∈ Ω0, we have limN→∞P (mλ01NT

= m0) = 1 and limN→∞P (Tj(λ0

1NT

)= T 0

j ,

j = 1, ...,m0) = 1 by Corollary 3.3 as λ01NT also satisfies Assumptions A.2(ii)-(iii). It follows that w.p.a.1

σ2Tmλ1

= σ2Tm0

. Using the√NI0

j -consistency of αpj (Tm0) and the expression ∆yit = α0′

j ∆xit + ∆uit if

t ∈ [T 0j−1 + 1, T 0

j − 1] and ∆yit = α0′j+1xit − α0′

j xi,t−1 + ∆uit if t = T 0j , we can readily show that

σ2Tm0

=1

N (T − 1)

m0+1∑j=1

T 0j −1∑t=T 0j−1+1

N∑i=1

(∆yit − α′j,Tm0

∆xit

)2

+1

N (T − 1)

m0∑j=1

N∑i=1

(∆yiT 0j − α

′j+1,Tm0

xiT 0j + α′j,Tm0xi,T 0j −1

)2

= σ2NT +OP [(NImin)

−1],

where σ2NT ≡ 1

N(T−1)

∑Tt=2

∑Ni=1 ∆u2

itP→ σ2

0 ≡ lim(N,T )→∞1

N(T−1)

∑Tt=2

∑Ni=1E

(∆u2

it

)under Assump-

tions A.1(i)-(ii). Then by Assumption A.5 and Slutsky lemma, IC(λ0

1NT

)= σ2

Tm0+ ρ1NT p

(m0 + 1

) P→σ2

0. We consider the case of under- and over-fitted models separately.

Case 1: Under-fitted model: mλ1 < m0. By Lemma A.2 below, infλ1∈Ω− σ2Tmλ1

− σ2T 0m0≥ c0 where

c0 =IminJ

2min

T−1 [c+ oP (1)] for some c > 0. Then by Assumption A5,

P

(inf

λ1∈Ω−IC (λ1) > IC

(λ0

1NT

))= P

(inf

λ1∈Ω−

[(σ2Tmλ1

− σ2Tm0

)+ ρ1NT p (mλ1 −m)

]> 0

)≥ P

(IminJ

2min

ρ1NT (T − 1)[c+ oP (1)] +OP (1) > 0

)→ 1.

Case 2: Over-fitted model: mλ1 > m0. Let Tm ≡ Tm = T1, ..., Tm : 2 ≤ T1 < ... < Tm ≤ T.Given Tm = T1, ..., Tm ∈ Tm, let Tm∗+m0 = T1, T2, ..., Tm∗+m0 denote the union of Tm and T 0

m0 with

34

elements ordered in non-descending order: 2 ≤ T1 < T2 < · · · < Tm∗+m0 ≤ T for some m∗ ∈ 0, 1, ...,m.Let αpm(Tm) ≡

(αp1(Tm)′, ..., αpm+1(Tm)′

)′= arg minαm Q1NT (αm; Tm) and σ2

Tm ≡ Q1NT (αpm(Tm); Tm).

σ2Tm∗+m0

is analogously defined. In view of the fact that σ2Tm∗+m0

≤ σ2Tm for all Tm ∈ Tm, N(σ2

Tm∗+m0−

σ2NT ) = OP (1) uniformly in Tm ∈ Tm by Lemma A.3 below, and Nρ1NT → ∞ by Assumption A.5, we

have

P

(inf

λ1∈Ω+

IC (λ1) > IC(λ0

1NT

))≥ P

(min

m0<m≤mmax

infTm∈Tm

[N(σ2Tm − σ

2Tm0

)+Nρ1NT p

(m−m0

)]> 0

)≥ P

(min

m0<m≤mmax

infTm∈Tm

[N(σ2Tm∗+m0

− σ2Tm0

)+Nρ1NT p

(m−m0

)]> 0

)→ 1 as N →∞.

Lemma A.2 Let Tm = Tm = T1, ..., Tm : 2 ≤ T1 < ... < Tm ≤ T, T0 = 1 and Tm+1 = T + 1. Thenmin0≤m<m0 infTm∈Tm

(T−1)IminJ2min

(σ2Tm − σ

2T 0m0

) ≥ c+ oP (1) for some c > 0.

Proof. First, by the results for least squares regressions, we can readily show that σ2T 0m0

= σ2NT +

OP ((NImin)−1). In view of the fact that that αpm = arg minαm D1NT (αm; Tm) where

D1NT (αm; Tm) ≡ Q1NT (αm; Tm)−Q1NT

(α0m0 ; T 0

m0

)=

1

N

m+1∑j=1

Tj−1∑t=Tj−1+1

N∑i=1

[(∆yit − α′j∆xit

)2 − (∆uit)2]

+1

N

m∑j=1

N∑i=1

[(∆yiTj − α′j+1xiTj + α′jxi,Tj−1

)2 − (∆uiTj)2] ,and (T − 1) (σ2

Tm − σ2NT ) = D1NT (αpm(Tm); Tm) , it suffi ces to show that 1

IminJ2minD1NT (αpm(Tm); Tm) ≥

c + oP (1) uniformly in Tm ∈ Tm for each m ∈ [0,m0 − 1]. We consider three cases: (a) m0 = 1, (b)

m0 = 2, and (c) 3 < m0 ≤ mmax.

In case (a), m = 0 and Tm = T0 becomes empty so that the post Lasso estimate αpm(Tm) = αp0(T0) =

αp1(T0) becomes the OLS estimate in the first-differenced model: αp1(T0) =(∑T

t=2

∑Ni=1 ∆xit (∆xit)

′)−1

×∑Tt=2

∑Ni=1 ∆xit∆yit. Noting that

∆yit = x′itβ0t − x′i,t−1β

0t−1 + ∆uit =

(∆xit)

′α0

1 + ∆uit if 2 ≤ t ≤ T 01 − 1

x′itα02 − x′i,t−1α

01 + ∆uit if t = T 0

1

(∆xit)′α0

2 + ∆uit if T 01 + 1 ≤ t ≤ T

,

35

we have

αp1(T0) = M−1NTM1NTα

01 +M−1

NTM2NTα02 +M−1

NT

1

N (T − 1)

N∑i=1

∆xiT 01

(x′iT 01

α02 − x′i,T 01−1α

01

)+M−1

NT

1

N (T − 1)

T∑t=2

N∑i=1

∆xit∆uit

= M−1NTM1NTα

01 +M−1

NTM2NTα02 +M−1

NT

1

T − 1

(φ∆xx,T 01

α02 − φ∆xx,T 01 ,T

01−1α

01

)+OP

((N(T − 1))−1/2

),

where MNT = 1N(T−1)

∑Tt=2


′, M1NT = 1

N(T−1)

∑T 01−1t=2


′, M2NT =

1N(T−1)

∑Tt=T 01 +1


′, and the last line follows from Assumption A.1.9 Note that

D1NT (αp0(T0); T0) =1

N

T∑t=2

N∑i=1

[(∆yit − αp1(T0)′∆xit)

2 − (∆uit)2]

=1

N

T∑t=2

N∑i=1

[(β0t − α

p1(T0)

)′xit −

(β0t−1 − α

p1(T0)

)′xi,t−1

]2+

2

N

T∑t=2

N∑i=1

[(β0t − α

p1(T0)

)′xit −

(β0t−1 − α

p1(T0)

)′xi,t−1

]∆uit ≡ D1 + 2D2, say.

Further,

D1 =1

N

T 01−1∑t=2

N∑i=1

[α0

1 − αp1(T0)

]′∆xit

2

+1

N

T∑t=T 01 +1

N∑i=1

[α0

2 − αp1(T0)

]′∆xit

2

+1

N

N∑i=1

[α0

2 − αp1(T0)

]′xiT 01 −

[α0

1 − αp1(T0)

]′xi,T 01−1

2

= (T − 1)[α0

1 − αp1(T0)

]′M1NT

[α0

1 − αp1(T0)

]+ (T − 1)

[α0

2 − αp1(T0)

]′M2NT

[α0

2 − αp1(T0)

]+

1

N

N∑i=1

[α0

2 − αp1(T0)

]′xiT 01 −

[α0

1 − αp1(T0)

]′xi,T 01−1

2

≡ D11 + D12 + D13, say.

Let d01 = α0

2 − α01. Then αp1(T0) − α0

1 = M−1NT M2NT d

01 + OP

((N(T − 1))−1/2

)and αp1(T0) − α0

2 =

−M−1NT M1NT d

01 + OP

((N(T − 1))−1/2

), where M2NT = M2NT + 1

T−1φ∆xx,T 01and M1NT = M1NT −

1T−1φ∆xx,T 01 ,T

01−1.

Noting that∥∥d0

1

∥∥ = Jmin, φ∆xx,T 01= OP (1) , φ∆xx,T 01 ,T

01−1 = OP (1) , M−1

NT = OP (1) , M1NT =

OP

(I01−1T−1

), M2NT = OP

(I02−1T−1

), M1NT = OP

(I01−1T−1

), and M2NT = OP

(I02−1T−1

), we have

d0′1 M

′2NTM

−1NTM1NTM

−1NT M2NT d

01 ≥

J2min

(I01 − 1

) (I02 − 1

)2(T − 1)

3 c1NT ,

9Strictly speaking, we need 3 ≤ T 01 ≤ T − 1. If T 01 = 2 (resp. T ), then M1NT = 0 (resp. M2NT = 0) as∑bt=a = 0 when

a > b.

36

and

d0′1 M

′1NTM

−1NTM2NTM

−1NT M1NT d

01 ≥

J2min

(I01 − 1

)2 (I02 − 1

)(T − 1)

3 c2NT ,

where c1NT = (T−1)3

(I01−1)(I02−1)2λmin

(M ′2NTM

−1NTM1NTM

−1NT M2NT

), and c2NT = (T−1)3

(I01−1)2(I02−1)

λmin(M ′1NT

M−1NTM2NTM

−1NT M1NT ). To bound D1, we consider two subcases: (a1) I0

1 ≥ 2 and I02 ≥ 2, and (a2)

I01 = 1 or I0

2 = 1. Observe that in subcase (a1)

D11 + D12

IminJ2min

=T − 1

IminJ2min

[α0

1 − αp1(T0)

]′M1NT

[α0

1 − αp1(T0)

]+[α0

2 − αp1(T0)

]′M2NT

[α0

2 − αp1(T0)

]≥ T − 1

IminJ2min

J2

min

(I01 − 1

) (I02 − 1

)2(T − 1)

3 c1NT +J2

min

(I01 − 1

)2 (I02 − 1

)(T − 1)

3 c2NT

+T − 1

IminJ2min

OP((N(T − 1))−1

)+ s.m.

≥ 1

Imin

(I01 − 1

) (I02 − 1

)2(T − 1)

2 +

(I01 − 1

)2 (I02 − 1

)(T − 1)

2

cNT + oP (1)

=

(I01 − 1

) (I02 − 1

)(T − 2)

Imin (T − 1)2 cNT + oP (1)

where cNT = min(c1NT , c2NT ) and s.m. denotes terms that are of smaller order than the expressed

terms. Then 1IminJ2min

D1 ≥ c+ oP (1) where c = limT→∞(I01−1)(I02−1)(T−2)

Imin(T−1)2plim(NT ) cNT > 0. In subcase

(a2), D11 = 0 or D12 = 0, (D11 + D12)/(IminJ2min) = oP (1) , and we need to show that 1

IminJ2minD13 is

stochastically bounded below by a positive constant. By Assumption A.4(i),

D13

IminJ2min

≥ 1

IminJ2min

minα1

1

N

N∑i=1

[(α0

2 − α1

)′xiT 01 −

(α0

1 − α1

)′xi,T 01−1

]2≥ cαImin

+ oP (1) .

Then 1IminJ2min

D1 ≥ cαImin

+ oP (1).

To determine the probability order of D2, we make the following decomposition:

D2 =[α0

1 − αp1(T0)

]′Mu

1NT +[α0

2 − αp1(T0)

]′Mu

2NT

+1

N

N∑i=1

[α0

2 − αp1(T0)

]′xiT 01 −

[α0

1 − αp1(T0)

]′xi,T 01−1

∆uiT 01

≡ D21 + D22 + D23, say,

where Mu1NT = 1

N

∑T 01−1t=2

∑Ni=1 ∆xit∆uit, and Mu

2NT = 1N

∑Tt=T 01 +1

∑Ni=1 ∆xit∆uit. Noting that α0

1 −

αp1(T0) = JminOP

(I02−1T−1

), α0

2 − αp1(T0) = JminOP

(I01−1T−1

), Mu

1NT = OP

(√(I0

1 − 1)/N), and Mu

2NT =

OP

(√(I0

2 − 1)/N), we have

D21 + D22

IminJ2min

=1

IminJmin

[OP

(I02 − 1

T − 1

)OP

(√I01 − 1

N

)+OP

(I01 − 1

T − 1

)OP

(√I02 − 1

N

)]

=1

Imin (T − 1) Jmin

√NOP

((I0

2 − 1)√

(I01 − 1) + (I0

1 − 1)√

(I02 − 1)

)= oP (1) .

37

Similarly, noting 1N

∑Ni=1 xit∆uiT 01 = OP

(N−1/2

)for t = T 0

1 and T01 − 1, we have

D23

IminJ2min

=1

IminJmin

√N

OP

(I01 − 1

T − 1

)+OP

(I02 − 1

T − 1

)= oP (1) .

It follows that D2IminJ2min

= oP (1) . Consequently, we have 1IminJ2min

D1NT (αpm(Tm); Tm) ≥ c+ oP (1) .

In cases (b)-(c), it suffi ces to consider the case where m = m0 − 1. [If m < m0 − 1, one can always

augment the set Tm bym0−1−m true break points which are not inside Tm to make D1NT (αpm(Tm); Tm)

smaller.] For case (b) with m = 1, we consider three subcases: (b.1) 2 ≤ T1 < T 01 , (b.2) T

01 ≤ T1 ≤ T 0

2 ,

and (b.3) T 02 < T1 ≤ T, where (b.3) is redundant if T 0

2 = T (i.e., the second true break occurs at the end

of the sample). In subcase (b.1), we can focus on the interval [T1 + 1, T ] which contains two true break

points T 01 and T

02 that are not accounted for by the post Lasso estimate α

p1(T1) = (αp1(T1)′, αp2(T1)′)′.

Observe that

D1NT (αp1(T1); T1) =1

N

T1−1∑t=2

N∑i=1

[(∆yit − αp1(T1)′∆xit)2 − (∆uit)

2]

+1

N

T∑t=T1+1

N∑i=1

[(∆yit − αp2(T1)′∆xit)2 − (∆uit)

2]

+1

N

N∑i=1

[(∆yiT1 − αp2(T1)′xiT1 + αp1(T1)′xi,T1−1)

2 − (∆uiT1)2]

≡ D3 + D4 + D5, say.

It is easy to show that αp1(T1)−α01 = OP

(N−1/2

). With this, one can readily show that D3 = OP

(N−1

).

Let α =argminα 1N

∑Ni=1 (∆yiT1 − α′xiT1 + αp1(T1)′xi,T1−1)

2. By standard results for OLS regressions, α

−β0T1 = OP

(N−1/2

)and

D5 =1

N

N∑i=1

[(∆yiT1 − α′xiT1 + αp1(T1)′xi,T1−1

)2 − (∆uiT1)2] = OP

(N−1

).

It follows that D5 ≥ D5 = OP(N−1

)and D1NT (αp1(T1); T1) ≥ D4 + OP

(N−1

). A simple repetition of

the argument used in case (a) (now with two true breaks) yields 1IminJ2min

D4 ≥ c+ oP (1) for some c > 0.

Then by the fact that NJ2min → cJ =∞, we have D1NT (αp1(T1); T1) ≥ c+ oP (1) .

For subcase (b.2), wlog we assume that T1−T 01 ≥ T 0

2−T1, which implies that T1−T 01 ≥ Imin/2. Then we

can focus on the interval [2, T1] which contains a true break point T 01 . As in subcase (b.1), we can show that

D1NT (αp1(T1); T1) ≥ D1NT +OP(N−1

), where D1NT = 1

N

∑T1t=2

∑Ni=1[(∆yit − αp1(T1)′∆xit)

2−(∆uit)2].

A simple repetition of the argument used in case (a) yields 1IminJ2min

D1NT ≥ c + oP (1) for some c > 0.

Subcase (b.3) is analogous to subcase (b.1). Hence, the conclusion follows in subcase (b). Case (c) can

be studied analogously. This completes the proof of the lemma.

Lemma A.3 Let Tm = Tm = T1, ..., Tm : Tm0 ⊂ Tm, 2 ≤ T1 < ... < Tm ≤ T where m0 < m ≤ mmax.

Then maxm0<m≤mmaxsupTm∈Tm N

−1∣∣∣σ2Tm − σ

2Tm0

∣∣∣ = OP (1) .

38

Proof. Let Tm ∈ Tm where m0 < m ≤ mmax. In view of the fact that σ2T 0m0≥ σ2

Tm and σ2T 0m0

=

σ2NT +OP ((NImin)−1), we have

0 ≤ σ2T 0m0− σ2

Tm = σ2NT − σ2

Tm +OP((NImin)−1

)≤ (m+ 1) JNT +OP

((NImin)−1

)(A.6)

where

JNT ≡ max0≤s≤m

(Ts,Ts+1−1] does not contain any break point

∣∣∣∣ inf(α,β)

Ss (α, β)

∣∣∣∣and Ss (α, β) = 1

N(T−1)

∑Ts+1−1t=Ts+1

∑Ni=1[(∆yit − α′∆xit)2 − (∆uit)

2] + 1

N(T−1)

∑Ni=1[(∆yiTs+1 − β′xiTs+1

+α′xi,Ts+1−1)2 −(∆uiTs+1

)2] for s < m and Sm (α, β) = Sm (α) = 1

N(T−1)

∑Tt=Tm+1

∑Ni=1[(∆yit − α′∆xit)2

− (∆uit)2]. Let (αs, βs) = arg min(α,β) Ss (α, β) and γs = (α′s, β

′s)′ when s < m and αm = arg minα Sm (α) .

To study inf(α,β) Ss (α, β) for s = 0, 1, ...,m, we consider three cases: (a) Ts+1 − Ts = 1, s < m, (b)

Ts+1 − Ts ≥ 2, s < m, and (c) Ts+1 − Ts ≥ 2, s = m.10

In case (a), noting the first term in the definition of Ss (α, β) is zero, we have γs = (X ′sXs)−1X ′s∆Ys

where Xs = (X1s, ..., XNs)′, Xis = (−x′i,Ts+1−1, x

′iTs+1

)′, and ∆Ys = (∆y1Ts+1 , ...,∆yNTs+1)′. Let γ0

s =

(β0′Ts+1−1, β

0′Ts+1)

′. One can readily show that γs − γ0s = OP

(N−1/2

). In addition,

(T − 1)Qs(αs, βs

)=

1

N

N∑i=1

[(∆yiTs+1 − β

′sxiTs+1 + α′sxi,Ts+1−1

)2

−(∆uiTs+1

)2]

=1

N

N∑i=1

[(∆uiTs+1 −

(γs − γ0

s

)′Xis

)2

−(∆uiTs+1

)2]

=(γs − γ0

s

)′( 1

N

N∑i=1

XisX′is

)(γs − γ0

s

)− 2

(γs − γ0

s

)′( 1

N

N∑i=1

Xis∆uiTs+1

)= OP

(N−1/2

)OP (1)OP

(N−1/2

)− 2OP

(N−1/2

)OP

(N−1/2

)= OP

(N−1

).

In case (c), αm =(∑T

t=Tm+1

∑Ni=1 ∆xit(∆xit)

′)−1∑T

t=Tm+1

∑Ni=1 ∆xit∆yit. It is easy to verify that

αm−α0m0+1 = OP

(N−1/2

)and Sm (αm) = OP

(N−1

). Similarly, in case (b), one can verify that γs−γ0

s =

OP(N−1/2

)and Ss

(αs, βs

)= OP

(N−1

). It follows that JNT = OP

(N−1

). This, in conjunction with

(A.6), implies that

σ2Tm − σ

2NT = OP

(N−1

)which holds for all m ∈

m0 + 1, ...,mmax

and Tm = T1, ..., Tm. Then the conclusion follows.

B Proof of the results in Section 4

Let V2NT (β) ≡∑Tt=2[ 1

N

∑Ni=1 ρit

(βt, βt−1

)]′Wt[

1N

∑Ni=1 ρit

(βt, βt−1

)], where ρit

(βt, βt−1

)= zit(∆yit−

x′itβt + x′i,t−1βt−1). We first prove a technical lemma.

10To see why we do not need to consider the case where Tm+1 − Tm = 1. Note that if s = m, we have Tm+1 − Tm =

T + 1− Tm as Tm+1 = T + 1 by default. If Tm+1 − Tm = 1, we have Tm = T so that Sm (α) = 0 in this case.

39

Lemma B.1 Suppose Assumption B.1 holds. Then βt − β0t = OP

(N−1/2

)for each t = 1, 2, ..., T.

Proof. The proof is analogous to that of Lemma A.1 and we only sketch it. Let bt = N1/2(βt − β0t )

and b =N1/2(β − β0). Let ξit = x′itbt − x′i,t−1bt−1 where recall bt = βt − β0

t . Noting that ∆yit − x′itβt +

x′i,t−1βt−1 = ∆uit −N−1/2ξit, we have


(β0)]

=

T∑t=2

1

N

N∑i=1

ξitz′it

Wt

1

N

N∑i=1

zitξit

− 2N1/2

T∑t=2

1

N

N∑i=1

ξitz′it

Wt

1

N

N∑i=1

zit∆uit

= b′QNTb− 2b′

√NRuNT ≡ B1 (b)− 2B2 (b) , say,

where QNT and RuNT are defined in (2.9) and (2.10), respectively. As in the proof of Lemma A.1, under

Assumption B.1, we can readily show that w.p.a.1

T−1 [B1 (b)− 2B2 (b)] ≥(cQ0

/2)T−1 ‖b‖2 − T−1/2 ‖b‖OP (1) > 0

if T−1/2 ‖b‖ is suffi ciently large. It follows that T−1/2∥∥∥b∥∥∥ must be stochastically bounded and the result

follows if T is fixed.

In the case of large T, we can show that βt − β0t = OP

(N−1/2

)for each t by the same arguments as

used in the second part of the proof of Lemma A.1 as QNT is a symmetric block tridiagonal matrix that

is asymptotically nonsingular.

Proof of Theorem 4.1. (i) The proof parallels that of Theorem 3.1 and we only sketch it. Let

bt = N1/2(βt − β0t ) and b =N1/2(β − β0

). Noting that ∆yit − x′itβt + x′i,t−1βt−1 = ∆uit − N−1/2ξit

where ξit = x′itbt − x′i,t−1bt−1, we have


(β0)]

= b′QNTb− 2b′√NRuNT +Nλ2

T∑t=2,t∈T 0

m0

wt

[∥∥∥β0t − β0

t−1 +N−1/2(bt − bt−1)∥∥∥− ∥∥β0

t − β0t−1

∥∥]

+Nλ2

T∑t=2,t∈T 0c

m0

wt

∥∥∥N−1/2(bt − bt−1)∥∥∥

≡ B1 (b)− 2B2 (b) +B3 (b) +B4 (b) , say.

As in the proof of Theorem 3.1, we can show that∣∣T−1B3 (b)

∣∣ = OP(N1/2λ2T

−1/2J−κ2min

)T−1/2 ‖b‖ =

OP (1)T−1/2 ‖b‖ and w.p.a.1

[B1 (b)− 2B2 (b) +B3 (b)] /T ≥ λmin

(QNT

)T−1 ‖b‖2 −OP (1)T−1/2 ‖b‖ > 0

if T−1/2 ‖b‖ = L is suffi ciently large. Consequently, N[V2NT,λ2 (β)− V2NT,λ2

(β0)]> 0 w.p.a.1 for large

L and V2NT,λ2 (β) cannot be minimized in this case. This further implies that T−1/2∥∥∥b∥∥∥ has to be

stochastically bounded.

40

(ii) The result follows from (i) in the case of fixed T. In the case of large T, the proof is analogous to

that of the second part of Theorem 3.1 by utilizing the fact that QNT is an asymptotically nonsingular

symmetric block tridiagonal matrix.

Proof of Theorem 4.2. We want to demonstrate that

P(∥∥∥θt∥∥∥ = 0 for all t ∈ T 0c

m0

)→ 1 as N →∞. (B.1)

Suppose that to the contrary, θt = βt − βt−1 6= 0 for some t ∈ T 0cm0 for suffi ciently large N or (N,T ) .

To consider the optimization conditions wrt βt, t ≥ 2, based on subdifferential calculus (e.g., Bersekas

(1995, Appendix B.5)), we distinguish two cases: (a) 2 ≤ t ≤ T − 1 and (b) t = T and T ∈ T 0cm0 .

In case (a), we consider two subcases: (a1) t+1 = T 0j ∈ T 0

m0 for some j = 1, ...,m0, and (a2) t+1 ∈ T 0cm0 .

In either case, we can apply the FOC wrt βt and the equality ∆yit = β0′t xit−β0′

t−1xi,t−1 + ∆uit to obtain

0p×1 = − 2

N

N∑i=1

xitz′itWt

1√N

N∑i=1

zit

[∆yit − β

′txit + β

′t−1xi,t−1

](B.2)

+2

N

N∑i=1

x′itzi,t+1Wt+11√N

N∑i=1

zi,t+1

[∆yi,t+1 − β

′t+1xi,t+1 + β

′txit

]+√Nλ1wt

θt,p∥∥∥θt∥∥∥ −√Nλ1wt+1et+1

= −2φ′zx,tWt1√N

N∑i=1

zit

[∆uit −

(βt − β0

t

)′xit +

(βt−1 − β0

t−1

)′xi,t−1

]

+2φ′zx,t+1,tWt+11√N

N∑i=1

zi,t+1

[∆ui,t+1 −

(βt+1 − β0

t+1

)′xi,t+1 +

(βt − β0

t

)′xit

]

+√Nλ2wt

θt,p∥∥∥θt∥∥∥ −√Nλ2wt+1et+1

= −2√N [φ′zx,t+1,tWt+1φzx,t+1

(βt+1 − β0

t+1

)− φ′zx,tWt+1φzx,t

(βt − β0

t

)−φ′zx,t+1,tWt+1φzx,t+1,t

(βt − β0

t

)+ φ′zx,tWtφzx,t,t−1

(βt−1 − β0

t−1

)′]

+2√N(φ′zx,t+1,tWt+1φz∆u,t+1 − φzx,tWtφz∆u,t

)+√Nλ2wt

θt,p∥∥∥θt∥∥∥ −√Nλ2wt+1et+1,p

≡ B1t +B2t +B3t −B4t, say,

where et+1 = θt+1/∥∥∥θt+1

∥∥∥ if ∥∥∥θt+1

∥∥∥ 6= 0 and ‖et+1‖ ≤ 1 otherwise.

Since θt 6= 0, there exists r ∈ 1, ..., p such that∣∣∣θt,r∣∣∣ = max

∣∣∣θt,l∣∣∣ , l = 1, ..., p, where for any

p× 1 vector at, at,l denotes its lth element. Wlog assume that r = p, implying that∣∣∣θt,p∣∣∣ / ∥∥∥θt∥∥∥ ≥ 1/

√p.

By Assumptions B.1(i)-(ii) and Theorem 4.1, B1t,p = OP (1) and B2t,p = OP (1). In view of the fact

that w−1t = OP (N−κ2/2) for t ∈ T 0

m0 , |B3t,p| ≥√Nλ2wt/

√p, which is explosive in probability un-

der Assumption B.2(iii) (N (κ2+1)/2λ2 → ∞). To bound the probability order of B4t,p, we distinguish

41

two subcases. In subcase (a1), noting that βt+1 − βtP→ θ0

t+1 6= 0 by Theorem 4.1, we have wt+1 =∥∥θ0t+1 +OP (N−1/2)

∥∥−κ2 = OP(J−κ2min

)and B4t =

√Nλ2wt+1et+1,p = OP (

√Nλ2J

−κ2min ) = OP (1) . Con-

sequently, |B3t,p| |B1t,p +B2t,p +B4t,p| so that (B.2) cannot be true for suffi ciently large N or (N,T ).

Then we conclude that w.p.a.1, θt must be in a position where∥∥∥θt∥∥∥ is not differentiable in subcase

(a1). In addition, a direct application of this result is that if T 0j − 1 ∈ T 0c

m0 for some j = 1, ...,m0, then

P(∥∥∥θT 0j −1

∥∥∥ = 0)→ 1 as N → ∞ and

√Nλ2wT 0j −1eT 0j −1 = OP (1) in order for the FOC to hold for

t = T 0j − 1.

In subcase (a2), we apply deductive arguments as used in the proof of Theorem 3.2 and the result

in subcase (a1) so show that θt must be in a position that∥∥∥θt∥∥∥ is not differentiable for all t ∈ T 0c

m0 and

t 6= T.

In case (b), noting that only one term in the penalty term (λ2

∑Tt=2 wt

∥∥βt − βt−1

∥∥) is involved withβT , it is easy to show that θT = βT − βT−1 must be in a position where

∥∥∥θT∥∥∥ is not differentiable ifT ∈ T 0c

m0 . Consequently (B.1) follows.

Proof of Corollary 4.3. The proof is analogous to that of Corollary 3.3 by using Theorems 4.1-4.2

instead.

Proof of Theorem 4.4. Note that αpm(Tm) = (αp1(Tm)′, ..., αpm+1(Tm)′)′ = arg minαm Q2NT

(αm; Tm

).

The first order conditions for this minimization problem are

0p×1 =−2

N

T1−1∑t=2

N∑i=1

∆xitz′itW

p1

1

N

T1−1∑t=2

N∑i=1

zit(∆yit − αp′1 ∆xit

)+

2

N

N∑i=1

xi,T1−1z′iT1

WT1

1

N

N∑i=1

ziT1

(∆yiT1 − α

p′2 xiT1 + αp′1 xi,T1−1

),

0p×1 =−2

N

Tj−1∑t=Tj−1+1

N∑i=1

∆xitz′itW

pj

1

N

Tj−1∑t=Tj−1+1

N∑i=1

zit(∆yit − αp′j ∆xit

)

+2

N

N∑i=1

xi,Tj−1z′iTjWTj

1

N

N∑i=1

ziTj

(∆yiTj − α

p′j+1xiTj + αp′j xi,Tj−1

)− 2

N

N∑i=1

xi,Tj−1z′iTj−1

WTj−1

1

N

N∑i=1

zi,Tj−1

(∆yiTj−1 − α

p′j xiTj−1 + αp′j−1xi,Tj−1−1

)for j = 2, ..., m,

0p×1 =−2

N

T∑t=Tm+1

N∑i=1

∆xitz′itW

pm+1

1

N

T∑t=Tm+1

N∑i=1

zit(∆yit − αp′m+1∆xit

)− 2

N

N∑i=1

xiTmz′iTm

WTm

1

N

N∑i=1

ziTm

(∆yiTm − α

p′m+1xiTm + αp′mxi,Tm−1

),

where we suppress the dependence of αpj’s on Tm.

42

Let Φab,l = 1N

∑Tl−1

t=Tl−1

∑Ni=1 aitb

′it for l = 1, ..., m + 1, and a, b = ∆x, x, or ∆y. Let φ

†ab,l+1 =

φ′ab,Tl

WTlφab,Tl,Tl−1 for l = 1, ..., m. One can readily solve for αpm to obtain αpm = Υ−1

NT ΞyNT , where

ΥNT = TriD(

Υ†, Υ)m+1

, ΞyNT =(

Ξ′y,1, Ξ′y,2 , ...Ξ

′y,m+1

)′(B.3)

Υ1 = Φ′z∆x,1Wp1 Φz∆x,1 + φ′

zx,T1,T1−1WT1

φzx,T1,T1−1, Υl = Φ′z∆x,lWpl Φz∆x,l + φ′

zx,Tl,Tl−1WTl

φzx,Tl,Tl−1 +

φ′zx,Tl−1

WTl−1φzx,Tl−1 for l = 2, ..., m, Υm+1 = Φ′z∆x,m+1W

pm+1Φz∆x,m+1 + φ′

zx,TmWTm

φzx,Tm , and

Υ†l = φ†xx,l for l = 2, ..., m + 1. In addition, Ξy,1 = Φ′z∆x,1W

p1 Φz∆y,1 − φ′zx,T1,T1−1

WT1φz∆y,T1 , Ξy,l =

Φ′z∆x,lWpl Φz∆y,l−φ′zx,Tl,Tl−1

WTlφz∆y,Tl+φ

′zx,Tl−1

WTl−1φz∆y,Tl−1 for l = 2, ..., m, and Ξy,m+1 = Φ′z∆x,m+1

W pm+1Φz∆y,m+1 + φ′

zx,TmWTm

φz∆y,Tm .

By Corollary 4.3, αpm(Tm) = αpm0 (Tm0) w.p.a.1. Therefore we can study the asymptotic distribution

of αpm(Tm) by studying that of αm0 (Tm0) . Note that αm0 (Tm0) = Υ−1NTΞNT , where ΥNT and ΞNT are

defined in (4.1). It is easy to verify that

√NDm0+1

(αpm0 (Tm0)−α0

)=(D−1m0+1ΥNTD−1

m0+1

)−1√ND−1

m0+1ΞuNT

where ΞuNT is defined in (4.1). Then by Assumption B.3(i), D−1m0+1ΥNTD−1

m0+1

P→ Υ0 > 0. By Assumption

B.3(ii),√ND−1

m0+1ΞuNTD→ N (0,Σ0) . Then by the Slutsky lemma,

√NDm0+1

(αpm0 (Tm0)−α0

) D→ N(0,Υ−1

0 Σ0Υ−10

).

This completes the proof of the theorem.

Proof of Theorem 4.5. The proof is analogous to that of Theorem 3.5 and thus omitted.

REFERENCE

Andrews, D. W. K., 1993. Tests for parameter instability and structural change with unknown changepoint. Econometrica 61, 821-856.

Andrews, D. W. K., 2003. End-of-sample instability tests. Econometrica 71, 1661-1694.

Angelosante, D., Giannakis, G.B., 2012. Group Lassoing change-points in piecewise-constant ARprocesses. EURASIP Journal on Advances in Signal Processing 1(70), 1-16.

Bai, J., 1997a. Estimation of a change point in multiple regression models. Review of Economics andStatistics 79, 551-563.

Bai, J., 1997b. Estimating multiple breaks one at a time. Econometric Theory 13, 315-352.

Bai, J., 2010. Common breaks in means and variances for panel data. Journal of Econometrics 157,78-92.

Bai, J., Lumsdaine, R. L., Stock, J., 1998. Testing and dating common breaks in multivariate timeseries. Review of Economic Studies 65, 395-432.

Bai, J., Perron, P., 1998. Estimating and testing liner models with multiple structural changes. Econo-metrica 66, 47-78.

Baltagi, B. H., Feng, Q., Kao, C., 2013. Estimation of heterogeneous panels with structural breaks.Working Paper, Syracuse University.

43

Baltagi, B. H., Kao, C., Liu, L., 2012. Estimation and identification of change points in panel modelswith nonstationary or stationary regressors and error terms. Working paper, Syracuse University.

Belloni, A., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instrumentswith an application to eminent domain. Econometrica 80, 2369-2429.

Ben-David, D., Papell, D. H., 1995. The great wars, the great crash, and steady state growth: somenew evidence about an old stylized fact. Journal of Monetary Economics 36, 453-475.

Bertsekas, D., 1995. Nonlinear Programming. Athena Scientific, Belmont, MA.

Breitung, J., Eickmeier, S., 2011. Testing for structural breaks in dynamic factor models. Journal ofEconometrics 163, 71-84.

Caner, M., 2009. Lasso-type GMM estimator. Econometric Theory 25, 270-290.

Caner, M., Han, X., 2013. Selecting the correct number of factors in approximate factor models: thelarge panel case with Bridge estimators. Working paper, North Carolina State University.

Caner, M., Knight, K., 2013. An alternative to unit root tests: Bridge estimators differentiate betweennonstationary versus stationary models and select optimal lag. Journal of Statistical Planning andInference 143, 691-715.

Chan, F., Mancini-Griffoli, T., Pauwels, L. L., 2008. Stability tests for heterogenous panel. Workingpaper, Curtin University of Technology.

Cheng, X., Liao, Z., 2013. Select the valid and relevant moments: an information-based LASSO forGMM with many moments. Working paper, University of Pennsylvania.

Cheng, X., Liao, Z., Schorfheide, F., 2014. Shrinkage estimation of high-dimensional factor models withstructural instabilities. NBERWorking Paper Series 19792, National Bureau of Economic Research.

De Watcher, S., Tzavalis, E., 2005. Monte Carlo comparison of model and moment selection and classicalinference approaches to break detection in panel data models. Economics Letters 99, 91-96.

De Watcher, S., Tzavalis, E., 2012. Detection of structural breaks in linear dynamic panel data models.Computational Statistics and Data Analysis 56, 3020-3034.

Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American Statistical Association 96, 1348-1360.

Fan, J., Liao, Y., 2011. Ultra high dimensional variable selection with endogenous covariates. Workingpaper, Princeton University.

García, P. E. 2011. Instrumental variable estimation and selection with many weak and irrelevantinstruments. Working paper, University of Wisconsin, Madison.

Harville, D. A., 1997. Matrix Algebra from a Statistician’s Perspective. Springer, New York.

Hsu, C-C., Lin, C-C., 2012. Change-point estimation for nonstationary panel. Working paper, NationalCentral University.

Islam, N., 1995. Growth empirics: a panel data approach. The Quarterly Journal of Economics 110(4),1127-1170.

Jones, C. I., 2002. Sources of U.S. economic growth in a world of ideas. American Economic Review92, 220-239.

Kim, D., 2011. Estimating a common deterministic time trend break in large panels with cross sectionaldependence. Journal of Econometrics 164, 310-330.

Kim, D., 2012. Common breaks in time trends for large panel data with a factor structure. Workingpaper, University of Virginia.

Knight, K., Fu, W., 2000. Asymptotics for Lasso-type estimators. Annals of Statistics 28, 1356-1378.

44

Kock, A. B., 2013. Oracle effi cient variable selection in random and fixed effects panel data models.Econometric Theory 29, 115-152.

Kurozumi, E., 2012. Testing for multiple structural changes with non-homogeneous regressors. Workingpaper, Hitotsubashi University.

Liao, W., Wang, P., 2012. Structural breaks in panel data models: a common distribution approach.Working paper, HKUST.

Liao, Z., 2013. Adaptive GMM shrinkage estimation with consistent moment selection. EconometricTheory 29, 857-904.

Liao, Z., Phillips, P. C. B., 2014. Automated estimation of vector error correction models. EconometricTheory, forthcoming.

Lu, X., Su, L., 2013. Shrinkage estimation of dynamic panel data models with interactive fixed effects.Working paper, HKUST.

Meurant, G., 1992. A review on the inverse of symmetric tridiagonal and block tridiagonal matrices.SIAM Journal of Matrix Analysis and Applications 13, 707-728.

Molinari, L. G., 2008. Determinant of block tridiagonal matrices. Linear Algebra and Its Applications429, 2221-2226.

Pesaran, M. H., 2006. Estimation and inference in large heterogeneous panels with a multifactor errorstructure. Econometrica 74, 967-1012.

Pesaran, M. H., Yamagata, T., 2008. Testing slope homogeneity in large panels. Journal of Econometrics142, 50-93.

Qian, J., Su, L., 2013. Shrinkage estimation of regression models with multiple structural changes.Working paper, Singapore Management University.

Qu, Z., Perron, P., 2007. Estimating and testing structural changes in multiple regressions. Economet-rica 75, 459-502.

Ran, R-S., Huang, T-Z., 2006. The inverses of block tridiagonal matrices. Applied Mathematics andComputation 179, 243-247.

Rinaldo, A., 2009. Properties and refinement of the fused Lasso. Annals of Statistics 37, 2922-2952.

Romer, P. M., 1986. Increasing returns and long-run growth. Journal of Political Economy 94, 1002-1037.

Su, L., Chen, Q., 2013. Testing homogeneity in panel data models with interactive fixed effects. Econo-metric Theory 29, 1079-1135.

Su, L., Shi, Z., Phillips, P. C. B., 2013. Identifying latent structures in panel. Working paper, Dept. ofEconomics, Yale University.

Su, L., White, H., 2010. Testing structural change in partially linear models. Econometric Theory 26,1761-1806.

Tibshirani, R. J., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal StatisticalSociety, Series B 58, 267-288.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K., 2005. Sparsity and smoothness via thefused Lasso. Journal of the Royal Statistical Society, Series B 67, 91-108.

Yamazaki, D., Kurozumi, E., 2013. Testing for parameter constancy in the time series direction infixed-effect panel data models. Working paper, Department of Economics, Hitotsubashi University.

Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped variables. Journalof the Royal Statistical Society, Series B 68, 49-67.

Zou, H., 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Asso-ciation 101, 1418-1429.

45

Documents

Shrinkage Estimation of Common Breaks in Panel Data Models