36 doi:10.1017/S0266466619000069 ESTIMATION …...Gaussian pseudo maximum likelihood estimation of unknown parameters, mo-tivated by its asymptotic efﬁciency when Gaussianity holds

Econometric Theory, 36, 2020, 185–222.doi:10.1017/S0266466619000069

ESTIMATION FOR DYNAMIC PANELDATA WITH INDIVIDUAL EFFECTS

PETER M. ROBINSONLondon School of Economics

CARLOS VELASCOUniversidad Carlos III de Madrid

The article discusses statistical inference in parametric models for panel data. Themodels feature dynamics of a general nature, individual effects, and possible ex-planatory variables. The focus is on large-cross-section inference on Gaussianpseudo maximum likelihood estimates with temporal dimension kept fixed, partiallycomplementing and extending recent work of the authors. We focus on a particularkind of initial condition but go on to discuss implications of alternative initial con-ditions. Some possible further developments are briefly reviewed.

1. INTRODUCTION

The proliferation of econometric panel data sets has prompted considerable devel-opment in the modeling of such data and consequent methods of point estimationand statistical inference, with associated theoretical justification. The literaturegoes back a long way and includes also much work by statisticians under theheading of longitudinal data. A fundamental monograph is Hsiao (2014).

In general, we describe a panel data set as a rectangular array of scalars yit ,i = 1, . . . ,N, t = 0,1, . . . ,T, so we have observations on N cross-sectional unitsat T + 1 consecutive, equally spaced points of time, along with perhaps observ-able explanatory variables. Except for the very simplest of models, such as linearregression under classical conditions, only asymptotic justification of statisticalmodels is feasible. Mirroring the longitudinal character of many data sets, whereT can be very small, even T = 2, it is often most reasonable to develop asymptotictheory with N → ∞ but T kept fixed. However, theory that requires N to divergeis problematic when, for example, the model for yit incorporates an additive, un-observed, individual effect parameter or random variable ζi . Thus, the number ofunknowns increases with N, indeed there are precisely N such ζi . This is the ‘in-cidental parameters’ problem pointed out by Neyman and Scott (1948), and sev-eral approaches have been suggested for dealing with it, and with more generalversions of the problem. Typically, these involve some procedure for eliminating,or approximately eliminating, the ζi (which are commonly regarded as nuisance

We are grateful to Peter Phillips and three referees for constructive comments that have improved the article. Addresscorrespondence to Peter M. Robinson, London School of Economics, London, UK; e-mail: [email protected].

c© Cambridge University Press 2019 185

terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0266466619000069Downloaded from https://www.cambridge.org/core. IP address: 54.39.106.173, on 24 Jun 2020 at 13:57:44, subject to the Cambridge Core

https://www.cambridge.org/core/terms

https://doi.org/10.1017/S0266466619000069

https://www.cambridge.org/core

186 PETER M. ROBINSON AND CARLOS VELASCO

parameters), leaving us to study features of interest in a transformed or modifiedmodel.

These features typically include one or more of the following: explanatory vari-ables (modeled parametrically or nonparametrically), instantaneous temporal ef-fects (such as an unknown additive quantity varying over time), cross-sectionaldependence, and modeling of temporal dependence. Many econometric panel datamodels have described the latter in terms of autoregressive (AR) models or autore-gressive moving average (ARMA) models, to include the possibility of an I (1)unit root, reflecting the preoccupation with unit roots in much of the macroecono-metric time series literature. The literature on AR and ARMA panel data modelsis now very well developed, including much work on unit root testing, modify-ing econometric time series methods (for a recent review see Moon, Perron, andPhillips, 2015). Some of it concerns asymptotics with N → ∞ but T kept fixed,but also there is work, motivated by some data sets, which entails T → ∞ and Nfixed, or with both T → ∞ and N → ∞, including with N increasing at somerate as a function of T or vice versa; the precise form of the model often dictateswhat type of asymptotics is possible or desirable.

Despite the popularity of AR models, there is in principle any number of dy-namic models, even any number which nest I (1) behaviour. One such class whichhas been studied a good deal in the time series literature is that of fractional mod-els. Whereas AR models cover certain I (0) processes (when the AR coefficientlies in the stationary region) as well as certain I (δ) processes for any integer δ,and also explosive processes, fractional processes describe certain I (δ) processesfor real values of δ. These can include negative δ, to describe antipersistence ornoninvertibility, but the main interest has been in moderately positive values ofδ, such as δ ∈ (0,2], where δ ∈ (0,1/2) implies stationarity and δ ≥ 1/2 impliesnonstationarity. The most striking difference between AR models and fractionalmodels with respect to statistical inference is as follows: whereas in an AR settingthe limit distribution as T → ∞ of statistics such as LS estimates of the AR co-efficient α are asymptotically normally distributed with T 1/2 norming rate when|α| < 1, and have a nonstandard limit distribution with rate T when α = 1 (anddifferent behaviours again when α= −1 and |α|> 1), in fractional models, on theother hand, there exist estimates (which have an approximate Gaussian maximumlikelihood interpretation and corresponding efficiency properties) of the memoryparameter δ, which are asymptotically normal with norming rate T 1/2 whateverthe value of δ (see Hualde and Robinson, 2011a). This latter property is due toan essential ‘smoothness’ of the fractional model. Thus, whereas testing α = 1against α �= 1 typically involves a nonstandard approximate distribution, testingδ = 1 against δ �= 1 involves a standard approximate distribution.

Motivated by this time series experience, Robinson and Velasco (2015, 2017)have developed asymptotic statistical inference on certain panel data models withfractional dynamics. All their asymptotics is based on T diverging, with N al-lowed to be either fixed or increasing with T . The requirement that T → ∞ inthe simple model of Robinson and Velasco (2015) (hereafter RV) is due to the use



https://doi.org/10.1017/S0266466619000069


ESTIMATION FOR DYNAMIC PANEL DATA 187

of an approximation to the Gaussian pseudo likelihood and the need for the ef-fect of an initial condition to be asymptotically negligible. The work of Robinsonand Velasco (2017) in a more general model needs T → ∞ not only for that samereason but also because of the nonparametric modeling of (possibly time-varying)individual effects and the nonparametric estimation of the cross-sectional covari-ance matrix. Fractional modeling of panel data has also been studied by Hassler,Demetrescu, and Tarcolea (2011). Fractional models might also be used in placeof AR and the ARMA ones used by, e.g., Ejrnaes and Browning (2014), for in-come dynamics.

However, as with AR-based modeling, it is possible to develop theory that isbased on N diverging with T fixed, as in the classical longitudinal data setting.This seems particularly feasible when there is no cross-sectional dependence,whence we can appeal essentially to a central limit theorem for a weighted sum ofN independent random variables, rather than (in the T → ∞ theory) to a centrallimit theorem for a sum of dependent or approximately whitened time series ob-servations. It might be argued that in the time series ‘long memory’ literature, inwhich fractional models have been extensively used, it is often considered naturalto expect that T be large. However, parametric fractional models generate formu-lae for point estimates and other statistics for any T, just as AR models do, and itseems equally legitimate to study them in the fixed T case. Using a similar esti-mation approach to ours for a model with AR dynamics, Han and Phillips (2013)found peculiarities in the objective function, with implications for asymptotic the-ory with T diverging, that are removed in asymptotic theory with N diverging.

The distinctive properties of AR and fractional time series models alluded toabove all arise in the asymptotic T → ∞ regime, and when T is kept fixed in apanel data setting they are no longer relevant. Thus, in the N → ∞, fixed T caseessentially similar asymptotic properties would arise from any number of param-eterizations of time series dynamics. We consider a modeling strategy which isgeneral in this sense, extending work for particular parameterizations in the fixedT case, and complementing some of the work which depends on a diverging T . Inparticular, the article partly complements RV, like them incorporating individualeffects but also including explanatory variables as well as generalizing the dy-namics and relying on diverging N rather than T ; the explanatory variables couldinclude period-fixed effects accounting for any potential trend or level change. Asusual, on the one hand, parsimony in modeling is desirable, on the other, mis-specification is to be avoided, and our asymptotic theory can be applied in modeltesting (e.g., by Wald, likelihood-ratio, and Lagrange multiplier type tests) as wellas in interval estimation.

One additional issue which we explore is the generalization of initial condi-tions. A stationary time series process, such as a linear process, is typically mod-eled over all time points in Z ={0,±1, . . .}. But allowance for nonstationary dy-namics of a process requires initial conditions (on the process prior to the obser-vation period) to ensure the variance remains finite at any time point t, even ifit diverges as t → ∞. For an AR(1) process (with a possible unit root in mind),



https://doi.org/10.1017/S0266466619000069



it suffices to impose an initial condition on the process at a single t, e.g., that itis zero at t = −1. But for fractional processes, possible nonstationarity requiresin general a condition on an infinite past, e.g., that the process takes zero valuesfor t < −1. (The ‘e.g.’ is important here, because the condition might equallybe imposed only for t < −m for any integer m ≥ 1.) But the choice of m ispart of the model specification in that an incorrect choice can lead to inconsis-tent estimation of parameters of interest, particularly when T is kept fixed in theasymptotics.

The following section describes our dynamic model with fixed effects and pos-sible regressors (under the simplest, and most usual, kind of initial conditions thatcover fractional models, for example). Section 3 employs one of the standard ap-proaches to eliminating individual effects, first differencing. Section 4 developsGaussian pseudo maximum likelihood estimation of unknown parameters, mo-tivated by its asymptotic efficiency when Gaussianity holds and its retention ofconsistency and asymptotic normality, with the same norming, under more gen-eral conditions. The conditions, and strong consistency and asymptotic normal-ity properties, are described in Section 5 (with proofs left to the final Sections9 and 10, the latter focussing on the evaluation of the asymptotic variance ma-trix under Gaussianity). Section 6 contains a Monte Carlo study of finite sampleperformance, while Section 7 discusses modifications to the methodology underalternative initial conditions. Section 8 summarizes and briefly lists possible ex-tensions to modeling and inference.

2. DYNAMIC PANEL MODEL WITH INDIVIDUAL EFFECTSAND REGRESSORS

The observable scalar array {yit } and q ×1 vector array {xit } are supposed to berelated by the model

λt (L; θ0)(yit − ζi − x ′

itβ0) = εit , (1)

εit = 0, t < 0, (2)

xit = 0, t < 0, (3)

for i = 1, . . . ,N, t = 0,1, . . . ,T ≥ 2, with the prime denoting transposition.The ingredients of (1) are described as follows. For each t , the εit are indepen-

dent and identically distributed (iid) across i = 1, . . . ,N. For each i , the εit areuncorrelated across t = 0,1, . . . ,T, with mean zero and unknown, finite, positivevariance σ 2

0 . The ζi are unobserved fixed effects. The xit consists of explanatoryvariables. The p ×1 vector θ0 and q ×1 vector β0 have unknown elements whoseestimation is of interest, though in an important special case β0 = 0 a priori, sothe model contains no explanatory variables. Denoting by L the lag operator, andθ any admissible value of θ0, we define the operator



https://doi.org/10.1017/S0266466619000069



λt (L; θ)=t∑

j=0

λj (θ) L j ,

where the λj (θ) are known functions of θ with λ0 (θ) = 1 for all θ. The trun-cation means that the yit need not be defined for t < 0. For fixed T , the λj (θ)can be chosen quite arbitrarily, but leading choices are associated with regardingλt (L; θ) as truncating the expansion

λ(L; θ)=∞∑

j=0

λj (θ) L j , (4)

where λ(L; θ) is one of the AR operators originally arising in the stationary timeseries literature, though here no stationarity assumptions are imposed on parame-ters, especially as T remains fixed in our theory.

For example, taking θj to be the j th element of θ :(i) The autoregressive moving average or autoregressive integrated moving av-

erage operator

λ(L; θ)=(

1 −p1∑

j=1θj L j

)(1 +

p∑j=p1+1

θj L j−p1

)−1

, (5)

where 0 ≤ p1 ≤ p, with the understanding that the first and second sums are,respectively, void when p1 = 0 (the pure MA case) and p1 = p (the pure AR case).The dynamic panel data literature has heavily stressed the AR(1) case p1 = p = 1,in which there has been great interest in testing the unit root hypothesis θ01 = 1,taking θ0 j to be the j th element of θ0.(i i) The fractional operator (see Adenstedt, 1974)

λ(L; θ)=θ, (6)

with = 1 − L, so p = 1. The operatorδ has the expansion

δ =∞∑

j=0

πj (δ) L j , πj (δ)= �( j − δ)�(−δ)�( j + 1)

,

for noninteger δ > 0, while for integer δ = 0,1, . . . ,πj (δ)= 1( j = 0,1, . . . ,δ)(−1) j δ (δ− 1) · · · (δ− j + 1)/j !, taking 0/0 = 1. Thiscase has been studied by RV, as has the hybrid model(i i i)

λ(L; θ)=θ1

(1 −

p1+1∑j=2

θj L j−1

)(1 +

p∑j=p1+2

θj L j−1−p1

)−1

, (7)



https://doi.org/10.1017/S0266466619000069



where the first and second sums are, respectively, void when p1 = 0 and p1 = p −1. This is known as a fractional ARIMA model (FARIMA(p1,θ1, p − p1 − 1)) so(6) is FARIMA(0,θ,0) .

Condition (2) is an initial condition, which ensures that yit − ζi − x ′itβ0 has

bounded variance even if λt (L; θ0)−1 εit has infinite variance when (2) does not

hold but instead the conditions on εit in the second paragraph of the currentsection hold for all t = 0,±1,±2, . . . , as is the case for ‘nonstationary’ filters

λ(L; θ), for example in (5) when at least one zero of 1 −p1∑

j=1θj z j falls on or in

the unit circle on the complex plane or in (6) or (7) when θ ≥ 1/2 or θ1 ≥ 1/2,

respectively (for the latter model even when all zeroes of 1−p1+1∑j=2

θj z j−1 fall out-

side the unit circle). An initial condition such as (2) would not be needed werewe to assume λ(L; θ0) is a ‘stationary’ filter, but the dynamic panel literature haspaid a good deal of attention to the possibility of nonstationarity, in particular anAR unit root. Condition (3) is a similar initial condition on xit .

We single out two important restrictions that are implied by our assumptions.One is that each cross-sectional unit has the same dynamics. The other is thatconditional on the ζi and xit , yit is cross-sectionally independent. In this connec-tion, our asymptotic theory entails N diverging while T is kept fixed, so the ζi

cannot be consistently estimated and their presence is an obstacle to consistentestimation of θ0 and β0, indicating an incidental parameters problem which thefollowing section commences by eliminating.

3. DIFFERENCED MODEL

Of possible approaches to eliminate the ζi , we employ the popular one of firsttemporal differencing. Given (1) and (2), and defining

vit = λ−1t (L; θ0)εit , t = 0, . . . ,T, i = 1, . . . ,N, (8)

we solve and then take first differences,

yit −x ′itβ0 =vit , t = 1, . . . ,T, i = 1, . . . ,N. (9)

In general, the vit are not white noise, so (as in RV in the fractional case) weattempt a full whitening in order to estimate the parameters. For any θ, β, define

zit (θ,β)= τt−1 (L; θ)(yit −x ′itβ

), t = 1, . . . ,T, i = 1, . . . ,N, (10)

where

τt (L; θ)= 1 +t∑

j=1τj (θ) L j ,

for

τj (θ)= λj (1; θ)=j∑

i=0λi (θ) ,



https://doi.org/10.1017/S0266466619000069



so that τt (L; θ) truncates the expansion

τ (L; θ)=∞∑j=0τj (θ) L j

of

τ (L; θ)= λ(L; θ)/.Then (cf. RV), for t ≥ 1,

zit (θ,β)= τt−1 (L ; θ)x ′it (β0 −β)+ τt−1 (L ; θ)vit

= τt (L ; θ)x ′it (β0 −β)+ (

τt−1 (L ; θ)− τt (L ; θ))x ′it (β0 −β)

+τt (L ; θ)vit + (τt−1 (L ; θ)− τt (L ; θ))vit

= λt (L ; θ) x ′it (β0 −β)− τt (θ)x ′

i0 (β0 −β)+λt (L ; θ)vit − τt (θ)εi0

= λt (L ; θ){vit − x ′it (β−β0)

}− τt (θ){εi0 − x ′

i0 (β−β0)}, (11)

with the last step imposing (2) and (3). Thus,

zit (θ,β0)= λt (L; θ)vit − τt (θ)εi0 (12)

and

zit (θ0,β0)= εit − τt (θ0)εi0. (13)

As (13) indicates, the zit (θ0,β0) are not white noise across t . As noted in RV,τt (θ) = O

(t−θ1

)as t → ∞, in the case of models (ii) and (iii) of the preceding

section (where, for example, τ (L; θ) = θ1−1 in (ii)). Thus in these cases thiscorrupting term is negligible under stationary and nonstationary long memory,θ1 > 0, as t → ∞, with faster the decay greater the memory, though on theother hand for negative dependence, θ1 < 0 (not discussed by RV), the corruptingterm dominates as t → ∞, while for θ1 = 0, it is generally of exact order O(1).The latter is also the case for stationary versions of model (i). In general, with Tfixed there is a nonnegligible source of bias incurred by employing methods thatassume the zit (θ0,β0) are uncorrelated across t .

If explanatory variables xit are present, then in view of (10) we adopt the con-vention that in the original model (1) no element of xit is constant across t, sothat any intercept is incorporated in ζi . It is also important to acknowledge from(9) that in the presence of xit , under the conditions we will impose to justifythe more elaborate estimates proposed in the following section, β0 can unsur-prisingly instead be consistently and asymptotically normally estimated by, forexample, ordinary least squares regression of the yit on the xit . Moreover,the norming factor in the central limit theorem under the same N → ∞, fixed Tregime is N1/2, like our estimates. In view of the autocorrelation in the errors vit

which the general dynamics in (1) anticipate, ordinary least squares will gener-ally be inefficient, but generalised least squares, employing an estimated T × T



https://doi.org/10.1017/S0266466619000069



error covariance matrix estimated simply from sums of squares and products ofN least squares residuals without recourse to the parametric dynamics imposed in(1), will be asymptotically more efficient. Moreover, under conditions which im-ply that the limiting covariance matrix of the estimates proposed in the followingsection is block diagonal with respect to θ0 and β0, these estimates of β0 will beasymptotically no more efficient than the computationally far simpler generalizedleast squares. However, the estimates of the following section seem worthwhilebecause the model (like RV’s) may not include any xit ; because estimation of pa-rameters θ0 can help in inference on dynamics, such as possible nonstationarityand might be used in forecasting; because with correct specification of dynamicsin (1) with p small relative to T our method may have better finite sample proper-ties than the generalized least squares approach described above; and because (1)allows a comparison between our asymptotics and those of RV.

4. GAUSSIAN PSEUDO MAXIMUM LIKELIHOOD ESTIMATION

Gaussian pseudo maximum likelihood estimation is widely used in statistics andeconometrics due to its asymptotic efficiency under Gaussianity and its consis-tency robustness under much broader conditions. It has been used in panel datamodels by, e.g., Hsiao, Pesaran, and Tahmiscioglu (2002). Given (10) define fori = 1, . . . ,N the T ×1 vectors

zi (θ,β)= (zi1 (θ,β), . . . ,ziT (θ,β))′

=ϒ (L; θ)(yi −xiβ),

with the T ×1 vector, T ×q matrix and T × T diagonal matrix operator

yi = (yi1, . . . ,yiT )′ ,

xi = (xi1, . . . ,xiT )′ ,

ϒ (L; θ)= diag (τ0 (L; θ), . . . ,τT −1 (L; θ)) .From (13), zi (θ0,β0) has zero mean vector and covariance matrix σ 2

0�(θ0) ,where

�(θ)= IT + τ (θ)τ ′ (θ) , (14)

introducing the T ×1 vector

τ (θ)= (τ1 (θ) , . . . ,τT (θ))′

and with IT the T × T identity matrix. Thus, define (cf. RV)

σ 2 (θ,β)= 1

NT

N∑i=1

z′i (θ,β)�(θ)

−1 zi (θ,β) (15)

and

L (θ,β)= |�(θ) |1/T σ 2 (θ,β) . (16)



https://doi.org/10.1017/S0266466619000069



The Gaussian pseudo maximum likelihood estimate (PMLE) is(θ, β

) = arg minθ∈�,β L (θ,β) , (17)

where � is a compact subset of Rp.For computations note from RV that

�(θ)−1 = IT − τ (θ)τ ′ (θ)|�(θ)| , (18)

|�(θ) | = 1 + τ ′ (θ)τ (θ), (19)

while of course we can concentrate out β, defining

β (θ)=(

N∑i=1(ϒ (L; θ)xi )

′�(θ)−1ϒ (L; θ)xi

)−1 N∑i=1(ϒ (L; θ)xi )

′�(θ)−1ϒ (L; θ)yi ,

so that

θ = argminθ∈� L

(θ, β (θ)

), β = β

(θ).

5. ASYMPTOTIC STATISTICAL PROPERTIES

To establish strong consistency of(θ, β

), we introduce the following assumptions.

Assumptions A. (i) The {εit ,xit ,1 ≤ i ≤ N, 1 ≤ t ≤ T } are iid across i.

(ii) E (ε1t | x1s,1 ≤ s ≤ T )= 0, a.s., 0 ≤ t ≤ T .

(iii) E (ε1sε1t | x1r ,1 ≤ r ≤ T )= σ 20 δst , a.s., 0 ≤ s, t ≤ T, for σ 2

0 <∞ and δst

the Kronecker delta.

(iv) xit does not contain an intercept and E ‖x1t‖2 <∞, 0 ≤ t ≤ T .

(v) θ0 ∈�, which is compact.

(vi) For 1 ≤ t ≤ T , the λt (θ) are continuous in θ.

(vii) For at least one t ∈ [1,T ], λt (θ) �= λt (θ0) for all θ ∈�−{θ0} .(viii) The matrix

E((ϒ (L; θ0)x1)

′�(θ0)−1ϒ (L; θ0)x1

)is positive definite.

Assumption (i) ensures that a strong law of large numbers for iid random vari-ables can be used. Assumptions (ii) and (iii) require strong exogeneity of xit . Oneor more elements of xit might be deterministic and entail trending, e.g., linearlyin t (because T is fixed), but xit cannot include incidental tends. The first partof assumption (iv) merely means that any intercept is regarded as incorporated inthe individual effects ζi , which are differenced out. Assumptions (v) and (vi) per-mit uniform convergence arguments and are readily checked, and, for example,�



https://doi.org/10.1017/S0266466619000069



can be chosen to cover the possibility of an AR unit root in model (5). Assump-tions (vii) and (viii) ensure identifiability, with (viii) ruling out multicollinearityin regressors and (vii) being automatically satisfied in case of model (6), and sat-isfied in case of (5) and (7) if p ≤ T and the autoregressive and moving averageoperators have no zeros in common.

THEOREM 1. Let (1), (2) and Assumptions A hold. Then, as N → ∞, almostsurely (a.s.)

θ → θ0, β → β0.

To develop a useful asymptotic normality result, as is standard we use Theorem1 and the mean value theorem (see (A.10) in the proofs in Section 9 below) andconsider the vector of first partial derivatives of L (θ,β) evaluated at θ0,β0. Now(cf. (A.12) of Section 9 below) define, for j = 1, . . . , p,

r1 j i (θ,β)= 1

T|�T (θ)| 1

T

(1

Ttr

(�−1 (θ)� j (θ)

)z′

i (θ,β)�(θ)−1 zi (θ,β)

−z′i (θ,β)�

−1 (θ)� j (θ)�−1 (θ) zi (θ,β)

+2 z j ′i (θ,β)�

−1 (θ) zi (θ,β)). (20)

Here,

� j (θ)= (∂/∂θj

)�(θ)= τ j (θ)τ (θ)′ + τ (θ) τ j (θ)′ (21)

with

τ j (θ)= (∂/∂θj

)τ (θ), (22)

and z j ′i (θ,β) is the j th row of

z′i (θ,β)= (∂/∂θ) z′

i (θ,β),

namely, the transpose of(∂/∂θj

)(ϒ (L; θ) (yi −xiβ))= � j (L; θ)(yi −xiβ),

where

� j (L; θ)= (∂/∂θj

)ϒ (L; θ)

= diag(τ

j0 (L; θ), . . . , τ j

T −1 (L; θ))

with τ j0 (L; θ)≡ 0 and for t ≥ 1

τjt (L; θ)= (

∂/∂θj)τ t (L; θ)=

t∑k=1τ

jk (θ) Lk ,



https://doi.org/10.1017/S0266466619000069



in which the τ jk (θ) are given by τ j (θ)=

(τ

j1 (θ) , . . . , τ

jT (θ)

)′. Next define (cf.

(A.13) of Section 9 below)

r2i (θ,β)= − 2

T|�(θ)| 1

T (ϒ (L; θ)xi)′�(θ)−1 zi (θ,β),

and then,

ri (θ,β)= (r1i (θ,β)

′,r21i (θ,β)′)′ ,

and

C (θ,β)= 1

N

N∑i=1

ri (θ,β)ri (θ,β)′.

Now define, for j,k = 1, . . . , p,

b1 j ki (θ,β)= 1

T|�(θ)| 1

T

(σ 2 (θ,β) tr

(�−1 (θ)�k (θ)�−1 (θ)� j (θ)

)− 1

Tσ 2 (θ,β) tr

(�−1 (θ)� j (θ)

)tr

(�−1 (θ)�k (θ)

)−2zk′

i (θ,β)�−1 (θ)� j (θ)�−1 (θ) zi (θ,β)

−2z j ′i (θ,β)�

−1 (θ)�k (θ)�−1 (θ) zi (θ,β)

+2z j ′i (θ,β)�

−1 (θ) zki (θ,β)

)and the q ×q matrix

B2i (θ)= 2 |�(θ)| 1T (ϒ (L; θ)xi )

′�(θ)−1ϒ (L; θ)xi/T .

Let B1i (θ,β) be the p × p matrix with jkth element b1 j ki (θ,β) and define theblock diagonal matrix

B (θ,β)= 1

N

N∑i=1

(B1i (θ,β) 0

0 B2i(θ)

).

Alternative formulae might be used in place of B1i (θ,β), but ours is a relativelysimple one. For computations note again (19), (18) and, for example,

�−1 (θ)� j (θ)�−1 (θ) = � j (θ)− τ (θ)τ (θ)′

|�(θ)| � j (θ0)−� j (θ0)τ (θ)τ (θ)′

|�(θ)|+ τ (θ)τ (θ)

′

|�(θ)|(τ j (θ)τ (θ)′ + τ (θ) τ j (θ)′

) τ (θ)τ (θ)′|�(θ)|

= 1|�(θ)|

(τ j (θ)τ (θ)′ + τ (θ) τ j (θ)′

)−2

τ j (θ)′ τ (θ)|�(θ)|2 τ (θ)τ (θ)′ ,



https://doi.org/10.1017/S0266466619000069



tr(�−1 (θ0)�

j (θ0))

= tr

((IT − τ (θ)τ (θ)′

|�(θ)|)(τ j (θ)τ (θ)′ + τ (θ) τ j (θ)′

))= 2τ j (θ)′ τ (θ)

(1− τ (θ)′ τ (θ)

|�(θ)|)

= 2τ j (θ)′ τ (θ)

|�(θ)| .

The proof of asymptotic normality is based on the following additionalconditions.

Assumptions B. (i) Eε41t <∞.

(ii) For 1 ≤ t ≤ T the λt (θ) are twice continuously differentiable in θ.

(iii) θ0 is an interior point of �.

(iv) In a neighbourhood of θ0,β0, the matrix E B (θ,β) is nonsingular.

(v) The matrix EC (θ0,β0) is nonsingular.

Condition (i) seems unavoidable. Condition (ii) can be verified by inspection.Condition (iii) is standard. Our other conditions ensure existence of the matricesin (iv) and (v), with (v) being a local identifiability condition, which, along with(iv), may be checkable for specific choices of the λt (θ) .

THEOREM 2. Let (1), (2) and Assumptions A and B hold. Then, asN → ∞,(

N B(θ, β

)C

(θ, β

)−1B(θ, β

))1/2(θ − θ0

β−β0

)→d N(0, Ip+q),

where D1/2 denotes the unique nonnegative definite square root of a positive def-inite matrix D.

Under normality of ε1t , θ , β are asymptotically effi-cient and we may replace the studentizing factor in the

theorem by N1/2 (T/2) σ−1(θ) ∣∣�T

(θ)∣∣− 1

T C(θ, β

)1/2or by

N1/2 (T/2)1/2 σ−1(θ)∣∣�T

(θ)∣∣− 1

2T B(θ, β

)1/2, the additional normaliza-

tion for C and B correcting for L being proportional to a likelihood, instead ofbeing a log-likelihood (see Section 10 below). Furthermore, it would be possibleto use restricted versions replacing the elements not in the p × p and q × qmatrices spanning their main diagonals by zeroes, reflecting the asymptoticindependence of θ and β. Comparison with Theorem 2 of RV, which coversmodels (ii) and (iii) in Section 1 with T → ∞ (but without explanatory variables),illustrates the much greater complexity of the asymptotic variance matrix basedon N → ∞ and fixed T asymptotics compared to T → ∞ asymptotics. Thiscan be better understood by inspecting the formulae for ∂L (θ0,β0)/∂θ in theproof in Section 8, in particular the term (A.15), which is the i th summand



https://doi.org/10.1017/S0266466619000069



in its j th element. It contains terms in ε2it , t = 1, . . . ,T, which can be seen

to make an asymptotically negligible contribution, by a law of large numbers,only as T → ∞. There are also the terms involving ε2

i0 for which this argumentdoes not apply, but collecting them together reveals that they make a negligiblecontribution. The dominating term in (A.15) as T → ∞ is the penultimate one,and this is the basis for the central limit theorem of RV. But though their formulafor asymptotic variance estimation is far simpler than ours, ours is still valid forlarge T (if N also is large) and might be preferred in case T is feared too smallfor RV’s formula to provide a good approximation.

6. FINITE-SAMPLE PERFORMANCE

In this section, we explore by Monte Carlo simulations finite sample performanceespecially for estimation of θ0, including our asymptotic variance estimates de-veloped for finite T in comparison with the asymptotic version of RV obtainedfor increasing T, though we also compare our PMLE of β0 with GLS and OLSestimation.

FIGURE 1. Asymptotic variance of PMLE estimates for the FARIMA(0,δ0,0)model (6), δ0 ∈ [0,1.5] in horizontal axis, each line corresponds to T ∈{3,4,5,7,10,15,20,40,100,1,000}, from top to bottom.



https://doi.org/10.1017/S0266466619000069



TABLE 1. Simulated size t-test N = 100, model (6),FARIMA(0,θ0,0) . Gaussian innovations. Nominal size5%

θ0 T BC B C B (BC B)0 Asymp

0.0 20 3.68 2.32 2.51 2.47 9.0710 4.38 2.78 2.95 2.72 14.53

5 5.48 2.96 3.41 2.56 21.894 5.81 3.46 3.65 3.01 24.853 7.13 4.58 4.26 3.44 29.07

0.6 20 5.65 5.09 5.07 5.07 11.9210 5.91 4.93 5.06 5.06 17.91

5 6.51 4.50 5.08 4.72 28.234 7.34 4.81 5.77 5.64 33.943 6.94 4.69 5.23 5.13 39.75

1.0 20 5.47 4.75 4.82 4.86 6.9110 6.05 4.79 5.20 5.19 9.05

5 5.54 4.37 4.61 4.65 12.564 6.19 4.76 5.18 5.05 15.353 5.50 4.35 4.65 4.67 18.70

1.4 20 5.68 4.63 4.92 4.94 6.3210 6.07 4.94 5.10 5.18 7.89

5 5.54 4.65 4.73 4.83 9.134 6.39 4.77 5.32 5.26 11.273 6.23 4.83 5.15 5.17 12.91

We initially consider models without regressors, thus estimating θ0 only andimplying a focus on the first diagonal block in B and C.We consider the following(alternative) studentization factors for θ in Theorem 2:

1.(

N B(θ)

C(θ)−1

B(θ))1/2

as in Theorem 2, denoted as BC B .

2. N1/2 (T/2) σ−1(θ) ∣∣�T

(θ)∣∣− 1

T C(θ)1/2

, following the discussion afterTheorem 2 for Gaussian series (setting σ 2

0 = 1 wlog), the factor

(T/2)∣∣�T

(θ)∣∣− 1

T correcting for L being only proportional to the likeli-hood, denoted as C.

3. N1/2 (T/2)1/2 σ−1(θ) ∣∣�T

(θ)∣∣− 1

2T B(θ)1/2

, exploiting the proportionalityof B0 (θ0) and C0 (θ0) under Gaussianity, denoted as B.

4.(

N B0(θ)

C0(θ)−1

B0(θ))1/2 = N1/2 (T/2)

∣∣�T(θ)∣∣− 1

T C0(θ)1/2

= N1/2 (T/2)1/2∣∣�T

(θ)∣∣− 1

2T B0(θ)1/2

where C0 (θ0) and B0 (θ0) are theexpectations of C (θ0) and B (θ0) under Gaussianity as evaluated in Ap-pendix B using the same normalizations as in 2. and 3., denoted as (BC B)0 .

5. Asymptotic variance for T → ∞ from RV, denoted as Asymp.



https://doi.org/10.1017/S0266466619000069



TABLE 2. Simulated size t-test N = 200, model (6),FARIMA(0,θ0,0) . Gaussian innovations. Nominal size5%


0.0 20 3.02 2.70 2.54 2.67 9.3510 3.59 2.68 2.69 2.63 14.17

5 4.63 2.91 2.80 2.73 22.294 4.34 2.89 2.76 2.61 24.033 5.56 3.68 3.60 2.85 29.39

0.6 20 5.57 5.01 5.17 5.12 11.8110 5.62 4.78 4.94 4.87 17.41

5 5.43 4.76 4.89 4.96 28.354 5.98 5.01 5.29 5.09 33.513 5.47 4.74 4.73 4.90 40.92

1.0 20 5.44 4.98 5.13 5.13 7.1210 5.31 4.74 4.92 4.92 8.61

5 5.32 4.57 4.71 4.69 12.614 5.44 4.99 4.87 4.85 15.093 5.29 4.81 5.00 5.04 18.43

1.4 20 5.40 4.88 5.10 5.09 6.7010 5.42 4.83 4.91 4.88 7.68

5 5.33 4.76 4.93 4.83 9.594 5.85 5.13 5.05 5.06 11.263 5.65 4.91 5.16 5.17 13.06

We first calculate in the simple fractional model (6), i.e., FARIMA(0,θ0,0) , theasymptotic variance (BC B)0 of the PMLE of the memory parameter θ0 computedfor a grid of values of θ0 ∈ [0,1.5] and T ∈ {3,4,5,7,10,15,20,40,100,1,000}.In Figure 1, we plot T B0 (θ0)

−1 C0 (θ0)B0 (θ0)−1 to make possible comparisons

with the asymptotic variance of√

NT(θ − θ0

)as T → ∞. The (scaled) asymp-

totic variance decreases very fast with T until T = 20, so we focus in ourMonte Carlo simulations on these values, where the asymptotic variance basedon T → ∞ asymptotics severely underestimates the actual variability. Thereis also an important (monotone) dependence of the asymptotic variance on thepersistence of the model for finite T since our estimation method is based onpredifferenced data. Thus, capturing with precision values of θ0 lower that 0.5seems very challenging, while from about θ0 = 1 upwards the asymptotic vari-ance shows an almost flat pattern, at least for T ≥ 5. For θ0 = 1.5 and T = 1,000,the finite T variance (BC B)0 equals 0.6107, which is very close to the asymptoticone (for T increasing), namely 6/π2 = 0.6079, but for T = 3 it equals 1.0059,65% larger. However, for θ0 = 0, the finite T = 3 variance is about 13 times largerthan the asymptotic T → ∞ one.

We next explore the performance of the various studentizations in infer-ence based on Theorem 2 by means of the simulated size of a nominal 5%Wald test under different model designs. We consider two basic cross-section



https://doi.org/10.1017/S0266466619000069



TABLE 3. Simulated size t-test N = 100, model (6),FARIMA(0,θ0,0) . Exponential innovations. Nominalsize 5%.


0.0 20 4.36 2.81 2.84 2.37 9.0410 4.65 3.74 2.88 2.84 14.705 6.09 5.24 3.63 3.36 22.454 7.14 6.92 4.49 4.26 27.063 7.62 10.04 6.30 6.28 32.13

0.6 20 5.77 5.69 5.16 5.20 12.2010 6.85 7.54 6.59 6.64 20.455 6.36 8.22 6.05 6.70 31.714 7.03 9.26 6.92 7.49 37.523 6.85 9.92 6.50 7.50 45.10

1.0 20 6.14 5.07 5.04 5.10 7.0410 6.46 5.05 5.17 5.20 9.115 7.11 4.44 4.82 4.80 12.624 6.88 4.25 4.44 4.47 14.483 7.13 4.45 4.76 4.77 18.39

1.4 20 5.84 4.98 4.88 4.91 6.4510 6.59 5.30 5.44 5.55 8.035 6.88 6.13 5.80 5.94 11.034 6.69 5.83 5.41 5.55 11.993 7.08 6.21 5.98 6.02 14.46

sizes, N = 100,200, and five different values for the time series dimensionT ∈ {3,4,5,10,20}. First, we consider the FARIMA(0,θ10,0) model (6) withp = 1 and θ10 ∈ {0.0,0.6,1.0,1.4} for Gaussian εit and also for highly asym-metric exponential innovations for which the Gaussian standardization (BC B)0should be inappropriate. Then, we consider the FARIMA(1,θ0,0) model (7) withp = 2 for Gaussian εit and the same values of θ0 with θ20 ∈ {−0.5,0.5} .

In Tables 1 and 2, we report the simulated size for the t-test based on our fivepossible standardizations for the Gaussian FARIMA(0,θ0,0)model and N = 100and 200, respectively. Except for the asymptotic T →∞ standardization (which isvery oversized except for the largest value T = 20 and the most persistent models),all provide reasonable approximations to the nominal 5% size for both values ofN . The T → ∞ standardization is particularly problematic in the zero-persistencecase θ0 = 0, likely due to bias in estimation based on initially differenced data andthe higher variance as reported in Figure 1.

Tables 3 and 4 deal with the same design but using (centred) exponential in-novations for which the t-tests based on the first three finite T standardizationswork similarly as for Gaussian innovations. However, the (BC B)0 standardiza-tion provides a marginally worse simulated size for both values of N , while theasymptotic one remains unreliable.



https://doi.org/10.1017/S0266466619000069



TABLE 4. Simulated size t-test N = 200, model (6),FARIMA(0,θ0,0) . Exponential innovations. Nominalsize 5%.


0.0 20 3.06 2.86 2.24 2.51 9.4510 3.61 3.71 2.72 2.77 14.445 4.23 4.79 3.12 3.26 22.364 4.84 7.18 4.06 4.45 27.083 5.86 9.97 5.43 6.01 32.36

0.6 20 5.41 5.93 5.30 5.44 12.3310 5.55 6.63 5.89 5.88 19.495 5.85 8.77 6.66 7.07 32.264 5.60 9.50 6.59 7.16 37.683 5.62 11.16 6.94 7.98 46.23

1.0 20 5.59 5.09 4.97 5.03 7.0510 5.79 4.77 4.79 4.86 8.575 5.99 4.72 4.71 4.74 12.184 6.43 4.82 4.75 4.79 14.673 6.48 4.39 4.59 4.84 18.01

1.4 20 5.80 5.13 5.12 5.12 6.6810 5.24 5.26 4.98 4.89 7.645 6.31 6.26 5.83 5.84 10.774 6.39 6.89 6.13 6.25 12.973 6.22 6.90 6.02 6.10 15.19

Tables 5 and 6 provide the equivalent results for the Wald test corresponding totesting the Gaussian FARIMA(1,θ10,0) model, with tables named “a” for θ20 =0.5, and tables “b” for θ20 = −0.5. Here, the asymptotic standardization is againunable to approximate the actual variability of the estimates, with the exceptionof the (simultaneously) largest N , T , and θ10. The standardizations based on finiteT asymptotic theory have problems for the smallest N, the performance of BC Bbeing inferior to that of C or B, which in turn is similar to that of (BC B)0.

For the PMLE of β0, we consider the same experimental design and specifica-tions as for Tables 1–2, while for each i, xit is generated as a univariate GaussianFARIMA(0,θx ,0) time series with θx = 0.6 (Tables 7a and 7b) and θx = 1.0 (Ta-bles 8a and 8b) with innovations having variance equal to that of εit , the resultsnot being very sensitive to this choice. We do not report parallel results for PMLEof θ0, because they are very similar to those of Tables 1 and 2, but consider twoalternative estimates of β0. These are OLS and GLS estimates based on first dif-ferencesyi andxi , with the GLS estimate using the OLS residuals to estimatethe covariance matrix ofyi −xiβ.We report simulated bias for β0 = 1 for thethree estimates, relative efficiency with respect to the PMLE based on empiricalmean square error, and simulated size of a Wald test for β0 with 5% nominal sizeusing the results of Theorem 2 for the PMLE and direct adaptations for the GLSand OLS estimates.



https://doi.org/10.1017/S0266466619000069



TABLE 5a. Simulated size Wald-test N = 100, model(7), FARIMA(1,θ10,0) , θ20 = 0.5. Gaussian innovations.Nominal size 5%


0.5 20 14.55 8.18 8.89 8.26 31.8910 12.60 7.93 7.21 7.90 34.81

5 14.16 7.69 6.82 7.42 41.454 17.46 8.46 8.13 8.18 44.603 21.10 9.90 9.95 9.97 49.57

0.6 20 11.68 10.12 10.62 10.56 23.5410 14.94 12.31 13.25 13.10 34.56

5 14.07 11.63 12.35 12.29 39.084 15.07 12.56 13.40 13.49 42.543 15.59 12.75 14.27 14.43 43.54

1.0 20 10.80 9.48 9.56 9.63 18.6410 14.71 12.39 12.94 13.04 29.15

5 13.37 11.02 11.55 11.75 33.684 14.23 11.75 12.48 12.43 36.983 15.11 12.34 13.45 13.21 38.83

1.4 20 10.54 9.27 9.55 9.59 18.4910 15.13 12.66 13.23 13.20 28.95

5 13.19 10.90 11.44 11.56 33.224 13.54 11.51 12.10 12.13 35.913 14.06 11.67 12.45 12.41 37.66

In terms of bias, the three estimates perform similarly, both in terms of mag-nitude and sign of the bias. The sign can shift with the values of T and θ0 andthe magnitude mainly falls with increasing T and N. The relative efficiency ofGLS is typically higher than 97% for the smallest values of T, but deterioratesfor T = 20 and N = 100, possibly due to the lack of precision in the estimationof a moderately large covariance matrix of residuals, the results improving sub-stantially for N = 200. OLS estimates can be very inefficient for θ0 = 0.0 due tofirst differencing introducing strong (negative) correlation in the regression errorterms with efficiency deteriorating as T increases from 0.68 (resp. 0.61) for T = 3to 0.36 (resp. 0.13) for T = 20 and θx = 0.6 (resp. 1.0). However, for other val-ues of θ0, the relative performance of OLS is more stable across T . For θ0 = 1,OLS estimation performs similarly to PMLE for both θx = 0.6 and 1.0, as inthis case first differencing is prewhitening exactly the errors. For the other valuesof θ0, the relative performance of OLS estimation worsens for the larger valueof θx .

Simulated size is very good for the Wald tests based on the PMLE and OLS es-timate, improving with N, but not being monotone with T . The GLS-based test isseriously oversized for the largest values of T (and small N), related with the effi-ciency problems associated with the dimension of the residual sample covariancematrix.



https://doi.org/10.1017/S0266466619000069



TABLE 5b. Simulated size Wald-test N = 100,model (7),FARIMA(1,θ10,0) , θ20 = −0.5. Gaussian innovations.Nominal size 5%


0.0 20 6.39 3.16 3.84 3.21 10.6010 8.49 3.79 5.05 3.79 17.715 12.27 4.20 5.98 3.86 28.344 14.05 5.12 6.29 4.38 32.023 16.73 6.82 7.84 5.61 37.52

0.6 20 6.62 5.10 5.24 5.19 13.3410 7.38 5.23 5.60 5.44 22.895 11.00 4.81 6.46 5.09 42.544 14.24 5.39 7.62 5.38 50.573 21.55 6.41 10.50 6.37 64.83

1.0 20 6.32 5.02 5.12 5.12 7.9710 6.70 5.20 5.26 5.34 11.335 6.08 4.44 4.74 4.73 20.174 5.88 4.08 4.23 4.20 27.103 7.14 4.39 5.02 4.68 41.06

1.4 20 6.32 5.03 5.16 5.17 7.0910 6.52 5.14 5.15 5.16 9.395 6.44 4.55 4.89 5.06 14.444 6.54 4.62 4.76 4.87 18.593 6.51 4.52 5.01 4.89 29.18

7. GENERALIZATION OF INITIAL CONDITIONS

Condition (2) is quite drastic, requiring εit = 0 for all t < 0, motivated by non-stationary versions of fractional models (6) and (7) whereas for the nonstationaryAR(1) covered by (5) it is only necessary to impose a condition on εit for a singlet (for an extensive discussion of initial values in the AR(1) case see Andersonand Hsiao, 1981). The time series version of (2) (see e.g., Hualde and Robin-son, 2011a) has sometimes been imposed in the fractional literature with littlediscussion, whereas in fact it plays a crucial role in model specification, and con-sequent asymptotic statistical properties. Also, in a time series context, Johansenand Nielsen (2010) instead treat initial conditions as bounded constants.

Here, we consider the impact on our panel model by replacing (2) by the con-dition

εit = 0, a.s. t <−m, (23)

for a specified positive integer m. This condition was considered in a time seriescontext by Hualde and Robinson (2011b) and Johansen and Nielsen (2016). Thelarger m, the closer one appears to get to the initial-condition-free setup usual inthe stationary time series literature. However, (23) for m ≥ 1 is not really a milderassumption than (2), rather, it replaces εit = 0, −1 ≤ t ≤ −m by the assumption



https://doi.org/10.1017/S0266466619000069



TABLE 6a. Simulated size Wald-test N = 200, model(7), FARIMA(1,θ10,0) , θ20 = 0.5. Gaussian innovations.Nominal size 5%


0.0 20 12.82 8.46 8.98 8.41 34.0110 10.37 7.83 6.84 7.68 37.47

5 10.33 7.27 6.35 7.11 43.074 11.67 7.36 6.55 6.86 46.653 14.81 7.52 7.42 7.41 52.36

0.6 20 9.19 8.32 8.32 8.33 23.0210 13.82 12.50 12.92 12.96 37.82

5 13.49 12.27 12.60 12.72 44.754 14.19 12.88 13.16 13.15 46.273 15.52 13.73 14.39 14.30 45.30

1.0 20 8.01 7.03 7.28 7.32 17.3210 13.73 12.23 12.82 12.79 31.59

5 13.10 11.60 12.25 12.36 39.934 13.86 12.44 12.75 12.79 41.733 14.21 12.66 13.17 13.18 41.85

1.4 20 7.89 6.94 7.18 7.18 16.9710 13.69 12.54 12.97 12.90 31.07

5 13.24 11.70 12.18 12.21 39.344 13.66 12.22 12.69 12.69 40.793 13.61 12.10 12.60 12.51 40.75

that these εit are iid across i with the same distribution as the iid nondegenerate εit

for t ≥ 0.We retain the initial condition (3) on xit .We now require that T ≥ m +2.Corresponding to (23), we write in place of (1)

λt+m (L; θ0)(yit − ζi − x ′

itβ0) = εit , (24)

for i = 1, . . . ,N, t = 0,1, . . . ,T, where λt+m (L; θ) = ∑t+mj=0 λj (θ) L j truncates

λ(L; θ). Now redefine

vit = λ−1t+m (L; θ0)εit , (25)

and (cf. (11))

zit (θ,β)= τt−1 (L; θ)x ′it (β0 −β)+ τt−1 (L; θ)vit

= τt+m (L; θ)x ′it (β0 −β)+ (τt−1 (L; θ)− τt+m (L; θ))x ′

it (β0 −β)+τt+m (L; θ)vit + (τt−1 (L; θ)− τt+m (L; θ))vit

= λt+m (L; θ)x ′it (β0 −β)−

t+m∑j=tτj (θ)x ′

i,t− j (β0 −β)

+λt+m (L; θ)vit −t+m∑j=tτj (θ)vit− j

= λt+m (L; θ){vit − x ′it (β−β0)

}− τ t+mt (θ)′vm

i + τt (θ)x ′i0 (β−β0) ,



https://doi.org/10.1017/S0266466619000069



TABLE 6b. Simulated size Wald-test N = 200, model(7), FARIMA(1,θ10,0) , θ20 = −0.5. Gaussian innova-tions. Nominal size 5%


0.0 20 5.06 3.63 3.74 3.58 11.4210 6.74 3.53 4.08 3.58 17.835 9.25 4.22 5.10 3.78 28.964 10.53 4.12 5.04 3.67 32.003 12.35 5.51 6.04 4.27 37.92

0.6 20 6.01 5.19 5.16 5.25 13.4810 6.22 5.09 5.08 5.10 22.925 8.29 5.24 6.00 5.22 41.334 9.57 4.79 6.03 4.88 49.913 15.32 5.71 8.42 5.69 64.41

1.0 20 5.86 5.24 5.08 5.13 8.1110 5.68 4.77 4.93 4.96 10.995 5.48 4.79 4.77 4.78 19.554 5.48 4.36 4.51 4.51 25.353 5.90 4.59 4.80 4.69 40.57

1.4 20 5.77 5.23 5.28 5.27 7.2410 5.76 4.85 4.91 4.95 8.715 5.51 4.73 4.90 4.83 14.174 5.66 4.82 4.95 4.93 17.523 5.97 4.80 5.00 4.94 28.66

where

τ t+mt (θ)= (τt (θ) , . . . ,τt+m (θ))

′

and

vmi = (

vi0, . . . ,vi,−m)′..

Thus,

zit (θ,β0)= λt+m (L; θ)vit − τ t+mt (θ)′vm

i ,

zit (θ0,β0)= εit − τ t+mt (θ0)

′vmi . (26)

Now for −m ≤ t ≤ 0,

vit = λ−1t+m (L; θ0)εit −λ−1

t+m (L; θ0)εi,t−1

=m+t∑

0φj (θ0)εi,t− j −

m+t−1∑0

φj (θ0)εi,t− j−1

= εit +m+t∑

1

(φj (θ0)−φj−1 (θ0)

)εi,t− j , (27)



https://doi.org/10.1017/S0266466619000069



TABLE 7a. Estimation of slope coefficient β0 : N = 100, xit ∼FARIMA(0,0.6,0), λ(L; θ) ∼ FARIMA(0,θ0,0) . Gaussian innovations.Nominal size 5%

Bias Rel. efficiency Size Wald-testθ0 T PMLE GLS OLS GLS OLS PMLE GLS OLS

0.0 20 0.0002 0.0001 0.0001 0.7957 0.3511 5.47 12.39 5.5710 –0.0003 –0.0003 –0.0003 0.9081 0.4514 5.24 8.36 5.525 –0.0009 –0.0009 –0.0009 0.9574 0.5773 5.58 6.75 5.104 0.0004 0.0004 0.0007 0.9633 0.6124 5.42 6.73 5.173 –0.0002 –0.0002 –0.0018 0.9726 0.6743 5.69 6.28 5.87

0.6 20 0.0001 0.0000 0.0001 0.8551 0.8574 5.77 11.57 5.6710 –0.0002 –0.0002 –0.0002 0.9229 0.8629 5.40 7.82 5.395 –0.0004 –0.0004 –0.0004 0.9533 0.8752 5.49 6.64 5.004 0.0011 0.0010 0.0011 0.9667 0.8828 5.14 5.92 5.013 –0.0004 –0.0003 –0.0010 0.9723 0.8969 6.03 6.66 5.85

1.0 20 0.0000 –0.0000 0.0000 0.8622 1.0005 5.69 11.23 5.7110 –0.0001 –0.0002 –0.0001 0.9260 1.0006 5.30 7.81 5.355 –0.0001 –0.0001 –0.0001 0.9511 1.0032 5.40 6.55 5.354 0.0013 0.0011 0.0013 0.9664 1.0036 5.15 6.15 5.053 –0.0007 –0.0006 –0.0006 0.9718 1.0061 6.06 6.83 5.82

1.4 20 –0.0000 0.0000 –0.0000 0.8532 0.8255 5.44 11.32 5.6210 –0.0001 –0.0002 –0.0000 0.9233 0.8135 5.26 7.70 5.225 0.0000 0.0000 0.0002 0.9500 0.8169 5.36 6.62 5.294 0.0012 0.0011 0.0014 0.9654 0.8012 5.03 5.92 5.023 –0.0006 –0.0006 –0.0004 0.9715 0.8114 5.94 7.06 5.77

TABLE 7b. Estimation of slope coefficient β0 : N = 200, xit ∼FARIMA(0,0.6,0), λ(L; θ) ∼ FARIMA(0,θ0,0) . Gaussian innovations.Nominal size 5%


0.0 20 –0.0003 –0.0004 0.0001 0.9081 0.3571 5.18 7.78 5.0310 0.0002 0.0002 0.0002 0.9467 0.4409 5.15 6.49 5.325 –0.0003 –0.0004 –0.0006 0.9756 0.5654 5.20 5.71 5.324 0.0000 0.0002 0.0002 0.9770 0.6092 5.44 5.76 5.413 –0.0001 –0.0001 –0.0005 0.9880 0.6767 5.29 5.57 5.00

0.6 20 –0.0001 –0.0001 0.0000 0.9208 0.8541 5.03 7.43 4.9710 –0.0000 –0.0001 –0.0001 0.9503 0.8553 5.30 6.75 5.325 –0.0003 –0.0004 –0.0004 0.9715 0.8657 5.18 5.94 5.334 0.0004 0.0005 0.0004 0.9769 0.8786 5.42 5.86 5.463 –0.0007 –0.0007 –0.0008 0.9855 0.8907 5.18 5.54 5.10

1.0 20 0.0000 0.0000 0.0000 0.9220 1.0004 5.02 7.40 4.9910 –0.0002 –0.0002 –0.0002 0.9532 1.0011 5.08 6.50 5.095 –0.0002 –0.0003 –0.0002 0.9704 1.0001 5.65 6.20 5.624 0.0005 0.0006 0.0005 0.9766 1.0002 5.19 5.68 5.143 –0.0009 –0.0009 –0.0009 0.9850 1.0032 4.99 5.51 4.96

1.4 20 0.0000 0.0001 0.0001 0.9210 0.8289 4.91 7.43 5.0910 –0.0002 –0.0003 –0.0003 0.9522 0.8202 5.00 6.13 5.375 –0.0001 –0.0002 –0.0001 0.9718 0.8110 5.45 6.23 5.294 0.0005 0.0006 0.0005 0.9778 0.8237 5.15 5.39 4.923 –0.0010 –0.0009 –0.0010 0.9866 0.8296 5.09 5.44 4.80



https://doi.org/10.1017/S0266466619000069



TABLE 8a. Estimation of slope coefficient β0 : N = 100, xit ∼FARIMA(0,1.0,0), λ(L; θ) ∼ FARIMA(0,θ0,0) . Gaussian innovations.Nominal size 5%


0.0 20 0.0001 0.0001 0.0001 0.6832 0.1338 5.52 15.46 5.5410 –0.0002 –0.0001 –0.0003 0.8884 0.2534 5.60 8.60 5.425 –0.0006 –0.0006 –0.0010 0.9562 0.4423 6.04 7.13 5.194 0.0001 0.0002 0.0002 0.9611 0.5005 5.59 6.73 5.403 –0.0004 –0.0004 –0.0021 0.9765 0.6059 5.95 6.49 5.69

0.6 20 0.0001 0.0001 0.0001 0.8376 0.6514 5.79 11.93 5.7810 –0.0003 –0.0003 –0.0003 0.9190 0.7220 5.25 8.09 5.345 –0.0005 –0.0004 –0.0007 0.9536 0.7984 5.60 6.76 5.204 0.0008 0.0007 0.0008 0.9691 0.8313 5.34 6.27 5.043 –0.0005 –0.0004 –0.0012 0.9759 0.8584 5.87 6.31 5.56

1.0 20 0.0001 –0.0000 0.0001 0.8605 1.0020 5.58 11.48 5.5810 –0.0003 –0.0004 –0.0003 0.9228 1.0017 5.38 7.62 5.365 –0.0004 –0.0004 –0.0004 0.9519 1.0038 5.28 6.80 5.214 0.0012 0.0010 0.0012 0.9671 1.0050 5.02 5.89 5.003 –0.0008 –0.0006 –0.0008 0.9735 1.0081 6.16 6.64 6.03

1.4 20 0.0000 –0.0000 –0.0000 0.8397 0.6030 5.50 11.68 5.5510 –0.0002 –0.0003 –0.0004 0.9187 0.6422 5.32 7.82 5.195 –0.0002 –0.0002 –0.0001 0.9495 0.7044 5.42 6.65 5.284 0.0013 0.0012 0.0015 0.9646 0.7036 5.08 6.19 5.363 –0.0008 –0.0006 –0.0005 0.9719 0.7589 6.15 7.11 6.05

TABLE 8b. Estimation of slope coefficient β0 : N = 200, xit ∼FARIMA(0,1.0,0), λ(L; θ) ∼ FARIMA(0,θ0,0) . Gaussian innovations.Nominal size 5%


0.0 20 –0.0002 –0.0002 0.0000 0.8588 0.1335 4.91 8.39 5.1410 0.0000 0.0000 0.0003 0.9473 0.2452 5.03 6.40 5.425 –0.0001 –0.0001 –0.0006 0.9794 0.4288 5.51 5.95 5.184 –0.0004 –0.0003 –0.0003 0.9796 0.5010 5.41 5.68 5.263 0.0002 0.0002 –0.0002 0.9869 0.5986 5.39 5.54 5.24

0.6 20 –0.0002 –0.0002 –0.0001 0.9182 0.6546 5.19 7.71 5.2210 0.0001 0.0001 0.0001 0.9499 0.7128 5.23 6.51 5.425 –0.0002 –0.0003 –0.0004 0.9763 0.7862 5.24 5.83 5.124 –0.0000 0.0000 0.0000 0.9792 0.8215 5.61 6.01 5.353 –0.0006 –0.0006 –0.0007 0.9851 0.8537 5.23 5.65 5.18

1.0 20 –0.0001 –0.0001 –0.0001 0.9229 1.0009 5.19 7.51 5.1610 –0.0000 –0.0001 –0.0000 0.9518 1.0025 5.19 6.61 5.125 –0.0003 –0.0004 –0.0003 0.9733 1.0018 5.51 6.14 5.374 0.0002 0.0003 0.0002 0.9770 1.0007 5.40 5.86 5.363 –0.0011 –0.0010 –0.0010 0.9842 1.0035 5.26 5.68 5.25

1.4 20 0.0000 0.0000 0.0000 0.9173 0.6026 5.04 7.63 4.8210 –0.0002 –0.0002 –0.0002 0.9517 0.6505 5.07 6.43 5.305 –0.0002 –0.0003 –0.0002 0.9710 0.6980 5.52 6.34 5.464 0.0004 0.0005 0.0002 0.9769 0.7424 5.32 5.73 5.023 –0.0012 –0.0011 –0.0014 0.9857 0.7703 4.98 5.34 4.89



https://doi.org/10.1017/S0266466619000069



where the moving average weights φj (θ) are defined by

λ(L; θ)−1 =∞∑0φj (θ) L j (28)

and the second term in (27) is absent for t = −m. Then, we can write

vmi = U (θ0)ε

mi ,

where we introduce

εmi = (

εi0,εi,−1, . . . ,εi,−m)′

and the (m + 1)× (m + 1) upper-triangular matrix

U (θ)=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 φ1 (θ)− 1 φ2 (θ)−φ1 (θ) . . . φm (θ)−φm−1 (θ)

0 1 φ1 (θ)− 1 . . . φm−1 (θ)−φm−2 (θ)

0 0 1 . . . φm−2 (θ)−φm−3 (θ)

. . . . . . . . . . . . . . .

0 0 0 . . . φ1 (θ)− 1

0 0 0 . . . 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠.

It follows from (26) that

zi (θ0,β0)= εi − τm (θ0)U (θ0)εmi ,

where

τm (θ)=(τ 1+m

1 (θ) , . . . ,τ T +mT (θ)

)′.

Thus, zi (θ0,β0) has covariance matrix σ 20�(θ0) , with (14) replaced by

�(θ)= IT + τm (θ0)U (θ0)U (θ0)′ τm (θ0)

′ . (29)

We can thus employ the same definition of pseudo likelihood L (θ,β) (16) withthe same formula (15) for σ 2 (θ,β) in both cases redefining �(θ) as (29), andthence the formula (17) for the estimates θ , β that are now based on the initialcondition (23) in place of (2). Writing Wm (θ)= τm (θ)U (θ) , note that by Wood-bury’s identity

�(θ)−1 = IT − Wm (θ)(Im+1 + Wm (θ)

′ Wm (θ))−1

Wm (θ)′

and by Silvester’s identity

|�(θ)| = ∣∣Im+1 + Wm (θ)′ Wm (θ)

∣∣ ,terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0266466619000069Downloaded from https://www.cambridge.org/core. IP address: 54.39.106.173, on 24 Jun 2020 at 13:57:44, subject to the Cambridge Core


https://doi.org/10.1017/S0266466619000069



thereby reducing relevant computations to the inverse and determinant of a(m + 1)× (m + 1)matrix. Asymptotic properties analogous to those of Theorems1 and 2 will follow under slightly modified assumptions.

If asymptotics with T → ∞ are instead pursued it may be possible to allowm → ∞ simultaneously. But in practice m must be chosen and the implicationsof increasing m will typically be an increase in the variance of yit conditional onζi and xit – in particular there will be a monotonic increase with m if all movingaverage weights are nonnegative, as in the AR(1) case of (5) with positive θ01 (asunder a unit root) or the long memory case θ0 > 0 of (6). The models for varyingm, m = 0,1, . . . , are nonnested and, for given λ(L; θ), m might be determinedby a suitable model-selection procedure.

There are other possible implications for the choice of initial conditions thatmight be studied in our model setting. Hahn (1999) compared semiparametricefficiency bounds under rival initial conditions. Moon, Perron, and Phillips (2007)compared initial conditions in unit root testing in panel models with incidentaltrends, where T diverges with N. Instead of the zero initial conditions assumedabove, one might consider ones that are heterogeneous across i.

To explore the latter possibility, assume now

εi,−1 = ξi , εit = 0, a.s. t <−1, (30)

where as we plan to eliminate it, no assumptions are required on ξi , as was thecase with ζi . (An analogous argument to that below can apply to the alterna-tive setting εi0 = ξi , εit = 0, a.s. t < 0, and to ones with more than one het-erogeneous initial condition.) With the notation in (28), vit = λ−1

t (L; θ0)εit =∞∑

j=0φj (θ0)εi,t− j , see (8), we have yit −ζi − x ′

itβ0 =ψit (θ0)+φt+1 (θ0)ξi , where

ψit (θ)=t∑

j=0φj (θ)εi,t− j . Thence, from (9),

yit −x ′itβ0 =ψit (θ0)+φt+1 (θ0)ξi , t = 1, . . . ,T . (31)

If φt+1 (θ0) = 0, ξi is eliminated, but assuming that is not the case, form(yit −x ′

itβ0)/φt+1 (θ0) = ψit (θ0)/φt+1 (θ0)+ ξi , whence after further

differencing

{(yit −x ′

itβ0)/φt+1 (θ0)

} ={ψit (θ0)/φt+1 (θ0)} , t = 2, . . . ,T .

(32)

Noting thatψit (θ)= εit +t∑

j=1φj (θ)εi,t− j , after rearrangement the right-hand

side of (32) can be expressed as

{ψit (θ0)/φt+1 (θ0)} =t∑

j=0χj t (θ0)εi,t− j ,

where



https://doi.org/10.1017/S0266466619000069



χ0t (θ)= 1/φt+1 (θ) ,

χ1t (θ)=(φ1 (θ)

φt+1 (θ)− 1

φt (θ)

),

χj t (θ)=(φj (θ)

φt+1 (θ)− φj−1 (θ)

φt (θ)

), j = 2, . . . , t .

Now form the (T − 1)×1 vectors

wi (θ,β)=(

{(yi2 −x ′

i2β)/φ3 (θ)

}, . . . ,

{(yiT −x ′

iT β)/φT +1 (θ)

})′,

γi (θ)=(

2∑j=0

χj2 (θ)εi,t− j , . . . ,T∑

j=0χ j T (θ)εi,t− j

)′.

Now γi (θ0) has zero mean vector and covariance matrix σ 20� (θ0) , where � (θ)

has (l,m)th element

min(l,m)+1∑j=0

χj,l+1 (θ)χj+|l−m|,m+1 (θ) .

A PMLE of θ0,β0, based on the transformed model (32) and under the het-erogeneous initial condition (30), thus minimizes |� (θ) |1/(T −1)σ 2 (θ,β), where

σ 2 (θ,β) = 1N(T −1)

N∑i=1w′

i (θ,β)� (θ)−1wi (θ,β), cf. (15), (16). Its asymptotic

properties can be derived from similar arguments to those used in proving The-orems 1 and 2 with the important exception of the identifiability argument. Withrespect to this, note that under the simple fractional model (6) for λt (L; θ) witha fractional unit root, θ0 = 1, in (1) we have φt+1 (θ0) = 0 for all t so that ξi

is eliminated from (31). But the practitioner would not know that θ0 = 1 and sowould be inclined to use the procedure based on (32). But this clearly breaksdown at θ = 1 so � must exclude θ = 1 and so it must be assumed that θ0 �= 1.However, we can cover the possibility of a fractional unit root under, say, theFARIMA(1,θ1,0) structure (7), where� can be chosen to include θ1 = 1 so longas it also entails 0< |θ2|< 1.

8. FINAL COMMENTS

We have discussed inference in panel data models with general parametric dy-namics, individual effects and possible linearly involved explanatory variables,with asymptotic theory based on cross-sectional dimension N diverging but tem-poral dimension T remaining fixed. For T → ∞ similarly desirable asymptoticproperties are available but typically with simpler formulae for the large samplevariance matrix of estimates. Obviously, one might wish to consider a modified ormore general panel data model, and given the literature and range of potential ap-plications, the possibilities are too numerous to list in full. But we briefly mentionsome possible developments in connection with our model.



https://doi.org/10.1017/S0266466619000069



1. An alternative form of inference is prompted by the rather cumber-some covariance matrix of Theorem 2. A suitable bootstrap procedurecan avoid this, and since it may achieve an Edgeworth correction is li-able to have better finite sample properties than first-order asymptoticinference.

2. Though Theorem 2 does not assume normality of the εit , if normality doesnot hold one expects greater efficiency to be achievable by maximum likeli-hood estimates with respect to a correctly specified nonnormal distribution.However, not only may these be inconsistent if the density of εit is mis-specified but also the joint density of the εit − τt (θ0)εi0, t = 1, . . . ,T, is aconvolution of the underlying density of εit , and is thus potentially compli-cated. In principle, at least, it would be possible to construct semiparametricestimates that achieve equal efficiency without parameterizing the density ofεit , being adaptive in the sense of Stone (1975) and involving nonparametricestimation of the relevant density or score function.

3. Our iid (across i) assumption on εit ,xit allows strong consistency of es-timates to be established under minimal moment conditions. But there isconcern for robustness to departures from some of our assumptions. For ex-ample, it should be straightforward to extend our proof of consistency toallow for unconditional heteroscedasticity across i of the εit (heteroscedas-ticity across t can be incorporated in the model for fixed T ). But our limitingcovariance matrix estimate in Theorem 2 can be robustified with respect tounconditional heteroscedasticity across i via the nonparametric approach ofEicker (1963). On the other hand, unanticipated heteroscedasticity entailsloss of efficiency, so more ambitiously one could develop asymptoticallyefficient estimates in a semiparametric extension of our model with condi-tionally (on xit ) heteroscedastic εit , with conditional variances estimated bynonparametric smoothing (as in, say, Robinson, 1987). Our estimates shouldalso be consistency-robust to cross-sectional dependence, but valid large-N, fixed T inference would likely require specification and estimation of aparametric model for the N × N cross-sectional covariance matrix of the εit ,based perhaps on a factor or spatial mode1. Likewise relaxation of, respec-tively, identity of distribution and independence of the xit across i would bepossible. Though we do not assume the εit are independent of the xit , if ourstrong exogeneity assumption were relaxed to orthogonality generalized-method-of-moments estimates can be considered, albeit with some loss inefficiency.

4. The linearity of the regression component x ′itβ0 reflects popular practice, but

in our setting it is easily extended to a general nonlinear parametric compo-nent, since our general dynamics require implicitly defined extremum esti-mation in any case so our asymptotic proofs can be straightforwardly mod-ified. Nonparametric regression would entail a more challenging extension,but here we might consider a series approximation to the regression com-



https://doi.org/10.1017/S0266466619000069



ponent, where, in asymptotic theory, the number of terms would increaseslowly with N.

5. Arellano and Bonhomme (2012) considered a static regression model, i.e.,with λt (L; θ0)≡ 1 in (1), but with possibly AR errors εit , in which our indi-vidual effect term ζi is generalized to z′

itϕi , where zit is a vector of observ-able explanatory variables and ϕi a vector of unknown individual-specificparameters. They discussed inter alia, nonparametric identification and es-timation of the conditional distribution of ϕi . It may be of some interestto develop an extension to our dynamic model. Of course, many paramet-ric, semiparametric, and nonparametric approaches can be applied to paneldata, bearing in mind the need to reconcile issues of correct specification,parsimony, and curse-of-dimensionality, and reflecting absolute and relativesizes of N and T .

6. Fractional modeling could be applied to errors of static models, in placeof the popular ARMA modeling. In particular, issues of identification andestimation might be studied when the shocks to the errors comprise botha permanent and transitory component, as in e.g., Ejrnaes and Browning(2014).

7. Our ζi could also be extended to incidental or heterogeneous trends (seeMoon and Phillips, 1999; Moon, Perron, and Phillips, 2007). Here, for ex-ample, a polynomial trend with individual-specific slope might be elimi-nated by higher order differencing and then a modification of our PMLEdeveloped.

REFERENCES

Adenstedt, R.K. (1974) On large-sample estimation for the mean of a stationary random sequence.Annals of Statistics 6, 1095–1107.

Anderson, T.W. & C. Hsiao (1981) Estimation of dynamic models with error components. Journal ofthe American Statistical Association 76, 598–606.

Arellano, M. & S. Bonhomme (2012) Identifying distributional characteristics in random coefficientspanel data models. The Review of Economic Studies 79, 987–1020.

Eicker, F. (1963) Asymptotic normality and consistency of the least squares estimators for families oflinear regressions. Annals of Mathematical Statistics 34, 447–456.

Ejrnaes, M. & M. Browning (2014) The persistent-transitory representation for earnings processes.Quantitative Economics 5, 555–581.

Hahn, J. (1999) How informative is the initial condition in the dynamic panel model with fixed effects?Journal of Econometrics 93, 309–326.

Han, C. & P.C.B. Phillips (2013) First difference maximum likelihood and dynamic panel estimation.Journal of Econometrics 175, 35–45.

Hassler, U., M. Demetrescu, & A.L. Tarcolea (2011) Asymptotic normal tests for integration in panelswith cross-dependent units. Advances in Statistical Analysis 95, 187–204.

Hsiao, C. (2014) Analysis of Panel Data, 3rd ed. Cambridge University Press.Hsiao, C., M.H. Pesaran, & A.K. Tahmiscioglu (2002) Maximum likelihood estimation of fixed effects

dynamic panel data models covering short time periods. Journal of Econometrics 109, 107–150.Hualde, J. & P.M. Robinson (2011a) Gaussian pseudo-maximum likelihood estimation of fractional

time series models. Annals of Statistics 39, 3152–3181.



https://doi.org/10.1017/S0266466619000069



Hualde, J. & P.M. Robinson (2011b) Supplement to Gaussian pseudo-maximum like-lihood estimation of fractional time series models. Annals of Statistics. Available athttps://projecteuclid.org/download/suppdf 1/euclid.aos/1330958676.

Jennrich, R.I. (1969) Asymptotic properties of non-linear least squares estimators. Annals of Mathe-matical Statistics 2, 633–643.

Johansen, S. & M. Nielsen (2010) Likelihood inference for a nonstationary fractional autoregressivemodel. Journal of Econometrics 158, 51–66.

Johansen, S. & M. Nielsen (2016) The role of initial values in conditional sum-of-squares estimationof nonstationary fractional time-series models. Econometric Theory 32, 1095–1139.

Moon, H.R., B. Perron, & P.C. Phillips (2007) Incidental trends and the power of panel unit root tests.Journal of Econometrics 141, 416–459.

Moon, H.R., B. Perron, & P.C.B. Phillips (2015) Incidental parameters and dynamic panel models. InB.H. Baltagi (ed.), The Oxford Handbook of Panel Data, pp. 111–148. Oxford University Press.

Moon, H.R. & P.C.B. Phillips (1999) Maximum likelihood estimation in panels with incidental trends.Oxford Bulletin of Economics and Statistics 61, 711–747.

Nagar, A.L. (1959) The bias and moment matrix of the general k-class estimators of the parameters insimultaneous equations. Econometrica 27, 575–595.

Neyman, J. & E. Scott (1948) Consistent estimates based on partially consistent observations. Econo-metrica 16, 1–31.

Robinson, P.M. (1987) Asymptotically efficient estimation in the presence of heteroskedasticity ofunknown form. Econometrica 55, 875–891

Robinson, P.M. & C. Velasco (2015) Efficient inference on fractionally integrated panel data modelswith fixed effects. Journal of Econometrics 185, 435–452.

Robinson, P.M. & C. Velasco (2017) Inference on trending panel data. Journal of Econometrics, forth-coming.

Stone, C.J. (1975) Adaptive maximum likelihood estimators of a location parameter. Annals of Statis-tics 3, 267–284.

APPENDIX A: Theorem Proofs

Proof of Theorem 1. We have

σ 2 (θ, β (θ)

) = 1

N T

N∑i=1

(ϒ (L; θ)(yi −xi β (θ)

))′�(θ)−1 (

ϒ (L; θ)(yi −xi β (θ))),

which straightforwardly equals

1

NT

N∑i=1

(ϒ (L ; θ)yi )′�(θ)−1 (ϒ (L ; θ)yi )

− 1

NT

N∑i=1

(ϒ (L ; θ)yi )′�(θ)−1ϒ (L ; θ)xi

×(

1

NT

N∑i=1

(ϒ (L ; θ)xi )′�(θ)−1ϒ (L ; θ)xi

)−1

× 1

NT

N∑i=1

(ϒ (L ; θ)xi )′�(θ)−1ϒ (L ; θ)yi

= A(y, y,θ)− A(y, x,θ)A(x, x,θ)−1 A(x, y,θ), (A.1)



https://doi.org/10.1017/S0266466619000069



where, for example,

A(x, y,θ)= 1

NT

N∑i=1

(ϒ (L ; θ)xi )′�(θ)−1ϒ (L ; θ)yi .

With the same kind of notation, (A.1) straightforwardly equals

A(v,v,θ)− A(v, x,θ)A(x, x,θ)−1 A(x,v,θ).

Assumption A(vi) and finiteness of T implies the τt (θ) are continuous in θ, and thusuniformly continuous on the compact set �. Also, from (18), (19), �(θ)−1 is continuousin θ, and thus uniformly continuous on the compact set �.

Now, from finiteness of T and conditions Assumptions A(i)–(vi), as N → ∞, uniformlyin θ ∈�, a.s.

A(v,v,θ)→ E A(v,v,θ), A(v, x,θ)→ E A(v, x,θ)= 0, A(x, x,θ)→ E A(x, x,θ),

(A.2)

where the last expression is positive definite from condition (viii). Here, pointwise conver-gence follows from Assumption A(1), while equicontinuity, and thus uniform convergence,follow since Assumption A(vi) and finiteness of T implies the τt (θ) are continuous in θ,and thus uniformly continuous on the compact set �, while from (18), (19), �(θ)−1 iscontinuous in θ, and thus uniformly continuous on the compact set �. It follows that asN → ∞, uniformly in θ ∈�, a.s.

σ 2 (θ, β (θ)

) → E A(v,v,θ). (A.3)

Now,

E A(v,v,θ)= 1

Ttr

(�(θ)−1 Ezi (θ,β0) z′

i (θ,β0)),

where

zi (θ,β0)= (zi1 (θ,β), . . . , ziT (θ,β))′

= V (θ)εi + (ν (θ)− τ (θ))εi0, (A.4)

defining the T × T upper-triangular matrix

V (θ)=

⎛⎜⎜⎜⎜⎜⎝1 ν1 (θ) ν2 (θ) . . . νT −1 (θ)

0 1 ν1 (θ) . . . νT −2 (θ)

0 0 1 . . . νT −3 (θ)

. . . . . . . . . . . . . . .

0 0 . . . . . . 1

⎞⎟⎟⎟⎟⎟⎠ , (A.5)

where the νj (θ) are given by

ν (L ; θ,θ0)= λ(L ; θ)/λ(L ; θ0)=∞∑

j=0

νj (θ)L j , (A.6)



https://doi.org/10.1017/S0266466619000069



and the T ×1 vector

ν (θ)= (ν1 (θ) ,ν2 (θ) , . . . ,νT (θ))′ . (A.7)

The evaluation (A.4) follows because

zit (θ,β0)= τt−1 (L ; θ)vit

= λt (L ; θ)vit − τt (θ)εi0,

= λt (L ; θ0)−1 λt (L ; θ)εit − τt (θ)εi0

=t∑

j=0νj (θ)εi,t− j − τt (θ)εi0

=t−1∑j=0νj (θ)εi,t− j + (νt (θ)− τt (θ))εi0.

From (A.4)

Ezi (θ,β0) zi (θ,β0)′ /σ 2

0 = V (θ)′ V (θ)+ (ν (θ)− τ (θ))(ν (θ)− τ (θ))′ ,and so

E A(v,v,θ)/σ 20 = 1

Ttr

(�(θ)−1 (

V (θ)′ V (θ)+ (ν (θ)− τ (θ)) (ν (θ)− τ (θ))′)) . (A.8)

By the inequality between arithmetic and geometric means, the last expression is no lessthan∣∣∣�(θ)−1 (

V (θ)′ V (θ)+ (ν (θ)− τ (θ))(ν (θ)− τ (θ))′)∣∣∣1/T. (A.9)

It is readily seen from (A.5) and (A.7) after evaluating the νj (θ) from (A.6) (eg ν1 (θ) =λ1 (θ)− λ1 (θ0)) that V (θ)′ V (θ)+(ν (θ)− τ (θ))(ν (θ)− τ (θ))′ equals�(θ) plus a sym-metric matrix all of whose elements are functions of λt (θ)− λt (θ0) , t = 1, . . . ,T and θ0only, such that all eigenvalues of �(θ)−1 (

V (θ)′ V (θ)+ (ν (θ)− τ (θ))(ν (θ)− τ (θ))′)are equal, and thence the lower bound (A.9) is attained, when and only when λt (θ) =λt (θ0) , j = 1, . . . ,T . But from condition (vii) the latter holds only when θ = θ0.

Since (A.9) equals(1+ τ ′ (θ0)τ (θ0)

)−1/T = |�(θ0)|−1/T at θ = θ0 we have shown that

L (θ,β (θ))/σ 20 converges uniformly a.s. to a function that is bounded below uniquely by

the limit, 1, of L (θ0,β (θ0))/σ20 , since σ2 (

θ0, β (θ0)) → σ 2

0 a.s. from (A.3). Thus, by astandard argument (see e.g., Jennrich, 1969) it follows that θ → θ0 a.s. Using this prop-erty and some of the arguments above, finally, β = β

(θ) = A(x, x, θ )−1 A(x, y, θ)→ β0

a.s. �

Proof of Theorem 2. By Assumptions B (ii) and the usual mean value theorem argu-ment

∂

∂(θ ′,β′)′ L

(θ, β

) = ∂

∂(θ ′,β′)′ L (θ0,β0)+ M

(θ −θ0

β−β0

), (A.10)

where M is the matrix derived from

M (θ,β)= ∂2

∂(θ ′,β′)′ ∂ (θ ′,β′) L (θ,β) (A.11)



https://doi.org/10.1017/S0266466619000069



by evaluating the j th row at θ = θ( j ), β = β

( j )satisfying∥∥∥θ( j )−θ0

∥∥∥ ≤ ∥∥θ −θ0∥∥ ,∥∥∥β( j )−β0

∥∥∥ ≤ ∥∥β−β0∥∥ . Now for j = 1, . . . , p,

∂

∂θjL (θ0,β0)= 1

NT

N∑i=1

r1 j i (θ0,β0), (A.12)

and

∂

∂βL (θ0,β0)= 1

N

N∑i=1

r2i (θ0,β0). (A.13)

The expression (A.12) follows by noting that from the proof of Theorem 4.4 of RV the lefthand side is

|�(θ0)|1T

(σ2 (θ0,β0)

Ttr

(�−1 (θ0)�

j (θ0))

− 1

NT

N∑i=1

z′i (θ0,β0)�

−1 (θ0)�j (θ0)�

−1 (θ0) zi (θ0,β0)

⎞⎠+ 2

NT

N∑i=1

z j ′i (θ0,β0)�

−1 (θ0) zi (θ0,β0) ,

and applying (15) and (20).To analyze the right side of (A.12), we make heavy use of expressions in the proof of

Theorem 4.4 of RV, reproducing them here because their central limit theorem is basedon T → ∞ whereas ours is based on N → ∞; hence, the representation in (A.13) as anarithmetic mean over i = 1, . . . ,N, instead of one over t . First, we have

σ 2 (θ0,β0)= 1

NT

N∑i=1

T∑t=1

(εit − τ0

t εi0

)2 − 1

NT S0ττ

N∑i=1

⎛⎝ T∑t=1

τ0t

(εit − τ0

t εi0

)⎞⎠2

where

S0ττ = |�(θ0)| = 1+ τ0′τ0,

and from RV

E σ 2 (θ0,β0)= σ 20 .

Also, defining

S0τ τ j = τ0′τ0 j , τ0 j = τ j (θ0) ,

we have

z′i (θ0,β0)�

−1 (θ0)�j (θ0)�

−1 (θ0) zi (θ0,β0)

= 2

S0ττ

⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ

0 jt

⎞⎠−2

S0τ τ j

S02ττ

⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠2

,



https://doi.org/10.1017/S0266466619000069



which has expectation tr(�−1 (θ0)�

j (θ0)). Finally, z j ′

i (θ0,β0) is the j th row of

z′i (θ0,β0)= fi − τ 0εi0, (A.14)

in which we have the p × T matrices τ0 = τ (θ0) =(τ01′, . . . , τ0p′)′

and fi =( fi1, . . . , fiT ) , where

fit =t−1∑j=0

χt− j (θ0)εi j ,

with the vectors χj (θ) defined by

χ (L ; θ)= ∂

∂θlogλ(L ; θ)=

∞∑j=0

χj (θ)L j .

The representation (A.14) follows from

∂

∂θzit (θ,β)= ∂

∂θλt (L ; θ){vit − x ′

it (β−β0)}− ∂

∂θτt (θ)

{εi0 − x ′

i0 (β−β0)}

= χ (L ; θ)λ(L ; θ)λ−1t (L ; θ0)εit −χ (L ; θ)λ(L ; θ) x ′

it (β−β0)

−τ t (θ){εi0 − x ′

i0 (β−β0)},

and hence,

∂

∂θzit (θ,β0)= χ (L ; θ)λ(L ; θ)λ−1

t (L ; θ0)εit − τ t (θ)εi0,

∂

∂θzit (θ0,β0)= χ (L ; θ0)εit − τ 0

t εi0

= fit − τ 0t εi0.

Then, with f jit the j th element of fit ,

z j ′i (θ0,β0)�

−1 (θ0) zi (θ0,β0)

=T∑

t=1

(f jit − τ 0 j

t εi0

)(εit − τ0

t εi0

)− 1

S0ττ

T∑t=1

(f jit − τ 0 j

t εi0

)τ0

t

×⎛⎝ T∑

t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠ ,which, as shown by RV, has expectation σ 2

0 S0τ τ j/S0

ττ .



https://doi.org/10.1017/S0266466619000069



Altogether we have, writing S0ττ =

(S0τ τ1, . . . , S0

τ τ p

)′,

T r1i (θ0,β0) = S0τ τ(

S0ττ

)1−1/T

⎧⎪⎨⎪⎩ 1

T

T∑t=1

(εit − τ0

t εi0

)2 − 1

T S0ττ

⎛⎝ T∑t=1

τ0t

(εit − τ0

t εi0

)⎞⎠2⎫⎪⎬⎪⎭

− 2(S0ττ

)1−1/T

⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠

+2τ0′τ0(

S0ττ

)2−1/T

⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠2

+2(

S0ττ

)1/T T∑t=1

(fit − τ0εi0

)(εit − τ0

t εi0

)

− 2(S0ττ

)1−1/T

⎛⎝ T∑t=1

(fit − τ0

t εi0

)τ0

t

⎞⎠⎛⎝ T∑t=1

(εit − τ0

t εi0

)τ0

t

⎞⎠ , (A.15)

and Er1 j i(θ0,β0) = 0. The r2i (θ0,β0) can be similarly but more simply expressed and itis readily confirmed that Er2i (θ0,β0) = 0, and then from Assumptions A, conditions (i)and the central limit theorem for iid random vectors,

N1/2 ∂

∂(θ ′,β′)′ L (θ0,β0)→d N (0,EC (θ0,β0)) .

Next, using also Theorem 1 and Assumption B(ii)–(v), it follows much as in the proofof (A.2) that we have uniform convergence on a neighbourhood of (θ0,β0) of B (θ,β),C (θ,β) and M (θ,β), whence

C(θ, β

) →p EC (θ0,β0) ,

B(θ, β

) →p E B (θ0,β0) ,

M − M (θ0,β0)→p 0

and

M (θ0,β0)= B (θ0,β0)+op (1)→p E B (θ0,β0) ,

the op (1) terms including negligible ones such as those in second derivatives of elementsof �(θ), zi (θ,β). �

APPENDIX B: Evaluation of the Asymptotic Variance ofθ under Gaussianity

Here, we pursue closed form expressions for the asymptotic variance of θ that can beestimated by replacing θ0 by θ when there are no regressors in the model. We suppressdependence of β in the notation and write z0

i = zi (θ0)= zi (θ0,β0)= εi − τ (θ0)εi0, with



https://doi.org/10.1017/S0266466619000069



a similar interpretation for z0 ji = z j

i (θ0) , its derivative with respect to θj . We also write

�−10 =�−1 (θ0) and � j

0 =� j (θ0) .

We first evaluate the variance of the score of L for Gaussian data, setting σ 20 = 1 wlog,

E[C (θ0)

]j,k = E

[r1i (θ0)r1i (θ0)

′]j,k

= 1

T 2|�T (θ0)|

2T

3∑a,b=1

E[(w0

a j i − E[w0

a j i

])(w0

bki − E[w0

bki

])]

= 1

T 2|�T (θ0)|

2T

3∑a,b=1

E[w0

a j iw0bki

]− E

[w0

a j i

]E

[w0

bki

]for coordinates j,k = 1, . . . , p, where

w01 j i = 1

Ttr

(�−1

0 �j0

)z′

i�−10 z0

i

w02 j i = −z0′

i �−10 �

j0�

−10 z0

i

w03 j i = 2z0 j ′

i �−10 z0

i ,

and E[w0

1 j i

]= −E

[w0

2 j i

]= tr

(�−1

0 �j0

)because

E[z0′

i �−10 z0

i

]= E

[(εit − τt (θ0)εi0)

′�(θ0)−1 (εit − τt (θ0)εi0)

]= tr

{�−1

0 E[(εit − τt (θ0)εi0)

′ (εit − τt (θ0)εi0)]}

= tr {IT } = T

while

E[w0

2 j i

]= −E

[z0′

i �−10 �

j0�

−10 z0

i

]= −tr

(�−1

0 �j0 E

[z0′

i �−10 z0

i

])= −tr

(�−1

0 �j0

)and E

[w0

3 j i

]= tr

(�−1

0 E(

z0i z0 j ′

i

))= 0 from the proof of Theorem 4.4 in RV.

Next,

E[w0

1 j iw01ki

]= 1

T 2tr

(�−1

0 �j0

)tr

(�−1

0 �k0

)[(z0′

i �−10 z0

i

)2]

=(

1+ 2

T

)tr

(�−1

0 �j0

)tr

(�−1

0 �k0

)because using Gaussianity, see Nagar (1959), with E

[(εi − τ (θ0)εi0) (εi − τ (θ0)εi0)

′] =�0,

E

[(z0′

i �(θ0)−1 z0

i

)2]

= E

[((εi − τ (θ0)εi0)

′�(θ0)−1 (εi − τ (θ0)εi0)

)2]

= tr{�−1

0 �0

}2 +2tr{�−1

0 �0�−10 �0

}= tr (IT )

2 +2tr (IT )= T 2 +2T .



https://doi.org/10.1017/S0266466619000069



Next, using again Gaussianity,

E[w0

2 j iw02ki

]= E

[(z0′

i �−10 �

j0�

−10 z0

i

)(z0′

i �−10 �k

0�−10 z0

i

)]= tr

(�−1

0 �j0

)tr

(�−1

0 �k0

)+2tr

(�−1

0 �j0�

−10 �k

0

).

Write

z0i =�0 εi , �0 =�(θ0) , � =�(θ)=

(−τ(θ) ... IT

)

while from

z0 jit =

t∑a=1

χj

a (θ0)εit−a − τ t (θ0)εi0

we have

z0 ji = A0

j εi , A0j = Aj (θ0) , Aj = Aj (θ)=

(� j (θ)

... 0T ×1

)−

(τ j (θ)

... 0T ×T

)

where

� j (θ)=

⎛⎜⎜⎜⎜⎜⎜⎝χ

j1 (θ) 0 · · · 0

χj

2 (θ) χj

1 (θ)...

.... . .

. . . 0

χj

T (θ) · · · χj

2 (θ) χj

1 (θ)

⎞⎟⎟⎟⎟⎟⎟⎠ .

Then, writing w03 j i = 2z0 j ′

i �−10 z0

i = ε′i B0j εi for symmetric B0

j =(A0′

j �−10 �0 +

(A0′

j �−10 �0

)′),

E[w0

3 j iw03ki

]= 4E

[(z0 j ′

i �−10 z0

i

)(z0k′

i �−10 z0

i

)]= 2tr

(A0′

j �−10 �0 A0′

k �−10 �0

)+2tr

(A0′

j �−10 �0

(A0′

k �−10 �0

)′)+2tr

((A0′

j �−10 �0

)′A0′

k �−10 �0

)+2tr

((A0′

j �−10 �0

)′ (A0′

k �−10 �0

)′)= 2tr

(A0′

j �−10 �0

(A0′

k �−10 �0

)′)+2tr

((A0′

j �−10 �0

)′A0′

k �−10 �0

)

because tr(

A0′j �

−10 �0

)= tr

(�−1

0 �0 A0′j

)= tr

(�−1

0 E(

z0i z0 j ′

i

))= 0 and

tr(

A0′j �

−10 �0 A0′

k �−10 �0

)= tr

((A0′

j �−10 �0

)′ (A0′

k �−10 �0

)′) = 0 extending

the arguments in the proof of Theorem 4.4 of RV.



https://doi.org/10.1017/S0266466619000069



Finally, the covariance terms are

E[w0

1 j iw02ki

]= − 1

Ttr

(�−1 (θ0)�

j (θ0))

E[(

z0′i �

−10 z0

i

)(z0′

i �−10 �k

0�−10 z0

i

)]= −

(1+ 2

T

)tr

(�−1

0 �j0

)tr

(�−1

0 �k0

)because

E[(

z0′i �

−10 z0

i

)(z0′

i �−10 �k

0�−10 z0

i

)]= tr

(�−1

0 �0

)tr

(�−1

0 �k0�

−10 �0

)+2tr

(�−1

0 �0�−10 �k

0�−10 �0

)= (T +2) tr

(�−1

0 �k0

)with E

[w0

2 j iw01ki

]= E

[w0

1 j iw02ki

], while

E[w0

1 j iw03ki

]= 2

1

Ttr

(�−1

0 �j0

)E

[(z0′

i �−10 z0

i

)(z0k′

i �−10 z0

i

)]= 2

(1+ 2

T

)tr

(�−1

0 �j0

)tr

(A0′

k �−10 �0

)= 0,

because

E[(

z0′i �

−10 z0

i

)(z0k′

i �−10 z0

i

)]= tr

(�−1

0 �0

)tr

(A0′

k �−10 �0

)+ tr

(�−1

0 �0 A0′k �

−10 �0

)+ tr

(�−1

0 �0

(A0′

k �−10 �0

)′)= (T +2) tr

(A′

k�−10 �0

)= 0.

Similarly E[w0

3 j iw01ki

]= 0, and

E[w0

2 j iw03ki

]= −2E

[(z0′

i �−10 �

j0�

−10 z0

i

)(z0k′

i �−10 z0

i

)]= −2tr

(�0′�−1

0 �j0�

−10 �0 A′

k�−10 �0

)−2tr

(�0′�−1

0 �j0�

−10 �0

(A0′

k �−10 �0

)′)

and a similar expression follows for E[w0

2kiw03 j i

]interchanging j and k.

Compiling all the terms we set[C0 (θ)

]j,k = 1

T 2|�(θ)| 2

T (− 2

Ttr

(�−1� j

)tr

(�−1�k

)+2tr

(�−1� j�−1�k

)+2tr

(A′

j�−1�

(A′

k�−1�

)′)+2tr

((A′

j�−1�

)′A′

k�−1�

)−2tr

(�′�−1� j�−1�A′

k�−1�

)−2tr

(�′�−1� j�−1�

(A′

k�−1�

)′)−2tr

(�′�−1�k�−1�A′

j�−1�

)−2tr

(�′�−1�k�−1�

(A′

j�−1�

)′))where �−1 = �−1 (θ) , � j = � j (θ) , Aj = Aj (θ) and � = �(θ) are evaluated at θ

instead of �−10 , etc., which are evaluated at θ0.



https://doi.org/10.1017/S0266466619000069



For the evaluation of the probability limit of the Hessian, B1i (θ0) , we define b01 j ki (θ)

as b1 j ki (θ) with σ 2 (θ) replaced by σ 20 (= 1 wlog), j,k = 1, . . . , p, so that

E[b0

1 j ki (θ0)]

= 1

T|�(θ0)|

1T

(tr

(�−1 (θ0)�

k (θ0)�−1 (θ0)�

j (θ0))

− 1

Ttr

(�−1 (θ0)�

j (θ0))

tr(�−1 (θ0)�

k (θ0))

−2tr(�−1 (θ0)�

j (θ0)�−1 (θ0)E

[z0

i z0k′i

])−2tr

(�−1 (θ0)�

k (θ0)�−1 (θ0)E

[z0

i z0 j ′i

])+2tr

(�−1 (θ0)E

[z0k

i z0 j ′i

])),

and then set[B0 (θ)

]j,k = 1

T|�(θ)| 1

T

(tr

(�−1 (θ)�k (θ)�−1 (θ)� j (θ)

)− 1

Ttr

(�−1 (θ)� j (θ)

)tr

(�−1 (θ)�k (θ)

)−2tr

(�−1 (θ)� j (θ)�−1 (θ) E(θ)

[zi zk′

i

])−2tr

(�−1 (θ)�k (θ)�−1 (θ) E(θ)

[zi z j ′

i

])+2tr

(�−1 (θ) E(θ)

[zk

i z j ′i

]))where E(θ)

[zi z j ′

i

]=�(θ) A′

j (θ)=�A′j and E(θ)

[zk

i z j ′i

]= Ak (θ) A′

j (θ)= Ak A′j .

Then, the equivalent normalizing matrices for(θ −θ0

)based on these ex-

pected values under Gaussianity satisfy(

N B0(θ)

C0(θ)−1 B0

(θ))1/2 =

N1/2 (T/2)∣∣�T

(θ)∣∣− 1

T C0(θ)1/2 = N1/2 (T/2)1/2

∣∣�T(θ)∣∣− 1

2T B0(θ)1/2 exploiting

the proportionality of B0 (θ) and C0 (θ) under Gaussianity, because �(θ)� (θ)′ = �(θ) ,so that

tr

((A′

j�−1�

)′A′

k�−1�

)= tr

(�−1 Aj A′

k�−1��′) = tr

(�−1 Aj A′

k

),

tr(�′�−1� j�−1�A′

k�−1�

)= tr

(�−1� j�−1�A′

k

),

the additional factors correcting for the fact of L being only proportional to the likelihood.



https://doi.org/10.1017/S0266466619000069


Documents

36 doi:10.1017/S0266466619000069 ESTIMATION …...Gaussian pseudo maximum likelihood estimation of unknown parameters, mo-tivated by its asymptotic efﬁciency when Gaussianity holds