Robust Granger Causality Tests in the VARX Frameworkhomes.chass.utoronto.ca/~amaynard/papers/grangint.pdf · and Granger causality testing often rely heavily on the choice of model

Robust Granger Causality Tests in the VARXFramework

Dietmar BauerArsenal Research

Vienna, Austria

Alex Maynard∗

Department of Economics

University of Toronto

February 21, 2006

Preliminary and Incomplete. Comments Welcome.

AbstractThe estimated spectral densities of many financial and macroeconomic time serieshave long been known to peak at frequency zero. This low frequency behavior maybe potentially generated by a large number of time series models, including autore-gressive models with roots close or equal to unity, long-memory models, and modelsincorporating structural breaks. Although such alternatives are not always distin-guished with confidence in practice, common inference procedures in predictabilityand Granger causality testing often rely heavily on the choice of model. Doladoand Lutkepohl (1996) and Toda and Yamamoto (1995) introduced tests in I(1) VARmodels based on an additional lag in the specification of the estimated equationwhich is not included in the tests. This approach, which we refer to as the surpluslag method, leads to conventional test distributions for stationary and (co)integrateddata generating processes. By extending the surplus lag methodology to the VARXframework, we show that it can provide Granger causality tests that accommodatestationary, nonstationary, near-nonstationary, long-memory, and unmodelled struc-tural break processes within the context of a single χ2 null limiting distribution.Since the distribution under the null hypothesis is the same in all cases, no priorknowledge, first-stage testing, or estimation is required.JEL Classification: C12, C32 Keywords: Granger causality, surplus lag, nonstation-ary VAR, local-to-unity, long-memory, structural breaks

∗Corresponding author. We thank J.M. Dufour, seminar participants at the University of Washingtonand conference participants at the 2005 Canadian Econometric Study Group and Midwest EconometricsGroup meetings for useful comments and discussion. Part of this work was completed while the authorswere visiting the Cowles Foundation and we gratefully acknowledge their hospitality. The postdoc positionof Bauer at the Cowles Foundation has been financed by the Max Kade Foundation which is gratefullyacknowledged. Maynard thanks the SSHRC for research funding.

1

1 Introduction

This paper develops a simple but flexible augmented VARX approach to causality testingthat is highly robust to the degree and nature of the persistence in the causing variables.In particular, the same estimator and test statistic may be employed to test causality,regardless of whether the true, but unknown, data generating process is characterized by(unmodelled) structural breaks, long-memory/fractional integration, stationarity, a local-to-unity process, or I(1) behavior. Since the Granger causality test is based on the sameWald statistic and Chi-squared limiting distribution in all five cases, no prior knowledge,estimation, or testing is required to distinguish between these processes.

These are desirable characteristics for two reasons. First, it is frequently difficult todetermine the degree and nature of the persistence of the forcing variables with muchconfidence. Secondly, this can often matter in both theory and practice for second stagemodel specification and inference.

The difficulties associated in distinguishing I(1) and I(0) processes have long beenknown. In short, unit root tests may have low power, confidence intervals on the largestroot are often wide in practice, and problems of near observational equivalence put boundson our ability to distinguish between true I(1) series and persistent I(0) series in finitesample (Faust, 1996; Faust, 1999). Moreover, for many purposes, processes with near unitroots may be better modelled as local-to-unity processes, which depend on a local-to-unityparameter, which cannot be consistently estimated in a time series context (Phillips, 1987;Chan, 1988; Nabeya and Sørensen, 1994). These problems may be further complicatedwith the allowance for structural breaks, in which numerous modelling possibilities arise,unit root tests become more complicated, and the similarity between break-stationaryand unit root processes, which contain a break each period, increases with the numberof breaks. Recent literature has also highlighted the difficulties in distinguishing betweenstructural breaks and long-memory/fractionally integrated processes (Diebold and Inoue,2001; Gourieroux and Jasiak, 2001; Granger and Hyung, 2004). Thus while it is often easyto recognize that a given series is persistent, it can be far more difficult to determine withconfidence the right approach for modelling this persistence. As Phillips (2003, p. C35)puts it “no one really understands trends, even though most of us see trends when we lookat economic data.”

These distinctions matter in two ways for causality testing. First, they are importantto model specification. A typical VAR may be specified in levels if unit roots are rejected,in first-differences if the variables are individually integrated but not cointegrated, and inerror-correction format if the variables are cointegrated. Likewise, structural breaks requireexplicit modelling and long-memory processes are not easily accommodated in a VARsetting. These choices can have important practical, as well as, econometric implications.A recent controversial example involves the role of technology shocks in macroeconomicmodels, for which VARs provide evidence consistent with New Keynesian models if hoursworked is treated as nonstationary and enters in differences (Gali, 1999) but supportsinstead the conclusions of standard Real Business Cycle models when the same variable ispresumed stationary and enters in levels (Christiano et al., 2003).

2

The persistence of the causing variable can also matter for second-stage inference. Evenin simple regression models, different critical values may apply depending on whether ornot the regressor contains a unit root. Moreover, both stationary and unit root asymptoticscan be misleading when roots are close, but not equal, to unity, as often modelled by thenear unit root/local-to-unity model mentioned above (Cavanagh et al., 1995; Elliott, 1998).Such inference problems may be further complicated once one allows more realistically forthe possibility of structural breaks or long-memory processes. They also matter in practicalapplications and have played a particularly important role in the predictive regressionliterature, including tests of stock return predictability (Stambaugh, 1999; Lewellen, 2004;Torous et al., 2005), the expectations hypothesis of the term structure (Lanne, 2002), anduncovered interest rate parity (Baillie and Bollerslev, 2000; Maynard and Phillips, 2001),all of which constitute special cases of Granger causality testing.

Our approach builds on several previous papers which develop the augmented lag ap-proach in auto and vector-autoregressive frameworks. Although regression parameters onpure I(1) regressors typically have nonstandard limit behavior, it is well known (Park andPhillips, 1989; Sims et al., 1990) that the parameters on the I(0) components have a nor-mal asymptotic distribution with a root-T rate of convergence. In the context of unit roottesting, Choi (1993) recognized that, with the addition of an extra, unnecessary lag, theautoregressive model could be rewritten so that all the parameters of interest are expressedas coefficients on stationary transformations of the data. Thus, at some cost, in terms ofefficiency, inference procedures could be simplified, via the avoidance of nonstandard dis-tributions. Toda and Yamamoto (1995), and Dolado and Lutkepohl (1996) showed howthe same surplus lag approach could be applied to provide inference in finite order vectorautoregression, without knowing which components are stationary and which have unitroots. Saikkonen and Lutkepohl (1996) extended these results to infinite order VARs.

These results provide a very flexible approach to inference in general models. On theother hand, the pure VAR framework considered in these papers cannot accommodate long-memory, nor can it accommodate breaks unless they are explicitly modelled. Yet, sincebreaks are often tested for in conjunction with unit roots, the requirement that they bespecifically modelled detracts somewhat from the advantageous features of the surplus lagapproach. By incorporating an exogenously modelled component, we may accommodatea richer class of persistent processes in the VARX framework, including those with long-memory/fractionally integration or unmodelled structural breaks. Moreover, we find thatwith incorporation of the surplus lag, the limit distribution continues to be unaffected bythe particular form of this persistence.1

1Some related results are provided by Dolado and Marmol (2004), who extend the results of Sims etal. (1990) to cover nonstationary fractionally integrated processes, in the context of a single variable au-toregressive distributed lag model, whose variables include non-stationary fractionally integrated processeswith a priori known orders of integration. Like Dolado and Marmol (2004), we also consider nonstation-ary fractionally integrated regressors as one of several possible forcing processes, but in the more generalcontext of a multivariate surplus lag VARX model, in which we do not require known orders of integrationand may also accommodate stationary long-memory. Amihud and Hurvich (2004) also suggest a finitesample bias correction to predictive regression in finance, which they also refer to as a surplus regression.

3

The simplicity and generality of the surplus lag approach does not come without cost.Naturally, the addition of an additional unnecessary lag reduces the efficiency of estimation,thereby leading to reduced power relative to a correctly specified model. However, asprevious literature reports, the magnitude of this effects varies considerably. Efficiencylosses are substantial in unit root and cointegration tests, which are based exclusively onsuperconsistent regression estimators in the absence of the extra lag. Generally, the surpluslag is not recommended in this case, even by its proponents (Toda and Yamamoto, 1995;Saikkonen and Lutkepohl, 1996). However, efficiency losses are often far more moderate inthe type of Granger causality tests considered here, particularly when the baseline modelalready includes a number of lags, as in common macroeconomic applications.

There are a number of more efficient alternatives to the surplus lag approach. However,none to our knowledge achieves the same level of generality with respect to the range ofpossible models characterizing the forcing variables. In the I(1) case when the number ofcointegrating vectors and orders of integration are known, error correction models providea natural context for efficient causality testing (Toda and Phillips, 1994), although inpractice pre-tests are required. The fully modified VAR (Phillips, 1995; Kitamura andPhillips, 1997) is also efficient, but does not require a priori knowledge on the number of I(0)and unit root components. On the other hand, Elliott (1998) shows that common I(0)/I(1)cointegration procedures, such as FM regression, can perform poorly in the presence of nearunit roots, whereas we show that the surplus lag approach works equally well with local-to-unity regressors. A recently developing literature (Jansson and Moreira, 2004; Eliasz, 2005)provides optimal predictive regression tests with local-to-unity regressions. This is a verypromising development still in its earlier stages, yet so far, the use of this technique has beenrestricted primarily to near cointegration and predictive regressions models, which whileimportant in finance, are not yet sufficiently flexible for many macroeconomic applications.More importantly, it is not yet clear how easily this approach extends to other forms ofpersistence, such as the long-memory or unmodelled structural break regressors consideredhere.

A second limitation of the surplus lag approach is that, like many econometric esti-mators, it provides only correct large sample, not finite sample size. In a simple bivariatepredictive regression context, sign and sign rank tests are both non-parametric and exact infinite samples, thus providing a very attractive alternative (Campbell and Dufour, 1997).Unfortunately, they do not easily generalize to more complicated models. As recent workhas demonstrated (Dufour and Jouini, 2005), Monte-Carlo methods may also be employedto provide finite sample inference even in quite complicated parametric models. In princi-ple, the generality of this approach is limited only by its computational complexity. On theother hand, the econometrician must simulate from all possible parametric models for theforcing variable, a set which may become increasingly large once we consider the possibilityof long-memory and breaks. Moreover, the fact that exactly the same VARX based surpluslag test statistic works without modification in all the standard cases considered here hints

However, this surplus regression involves the inclusion of a first-stage residual rather than a surplus lagand thus, despite the similar name, it is fundamentally different from the approach discussed here.

4

at the possibility that the technique may work for a much wider class of processes. Thismay be an appealing aspect to those who suspect that the true mechanisms generating thepersistence in the data are likely more sophisticated than the econometric models typicallyemployed to capture them (Phillips, 2003, for example).

The remainder of the paper is organized as follows. Section 2 presents the model andexplains the basic intuition behind our results. Section 3 presents the large sample results,showing that the VARX based excess lag Wald statistic has the same null limiting Chi-squared distribution for a variety of forcing processes. Appendix A collects some technicalresults and the proofs of the main theorems are provided in Appendix B.

Finally, a word on notation. Lag orders are given by p and dimensions are given byk. Capital letters denote regression matrices. Variables with time subscripts are lowercase. A variable with a minus sign in the superscript (e.g. x−t ) includes all relevant lags.Autoregressive coefficients are denoted by π, coefficients on exogenously modelled variablesare given by ψ, and cointegrating coefficients are given by γ. The dependent variable isdenoted by yt, the variables about which Granger causality is tested are often writtenin terms of xt when referring to particular applications, but as z1t for the theoreticalresults, and any additional control variables are included as z2t. We define || · ||2 as theEuclidian norm ‖x‖2 =

√x′x, when applied to the vector x and as the induced matrix

norm max ‖Ax‖2 : x(n× 1), ‖x‖2 = 1 when applied to the m× n matrix A.

2 The model

Throughout this paper we consider three basic variables: by yt we denote a ky vector ofdependent variables, by z1t we denote a kz1 vector of exogenously modelled forcing variables,and z2t denotes an optional kz2 vector of endogenously modelled control variables.2

We consider tests of the null hypothesis that z1t does not Granger cause yt after con-trolling for z2t. Let Ft,y,z1,z2 define the information set generated by the past history ofall three variables (i.e. by yt−j, z1t−j, z2t−j for j ≥ 0). Likewise, let Ft,y,z2 denote theinformation set generated by the past history of the endogenous variables only, (i.e. byyt−j, z2t−j for j ≥ 0). Then we test the Granger noncausality condition

E [yt|Ft−1,y,z1,z2] = E [yt|Ft−1,y,z2] .

In practice this hypothesis is often tested by means of parameter restrictions on a jointVAR involving all three variables. However, our interest lies in cases in which the forcingvariable z1t displays persistent behavior, which may potentially be modelled in a variety ofdifferent ways. In this context, the pure VAR may be too restrictive. To maintain flexibility,the forcing variable z1t is instead exogenously modelled in the sense that we do not explicitlymodel its possible dependence on past yt−j and z2t−j. This allows us to consider a numberof alternative data generating processes for z1t on a case by case basis, including I(0), I(1),

2While inclusion of z2t is optional, the results of Dufour and Renault (1998) underline its potentialimportance.

5

local-to-unity, stationary and nonstationary long-memory processes, and processes withstructural breaks. We will defer discussion of the exact modelling assumptions on z1t toSection 3. However, it should be noted that although z1t is exogenously modelled, it is notassumed to be exogenous in a statistical sense. The innovations in the process for z1t maycorrelate with the past innovations to yt and z2t.

We maintain more traditional assumptions regarding the behavior of the endogenouslymodelled variables. Under the null hypothesis the true joint process for wt := [y′t, z

′2t]′ will

be assumed to be approximable by a VAR model, i.e. we assume that

wt =∞∑

j=1

πwjwt−j + εt (1)

where (εt)t∈Z is a martingale difference sequence (MDS), whose assumptions are detailedlater in Section 3.

In practice, the infinite order VAR in (1) is approximated using a finite autoregression.Our primary interest lies in the process for yt, which is approximated by

yt =

p∑j=1

(πyjyt−j + πz2jz2t−j) + εyt,p. (2)

In order to consider linear alternatives to Granger noncausality, we must also include lagsof z1t in the empirical specification. Thus, we estimate the VARX model3

yt =

p∑j=1

(ψyjyt−j + ψz2jz2t−j) +

pz1+1∑j=1

ψz1jz1t−j + εyt,p (3)

and test the joint parameter restriction ψz1j = 0 for 1 ≤ j ≤ pz1 using a standard Waldtest. Note that the estimated model includes a surplus lag of the forcing variable, z1t−pz1−1,which is not tested. The role of the surplus lag becomes more apparent after reparameter-izing (3) as

yt =

p∑j=1

(ψyjyt−j + ψz2jz2t−j) +

pz1∑j=1

ψz1j∆z1t−j + ψz1pz1+1z1t−pz1−1 + εyt,p, (4)

where ∆ denotes the first difference and ψz1j :=∑j

i=1 ψz1j. When z1t is integrated of orderless than 1.5 the parameters restricted under the null hypothesis (i.e. ψz1j for 1 ≤ j ≤ pz1)are expressed as the coefficients on the covariance stationary variables ∆z1t−j and maybe shown to follow a joint normal limiting distribution under suitable conditions. Forinstance, when z1t is I(1) it well known that they have a

√T convergence rate and joint

normal limiting distribution (Park and Phillips, 1989; Sims et al., 1990). If the integrationorder of z1t were to exceed 1.5, e.g. z1t ∼ I(2), a second surplus lag would be required.However, we do not consider this possibility.

3When the null hypothesis holds ψyj = πyj and ψz2j = πz2j .

6

The choice of pz1, the lag order of z1t, will generally influence the power of the test,but not its large sample size. In particular, the test has power against a more generalset of alternatives for large pz1, but has greater power against parsimonious alternatives,when the lag-length is small. Also, pz1 need not be set equal to p, the lag order of wt.This is another way in which the VARX provides additional flexibility. Even if modellingz1t requires many lags, e.g. if z1t has long-memory, it may still be possible to model wt

parsimoniously. Likewise, in the pure VAR framework we require a surplus lag of allvariables, whereas in the VARX we require only an extra lag of z1t. This may improveefficiency, particularly when the number of lags is small, but the dimension of wt is large.

In order to rewrite (3) in compact form define y−t := [y′t−1, . . . , y′t−p]

′, z−2t := [z′2t−1, . . . , z′2t−p]

′,ψy := [ψy1, . . . , ψyp], ψz2 := [ψz21, . . . , ψz2p], and z−1t := [z′1t−1, . . . , z

′1t−pz1

]′, so that εyt,p =yt − ψyy

−t − ψz2z

−2t. We define by x−1t = z−1t the regressors whose coefficients ψx1 are to be

tested. The remaining regressors, including the surplus lag, are then grouped together asx−2t := [(y−t )′, (z−2t)

′, (z1t−pz1−1)′]′. Thus, the estimated equation in (3) may be rewritten in

single equation form as

yt = ψx1x−1t + ψx2x

−2t + εyt,p (5)

where ψx1 ∈ Rky×pz1kz1 and ψx2 ∈ Rky×(kyp+kz2p+kz1) or in stacked form as

Y = X1ψ′x1 + X2ψ

′x2 + Ep, (6)

where Y =[

y−pmax+1, . . . , y−T]′, for pmax = maxp, pz1 + 1, and X1, X2, and Ep stack

x−1t, x−2t and ε−yt,p in identical fashion.The null hypothesis of no-Granger causality is then H0 : ψx1 = 0 and the alternative

hypothesis is HA : ψx1 6= 0. Defining X1.2 = X1 − X2(X′2X2)

−1X ′2X1, with rows denoted

by(x−1.2t

)′, as the residual from the projection of X1 on X2, we estimate the parameter of

interest, ψx1 by ψx1 = Y ′X1.2 (X ′1.2X1.2)

−1 and the variance of vec(ψx1

)is estimated in

the standard way by

Σx1 :=((X ′

1.2X1.2)−1 ⊗ Σε

),

for Σε := 1TE ′pEp, with the rows of Ep given by ε′yt,p for εyt,p := yt − ψx1x

−1t − ψx2x

−2t. The

standard Wald test for ψx1 = 0 then takes the form:

W := vec(ψx1)′Σ−1

x1 vec(ψx1) = vec(Y ′X1.2)′((X ′

1.2X1.2)−1 ⊗ Σ−1

ε

)vec(Y ′X1.2). (7)

In the section below, we show that under suitable regularity conditions W has a χ2 nulllimiting distribution for a wide variety of data generating processes under which z1t mayexhibit persistent behavior.

3 Large sample robustness results

In this section we show that the Wald statistic W for a test of Granger noncausality inthe surplus lag VARX obeys a standard Chi-squared null limiting distribution under a

7

variety of assumptions regarding the nature of the persistence in z1t. We begin by statingthe assumptions on the innovation process for the endogenous variables wt specified in (1):4

Assumption N: The noise (εt)t∈Z is a strictly stationary ergodic martingale differencesequence adapted to the increasing sequence of sigma algebras Ft generated by εt, εt−1, . . ..Further assume that Eεtε

′t|Ft−1 = Eεtε

′t = Σ > 0 and that Eεt,aεt,bεt,c|Ft−1 = ωa,b,c

(constant) where εt,a denotes the a-th coordinate of the vector εt. Finally finite fourth mo-ments are assumed: Eε4

t,i < ∞.

Many of the results presented below may be proved under more general assumptionson the innovations. In particular, finite fourth moments are often unnecessary. However,the above assumptions are standard in VAR models (Saikkonen and Lutkepohl, 1996, usesimilar but stronger assumptions) and provide a single set of assumptions that are sufficientfor most our results. A second restriction is the assumed conditional homoskedasticity ofthe innovations. If this restriction is dropped the asymptotic distributions change and theestimators have to be adapted to account for possible heteroskedasticity. We will not gointo details in this respect.

Under the null hypothesis we have that yt = εt,p + ψx2x−2t and hence Y ′X1.2 = E ′pX1.2.

This motivates the following high level assumptions where Γ1.2 := T−1X ′1.2X1.2 is used:

Assumption HL: Let p = p(T ) be a function of the sample size T and pz1 be a fixedinteger. Then assume that the following conditions have been verified:(i) Σε → Σ in probability.(ii) Γ1.2 → Γ1.2 in probability for some matrix Γ1.2 ∈ Rkz1pz1×kz1pz1 , Γ1.2 > 0.

(iii) p(T ) is such that T−1/2vec(∑T

t=p+1 εt,p(x−1.2t)

′)d→ Zε where Zε ∈ Rkz1ky is normally

distributed with mean zero and variance Γ1.2 ⊗ Σ.

From these high level assumptions the standard asymptotics for the Wald test areimmediate from (7):

Theorem 1 Let Assumption HL hold for εt,p = yt − ψx2x−2t − ψx1x

−1t where the integer p

is chosen as in HL (iii). Then under the null hypothesis H0 : ψx1 = 0 the Wald statisticW converges in distribution to a χ2 distributed random variable with kypz1kz1 degrees offreedom where pz1kz1 equals the dimension of x−1t.

Of course the high level assumptions are not directly applicable. In the following it will beshown that in a multitude of circumstances the high level conditions are fulfilled.

4Note that Ft−1,y,z2 = Ft−1 under the null hypothesis.

8

3.1 Stationary VARX(p,q) processes

Let the true data generating process be a V ARX(p, q) of the form

yt =

p∑j=1

πyjyt−j +

q∑j=1

ψz2jz2t−j + εyt

where z2t is an exogenous sequence. Further assume that the joint process zt = [z′1t, z′2t]′

is a stationary ergodic process with finite second moments independent of the noise εyt.Assume that the polynomial πy(z) = I−∑p

j=1 πyjzj has all its roots outside the unit circle.

Then, assuming that the process (yt)t∈Z is the unique stationary solution, the sample secondmoments converge to the true population moments (Hannan and Deistler, 1988, Theorem4.1.1.). This implies that Γp := [X1, X2]

′[X1, X2]/T → Γp ≥ 0 and therefore under the

assumption that Γp > 0 is positive definite it follows that Γ1.2 → Γ1.2 > 0 (HL (ii)). Also

since Σε → Σ due to the consistency of the parameter estimation it is straightforwardto verify (HL (i)). Finally HL (iii) can be shown using standard central limit theoremssuch as Hannan and Deistler (1988, Lemma 4.3.4.). Hence the high level assumptions HLhold and Theorem 1 can be applied. Of course this situation is standard and involves theunrealistic assumption that the true system has finite and known lag lengths. This willseldom be the case.

3.2 Stationary V ARX(∞,∞) processes

In many cases the data generating process can be approximated reasonably well usinga VARX(p,q) specification. In fact, a large area of research in time series started withthe seminal papers of Berk (1974) and Lewis and Reinsel (1985) dealing with finite orderautoregressive approximations to processes of this kind. Under the null we will assumethat the joint process wt := [y′t, z

′2t]′ admits a V AR(∞) representation of the form given

in (1) where εt =[

ε′yt ε′z2t

]′fulfills Assumption N. The noise assumptions are in line

with Theorem 7.4.8. of Hannan and Deistler (1988) which extends Theorem 4 of Lewisand Reinsel (1985) where (εt)t∈Z was assumed to be i.i.d. Many results building on Lewisand Reinsel (1985) also use independent noise (Lutkepohl and Saikkonen, 1999; Lutkepohland Saikkonen, 1997; Dolado and Lutkepohl, 1996).

As described in Section 2, the test is based on the auxiliary model given in (3), whichfor p = ∞ nests the true model. Here H0 : ψz1j = 0, j = 1, . . . , pz1 specifies the nullhypothesis while a wide range of alternatives can be included for large pz1. In the followingwe will only consider the case that pz1 is some pre-specified integer rather than the moregeneral case, in which pz1 → ∞ as a function of the sample size. It should be noted,however, that pz1 → ∞ in some situations could also be dealt with leading to a morecomplicated asymptotic theory (Saikkonen and Lutkepohl, 1996). However, it is not clearhow the integer pz1 in such a setting could be chosen. Additionally the assumptions on z1t

in order for such results to hold are more restrictive. We will not go into details in thisrespect.

9

We formalize the notion of reasonable approximability as is usually done in the litera-ture (Lewis and Reinsel, 1985; Lutkepohl and Saikkonen, 1997):

Assumption P1:(i) The noise (εt)t∈Z fulfills Assumption N.(ii)

∑∞j=1 ‖πw,j‖2 < ∞. For πw(z) := I −∑∞

j=1 πw,jzj it holds that det πw(z) 6= 0, |z| ≤ 1.

(iii) The integer p tends to infinity as a function of the sample size such that T 1/2∑∞

j=p+1 ‖πw,j‖2 →0 and p3/T → 0.(iv) The process (z1t)t∈Z is generated according to the equation

z1t = νt +∞∑

j=1

θjνt−j +∞∑

j=1

φjεt−j

where (νt)t∈Z fulfills Assumption N with Eνtν′t > 0 and is independent of the process (εt)t∈Z.

Here∑∞

j=0 ‖[θj, φj]‖2 < ∞ is assumed.

Except for the assumptions on (εt)t∈Z and (z1t)t∈Z these are exactly the same assump-tions as given in Lewis and Reinsel (1985, Theorem 2, p. 398). One main difference inour setup is that the process (z1t)t∈Z is not modelled endogenously. In other words theVAR framework of Lewis and Reinsel (1985) is exchanged with VARX modelling here. Animportant advantage of such an approach is that it allows us to vary the integer pz1 freely,i.e. the choice of pz1 is not tied to the approximation properties. Also we do not needto assume that (z1t)t∈Z has a V AR(∞) representation. Hence the assumptions includeoverdifferenced processes which are excluded in Lewis and Reinsel (1985). Furthermore z1t

may be a function of lagged yt−j and z2t−j, j ∈ N.The assumptions on (z1t)t∈Z nevertheless exclude a number of interesting cases: Due to

the summability assumptions on the coefficients θj and φj the process (z1t)t∈Z is stationary.Integrated processes will be dealt with in later sections. Also certain long memory processesare excluded (see below). These are the topic of section 3.5. The assumption that (z1t)t∈Zis linearly filtered white noise might also be viewed as somewhat restrictive. It should benoted that these assumptions are by no means necessary.

The following result is an easy consequence of the proof of Theorem 3, p. 399, of Lewisand Reinsel (1985):

Theorem 2 Let x−2t := [y′t−1, . . . , y′t−p, z

′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1

]′.Then Assumption P1 implies Assumption HL.

The theorem shows that in situations where the true process is an infinite VARX model thetest statistic for the Wald test can be used as if the true process was a V ARX(p, q). Themain advantage in comparison to previously obtained results, such as those in Dolado andLutkepohl (1996), is that the regressors z1t are not modelled endogenously. This leads to amore parsimonious model in the case where yt can be modelled using only few lags relativeto a full VAR model also containing z1t. From the proof of the theorem it is obvious that

10

the result also holds if the variable z1t−pz1−1 is not included in x−2t. The inclusion of thisadditional term reduces the power of the tests where the decrease in power depends onthe characteristics of z1t and the lag length pz1. Dolado and Lutkepohl (1996) contains adiscussion on these issues. This power loss is the downside to the robustness properties ofthe test provided in the following sections.

3.3 Nonstationary V ARX(∞,∞)

In the last subsection we dealt with stationary processes. One of the main motivationsbehind the surplus lag approach was to obtain results without unit root and cointegrationpre-testing in cases where components of yt and/or zt might be (co)integrated. This isallowed for in the following assumption.

Assumption P2 :(i) There exists a nonsingular matrix Γ = [γ⊥, γ], γ ∈ R(ky+kz2)×n, 0 ≤ n ≤ ky + kz2 suchthat the process (vt)t∈N obtained as (using a random initial value w0 that has finite fourthmoments and is independent of εt, ηt, t > 0)

vt :=

[γ′⊥(wt − wt−1)

γ′wt

]

has an autoregressive representation as vt =∑∞

j=1 πv,jvt−j + εt where (εt)t∈Z fulfills As-sumption N.(ii) For πv(z) := I −∑∞

j=1 πv,jzj we assume det πv(z) 6= 0, |z| ≤ 1.

(iii) Summability of power series:∑∞

j=1 j‖πv,j‖2 < ∞.

(iv) The integer p is chosen as a function of the sample size such that p3/T → 0 andT 1/2

∑∞j=p+1 ‖πv,j‖2 → 0.

(v) The process (z1t − z1t−1)t∈N for random z10 (finite fourth moment, independent ofεt, ηt, t > 0) fulfills Assumption P1(iv) where additionally

∑∞j=0 j‖[θj, φj]‖2 < ∞ holds.

Here γ denotes the cointegrating relations as (γ′wt)t∈N is assumed to be stationary.Note that these assumptions also include many processes fulfilling Assumption P1. Theylead to the following Theorem:


′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1


The theorem extends the results of Theorem 5 of Saikkonen and Lutkepohl (1996) slightlyby using the VARX rather than the VAR framework with the same advantages as in thestationary case: The integer pz1 is not tied to the approximation quality and more generalprocesses z1t are allowed for. An appealing and very useful feature of these results is that thesame test statistic can be used to perform the Granger-noncausality test in many differentframeworks: All processes yt, z1t and z2t can be stationary, integrated or cointegrated and

11

we do not use any information on possible cointegrating relations. This robustness propertyfollows from the fact that the properties of εt,p are identical across all cases and that in allcases there exists a sequence of matrices τp such that x−1t − τpx

−2t is stationary. The fact

that this is sufficient for standard asymptotics to hold is also acknowledged in Theorem 1of Toda and Phillips (1994) and footnote 3 of Sims et al. (1990).

Of course this robustness comes at a price: If z1t is stationary then one lag of z1t isunnecessarily omitted from the test restriction, leading to a loss of power with respect totests under correct specification as noted by Dolado and Lutkepohl (1996). From the proofof Theorem 3 it is obvious that for stationary z1t the variable z1t−pz1−1 can be omitted in

the definition of x−2t and the asymptotic distribution of W is unchanged but the resultingtest has higher power since it is based on a smaller model. If z1t is in fact integrated but isconsidered to be stationary such that no lag of z1t is included in x−2t then the asymptoticdistribution is incorrect as is the size of the test. In situations where z1t clearly isstationary rather than integrated hence the omission of z1t−pz1−1 seems to be the betterprocedure. If this decision cannot be made with certainty then there is an argument forusing the robust version of the test given the fact that for large pz1 the expected loss inpower is small whereas the effect of misspecification can be substantial.

If additionally the cointegrating rank of the joint process [y′t, z′2t, z

′1t]′ is known then

superior tests can exploit this knowledge (Toda and Phillips, 1994). In this situation thepower loss can be substantial due to the T -consistency of the estimators of the cointegratingvectors as compared to the

√T consistency of the excess lag estimators. Note however,

that in any case there is a risk of misspecification. The proposed tests given in thispaper sacrifice power in special cases for obtaining robust inference under a wide varietyof possible assumptions on the data generating process.

3.4 Local-to-unity processes

This subsection demonstrates that in fact the Wald test based on W is also robust underlocal-to-unity assumptions. Since the test has the same asymptotic distribution irrespectiveof whether the various processes are stationary or integrated it may seem to come atno surprise that the same holds true for local-to-unity processes. In fact the results ofSaikkonen and Lutkepohl (1996) can be easily extended to the local-to-unity framework of(Phillips, 1987; Chan, 1988).

The basic idea behind local-to-unity processes is to let the largest pole depend onthe sample size: While for an integrated process (z1t)t∈N the first difference z1t − z1t−1

(for suitable z10) is stationary the main assumption in a local-to-unity framework is thatz1t − aT z1t−1 is stationary (for suitable z10) where aT = 1 + c/T . Here typically c ≤ 0in order to exclude explosive processes. For c = 0 the corresponding process is integratedwhile for c < 0 it is stationary. These models bridge the gap between asymptotic theoryfor integrated processes and stationary processes. For example, since the evidence on unitroots is not always clear-cut in macroeconomics and financial time series and since near-unit behavior is often poorly approximated by both stationary and unit root asymptotics,local-to-unity models are often employed to robustify inference in this context. We will

12

use the following assumptions:

Assumption P3 :(i) There exists a nonsingular matrix Γ = [γ⊥, γ], γ ∈ R(ky+kz2)×n, 0 ≤ n ≤ ky + kz2 suchthat the process (vt)t∈N obtained as (for suitable initial value w0)

vt :=

[γ′⊥(wt − aT wt−1)

γ′wt

]

has an autoregressive representation as vt =∑∞

j=1 πv,jvt−j + εt where (εt)t∈Z fulfills As-sumption N. Here aT = 1 + c/T, c ≤ 0.(ii) For πv(z) := I −∑∞

j=1 πv,jzj we assume det πv(z) 6= 0, |z| ≤ 1.

(iii) Summability of power series:∑∞

j=1 j‖πv,j‖2 < ∞.

(iv) The integer p increases with the sample size such that p3/T → 0 and T 1/2∑∞

j=p+1 ‖πv,j‖2 →0.(v) The process (z1t−aT z1t−1)t∈N for some z10 fulfills Assumption P1(iv) where additionally∑∞

j=0 j‖[θj, φj]‖2 < ∞ holds.

Again these assumptions are sufficient for the asymptotics to hold:


′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1


It is obvious that Assumption P3 implies Assumption P2 with c = 0. The extension ofthis robustness result for the Wald test from the standard I(1) framework to the local-to-unity framework is to the best of our knowledge new although not surprising. Again themain technical reason is that there exists a sequence of matrices τp such that x−1t − τpx

−2t

is stationary. Under Assumption P3 each of the processes yt, z1t and z2t can be integrated,stationary or near-integrated/local-to-unity. Cointegrating relations may exist. The samediscussion as for integrated processes also applies here: Using the VARX framework mightbe advantageous in situations where z1t requires high lag lengths in order to capture thedynamics whereas the lag length needed in the VARX approximation of yt are small. Ifz1t indeed is stationary rather than local-to-unity then omitting the factor z1t−pz1−1 in thedefinition of x−2t leads to a more powerful test at the drawback of loosing robustness.

3.5 Long-memory processes

Assumption P1 for stationary VARX(∞,∞) processes imposed summability assumptionson the impulse response sequence of the exogenous process (z1t)t∈N. These assumptionsexclude long-memory processes such as fractionally integrated processes with 0 < d < 0.5.In this section we will provide less restrictive assumptions including such processes:

Assumption P4 :(i) Assumption P1, (i) - (iii) hold. Additionally (εt)t∈Z is assumed to be i.i.d.

13

(ii) The process (z1t)t∈Z is generated according to the equation

z1t = νt +∞∑

j=1

θjνt−j +∞∑

j=1

φjεt−j

where (νt)t∈Z fulfills Assumption N and is independent of the process (εt)t∈Z. Here ‖[θj, φj]‖2 ≤cjd−1 for some constant 0 < c < ∞ and −0.5 < d < 0.5 is assumed.(iii) p is chosen such that p = o(T 1−2d) and Assumption P1(iii) is fulfilled for this choiceof p.

These assumptions on the exogenous inputs include many long-memory processes andin particular fractionally integrated processes (see Baillie (1996) for a recent survey). Infact the assumptions are much weaker than this including sums of fractionally integratedprocesses. Since the squared coefficients for d ≈ 0.5, d ≤ 0.5 are just summable, the con-ditions on the impulse response sequences are close to minimal. On the other hand, it hasbeen found necessary to introduce an additional condition on p, the number of lags includedin the approximation for 1/3 < d < 1/2 since in this case the estimates of the covariancesequence, including the cross covariance between lags of yt and z2t, are extremely unreli-able. In fact, their covariances are of order O(T 4d−2) and hence arbitrarily small fractionsof the sample size are obtained as convergence orders for values close to d = 0.5. This inturn limits the range of admitted processes via the assumption T 1/2

∑∞j=p+1 ‖πw,j‖2 → 0.

In some situations this is not a severe limitation: If the joint process wt is a VARMAprocess then any rate of the form p = T c will fulfill the approximation restriction andchoosing c < 1− 2d also the condition on p can be easily met.

In this setting the advantage of the VARX framework is most clearly visible: If onewas to model the process [y′t, z

′1t, z

′2t]′ using the VAR framework then overly large orders

are needed in order to obtain a small approximation error εyt,p − εyt due to the slowdecrease of the coefficients in the VAR(∞) representation of the true process. In theVARX framework this difficulty does not arise. Moreover, overdifferenced processes canalso be used as regressors since they do not need to be approximated using autoregressiveterms.

Again it can be shown that Assumption HL holds:


′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1


This result for stationary long-memory does not depend on the surplus lag and also holdsif z1t−pz1−1 is omitted in the definition of x−2t. Again this results in a power-robustnesstradeoff.

It has been observed in the literature that the estimates of d for fractionally integratedprocesses often are close to d = 0.5 (cf. the references given in Baillie, 1996, section 6,p. 43). The last theorem showed that the Wald test is robust with respect to fractionalintegration for −0.5 < d < 0.5. Earlier on robustness with respect to integration has

14

been given in Theorem 3. In fact it is not hard to combine these two results to concluderobustness also under the following set of assumptions:Assumption P5 :(i) Assumption P1, (i) - (iii) hold. Additionally (εt)t∈Z is assumed to be i.i.d.(ii) There exists full column rank matrices β ∈ Rkz1×(kz1−cz1) and β⊥ ∈ Rkz1×cz1 , β′β⊥ = 0such that for β′⊥z10 = 0 [

β′⊥(z1t − z1t−1)β′z1t

]= vt, t ∈ N

where

vt =∞∑

j=0

Γ(j + d)

Γ(d)Γ(j + 1)νt−j.

(νt)t∈Z is i.i.d. with zero mean and independent of εt, with Eνtν′t > 0 and finite fourth

moments. Here −0.5 < d < 0.5.(iii) p is chosen such that

p =

o(Tmin(1/3,1−2d)) , 0.25 < d < 0.5,o(T 1/3), , 0 < d ≤ 0.25,

o(T 1/3(1+2d)), d ≤ 0

Furthermore T 1/2∑∞

j=1 ‖πw,j‖2 → 0 for this choice of p.

These assumptions are far from being minimal. In particular the assumptions on vt

appear to be overly strong. Again we stress that we are not interested in the most generalsetup but only in providing cases in which the standard asymptotics for the Wald test hold:


′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1


The theorem also shows that in the case that the exogenous regressors are fractionallyintegrated of order 0.5 < d < 1.5 the asymptotics of the Wald test for Granger-noncausalityare standard. With respect to the previously used assumptions the restrictive assumptionson the increase of p as a function of the sample size is striking: Assumption P4 showedproblems for the fractional integration parameter d close to 0.5 due to the bad estimatesof the covariance sequence. Assumption P5 contains additionally a condition showingproblems at d close to −0.5 and hence the integrated process having a d close to butbigger than 0.5. The reason for problems in that case is the slow divergence rate of thenonstationary components. The borderline case d = 0.5 has not been analyzed.

3.6 Stationary processes with structural breaks

We next expand the stationary VARX(∞,∞) process to allow for the occurrence of a fixednumber (J) of historical breaks in the intercept of the exogenously modelled variable z1t,which occur at fixed fractions of the sample size. Although the true data generating process

15

includes breaks, we do not assume that any breaks are included in the estimated model.In particular, we wish to avoid any first stage inference regarding the existence of and/ornumber of breaks. Breaks in the process for the endogenously modelled variables wt wouldhave to be explicitly modelled and thus are not considered. Breaks in the coefficients ψx1

governing the impact of x1t on yt are also excluded under the null hypothesis, under whichthese coefficients are fixed at zero.

Assumption P6 :(i) Assumption P1, (i) - (iii) hold. Additionally (εt)t∈Z is assumed to be i.i.d.(ii) Let J be a fixed integer denoting the number of breaks. Defining ω0 := 0 and letting

ωj j = 1, . . . , J denote the fraction of the sample spent in regime j between the (j − 1)th

break and the jth break, with∑J

j=1 ωj = 1, the process (z1t)t∈Z is generated according tothe equation

z1t =J∑

j=1

φjI

(1 +

⌊j−1∑

k=0

ωkT

⌋≤ t ≤

⌊j∑

k=1

ωkT

⌋)+ νt +

∞∑j=1

θjνt−j +∞∑

j=1

φjεt−j

where bxc denotes the greatest integer less than x, (νt)t∈Z fulfills Assumption N withEνtν

′t > 0 and is independent of the process (εt)t∈Z. Here

∑∞j=0 ‖[θj, φj]‖ < ∞ is as-

sumed.

While we do not assume that zt is explicitely modelled in the empirical analysis, theestimated VARX must either include an intercept, or at a minimum, z1t must be de-meaned prior to estimation. It will be convenient to work with deviations from means. Let

x−t :=[ (

x−1t

)′ (x−2t

)′ ]′denote the full set of regressors and define

µ(j) := E[x−t I (t ∈ Sj)

]and µ :=

J∑j=1

ωjµ(j)

as the mean within regime j and the average mean across regimes, respectively. Here we

define Sj =

pz1 + 1 + b∑j−1k=0 ωkT c, . . . , b

∑jk=1 ωkT c

as the data range that would result

if restricted to regime j only. Note that since breaks are allowed only in z1t, µ(j)− µ has

the form[

(µ1(j)− µ1)′ : 0 : µ′pz1+1

− µ′pz1+1

]′where the three partitions have dimension

kz1 (pz1 + 1) , p(ky + kz2) and kz1 respectively, and where pz1 is fixed.

Let W (x−) denote the value of the wald statistic introduced earlier when the originaldata x−t is replaced by x−t − x− To keep the proofs simple, we first show that the infeasible

estimator W (µ) has the correct large sample distribution.


′2t−1, . . . , z

′2t−p, z

′1t−pz1−1]

′ and x−1t := [z′1t−1, . . . , z′1t−pz1

]′.Then Assumption P6 implies Assumption HL is satisfied for the infeasible Wald statisticW (µ) .

16

Then, since ‖z − µ‖2 →p 0 and W (·) is continuous, the result is easily extended to the

feasible statistic W (x−) in the following corollary, the proof of which is omitted.

Corollary Assume that P6 holds. Then under the null hypothesis H0 : ψx1 = 0 the Waldstatistic W (x−) converges in distribution to a χ2 distributed random variable with kypz1kz1

degrees of freedom where pz1kz1 equals the dimension of x−1t.

4 Conclusion

Employing a surplus lag in VAR based tests has been known to provide for inferencewhich is robust to possible I(1) nonstationarity without necessitating unit root or coin-tegration pre-tests (Toda and Yamamoto, 1995; Dolado and Lutkepohl, 1996; Saikkonenand Lutkepohl, 1996). This provides for robust inference at some cost in terms of effi-ciency. In this paper we have shown, in the context of Granger causality testing, thatby applying the surplus lag to a VARX, in which the causing variables are exogenouslymodelled, one can significantly enhance the robustness of the surplus lag approach withoutnecessarily incurring further costs in terms of its efficiency. In fact, we have shown thatthe same χ2 test statistic and critical values can be used to test Granger causality undera variety of possible data generating processes that may characterize the persistence inthe forcing variable. These include the I(0), I(1) and cointegrated models considered ear-lier in the VAR context, as well as stationary and nonstationary long-memory, structuralbreaks, and local-to-unity models. In keeping with the earlier literature, no estimates ofthe long-memory parameter or first-stage confidence intervals on the local-to-unity param-eter are required and the structural breaks need not be tested for or explicitely modelled.The VARX framework turns out to be particularly useful in allowing for long-memoryand unmodelled structural breaks, which are not easily incorporated into a pure VAR.However, it is only in the context of the surplus lag that nonstationary processes, such asnon-stationary fractionally integrated processes, can be accomodated without altering thelimit distribution of the test statistic.

References

Amihud, Yakov and Clifford M. Hurvich (2004). Predictive regressions: A reduced-biasestimation method. Journal of Financial and Quantitative Analysis 39, 813–841.

Baillie, Richard T. (1996). Long memory processes and fractional integration in economet-rics. Journal of Econometrics 73, 5–59.

Baillie, Richard T. and Tim Bollerslev (2000). The forward premium anomaly is not asbad as you think. Journal of International Money and Finance 19, 471–488.

17

Berk, Kenneth N. (1974). Consistent autoregressive spectral estimates. Ann. Statist.2, 489–502.

Campbell, Bryan and Jean-Marie Dufour (1997). Exact nonparametric tests of orthogo-nality and random walk in the presence of a drift parameter. International EconomicReview 38(1), 151–173.

Cavanagh, Christopher L., Graham Elliott and James H. Stock (1995). Inference in modelswith nearly integrated regressors. Econometric Theory 11, 1131–1147.

Chan, Ngai Hang (1988). The parameter inference for nearly nonstationary time series.Journal of the American Statistical Association 83(403), 857–862.

Chan, Ngai Hang and Wilfredo Palma (1998). State space modeling of long-memory pro-cesses. The Annals of Statistics 26, 719–740.

Choi, In (1993). Asymptotic normality of the least-squares estiamtes for higher order autor-gressive integrated processes with some applications. Econometric Theory 9, 263–282.

Christiano, Lawrence, Martin Eichenbaum and Robert Vigfusson (2003). What happensafter a technology shock. Mimeo, Northwestern University.

Davidson, James (1994). Stochastic Limit Theory. Oxford University Press.

Davidson, James (2004). Convergence to stochastic integrals with fractionally integratedprocesses: Theory, and applications to cointegrating regression. Technical report.Cardiff University.

Davidson, James and Robert de Jong (2000). The functional central limit theorem andweak convergence to stochastic integrals II. Econometric Theory 16, 643–666.

Diebold, Francis X. and Atsushi Inoue (2001). Long memory and regime switching. Journalof Econometrics 105, 131–159.

Dolado, Juan and Francesc Marmol (2004). Asymptotic inference results for multivariatelong-memory processes. Econometrics Journal 7, 168–190.

Dolado, Juan and Helmut Lutkepohl (1996). Making wald tests work for cointegrated VARsystems. Econometric Reviews 15, 369–386.

Dufour, Jean-Marie and Eric Renault (1998). Short run and long run causality in timeseries: Theory. Econometrica 66, 1099–1125.

Dufour, Jean-Marie and Tarek Jouini (2005). Finite-sample simulation-based inference inVAR models with applications to order selection and causality testing. Journal ofEconometrics. forthcoming.

18

Eliasz, Piotr (2005). Optimal median unbiased estimation of coefficients on highly persis-tent regressors. Job market paper, Princeton University.

Elliott, Graham (1998). On the robustness of cointegration methods when regressors havealmost unit roots. Econometrica 66(1), 149–158.

Faust, Jon (1996). Near observational equivalence and theoretical size problems with unitroot tests. Econometric Theory 12, 724–731.

Faust, Jon (1999). Conventional confidence intervals for points on the spectrum have con-fidence level zero. Econometrica 67, 629–37.

Gali, Jordi (1999). Technology, employment, and the business cycle: Do technology shocksexplain aggregrate fluctuations?”. American Economic Review 89(1), 249–271.

Gourieroux, Christian and Joan Jasiak (2001). Memory and infrequent breaks. EconomicsLetters 70, 29–41.

Granger, Clive W.J. and Namwon Hyung (2004). Occasional structural breaks and longmemory with an application to the S&P 500 absolute stock returns. Journal of Em-pirical Finance 11(3), 399–421.

Hall, Peter and Chris C. Heyde (1980). Martingale Limit Theory and its Application.Academic Press.

Hannan, Edward J. (1976). The asymptotic dibribution of serial covariances. Annals ofStatistics 4(2), 396–399.

Hannan, Edward J. and Manfred Deistler (1988). The Statistical Theory of Linear Systems.John Wiley. New York.

Hosking, Jonathan R. M. (1996). Asymptotic distributions of the sample mean, autoco-variances, and autocorrelations of long-memory time series. Journal of Econometrics73, 261–284.

Jansson, Michael and Marcelo J. Moreira (2004). Optimal inference in regression modelswith nearly integrated regressors. Econometrica. forthcoming.

Kitamura, Yuichi and Peter C. B. Phillips (1997). Fully modified IV, GIVE and GMMestimation with possibly nonstationary regressors and instruments. Journal of Econo-metrics 80, 85–123.

Lanne, Markku (2002). Testing the predictability of stock returns. The Review of Eco-nomics and Statistics 84(3), 407–415.

Lewellen, Jonathan (2004). Predicting returns with financial ratios. Journal of FinancialEconomics 74(2), 209–235.

19

Lewis, Richard and Gregory C. Reinsel (1985). Prediction of multivariate time series byautoregressive model fitting. Journal of Multivariate Analysis 16, 393–411.

Lutkepohl, Helmut (1996). Handbook of Matrices. John Wiley and Sons.

Lutkepohl, Helmut and Pentti Saikkonen (1997). Impulse response analysis in infinite ordercointegrated vector autoregressive processes. Journal of Econometrics 81, 127–157.

Lutkepohl, Helmut and Pentti Saikkonen (1999). Order Selection in Testing for the Coin-tegrating Rank of a VAR Process. Oxford University Press. Engle, White (eds.):Cointegration, Causality and Forecasting - A Festschrift in honour of Clive W. J.Granger.

Maynard, Alex and Peter C. B. Phillips (2001). Rethinking an old empirical puzzle: Econo-metric evidence on the forward discount anomaly. Journal of Applied Econometrics16(6), 671–708.

Nabeya, Seiji and Bent E. Sørensen (1994). Asymptotic distributions of the least-squaresestimators and test statistics in the near unit root model with non-zero initial valueand local drift and trend. Econometric Theory 10, 937–966.

Park, Joon Y. and Peter C.B. Phillips (1989). Statistical inference in regressions withintegrated processes: Part II. Econometric Theory 5, 95–131.

Phillips, Peter C. B. (1987). Towards a unified asymptotic theory for autoregression.Biometrika 74(3), 535–547.

Phillips, Peter C. B. (1995). Fully modified least squares and vector autoregression. Econo-metrica 63(5), 1023–1078.

Phillips, Peter C. B. (2003). Laws and limits of econometrics. Economic Journal 113(3), pp.C26–C52.

Phillips, Peter C. B. and Victor Solo (1992). Asymptotics for linear processes. Annals ofStatistics 20(2), 971–1001.

Saikkonen, Pentti and Helmut Lutkepohl (1996). Infinite-order cointegrated vector autore-gressive processes. Econometric Theory 12, 814–844.

Sims, Christopher A., James H. Stock and Mark W. Watson (1990). Inference in lineartime series models with some unit roots. Econometrica 58, 113–144.

Stambaugh, Robert F. (1999). Predictive regressions. Journal of Financial Economics54, 375–421.

Toda, Hiro and Peter C.B. Phillips (1994). Vector autoregression and causality: a theoret-ical overview and simulation study. Econometric Reviews 13, 259–285.

20

Toda, Hiro Y. and Taku Yamamoto (1995). Statistical inference in vector autoregressionswith possibly integrated processes. Journal of Econometrics 66, 225–250.

Torous, Walter, Rossen Valkanov and Shu Yan (2005). On predicting stock returns withnearly integrated explanatory variables. Journal of Business 78(1), 937–966.

A Technical lemmas

Lemma 1 Let wt =∑∞

j=0 φw,jεt−j where (εt)t∈Z is an i.i.d. sequence of random variables

having zero mean and finite fourth moments. Let γj := T−1∑T

t=1+p wtw′t−j and γj :=

Ewtw′t−j. Assume that φw,j = O(jd−1) where −0.5 < d < 0.5. Then:

Evec(γj − Eγj)vec(γk − Eγk)′ =

O(T 4d−2) , for 0.25 < d < 0.5O(T−1 log T ) , d = 0.25,

O(T−1) , −0.5 < d < 0.25

All O(.) terms hold uniformly in 1 ≤ j, k ≤ p and 1 ≤ p ≤ T .

Proof: The proof uses the results of Theorem 1, 3 and 5 of Hosking (1996). The maindifference is that we are interested in expressions uniformly in the lag whereas Hosking(1996) deals with fixed lags.First note that using Ω := Eεtε

′t we have for some constant 0 < K < ∞ not depending on

j ∈ Z

‖γj‖2 = ‖∞∑i=j

φw,iΩφ′w,i+j‖2 ≤ C

∞∑i=j

‖φw,i‖2‖φw,i+j‖2 ≤ CC2k

∞∑i=j

i2d−2 ≤ Kj2d−1

since ‖Ω‖2 < C, ‖φw,i‖2 ≤ Ckid−1 for some Ck < ∞, i < i + j and Lemma 3.2 of Chan and

Palma (1998). The vector case is only notationally more complex and hence we only showthe result for the case of scalar wt.Then we obtain

Eγj γk = T−2

T∑t,s=1+p

Ewt+jwtwsws+k.

Note that Ewtwswrw0 = γt−sγr + γt−rγs + γtγs−r + κ4(t, s, r) for

κ4(t, s, r) :=∞∑

a=−∞φw,a+tφw,a+sφw,a+rφw,a(Eε4

t − (Eε2t )

2)

where for notational simplicity φw,a = 0, a < 0 is used. It follows that Ew40 ≤ M4 < ∞

since ‖φ4w,a‖2 = O(a4d−4) = o(a−2). Next

T−2

T∑t,s=1+p

Ewt+jwtwsws+k = T−2

T∑t,s=1+p

γjγk+γt−s+jγt−s−k+γt+j−s−kγt−s+κ4(t−s, t−s+j, k).

(8)

21

The first term here is equal to (T − p)2T−2γjγk = EγjEγk independent of the value of d.The derivation of the bounds for the remaining terms in (8) will be done separately forthe different cases for d. Thus let 0.25 < d < 0.5 for the moment. The last term in (8)is majorized by the first term in (A.2) of Hosking (1996) and hence can be bounded byM4εT

−1γjγk where M4ε is the fourth cumulant of εt. In fact this holds for any d < 0.5.The two middle terms can be dealt with using ‖γl‖2 ≤ Kl2d−1 as shown above:

∣∣∣∣∣T−2

T∑t,s=1+p

γt−s+jγt−s−k

∣∣∣∣∣ ≤ T−1

T−1−p∑

l=1−T+p

|γl+jγl−k|T − |l| − p

T

≤ T−1

(T−1−p∑

l=1−T+p

γ2l+j

)1/2 (T−1−p∑

l=1−T+p

γ2l−k

)1/2

and for j ≥ 0 we have

T−1−p∑

l=1−T+p

γ2l+j ≤

T−1+2j−p∑

l=1−T+p

γ2l+j =

T−1+j−p∑

l=1−T−j+p

γ2l = O((T − p + j)4d−1) = O(T 4d−1)

again using Lemma 3.2. (i) of Chan and Palma (1998). This holds for d 6= 0.25. Ford = 0.25 the same argument shows the bound O(log T ) (cf. Hosking, 1996, top of p. 278).For j ≤ 0 the analogous argument can be used extending the sum to the negative integers.Combining these expressions we obtain Eγj γk − EγjEγk = ∆j,k where E|∆j,k| ≤ MT 4d−2

for 0.25 < d < 0.5.For d = 0.25 the bound on the last term (8) is identical to the case 0.25 < d < 0.5. FurtherE|∆j,k| ≤ M(log T )/T for d = 0.25 according to standard summability arguments showing

that∑T

j=1 j−1 = O(T log T ) (see e.g. Hosking (1996), top of p. 278). This shows the claimfor d = 0.25.For d < 0.25 it follows that the middle two terms are of order O(T−1) independent ofj, k, p. Hence E|∆j,k| ≤ M/T for d < 0.25. All bounds hold uniformly in 1 ≤ j, k ≤ p and1 ≤ p ≤ T . ¤

Inspecting the proof it follows that it also applies to linear processes vt =∑∞

j=0 θv,jεt−j

where (εt)t∈Z fulfills Assumption N if∑∞

j=0 ‖θv,j‖2 < ∞: In this case the results of thelemma hold with d = 0.

Lemma 2 Let (εt)t∈Z fulfill Assumption N. Let vt,p =∑∞

j=0 φp,jεt−j, t ∈ Z, p ∈ N. Then if

supp∈N∑∞

j=0 ‖φp,j‖22 < ∞ it follows that supp∈N E‖vt,p‖4

2 < ∞.

Proof: The proof for the multivariate case is only notationally more complex, hence onlythe univariate case will be dealt with. Then Ev4

t,p = 3(Ev2t,p)

2 + κ4,p (see e.g. the proof ofLemma 1 given above). Next since Ev2

t,p =∑∞

j=0 φ2p,jEε2

t it follows that supp∈N Ev2t,p < ∞.

Further

κ4,p =∞∑

j=0

φ4p,j(E(ε2

t − (Eεt)2)2 ≤ Eε4

t

( ∞∑j=0

φ2p,j

)2

.

22

Hence supp Eκ4,p < ∞. ¤

Lemma 3 Let ∆vt = ut for v0 = 0 where ut =∑∞

j=0 θu,jεt−j for (1−z)−d =∑∞

j=0 θu,jzj,−1/2 <

d < 1/2. Further let wt =∑∞

j=0 θw,jεt−j for∑∞

j=0 j‖θw,j‖2 < ∞ and (εt)t∈Z is i.i.d. withmean zero and finite fourth moments. Then using d0 := max(d, 0) we have

(i) T−2(d+1)

T∑t=p+1

vtv′t

d→ Ξd,

(ii) max0≤j≤HT

‖T−(d0+1)

T∑t=p+1

vtw′t−j‖2 = OP (1),

(iii) T−(2d0+1)

T∑t=p+1

vtu′t = OP (1),

(iv) T−d+1

T∑t=p+1

vt−1ε′t = OP (1)

where HT = o(T 1/3) and Ξd is a random variable. All results hold uniformly in p = o(T 1/3).

Proof: (i) The convergence of T−2(d+1)∑T

t=1 vtv′t follows from Theorem 4.1. (−0.5 <

d < 0) and Theorem 4.2 (0 < d < 0.5) of Davidson and de Jong (2000) and Phillips andSolo (1992), Theorem 3.4. (d = 0) and the usual application of the continuous mappingtheorem.(ii) The convergence in distribution of T−(d0+1)

∑Tt=p+1 vtw

′t+1 follows from Theorem 4.1.

of Davidson (2004). The uniform result in j can be derived from the following argument:

T−(d0+1)

T∑t=p+1

vtw′t−j = T−(d0+1)

T∑t=p+1

(vt − vt−j−1)w′t−j + T−(d0+1)

T∑t=p+1

vt−j−1w′t−j

= T−(d0+1)

j∑i=0

T∑t=p+1

∆vt−iw′t−j + T−(d0+1)

T∑t=p+1

vt−j−1w′t−j

= T−d0

j∑i=0

(T−1

T∑t=p+1

ut−iw′t−j

)+ T−(d0+1)

T∑t=p+1

vt−j−1w′t−j.

The first term is the sum of j + 1 estimated covariances which can be dealt with usingLemma 1:

j∑i=0

(T−1

T∑t=p+1

ut−iw′t−j

)=

j∑i=0

Eut−iw′t−j+

j∑i=0

(T−1

T∑t=p+1

ut−iw′t−j − Eut−iw

′t−j

)+O(pT−1)

23

which is of order O(T d0) + OP ((j + 1)fT ) where fT = T 2d−1 for 0.25 < d < 0.5, fT =T−1/2

√log T for d = 0.25 and fT = T−1/2 for d < 0.25. Here

∑ji=1 Eut−i−1w

′t−j =

O(T d0) is used which is straightforward to derive. Hence the first term above is of or-der O(1) + OP (jfT T−d) = OP (1) for d > 0 and of order O(1) + OP (jT−1/2) = OP (1) ford < 0 uniformly in 0 ≤ j ≤ T 1/3.(iii) and (iv) can be found in Lemma 4.2. and Theorem 4.1 of Davidson (2004). ¤

Lemma 4 Let vt,T − aT vt−1,T = ut, t ∈ N, aT = 1 − c/T, c ≥ 0 where ut is stationaryand ergodic with finite second moments generated according to

∑∞j=0 πu,jut−j = εt where

(εt)t∈Z fulfills Assumption N, for πu(z) :=∑∞

j=0 πu,jzj we have det πu(z) 6= 0, |z| ≤ 1 and∑∞

j=0 ‖πu,j‖2 < ∞. The recursions are started at v0,T = v0, T ∈ N which is assumedto be deterministic. Further let wt =

∑∞j=0 φε

w,jεt−j + φηw,jηt−j where

∑∞j=0 j‖φε

w,j‖2 <∞,

∑∞j=0 ‖φη

w,j‖2 < ∞ and (ηt)t∈Z fulfills Assumption N and is independent of (εt)t∈Z.Then:

(i) E‖vt,T‖22 = O(t) uniformly in T .

(ii) E‖T−3/2∑T

t=p+1 vt,T w′t‖2

2 = O(T−1).

(iii) T−2∑T

t=p+1 vt,T v′t,Td→ ∫ 1

0Jc(w)Jc(w)′dw where Jc(w) denotes an Ornstein-Uhlenbeck

process.

(iv) T−1∑T

t=p+1 vt,T u′td→ ∫ 1

0Jc(w)dB(w)′ + σu for some matrix σu. Here B(w) denotes

the Brownian motion associated with T−1/2ut.

Proof: (i) According to the assumptions it follows that ut =∑∞

j=0 φu,jεt (Lewis andReinsel, 1985, p. 395, l.3). Further

∑∞j=−∞ ‖Eu0u

′j‖2 < ∞ follows. The recursive definition

of vt,T implies that vt,T = atT v0 +

∑t−1i=0 ai

T ut−i. Consequently

E‖vt,T‖22 = E(at

T v0 +t−1∑i=0

aiT ut−i)

′(atT v0 +

t−1∑i=0

aiT ut−i) = a2t

T Ev′0v0 +t−1∑

i,j=0

aiT aj

TEu′t−iut−j.

Since c ≥ 0 it follow that aT ≤ 1 and hence a2tT v′0v0 = O(1). For the second term note that

|t−1∑

i,j=0

aiT aj

TEu′t−iut−j| ≤t−1∑

i,j=0

‖Eut−iu′t−j‖2 ≤ t

∞∑j=−∞

‖Eu0u′j‖2 = O(t).

(ii) We will only deal with the univariate case, the multivariate case is only notationallymore difficult. The process (wt)t∈N can be decomposed as wt := wε

t +wηt = (

∑∞j=0 φε

w,jεt−j)+(∑∞

j=0 φηw,jηt−j). Since εs and ηt are independent it follows that

Evt,T vs,T wtws = Evt,T vs,T wεt w

εs + Evt,T vs,TEwη

t wηs

24

because Evt,T vs,T wεt w

ηs = Evt,T vs,T wε

tEwηs = 0. Therefore the evaluations can be given for

wεt and wη

t separately. All expectations exist due to assumed finite fourth moments. Thecontribution to E‖T−3/2

∑Tt=p+1 vt,T wt‖2

2 of the second term involving wηt can be bounded

as

T−3

T∑t=1+p

T∑s=1+p

|Evt,T vs,TEwηt w

ηs | ≤ T−3

T∑t=1+p

T∑s=1+p

t1/2s1/2|Ewηt w

ηs | = O(T−1)

due to∑∞

j=−∞ ‖Ewηt w

ηt−j‖2 < ∞.

With respect to the first term we use the Beveridge-Nelson decomposition (Phillips andSolo, 1992) wε

t = φw(1)εt + w∗t −w∗

t−1. The main strategy is to rewrite∑T

j=p+1 vt,T wεt as a

sum of several terms and then show that the expectation of the square of each summandis of the required order. Of course, the cross terms are then of the same order, as isstraightforward to verify. It follows that

T−3/2∑T

t=1+p vt,T wεt = T−3/2

∑Tt=1+p vt,T εtφw(1) + T−3/2

∑Tt=1+p vt,T (w∗

t − w∗t−1)

= T−3/2∑T

t=1+p vt,T εtφw(1)− T−3/2∑T−1

t=p (vt+1,T − vt,T )w∗t

+T−3/2vT,T w∗T − T−3/2vp,T w∗

p

(9)

Since vT,T = aTT v0 +

∑T−1i=0 ai

T uT−i it follows from finite fourth moments of ut that Ev4T,T =

O(T 4) and finite fourth moments of w∗T (see the proof of Lemma 1) then imply via the

Cauchy-Schwartz inequality that Ev2T,T (w∗

T )2 = O(T 2). Therefore the two last terms in

the expression above contribute terms of the order O(T−1) to E‖T−3/2∑T

t=p+1 vt,T wt‖22 as

required. Further vt,T = aT vt−1,T + ut and

E

(T−3/2

T∑t=1+p

vt−1,T εt

)2

= T−3

T∑t,s=1+p

Evt−1,T εtvs,T εs = T−3

T∑t=1+p

Ev2t−1,TEε2

t

due to the conditional homoskedasticity assumption Eεtε′t|Ft−1 = Eεtε

′t. Since Ev2

t,T =

O(t) the whole sum is O(T−1) as required. Obviously E(T−3/2∑T

t=p+1 utεt)2 = O(T−1).

Finally vt,T − vt−1,T = vt,T − aT vt−1,T + (aT − 1)vt−1,T = ut − c/Tvt−1,T and therefore thesquare of the second term in (9) equals

T−3

T∑t,s=1+p

ut+1us+1w∗t w

∗s −

c

T(vt,T us+1w

∗t w

∗s + vs,T ut+1w

∗t w

∗s) +

c2

T 2vt,T vs,T w∗

sw∗t .

Since vt,T is the sum of t terms of bounded fourth moments it follows as above thatEv4

t,T = O(t4) and hence Evt,T us+1w∗t w

∗s ≤ (Ev4

t,T )1/4(Eu4s+1)

1/4(E(w∗t )

4)1/2 = O(t). There-fore (ii) follows.(iii) and (iv) are proved in Lemma 1 of Phillips (1987) under different assumptions on the

process ut. The main fact used there, however, is that the process XT (t) = T−1/2σ−1∑btT c

s=1 us, 0 ≤t ≤ T converges weakly to a Brownian motion. It is a standard result that this holds underour assumptions (see e.g. Hall and Heyde, 1980, Theorem 4.1.). ¤

25

Lemma 5 Let the process (wt)t∈Z be generated according to wt =∑∞

j=1 πw,jwt−j +εt where

(εt)t∈Z fulfills Assumption N and∑∞

j=1 ‖πw,j‖2 < ∞. Further for πw(z) := I−∑∞j=1 πw,jz

j

assume that det πw(z) 6= 0, |z| ≤ 1. Let wt be partitioned as w′t = [y′t, z

′2t]′ and ac-

cordingly let εyt denote the first block of εt. Define εyt,p := yt −∑p

j=1[Is, 0]πw,jwt−j =εyt +

∑∞j=p+1[Is, 0]πw,jwt−j. Then

E(‖εyt,p − εyt‖22)

1/2 ≤ c

∞∑j=p+1

‖πw,j‖2 (10)

for suitable constant c < ∞ not depending on p.

Proof: See Lewis and Reinsel (1985), p. 397, (2.9). ¤

Lemma 6 Let RT ∈ RgT×gT denote a sequence of (possibly random) matrices whose di-mension gT depend on the sample size T . Let RT denote a sequence of random matricessuch that ‖RT −RT‖2 = OP (fT ) where fT → 0. Then if supT∈N ‖R−1

T ‖2 < ∞ a.s. it follows

that ‖R−1T −R−1

T ‖2 = OP (fT ).

Proof: See Lewis and Reinsel (1985), p. 397, l. 11. ¤

Lemma 7[

A BC D

]−1

=

[A−1 00 0

]+

[ −A−1BI

] [D − CA−1B

]−1 [ −CA−1 I]

(11)

Proof: This can be verified by simple algebraic manipulations. ¤

Lemma 8 Under Assumption P1(i), (ii) and (iv) let Γp := E(x−t )(x−t )′ where x−t =[(x−2t)

′, (x−1t)′]′ as defined in Theorem 2. Then supp∈N ‖Γ−1

p ‖2 < ∞.

Proof: Note that z1t = zν1t + zε

1t where zν1t = νt +

∑∞j=1 θjνt−j and zε

1t =∑∞

j=1 φjεt−j aremutually independent. Consequently Ez1t−iz

′1t−j = Ezν

1t−i(zν1t−j)

′ + Ezε1t−i(z

ε1t−j)

′. Lettingxε

1t and xν1t denote the components of x−1t generated from εt and νt respectively, it follows

that

Γp = E

y−t (y−t )′ y−t (z−2t)′ y−t (zε

1t−pz1−1)′ y−t (xε

1t)′

z−2t(y−t )′ z−2t(z

−2t)

′ z−2t(zε1t−pz1−1)

′ z−2t(xε1t)

′

zε1t−pz1−1(y

−t )′ zε

1t−pz1−1(z−2t)

′ zε1t−pz1−1(z

ε1t−pz1−1)

′ zε1t−pz1−1(x

ε1t)

′

xε1t(y

−t )′ xε

1t(z−2t)

′ xε1t(z

ε1t−pz1−1)

′ xε1t(x

ε1t)

′

+E

0 0 0 00 0 0 00 0 zν

1t−pz1−1(zν1t−pz1−1)

′ zν1t−pz1−1(x

ν1t)

′

0 0 xν1t(z

ν1t−pz1−1)

′ xν1t(x

ν1t)

′

def= Γε

p + Γνp

26

Clearly 0 ≤ Γεp, 0 ≤ Γν

p. Also the largest eigenvalue of both matrices are bounded uniformlyin p (see Theorem 6.6.10. of Hannan and Deistler (1988) for Γε

p; the nonzero eigenvaluesof Γν

p do not depend on p). Furthermore the matrix in the third and fourth block row andblock column of Γν

p is positive definite, since z1t contains the term νt.For the heading subblock built from the first and second block row and columns of Γε

p

the smallest eigenvalue is bounded uniformly in p by Theorem 6.6.10. on p. 265 of Han-nan and Deistler (1988). Assume then that the uniform bound on the eigenvalues ofΓp does not hold. Then there exists a sequence pT → ∞ and a sequence of unit normvectors xp such that x′pΓpxp → 0. Then x′pΓ

εpxp + x′pΓ

νpxp → 0 and hence partition-

ing xp = [x′p,1, x′p,2, x

′p,3, x

′p,4]

′ where xp,i corresponds to the partitioning used previouslyit follows that E(x′p,3z

ν1t−pz1−1 + x′p,4x

ν1t)(x

′p,3z

ν1t−pz1−1 + x′p,4x

ν1t)

′ → 0. It follows that‖xp,3‖2 + ‖xp,4‖2 → 0. From Theorem 6.6.10 of Hannan and Deistler (1988) it also followsthat E(x′p,1y

−t + x′p,2z

−2t)(x

′p,1y

−t + x′p,2z

−2t)

′ → 0 implies ‖xp,1‖2 + ‖xp,2‖2 → 0. But thisproduces a contradiction to ‖x‖2 = 1. This shows the claim. ¤

B Proof of the Theorems

The proof of the theorems will be given based on the following lemma which introduces anew set of high level conditions sufficient for Assumptions HL to hold:

Lemma 9 Let the process (wt)t∈Z be generated according to wt =∑∞

j=1 πw,jwt−j +εt where

(εt)t∈Z fulfills Assumption N and∑∞

j=1 ‖πw,j‖2 < ∞. Further for πw(z) := I−∑∞j=1 πw,jz

j

assume that det πw(z) 6= 0, |z| ≤ 1. Let wt be partitioned as w′t = [y′t, z

′2t]′ and accord-

ingly let εyt denote the first block of εt. Define εyt,p := yt −∑p

j=1[Is, 0]πw,jwt−j = εyt +∑∞j=p+1[Is, 0]πw,jwt−j. Assume that z−t ∈ Rkzp is a vector which is Ft−1 measurable such

that yt = A(p)z−t +εt,p = [A1(p), A2(p), A3(p)][(z−t,1

)′,(z−t,2,p

)′, z′3,t]

′+εt,p where z−t ∈ Rkzp is

partitioned as z−t = [(z−1,t

)′,(z−2,t,p

)′, z′3,t]

′ such that z−t,1 =[z′t−1,1, . . . , z

′t−p1,1

]′ ∈ Rkz1 (where

p1 is fixed) and z3,t ∈ Rkz3 do not depend on p and z2,t,p =[z′2t−1, . . . , z

′2t−p

]′depends on

p. Further let p tend to infinity as a function of the sample size such that p3/T → 0 andT 1/2

∑∞j=p+1 ‖πw,j‖2 → 0.

Then the following conditions are sufficient for the high level conditions Assumption HL tohold: There exists a matrix RT and a scaling matrix DT = diag(Ikz1T

−1/2, IT−1/2, Ikz3fT )such that (λmax denotes the maximal eigenvalue)

maxT∈N λmax(ERT ) = O(1) , λmax(RT ) = OP (1), λmax(R

−1T ) = OP (1),(12)

RT =

R1,1 RT,1,2 0RT,2,1 RT,2,2 0

0 0 RT,3,3

, (13)

RT := DT

T∑t=p+1

z−t (z−t )′DT , such that ‖RT −RT‖2 = oP (p−1/2), (14)

27

supl∈Rkzp ,‖l‖2=1

T−1/2

T∑t=p+1

(E‖l′DT z−t ‖22)

1/2 = O(1), (15)

vec

T∑t=p+1

εt(z−t )′DT R−1

T

I00

d→ Z (16)

where Z is Gaussian distributed with zero mean and variance Γ−11.2 ⊗ Σ where Γ1.2 :=

limT→∞ R1,1 −RT,1,2R−1T,2,2RT,2,1 > 0.

Proof: Consider

A(p) :=T∑

t=p+1

yt(z−t )′(

T∑t=p+1

z−t (z−t )′)−1 = A(p) +T∑

t=p+1

εyt,p(z−t )′DT (DT

T∑t=p+1

z−t (z−t )′DT )−1DT

= A(p) +

(T∑

t=p+1

εyt,p(z−t )′DT

)R−1

T DT .

NowT∑

t=p+1

εyt,p(z−t )′DT =

T∑t=p+1

εyt(z−t )′DT +

T∑t=p+1

(εyt,p − εyt)(z−t )′DT .

Here

E‖T∑

t=p+1

(εyt,p − εyt)(z−t )′DT‖2 ≤

T∑t=p+1

(E‖εyt,p − εyt‖22)

1/2(E‖DT (z−t )‖22)

1/2

= (T 1/2(E‖εy1,p − εy1‖2

2)1/2

)(

T−1/2

T∑t=p+1

(E‖DT (z−t )‖22)

1/2

)= o(p1/2).

(17)

Here (15) and Lemma 5 are used. Moreover letting εyt,i, i = 1, . . . , ky, denote a coordinateof εyt we have

E(T∑

t=p+1

εyt,i(z−t )′DT )′(

T∑t=p+1

εyt,i(z−t )′DT ) =

T∑t=p+1

Eε2yt,iEDT z−t (z−t )′DT = Eε2

y1,iERT

using the martingale difference property. Therefore ‖∑Tt=p+1 εyt,p(z

−t )′DT‖2 = OP (p1/2).

Consequently we obtain ‖(A(p) − A(p))D−1T ‖2 = OP (p1/2) using (15) and (12,14) in com-

bination with Lemma 6.Then consider Σε := T−1

∑Tt=p+1 εtε

′t: We obtain

Σε =1

T

T∑t=p+1

(yt − A(p)z−t )(yt − A(p)z−t )′

28

=1

T

T∑t=p+1

(εyt,p − (A(p)− A(p))z−t )(εyt,p − (A(p)− A(p))z−t )′

=1

T

T∑t=p+1

εyt,pε′yt,p −

1

T

T∑t=p+1

εyt,p(z−t )′(A(p)− A(p))′ − 1

T

T∑t=p+1

(A(p)− A(p))z−t ε′yt,p

+(A(p)− A(p))

(1

T

T∑t=p+1

z−t (z−t )′)

(A(p)− A(p))′

= Σ + oP (1) + OP (p/T ) = Σ + oP (1).

Here the bound follows from T−1∑T

t=p+1 εyt,pε′yt,p → Σ which can be shown using Lemma 5

and ergodicity of (εt)t∈Z implying T−1∑T

t=p+1 εtε′t → Σ almost surely. Further ‖(A(p) −

A(p))D−1T ‖2 = OP (p1/2), ‖RT‖2 = OP (1) and ‖∑T

t=p+1 DT z−t ε′yt,p‖2 = OP (p1/2) are used.This shows HL (i).With respect to HL (ii) note that Γ−1

1.2 equals the (1,1) block of R−1T . Then (14) in combi-

nation with the bound on the norm of RT and R−1T given in (12) imply HL (ii).

With respect to HL (iii) note that x−1.2t = [Γ1.2, 0]D−1T R−1

T DT z−t . Therefore

T−1/2

T∑t=p+1

εyt,p(x−1.2t)

′ =T∑

t=p+1

εyt,p(z−t )′DT R−1

T

Γ1.2

00

=

T∑t=p+1

εyt(z−t )′DT R−1

T

Γ1.2

00

+oP (1)

since ‖∑Tt=p+1(εyt − εyt,p)(z

−t )′DT l‖2 = oP (1) similar to (17) and

∑Tt=p+1 εyt(z

−t )′DT =

OP (p1/2) as used above. Then (16) and HL (ii) imply HL (iii). ¤

B.1 Proof of Theorem 2

Proof: The proof uses a number of results of Lewis and Reinsel (1985), henceforth calledLR. We will verify the conditions of Lemma 9 where z−1,t := x−1t, z

−2,t,,p := x−2t and z3,t does

not occur. Thus kzp = kz1pz1 + p(ky + kz2) + kz1. Consequently DT = T−1/2I. Also theassumptions of the theorem imply that all occurring variables are stationary with boundedvariance. Then ERT = (T − p)/TRT . The maximum eigenvalue of RT is bounded uni-formly in T ∈ N since z−t is a vector containing only lags of the vector process [w′

t, z′1t]′

which has bounded spectrum due to the summability assumptions on the autoregressioncoefficients (see e.g. Hannan and Deistler, 1988, p. 265). The bound on the minimumeigenvalue of RT is derived in Lemma 8. This verifies (12), (13) and (15).Each entry in RT − RT is equivalent to an estimated covariance at some lag up to anapproximation error due to the different limits of summation. Lemma 1 shows that thevariance of the estimators of the covariances are of order O(T−1), see also Hannan (1976),Chapter 4. The change in the summation introduces an error of order OP (pT−1) since thedifference is a sum of a maximum of p terms each of variance O(T−1). This shows that all

29

entries in RT −RT are of order OP (T−1/2) and therefore ‖RT −RT‖2 = OP (pT−1/2). Thenp/T 3 → 0 implies that pT−1/2 = o(p−1/2) showing (14).Finally (16) follows as in Theorem 3 of LR (see also Theorem 7.4.9. of Hannan and Deistler,1988). The only change in the arguments lies in the different definition of the regressorsand correspondingly the replacement of Γp of LR by RT . In the proof the uniform boundon λmax(R

−1T ) derived above is crucial. Details are omitted. ¤


Proof: The proof will follow closely the results contained in Saikkonen and Lutkepohl(1996) henceforth denoted as SP96. Essential in the developments is a reparameterizationof the auxiliary model using Γ according to

∆yt = Ψ0(γ′⊥wt−1)+

p−1∑j=1

Πjvt−j+Πpvt−p+

(pz1+1∑j=1

ψz1j

)z1t−pz1−1+

pz1∑j=1

ψz1j(z1t−j−z1t−pz1−1)+εyt,p.

(18)Here the coefficients Πj are functions of the coefficients πw,j such that

∑∞j=1 ‖Πj‖2 < ∞

due to Assumption P2 (iii) See e.g. equation (2.6) of SP96 for detailed expressions. Sincewe are interested in testing the significance of ψz1j, j = 0, . . . , pz1 and not in the otherparameters the reparameterization is immaterial to our purposes since the estimates ofψz1j in both, the original model and equation (18), coincide.Note that in (18) there are two variables containing integrated regressors: (γ′⊥wt−1) andz1t−pz1−1. Assumption P2 implies that there exists full column rank matrices β ∈ R(n+kz1)×nz

and β⊥ ∈ R(n+kz1)×(n+kz1−nz) such that β′β⊥ = 0 where (nt,⊥)t∈N, nt,⊥ := β′[(γ′⊥wt−1)′, z′1t−pz1−1]

′

is stationary and (nt)t∈N, nt := β′⊥[(γ′⊥wt−1)′, z′1t−pz1−1]

′ is integrated allowing for no coin-tegrating relation. Thus instead of the original regression we can consider the regression

∆yt = [ψx1, ψx2, Ψ0]

z−1,t

z−2,t,p

z3,t

+ εyt,p = A(p)z−t + εyt,p

where z−1,t := x−1t = [(z1t−z1t−pz1−1)′, . . . , (z1t−pz1−z1t−pz1−1)

′]′, z−2,t,p := [n′t,⊥, v′t−1, . . . , v′t−p+1, (γ

′vt−p)′]′

and z3,t := nt analogously to the definition in Lemma A.3. of SP96. Here (z−1,t)t∈Z and(z−2,t,p)t∈Z are stationary, the latter for given value of p. Thus we clearly are in the situationof Lemma 9 and it is sufficient to verify the conditions given there.Define DT := diag(T−1/2I, T−1/2I, T−1I) where the partitioning refers to the partitioningof z−t into z−1,t, z

−2,t,p and z3,t and let RT := DT (

∑Tt=p+1 z−t (z−t )′)DT . Note that in z−t the

last kz3 coordinates are integrated the remaining ones being stationary. Further let

RT :=

Ez−1,t(z

−1,t)

′ Ez−1,t(z−2,t)

′ 0Ez−2,t(z

−1,t)

′ Ez−2,t(z−2,t)

′ 0

0 0 T−2∑T

t=p+1 ntn′t

30

such that obviously (13) holds. Here the submatrix built of the first two block rows andcolumns of RT has uniformly bounded eigenvalues (both from below and from above) dueto Lemma 8 as in the proof of Theorem 2. The nonsingularity (in probability) of the(3,3) block of RT follows from the convergence in distribution (cf. Lemma 4 (iii), c = 0)to an almost sure positive definite random matrix. Therefore λmax(RT ) = OP (1) andλmax(R

−1T ) = OP (1) establishing (12).

Next Lemma 1 for d = 0 and Lemma 4 (ii) (for c = 0) imply that each entry in RT−RT hasvariance uniformly of order O(T−1). Accordingly ‖RT −RT‖2 = OP (p/T−1/2) establishing(14) for p = o(T 1/3).Next consider E‖l′DT z−t ‖2

2 = E(T−1‖l′1z−1,t‖22 + T−1‖l′2z−2,t‖2

2 + T−2‖l′3z3,t‖22) where l′ =

[l′1, l′2, l

′3] is partitioned according to the partitioning of z−t . According to Lemma 4 (i)

for c = 0 we have E‖z3,t‖22 = O(t). Due to stationarity of the remaining terms we have

E‖l′DT z−t ‖22 = O(T−1) analogously to the proof in Theorem 2. Therefore (15) follows.

Finally in∑T

t=p+1 εyt(z−t )′DT R−1

T [I, 0, 0]′ the nonstationary terms do not occur due to theblock diagonal structure of RT . Thus analogous arguments as in the proof of Theorem 2imply that (16) holds. This concludes the proof. ¤


Proof: The proof proceeds by mimicking the arguments of the proof of Theorem 3 whilereplacing the difference ∆ with the pseudo-difference ∆c := (1 − aT L). We immediatelyobtain

∆cyt = [ψx1, ψx2, Ψ0]

z−1,t

z−2,t,p

z−3,t

+ εyt,p = A(p)z−t + εyt,p

where z−2,t,p := [n′t,⊥, v′t−1, . . . , v′t−k+1, (γ

′vt−k)′]′ where (nt,⊥)t∈N, nt,⊥ := β′[(γ′⊥wt−1)

′, z′1t−pz1−1]′

is stationary (up to lower order terms, see the discussion around (18)) and (z3,t)t∈N, z3,t :=β′⊥[(γ′⊥wt−1)

′, z′1t−pz1−1]′ is the sum of two local-to-unity processes allowing for no cointe-

grating relation.The variable z3,t is a local-to-unity process with local-to-unity parameter c. Further

z−1,t := [(z1t−z1t−pz1−1)′, . . . , (z1t−pz1−z1t−pz1−1)

′]′ behaves essentially as a stationary process

since z1t−j−apz1+1−jT z1t−pz1−1 is stationary (as a finite sum of stationary terms) and therefore

z1t−j − z1t−pz1−1 = z1t−j − apz1−j+1T z1t−pz1−1 + (apz1−j+1

T − 1)z1t−pz1−1

where apz1−j+1T − 1 = O(T−1). Therefore it follows from Lemma 4 that the second term

does not influence any of the results. Then reconsidering the proof of Theorem 3 showsthat the only change consists in omitting the statement ‘for c = 0’ in the arguments. Thisshows the theorem. ¤

31


Proof: The proof follows the same route as the proof of Theorem 2. The main differencelies in the fact that the impulse response sequence corresponding to z1t is not summable.Note, however, that only the process (z1t) has long-memory whereas yt and z2t remainshort-memory processes.Hence let DT = T−1/2I and RT = Ez−t (z−t )′ where z−t is defined as in the proof of Theo-rem 2. Accordingly RT := T−1

∑Tt=p+1 z−t (z−t )′. In order to show ‖RT −RT‖2 = oP (p−1/2)

note that every entry in this matrix converges in mean square since according to Lemma 1the variances are of order O(Tmax(4d−2,−1)) for d 6= 0.25 and of order O(T−1 log T ) ford = 0.25. Note that Eγj = (T − p)/Tγj. Therefore the expectation of the sum of

squared entries of RT − RT is of order O(T 4d−2p + p2T−1) for 0.25 < d < 0.5, of orderO(pT−1 log T +p2T−1) for d = 0.25 and of order O(pT−1+p2T−1) for d < 0.25. Here we usethe fact that there are only O(p) terms involving the long-memory processes since yt andz2t are short memory processes contributing p2 terms of order O(T−1). Therefore in orderfor ‖RT − RT‖2 = oP (p−1/2) it is sufficient that p2T 4d−2 + p3T−1 → 0 for 0.25 < d < 0.5,that (p2 log T + p3)/T → 0 for d = 0.25 and in all other cases p3T−1 → 0. This shows(14). The bounds (12) on the eigenvalues of RT follow from Lemma 8 (which did not usethe short memory assumption on z1t) as in the proof of Theorem 2. Since z3,t does notoccur (13) follows trivially. Stationarity and finite variances of (z1t)t∈N implies (15) as inthe proof of Theorem 2.It remains to verify (16). In the following we will only deal with the scalar output case(i.e. ky = 1). The multivariate case is only notationally more difficult. It is sufficient to

show that T−1/2∑T

t=p+1 εyt(α′pz−t ) is asymptotically normal with α′pRT αp → α′∞R∞α∞ for

vector sequences αp such that 0 < c < infp∈N ‖αp‖2 ≤ supp∈N ‖αp‖2 ≤ C for some constants

0 < c < C < ∞ and ‖[α′p, 0]′ − α∞‖2 → 0 holds. Clearly the columns of R−1T fulfill these

requirements. In this respect we use the three series criterion of Hall and Heyde (1980,Theorem 3.2, p. 58): With XTt = εyt(α

′pz−t )/

√T we obtain that (XTt)1≤t≤T is a martin-

gale difference sequence with respect to the sigma field generated by εs, νs, s ≤ t. In thefollowing we will only deal with the univariate case. The multivariate case follows as usualfrom the Cramer-Wold device (see e.g. Davidson, 1994, Theorem 25.5.). Then Theorem

3.2. states that∑T

t=1 XTtd→ N (0, η2) if

(i) max1≤t≤T

|XTt| p→ 0,

(ii)T∑

t=1

X2Tt

p→ η2(a constant),

(iii) E max1≤t≤T

X2Tt is bounded in T.

Assume that α′pRT αp → η2 (for some constant η) as p →∞. Then it holds that

Eε2yt(α

′pz−t )2 = Eε2

ytE(αpz−t )2 < M

32

for some constant 0 < M < ∞ uniformly in p ∈ N due to the conditional homoskedasticityand the assumption of finite second moments RT of z−t . Then

E max1≤t≤T

X2Tt ≤

T∑t=1

EX2Tt ≤ M

such that (iii) follows. Secondly,

T∑t=1

X2Tt = T−1

T∑t=1

ε2yt(α

′pz−t )2 = T−1

T∑t=1

(ε2yt−Eε2

yt)α′pz−t (z−t )′αp+

(T−1

T∑t=1

α′pz−t (z−t )′

)αpEε2

yt

where α′p(T−1

∑Tt=1 z−t (z−t )′)αp = α′pRT αp → η2 since ‖RT − RT‖2 → 0. Therefore it is

sufficient to show that T−1∑T

t=1(ε2yt − Eε2

yt)α′pz−t (z−t )′αp converges to zero. According to

Davidson (1994, Theorem 19.7) this hold for our assumptions if |(ε2yt−Eε2

yt)(α′pz−t )2| can be

shown to be uniformly integrable (uniformly over t and p). Now E(ε2yt − Eε2

yt)2(α′pz

−t )4 =

(E(ε2yt−(Eε2

yt))2)(Eα′pz

−t )4 due to the i.i.d. assumption on (εt)t∈Z. But E(ε2

yt−(Eε2yt))

2 < ∞due to finite fourth moments. In order to show that supp∈N E(α′pz

−t )4 < ∞ for supp ‖αp‖2 <

∞ we use Lemma 2: Clearly α′pz−t =

∑∞j=0 φν

p,jνt−j + φεp,jεt−j. Therefore it is sufficient

to show that supp

∑∞j=0 ‖[φν

p,j, φεp,j]‖2

2 < ∞. But this follows since supp ‖αp‖2 is boundedby assumption and for each of yt, z1t and z2t the summability assumption holds which isstraightforward to verify. Uniform integrability then follows from Davidson (1994, Theorem12.10.). Hence we follow that (ii) holds.Finally (i) holds since it is implied by (I(.) denoting the indicator function)

T∑t=1

E[X2

TtI(X2Tt > ε)

]= TE

[X2

T1I(X2T1 > ε)

] → 0

for each ε > 0 (see Hall and Heyde, 1980, (3.6), p. 53). Here convergence is implied byE[εy1(α

′pz1)]

4 = Eε4y1E(α′pz1)

4 < ∞ as shown previously. This concludes the proof. ¤


Proof: The proof of Theorem 6 combines the arguments from the proof of Theorem 3 andTheorem 5. Analogously to equation (18) we obtain

yt =

p−1∑j=1

πjyt−j+

p∑j=1

ψjz2t−j+

(pz1+1∑j=1

ψz1j

)B−1(Bz1t−pz1−1)+

pz1∑j=1

ψz1j(z1t−j−z1t−pz1−1)+εyt,p

where B := [β, β⊥]. Note that z1t−j−z1t−pz1−1 =∑pz1−1

i=j ∆z1t−i =∑pz1−1

i=j x1t−i is stationary

for each 0 ≤ j < pz1−1. Define z−1t :=[z′1t−1 − (z1t−pz1−1)

′, . . . , z′1t−pz1− (z1t−pz1−1)

′]′ , z−2,t,p :=[(y−t )′, (z−2t)

′, (β′z1t−pz1−1)′]′ and z3,t := β′⊥z1t−pz1−1. Then in z−t := [(z−1,t)

′, (z−2,t,p)′, z′3,t]

′ the

33

last coordinates (i.e. z3,t) are fractionally integrated with parameter d + 1 while the re-maining coordinates are stationary. Define DT := diag(T−1/2I, T−(d+1)I) and moreoverRT := DT

∑Tt=p+1 z−t (z−t )′DT and

RT :=

Ez−1,t(z

−1,t)

′ Ez−1,t(z−2,t)

′ 0Ez−2,t(z

−1,t)

′ Ez−2,t(z−2,t)

′ 0

0 0 T−2(d+1)∑T

t=p+1 ntn′t

.

Obviously (13) holds with this choice. Then RT − RT consists of six types of subblocks:The terms involving only z−1,t and z−2,t can be analyzed exactly as in the proof of Theorem 5:The upper bound on the increase of p as a function of the sample size shows that the sumof squares of these entries is of order OP (p−1). The (3, 3) block of RT −RT is zero by defi-nition. The remaining two terms include terms of the form T−d−3/2

∑Tt=p+1 z3,t(β

′z1t−j)′ =

OP (T 2d0+1−d−3/2), T−d−3/2∑T

t=p+1 z3,tx′1t−j = OP (T 2d0+1−d−3/2) and T−d−3/2

∑Tt=p+1 z3,t[y

′t−j, z

′2t−j] =

OP (T d0+1−d−3/2) by Lemma 3 where d0 := max(0, d). Furthermore it follows from Lemma 3(ii) that

max0≤j≤HT

‖T−d−3/2

T∑t=p+1

z3,t[y′t−j, z

′2t−j]‖2 = OP (T−d−1/2+d0)

so that the bound also holds uniformly for all entries in the matrix. Therefore the sumover these terms is of order OP (pT−1/2+d0−d). Using the bounds derived in the proof ofTheorem 5 this shows that

‖R−R‖2 =

OP (p1/2T 2d−1 + pT−1/2 + T d−1/2) 0.25 < d < 0.5

OP (p1/2√

log TT

+ pT−1/2 + T d−1/2) d = 0.25,

OP (pT−1/2 + T d−1/2) 0 < d < 0.25,OP (pT−d−1/2) −0.5 < d ≤ 0

.

Then (14) holds under the restrictions on p imposed in Assumption P5 .The uniform bound on the eigenvalues of RT follows as in the proof of Theorem 5 and thefact that

T−2(d+1)

T∑t=p+1

z3,tz′3,t

d→ Ξ

for some random variable Ξ which is a.s. positive by Theorem 4.1. (d < 0) or Theorem4.2. (d > 0) of Davidson and de Jong (2000). Consequently (12) holds.According to Lemma 3.2. of (Davidson and de Jong, 2000) Ez3,tz

′3,t = O(T 2d+1) follows

and hence the contribution of this block to E‖l′DT z−t ‖22 is of order O(T−1) showing (15).

Finally the arguments to show (16) are analogous to the ones used in the proof of Theo-rem 5 since the nonstationary components are not involved. This concludes the proof. ¤

34


Proof: The stategy of the proof is to apply, where possible, the previously proved resultswithin each regime. We will verify the conditions of Lemma 9 where z−1,t := x−1t, z

−2,t,p := x−2t

and z3,t does not occur. Thus kzp = kz1 (pz1 + 1) + p(ky + kz2) and DT = T−1/2I. De-

fine Sj =

pz1 + 2 + b∑j−1k=0 ωkT c, . . . , b

∑jk=1 ωkT c

as the data range that would re-

sult if restricted to regime j only. Sj ommits pz1 + 1 discarded lags, denoted by Dj :=b∑j−1

k=0 ωkT c+ 1, . . . , pz1 + 1 + b∑j−1k=0 ωkT c

. Let D :=

⋃Jj=1 Dj. Define the within-

regime variance Γ(j) := E[z−t (z−t )′I (t ∈ Sj)

] − µ(j)µ(j)′ and define R :=∑J

j=1 ωjR(j),

where R(j) := E[(

z−t − µ) (

z−t − µ)′

I (t ∈ Sj)]

as a measure of the overall average

variation. Noting that R(j) = Γ(j) + (µ(j)− µ) (µ(j)− µ)′ we decompose R as R =∑Jj=1 ωjΓ(j) +

∑Jj=1 ωj (µ(j)− µ) (µ(j)− µ)′ .

Using the same argument as was used directly for R in the proof of Theorem 5 for Γ(i) wehave λmax(Γ(i)), λmax (Γ(i)−1) = O(1). We also have λmax

((µ(j)− µ) (µ(j)− µ)′

)= O(1)

despite the fact that the dimension µ(j)− µ grows in p, since it consists of pz1 +1 repeatedcopies of the same vector extended to the correct dimension by adding zeros. Here pz1 isfixed independently of the sample size. Then it follows (see (Lutkepohl, 1996, p. 74) that

λmax (R) ≤J∑

j=1

ωjλmax (Γ(j)) +J∑

j=1

ωjλmax

((µ(j)− µ) (µ(j)− µ)′

)= O (1) and

λmax

(R−1

) ≤(

J∑j=1

ωjλmin (Γ(j))

)−1

= O(1)

where J is fixed. This shows (12).Next define sample counterparts:

z(j) := bωjT c−1∑t∈Sj

z−t , z := T−1

T∑t=p+1

z−t , Γ(j) := bωjT c−1∑t∈Sj

(z−t − µ (j)

) (z−t − µ (j)

)′,

R(j) := bωjT c−1∑t∈Sj

(z−t − µ

) (z−t − µ

)′

and note that

E∥∥∥R(j)−R(j)

∥∥∥2≤ E

∥∥∥Γ(j)− Γ(j)∥∥∥

2+ 2E ‖(z(j)− µ (j))‖2

(∥∥µ (j)′∥∥

2+ ‖µ′‖2

)

+(pz1 + 1)∥∥[ωjT ]−1µµ′

∥∥2

where the last term results from the sum over the pz1 + 1 discarded lags in Dj.

E ‖(z(j)− µ (j)) I (t ∈ Sj)‖22 = (pz1 + 1)

kz1∑i=1

E[(x1ti(j)− E [x1tiI (t ∈ Sj)])

2 I (t ∈ Sj)]

35

+ p

ky+kz2∑i=1

E[(x2ti(j)− E [x2tiI (t ∈ Sj)])

2 I (t ∈ Sj)]

= O(pT−1

).

(19)

By similar argument ‖µ‖2, ‖µ(j)‖2 = OP (1) and ‖pz1[ωjT ]−1µµ′‖2 ≤ OP (pT−1). Thus

∥∥∥R(j)−R(j)∥∥∥

2≤

∥∥∥Γ(j)− Γ(j)∥∥∥

2+ OP

(pT−1

). (20)

Next define R := T−1∑T

t=p+1

(z−t − z

) (z−t − z

)′and note that

R =J∑

j=1

ωjR(j) +J∑

j=1

T−1∑t∈Dj

(z−t − z

) (z−t − z

)′=

J∑j=1

ωjR(j) + OP

(√pT−1

),(21)

since (pz1 + 1)T−1E∥∥∥(z−t − z

) (z−t − z

)′∥∥∥2≤ (pz1 + 1)T−1E

[∥∥(z−t − z

)∥∥2

2

]= O (pT−1) ,

where the last step follows by an argument similar to (19). Then by (20) and (21)

∥∥∥R−R∥∥∥

2≤

J∑j=1

ωj

∥∥∥Γ(j)− Γ(j)∥∥∥

2+ OP

(√pT−1

). (22)

The same arguments as in the proofs of Theorems 2 and 5 show∥∥∥Γ(j)− Γ(j)

∥∥∥2

= oP

(p−1/2

)

since these do not involve breaks. This shows (14).Next write

T−1

T∑t=p+1

(E

∥∥l′(z−t − µ

)∥∥2

2

)1/2

≤ 21/2

J∑j=1

T−1∑t∈Sj

(E

∥∥l′(z−t − µ(j)

)∥∥2

2

)1/2

+21/2

J∑j=1

T−1∑t∈Sj

(‖l′ (µ(j)− µ)‖2

2

)1/2

+J∑

j=1

T−1∑t∈Dj

(E

∥∥l′(z−t − µ

)∥∥2

2

)1/2

.

For the last term in (23) we have(E

∥∥l′(z−t − µ

)∥∥2

2

)1/2

= OP (p) by arguments simi-

lar to those directly above (19). It follows that∑J

j=1 T−1∑

t∈Dj

(E

∥∥l′(z−t − µ

)∥∥2

2

)1/2

=

O (pT−1) = o (1). For the middle term in (23) we have∑J

j=1 T−1∑

t∈Sj

(‖l′ (µ(j)− µ)‖22

)1/2=∑J

j=1b(Tωj − pz1− 1)c/T ‖l′ (µ(j)− µ)‖2 = O (1) by argument similar to (19). Finally the

first term in (23) is also O(1) since J is fixed and T−1∑

t∈Sj

(E

∥∥l′(z−t − µ(j)

)∥∥2

2

)1/2

=

O (1) by the same arguments as in the proofs of theorems 2 and 5. This establishes (15).As in the proof of Theorem 5, we will show T−1/2

∑Tt=1 εytα

′p

(z−t − µ

)converges to the

normal distribution given in (16) by verifying the three conditions of (Hall and Heyde, 1980,

36

Theorem 3.2, p. 58) for XTt := εytαp

(z−t − µ

)/√

T in the scalar case. The multivariatecase again follows from the Cramer-Wold device.

Condition (ii) of Hall and Heyde (1980, Theorem 3.2, p. 58) follows from

E max1≤t≤T

X2Tt ≤

T∑t=1

EX2Tt = E

[ε2

yt

]α′p

J∑j=1

ωjE[(

z−t − µ) (

z−t − µ)′]

αp = E[ε2

yt

]α′pRαp.

For condition (ii)

T∑t=1

X2Tt = E

[ε2

ty

]α′pR

′αp + T−1

T∑t=1

(ε2

ty − E[ε2

ty

])α′p

(z−t − µ

) (z−t − µ

)′αp (23)

Note that∥∥∥Γ (j)− Γ(j)

∥∥∥2→p 0 by the same arguments as in Theorems 2 and 5 and

therefore by (22), this implies that∥∥∥R−R

∥∥∥2→p 0, so that the first term in (23) con-

verges in probability to η2 = E[ε2

ty

]α′pRαp. The second term in (23) converges in prob-

ability to zero by the same arguments as in the proof of Theorem 5 (Lemma 2 impliesthat E

[(α′p

(z−t − µ(j)

))4

]and therefore E

[(α′p

(z−t − µ

))4

]is bounded). Noting that∑T

t=1 E [X2TtI (X2

Tt > ε)] =∑J

j=1bTωjcE [X2TtI (t ∈ Sj) I (X2

Tt > ε)], condition (i) also fol-lows by the similar aguments as in Theorem 5. ¤

37

Documents

Robust Granger Causality Tests in the VARX Frameworkhomes.chass.utoronto.ca/~amaynard/papers/grangint.pdf · and Granger causality testing often rely heavily on the choice of model