Testing a large set of zero restrictions in regression …motegi/Ghysels_Hill_Motegi_MaxMFGC.pdfTesting a large set of zero restrictions in regression models, with an application to

Testing a large set of zero restrictions in regression models,

with an application to mixed frequency Granger causality∗

Eric Ghysels†

UNC Chapel HillJonathan B. Hill‡

UNC Chapel HillKaiji Motegi§

Kobe University

This draft: November 10, 2019

Abstract

This paper proposes a new test for a large set of zero restrictions in regression models based

on a seemingly overlooked, but simple, dimension reduction technique. The procedure in-

volves multiple parsimonious regression models where key regressors are split across simple

regressions. Each parsimonious regression model has one key regressor and other regressors

not associated with the null hypothesis. The test is based on the maximum of the squared

parameters of the key regressors. Parsimony ensures sharper estimates and therefore im-

proves power in small sample. We present the general theory of our test and focus on mixed

frequency Granger causality as a prominent application involving many zero restrictions.

Keywords: dimension reduction, Granger causality test, max test, Mixed Data Sampling

(MIDAS), parsimonious regression models.

JEL classification: C12, C22, C51.

∗We are grateful for the editor, Marine Carrasco, and two anonymous referees for their helpful comments. Wealso thank Manfred Deistler, Shigeyuki Hamori, Peter R. Hansen, Atsushi Inoue, Helmut Lutkepohl, MassimilianoMarcellino, Yasumasa Matsuda, Daisuke Nagakura, Eric Renault, Peter Robinson, seminar participants at NYUStern, Universite Libre de Bruxelles, Keio University, and UNC Chapel Hill, and participants at 2014 NBER-NSFTime Series Conference, 25th EC2 Conference, SETA 2015, 11th World Congress of the Econometric Society, 2015JJSM, Recent Developments in Time Series and Related Fields, Workshop on Advances in Econometrics 2017, and2019 JJSM for helpful comments. The third author is grateful for financial supports of JSPS KAKENHI (GrantNumber 16K17104), Kikawada Foundation, Mitsubishi UFJ Trust Scholarship Foundation, Nomura Foundation,and Suntory Foundation.

†Corresponding author. Department of Economics and Department of Finance, Kenan-Flagler Business School,University of North Carolina at Chapel Hill, and CEPR. Address: Chapel Hill, NC 27599, USA. E-mail:[email protected]

‡Department of Economics, University of North Carolina at Chapel Hill. Address: Chapel Hill, NC 27599,USA. E-mail: [email protected]

§Graduate School of Economics, Kobe University. Address: Kobe, Hyogo 657-8501, Japan. E-mail:[email protected]

1 Introduction

We propose a new test designed for a large set of zero restrictions in regression models. The test

is based on a seemingly overlooked, but simple, dimension reduction technique for regression

models. Suppose that the underlying data generating process for a scalar yt is

yt = z′ta+ x′

tb+ ϵt,

where zt is assumed to have a small dimension p while xt may have a large but finite dimension

h. We want to test the null hypothesis H0 : b = 0 against a general alternative hypothesis H1 :

b = 0.

A classical approach for testing H0 relies on what we call a full regression model yt = z′tα +

x′tβ + ut and a Wald test. This approach may result in imprecise inference when b has a large

dimension relative to sample size n. The asymptotic χ2-test may suffer from size distortions in

small sample due to parameter proliferation. A bootstrap method can be employed to improve

empirical size, but this generally results in the size-corrected bootstrap test having low power

due to large size distortions (cf. Davidson and MacKinnon (2006)). A shrinkage estimator can

be used, including Lasso, Adaptive Lasso, or Ridge Regression, but these are valid under a

sparsity assumption. This means that when applied to our setting, the shrinkage estimator may

not be valid under the alternative because the high dimensional parameter set being tested may

have all non-zero components, contradicting the sparsity assumption.1

To circumvent the issue of parameter proliferation, we propose to split each of the key

regressors xt = [x1t, . . . , xht]′ across separate regression models. This approach leads to what

we call parsimonious regression models:

yt = z′tαi + βixit + uit for i = 1, . . . , h.

The ith parsimonious regression model has zt and the ith element of xt only, so that parameter

proliferation is not an issue. We then compute a max test statistic:

Tn = max{(√nβn1)

2, . . . , (√nβnh)

2}.

The asymptotic distribution of Tn is non-standard under H0 : b = 0, but it is straightforward to

approximate a p-value by drawing directly from an asymptotically valid approximation of the

asymptotic distribution. We will prove that, under H1 : b = 0, at least one of {βn1, . . . , βnh}has a nonzero probability limit under fairly weak conditions. This key result ensures that,

although the proposed approach does not deliver a consistent estimator of b, we can reject the

null hypothesis with probability approaching one for any direction b = 0 under the alternative

hypothesis. The max test is therefore a consistent test.

The maximum of a sequence of statistics with (or without) pointwise Gaussian limits has

1Under the null the sparsity assumption is true by construction. Even if evidence points to the null, a TypeII error may be occurring in which case the test statistic need not have its intended asymptotic properties.

1

a long history (e.g. Gnedenko (1943)), including the maximum correlation over an increasing

sequence of integer displacements. See, e.g., Berman (1964) and Hannan (1974). The use of

a max statistic across econometric models is used in White’s (2000) predictive model selection

criterion. Evidently ours is the first application of a maximum statistic to a regression model

as a core test statistic for zero restrictions.

We demonstrate via Monte Carlo simulations that in comparison to the bootstrapped Wald

test, the max test is more powerful particularly as h increases and/or there is a stronger corre-

lation among regressors. In some cases, the max test is more than three times as powerful as

the bootstrapped Wald test. An intuitive explanation for the dominance of the max test is as

follows. Performing the Wald test requires the computation of Γ = (1/n)∑n

t=1XtX′t, where

Xt = [z′t,x

′t]′ is a vector of regressors in the full regression model. The resulting Γ is a less

precise estimator for E[XtX′t] as the dimension p+h increases or there is a stronger correlation

among the regressors. Moreover, the bootstrapped Wald test tends to lose finite sample power

under such circumstances.

The max test, by contrast, bypasses the computation of Γ since it only requires the compu-

tation of Γii = (1/n)∑n

t=1XitX′it for i = 1, . . . , h, where Xit = [z′

t, xit]′ is a (p+ 1)× 1 vector

of regressors in the ith parsimonious regression model. Since Γii has a much smaller dimension

than Γ, its accuracy improves considerably. Hence the max test remains powerful as the dimen-

sion h or the contemporaneous correlation within Xt increases. Since large dimensionality and

strong correlation among regressors (i.e. multicollinearity) are indeed an issue in many economic

applications, the max test is a promising alternative to the Wald test.

After presenting the general theory of the max test, we focus on Mixed Data Sampling (MI-

DAS) Granger causality as a prominent application involving many zero restrictions. Classical

tests for Granger’s (1969) causality are typically designed for single-frequency data (i.e. they

aggregate data with possibly different frequencies to the common lowest frequency). A popular

Granger causality test is a Wald test based on multi-horizon vector autoregressive (VAR) mod-

els. An advantage of the VAR approach is the capability of handling causal chains among more

than two variables.2

Time series are often sampled at different frequencies, and it is well known that temporal

aggregation generates or conceals Granger causality.3 Standard VAR models are designed for

single-frequency data, and causality tests based on those models may produce misleading results

of spurious (non-)causality. To alleviate the adverse impact of temporal aggregation, Ghysels,

Hill, and Motegi (2016) develop Granger causality tests that explicitly take advantage of mixed

frequency data. They extend Dufour, Pelletier, and Renault’s (2006) VAR-based causality test,

exploiting Ghysels’ (2016) mixed frequency vector autoregressive (MF-VAR) models.4

A challenge not addressed by Ghysels, Hill, and Motegi (2016) is parameter proliferation in

MF-VAR models which has a negative impact on the finite sample performance of their tests.

2See Lutkepohl (1993), Dufour and Renault (1998), and Dufour, Pelletier, and Renault (2006) among others.3See Ghysels, Hill, and Motegi (2016) and references therein for the consequence of temporal aggregation on

Granger causality.4Foroni, Ghysels, and Marcellino (2013) provide a survey of MF-VAR models.

2

Indeed, if we let m be the ratio of high and low frequencies (e.g. m = 3 in mixed monthly and

quarterly data), then for bivariate mixed frequency settings the MF-VAR is of dimension m + 1.

Parameter proliferation occurs when m is large, and worsens as the lag length of VAR increases.

In such a case, an asymptotic Wald test results in size distortions, while a bootstrapped Wald

test features low size-corrected power, a common consequence when a size-distorted asymptotic

test is bootstrapped (cf. Davidson and MacKinnon (2006)).5

The max test proposed in the present paper is a useful solution to the large dimensionality

associated with mixed frequency Granger causality. This paper focuses on testing for Granger

causality from a high frequency variable xH to a low frequency variable xL.6 We show via Monte

Carlo simulations that the max test dominates the Wald test in terms of empirical power. In

some cases, the max test is more than four times as powerful as the bootstrapped Wald test.

As an empirical application, we analyze Granger causality from a weekly interest rate spread

to real gross domestic product (GDP) growth in the United States. Based on a rolling window

analysis, the max test with mixed frequency data yields an intuitive result that the interest rate

spread causes the GDP growth until 1990s and the causality vanishes after that. The Wald test

with mixed frequency data yields a counter-intuitive result that significant causality emerges

only recently.

The remainder of the paper is organized as follows. In Section 2, we present general theory

of the parsimonious regression models and max tests. In Section 3, we focus on mixed fre-

quency Granger causality tests as a specific application. In Section 4, we perform Monte Carlo

simulations for a generic linear regression framework and MIDAS frameworks. The empirical

application on interest rate spread and GDP is presented in Section 5, and Section 6 concludes

the paper. In Technical Appendices, we provide omitted proofs and some technical details.

2 General theory

Consider a data generating process (DGP) in which a univariate time series {yt} depends linearlyon two groups of regressors zt = [z1t, . . . , zpt]

′ and xt = [x1t, . . . , xht]′. Define the σ-field Ft =

σ(Yτ : τ ≤ t+ 1) with all variables Yt = [yt−1,X′t]′, where Xt = [z′

t,x′t]′. Thus, Ft−1 contains

information on {yt−1, yt−2, ...} and the regressors {(zt,xt), (zt−1,xt−1), . . . }.

Assumption 2.1. The true DGP is

yt =

p∑k=1

akzkt +

h∑i=1

bixit + ϵt. (2.1)

The error {ϵt} is a stationary martingale difference sequence (mds) with respect to the increasing

σ-field filtration Ft ⊂ Ft+1, and σ2 ≡ E[ϵ2t ] > 0.

5Gotz, Hecq, and Smeekes (2016) propose reduced rank regression and Bayesian estimation to improve thefinite sample performance of Ghysels, Hill, and Motegi’s (2016) test under a large-dimensional MF-VAR.

6Granger causality from xL to xH poses a harder challenge due to the difficulty of ensuring parsimony underthe null and identifying parameters under the alternative. See Section 3.2 for discussions.

3

Remark 2.1. The mds assumption allows for conditional heteroskedasticity of unknown form,

including GARCH-type processes. We can also easily allow for stochastic volatility or other

random volatility errors by expanding the definition of the σ-field Ft. The mds assumption

can be relaxed at the expense of more technical proofs, and a more complicated asymptotic

variance structure and subsequent estimator. Since this is all well known, we do not consider

such generalizations in the present paper.

Let n be a sample size, and define an n × (p + h) matrix of regressors X = [X1, . . . ,Xn]′.

Assumption 2.2 is a standard assumption that rules out perfect multicollinearity among the

regressors.

Assumption 2.2. X is of full column rank p+ h almost surely.

Assumption 2.3. {Xt, ϵt} are strictly stationary and ergodic. Xt is square integrable.

Using standard vector notations, e.g., a = [a1, . . . , ap]′, DGP (2.1) is rewritten as

yt = z′ta+ x′

tb+ ϵt. (2.2)

where {zt} are auxiliary regressors whose coefficients are not our main target. The number

of the auxiliary regressors, p, is assumed to be relatively small. We want to test for the zero

restrictions with respect to main regressors {xt}:

H0 : b = 0.

The number of zero restrictions, h, is assumed to be finite but potentially large relative to sample

size n.

A classical approach of testing for H0 : b = 0 is the Wald test based on what we call a full

regression model :

yt =

p∑k=1

αkzkt +h∑

i=1

βixit + ut, (2.3)

where we take it granted that model (2.3) is correctly specified relative to the true DGP (2.1)

in order to focus ideas. Given model (2.3), it is straightforward to compute a Wald statistic

with respect to H0 : b = 0. The statistic has an asymptotic χ2 distribution with h degrees of

freedom under Assumptions 2.1-2.3. A potential problem of this approach is that the asymptotic

approximation may be poor when there are many zero restrictions relative to sample size. A

parametric or wild bootstrap can be used to control for the size of the test, but this typically

leads to low size-corrected power. It is therefore of use to propose a new test that achieves a

sharper size and higher power.

We propose parsimonious regression models:

yt =

p∑k=1

αkizkt + βixit + uit, i = 1, . . . , h. (2.4)

4

A crucial difference between the full regression model (2.3) and the parsimonious regression

models (2.4) is that, in the latter, there are h equations and the key regressor xit along with the

auxiliary regressors {z1t, . . . , zpt} appear in the ith equation. The parameters {α1i, . . . , αpi} may

differ across the equations i = 1, . . . , h, and unless the null hypothesis is true, they are generally

not equal to the true values {a1, . . . , ap}. We run least squares for each parsimonious regression

model to get {βn1, . . . , βnh}. Then we formulate a max test statistic

Tn = max{(√nβn1)

2, . . . , (√nβnh)

2}. (2.5)

Equations (2.4) and (2.5) form a core part of our approach. The number of regressors in each

parsimonious regression model is p+ 1, which is much smaller than p+ h in the full regression

model. As a result, the precision of βni improves toward its probability limit β∗i = plimn→∞βni

for each i.

Under H0 : b = 0, each parsimonious regression model is correctly specified and therefore

βnip→ β∗

i = bi = 0 straightforwardly. The asymptotic distribution of Tn under H0 is non-

standard, but we can compute a simulation-based p-value by drawing from an asymptotically

valid approximation to the asymptotic distribution directly (see Theorems 2.1-2.3 below). Under

H1 : b = 0, each parsimonious regression model is in general misspecified due to an omitted

regressor, and therefore βnip→ β∗

i = bi. A key result to be proven in Theorems 2.4 and 2.5 is

that at least one of {β∗1 , . . . , β

∗h} must be nonzero under H1 : b = 0. Hence the max test achieves

consistency Tnp→ ∞, although βni itself may not converge in probability to the true value bi.

We can potentially generalize the test statistic (2.5) by adding a weight to each term:

Tn = max{(√nwn1βn1)

2, . . . , (√nwnhβnh)

2},

where {wn1, . . . , wnh} is a sequence of possibly stochastic L2-bounded positive scalar weights

with non-random positive probability limits. We restrict ourselves to the equal weights wni = 1

in order to focus on the key implications from (2.4) and (2.5). Exploring non-equal weights is a

possible future task.7

2.1 Asymptotic theory under the null hypothesis

We derive the asymptotic distribution of Tn in (2.5) underH0 : b = 0. Rewrite each parsimonious

regression model (2.4) as

yt = X ′itθi + uit, i = 1, . . . , h, (2.6)

where Xit = [z1t, . . . , zpt, xit]′ and θi = [α′

i, βi]′ = [α1i, . . . , αpi, βi]

′. Stack all parameters across

the h models as θ = [θ′1, . . . ,θ

′h]

′. Define a selection matrix R that selects β = [β1, . . . , βh]′ from

7A natural choice would be the inverted standard error V−1/2ni , where Vni is a consistent estimator for the

asymptotic variance of√nβni under the null. This allows for a control of different sampling and asymptotic

dispersions of the estimators across parsimonious regression models, which may improve empirical and localpower. For the sake of compactness, we do not explore such generality here.

5

θ. R is an h× (p+ 1)h full row rank matrix such that β = Rθ, hence

R =

01×p 1 01×p 0 . . . 01×p 0

01×p 0 01×p 1 . . . 01×p 0...

......

......

......

01×p 0 01×p 0 . . . 01×p 1

. (2.7)

Theorem 2.1. Under H0 : b = 0, we have that Tnd→ max{N 2

1 , . . . ,N 2h} as n → ∞, where

N = [N1, . . . ,Nh]′ is distributed as N(0,V ) with covariance matrix:

V ≡ RSR′ ∈ Rh×h, (2.8)

where

S =

Σ11 . . . Σ1h

.... . .

...

Σh1 . . . Σhh

∈ R(p+1)h×(p+1)h,

Γij = E[XitX′jt], Λij = E[ϵ2tXitX

′jt], Σij = Γ−1

ii ΛijΓ−1jj , i, j ∈ {1, . . . , h}.

(2.9)

All proofs appear in Technical Appendix A.

Remark 2.2. Under Assumption 2.1, the error {ϵt} is an adapted martingale difference. Sup-

pose also that {ϵ2t − σ2} is an adapted martingale difference with σ2 = E[ϵ2t ] (i.e. conditional

homoskedasticity). Then, we get a well-known simplification Λij = σ2Γij .

Remark 2.3. {X1t, . . . ,Xht} share the common regressors zt. It is therefore easy to prove that

S is a positive semi-definite and singular matrix.8

Remark 2.4. We do not have a general proof that the asymptotic covariance matrix V is

nonsingular, except for a simple case with (p, h) = (1, 2). This is irrelevant for performing

the max test since we do not need to invert V . As we show below, it is easy to compute

an asymptotically valid p-value by drawing from an asymptotically valid approximation to the

distribution N(0,V ).

2.2 Simulated p-value

Let Vn be a consistent estimator for V , and draw M samples of vectors {N (1), . . . ,N (M)} inde-

pendently fromN(0, Vn).Now compute artificial test statistics T (j)n = max{(N (j)

1 )2, . . . , (N (j)h )2}

8Observe that Σij = E[ϵ2tXitX′jt], where Xit = E [XitX

′it]

−1Xit. Now define the set of all (non-unique)

regressors across all parsimonious regression models: Xt = [X ′1t, ..., X

′ht]

′ ∈ R(p+1)h. Let λ = [λ′1, ...,λ

′h]

′ whereλi = [λij ]

p+1j=1 is (p + 1) × 1 and λ′λ = 1. Then λ′Sλ =

∑hi=1

∑hj=1 E[ϵ2tλ

′iXitX

′jtλj ] = E[ϵ2t (λ

′Xt)2] ≥ 0.

Since each Xit contains zt, it is possible to find a non-zero λ such that λ′Xt = 0, e.g. λ1p+1 = 0, λ2 = −λ1,and all other λi = 0. Therefore S is singular and positive semi-definite.

6

for j = 1, . . . ,M. An asymptotic p-value approximation for Tn is

pn,M =1

M

M∑j=1

I(T (j)n > Tn

),

where I(A) is the indicator function that equals 1 if A occurs and 0 otherwise. Since N (j) are

i.i.d. over j, and M can be made arbitrarily large, by the Glivenko-Cantelli Theorem pn,M can

be made arbitrarily close to P (T (1)n > Tn). The proposed max test is to reject H0 at level α

when pn,Mn < α and to accept H0 when pn,Mn ≥ α, where {Mn}n≥1 is a sequence of positive

integers that satisfies Mn → ∞.9

Define the max test limit distribution under H0 as F 0(c) = P (max1≤i≤h(N(1)i )2 ≤ c). The

asymptotic p-value is therefore F 0(Tn) ≡ 1 − F 0(Tn) = P (max1≤i≤h(N(1)i )2 ≥ Tn). By an

argument identical to Hansen (1996: Theorem 2), we have the following link between the p-

value approximation P (T (1)n > Tn) and the asymptotic p-value for Tn.

Theorem 2.2. Let {Mn}n≥1 be a sequence of positive integers, Mn → ∞. Under Assumptions

2.1-2.3 P (T (1)n > Tn) = F 0(Tn) + op(1), hence pn,Mn = F 0(Tn) + op(1). Therefore under H0,

P (pn,Mn < α) → α for any α ∈ (0, 1).

A consistent estimator Vn is computed as follows. Run least squares for each parsimo-

nious regression model to get θni = [α′ni, βni]

′ and residuals uit = yt − X ′itθni. Define Γij =

(1/n)∑n

t=1XitX′jt, Λij = (1/n)

∑nt=1 u

2itXitX

′jt, Σij = Γ−1

ii ΛijΓ−1jj , S = [Σij ]i,j , and

Vn = RSR′.

Theorem 2.3. Under Assumptions 2.1-2.3, Vnp→ V where V is some matrix that satisfies

||V || < ∞. Specifically, V = V ≡ RSR′ under H0, and under H1 it follows RS∗R′ where

S∗ = [Γ−1ii Λ∗

ijΓ−1jj ]i,j .

2.3 Limits under the null and alternative hypotheses

Under the alternative hypothesis H1 : b = 0, βni in general does not converge in probability to

the true value bi due to omitted regressors. Let β∗i = plimn→∞βni be the so-called pseudo-true

value of βi. The same notation applies to α∗i and θ∗

i = [α∗′i , β

∗i ]

′. Let βn = [βn1, . . . , βnh]′ and

β∗ = [β∗1 , . . . , β

∗h]

′. We can characterize θ∗i as follows.

Theorem 2.4. Let Assumptions 2.1-2.3 hold. Let Γii = E[XitX′it] ∈ R(p+1)×(p+1) and Ci =

9The computation of pn,M is simple and fast. The latter follows since N (j) i.i.d.∼ N(0,V ) are easily computed

by standard modern software packages. The test statistics T (j)n then are trivially easy to compute.

7

E[Xitx′t] ∈ R(p+1)×h. Then, θn

p→ θ∗ = [θ∗′1 , . . . ,θ

∗′h ]

′, where

θ∗i =

α∗1i...

α∗pi

β∗i

=

a1...

ap

0

+ Γ−1ii Cib, i = 1, . . . , h. (2.10)

Further, βnp→ β∗ = Rθ∗ by construction.

Theorem 2.4 provides useful insights on the relationship between the underlying coefficient

b and the pseudo-true value β∗. First, it is clear from (2.10) that β∗ = 0 whenever b = 0.

This is an intuitive result since each parsimonious regression model is correctly specified under

H0 : b = 0. Second, as the next theorem proves, b = 0 whenever β∗ = 0. This is a key result that

allows us to identify the null and alternative hypotheses exactly. Of course, our approach cannot

identify all of {b1, . . . , bh} under H1. We can, however, identify that at least one of {b1, . . . , bh}must be non-zero, which is sufficient for rejecting H0 : b = 0.

Theorem 2.5. Let Assumptions 2.1-2.3 hold. Then β∗ = 0 implies b = 0, hence β∗ = 0 if and

only if b = 0. Therefore βnp→ 0 if and only if b = 0.

A proof of Theorem 2.5 exploits the non-singularity of E[ztz′t] and E[xtx

′t], which is ensured

by Assumption 2.2. We next present a simple example which illustrates Theorem 2.5.

Example 2.1 (Identification). Suppose that the true DGP is

yt = a1 + b1x1t + b2x2t + ϵt

and we run two parsimonious regression models

yt = α11 + β1x1t + u1t and yt = α12 + β2x2t + u2t.

Notice that we have only one common regressor z1t = 1 and two main regressors {x1t, x2t}.Assume for simplicity that E[xit] = 0 and E[x2it] = 1 for i = 1, 2. Finally, since Assumption 2.2

rules out perfect multicollinearity, we have that ρ12 = E[x1tx2t] ∈ (−1, 1).

In this setting, it follows that Xit = [1, xit]′, xt = [x1t, x2t]

′, and

Γii = E[XitX′it] =

[1 0

0 1

], C1 = E[X1tx

′t] =

[0 0

1 ρ12

], C2 = E[X2tx

′t] =

[0 0

ρ12 1

].

Substitute these quantities into (2.10) to get

β∗1 = b1 + ρ12b2 and β∗

2 = b2 + ρ12b1. (2.11)

Now we verify Theorem 2.5 by showing that β∗ = 0 =⇒ b = 0. Assume β∗1 = β∗

2 = 0. If ρ12 = 0,

8

then it is trivial from (2.11) that b1 = b2 = 0. If ρ12 = 0, then (2.11) implies that

(1 + ρ12)(1− ρ12)

ρ12× b1 = 0.

Since ρ12 ∈ (−1, 1), it must be the case that b1 = 0 and therefore b2 = 0.

Theorems 2.1-2.5 together imply the max test statistic has its intended limit properties

under either hypothesis. First, observe that the max test statistic construction (2.5) indicates

that Tnp→ ∞ at rate n if and only if β∗ = 0, and by Theorems 2.4 and 2.5 βn

p→ β∗ = 0

under H1 : b = 0. In conjunction with the p-value approximation consistency Theorem 2.2, the

consistency of the max test follows.

Theorem 2.6. Let Assumptions 2.1-2.3 hold, then Tnp→ ∞ at rate n and therefore P (pn,Mn <

α) → 1 for any α ∈ (0, 1) if and only if H1 : b = 0 is true. In particular, H1 : b = 0 is true if

and only if Tn/np→ max

{(β∗

1)2, . . . , (β∗

h)2}> 0.

An immediate consequence of Theorems 2.1, 2.2, 2.5, and 2.6 is that the limiting null distri-

bution arises if and only if H0 is true.

Corollary 2.7. Let {Mn}n≥1 be a sequence of positive constants, Mn → ∞. Let Assumptions

2.1-2.3 hold. Then Tnd→ max{N 2

1 , . . . ,N 2h} as n → ∞ and therefore P (pn,Mn < α) → α for any

α ∈ (0, 1) if and only if H0 : b = 0 is true.

To conclude this section, we briefly discuss a misspecification problem. Consider the case

where a true DGP has h regressors {x1t, . . . , xht}, while we run only h parsimonious regression

models with h < h so that {xh+1,t, . . . , xht} are not used. In such a case, the max test is

generally inconsistent, just as the Wald test is. Observe that (2.10) still holds for i = 1, . . . , h

when there is under-specification. (See the proof of Theorem 2.4 for derivation.) When h < h,

having β∗1 = · · · = β∗

h= 0 does not imply that b = 0. Below is a simple counter-example.

Example 2.2 (Non-identification due to Under-specification). Continue Example 2.1 except for

that h = 1 now. Hence we run only one regression model yt = α11 + β1x1t + u1t. In this simple

set-up, the parsimonious regression approach is identical to the full regression approach with

h = 1. By (2.11), β∗1 = b1 + ρ12b2. For any given ρ12 ∈ (−1, 1), having β∗

1 = 0 does not imply

b1 = b2 = 0 because (b1, b2) can take any value as long as b1 = −ρ12b2.

3 Mixed frequency Granger causality

In this section, we discuss testing for Granger causality between bivariate mixed frequency time

series as a prominent example involving a large set of zero restrictions. We restrict ourselves

to the bivariate case where we have a high frequency variable xH and a low frequency variable

xL.10

10The trivariate case involves causal chains whose complexity detracts us from the main theme of dimensionreduction. See Dufour and Renault (1998), Dufour, Pelletier, and Renault (2006), and Hill (2007) for furtherdiscussions on causal chains.

9

We first set up a DGP of xH and xL. Let m denote the ratio of sampling frequencies, i.e.

the number of high frequency time periods in each low frequency time period τ ∈ Z. We assume

throughout the paper that m is fixed (e.g. m = 3 months per quarter) in order to focus ideas and

reduce notation. All of our main results would carry over to a time-varying sequence {m(τ)}(e.g. daily xH and monthly xL) in a straightforward way.

Example 3.1 (Empirical Example of Mixed Frequency Data). Mixed frequency data arise

when we analyze a monthly variable xH and a quarterly variable xL, where m = 3. Suppose

that xH(τ, 1) is the first monthly observation in quarter τ, xH(τ, 2) is the second, and xH(τ, 3)

is the third. A leading example in macroeconomics is quarterly real GDP growth xL(τ), where

existing analyses on Granger causality use unemployment, oil prices, inflation, interest rates,

etc., often aggregated into quarters (see Hill (2007) for references). Consider monthly infla-

tion in quarter τ, denoted as [xH(τ, 1), xH(τ, 2), xH(τ, 3)]′. The resulting stacked system is

{xH(τ, 1), xH(τ, 2), xH(τ, 3), xL(τ)}. The assumption that xL(τ) is observed after xH(τ,m) is

merely a convention.

In the bivariate case, we have a K × 1 mixed frequency vector

X(τ) = [xH(τ, 1), . . . , xH(τ,m), xL(τ)]′,

where K = m + 1. Define the σ-field Fτ ≡ σ(X(τ ′) : τ ′ ≤ τ). We assume as in Ghysels (2016)

and Ghysels, Hill, and Motegi (2016) that E[X(τ)|Fτ−1] has a version that is almost surely

linear in {X(τ − 1), . . . ,X(τ − p)} for some finite p ≥ 1.

Assumption 3.1. The mixed frequency vector X(τ) is governed by a MF-VAR(p) for finite

p ≥ 1:xH(τ, 1)

...

xH(τ,m)

xL(τ)

︸︷︷︸

≡X(τ)

=

p∑k=1

d11,k . . . d1m,k c(k−1)m+1

.... . .

......

dm1,k . . . dmm,k ckm

bkm . . . b(k−1)m+1 ak

︸︷︷︸

≡Ak

xH(τ − k, 1)

...

xH(τ − k,m)

xL(τ − k)

︸︷︷︸

≡X(τ−k)

+

ϵH(τ, 1)

...

ϵH(τ,m)

ϵL(τ)

︸︷︷︸

≡ϵ(τ)

(3.1)

or compactly X(τ) =∑p

k=1AkX(τ − k) + ϵ(τ). The error {ϵ(τ)} is a strictly stationary

martingale difference sequence (mds) with respect to increasing Fτ ⊂ Fτ+1, with a positive

definite covariance matrix E[ϵ(τ)ϵ(τ)′].

Remark 3.1. The present paper borrows notation from Ghysels, Hill, and Motegi (2016). There

it is assumed that all variables follow a high frequency VAR process, but only some are observable

in a lower frequency. A central contribution is the recovery of true high frequency causal patterns

based on a mixed frequency model like (3.1). The focus here is decidedly different, with an aim

for improved inference under parameter proliferation as in mixed frequency models. We therefore

do not work with a high frequency model, nor consider the link between inference under mixed

versus high frequency models.

10

Remark 3.2. Model (3.1) captures bi-directional mixed frequency Granger causality. It there-

fore does not allow instantaneous causality or now-casting (e.g. Gotz and Hecq (2014)).

The coefficients d and a in (3.1) govern the autoregressive properties of xH and xL, re-

spectively. The coefficients b and c are relevant for Granger causality, so we explain how

they are labeled. b1 represents the impact of the most recent past observation of xH (i.e.

xH(τ − 1,m)) on xL(τ), b2 represents the impact of the second most recent past observation of

xH (i.e. xH(τ − 1,m− 1)) on xL(τ), and so on through bpm. In general, bj represents the impact

of xH on xL with j high frequency lags.

Similarly, c1 represents the impact of xL(τ−1) on the nearest observation of xH (i.e. xH(τ, 1)),

c2 represents the impact of xL(τ−1) on the second nearest observation of xH (i.e. xH(τ, 2)), cm+1

represents the impact of xL(τ−2) on the (m+1)-th nearest observation of xH (i.e. xH(τ, 1)), and

so forth. Finally, cpm represents the impact of xL(τ − p) on xH(τ,m). In general, cj represents

the impact of xL on xH with j high frequency lags.

We impose the usual stationarity and ergodicity conditions.

Assumption 3.2. All roots of the polynomial det(IK −∑p

k=1Akzk) = 0 lie outside the unit

circle, where det(·) is the determinant.

Assumption 3.3. X(τ) and ϵ(τ) are ergodic.

Granger causality from xH to xL is what we call high-to-low causality, while the reverse

direction is what we call low-to-high causality. Since high-to-low causality and low-to-high

causality pose fundamentally different challenges, we treat them separately in Sections 3.1 and

3.2.

3.1 Granger causality from high to low frequency

Pick the last row of the entire system (3.1):

xL(τ) =

p∑k=1

akxL(τ − k) +

pm∑i=1

bixH(τ − 1,m+ 1− i) + ϵL(τ),

ϵL(τ)mds∼ (0, σ2

L), σ2L > 0.

(3.2)

The index i ∈ {1, . . . , pm} is in high frequency terms, and the second argument m + 1 − i of

xH can be less than 1 since i > m occurs when p > 1. Allowing any integer value in the second

argument of xH , including those smaller than 1 or larger than m, does not cause any confusion,

and simplifies equations below. It is understood, for example, that xH(τ, 0) = xH(τ − 1,m),

xH(τ,−1) = xH(τ−1,m−1), and xH(τ,m+1) = xH(τ+1, 1).More generally, we interchangeably

write xH(τ − τ ′, i) = xH(τ, i−mτ ′) for any τ, τ ′, i ∈ Z. See Technical Appendix B for details.

Based on the classic theory of Dufour and Renault (1998) and the mixed frequency extension

made by Ghysels, Hill, and Motegi (2016), we know that xH does not Granger cause xL given

11

the mixed frequency information set Fτ = σ(X(τ ′) : τ ′ ≤ τ) if and only if

H0 : b1 = · · · = bpm = 0.

A well-known issue here is that the number of zero restrictions, pm,may be very large in some

applications, depending on the ratio of sampling frequencies m. Consider a case of weekly versus

quarterly data for instance.11 Each quarter has approximately m = 12 weeks, and the VAR

lag length is p quarters. Then pm = 36 when p = 3, and pm = 48 when p = 4, etc. Running

parsimonious regression models and the max test is our proposed solution to the parameter

proliferation arising from a large m.

There is a clear correspondence between the general linear DGP (2.1) and the mixed fre-

quency DGP (3.2). The regressand yt is xL(τ); common regressors {z1t, . . . , zpt} are identically

the low frequency lags {xL(τ − 1), . . . , xL(τ − p)}; the main regressors split across parsimonious

regression models {x1t, . . . , xht} are the high frequency lags {xH(τ − 1,m+ 1− 1), . . . , xH(τ −1,m+ 1− pm)}. The parsimonious regression models are therefore written as

xL(τ) =

p∑k=1

αk,ixL(τ − k) + βixH(τ − 1,m+ 1− i) + uL,i(τ), i = 1, . . . , pm.

Here we are using h = pm models, which matches the true high frequency lag length. If h < pm,

then there is under-specification and the max test loses consistency (cf. Example 2.2). If h ≥ pm,

then consistency is ensured although finite sample performance may be poorer due to redundant

regressors.

Assumptions 3.1-3.3 imply Assumptions 2.1-2.3. Thus, under Assumptions 3.1-3.3, Theorems

2.1-2.6 and Corollary 2.7 carry over to the max test for high-to-low causality.

3.2 Granger causality from low to high frequency

We now consider Granger causality from xL to xH . Recall from (3.1) that the DGP is MF-

VAR(p). As defined in Ghysels, Hill, and Motegi (2016), the null hypothesis of low-to-high

non-causality is written as H0 : c1 = · · · = cpm = 0 or compactly c = [c1, . . . , cpm]′ = 0.

The first m equations of (3.1) motivate us to run the following pm parsimonious regressions

to test for H0 : c = 0:

xH(τ, 1) =

pm∑j=1

δ1j,ixH(τ − 1,m+ 1− j) + γ1ixL(τ − i) + uH,i(τ, 1), i = 1, . . . , p, (3.3)

......

......

......

...

xH(τ,m) =

pm∑j=1

δmj,ixH(τ − 1,m+ 1− j) + γmixL(τ − i) + uH,i(τ,m), i = 1, . . . , p. (3.4)

11In Section 5, we indeed analyze Granger causality from weekly interest rate spread to quarterly GDP growthin the U.S.

12

In each of the pm parsimonious regression models (3.3)-(3.4), we put one high frequency τ -period

series on the left-hand side and one lagged low frequency on the right-hand side, augmented with

lagged high frequency regressors. Each of the pm models is correctly specified relative to (3.1)

under low-to-high non-causality H0 : c = 0.

For each fixed high frequency increment k ∈ {1, . . . ,m}, Assumptions 2.1-2.3 and hence

Theorems 2.1-2.3 carry over so that we can easily run the max test with respect to γk1 = · · · =γkp = 0. Low-to-high non-causality, however, implies joint zero restrictions over all increments

k ∈ {1, . . . ,m}. It is left as a future task to derive an asymptotic distribution of the max test

statistic over all pm models under the null.

Another open question is the characterization of the pseudo-true values of key parameters

{γ11, . . . , γ1p, . . . , γm1, . . . , γmp}. To prove the consistency of the proposed max test for low-to-

high causality, we should prove that at least one of those γ’s has a nonzero pseudo-true value

under H1 : c = 0.

Last but not least, parsimonious regression models (3.3)-(3.4) may suffer from parameter

proliferation even under H0 : c = 0. The number of regressors in each model is pm+ 1, which

may be large enough to distort the finite sample performance of the max test. A possible remedy

is to impose a MIDAS polynomial on the lagged high frequency variables in order to reduce the

dimension of common regressors, and to keep the low-to-high causality part unrestricted.12 An

issue of this approach is a potential misspecification of MIDAS polynomials. If the MIDAS

polynomials are misspecified, the resulting estimators of γ’s are not guaranteed to converge in

probability to 0 under H0 : c = 0.

4 Monte Carlo simulations

In this section, we perform Monte Carlo simulations and compare the finite sample performance

of the max tests and Wald tests. We consider three scenarios in order to highlight practical

advantages of the max tests in generic and MIDAS frameworks. In Section 4.1, we consider

generic DGPs that can include cross-section and single-frequency time series data. In Section

4.2, we consider a variety of DGPs and models with mixed frequency data. In Section 4.3, we

study a more specific MIDAS framework that resembles the empirical application in Section 5.

4.1 Generic framework

4.1.1 Simulation design

We consider a generic DGP that yt = x′tb + ϵt with ϵt

i.i.d.∼ N(0, 1). For simplicity, we do not

include any common regressor zt (i.e. p = 0) and only include key regressors xt. The dimension

of xt is h ∈ {10, 100, 200}, and sample size is n ∈ {10h, 50h}. We assume that xti.i.d.∼ N(0,Γ),

12See Ghysels, Santa-Clara, and Valkanov (2006) and Ghysels, Sinko, and Valkanov (2007) among others forfurther discussion of MIDAS polynomials. Various software packages including the MIDAS Matlab Toolbox(Ghysels (2013)), the R Package midasr (Ghysels, Kvedaras, and Zemlys (2016)), EViews, and Gretl cover avariety of polynomial specifications.

13

where the diagonal elements of Γ are all equal to 1 and the off-diagonal elements are all equal

to ρ ∈ {0.0, 0.3, 0.6}. The parameter ρ measures the contemporaneous correlation among the

regressors. We consider four patterns for the true coefficients b = [b1, . . . , bh]′:

DGP-1 bi = 0 for all i ∈ {1, . . . , h}.

DGP-2 bi = I(i < h/10)× 10(5h)−1 for all i ∈ {1, . . . , h}.


DGP-4 bi = (5h)−1 for all i ∈ {1, . . . , h}.

DGP-1 is prepared for investigating the empirical size of the max and Wald tests. Nonzero

coefficients are put on the first 10% of the entire regressors under DGP-2, the first half under

DGP-3, and all regressors under DGP-4. For each of the latter three DGPs, the coefficients sum

up to∑h

i=1 bi = 1/5.

We run parsimonious regression models yt = βixit+uit for i = 1, . . . , h, and compute the max

test statistic Tn = max{(√nβn1)

2, . . . , (√nβnh)

2}. To compute a p-value, we draw M = 5000

samples from the asymptotic distribution under H0 : b = 0. We use the asymptotic distribution

without the assumption of conditional homoskedasticity (i.e. Λij = E[ϵ2tXitX′jt]).

13

For comparison, we also run a full regression model yt = x′tβ+ut and perform the asymptotic

and bootstrapped Wald tests. The asymptotic Wald test is a well-known χ2-test with degrees

of freedom h. For the bootstrapped Wald test, we generate B = 1000 bootstrap samples

based on Goncalves and Kilian’s (2004: Sec. 3.1) recursive-design wild bootstrap with standard

normal innovations. An advantage of this bootstrap is that it is robust against conditional

heteroskedasticity of unknown form.

For each test, we draw J = 1000 Monte Carlo samples and compute rejection frequencies

with respect to nominal size α = 0.05.

4.1.2 Simulation results

See Table 1 for rejection frequencies with ρ = 0; see Table 2 for rejection frequencies with

ρ = 0.3. We refrain from reporting results with ρ = 0.6 since they are qualitatively similar to

those with ρ = 0.3.14

Focusing on DGP-1, the asymptotic Wald test suffers from severe size distortions. See, for

example, the case with (ρ, h, n) = (0, 200, 2000), where the empirical size of the asymptotic Wald

test is 0.320 (Table 1). The max test and the bootstrapped Wald test are correctly sized for all

cases.

We now compare the empirical power of the max test and the bootstrapped Wald test. When

ρ = 0, the bootstrapped Wald test is slightly more powerful than the max test in most cases.13We omit the max test with the assumption of conditional homoskedasticity (i.e. Λij = σ2E[XitX

′jt]), since

it produces almost the same rejection frequencies as Λij = E[ϵ2tXitX′jt] for all cases considered (see (2.8), (2.9),

and Remark 2.2).14The results with ρ = 0.6 are available upon request.

14

See, for example, DGP-2 with (ρ, h, n) = (0, 100, 5000), where the empirical power is 0.234 for

the max test and 0.339 for the Wald test (Table 1).

When ρ ∈ {0.3, 0.6}, the max test dominates the Wald test. The empirical power of the max

test is sometimes more than three times as high as the empirical power of the Wald test. See,

for example, DGP-4 with (ρ, h, n) = (0.3, 100, 1000), where the empirical power is 0.815 for the

max test and 0.227 for the Wald test (Table 2). We observe similar overwhelming results for

many other cases with ρ ∈ {0.3, 0.6}.An intuitive explanation for the remarkably high power of the max test is as follows. Per-

forming the Wald test requires the computation of Γ = (1/n)∑n

t=1 xtx′t. The resulting Γ is

more and more an imprecise estimator for Γ as the dimension h gets larger or there is a stronger

correlation among the regressors. The bootstrapped Wald test tends to lose finite sample power

under those situations. The max test, instead, bypasses the computation of Γ since it only

requires the computation of Γii = (1/n)∑n

t=1 x2it for i = 1, . . . , h. Since Γii has a much smaller

dimension than Γ, the accuracy of estimation improves considerably. Hence the max test keeps

high power when the dimension h or the correlation within xt is relatively large. Large dimen-

sionality and strong correlation among regressors (i.e. multicollinearity) are indeed an issue in

many economic applications. The max test is a promising alternative to the Wald test in those

challenging cases.

4.2 MIDAS framework

4.2.1 Simulation design

We slightly modify the simulation design of Section 4.1 in order to set up a MIDAS framework.

Suppose that there are one high frequency variable xH and one low frequency variable xL, and

that the ratio of their sampling frequencies is m. Assume that xH follows a univariate high

frequency AR(1):

xH(τ, i) = ϕ× xH(τ, i− 1) + ϵH(τ, i), τ = 1, . . . , n, i = 1, . . . ,m,

where ϕ ∈ {0.8, 0.95} and ϵH(τ, i)i.i.d.∼ N(0, 1). The choice of ϕ ∈ {0.8, 0.95} is reasonable

since many economic and financial time series, especially when they are observed at a very high

frequency, entail substantial persistence in levels or even after taking the first difference.

A DGP of xL is given by

xL(τ) =

h∑i=1

bixH(τ − 1,m+ 1− i) + ϵL(τ), τ = 1, . . . , n,

where ϵL(τ)i.i.d.∼ N(0, 1). The dimension of regressors is h ∈ {10, 100, 200}, and the low frequency

sample size is n ∈ {10h, 50h}.The true coefficients b = [b1, . . . , bh]

′ represent (unidirectional) high-to-low Granger causality.

We consider four causal patterns:

15

DGP-1 bi = 0 for all i ∈ {1, . . . , h}.



DGP-4 bi = (20h)−1 for all i ∈ {1, . . . , h}.

DGP-1 is considered in order to investigate the empirical size of each causality test. The

first 10% of the h high frequency lags of xH have nonzero impacts on xL under DGP-2; the first

half have nonzero impacts under DGP-3; all lags have nonzero impacts under DGP-4. For each

of the latter three DGPs, the magnitude of causality sums up to∑h

i=1 bi = 1/20.15

We choose m = 12 since it occurs in many empirical applications involving (i) monthly and

annual data and (ii) weekly and quarterly data. In extra simulations not reported here, we also

considered other common cases of m = 3 (i.e. monthly and quarterly mixture) and m = 22 (i.e.

daily and monthly mixture). We found that rejection frequencies are not sensitive to the choice

of m given our simulation design.

We test for H0 : b = 0 via the max test without the assumption of conditional homoskedas-

ticity. We run parsimonious regression models xL(τ) = βixH(τ − 1,m + 1 − i) + uL,i(τ) for

i = 1, . . . , h, and compute the max test statistic Tn = max{(√nβn1)

2, . . . , (√nβnh)

2}. A p-value

is computed with M = 5000 draws from the asymptotic distribution under H0.

For comparison, we also run a full regression model xL(τ) =∑h

i=1 βixH(τ−1,m+1−i)+uL(τ)

and perform the asymptotic and bootstrapped Wald tests. For the bootstrapped Wald test, we

generate B = 1000 bootstrap samples based on Goncalves and Kilian’s (2004) recursive-design

wild bootstrap with standard normal innovations.

For each test, we draw J = 1000 Monte Carlo samples and compute rejection frequencies

with respect to nominal size α = 0.05.

4.2.2 Simulation results

See Table 3 for rejection frequencies with ϕ = 0.85; Table 4 for ϕ = 0.95. Our results are

essentially similar to those in the generic framework (Section 4.1). First, the asymptotic Wald

test suffers from severe size distortions. Second, the max test and the bootstrapped Wald test

are correctly sized for all cases.

Third and importantly, the max test is more and more powerful than the bootstrapped Wald

test as the persistence in xH increases. When ϕ = 0.8, the max test is roughly as powerful as

the Wald test in many cases and much more powerful in some cases. See, for example, DGP-2

with (ϕ, h, n) = (0.8, 100, 5000), where the empirical power is 0.856 for the max test and 0.349

for the Wald test (Table 3). When ϕ = 0.95, the max test dominates the Wald test. The

empirical power of the max test is sometimes more than four times as high as the empirical

15One might be tempted to consider an extra case where, for example, the last 10% of the h high frequencylags of xH have nonzero impacts on xL. Rejection frequencies in such a case are virtually identical to those underDGP-2, since xH follows a simple AR(1) process in our setting.

16

power of the Wald test. See, for example, DGP-3 with (ϕ, h, n) = (0.95, 200, 2000), where the

empirical power is 0.794 for the max test and 0.192 for the Wald test (Table 4). We observe

equally overwhelming results in many other cases.

4.3 Empirical framework

In this section, we consider a more specific MIDAS framework which resembles our empirical

application in Section 5. Consider a low frequency variable xL and a high frequency variable

xH with m = 12. In our empirical application, xL is quarterly U.S. GDP growth and xH is

weekly interest rate spread in the U.S. Assume that xH follows a univariate high frequency

AR(1): xH(τ, i) = ϕH × xH(τ, i − 1) + ϵH(τ, i) with ϵH(τ, i)i.i.d.∼ N(0, 1). A main DGP is that

xL(τ) = ϕL×xL(τ −1)+∑24

i=1 bixH(τ −1,m+1− i)+ ϵL(τ) with ϵL(τ)i.i.d.∼ N(0, 1). We choose

ϕH = 0.95 and ϕL = 0.85 to mimic the strong persistence of the interest rate spread and GDP

growth. The high frequency lag length of h = 24 weeks also matches our empirical analysis.

As in Section 4.2, we consider four cases for b = [b1, . . . , b24]′. We assume that bi = 0 under

DGP-1; bi = (1/30)× I(i ≤ 3) under DGP-2; bi = (1/120)× I(i ≤ 12) under DGP-3; bi = 1/240

under DGP-4. The low frequency sample size is set to be n ∈ {80, 200} in line with our empirical

application. The smaller value of n = 80 quarters matches a rolling window analysis, while the

larger value of n = 200 quarters resembles a full sample analysis.

We perform the max test without the assumption of conditional homoskedasticity, asymptotic

Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. Parsimonious

regression models used for the max test are xL(τ) = α0i +∑2

k=1 αkixL(τ − k) + βixH(τ −1,m + 1 − i) + uL,i(τ) for i ∈ {1, . . . , 24}. A full regression model used for the Wald test is

xL(τ) = α0 +∑2

k=1 αkxL(τ − k) +∑24

i=1 βixH(τ − 1,m+1− i) + uL(τ). Residual diagnostics in

the empirical section suggest that those are reasonable specifications given our data.

For comparison, we aggregate the high frequency variable as xH(τ) = xH(τ,m) and rerun

the max and Wald tests. This aggregation scheme is called stock sampling since it picks the last

high frequency observation for each low frequency time period τ . In our empirical application,

we implement the stock sampling since term spread is a stock variable.

Parsimonious regression models used for the low frequency max test are xL(τ) = α0i +∑2k=1 αkixL(τ − k) + βixH(τ − i) + uL,i(τ) for i ∈ {1, 2, 3}. A full regression model used for the

low frequency Wald test is xL(τ) = α0 +∑2

k=1 αkxL(τ − k) +∑3

i=1 βixH(τ − i) + uL(τ). Since

the lag i is now in terms of low frequency, a relatively small lag length hLF = 3 is enough to

capture the causality from xH to xL. The nominal size is 5% for each test.

See Table 5 for rejection frequencies. Focusing on the mixed frequency models, the asymp-

totic Wald test suffers from severe size distortions as expected. Its empirical size is 0.491 for

n = 80 and 0.164 for n = 200. The max test and the bootstrapped Wald test have correct size

for both sample sizes.

The max test achieves much higher empirical power than the bootstrapped Wald test for

each sample size and causal pattern. See, for example, DGP-3 with n = 80, where the empirical

power is 0.538 for the max test and 0.180 for the bootstrapped Wald test. Those results highlight

17

the remarkable advantage of the max test in small sample with many parameters.

Focusing on the low frequency models, all tests including the asymptotic Wald test are

correctly sized. That is not surprising since hMF = 24 high frequency lags of xH is now replaced

with hLF = 3 low frequency lags of aggregated xH . Given our causal patterns, the max and

Wald tests attain equally high power under DGP-2, DGP-3, and DGP-4. The stock aggregation

of xH does not result in spurious non-causality given those DGPs.

5 Empirical application

As an empirical application, we test for Granger causality from a weekly term spread (i.e.

difference between long-term and short-term interest rates) to quarterly real GDP growth in the

United States. A decline in the interest rate spread has historically been regarded as a strong

predictor of a recession, but recent events place doubt on its use for such prediction.16 Recall

that, in 2005, the interest rate spread fell substantially due to a relatively constant long-term

rate and a soaring short-term rate (well known as “Greenspan’s Conundrum”), yet a recession

did not follow immediately. The subprime mortgage crisis started nearly two years later, in

December 2007, and therefore may not be directly related to the 2005 plummet in the term

spread. It is therefore an interesting study to test for causality from the term spread to GDP.

5.1 Data and preliminary analysis

We use seasonally-adjusted quarterly real GDP growth as a business cycle measure. In order

to remove potential seasonal effects remaining after seasonal adjustment, we use annual growth

(i.e. log xL(τ)− log xL(τ − 4)). The short and long term interest rates used for the term spread

are respectively the effective federal funds (FF) rate and 10-year Treasury constant maturity

rate. We aggregate each daily series into weekly series by picking the last observation in each

week (recall that interest rates are stock variables). The sample period covers January 5, 1962

to December 31, 2013, which contains 2,736 weeks or 208 quarters.17

Figure 1 shows time series plots of weekly FF rate, weekly 10-year rate, their spread (10Y−FF),

and quarterly GDP growth. The shaded areas represent recession periods defined by the Na-

tional Bureau of Economic Research (NBER). In the first half of the sample period, a sharp

decline of the spread seems to be immediately followed by a recession. In the second half of the

sample period, there appears to be a weaker association, and a larger time lag between a spread

drop and a recession.

See Table 6 for sample statistics. The 10-year rate is about 1% point higher than the FF

rate on average, while average GDP growth is 3.15%. It is interesting that both of the short-

term and long-term interest rates are positively skewed but their difference has a large negative

16See Stock and Watson (2003) for a survey of the historical relationship between term spread and businesscycle.

17All data are downloaded from Federal Reserve Economic Data maintained by the Federal Reserve Bank ofSt. Louis.

18

skewness of −1.198. The term spread has a relatively large kurtosis of 5.611, whereas the GDP

growth has a smaller kurtosis of 3.543. Because of the large skewness and kurtosis, both the

Kolmogorov-Smirnov (KS) test and Anderson-Darling (AD) test reject the null hypothesis of

normality for the FF rate, 10-year rate, and their spread. The normality hypothesis for GDP is

accepted by the KS test but rejected by the AD test.

In Table 7, we perform Phillips and Perron’s (1988) unit root tests for the weekly term spread

and quarterly GDP growth. We consider a regression model with a constant and drift as well as

a regression model with a constant, drift, and linear time trend. The number of lags included

in the Newey-West long-run variance estimator is l ∈ {2, 4}. The null hypothesis of unit root

is rejected at the 5% level for each variable, model specification, and lag length.18 Hence the

weekly term spread and quarterly GDP growth are likely to be stationary. Note, however, that

they have a strong degree of persistence—the estimated AR(1) parameter is approximately 0.95

for the spread and 0.87 for the GDP. In view of our simulation results in Section 4, the max test

is supposed to be more powerful than the bootstrapped Wald test when the variables of interest

are persistent.

The number of weeks contained in each quarter τ , denoted by m(τ), is not constant. In our

sample period, 13 quarters have 12 weeks each, 150 quarters have 13 weeks each, and 45 quarters

have 14 weeks each. While the max test can be applied with time-varying m(τ), we simplify

the analysis by taking a sample average at the end of each quarter, resulting in the following

modified term spread {x∗H(τ, i)}12i=1:

x∗H(τ, i) =

xH(τ, i) for i = 1, . . . , 11,

1m(τ)−11

∑m(τ)k=12 xH(τ, k) for i = 12.

This modification gives us a dataset with n = 208 quarters, m = 12, and therefore T = m×n =

2, 496 weeks.

5.2 Models and tests

Our main interest lies in a rolling window analysis since we want to observe a dynamic evolution

of Granger causality from term spread to GDP. We set a window size to be 80 quarters (i.e. 20

years) so that there are 129 windows in total. The first subsample covers the first quarter of 1962

through the fourth quarter of 1981 (written as 1962:I-1981:IV), the second one covers 1962:II-

1982:I, and the last one covers 1994:I-2013:IV. As a supplemental analysis, we also perform a

full sample analysis using all 208 quarters (i.e. 52 years).

We use the same model specifications for both full sample analysis and rolling window

18The unit root hypothesis is also rejected when lag length is l = 8. The result with l = 8 is omitted to savespace, but available upon request.

19

analysis. First, a full regression model with mixed frequency (MF) data is specified as

xL(τ) = α0 +

2∑k=1

αkxL(τ − k) +

24∑i=1

βix∗H(τ − 1, 12 + 1− i) + uL(τ). (5.1)

We are regressing the quarterly GDP growth xL(τ) onto a constant, p = 2 quarters of lagged

GDP growth, and hMF = 24 weeks of lagged interest rate spread. We choose p = 2 since the

resulting residuals appear to be serially uncorrelated. We perform the Wald test based on model

(5.1). P-values are computed after generating B = 1000 bootstrap samples based on Goncalves

and Kilian’s (2004) recursive-design wild bootstrap with standard normal innovations. The

asymptotic test is not performed since it is subject to serious size distortions in view of our

simulation results.

Naturally, parsimonious regression models with MF data are specified as

xL(τ) = α0i +2∑

k=1

αkixL(τ − k) + βix∗H(τ − 1, 12 + 1− i) + uL,i(τ), i = 1, . . . , 24. (5.2)

We perform the max test based on model (5.2). P-values are computed based on the robust

covariance matrix with M = 100000 draws from an approximation to the limit distribution

under non-causality.

For comparison, we also perform low frequency (LF) analysis by aggregating the weekly

term spread to the quarterly level. Since the term spread is a stock variable, we aggregate it as

x∗H(τ) = x∗H(τ,m). A full regression model with LF data is specified as

xL(τ) = α0 +

2∑k=1

αkxL(τ − k) +

3∑i=1

βix∗H(τ − i) + uL(τ). (5.3)

This model has q = 2 quarters of lagged xL and hLF = 3 quarters of lagged x∗H . We perform

the Wald test based on model (5.3) again with Goncalves and Kilian’s (2004) recursive-design

wild bootstrap.

Similarly, parsimonious regression models with LF data are specified as

xL(τ) = α0i +2∑

k=1

αkixL(τ − i) + βix∗H(τ − i) + uL,i(τ), i = 1, 2, 3.

We perform the max test based on this model with M = 100000 draws from an approximation

to the limit distribution under non-causality.

5.3 Empirical results

Figure 2 plots rolling window p-values for each causality test over the 129 windows. Unless

otherwise stated, the nominal size is fixed at 5%. All tests except for the MF Wald test find

significant causality from term spread to GDP in early periods. Significant causality is detected

20

up until 1979:I-1998:IV by the MF max test; up until 1975:III-1995:II by the LF max test;

up until 1974:III-1994:II by the LF Wald test. The MF max test has the longest period of

significant causality. These three tests all agree that there is non-causality in recent periods,

possibly reflecting some structural change in the middle of the entire sample.

The MF Wald test, in contrast, suggests that there is significant causality only after subsam-

ple 1990:IV-2010:III, which is somewhat counter-intuitive. This result may stem from parameter

proliferation. The full regression model with mixed frequency data has many more parameters

than any other model. The results of the MF max test seem to be more intuitive and preferable

than the results of the MF Wald test.

As a diagnostic check, we perform the Ljung-Box Q-test of serial uncorrelatedness of the

least squares residuals from the MF full regression model (5.1) and LF full regression model

(5.3) in order to check whether these models are well specified. Since the true innovations may

not be serially independent, we use Horowitz, Lobato, Nankervis and Savin’s (2006) double

blocks-of-blocks bootstrap with block size b ∈ {4, 10, 20}. The number of bootstrap samples is

{M1,M2} = {999, 249} for the first and second stages. We perform the Q-tests with 4, 8, or 12

lags for each window and model.

When the block size is b = 4, the null hypothesis of residual uncorrelatedness in the MF case

is rejected at the 5% level in only {13, 5, 1} windows out of 129 for {4, 8, 12} lags, suggesting

that the MF model is well specified. In the LF case, the null hypothesis is rejected at the 5%

level in {51, 23, 33} windows with {4, 8, 12} lags, suggesting that the LF model may not be well

specified. The MF model generally produces fewer rejections than the LF model under larger

block sizes b ∈ {10, 20}. Overall, the MF model seems to yield a better fit than the LF model

in terms of residual uncorrelatedness.

As a brief supplemental analysis, we perform a full sample analysis using all 208 quarters. We

consider the same model specifications and testing procedures as in the rolling window analysis

for easy comparison. Resulting p-values are 0.000 for the MF max test, 0.359 for the MF Wald

test, 0.000 for the LF max test, and 0.008 for the LF Wald test. Only the MF Wald test fails

to reject non-causality, the same result repeatedly observed in the first half of rolling windows.

It again suggests that the MF Wald test is less robust against parameter proliferation.

6 Conclusions

We propose a new test designed for a large set of zero restrictions in regression models. A

classical Wald test may have a poor finite sample performance when the number of zero restric-

tions is relatively large. Our proposed solution to the dimensionality problem is to split key

regressors across many parsimonious regression models. The ith parsimonious regression model

contains small-dimensional common regressors and the ith key regressor only, so that parameter

proliferation is less an issue. We then take the maximum of the squared estimators across all

parsimonious regression models.

The asymptotic distribution of our max test statistic is non-standard under the null hypoth-

21

esis, but an approximate p-value is readily available by drawing from an asymptotically valid

approximation to the asymptotic distribution directly. Under the alternative hypothesis, at least

one of the key estimators has a nonzero probability limit under fairly weak conditions. The max

test therefore attains consistency.

After presenting the general theory of the max test, the paper focuses on mixed frequency

Granger causality as a prominent application involving many zero restrictions. Granger causality

from a high frequency variable to a low frequency variable is straightforward to handle. The

reverse direction poses a harder challenge in terms of deriving the asymptotic distribution under

the null, identifying the pseudo-true values of key parameters under the alternative, and keeping

parsimony under the null.

We perform Monte Carlo simulations with generic and MIDAS settings. The asymptotic

Wald test leads to serious size distortions, while the max test and the bootstrapped Wald test

have sharp size. The max test is more and more powerful than the bootstrapped Wald test as

the dimension of the model or correlation among regressors increases. In many cases, the max

test is three or four times as powerful as the bootstrapped Wald test. An essential reason for

the remarkably high power of the max test is that it bypasses the computation of the sample

covariance matrix of all regressors unlike the Wald test.

As an empirical application, we test for Granger causality from a weekly interest rate spread

to real GDP growth in the U.S. The max test with mixed frequency data yields an intuitive

result that the interest rate spread causes GDP growth until the 1990s, after which causal-

ity vanishes. The Wald test with mixed frequency data yields a counter-intuitive result that

significant causality emerges only recently.

We close this paper by discussing a broad range of future applications of the max test. One

can easily generalize the max test for an increasing number of parameters, and would therefore

apply to, for example, nonparametric regression models using Fourier flexible forms (Gallant

and Souza (1991)), Chebyshev, Laguerre, or Hermite polynomials (see e.g. Draper, Smith, and

Pownell (1966)), and splines (Rice and Rosenblatt (1983), Friedman (1991))—where our test has

use for detecting redundant terms. Another application is a max-correlation white noise test for

weakly dependent time series (see e.g. Xiao and Wu (2014), Hill and Motegi (2019a), and Hill

and Motegi (2019b)). Those are only a few out of many examples involving a large—possibly

infinite—set of zero restrictions. We leave them as an area of future research.

References

Andersen, N. T., and V. Dobric (1987): “The central limit theorem for stochastic processes,”Annals of Probability, 15, 164–177.

Berman, S. M. (1964): “Limit theorems for the maximum term in stationary sequences,”Annals of Mathematical Statistics, 35, 502–516.

22

Billingsley, P. (1961): “The Lindeberg-Levy theorem for martingales,” Proceedings of theAmerican Mathematical Society, 12, 788–792.

Davidson, R., and J. G. MacKinnon (2006): “The power of bootstrap and asymptotic tests,”Journal of Econometrics, 133, 421–441.

Draper, N. R., H. Smith, and E. Pownell (1966): Applied regression analysis. Wiley NewYork.

Dudley, R. M. (1978): “Central Limit Theorems for empirical measures,” Annals of Probabil-ity, 6, 899–929.

Dufour, J. M., D. Pelletier, and E. Renault (2006): “Short run and long run causalityin time series: Inference,” Journal of Econometrics, 132, 337–362.

Dufour, J. M., and E. Renault (1998): “Short run and long run causality in time series:Theory,” Econometrica, 66, 1099–1125.

Foroni, C., E. Ghysels, and M. Marcellino (2013): “Mixed-frequency vector autore-gressive models,” in VAR Models in Macroeconomics – New Developments and Applications:Essays in Honor of Christopher A. Sims, ed. by T. B. Fomby, L. Kilian, and A. Murphy,vol. 32, pp. 247–272. Emerald Group Publishing Limited.

Friedman, J. H. (1991): “Multivariate adaptive regression splines,” Annals of Statistics, 19,1–67.

Gallant, A. R., and G. Souza (1991): “On the asymptotic normality of Fourier flexible formestimates,” Journal of Econometrics, 50, 329–353.

Ghysels, E. (2013): “Matlab Toolbox for mixed sampling frequency data analysis us-ing MIDAS regression models,” Available at http://www.mathworks.com/matlabcentral/fileexchange/45150-midas-matlab-toolbox.

(2016): “Macroeconomics and the reality of mixed frequency data,” Journal of Econo-metrics, 193, 294–314.

Ghysels, E., J. B. Hill, and K. Motegi (2016): “Testing for Granger causality with mixedfrequency data,” Journal of Econometrics, 192, 207–230.

Ghysels, E., V. Kvedaras, and V. Zemlys (2016): “Mixed frequency data sampling regres-sion models: The R package midasr,” Journal of Statistical Software, 72, 1–35.

Ghysels, E., P. Santa-Clara, and R. Valkanov (2006): “Predicting volatility: Gettingthe most out of return data sampled at different frequencies,” Journal of Econometrics, 131,59–95.

Ghysels, E., A. Sinko, and R. Valkanov (2007): “MIDAS regressions: Further results andnew directions,” Econometric Reviews, 26, 53–90.

Gnedenko, B. V. (1943): “Sur la Distribution Limite du Terms Maximum d’une SerieAleatoire,” Annals of Mathematics, 44, 423–453.

Goncalves, S., and L. Kilian (2004): “Bootstrapping autoregressions with conditional het-eroskedasticity of unknown form,” Journal of Econometrics, 123, 89–120.

23

Gotz, T. B., and A. Hecq (2014): “Nowcasting causality in mixed frequency vector autore-gressive models,” Economics Letters, 122, 74–78.

Gotz, T. B., A. Hecq, and S. Smeekes (2016): “Testing for Granger causality in largemixed-frequency VARs,” Journal of Econometrics, 193, 418–432.

Granger, C. W. J. (1969): “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, 3, 424–438.

Hannan, E. J. (1974): “The uniform convergence of autocovariances,” Annals of Statistics, 2,803–806.

Hansen, B. (1996): “Inference when a nuisance parameter is not identified under the nullhypothesis,” Econometrica, 64, 413–430.

Hill, J. B. (2007): “Efficient tests of long-run causation in trivariate VAR processes with arolling window study of the money-income relationship,” Journal of Applied Econometrics,22, 747–765.

Hill, J. B., and K. Motegi (2019a): “A max-correlation white noise test for weakly dependenttime series,” Econometric Theory, forthcoming.

(2019b): “Testing the white noise hypothesis of stock returns,” Economic Modelling,76, 231–242.

Horowitz, J., I. N. Lobato, J. C. Nankervis, and N. E. Savin (2006): “Bootstrappingthe Box-Pierce Q test: A robust test of uncorrelatedness,” Journal of Econometrics, 133,841–862.

Lutkepohl, H. (1993): “Testing for causation between two variables in higher-dimensionalVAR models,” in Studies in Applied Econometrics, ed. by H. Schneeweiß, and K. Zimmermann,pp. 75–91. Physica-Verlag Heidelberg.

Phillips, P. C. B., and P. Perron (1988): “Testing for a unit root in time series regression,”Biometrika, 75, 335–346.

Rice, J., and M. Rosenblatt (1983): “Smoothing splines: Regression, derivatives and de-convolution,” Annals of Statistics, 11, 141–156.

Stock, J. H., and M. W. Watson (2003): “Forecasting output and inflation: The role ofasset prices,” Journal of Economic Literature, 41, 788–829.

White, H. (2000): “A reality check for data snooping,” Econometrica, 68, 1097–1126.

Xiao, H., and W. B. Wu (2014): “Portmanteau test and simultaneous inference for serialcovariances,” Statistica Sinica, 24, 577–600.

24

Technical Appendices

A Proofs of main results

In this section we prove Theorems 2.1-2.5. Proofs of Theorem 2.6 and Corollary 2.7 are omittedsince they are self-explanatory.

A.1 Proof of Theorem 2.1

Recall from (2.6) that the ith parsimonious regression model is written as yt = X ′itθi + uit,

where Xit = [z1t, . . . , zpt, xit]′ and θi = [α′

i, βi]′ = [α1i, . . . , αpi, βi]

′. Let θni = [α′ni, βni]

′ =

[αn1i, . . . , αnpi, βni]′ be the least squares estimator for θi.

In order to characterize the limit distribution of Tn = max1≤i≤h(√nβni)

2, we must show

convergence of the finite dimensional distributions of {√nβni}hi=1 and stochastic equicontinuity

(e.g. Dudley (1978) and Andersen and Dobric (1987)). By discreteness of i, note ∀(ϵ, η) > 0 ∃δ ∈(0, 1) such that sup1≤i≤h: |i−i|≤δ |

√nβni −

√nβni| = 0 a.s. Therefore limn→∞ P (sup1≤i≤h: |i−i|≤δ

|√nβni −

√nβni| > η) ≤ ε for some δ > 0, hence {

√nβni}hi=1 is stochastically equicontinuous.

Let β = [β1, . . . , βh]′ and βn = [βn1, . . . , βnh]

′. Stack all parameters across the h parsimoniousregression models as θ = [θ′

1, . . . ,θ′h]

′ and θn = [θ′n1, . . . , θ

′nh]

′. Define an h × (p + 1)h full row

rank selection matrix R such that βn = Rθn. See (2.7) for the precise construction of R.

To prove Theorem 2.1, it suffices to show that√nβn

d→ N(0,V ) under H0 : b = 0, where V

is defined in (2.8). The main statement that Tn = max1≤i≤h(√nβni)

2 d→ max1≤i≤hN 2i , where

N = [N1, . . . ,Nh]′ ∼ N(0,V ), then follows instantly from the continuous mapping theorem

since max{·} is a continuous function.Stack the underlying coefficients as θ0i = [a1, . . . , ap, bi]

′ and θ0 = [θ′01, . . . ,θ

′0h]

′. It is suffi-

cient to show that√n(θn − θ0)

d→ N(0,S) under H0 : b = 0, where S is defined in (2.9). Thisresult suffices since

√nβn = R×

√n(θn − θ0) under H0 and V = RSR′ by construction.

Under H0, we have that θ0i = [a1, . . . , ap, 0]′ for i = 1, . . . , h. Hence the ith parsimonious

regression model includes the true DGP as yt = X ′itθ0i + ϵt. Define Γij = E[XitX

′jt], Λij =

E[ϵ2tXitX′jt], Σij = Γ−1

ii ΛijΓ−1jj , and S = [Σij ] for i, j ∈ {1, . . . , h}, as in (2.9).

Under Assumption 2.3 Xt it stationary and ergodic. Coupled with square integrability, theergodic theorem yields:

Γii =1

n

n∑t=1

XitX′it

p→ Γii, (A.1)

where Γii is a positive definite matrix under Assumption 2.2. By (A.1),

√n(θni − θ0i) =

√n

(n∑

t=1

XitX′it

)−1 n∑t=1

Xitϵt = Γ−1ii × 1√

n

n∑t=1

Xitϵt + op(1).

Pick any λ = [λ′1, . . . ,λ

′h]

′ with λi ∈ Rp+1 and λ′iλi = 1. Define Xt(λ) =

∑hi=1 λ

′iΓ

−1ii Xit. We

25

have that

λ′ ×√n(θn − θ0) =

h∑i=1

λ′i

√n(θni − θ0i) =

h∑i=1

λ′i

(Γ−1ii × 1√

n

n∑t=1

Xitϵt

)+ op(1)

=1√n

n∑t=1

(h∑

i=1

λ′iΓ

−1ii Xit

)ϵt + op(1) =

1√n

n∑t=1

Xt(λ)ϵt + op(1).

Observe that

E[Xt(λ)2ϵ2t ] =

h∑i=1

h∑j=1

λ′iΓ

−1ii ΛijΓ

−1jj λj = λ′Sλ < ∞.

Under Assumptions 2.1-2.3, {∑h

i=1 λ′iXitϵt} is a stationary, ergodic, square integrable martingale

difference. Therefore Billingsley’s (1961) central limit theorem applies to yield (1/√n)∑n

t=1Xt(λ)ϵtd→ N(0,λ′Sλ). By the Cramer-Wold theorem,

√n(θn − θ0)

d→ N(0,S). Hence,

√nβn = R×

√n(θn − θ0)

d→ N(0,V )

and therefore

Tn = max{(√nβn1)

2, . . . , (√nβnh)

2}

d→ max{N 2

1 , . . . ,N 2h

}. QED


Tn = max1≤i≤h(√nβni)

2 operates on a discrete-valued stochastic function gn(i) ≡ βni. As shownin the proof of Theorem 2.1, this implies that weak convergence for {gn(1), . . . , gn(h)} is identicalto convergence in the finite dimensional distributions of {gn(1), . . . , gn(h)}. The proof of Hansen’s(1996) Theorem 2 therefore carries over to the present theorem. QED


Under Assumptions 2.1-2.3, {yt,Xit, ϵt} are square integrable, stationary and ergodic. Hence

Γijp→ Γij and θni

p→ (E[XitX′it])

−1E[Xityt] ≡ θ∗i . Under H0, the parsimonious regression

model is identically yt = X ′itθ0i + ϵt with θ0i = [a1, . . . , ap, 0]

′ for i = 1, . . . , h. Hence θnip→ θ0i

and θ∗i = θ0i.

Combined with stationarity, ergodicity, and square integrability, Λijp→ Λ∗

ij ≡ E[(yt −X ′

itθ∗i )

2XitX′jt] with ∥Λ∗

ij∥ < ∞. Also, Λ∗ij = Λij under H0. Therefore Vn

p→ V ≡ RSR′

under H0. Under H1 : b = 0, it follows that Vnp→ RS∗R′ with S∗ = [Γ−1

ii Λ∗ijΓ

−1jj ]i,j and

∥RS∗R′∥ < ∞. QED


Recall that the ith parsimonious regression model is yt = X ′itθi + uit. In view of stationarity,

square integrability, and ergodicity, the least squares estimator satisfies θnip→ θ∗

i , where θ∗i =

26

[E[XitX′it]]

−1E[Xityt]. Recall from (2.2) that the DGP is yt = z′ta+ x′

tb+ ϵt. Therefore

θ∗i =

[E[XitX

′it

]]−1E[Xit(z

′ta+ x′

tb+ ϵt)]=

[E[XitX

′it

]]−1(E[Xitz

′t]a+ E[Xitx

′t]b+ E[Xitϵt])

=[E[XitX

′it

]]−1(E[Xitz

′t]a+ E[Xitx

′t]b) = Γ−1

ii (E[Xitz′t]a+Cib),

(A.2)

where the third equality holds from the mds assumption of ϵt. Note that zt = [Ip,0p×1]Xit fori ∈ {1, . . . , h} by construction. Hence

E[Xitz

′t

]= E

[XitX

′it

]×[

Ip01×p

]= Γii ×

[Ip

01×p

]. (A.3)

Substitute (A.3) into (A.2) to get the desired result (2.10). QED


Pick the last row of (2.10). The lower left block of Γ−1ii is

−n−1i E

[xitz

′t

] [E[ztz

′t

]]−1

while the lower right block is simply n−1i , where

ni = E[x2it]− E

[xitz

′t

] {E[ztz

′t

]}−1E [ztxit] .

Hence, the last row of Γ−1ii Ci appearing in (2.10) is n−1

i d′i, where

di = E [xtxit]− E[xtz

′t

] {E[ztz

′t

]}−1E [ztxit] . (A.4)

Impose β∗ = 0 hereafter. Then, n−1i d′

ib = 0 for any i ∈ {1, . . . , h} in view of (2.10).Since ni is a nonzero finite scalar for any i ∈ {1, . . . , h} by the nonsingularity of E [XitX

′it],

we have that d′ib = 0 for all i ∈ {1, . . . , h}. Stack these equations to get that Db = 0, where

D = [d1, . . . ,dh]′ ∈ Rh×h. To prove the main statement that b = 0, it is sufficient to show that

D is non-singular. Equation (A.4) implies that

D = E[xtx′t]− E[xtz

′t][E[ztz

′t]]−1

E[ztx′t].

Now define

Γ = E

[[ztxt

] [z′t x′

t

]].

Γ is trivially non-singular by Assumption 2.2. Note that D is the Schur complement of Γ withrespect to E[ztz

′t]. Therefore, by the classic argument of partitioned matrix inversion, D is

non-singular as desired. Thus, β∗ = 0 implies that b = 0. QED

B Double time indices

In this section, we introduce useful notations for handling mixed frequency data. Consider ahigh frequency variable xH and a low frequency variable xL. Let τ ∈ Z be a low frequency timeperiod. Suppose that each low frequency time period has m high frequency time periods. Inperiod τ, we observe {xH(τ, 1), xH(τ, 2), . . . , xH(τ,m), xL(τ)} sequentially.

27

It is often useful to admit a notational convention that allows the second argument of xH tobe an arbitrary integer. It is understood, for example, that xH(τ, 0) = xH(τ−1,m), xH(τ,−1) =xH(τ − 1,m − 1), and xH(τ,m + 1) = xH(τ + 1, 1). In general, what we call a high frequencysimplification operates as follows.

xH(τ, i) =

{xH(τ −

⌈1−im

⌉,m⌈1−im

⌉+ i)

if i ≤ 0,

xH(τ +

⌊i−1m

⌋, i−m

⌊i−1m

⌋)if i ≥ m+ 1.

(B.1)

⌈x⌉ is the smallest integer not smaller than x, while ⌊x⌋ is the largest integer not larger thanx. By applying the high frequency simplification, any integer put in the second argument of xHis transformed to a natural number between 1 and m with the first argument being modifiedappropriately. One can indeed verify that m ⌈(1− i)/m⌉ + i ∈ {1, . . . ,m} when i ≤ 0, andi−m ⌊(i− 1)/m⌋ ∈ {1, . . . ,m} when i ≥ m+ 1.

Since the high frequency simplification allows both arguments of xH to be any integer, thefollowing low frequency simplification is well defined.

xH(τ − τ ′, i) = xH(τ, i−mτ ′), ∀τ, τ ′, i ∈ Z. (B.2)

Equation (B.2) states that any lag or lead τ appearing in the first argument can be deleted bymodifying the second argument appropriately. The second argument may go below 1 or abovem, but such a case is covered by the high frequency simplification (B.1). Equations (B.1) and(B.2) are of practical use when one writes DGPs or regression models involving mixed frequencydata.

28

Table 1: Rejection frequencies under generic framework (ρ = 0)

DGP-1: yt =∑h

i=1 bixit + ϵt with bi = 0

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .029 .032 .040 .051 .049 .052

Wald (asy) .110 .051 .193 .081 .320 .084

Wald (boot) .054 .043 .062 .054 .061 .063

DGP-2: yt =∑h

i=1 bixit + ϵt with bi = I(i < h/10)× 10(5h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .175 .949 .057 .234 .049 .128

Wald (asy) .306 .896 .302 .410 .393 .339

Wald (boot) .199 .874 .088 .339 .082 .275

DGP-3: yt =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .051 .145 .049 .073 .051 .052

Wald (asy) .155 .225 .227 .133 .335 .111

Wald (boot) .080 .206 .075 .098 .067 .084

DGP-4: yt =∑h

i=1 bixit + ϵt with bi = (5h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .048 .084 .038 .065 .041 .045

Wald (asy) .118 .130 .205 .096 .307 .089

Wald (boot) .069 .120 .050 .079 .045 .072

Rejection frequencies of the max test without the assumption of conditional homoskedasticity, asymptotic

Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. The dimension of regressors

is h ∈ {10, 100, 200}, and sample size is n ∈ {10h, 50h}. Any pair of regressors has correlation coefficient

ρ = 0. The nominal size is 5%.

29

Table 2: Rejection frequencies under generic framework (ρ = 0.3)

DGP-1: yt =∑h

i=1 bixit + ϵt with bi = 0

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .031 .051 .046 .054 .047 .033

Wald (asy) .101 .052 .210 .085 .320 .083

Wald (boot) .066 .050 .043 .058 .062 .061

DGP-2: yt =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .232 .949 .833 1.000 .992 1.000

Wald (asy) .320 .879 .537 0.994 .748 1.000

Wald (boot) .235 .877 .258 0.991 .368 1.000

DGP-3: yt =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .137 .609 .813 1.000 .987 1.000

Wald (asy) .203 .484 .489 0.980 .730 1.000

Wald (boot) .123 .464 .210 0.970 .306 .998

DGP-4: yt =∑h

i=1 bixit + ϵt with bi = (5h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .113 .582 .815 1.000 .993 1.000

Wald (asy) .175 .436 .527 .978 .715 1.000

Wald (boot) .106 .420 .227 .963 .313 1.000


Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. The dimension of regressors

is h ∈ {10, 100, 200}, and sample size is n ∈ {10h, 50h}. Any pair of regressors has correlation coefficient

ρ = 0.3. The nominal size is 5%.

30

Table 3: Rejection frequencies under MIDAS framework (ϕ = 0.8)

DGP-1: xL(τ) =∑h

i=1 bixH(τ − 1,m+ 1− i) + ϵL(τ), bi = 0

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .042 .033 .040 .050 .057 .048

Wald (asy) .108 .048 .217 .075 .301 .074

Wald (boot) .068 .040 .047 .056 .041 .046

DGP-2: xL(τ) =∑h

i=1 bixH(τ − 1,m+ 1− i) + ϵL(τ), bi = I(i < h/10)× 10(20h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .073 .270 .163 .856 .157 .870

Wald (asy) .131 .208 .296 .409 .414 .407

Wald (boot) .075 .187 .097 .349 .084 .312

DGP-3: xL(τ) =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .070 .252 .081 .217 .059 .163

Wald (asy) .141 .165 .239 .130 .308 .142

Wald (boot) .081 .149 .060 .098 .056 .092

DGP-4: xL(τ) =∑h

i=1 bixH(τ − 1,m+ 1− i) + ϵL(τ), bi = (20h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .076 .204 .052 .123 .046 .095

Wald (asy) .122 .123 .253 .121 .317 .103

Wald (boot) .069 .104 .064 .087 .066 .060


Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. The dimension of regressors is

h ∈ {10, 100, 200}, and low frequency sample size is n ∈ {10h, 50h}. xH follows a univariate high frequency

AR(1): xH(τ, i) = ϕ× xH(τ, i− 1) + ϵH(τ, i) with ϕ = 0.8. The nominal size is 5%.

31

Table 4: Rejection frequencies under MIDAS framework (ϕ = 0.95)

DGP-1: xL(τ) =∑h

i=1 bixH(τ − 1,m+ 1− i) + ϵL(τ), bi = 0

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .043 .039 .044 .042 .037 .051

Wald (asy) .121 .045 .222 .086 .306 .075

Wald (boot) .067 .035 .062 .060 .064 .046

DGP-2: xL(τ) =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .274 .910 .969 1.000 .999 1.000

Wald (asy) .241 .701 .663 1.000 .872 1.000

Wald (boot) .150 .684 .386 1.000 .536 1.000

DGP-3: xL(τ) =∑h


h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .258 .915 .800 1.000 .794 1.000

Wald (asy) .231 .649 .486 .986 .592 .978

Wald (boot) .147 .627 .206 .979 .192 .951

DGP-4: xL(τ) =∑h

i=1 bixH(τ − 1,m+ 1− i) + ϵL(τ), bi = (20h)−1

h = 10 h = 100 h = 200

n = 100 n = 500 n = 1000 n = 5000 n = 2000 n = 10000

Max .258 .870 .550 .999 .451 1.000

Wald (asy) .232 .605 .376 .835 .464 .733

Wald (boot) .151 .573 .129 .790 .113 .645


Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. The dimension of regressors is

h ∈ {10, 100, 200}, and low frequency sample size is n ∈ {10h, 50h}. xH follows a univariate high frequency

AR(1): xH(τ, i) = ϕ× xH(τ, i− 1) + ϵH(τ, i) with ϕ = 0.95. The nominal size is 5%.

32

Table 5: Rejection frequencies under empirical framework

Mixed frequency models

DGP-1 DGP-2 DGP-3 DGP-4

n = 80 n = 200 n = 80 n = 200 n = 80 n = 200 n = 80 n = 200

Max .042 .049 .557 .949 .538 .948 .486 .900

Wald (asy) .491 .164 .791 .838 .743 .766 .666 .678

Wald (boot) .043 .047 .218 .643 .180 .523 .174 .465

Low frequency models

DGP-1 DGP-2 DGP-3 DGP-4

n = 80 n = 200 n = 80 n = 200 n = 80 n = 200 n = 80 n = 200

Max .044 .045 .533 .955 .455 .890 .381 .868

Wald (asy) .066 .053 .584 .946 .506 .871 .424 .846

Wald (boot) .052 .005 .506 .939 .427 .860 .356 .836


Wald test, and bootstrapped Wald test across J = 1000 Monte Carlo samples. xH follows a univariate high

frequency AR(1): xH(τ, j) = 0.95×xH(τ, j−1)+ϵH(τ, j). The main DGP is that xL(τ) = 0.85×xL(τ−1)+∑24i=1 bixH(τ − 1,m+1− i) with m = 12, where bi = 0 under DGP-1; bi = (1/30)× I(i ≤ 3) under DGP-2;

bi = (1/120)× I(i ≤ 12) under DGP-3; bi = 1/240 under DGP-4. Sample size in terms of low frequency is

n ∈ {80, 200}. Parsimonious regression models used for the MF max test are xL(τ) = α0i+∑2

k=1 αkixL(τ−k)+βixH(τ−1,m+1− i)+uL,i(τ) for i ∈ {1, . . . , 24}. A full regression model used for the MF Wald test is

xL(τ) = α0+∑2

k=1 αkxL(τ−k)+∑24

i=1 βixH(τ−1,m+1−i)+uL(τ). Parsimonious regression models used

for the LF max test are xL(τ) = α0i+∑2

k=1 αkixL(τ −k)+βixH(τ − i,m)+uL,i(τ) for i ∈ {1, 2, 3}. A full

regression model used for the LF Wald test is xL(τ) = α0+∑2

k=1 αkxL(τ−k)+∑3

i=1 βixH(τ−i,m)+uL(τ).

The nominal size is 5%.

33

Table 6: Sample statistics of U.S. interest rates and real GDP growth

# Obs. Mean Median Stdev Min Max Skew Kurt p-KS p-AD

FF 2736 5.563 5.250 3.643 0.040 22.00 0.928 4.615 0.000 0.001

10Y 2736 6.555 6.210 2.734 1.470 15.68 0.781 3.488 0.000 0.001

SPR 2736 0.991 1.160 1.800 −9.570 4.440 −1.198 5.611 0.000 0.001

GDP 208 3.151 3.250 2.349 −4.100 8.500 −0.461 3.543 0.177 0.002

FF: Weekly effective federal funds rate. 10Y: Weekly 10-year Treasury constant maturity rate. SPR:

Weekly term spread (10Y−FF). GDP: Percentage growth rate of quarterly GDP. For each variable, we

report the number of observations, sample mean, median, standard deviation, minimum, maximum,

skewness, kurtosis, p-values of the Kolmogorov-Smirnov test for normality, p-value of the Anderson-

Darling test for normality. The sample period covers January 5, 1962 through December 31, 2013.

34

Table 7: Results of Phillips-Perron unit root tests

Weekly term spread Quarterly GDP growth

Drift Drift + Trend Drift Drift + Trend

Lag l l = 2 l = 4 l = 2 l = 4 l = 2 l = 4 l = 2 l = 4

ϕn 0.958 0.958 0.952 0.952 0.878 0.878 0.867 0.867

se(ϕn) 0.006 0.006 0.006 0.006 0.032 0.032 0.034 0.034

Sn −5.849 −5.709 −6.381 −6.263 −4.591 −4.682 −4.760 −4.861

5% c.v. −2.861 −2.864 −3.414 −3.414 −2.876 −2.876 −3.434 −3.434

R/A Reject Reject Reject Reject Reject Reject Reject Reject

We perform Phillips and Perron’s (1988) unit root tests for weekly term spread and quarterly GDP

growth. We consider a regression model with a constant and drift (Drift) as well as a regression model

with a constant, drift, and linear time trend (Drift + Trend). The number of lags included in the Newey-

West long-run variance estimator is l ∈ {2, 4}. ϕn is an OLS estimate for the AR(1) parameter; se(ϕn)

is the standard error of ϕn; Sn is the test statistic; “5% c.v.” is the 5% critical value; “R/A” signifies the

rejection or acceptance of H0 : ϕ = 1. The sample period covers January 5, 1962 through December 31,

2013, which contains 2,736 weeks or 208 quarters.

35

Figure 1: Time series plots of U.S. interest rates and real GDP growth

1970 1980 1990 2000 20100

5

10

15

20

25

(a) Weekly federal funds rate

1970 1980 1990 2000 20100

5

10

15

20

25

(b) Weekly 10-year bond yield

1970 1980 1990 2000 2010-10

-5

0

5

10

(c) Weekly yield spread

1970 1980 1990 2000 2010-10

-5

0

5

10

(d) Quarterly GDP growth

In this figure we present time series plots of (a) weekly effective federal funds rate, (b) weekly 10-year

Treasury constant maturity rate, (c) their spread (10Y−FF), and (d) quarterly real GDP growth from

previous year. The sample period covers January 5, 1962 through December 31, 2013, which contains

2,736 weeks or 208 quarters. The shaded areas represent recession periods defined by the National Bureau

of Economic Research (NBER).

36

Figure 2: P-values for tests of non-causality from interest rate spread to GDP

1965 1970 1975 1980 1985 19900

0.2

0.4

0.6

0.8

1

(a) Mixed frequency max test

1965 1970 1975 1980 1985 19900

0.2

0.4

0.6

0.8

1

(b) Mixed frequency Wald test

1965 1970 1975 1980 1985 19900

0.2

0.4

0.6

0.8

1

(c) Low frequency max test

1965 1970 1975 1980 1985 19900

0.2

0.4

0.6

0.8

1

(d) Low frequency Wald test

Rolling window p-values based on (a) the mixed frequency max test, (b) the mixed frequency Wald test, (c) the

low frequency max test, and (d) the low frequency Wald test. The mixed frequency tests use weekly term spread

and quarterly GDP growth, while the low frequency tests use quarterly term spread and GDP growth. The

sample period covers January 5, 1962 through December 31, 2013, which contains 2,736 weeks or 208 quarters.

The window size is 80 quarters. Each point of the x-axis represents the beginning date of a given window. The

shaded area is [0, 0.05], hence any p-value in that range indicates rejection of non-causality from the interest rate

spread to GDP growth at the 5% level in a given window.

37

Documents

Testing a large set of zero restrictions in regression …motegi/Ghysels_Hill_Motegi_MaxMFGC.pdfTesting a large set of zero restrictions in regression models, with an application to