Large Vector Autoregressions with stochastic volatility and …...Large Vector Autoregressions with stochastic volatility and non-conjugate priors Andrea Carriero 1 Todd E. Clark 2

Large Vector Autoregressions with stochastic volatilityand non-conjugate priors

Andrea Carriero 1 Todd E. Clark 2 Massimiliano Marcellino 3

Norges Bank, 3 October 2017

1Queen Mary, University of London2Federal Reserve Bank of Cleveland3Bocconi University and CEPRCarriero, Clark, Marcellino () Large VARs September 2017 1 / 22

Introduction

Introduction - two ingredients

Two main ingredients are key for the specification of a good Vector Autoregressivemodel (VAR) for forecasting and structural analysis of macroeconomic data:

A large cross section. Banbura, Giannone, and Reichlin (2010), Carriero, Clark,and Marcellino (2015), Giannone, Lenza, and Primiceri (2015) and Koop (2013)

Time variation in the volatilities. Clark (2011), Clark and Ravazzolo (2015),Cogley and Sargent (2005), D’Agostino, Gambetti and Giannone (2013), andPrimiceri (2005)

There are no papers which jointly allow for both time variation and large datasets

Carriero, Clark, Marcellino () Large VARs September 2017 2 / 22

Introduction

Introduction - heteroskedasticity

The reason lies in the structure of the likelihood function

Homoskedastic VARs are SUR models with the same set of regressors in eachequation −→ Kronecker structure in the likelihood−→ OLS equation by equation

Equation-specific stochastic volatility breaks this symmetry because each equationis driven by a different volatility

The system would need to be vectorised, and the conditional posterior involves

manipulation of a matrix of dimension pN2 (N=number of variables, p=number oflags)

The computational complexity is therefore N23= N6


Introduction

Introduction - asymmetric priors

In a Bayesian framework, simmetry is not only needed in the likelihood, but also in

the prior

Kronecker structure in the likelihood+Kronecker structure in the prior= Kronecker

structure in the posterior

For example, the VAR estimated by Banbura, Giannone, and Reichlin (2010) is aVAR with 130 variables, but in order to make this estimation possible one needs to

assume:

(i) Homoskedasticity of the disturbances

(ii) A specific structure for the prior

Without either (i) or (ii) the system would need to be vectorised prior to estimation


Introduction

The problem

Consider the VAR of a N-dimensional vector yt :

yt = Π(L)yt−1 + vt ; vt ∼ iid N(0,Σt ) (1)

Define Xt = [1, y ′t−1, ..., y′t−p ]

′ and Π = [Π0 |Π1 |...|Πp ]

In general we have the posterior vec(Π)|Σ, y ∼ N(vec(µΠ),ΩΠ) with posteriorprecision:

Ω−1Π = Ω−1ΠPrior

+T

∑t=1

(Σ−1t ⊗ XtX ′t )Likelihood

(2)

The precision matrix Ω−1Π is of size N(Np + 1). Its manipulation requires(pN2)3 = O(N6) elementary operations

For N very large modern computers (laptops/desktops) can’t even store such amatrix in RAM (e.g. N = 125 needs 330 GB of RAM).


Introduction

The usual solution

In general we have the posterior vec(Π)|Σ, y ∼ N(vec(µΠ),ΩΠ) with


+T

∑t=1


(2)

Now assume that

(i) Σt = Σ (homoskedasticity)

(ii) ΩΠ = Σ⊗Ω0 (conjugate prior)

Ω−1Π = Ω−1Π︸︷︷︸Σ−1⊗Ω−10

+T

∑t=1(Σ−1t︸︷︷︸

Σ−1

⊗ XtX ′t ) = Σ−1 ⊗(

Ω−10 +T

∑t=1

XtX ′t

), (3)

and the two terms can be manipulated separately, reducing complexity by O(N3)

Classical homoskedastic VARs can be estimated equation by equation.


Introduction

Problems with the the usual solution

The Natural-conjugate homoskedastic approach allows to use large datasets, but it hasimportant limitations:

It imposes homoskedasticity, against the overwhelming evidence in macroeconomicand financial data

The prior structure Σ⊗Ω0 is restrictive (Rothemberg (1963), Sims and Zha

(1998))

It prevents any asymmetry in the prior across equations, because the

coeffi cients of each equation feature the same prior variance Ω0 (up to a scale

factor given by the elements of Σ).It has the unappealing consequence that prior beliefs must be correlated across

equations, with a correlation structure proportional to that of the shocks (as

described by Σ).


Introduction

A new algorithm

In this paper we propose a new algorithm that makes possible to use:

A heteroskedastic model

The more general and less restrictive independent Normal - Inverse Wishart

(and Normal-diffuse) prior

Our procedure is based on a simple factorization of the likelihood, which allows todraw the VAR coeffi cients equation by equation

This reduces the computational complexity from N6 to N4.

Our new algorithm is very simple and can be easily inserted in any pre-existingalgorithm for estimation of BVAR models.


Introduction

The Model

Consider the following VAR model for a N-dimensional yt with stochastic volatility:

yt = Π0 +Π(L)yt−1 + vt ; (1)

vt = A−1Λ0.5t εt , εt ∼ iid N(0, IN ) (2)

where Λt is a diagonal matrix with generic j-th element hj ,t and A−1 is a lowertriangular matrix with ones on its main diagonal.

The bottleneck is drawing vec(Π)|A,ΛT , yT ∼ N(vec(µΠ),ΩΠ); To obtain a draw

one needs to i) invert


+T

∑t=1


(3)

ii) compute its Cholesky factor and iii) multiply the Cholesky factor by a random

vector

Each of the above operations is of complexity N6


Introduction

An algorithm for large VARs

Consider again the decomposition vt = A−1Λ0.5t εt :v1,tv2,t...vN ,t

=

1 0 ... 0a∗2,1 1 ...... 1 0a∗N ,1 ... a∗N ,N−1 1

h0.51,t 0 ... 00 h0.52,t ...... ... 00 ... 0 h0.5N ,t

ε1,tε2,t...

εN ,t

,where a∗j ,i denotes the generic element of the matrix A

−1 which is available underknowledge of A.


Introduction


The VAR can be written as:

y1,t = π(0)1 +

N

∑i=1

p

∑l=1

π(i )1,l yi ,t−l + h

0.51,t ε1,t

y2,t = π(0)2 +

N

∑i=1

p

∑l=1

π(i )2,l yi ,t−l + a

∗2,1h

0.51,t ε1,t + h

0.52,t ε2,t

...

yN ,t = π(0)N +

N

∑i=1

p

∑l=1

π(i )N ,l yi ,t−l + a

∗N ,1h

0.51,t ε1,t + · · ·+ a∗N ,N−1h0.5N−1,t εN−1,t + h0.5N ,t εN ,t ,

with the generic equation for variable j :

yj ,t − (a∗j ,1h0.51,t ε1,t + ...+ a∗j ,,j−1h

0.5j−1,t εj−1,t )︸︷︷︸

y ∗j ,t

= π(0)j +

N

∑i=1

p

∑l=1

π(i )j ,l yi ,t−l + hj ,t εj ,t . (4)

When drawing the coeffi cients of equation j the term y ∗j ,t is known, since it is given bythe difference between the dependent variable of that equation and the realized residualsof all the previous j − 1 equations. Hence (4) is a standard generalized linear regressionmodel with i.i.d. Gaussian disturbances.


Introduction


The full conditional posterior distribution of the conditional mean coeffi cients can befactorized as:

p(Π|A,ΛT , y ) = p(π(N )|π(N−1),π(N−2), . . . ,π(1),A,ΛT , y )

×p(π(N−1)|π(N−2), . . . ,π(1),A,ΛT , y )...

×p(π(1)|A,ΛT , y ),and one can draw the coeffi cients in Π in separate blocks:

Πj|Π1:j−1,A,ΛT , y ∼ N(µΠj |1:j−1 ,ΩΠj |1:j−1 )

with

µΠj |1:j−1 = ΩΠj |1:j−1

T

∑t=1

Xj ,th−1j ,t y

∗′j ,t +Ω−1

Πj |1:j−1µΠj |1:j−1

Ω−1Πj |1:j−1 = Ω−1Πj |1:j−1 +

T

∑t=1

Xj ,th−1j ,t X

′j ,t ,

where µΠj |1:j−1 and ΩΠj |1:j−1 are moments of Πj|Π1:j−1 ∼ N(µ

Πj |1:j−1 ,ΩΠj |1:j−1 )


Introduction


The conditional posterior of Π obtained is the same as the one from thesystem-wide algorithm

The algorithm will produce draws numerically identical to those of thesystem-wide sampler

This is true regardelss of the ordering, which is irrelevant to the conditionalposterior of Π

The total computational complexity of this estimation algorithm is O(N4), with a

gain of N2.

Uses equations with at most Np + 1 regressors, and the correlation across

equations typical of SUR models is implicitly accounted for by the factorization

The dimension of the posterior variance matrix Ω−1Πj is (Np + 1), which

means that its manipulation only involves operations of order O(N3).


A numerical comparison of the estimation methods

Computational complexity and speed of simulation

time for producing 10 draws as a function of N

Size of the crosssection (N)0 1 2 3 4 5 6 7 8 9 10

Sec

onds

0

2

4

6

8

10

12

14

16

18time for producing 10 draws as function of N

System wide agorithm

Triangular algorithm

Size of the crosssection (N)0 1 2 3 4 5 6 7 8 9 10

Num

ber o

f ele

men

tary

ope

ratio

ns

0

10

20

30

40

50

60

70

80

90

100Theoretical and Actual difference in computational complexity

Actual difference

Theoretical difference



Computational complexity and speed of simulation

time for producing 10 draws as a function of N - log scale

Size of the crosssection (N)0 5 10 15 20 25 30 35 40

Sec

onds

10 1

10 0

10 1

10 2

10 3

10 4 time for producing 10 draws as function of N

System wide algorithm

Triangular algorithm

Size of the crosssection (N)0 5 10 15 20 25 30 35 40

Num

ber o

f ele

men

tary

ope

ratio

ns

10 1

10 0

10 1

10 2

10 3

10 4 Theoretical and Actual difference in computational complexity

Actual differenceTheoretical dif ference



Convergence and mixing

Regardless of the power of the computers used to perform the simulation thetriangular algorithm will always produce many more draws than the traditional

system-wide algorithm in a given unit of time.

This has important consequences in terms of producing draws with good mixingand convergence properties.

The triangular algorithm can produce draws many times closer to i.i.d. samplingin the same amount of time.

These computational and storage gains increase quadratically with the system size



Convergence and mixing

Ineffi ciency factors= distance from i.i.d sampling: ideally should be around 1.

5 0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7Conditinal mean parameters, systemwide algorithm

0 .5 0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8Conditinal mean parameters, triangular algorithm

10 0 10 20 30 40 50 60 700

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Covariances, systemwide algorithm

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.5

1

1.5

2

2.5

3Covariances, triangular algorithm

2 4 6 8 10 12 14 16 18 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Volatility factors (averaged across time), systemwide algorithm

0.7 0.75 0.8 0.85 0.9 0.95 1 1.050

2

4

6

8

10

12

14

16

18

20Volatility factors (averaged across time), triangular algorithm

Systemwide algorithm, 5000 draws.

20 0 20 40 60 80 100 120 140 160 1800

0.005

0.01

0.015

0.02

0.025

0.03Volatility innovation variance, systemwide algorithm

Triangular algorithm results are based on 1305000 draws with skipsampling of 261, producing an effective sample of 5000 draws.

0 .5 0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Volatility innovation variance, triangular algorithm


A large structural VAR with drifting volatilities

Empirical applications

As an illustration we estimate a VAR with stochastic volatilities, using 13 lags and a

cross-section of 125 variables from FRED-MD

For a model of this size the system-wide algorithm would have a covariance matrix

of the coeffi cients of dimension 203250, which would require about 330 GB of RAM(2032502 × 8/109).

Our estimation algorithm can produce 5000 draws in just above 7 hours on a 3.5

GHz Intel Core i7.

We find that:

The variance of the shocks was clearly unstable over time

There is a factor structure in the volatilities

The combined use of both time variation in volatilities and a large data-set

improves point and density forecasts, more that what these two ingredients do

if used separately.


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1250.4

0.2

0

0.2

0.4

Var explained 73.2837%

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1250.1

0

0.1

0.2

0.3

0.4

0.5


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1250.4

0.2

0

0.2

0.4

0.6


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1250.4

0.2

0

0.2

0.4

0.6


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 1250.30.20.1

00.10.20.30.4


FFRNB reserves

CES1021000001

PCEPIHoursRPI

interest rates, exchange rates, andfinancial indicators

Monetaryaggregates

Real variables PricesSurveys

Figure 11: Principal components loadings of the variance-covariance of the volatilities (matrix ).

PCA of the variance matrix of the shocks to volatilities

heteroskedast ic2 2.5 3 3.5

hom

osch

edas

tic

2

2.5

3

3.5

RPI SCORE

heteroskedast ic3.7 3.8 3.9

hom

osch

edas

tic3.65

3.7

3.75

3.8

3.85

3.9

3.95

DPCERA3M086SBEA SCORE

heteroskedast ic3 3.1 3.2

hom

osch

edas

tic

2.95

3

3.05

3.1

3.15

3.2

3.25

CMRMTSPLx SCORE

heteroskedast ic3.3 3.4 3.5

hom

osch

edas

tic

3.3

3.35

3.4

3.45

3.5

3.55

INDPRO SCORE

heteroskedast ic2.5 2 1.5 1

hom

osch

edas

tic

2.5

2

1.5

1

CUMFNS SCORE

heteroskedast ic1 0.5 0

hom

osch

edas

tic

1.21

0.80.60.40.2

00.20.4

UNRATE SCORE

heteroskedast ic4.6 4.7 4.8 4.9 5

hom

osch

edas

tic

4.6

4.7

4.8

4.9

5

PAYEMS SCORE

heteroskedast ic1 0.8 0.6 0.4 0.2

hom

osch

edas

tic

1

0.8

0.6

0.4

0.2

CES0600000007 SCORE

heteroskedast ic4.2 4.3 4.4 4.5

hom

osch

edas

tic

4.2

4.25

4.3

4.35

4.4

4.45

4.5

CES0600000008 SCORE

heteroskedast ic3.4 3.5 3.6 3.7 3.8

hom

osch

edas

tic

3.43.45

3.53.55

3.63.65

3.73.75

3.8

PPIFGS SCORE

heteroskedast ic2 2.05 2.1

hom

osch

edas

tic

1.96

1.98

2

2.02

2.04

2.06

2.08

2.1

PPICMM SCORE

heteroskedast ic4.5 4.6 4.7 4.8 4.9

hom

osch

edas

tic

4.54.554.6

4.654.7

4.754.8

4.854.9

PCEPI SCORE

heteroskedast ic2.5 3 3.5 4 4.5

hom

osch

edas

tic

2.5

3

3.5

4

4.5

FEDFUNDS SCORE

heteroskedast ic0 0.5 1

hom

osch

edas

tic

0.2

0

0.2

0.4

0.6

0.8

1

HOUST SCORE


hom

osch

edas

tic

1.75

1.8

1.85

1.9

S&P 500 SCORE


hom

osch

edas

tic

2.25

2.3

2.35

2.4

EXUSUKx SCORE

heteroskedast ic1.5 1 0.5 0

hom

osch

edas

tic

1.5

1

0.5

0

T1YFFM SCORE


hom

osch

edas

tic

2

1.5

1

0.5

T10YFFM SCORE


hom

osch

edas

tic

2

1.5

1

0.5

BAAFFM SCORE

heteroskedast ic3.6 3.4 3.2 3 2.8

hom

osch

edas

tic

3.6

3.4

3.2

3

2.8

NAPMNOI SCORE

Figure 17: Comparison of density forecast accuracy. Each panel describes a different variable. The

x axis reports the (log) density score obtained using the BVAR with stochastic volatility (het-

eroschedastic), the y axis reports the (log) density score obtained using the homoschedastic BVAR.

Each point corresponds to a different forecast horizon from 1 to 12 step-ahead.

Score comparison: homoskedastic model (y axis) vs heteroskedastic model (x axis)

heteroskedastic 1035.7 5.75 5.8 5.85

hom

osch

edas

tic

103

5.7

5.75

5.8

5.85

RPI RMSFE

heteroskedastic 10 35 5.1 5.2

hom

osch

edas

tic

10 3

5

5.05

5.1

5.15

5.2

DPCERA3M086SBEA RMSFE

heteroskedastic 10 39.4 9.6 9.8

hom

osch

edas

tic

103

9.3

9.4

9.5

9.6

9.7

9.8

CMRMTSPLx RMSFE

heteroskedastic 10 36.4 6.6 6.8 7 7.2 7.4

hom

osch

edas

tic

10 3

6.4

6.6

6.8

7

7.2

7.4

INDPRO RMSFE

heteroskedastic1 2 3

hom

osch

edas

tic

1

1.5

2

2.5

3

CUMFNS RMSFE

heteroskedastic0.2 0.4 0.6 0.8

hom

osch

edas

tic

0.2

0.3

0.4

0.5

0.6

0.7

0.8

UNRATE RMSFE

heteroskedastic 10 31.5 1.6 1.7 1.8 1.9

hom

osch

edas

tic

103

1.5

1.6

1.7

1.8

1.9

PAYEMS RMSFE

heteroskedastic0.3 0.4 0.5

hom

osch

edas

tic

0.3

0.35

0.4

0.45

0.5

0.55

CES0600000007 RMSFE

heteroskedastic 1032.6 2.7 2.8 2.9

hom

osch

edas

tic

10 3

2.62.652.7

2.752.8

2.852.9

2.95

CES0600000008 RMSFE

heteroskedastic 10 35.6 5.8 6 6.2

hom

osch

edas

tic

10 3

5.65.75.85.9

66.16.26.3

PPIFGS RMSFE


hom

osch

edas

tic

0.03

0.0305

0.031

0.0315

0.032

0.0325

PPICMM RMSFE

heteroskedastic 10 32 2.2 2.4

hom

osch

edas

tic

10 3

1.9

2

2.1

2.2

2.3

2.4

2.5

PCEPI RMSFE


hom

osch

edas

tic

0.01

0.015

0.02

FEDFUNDS RMSFE

heteroskedastic0.1 0.15 0.2 0.25

hom

osch

edas

tic

0.1

0.15

0.2

0.25

HOUST RMSFE

heteroskedastic0.03680.0370.03720.03740.0376

hom

osch

edas

tic

0.0368

0.037

0.0372

0.0374

0.0376

S&P 500 RMSFE


hom

osch

edas

tic

0.0228

0.023

0.0232

0.0234

0.0236

0.0238

0.024

EXUSUKx RMSFE

heteroskedastic0.5 0.6 0.7 0.8 0.9

hom

osch

edas

tic

0.5

0.6

0.7

0.8

0.9

T1YFFM RMSFE

heteroskedastic0.6 0.8 1 1.2 1.4 1.6

hom

osch

edas

tic

0.6

0.8

1

1.2

1.4

1.6

T10YFFM RMSFE

heteroskedastic0.6 0.8 1 1.2 1.4 1.6 1.8

hom

osch

edas

tic

0.6

0.8

1

1.2

1.4

1.6

1.8

BAAFFM RMSFE

heteroskedastic4 5 6 7

hom

osch

edas

tic

44.5

55.5

66.5

77.5

NAPMNOI RMSFE

Figure 16: Comparison of point forecast accuracy. Each panel describes a different variable. The x

axis reports the RMSFE obtained using the BVAR with stochastic volatility (heteroschedastic), the

y axis reports the RMSFE obtained using the homoschedastic BVAR. Each point corresponds to a

different forecast horizon from 1 to 12 step-ahead (in most cases, a higher RMSFE corresponds to a

longer forecast horizon).

RMSFE comparison: homoskedastic model (y axis) vs heteroskedastic model (x axis)

Conclusions

Conclusions

The assumptions of conjugacy and homoskedasticity in a VARs are hardly

defendable, but a more general specification is only manageable with a smallcross-section.

We have proposed a new estimation method VARs with non-conjugate priors anddrifting volatilities which can be applied with large models

The method is based on a straightforward triangularization of the system, and it is

very simple to implement.

Indeed, if a researcher already has algorithms to produce draws from a VAR with anindependent N-IW prior and stochastic volatility, only a single needs to be slightlymodified with a few lines of code.

Given its simplicity and the advantages in terms of speed, mixing, and convergence,we argue that the proposed algorithm should be preferred in empirical applications,especially those involving large datasets.


Conclusions

Prior dependence

We assumed that the prior variance was diagonal. This can be relaxed.

With a prior dependent across equations, the general form of the posterior can be

obtained using the triangularization also on the joint prior distribution, and is:

Πj|Π1:j−1,A,ΛT , y ∼ N(µΠj |1:j−1 ,ΩΠj |1:j−1 )

with

µΠj |1:j−1 = ΩΠj |1:j−1

T

∑t=1

Xj ,th−1j ,t y

∗′j ,t +Ω−1

Πj |1:j−1µΠj |1:j−1

Ω−1Πj |1:j−1 = Ω−1Πj |1:j−1 +

T

∑t=1

Xj ,th−1j ,t X

′j ,t ,

where µΠj |1:j−1 and ΩΠj |1:j−1 are moments of

Πj|Π1:j−1 ∼ N(µΠj |1:j−1 ,ΩΠj |1:j−1 ), i.e. the conditional priors implied by the

joint prior specification.

The moments of Πj|Π1:j−1 can be found recursively from the joint prior


Forecasting

Model size, stochastic volatility, and forecasting

Pseudo out of sample exercise performed recursively, starting with the estimationsample 1960:3 to 1970:2 and ending with 1960:3 to 2014:5.

We consider four models.

1 A small homoskedastic VAR including the growth rate of industrial production

(∆ ln IP), the inflation rate based on consumption expenditures (∆ lnPECEPI )and the effective Federal Funds Rate (FFR).

2 A large (20 variables) homoskedastic VAR along the lines of Carriero, Clark,

and Marcellino (2015), Giannone, Lenza, and Primiceri (2015), and Koop

(2013).3 A small VAR with time variation in volatilities along the lines of Clark (2011),

Cogley and Sargent (2005) and Primiceri (2005).4 The fourth model includes both time variation in the volatilities and a large

(20 variables) information set.


Forecasting

Forecasting

Direct effects:

The use of a larger dataset improves point forecasts via a better specification

of the conditional means.

The inclusion of time variation in volatilities improves density forecasts via a

better modelling of error variances,

Interactions:

A better point forecast improves the density forecast as well, by centering the

predictive density around a more reliable mean

Time varying volatilities improve the point forecasts at longer horizons -

because the heteroskedastic model will provide more effi cient estimates

(through a GLS argument) and a therefore a better characterization of the

predictive densities


Documents

Large Vector Autoregressions with stochastic volatility and …...Large Vector Autoregressions with stochastic volatility and non-conjugate priors Andrea Carriero 1 Todd E. Clark 2