Lecture 4: Estimation of ARIMA models - unice.frmath.unice.fr/~frapetti/CorsoP/Chapitre_4_IMEA_1.pdf · Estimate the values of ... Univariate time series Sept. 2011 - Dec. 2011 20

Lecture 4: Estimation of ARIMA models

Florian Pelgrin

University of Lausanne, Ecole des HECDepartment of mathematics (IMEA-Nice)

Sept. 2011 - Dec. 2011

Florian Pelgrin (HEC) Univariate time series Sept. 2011 - Dec. 2011 1 / 50

Road map

1 IntroductionOverviewFramework

2 Sample moments

3 (Nonlinear) Least squares methodLeast squares estimationNonlinear least squares estimationDiscussion

4 (Generalized) Method of momentsMethods of moments and Yule-Walker estimationGeneralized method of moments

5 Maximum likelihood estimationOverviewEstimation


Introduction Overview

1. Introduction1.1. Overview

Provide an overview of some estimation methods for linear time seriesmodels :

Sample moments estimation ;

(Nonlinear) Linear least squares method ;

Maximum likelihood estimation

(Generalized) Method of moments

Other methods are available (e.g., Bayesian estimation or Kalmanfiltering through state-space models) !



Broadly speaking, these methods consist in estimating the parametersof interest (autoregressive coefficients, moving average coefficients,and variance of the innovation process) as follows :

Optimizeδ

QT (δ)

s.t. (nonlinear) linear equality and inequality constraints

where QT is the objective function and δ is the vector of parametersof interest.

Remark : (Nonlinear) Linear equality and inequality constraints mayrepresent the fundamentalness conditions.



Examples

(Nonlinear) Linear least squares methods aims at minimizing the sumof squared residuals ;

Maximum likelihood estimation aims at maximizing the (log-)likelihood function ;

Generalized method of moments aims at minimizing the distancebetween the theoretical moments and zero (using a weighting matrix).



These methods differ in terms of :

Assumptions ;

”Nature” of the model (e.g., MA versus AR) ;

Finite and large sample properties ;

Etc.

High quality software programs (Eviews, SAS, Splus, Stata, etc) areavailable ! However, it is important to know the estimation options(default procedure, optimization algorithm, choice of initialconditions) and to keep in mind that all these estimation techniquesdo not perform equally and do depend on the nature of the model...


Introduction Framework

1.2. Framework

Suppose that (Xt) is correctly specified as an ARIMA(p,d,q) model

Φ(L)∆dXt = Θ(L)εt

where εt is a weak white noise (0, σ2ε ) and

Φ(L) = 1− φ1L− · · · − φpLp with φp 6= 0

Θ(L) = 1 + θ1L + · · ·+ θqLq with θq 6= 0.

Assume that1 The model order (p,d, and q) is known ;2 The data has zero mean ;3 The series is weakly stationary (e.g., after d-difference transformation,

etc).


Sample moments

2. Sample moments

Let (X1, · · · ,XT ) represent a sample of size T from acovariance-stationary process with

E [Xt ] = mX for all t

Cov [Xt ,Xt−h] = γX (h) for all t


Sample moments

If nothing is known about the distribution of the time series,(non-parametric) estimation can be provided by :

Mean

mX ≡ XT =1

T

T∑t=1

Xt

Autocovariance

γX (h) =1

T

T−|h|∑t=1

(Xt − XT )(Xt+|h| − XT )

or

γX (h) =1

T − |h|

T−|h|∑t=1

(Xt − XT )(Xt+|h| − XT ).

Autocorrelation

ρX (h) =γX (h)

γX (0).


(Nonlinear) Least squares method Least squares estimation

3. (Nonlinear) Least squares method3.1. Least squares estimation

Consider the linear regression model (t = 1, · · · ,T ) :

yt = x ′tb0 + εt

where (yt , x′t) are i.i.d. (H1), the regressors are stochastic (H2), x ′t is

the tth-row of the X matrix, and the following assumptions hold :1 Exogeneity (H3)

E [εt |xt ] = 0 for all t;

2 Spherical error terms (H4)

V [εt |xt ] = σ20 for all t;

Cov [εt , εt′ | xt , xt′ ] = 0 for t 6= t ′

3 Rank condition (H5)

rank [E (xtx′t)] = k.



Definition

The ordinary least squares estimation of b0, bols , defined from thefollowing minimization program

bols = argminb0

T∑t=1

ε2t

= argminb0

T∑t=1

(yt − x ′tb0

)2is given by

bols =

(T∑

t=1

xtx′t

)−1( T∑t=1

xtyt

).



Proposition

Under H1-H5, the ordinary least squares estimator of b is weakly consistent

bolsp−→

T→∞b0

and is asymptotically normally distributed

√T(bols − b0

)`→ N

(0k×1, σ

20E[xtx

′t

]−1)

.



Example : AR(1) estimation

Let (Xt) be a covariance-stationary process defined by thefundamental representation (|φ| < 1) :

Xt = φXt−1 + εt

where (εt) is the innovation process of (Xt).

The ordinary least squares estimation of φ is defined to be :

φols =

(T∑

t=2

x2t−1

)−1( T∑t=2

xt−1xt

)

Moreover

√T(φols − φ

)`→ N

(0, 1− φ2

).


(Nonlinear) Least squares method Nonlinear least squares estimation

3.2. Nonlinear least squares estimation

Definition

Consider the nonlinear regression model defined by (t = 1, · · · ,T ) :

yt = h(x ′t ; b0) + εt

where h is a nonlinear function (w.r.t. b0). The nonlinear least squaresestimator of b0, bnls , satisfies the following minimization program :

bnls = argminb0

T∑t=1

ε2t

= argminb0

T∑t=1

(yt − h(x ′t ; b0)

)2.



Example : MA(1) estimation

Let (Xt) be a covariance-stationary process defined by thefundamental representation (|θ| < 1) :

Xt = εt + θεt−1


The ordinary least squares estimation of θ, which is defined by

θols = argminθ

T∑t=2

(xt − θεt−1)2 ,

is not feasible (since εt−1 is not observable !).



Example : MA(1) estimation (cont’d)

Conditionally on ε0,

x1 = ε1 + θε0 ⇔ ε1 = x1 − θε0

x2 = ε2 + θε1 ⇔ ε2 = x2 − θε1

⇔ ε2 = x2 − θ(x1 − θε0)...

xt−1 = εt−1 − θεt−2 ⇔ εt−1 =t−2∑k=0

(−θ)kxt−1−k + (−θ)t−1ε0.

The (conditional) nonlinear least squares estimation of θ is defined by(assuming that ε0 is negligible)

θnls = argminθ

T∑t=2

(xt − θ

t−2∑k=0

(−θ)kxt−1−k

)2

and√

T(θnls − θ

)`→ N

(0, 1− θ2

).



Example : Least squares estimation using backcasting procedure

Consider the fundamental representation of a MA(q) :

Xt = ut

ut = εt + θ1εt−1 + · · ·+ θqεt−q

Given some initial values δ(0) =(θ(0)1 , · · · , θ

(0)q

)′,

Compute the unconditional residuals ut

ut = Xt !

Backcast values of εt for t = −(q − 1), · · · ,T using the backwardrecursion

εt = ut − θ(0)1 εt+1 − · · · − θ(0)

q εt+q

where the q values for εt beyond the estimation sample are set tozero : εT+1 = · · · = εT+q = 0, and ut = 0 for t < 1.



Example : Least squares estimation using backcasting procedure(cont’d)

Estimate the values of εt using a forward recursion

εt = ut − θ(0)1 εt−1 − · · · − θ

(0)q εt−q

Minimize the sum of squared residuals using the fitted values εt :

T∑q+1

(Xt − εt − θ1εt−1 − · · · − θq εt−q)2

and find δ(1) =(θ(1)1 , · · · , θ

(1)q

)′.

Repeat the backcast step, forward recursion, and minimizationprocedures until the estimates of the moving average part converge.

Remark : The backcast step can be turned off (ε−(q−1) = · · · = ε0 = 0)and the forward recursion is only used.


(Nonlinear) Least squares method Discussion

3.3. Discussion

(Linear and Nonlinear) Least squares estimation are often used inpractise. These methods lead to simple algorithms (e.g., backcastingprocedure for ARMA models) and can often be used to initialize more”sophisticated” estimation methods.

These methods are opposed to exact methods (see likelihoodestimation).

Nonlinear estimation requires numerical optimization algorithms :

Initial conditions ;

Number of iterations and stopping criteria ;

Local and global convergence ;

Fundamental representation and constraints.


(Generalized) Method of moments Methods of moments and Yule-Walker estimation

4. (Generalized) Method of moments4.1. Methods of moments and Yule-Walker estimation

Definition

Suppose there is a set of k conditions

ST − g (δ) = 0k×1

where ST ∈ Rk denotes a vector of theoretical moments , δ ∈ Rk is avector of parameters, and g : Rk → Rk defines a (bijective) mappingbetween ST and δ.The method of moments estimation of δ, δmm, is defined to be the valueof δ such that

ST − g(δmm

)= 0k×1

where ST is the estimation (empirical counterpart) of ST .



Example 1 : MA(1) estimation

Let (Xt) be a covariance-stationary process defined by thefundamental representation (|θ| < 1) :

Xt = εt + θεt−1


One has

ST − g(δ) =

(γX (0)− (1 + θ2)σ2

ε

γX (1)− θσ2ε

)= 02×1

where δ = (θ, σ2ε )′ and ST = (γX (0), γX (1))′ = (E

[X 2

t

], E [XtXt−1])

′.



The method of moment estimation of δ, δmm, solves

ST − g(δmm) =

(γX (0)− (1 + θ2

mm)σ2ε,mm

γX (1)− θmmσ2ε,mm

)= 02×1

Therefore (using the constraint |θ| < 1)

φmm =1−

√1− 4ρ2

X (1)

2ρX (1)

and

σ2ε,mm =

2γX (0)

1−√

1− 4ρ2X (1)

.



Example 2 : Yule-Walker estimation

Let (Xt) be a causal AR(p). The Yule-Walker equations write

E

Xt−k

Xt −p∑

j=1

φjXt−j

= 0 for all k = 0, · · · , p

or

ST − g(δ) = 0(p+1)×1 ⇔

γX (0)− φ′γp = σ2ε

γp − Γpφ = 0p×1

where γp = (γX (1), · · · , γX (p))′ = (E [XtXt−1] , · · · , E [XtXt−p])′,

ST = (γX (0), γ′p)′, φ = (φ1, · · · , φp)

′, δ = (φ′, σ2ε )′.



The method of moment estimation of (φ′, σ2ε )′ solves

φmm = Γ−1p γp

and

σ2ε,mm = γX (0)− φ′mmγp.

Question : Difference(s) with ordinary least squares estimation of anAR(p) ?



Discussion

It is necessary to use theoretical moments that can be well-estimatedfrom the data (or given the sample size).

Method of moment estimation depends on the mapping g−1 andespecially the ”smoothness” of the inverse g−1 (need to avoid thatthe inverse map has a ”large derivative”).

Summarizing the data through the sample moments incurs a loss ofinformation—the main exception is when the sample moments aresufficient (in a statistical sense).

Instead of reducing a sample of size T to a ”sample” of empiricalmoments of size k, one may reduce the sample to more ”empiricalmoments” than there are parameters...the so-called generalizedmethod of moments.


(Generalized) Method of moments Generalized method of moments

4.2. Generalized method of moments

Intuition :

Consider the linear regression model in an i.i.d. context

yt = x ′tb0 + εt

The exogeneity assumption E [εt | X ] implies the k momentsconditions (for t = 1, · · · ,T )

E[xt(yt − x ′tb0)

]= 0k×1.

Using the empirical counterpart (analogy principle)

1

T

T∑t=1

xt(yt − x ′tb0) = 0k×1

This is a system of k equations with k unknowns (just-identified)

bols =

(T∑

t=1

xtx′t

)−1( T∑t=1

xtyt

).



More generally, suppose that there exists a set of instrumentszt ∈ Rp (with p ≥ k)—the instruments being (i) uncorrelated withthe error terms and (ii) correlated with the explanatoryvariables—such that

E[zt(yt − x ′tb0)

]= 0p×1 ⇔ E [h(zt ; δ0)] = 0p×1.

Using the empirical counterpart (analogy principle)

1

T

T∑t=1

zt(yt − x ′tb0) = 0p×1 ⇔ 1

T

T∑t=1

h(zt ; δ0) = 0p×1.

This system is overidentified for p > k. How can we find anestimation of b0 ?



The answer is provided by the generalized method of moments, i.e. anestimation method based on estimating equations that imposes thenullity of the expectation of a vector function of the observations andthe parameters of interest b0.

The basic idea is to choose a value for δ such that the empiricalcounterpart is as close as possible to zero.

”As close as possible to zero” means that one minimizes the

distance between 1T

T∑t=1

zt(yt − x ′tb0) and 0p×1 using a weighting

matrix (say, WT ).

Two issues1 Choice of instruments ;2 Choice of the (optimal) weighting matrix.



Definition

Consider the set of moment conditions

E [h(zt , δ0)] = 0p×1

where δ0 ∈ Rk , p ≥ k, and h is a p-dimensional (nonlinear) linear functionof the parameters of interest and depends on (i.i.d.) observations. Thegeneralized method of moments estimator of δ0 associated with WT

(symmetric positive definite matrix) is a solution to the problem

minδ

[1

T

T∑t=1

h(zt ; δ)

]′WT

[1

T

T∑t=1

h(zt ; δ)

].



Definition

The generalized method of moments estimator of δ0 solves the first-orderconditions[

1

T

T∑t=1

∂h(zt ; δgmm)

∂δ

]′WT

[1

T

T∑t=1

h(zt ; δgmm)

]= 0k×1.

This estimator is consistent and asymptotically normally distributed.

Remark : The first-order conditions correspond to k linear combinations ofthe moments conditions.



This estimator depends on Wt . There exists a best GMM estimator,i.e. an optimal weighting matrix.

The optimal weighting matrix is given by (i.i.d. context)

W ∗T = V−1 [h(zt , δ0)] ≡ E−1

[h(zt , δ0)h(zt , δ0)

′]and the sample analog is[

1

T

T∑t=1

h(zt , δ0)h(zt , δ0)′

]−1

.

It does depend on δ0 !

How can we proceed in practise ?

Two-step GMM or iterated GMM method ;One-step method (Continuously updating estimator, exponential tiltingestimator, empirical likelihood estimator, etc).




Let (Xt) be a covariance-stationary process defined by thefundamental representation (|φ| < 1) :

Xt = φXt−1 + εt .

Since (εt) is the innovation process of (Xt), one has the followingmoment conditions

E [xt−kεt ] = 0 for k > 0

⇒ Past values xt−1, · · · , xt−h can be used as instruments

Let zt denote the set of instruments

zt = (xt−1, · · · , xt−h)′.



The moment conditions write

E [h(zt , δ0)] = 0h×1

where

h(zt , δ0) = zt(xt − φ1xt−1).

The empirical counterpart is defined to be

1

T

T∑t=h+1

xt−1...

xt−h

(xt − φ1xt−1) = 0h×1.



The 2S-GMM estimator of φ is a solution to the problem[T∑

t=h+1

∂h(zt ; δgmm)

∂δ

]′Ω−1

T

[T∑

t=h+1

h(zt ; δgmm)

]= 0h×1.

where Ω−1T is a first-step consistent estimate of the inverse of the

variance-covariance matrix of the moments conditions and it accountsfor the dependence of the moment conditions.

Remark : In a first step, one solves[T∑

t=h+1

∂h(zt ; δ1,gmm)

∂δ

]′IT

[T∑

t=h+1

h(zt ; δ1,gmm)

]= 0h×1.

and then compute Ω−1T (δ1,gmm) ≡ Ω−1

T .



Discussion

Choice of instruments (optimal instruments ?) and of the weightingmatrix ;

Bias-efficiency trade-off ;

Numerical procedures can be cumbersome !

Other methods can be written as a GMM estimation.


Maximum likelihood estimation Overview

5. Maximum likelihood estimation5.1. Overview

Consider the linear regression model (t = 1, · · · ,T ) :

yt = x ′tb0 + εt

where (yt , x′t) are i.i.d., the regressors are stochastic, x ′t is the

tth-row of the X matrix, and

εt | xt ∼ i .i .d .N (0, σ2ε ).

It follows that

yt | xt ∼ i .i .d .N (x ′tb0, σ2ε ).


Maximum likelihood estimation Overview

One observes a sample of (conditionally) i.i.d. observations (yt , x′t).

These observations are realizations of random variables with a knowndistribution but unknown parameters.

The basic idea is to maximize the probability of observing theserealizations (given a known parametric distribution with unknownparameters).

The probability of observing these realizations is provided by the jointdensity (or pdf) of the observations (or evaluated at the observations).

This joint density is the likelihood function, with the exception that itis a function of the unknown parameters (and not theobservations !).


Maximum likelihood estimation Estimation

5.2. Estimation

Definition

The exact likelihood function of the linear regression model is defined to be

L(y , x ; δ0) =T∏

t=1

Lt(yt , xt ; δ0)

where Lt is the (exact) likelihood function for observation t

Lt(yt , xt ; δ0) = fYt |Xt(yt | xt ; δ0)× fXt (xt).

where fYt |Xt(respectively, fXt ) is the (conditional) probability density

function of Yt | Xt (respectively, of Xt). Accordingly, the exactlog-likelihood function is defined to be

`(y , x ; δ0) =T∑

t=1

`t(yt , xt ; δ0) ≡T∑

t=1

logLt(yt , xt ; δ0).



The exact likelihood function is given by

L(y , x ; δ0) =T∏

t=1

fYt |Xt(yt | xt ; δ0)

T∏t=1

fXt (xt)

The second right-hand side term is often neglected (especially if fX does notdepend on δ0) and one considers the conditional likelihood function

L(y | x ; δ0) =T∏

t=1

fYt |Xt(yt | xt ; δ0)

=T∏

t=1

(σ2

ε2π)−1/2

exp

(− 1

2σ2ε

(yt − x ′tb0)

2)

=(σ2

ε2π)−T/2

exp

(− 1

2σ2ε

T∑t=1

(yt − x ′tb0)

2

)and the conditional log-likelihood function is given by

`(y | x ; δ0) = −T

2log(σ2

ε )−T

2log(2π)−− 1

2σ2ε

T∑t=1

(yt − x ′tb0)

2.



Definition

The maximum likelihood estimator of δ0 in the linear regression model,which is the solution of the problem

maxδ

L(y | x ; δ) ⇔ maxδ

`(y | x ; δ),

is asymptotically unbiased and efficient, and normally distributed

√T(δml − δ0

)`→ N

(0k×1, I

−1F

)where IF is the Fisher information matrix.




Consider the following AR(1) representation (|φ| < 1)

Xt = φXt−1 + εt .

where εt or εt |Xt−1 ∼ i .i .d .N(0, σ2

ε

)Therefore

Xt | Xt−1 = xt−1 ∼ N(φxt−1, σ

2ε

)X1 ∼ N

(0,

σ2ε

1− φ2

)Accordingly,

fXt |Xt−1(xt | xt−1; δ0) =

(σ2

ε 2π)−1/2

exp

(− 1

2σ2ε

(xt − φxt−1)2

)fX1(x1; δ0) =

(σ2

ε

1− φ22π

)−1/2

exp

(−1− φ2

2σ2ε

x21

).



Exact (log-) likelihood function

The exact likelihood function is given by

L(x ; δ0) = L1(x1; δ0)×T∏

τ=2

Lτ (xτ | xτ−1; δ0)

= fX1(x1; δ0)×T∏

τ=2

fXτ |Xτ−1(xτ | xτ−1; δ0)

=

(σ2

ε

1− φ22π

)−1/2

exp

(−1− φ2

2σ2ε

x21

)×(σ2

ε2π)− T−1

2 exp

(− 1

2σ2ε

T∑t=2

(xt − φxt−1)2

).

The exact log-likelihood is

`(x ; δ0) = −T

2log(2π)− 1

2log

(σ2

ε

1− φ2

)− 1− φ2

2σ2ε

x21

−T − 1

2log(σ2

ε )−1

2σ2ε

T∑t=2

(xt − φxt−1)2.



Conditional (log-) likelihood function

The conditional likelihood function is given by

L(x | x−1; δ0) =T∏

τ=2

Lτ (xτ | xτ−1; δ0)

=T∏

τ=2

fXτ |Xτ−1(xτ | xτ−1; δ0)

=(σ2

ε 2π)−T−1

2 exp

(− 1

2σ2ε

T∑t=2

(xt − φxt−1)2

).

The conditional log-likelihood is

`(x | x−1; δ0) = −T − 1

2log(2π)− T − 1

2log(σ2

ε )

− 1

2σ2ε

T∑t=2

(xt − φxt−1)2.



The two methods lead to different maximum likelihood estimates

δcml 6= δeml

The conditional maximum likelihood estimator of φ is also theordinary least squares estimator of φ

φcml = φols

This also holds for any AR(p).

The Yule-Walker, conditional maximum likelihood and (ordinary) leastsquares estimator are asymptotically equivalent.



Discussion

Generally the maximum likelihood estimates are built as if theobservations are Gaussian, though it is not necessary that (Xt) isGaussian when doing the estimation.

If the model is misspecified, the only difference (to some extent...) isthat estimates may be less efficient. In this case, one can use thepseudo- or quasi-maximum likelihood estimation.

In the case of linear models, there might be not so much gain of usingexact maximum likelihood against conditional likelihood method (orleast squares method).

Maximum likelihood estimation often requires nonlinear numericaloptimization procedures.

Estimation of the Fisher information matrix can be cumbersome(especially in finite samples).


Appendix

Appendix : ML estimation of a MA(q)

Step 1 : Writing ε = (ε1, · · · , εT )′

One has

ε1−q = ε1−q

... =...

ε0 = ε0

ε1 = x1 − θ1ε1 − · · · − θqε1−q

ε1 = x2 − θ1ε2 − · · · − θqε2−q

... =...

εT = xT − θ1εT − · · · − θqεT−q


Appendix

Therefore

ε = NX + Zε∗

=(

Z N)( ε∗

X

)with X = (x1, · · · , xT )′, ε∗ = (ε1−q, ε2−q, · · · , ε−1, ε0)

′, and

N =

(0q×T

A1(θ)

),Z =

(Iq

A2(θ)

)where A1(θ) is a lower triangular matrix (T × T ), A2(θ) is a T × qmatrix, and Iq is the identity matrix of order q.


Appendix

Step 2 : Determination of the joint density of ε

The joint probability density function of ε is defined to be

(2πσ2

ε

)−T+q2 exp

[− 1

2σ2ε

(NX + Zε∗)′ (NX + Zε∗)

].

Step 3 : Determination of the marginal density of X

The marginal density of X is defined to be

(2πσ2

ε

)−T2(detX ′X

)− 12 exp

[− 1

2σ2ε

(NX + Z ε∗)′ (NX + Z ε∗)

]with ε∗ = −(Z ′Z )−1Z ′NX .


Appendix

Step 4 : Determination of the log-likelihood funcion

The log-likelihood function is defined to be

` (x ; δ) = −T

2log(2π)− T

2log(σ2

ε )−T

2log(detX ′X

)− S(θ)

2σ2ε

where δ = (θ′, σ2ε )′ and

S(θ) = (NX + Z ε∗)′ (NX + Z ε∗) .

Step 5 : Determination of σ2ε,ml

Using the first-order condition w.r.t. σ2ε , one has

σ2ε,ml =

S(θml

)T

.


Appendix

Step 6 : Determination of the concentrated log-likelihood funcion

The concentrated log-likelihood function is defined to be

`∗(x ; θ) = −T log [S(θ)]− log[detX ′X

].

Step 7 : Determination of θml

θml solves

minθ

T log [S(θ)] + log

[detX ′X

].

Remarks :

1 Least squares methods only focus on the first right-hand side term ofthe log-likelihood function. Exact methods account for bothright-hand side terms.

2 This method generalizes to ARMA(p,q) models.


Documents

Lecture 4: Estimation of ARIMA models - unice.frmath.unice.fr/~frapetti/CorsoP/Chapitre_4_IMEA_1.pdf · Estimate the values of ... Univariate time series Sept. 2011 - Dec. 2011 20