Cambridge Time Series Notes

8/12/2019 Cambridge Time Series Notes

1/36

TIME SERIES

Contents

Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiBooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiKeywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Models for time series 1

1.1 Time series data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Trend, seasonality, cycles and residuals . . . . . . . . . . . . . . . . . 11.3 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.4 Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 The turning point test . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Models of stationary processes 5

2.1 Purely indeterministic processes . . . . . . . . . . . . . . . . . . . . . 52.2 ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 ARIMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Estimation of the autocovariance function . . . . . . . . . . . . . . . 62.5 Identifying a MA(q) process . . . . . . . . . . . . . . . . . . . . . . . 72.6 Identifying an AR(p) process . . . . . . . . . . . . . . . . . . . . . . . 72.7 Distributions of the ACF and PACF . . . . . . . . . . . . . . . . . . 8

3 Spectral methods 9

3.1 The discrete Fourier transform . . . . . . . . . . . . . . . . . . . . . . 93.2 The spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Analysing the effects of smoothing . . . . . . . . . . . . . . . . . . . . 12

4 Estimation of the spectrum 13

4.1 The periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Distribution of spectral estimates . . . . . . . . . . . . . . . . . . . . 154.3 The fast Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Linear filters 17

5.1 The Filter Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Application to autoregressive processes . . . . . . . . . . . . . . . . . 175.3 Application to moving average processes . . . . . . . . . . . . . . . . 185.4 The general linear process . . . . . . . . . . . . . . . . . . . . . . . . 19

i


2/36

5.5 Filters and ARMA processes . . . . . . . . . . . . . . . . . . . . . . . 205.6 Calculating autocovariances in ARMA models . . . . . . . . . . . . . 20

6 Estimation of trend and seasonality 21

6.1 Moving averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Centred moving averages . . . . . . . . . . . . . . . . . . . . . . . . . 226.3 The Slutzky-Yule effect . . . . . . . . . . . . . . . . . . . . . . . . . . 226.4 Exponential smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 236.5 Calculation of seasonal indices . . . . . . . . . . . . . . . . . . . . . . 24

7 Fitting ARIMA models 25

7.1 The Box-Jenkins procedure . . . . . . . . . . . . . . . . . . . . . . . 257.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.5 Tests for white noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.6 Forecasting with ARMA models . . . . . . . . . . . . . . . . . . . . . 28

8 State space models 29

8.1 Models with unobserved states . . . . . . . . . . . . . . . . . . . . . . 298.2 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.4 Parameter estimation revisited . . . . . . . . . . . . . . . . . . . . . . 32

ii


3/36

Syllabus

Time series analysis refers to problems in which observations are collected at regulartime intervals and there are correlations among successive observations. Applications

cover virtually all areas of Statistics but some of the most important include economicand financial time series, and many areas of environmental or ecological data.In this course, I shall cover some of the most important methods for dealing with

these problems. In the case of time series, these include the basic definitions ofautocorrelations etc., then time-domain model fitting including autoregressive andmoving average processes, spectral methods, and some discussion of the effect of timeseries correlations on other kinds of statistical inference, such as the estimation ofmeans and regression coefficients.

Books

1. P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods, SpringerSeries in Statistics (1986).

2. C. Chatfield,The Analysis of Time Series: Theory and Practice, Chapman andHall (1975). Good general introduction, especially for those completely new totime series.

3. P.J. Diggle,Time Series: A Biostatistical Introduction, Oxford University Press(1990).

4. M. Kendall, Time Series, Charles Griffin (1976).

iii


4/36

Keywords

ACF, 2

Akaikes AIC, 26

AR(p), 2ARIMA(p, d, q), 6

ARMA(p, q), 5

autocorrelation function, 2

autocovariance function, 2, 5

autoregressive integrated movingaverage process, 6

autoregressive moving average

process, 5autoregressive process, 2

backshift operator, 17

Box-Jenkins, 25

BoxPierce, 27

centred average of fours, 22

centred-moving average, 22

classical decomposition, 1correlogram, 6

estimation, 25

fast Fourier transform, 16

filter generating function, 17

Fourier frequencies, 13

Gaussian process, 5general linear process, 19

identifiability, 19

identification, 25

invertible process, 4

Kalman filter updating equations,31

Levinson-Durbin recursion, 7, 25

linear filter, 12

MA(q), 3

moving average process, 3

nonnegative definite sequence, 6

PACF, 8

periodogram, 14

purely-indeterministic, 5

sample partial autocorrelationcoefficient, 8

second order stationary, 2

simple exponential smoothing, 23

Slutzky-Yule effect, 22

spectral density function, 10

spectral distribution function, 9state space model, 5

strictly stationary, 1

strongly stationary, 1

symmetric moving average, 21

transfer function, 12, 17

turning point test, 4

variate difference method, 22

verification, 27

weakly stationary, 2

white noise, 4, 10

Yule-Walker equations, 3, 25

iv


5/36

1 Models for time series

1.1 Time series data

A time series is a set of statistics, usually collected at regular intervals. Time series

data occurs naturally in many application areas.

economics - e.g., monthly data for unemployment, hospital admissions, etc. finance - e.g., daily exchange rate, a share price, etc. environmental - e.g., daily rainfall, air quality readings. medicine - e.g., ECG brain wave activity every 28 secs.

The methods of time series analysis pre-date those for general stochastic processes

and Markov Chains. The aims of time series analysis are to describe and summarisetime series data, fit low-dimensional models, and make forecasts.

We write our real-valued series of observations as . . . , X 2, X1, X0, X1, X2, . . . ,a doubly infinite sequence of real-valued random variables indexed by Z .

1.2 Trend, seasonality, cycles and residuals

One simple method of describing a series is that of classical decomposition. Thenotion is that the series can be decomposed into four elements:

Trend (Tt) long term movements in the mean;

Seasonal effects (It) cyclical fluctuations related to the calendar;

Cycles (Ct) other cyclical fluctuations (such as a business cycles);

Residuals (Et) other random or systematic fluctuations.

The idea is to create separate models for these four elements and then combinethem, either additively

Xt=Tt+ It+ Ct+ Et

or multiplicatively

Xt=Tt It Ct Et .

1.3 Stationary processes

1. A sequence{Xt, t Z } is strongly stationary or strictly stationary if

(Xt1, . . . , X tk)D=(Xt1+h, . . . , X tk+h)

for all sets of time points t1, . . . , tk and integerh.

1


6/36

2. A sequence is weakly stationary, or second order stationary if

(a)E

(Xt) =, and

(b) cov(Xt, Xt+k) =k,

where is constant andk is independent oft.

3. The sequence{k,k Z } is called the autocovariance function.4. We also define

k=k/0= corr(Xt, Xt+k)

and call{k, k Z } the autocorrelation function (ACF).Remarks.

1. A strictly stationary process is weakly stationary.

2. If the process is Gaussian, that is (Xt1, . . . , X tk) is multivariate normal, for allt1, . . . , tk, then weak stationarity implies strong stationarity.

3. 0= var(Xt)> 0, assumingXt is genuinely random.

4. By symmetry, k=k, for allk.

1.4 Autoregressive processes

The autoregressive process of orderpis denoted AR(p), and defined by

Xt =

pr=1

rXtr+ t (1.1)

where 1, . . . , r are fixed constants and{t} is a sequence of independent (or un-correlated) random variables with mean 0 and variance 2.

The AR(1) process is defined by

Xt=1Xt1+ t . (1.2)To find its autocovariance function we make successive substitutions, to get

Xt=t+ 1(t1+ 1(t2+ )) =t+ 1t1+ 21t2+ The fact that{Xt} is second order stationary follows from the observation thatE (Xt ) = 0 and that the autocovariance function can be calculated as follows:

0= E t+ 1t1+ 21t2+

2= 1 +

21+

41+

2 = 2

1

2

1

k= E

r=0

r1tr

s=0

s1t+ks

=

2k11 21

.

2


7/36

There is an easier way to obtain these results. Multiply equation (1.2) by Xtkand take the expected value, to give

E

(XtXtk) = E (1Xt1Xtk) + E (tXtk) .

Thusk =1k1,k= 1, 2, . . .Similarly, squaring (1.2) and taking the expected value gives

E (X2t) =1 E (X2t1) + 21 E (Xt1t) + E (

2t ) =

21 E (X

2t1) + 0 +

2

and so0=2/(1 21).

More generally, the AR(p) process is defined as

Xt=1Xt1+ 2Xt2+ + pXtp+ t . (1.3)Again, the autocorrelation function can be found by multiplying (1.3) by Xtk, takingthe expected value and dividing by0, thus producing the Yule-Walker equations

k =1k1+ 2k2+ + pkp, k= 1, 2, . . .These are linear recurrence relations, with general solution of the form

k =C1|k|1 + + Cp|k|p ,

where1, . . . , p are the roots ofp 1p1 2p2 p = 0

andC1, . . . , C p are determined by 0 = 1 and the equations for k= 1, . . . , p 1. Itis natural to require k 0 as k , in which case the roots must lie inside theunit circle, that is,|i| < 1. Thus there is a restriction on the values of1, . . . , pthat can be chosen.

1.5 Moving average processes

The moving average process of order qis denoted MA(q) and defined by

Xt=

qs=0

sts (1.4)

where 1, . . . , qare fixed constants, 0 = 1, and{t} is a sequence of independent(or uncorrelated) random variables with mean 0 and variance 2.

It is clear from the definition that this is second order stationary and that

k=

0, |k| > q2q|k|

s=0 ss+k, |k| q3


8/36

We remark that two moving average processes can have the same autocorrelationfunction. For example,

Xt =t+ t1 and Xt = t+ (1/)t1

both have1=/(1 + 2),k = 0,|k| >1. However, the first givest=Xt t1=Xt (Xt1 t2) =Xt Xt1+ 2Xt2

This is only valid for|| < 1, a so-called invertible process. No two invertibleprocesses have the same autocorrelation function.

1.6 White noise

The sequence {t}, consisting of independent (or uncorrelated) random variables withmean 0 and variance 2 is called white noise (for reasons that will become clearlater.) It is a second order stationary series with0=

2 andk = 0,k = 0.

1.7 The turning point test

We may wish to test whether a series can be considered to be white noise, or whethera more complicated model is required. In later chapters we shall consider variousways to do this, for example, we might estimate the autocovariance function, say

{k}, and observe whether or not k is near zero for all k >0.However, a very simple diagnostic is the turning point test, which examines a

series{Xt} to test whether it is purely random. The idea is that if{Xt} is purelyrandom then three successive values are equally likely to occur in any of the sixpossible orders.

In four cases there is a turning point in the middle. Thus in a series ofn pointswe might expect (2/3)(n 2) turning points.

In fact, it can be shown that for large n, the number of turning points shouldbe distributed as aboutN(2n/3, 8n/45). We reject (at the 5% level) the hypothesisthat the series is unsystematic if the number of turning points lies outside the range2n/3 1.96

8n/45.

4


9/36

2 Models of stationary processes

2.1 Purely indeterministic processes

Suppose

{Xt

}is a second order stationary process, with mean 0. Its autocovariance

functionis

k= E (Xt Xt+k) = cov(Xt, Xt+k), k Z .1. As{Xt}is stationary,k does not depend ont.2. A process is said to be purely-indeterministic if the regression of Xt on

Xtq, Xtq1, . . . has explanatory power tending to 0 as q . That is, theresidual variance tends to var(Xt).

An important theorem due to Wold (1938) states that every purely-indeterministic second order stationary process {Xt} can be written in the form

Xt = + 0Zt+ 1Zt1+ 2Zt2+ where{Zt} is a sequence of uncorrelated random variables.

3. A Gaussian process is one for which Xt1, . . . , X tn has a joint normal distri-bution for all t1, . . . , tn. No two distinct Gaussian processes have the sameautocovariance function.

2.2 ARMA processes

The autoregressive moving average process, ARMA(p, q), is defined by

Xt p

r=1

rXtr =q

s=0

sts

where again{t} is white noise. This process is stationary for appropriate ,.Example 2.1

Consider the state space model

Xt= Xt1+ t ,Yt= Xt+ t .

Suppose{Xt} is unobserved,{Yt} is observed and{t} and{t} are independentwhite noise sequences. Note that{Xt} is AR(1). We can write

t =Yt Yt1

= (Xt+ t) (Xt1+ t1)= (Xt Xt1) + (t t1)=t+ t t1

5


10/36

Nowt is stationary and cov(t, t+k) = 0, k 2. As such,t can be modelled as aMA(1) process and{Yt}as ARMA(1, 1).

2.3 ARIMA processes

If the original process {Yt} is not stationary, we can look at the first order differenceprocess

Xt = Yt=Yt Yt1or the second order differences

Xt= 2Yt= (Y)t=Yt 2Yt1+ Yt2and so on. If we ever find that the differenced process is a stationary process we can

look for a ARMA model of that.The process {Yt}is said to be an autoregressive integrated moving average

process, ARIMA(p, d, q), ifXt = dYt is an ARMA(p, q) process.AR, MA, ARMA and ARIMA processes can be used to model many time series.

A key tool in identifying a model is an estimate of the autocovariance function.

2.4 Estimation of the autocovariance function

Suppose we have data (X1, . . . , X T) from a stationary time series. We can estimate the mean by X= (1/T)T1 Xt, the autocovariance by ck = k= (1/T)

Tt=k+1(Xt X)(Xtk X), and

the autocorrelation by rk= k = k/0.The plot ofrk againstk is known as the correlogram. If it is known that is 0

there is no need to correct for the mean and k can be estimated by

k= (1/T)T

t=k+1 XtXtk .Notice that in defining k we divide by T rather than by (T k). When T is

large relative to k it does not much matter which divisor we use. However, formathematical simplicity and other reasons there are advantages in dividing byT.

Suppose the stationary process{Xt} has autocovariance function{k}. Then

var

Tt=1

atXt

=

Tt=1

Ts=1

atascov(Xt, Xs) =T

t=1

Ts=1

atas|ts| 0.

A sequence {k} for which this holds for everyT 1 and set of constants (a1, . . . , aT)is called anonnegative definite sequence. The following theorem states that {k}is a valid autocovariance function if and only if it is nonnegative definite.

6


11/36

Theorem 2.2 (Blochner) The following are equivalent.

1. There exists a stationary sequence with autocovariance function{k}.2.{k}is nonnegative definite.3. The spectral density function,

f() = 1

k=

keik =

1

0+

2

k=1

kcos(k) ,

is positive if it exists.

Dividing byTrather than by (T k) in the definition of k ensures that {k} is nonnegative definite (and thus that it could be the autoco-

variance function of a stationary process), and

can reduce the L2-error ofrk.

2.5 Identifying a MA(q) process

In a later lecture we consider the problem of identifying an ARMA or ARIMA modelfor a given time series. A key tool in doing this is the correlogram.

The MA(q) process Xt hask = 0 for all k,|k| > q. So a diagnostic for MA(q) isthat|rk| drops to near zero beyond some threshold.

2.6 Identifying an AR(p) process

The AR(p) process has k decaying exponentially. This can be difficult to recognisein the correlogram. Suppose we have a process Xt which we believe is AR(k) with

Xt=k

j=1

j,kXtj+ t

witht independent ofX1, . . . , X t

1.Given the data X1, . . . , X T, the least squares estimates of (1,k, . . . , k,k) are

obtained by minimizing

1

T

Tt=k+1

Xt

kj=1

j,kXtj

2.

This is approximately equivalent to solving equations similar to the Yule-Walkerequations,

j =

k=1

,k|j|, j = 1, . . . , k

These can be solved by the Levinson-Durbin recursion:

7


12/36

Step 0. 20 := 0, 1,1= 1/0, k:= 0

Step 1. Repeat until k,k near 0:

k:=k+ 1

k,k :=

k

k1j=1

j,k1kj

2k1

j,k :=j,k1 k,kkj,k1, forj = 1, . . . , k 1

2k:=2k1(1 2k,k)

We test whether the orderk fit is an improvement over the order k 1 fit by lookingto see ifk,k is far from zero.

The statistic k,k is called thekthsample partial autocorrelation coefficient(PACF). If the process Xt is genuinely AR(p) then the population PACF, k,k, isexactly zero for all k > p. Thus a diagnostic for AR(p) is that the sample PACFsare close to zero for k > p.

2.7 Distributions of the ACF and PACF

Both the sample ACF and PACF are approximately normally distributed about their

population values, and have standard deviation of about 1/T, whereTis the lengthof the series. A rule of thumb for negligibility ofk (and similarly fork,k) is thatrk(similarly k,k) should lie between2/

T. (2 is an approximation to 1.96. Recall

that ifZ1, . . . , Z n N(, 1), a test of size 0.05 of the hypothesis H0: = 0 againstH1: = 0 rejects H0 if and only ifZ lies outside1.96/

n).

Care is needed in applying this rule of thumb. It is important to realizethat the sample autocorrelations, r1, r2, . . . , (and sample partial autocorrelations,1,1,2,2, . . . ) are not independently distributed. The probability that any one rk

should lie outside2/Tdepends on the values of the other rk.A portmanteau test of white noise (due to Box & Pierce and Ljung & Box) can

be based on the fact that approximately

Qm=T(T+ 2)m

k=1

(T k)1r2k 2m .

The sensitivity of the test to departure from white noise depends on the choice ofm. If the true model is ARMA(p, q) then greatest power is obtained then rejection

of the white noise hypothesis is most probable when m is aboutp+ q.

8


13/36

3 Spectral methods

3.1 The discrete Fourier transform

Ifh(t) is defined for integers t, the discrete Fourier transform ofh is

H() =

t=h(t)eit,

The inverse transform is

h(t) = 1

2

eitH() d .

Ifh(t) is real-valued, and an even function such that h(t) =h(

t), then

H() =h(0) + 2

t=1

h(t) cos(t)

and

h(t) = 1

0

cos(t)H() d .

3.2 The spectral density

The Wiener-Khintchine theorem states that for any real-valued stationary processthere exists a spectral distribution function,F(), which is a nondecreasing andright continuous on [0, ] such that F(0) = 0, F() =0 and

k=

0

cos(k) dF() .

The integral is a Lebesgue-Stieltges integral and is defined even ifF has disconti-

nuities. Informally, F(2) F(1) is the contribution to the variance of the seriesmade by frequencies in the range (1, 2).

F() can have jump discontinuities, but always can be decomposed asF() =F1() + F2()

where F1() is a nondecreasing continuous function and F2() is a nondecreasingstep function. This is a decomposition of the series into a purely indeterministiccomponent and a deterministic component.

Suppose the process is purely indeterministic, (which happens if and only ifk |k|


14/36

f() = F() exists, and is called the spectral density function. Apart from amultiplication by 1/it is simply the discrete Fourier transform of the autocovariancefunction and is given by

f() = 1

k=

keik = 1 0+2

1

kcos(k) ,

with inverse

k =

0

cos(k)f() d .

Note. Some authors define the spectral distribution function on [, ]; the use ofnegative frequencies makes the interpretation of the spectral distribution less intuitive

and leads to a difference of a factor of 2 in the definition of the spectra density.Notice, however, that iffis defined as above and extended to negative frequencies,f() =f(), then we can write

k =

12

eik f() d .

Example 3.1

(a) Suppose

{Xt

} is i.i.d., 0 = var(Xt) =

2 > 0 and k = 0, k

1. Then

f() =2/. The fact that the spectral density is flat means that all frequenciesare equally present accounts for our calling this sequence white noise.

(b) As an example of a process which in not purely indeterministic, consider Xt =cos(0t+ U) where 0 is a value in [0, ] and U U[, ]. The process haszero mean, since

E

(Xt) = 1

2

cos(0t + u) du= 0

and autocovariance

k = E (Xt, Xt+k)

= 1

2

cos(0t + u) cos(0t + 0k+ u)du

= 1

2

12[cos(0k) + cos(20t + 0k+ 2u)] du

=

1

212[2 cos(0k) + 0]

=1

2cos(0k) .

10


15/36

HenceXt is second order stationary and we have

k=1

2cos(0k), F() =

1

2I[0] and f() =

1

20() .

Note thatF is a nondecreasing step function.

More generally, the spectral density

f() =n

j=1

1

2ajj()

corresponds to the process Xt =n

j=1 ajcos(jt+ Uj) where j [0, ] andU1, . . . , U n are i.i.d.U[, ].

(c) The MA(1) process, Xt = 1t1+t, where

{t

} is white noise. Recall 0 =

(1 + 21)2, 1= 12, andk= 0,k >1. Thus

f() =2(1 + 21 cos +

21)

.

(d) The AR(1) process, Xt=1Xt1+ t, where{t} is white noise. Recallvar(Xt) =

21 var(Xt1) +

2 = 0=210+ 2 = 0=2/(1 21)where we need|1| 0 has power at low frequency, whereas < 0 has power at highfrequency.

00

1

1

2

2

3

3

4

5

f()

1 = 1

2

00

1

1

2

2

3

3

4

5

f()

1= 12

11


16/36

Plots above are the spectral densities for AR(1) processes in which{t} is Gaussianwhite noise, with 2/= 1. Samples for 200 data points are shown below.

AR(1), p=0.5

1 50 100 150 200

AR(1), p=-0.5

1 50 100 150 200

3.3 Analysing the effects of smoothing

Let{

as}

be a sequence of real numbers. A linear filter of{

Xt}

is

Yt=

s=asXts .

In Chapter 5 we show that the spectral density of{Yt}is given byfY() = |a()|2 fX() ,

wherea(z) is the transfer function

a() =

s=

aseis .

This result can be used to explore the effect of smoothing a series.

Example 3.2

Suppose the AR(1) series above, with 1 = 0.5, is smoothed by a moving averageon three points, so that smoothed series is

Yt= 13 [Xt+1+ Xt+ Xt1] .

Then|a()|2

= |13e

i

+ 13+

13e

i

|2

= 19(1 + 2 cos )

2

.Notice that X(0) = 4/3, Y(0) = 2/9, so{Yt} has 1/6 the variance of{Xt}.

Moreover, all components of frequency = 2/3 (i.e., period 3) are eliminated inthe smoothed series.

00

1

1

2 3

|a()|2

00

1

1

2

2

3

3

4

5

= |a()|2fX()fY()

12


17/36

4 Estimation of the spectrum

4.1 The periodogram

Suppose we have T = 2m+ 1 observations of a time series, y1, . . . , yT. Define the

Fourier frequencies,j = 2j/T,j = 1, . . . , m, and consider the regression model

yt = 0+m

j=1

jcos(jt) +m

j=1

jsin(jt) ,

which can be written as a general linear model, Y =X + , where

Y =y1...

yT

, X= 1 c11 s11 cm1 sm1... ... ... ... ...1 c1t s1t cmt smt

, =

01

1...mm

, =1...

T

cjt = cos(jt), sjt = sin(jt) .

The least squares estimates in this model are given by

= (XX)1XY .

Note thatT

t=1

eijt =eij(1 eijT)

1 eij = 0

=T

t=1

cjt + iT

t=1

sjt = 0 =T

t=1

cjt =T

t=1

sjt = 0

and

Tt=1

cjtsjt = 12

Tt=1

sin(2jt) = 0 ,

Tt=1

c2jt = 12

Tt=1

{1 + cos(2jt)} =T /2 ,T

t=1s2jt =

12

T

t=1{1 cos(2jt)} =T /2 ,

Tt=1

cjtskt =T

t=1

cjtckt =T

t=1

sjtskt = 0, j=k .

13


18/36

Using these, we have

=

01

...m

=

T 0 00 T /2 0... ... ...0 0 T /2

1

t yt

t c1tyt...

t smtyt

=

y(2/T)t c1tyt

...(2/T)

t smtyt

and the regression sum of squares is

YY =YX(XX)1XY =Ty2 +m

j=1

2

T

T

t=1

cjtyt

2+

Tt=1

sjtyt

2 .Since we are fittingTunknown parameters toTdata points, the model fits with noresidual error, i.e., Y =Y. Hence

Tt=1

(yt y)2 =m

j=1

2

T

T

t=1

cjtyt

2+

Tt=1

sjtyt

2 .This motivates definition of the periodogram as

I() = 1

T

Tt=1

yt cos(t)

2

+

Tt=1

yt sin(t)

2.

A factor of (1/2) has been introduced into this definition so that the sample variance,0 = (1/T)

Tt=1(yt y)2, equates to the sum of the areas ofm rectangles, whose

heights are I(1), . . . , I (m), whose widths are 2/T, and whose bases are centredat1, . . . , m. I.e., 0= (2/T)

mj=1 I(j). These rectangles approximate the area

under the curveI(), 0

.

0

I()I(5)

52/T

14


19/36

Using the fact thatT

t=1 cjt =T

t=1 sjt = 0, we can write

T I(j) =

T

t=1yt cos(jt)

2+

T

t=1yt sin(jt)

2

=

Tt=1

(yt y) cos(jt)

2

+

Tt=1

(yt y) sin(jt)

2

=

T

t=1

(yt y)eijt

2

=T

t=1(yt y)eijt

T

s=1(ys y)eijs

=T

t=1

(yt y)2 + 2T1k=1

Tt=k+1

(yt y)(ytk y) cos(jk) .

Hence

I(j) = 1

0+

2

T1k=1

kcos(jk) .

I() is therefore a sample version of the spectral density f().

4.2 Distribution of spectral estimates

If the process is stationary and the spectral density exists then I() is an almostunbiased estimator off(), but it is a rather poor estimator without some smoothing.

Suppose{yt} is Gaussian white noise, i.e., y1, . . . , yT are iid N(0, 2). Then forany Fourier frequency = 2j/T,

I() = 1

TA()2 + B()2 , (4.1)

where

A() =T

t=1

yt cos(t) , B() =T

t=1

ytsin(t) . (4.2)

ClearlyA() andB() have zero means, and

var[A()] =2T

t=1cos2(t) =T 2/2 ,

var[B()] =2T

t=1

sin2(t) =T 2/2 ,

15


20/36

cov[A(), B()] =E

Tt=1

Ts=1

ytys cos(t) sin(s)

=2

Tt=1

cos(t) sin(t) = 0 .

HenceA()2/T 2 andB()2/T 2 are independently distributed asN(0, 1), and2

A()2 + B()2

/(T 2) is distributed as 22. This gives I() (2/)22/2. Thuswe see thatI(w) is an unbiased estimator of the spectrum,f() =2/, but it is notconsistent, since var[I()] = 4/2 does not tend to 0 as T . This is perhapssurprising, but is explained by the fact that as T increases we are attempting toestimateI() for an increasing number of Fourier frequencies, with the consequencethat the precision of each estimate does not change.

By a similar argument, we can show that for any two Fourier frequencies, j andk the estimates I(j) and I(k) are statistically independent. These conclusions

hold more generally.

Theorem 4.1 Let{Yt}be a stationary Gaussian process with spectrum f(). LetI() be the periodogram based on samples Y1, . . . , Y T, and letj = 2j/T,j < T/2,be a Fourier frequency. Then in the limit as T ,(a) I(j) f(j)22/2.(b) I(j) andI(k) are independent forj=k.

Assuming that the underlying spectrum is smooth, f() is nearly constant over asmall range of. This motivates use of an estimator for the spectrum of

f(j) = 1

2p + 1

p=p

I(j+) .

Then f(j) f(j)22(2p+1)/[2(2p + 1)], which has variancef()2/(2p +1). The ideais to let p asT .

4.3 The fast Fourier transform

I(j) can be calculated from (4.1)(4.2), or from

I(j) = 1

T

T

t=1

yteijt

2

.

Either way, this requires of orderTmultiplications. Hence to calculate the complete

periodogram, i.e.,I(1), . . . , I (m), requires of orderT2

multiplications. Computa-tion effort can be reduced significantly by use of the fast Fourier transform, whichcomputes I(1), . . . , I (m) using only orderTlog2 Tmultiplications.

16


21/36

5 Linear filters

5.1 The Filter Theorem

A linear filter of one random sequence

{Xt

}into another sequence

{Yt

}is

Yt=

s=asXts . (5.1)

Theorem 5.1 (the filter theorem) SupposeXt is a stationary time series with spec-tral density fX(). Let {at} be a sequence of real numbers such that

t= |at| < .

Then the processYt=

s= asXtsis a stationary time series with spectral densityfunction

fY() =A(e

i

)2

fX() = |a()|2

fX() ,whereA(z) is the filter generating function

A(z) =

s=asz

s, |z| 1 .

anda() =A(ei) is the transfer function of the linear filter.

Proof.

cov(Yt, Yt+k) =r

Z

s

Z

arascov(Xtr, Xt+ks)

=

r,sZ

arask+rs

=

r,sZaras

12e

i(k+rs)fX()d

=

A(ei)A(ei) 1

2eikfX()d

=

12e

ikA(ei)2 fX()d

=

12e

ik fY()d .

ThusfY() is the spectral density for Y andY is stationary.

5.2 Application to autoregressive processes

Let us use the notationB for the backshift operator

B0 =I , (B0X)t=Xt, (BX)t=Xt1, (B2X)t =Xt2, . . .

17


22/36

Then the AR(p) process can be written as

(Ipr=1 rBr) X=or(B)X=, whereis the function

(z) = 1 pr=1 rzr .By the filter theorem, f() = |

ei)|2fX(

, so since f() =

2/,

fX() = 2

|(ei)|2. (5.2)

AsfX() = (1/)

k= keik, we can calculate the autocovariances by expand-

ingfX() as a power series inei. For this to work, the zeros of(z) must lie outside

the unit circle inC

. This is the stationarity condition for the AR(p) process.Example 5.2

For the AR(1) process, Xt 1Xt1 = t, we have (z) = 1 1z, with its zero atz= 1/1. The stationarity condition is|1|


23/36

Example 5.3

For the MA(1), Xt=t 1t1,(z) = 1 + 1zand

fX() =2

1 + 21 cos + 21 .

As above, we can obtain the autocovariance function by expressingfX() as a powerseries inei. We have

fX() =2

1e

i + (1 + 21) + 1ei

So0=2(1 + 21),1=1

2,2= 0,|k| >1.As we remarked in Section 1.5, the autocovariance function of a MA(1) process

with parameters (2, 1) is identical to one with parameters (21

2, 11 ). That is,

0 =21

2(1 + 1/21) =2(1 + 21) =0

1=11 /(1 +

21 ) =1/(1 +

21) =1 .

In general, the MA(q) process can be written as X=(B), where

(z) =

qk=0

kztk =

qk=1

(i z) . (5.3)

So the autocovariance generating function is

g(z) =q

k=qkz

k =(z)(z1) =q

k=1

(i z)(i z1) .

Note that (i z)(i z1) = 2i (1i z)(1i z1). So g(z) is unchenged whenin (5.3) we replace (i z) by i(1i z). Thus (in the case that all roots of(z) = 0 are real) there can be as many as 2q different MA(q) processes with thesame autocovariance function. For identifiability, we assume that all the roots of(z) lie outside the unit circle in

C

. This is equivalent to the invertibility condition,

thatt can be written as a convergent power series in{Xt, Xt1, . . . }.

5.4 The general linear process

A special case of (5.1) is the general linear process,

Yt =

s=0

asXts ,

where{

Xt}

is white noise. This has

cov(Yt, Yt+k) =2

s=0

asas+k

s=0

a2s,

19


24/36

where the inequality is an equality when k = 0. Thus{Yt} is stationary if andonly if

s=0 a

2s < . In practice the general linear model is useful when the as are

expressible in terms of a finite number of parameters which can be estimated. A richclass of such models are the ARMA models.

5.5 Filters and ARMA processes

The ARMA(p, q) model can be written as (B)X=(B). Thus

|(ei)|2fX() = |(ei)|2 2

= fX() =

(ei)(ei)

22

.

This is subject to the conditions that

the zeros of lie outside the unit circle inC

for stationarity. the zeros of lie outside the unit circle in C for identifiability. (z) and(z) have no common roots.

If there were a common root, say 1/, so that (I B)1(B)X= (I B)1(B),then we could multiply both sides by

n=0

nBn and deduce1(B)X=1(B), andthus that a more economical ARMA(p 1, q 1) model suffices.

5.6 Calculating autocovariances in ARMA models

As above, the filter theorem can assist in calculating the autocovariances of a model.These can be compared with autocovariances estimated from the data. For example,an ARMA(1, 2) has

(z) = 1 z, (z) = 1 + 1z+ 2z2, where||


25/36

6 Estimation of trend and seasonality

6.1 Moving averages

Consider a decomposition into trend, seasonal, cyclic and residual components.

Xt = Tt+ It+ Ct+ Et .

Thus far we have been concerned with modelling{Et}. We have also seen that theperiodogram can be useful for recognising the presence of{Ct}.

We can estimate trend using a symmetric moving average,

Tt=k

S=kasXt+s ,

whereas=as. In this case the transfer function is real-valued.The choice of moving averages requires care. For example, we might try to esti-

mate the trend with

Tt= 13(Xt1+ Xt+ Xt+1) .

But supposeXt =Tt+ t, where trend is the quadraticTt = a + bt + ct2. Then

Tt=Tt+ 23c +

13(t1+ t+ t+1) ,

so E Tt = E Xt+ 23c and thus

Tis a biased estimator of the trend.This problem is avoided if we estimate trend by fitting a polynomial of sufficient

degree, e.g., to find a cubic that best fits seven successive points we minimize

3t=3

Xt b0 b1t b2t2 b3t3

2.

So Xt = 7b0 + 28b2tXt = 28b1 + 196b3t2Xt = 28b0 + 196b3t3Xt = 196b1 + 1588b3

Then

b0= 121

7

Xt

t2Xt

= 121

(

2X3+ 3X2+ 6X1+ 7X0+ 6X1+ 3X2

2X3) .

We estimate the trend at time 0 by T0=b0, and similarly,

Tt= 121(2Xt3+ 3Xt2+ 6Xt1+ 7Xt+ 6Xt+1+ 3Xt+2 2Xt+3) .

21


26/36

A notation for this moving average is 121[2, 3, 6, 7, 6, 3, 2]. Note that the weightssum to 1. In general, we can fit a polynomial of degreeqto 2m+1 points by applyinga symmetric moving average. (We fit to an odd number of points so that the midpointof fitted range coincides with a point in time at which data is measured.)

A value forqcan be identified using the variate difference method: if{Xt}isindeed a polynomial of degree q, plus residual error{t}, then the trend in rXt isa polynomial of degreeq r and

qXt = constant + qt= constant + t

q

1

t1+

q

2

t2 + (1)qtq.

The variance of qXt is therefore

var(

q

t) =

1 +q

12

+q

22

+ + 12 = 2q

q

2

,

where the simplification in the final line comes from looking at the coefficient ofzq

in expansions of both sides of

(1 + z)q(1 + z)q = (1 + z)2q .

DefineVr = var(qXt)/

2qq

. The fact that the plot ofVr againstr should flatten out

atr qcan be used to identify q.

6.2 Centred moving averages

If there is a seasonal component then acentred-moving averageis useful. Supposedata is measured quarterly, then applying twice the moving average 14 [1, 1, 1, 1] isequivalent to applying once the moving average 18 [1, 2, 2, 2, 1]. Notice that this so-called centred average of foursweights each quarter equally. Thus ifXt =It + t,where It has period 4, and I1 +I2 +I3 +I4 = 0, then Xt Tt has no seasonalcomponent. Similarly, if data were monthly we use a centred average of 12s, that is,1

24 [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1].

6.3 The Slutzky-Yule effect

To remove both trend and seasonal components we might successively apply a numberof moving averages, one or more to remove trend and another to remove seasonaleffects. This is the procedure followed by some standard forecasting packages.

However, there is a danger that application of successive moving averages canintroduce spurious effects. TheSlutzky-Yule effectis concerned with the fact that

a moving average repeatedly applied to a purely random series can introduce artificialcycles. Slutzky (1927) showed that some trade cycles of the nineteenth century wereno more than artifacts of moving averages that had been used to smooth the data.

22


27/36

To illustrate this idea, suppose the moving average 16 [1, 2, 4, 2, 1] is applied ktimes to a white noise series. This moving average has transfer function,a() = 1

6(4+

4cos 2cos2), which is maximal at =/3. The smoothed series has a spectraldensity, sayfk(), proportional toa()

2k, and hence for=/3,fk()/fk(/3) 0ask . Thus in the limit the smoothed series is a periodic wave with period 6.6.4 Exponential smoothing

Single exponential smoothing

Suppose the mean level of a series drifts slowly over time. A naive one-step-aheadforecast is Xt(1) = Xt. However, we might let all past observations play a part inthe forecast, but give greater weights to those that are more recent. Choose weights

to decrease exponentially and let

Xt(1) = 1 1 t

Xt+ Xt1+ 2Xt2+ + t1X1

,

where 0<


28/36

and hence

E St =b0+ b1t b1/(1 ) = E Xt+1 b1/(1 ) .Thus the forecast has a bias ofb1/(1). To eliminate this bias letS1t =Stbe thefirst smoothing, andS

2t =S

1t + (1 )S

2t1be the simple exponential smoothing of

S1t . Then

E

S2t = E S1t b1/(1 ) = E Xt 2b1/(1 ) ,

E

(2S1t S2t ) =b0+ b1t, E (S1t S2t ) =b1(1 )/ .This suggests the estimates b0+ b1t= 2S

1t S2t and b1 =(S1t S2t )/(1 ). The

forecasting equation is then

Xt(s) =b0+ b1(t + s) = (2S1t

S2t ) + s(S

1t

S2t )/(1

) .

As with single exponential smoothing we can experiment with choices of and findS10 and S

20 by fitting a regression line, Xt =

0+ 1t, to the first few points of theseries and solving

S10 =0 (1 )1/, S 20 =0 2(1 )1/ .

6.5 Calculation of seasonal indices

Suppose data is quarterly and we want to fit an additive model. Let I1 be theaverage ofX1, X5, X9, . . . , let I2 be the average ofX2, X6, X10, . . . , and so on for I3andI4. The cumulative seasonal effects over the course of year should cancel, so thatifXt = a + It, thenXt+ Xt+1+ Xt+2+ Xt+3= 4a. To ensure this we take our finalestimates of the seasonal indices as It =It 14(I1+ +I4).

If the model is multiplicative and Xt =aIt, we again wish to see the cumulativeeffects over a year cancel, so thatXt + Xt+1 + Xt+2 + Xt+3= 4a. This means that weshould takeIt =It 14(I1+ +I4) + 1, adjusting so the mean ofI1 , I2 , I3 , I4 is 1.

When both trend and seasonality are to be extracted a two-stage procedure is

recommended:

(a) Make a first estimate of trend, say T1t.

Subtract this from {Xt} and calculate first estimates of the seasonal indices, sayI1t , fromXt T1t.The first estimate of the deseasonalised series is Y1t =Xt I1t.

(b) Make a second estimate of the trend by smoothing Y1t , say T2t.

Subtract this from{

Xt}

and calculate second estimates of the seasonal indices,sayI2t, fromXt T2t.The second estimate of the deseasonalised series isY2t =Xt I2t.

24


29/36

7 Fitting ARIMA models

7.1 The Box-Jenkins procedure

A general ARIMA(p, d, q) model is (B)

(B)dX=(B), where

(B) =I

B.

TheBox-Jenkinsprocedure is concerned with fitting an ARIMA model to data.It has three parts: identification, estimation, and verification.

7.2 Identification

The data may require pre-processing to make it stationary. To achieve stationaritywe may do any of the following.

Look at it.

Re-scale it (for instance, by a logarithmic or exponential transform.) Remove deterministic components. Difference it. That is, take (B)dXuntil stationary. In practiced = 1, 2 should

suffice.

We recognise stationarity by the observation that the autocorrelations decay tozero exponentially fast.

Once the series is stationary, we can try to fit an ARMA(p, q) model. We considerthe correlogram rk = k/0 and the partial autocorrelations k,k. We have alreadymade the following observations.

An MA(q) process has negligible ACF after the qth term. An AR(p) process has negligible PACF after the pth term.

As we have noted, very approximately, both the sample ACF and PACF have stan-dard deviation of around 1/

T, whereTis the length of the series. A rule of thumb

is that ACF and PACF values are negligible when they lie between 2/T. AnARMA(p, q) process has kth order sample ACF and PACF decaying geometricallyfork >max(p, q).

7.3 Estimation

AR processes

To fit a pure AR(p), i.e., Xt = pr=1 rXtr +t we can use the Yule-Walkerequations k =pr=1 r|kr|. We fit by solving k =p1 r|kr|, k = 1, . . . , p.These can be solved by a Levinson-Durbin recursion, (similar to that used to solvefor partial autocorrelations in Section 2.6). This recursion also gives the estimated

25


30/36

residual variance 2p, and helps in choice ofp through the approximate log likelihood2log L Tlog(2p).

Another popular way to choosepis by minimizing Akaikes AIC (an informationcriterion), defined as AIC =2log L+ 2k, where k is the number of parametersestimated, (in the above casep). As motivation, suppose that in a general modellingcontext we attempt to fit a model with parameterised likelihood function f(X| ), , and this includes the true model for some 0 . LetX= (X1, . . . , X n) be avector ofn independent samples and let (X) be the maximum likelihood estimatorof. SupposeY is a further independent sample. Then

2nE Y E Xlog f

Y| (X)

= 2

E Xlog f

X| (X)

+ 2k+ O

1/

n

,

where k =

|

|. The left hand side is 2n times the conditional entropy ofY given

(X), i.e., the average number of bits required to specify Y given (X). The righthand side is approximately the AIC and this is to be minimized over a set of models,say (f1, 1), . . . , (fm, m).

ARMA processes

Generally, we use the maximum likelihood estimators, or at least squares numericalapproximations to the MLEs. The essential idea is prediction error decomposition.We can factorise the joint density of (X1, . . . , X T) as

f(X1, . . . , X T) =f(X1)T

t=2

f(Xt | X1, . . . , X t1) .

Suppose the conditional distribution ofXtgiven (X1, . . . , X t1) is normal with meanXt and variance Pt1, and suppose also that X1 is normal N(X1, P0). Here Xt andPt1 are functions of the unknown parameters1, . . . , p, 1, . . . , qand the data.

The log likelihood is

2log L= 2log f=T

t=1

log(2) + log Pt1+

(Xt Xt)2Pt1

.

We can minimize this with respect to 1, . . . , p,1, . . . , qto fit ARMA(p, q).Additionally, the second derivative matrix of log L(at the MLE) is the observed

information matrix, whose inverse is an approximation to the variance-covariancematrix of the estimators.

In practice, fitting ARMA(p, q) the log likelihood (2log L) is modified to sumonly over the range{m + 1, . . . , T }, wherem is small.Example 7.1

For AR(p), takem= p so Xt =p

r=1 rXtr, t m + 1, Pt1=2 .26


31/36

Note. When using this approximation to compare models with different numbersof parameters we should always use the same m.

Again we might choose p and qby minimizing the AIC of2log L+ 2k, wherek=p + qis the total number of parameters in the model.

7.4 Verification

The third stage in the Box-Jenkins algorithm is to check whether the model fits thedata. There are several tools we may use.

Overfitting. Add extra parameters to the model and use likelihood ratio test ort-test to check that they are not significant.

Residuals analysis. Calculate the residuals from the model and plot them. The

autocorrelation functions, ACFs, PACFs, spectral densities, estimates, etc., andconfirm that they are consistent with white noise.

7.5 Tests for white noise

Tests for white noise include the following.

(a) The turning point test (explained in Lecture 1) compares the number of peaksand troughs to the number that would be expected for a white noise series.

(b) The BoxPierce test is based on the statistic

Qm= 1

T

mk=1

r2k ,

whererk is thekth sample autocorrelation coefficient of the residual series, andp+ q < m T. It is called a portmanteau test, because it is based on theall-inclusive statistic. If the model is correct then Qm 2mpqapproximately.In fact,rk has variance (T k)/(T(T+ 2)), and a somewhat more powerful testuses the Ljung-Box statistic quoted in Section 2.7,

Qm=T(T+ 2)m

k=1

(T k)1r2k ,

where again,Qm 2mpqapproximately.(c) Another test for white noise can be constructed from the periodogram. Recall

thatI(j) (2/)22/2 and that I(1), . . . , I (m) are mutually independent.DefineCj =

jk=1 I(k) andUj =Cj/Cm. Recall that

22is the same as the expo-

nential distribution and that ifY1, . . . , Y mare i.i.d. exponential random variables,

27


32/36

then (Y1+ +Yj)/(Y1+ +Ym), j = 1, . . . , m 1, have the distribution ofan ordered sample ofm 1 uniform random variables drawn from [0, 1]. Henceunder the hypothesis that{Xt} is Gaussian white noise Uj, j = 1, . . . , m 1have the distribution of an ordered sample ofm 1 uniform random variableson [0, 1]. The standard test for this is the Kolomogorov-Smirnov test, which usesas a test statistic, D, defined as the maximum difference between the theoret-ical distribution function for U[0, 1], F(u) = u, and the empirical distributionF(u) = {#(Uj u)}/(m 1). Percentage points for D can be found in tables.

7.6 Forecasting with ARMA models

Recall that (B)X = (B), so the power series coefficients ofC(z) =(z)/(z) =

r=0 crzr give an expression forXt asXt = r=0 crtr.But also, = D(B)X, where D(z) = (z)/(z) =r=0 drzr as long as thezeros of lie strictly outside the unit circle and thus t=

r=0 drXtr.

The advantage of the representation above is that given (. . . , X t1, Xt) we cancalculate values for (. . . , t1, t) and so can forecast Xt+1.

In general, if we want to forecastXT+k from (. . . , X T1, XT) we use

XT,k =

r=k

crT+kr =

r=0

ck+rTr ,

which has the least mean squared error over all linear combinations of (. . . , T1, T).In fact,

E

(XT,k XT+k)2

=2

k1r=0

c2r.

In practice, there is an alternative recursive approach. Define

XT,k=

XT+k, (T 1) k 0 ,optimal predictor ofTT+k given X1, . . . , X T, 1 k .

We have the recursive relation

XT,k =

pr=1

r XT,kr+ T+k+q

s=1

sT+ks

Fork= (T 1), (T 2), . . . , 0 this gives estimates of t fort= 1, . . . , T .Fork >0, this give a forecast XT,k forXT+k. We take t= 0 for t > T.But this needs to be started off. We need to know (Xt, t 0) and t, t 0.

There are two standard approaches.

1. Conditional approach: take Xt=t = 0,t

0.

2. Backcasting: we forecast the series in the reverse direction to determine estima-tors ofX0, X1, . . . and0, 1, . . . .

28


33/36

8 State space models

8.1 Models with unobserved states

State space models are an alternative formulation of time series with a number ofadvantages for forecasting.

1. All ARMA models can be written as state space models.

2. Nonstationary models (e.g., ARMA with time varying coefficients) are also statespace models.

3. Multivariate time series can be handled more easily.

4. State space models are consistent with Bayesian methods.

In general, the model consists of

observed data: Xt=FtSt+ vtunobserved state: St=GtSt1+ wtobservation noise: vt N(0, Vt)state noise: wt N(0, Wt)

wherevt, wt are independent andFt, Gt are known matrices often time dependent(e.g., because of seasonality).

Example 8.1

Xt =St + vt,St = St1 + wt. DefineYt=Xt Xt1= (St + vt) (St1 + vt1) =wt+vt vt1. The autocorrelations of{yt} are zero at all lags greater than 1. So{Yt} is MA(1) and thus{Xt}is ARMA(1, 1).

Example 8.2

The general ARMA(p, q) model Xt =p

r=1 rXtr +q

s=0 sts is a state spacemodel. We writeXt =FtSt, where

Ft= (1, 2, , p, 1, 1, , q), St=

Xt1...

Xt

p

t...

tq

R

p+q+1

29


34/36

withvt= 0,Vt = 0.

St = GtSt1+ wt=

1 2 p 1 1 2 q1 q1 0 0 0 0 0 0 0

0 1

... 0 0 0 0 0 0... ... ... ... ... ... ... ... ... ...

0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 1 0 0...

... ...

... ...

... ...

... ...

...0 0 0 0 0 0 0 1 0

Xt2Xt3

..

....Xtp1

t1t2

...

...tq1

+

00......0t0......0

.

8.2 The Kalman filter

Given observed data X1, . . . , X t we want to find the conditional distribution ofStand a forecast ofXt+1.

Recall the following multivariate normal fact: If

Y =

Y1Y2

N

12

,

A11 A12A21 A22

(8.1)

then

(Y1, | Y2) N

1+ A12A122(Y2 2), A11 A12A122 A21

. (8.2)

Conversely, if (Y1 | Y2) satifies (8.2), and Y2 N(2, A22) then the joint distributionis as in (8.1).

Now let Ft1 = (X1, . . . , X t1) and suppose we know that (St1 | Ft1) NSt1, Pt1. Then

St=GtSt1+ wt ,

so

(St | Ft1) N

GtSt1, GtPt1Gt + Wt

,

and also (Xt | St, Ft1) N(FtSt, Vt).Put Y1 = Xt and Y2 = St. Let Rt = GtPt1Gt + Wt. Taking all variables

conditional on

Ft

1 we can use the converse of the multivariate normal fact and

identify

2=GtSt1 and A22= Rt .

30


35/36

Since St is a random variable,

1+ A12A122(St 2) =FtSt = A12=FtRt and 1=Ft2 .

Also

A11 A12A122 A21=Vt = A11= Vt+ FtRtR1t Rt Ft =Vt+ FtRtFt .What this says is that

XtSt

Ft1

=N

FtGtSt1

GtSt1

,

Vt+ FtRtF

t FtRt

Rt Ft Rt

.

Now apply the multivariate normal fact directly to get (St

|Xt,

Ft

1) = (St

| Ft)

N(St, Pt), whereSt=GtSt1+ RtFt

Vt+ FtRtF

t

1 Xt FtGtSt1

Pt=Rt RtFt

Vt+ FtRtF

t

1FtRt

These are the Kalman filter updating equations.Note the form of the right hand side of the expression for St. If contains the term

GtSt1, which is simply what we would predict if it were known thatSt1= St1, plus

a term that depends on the observed error in forecasting Xt, i.e.,

Xt FtGtSt1.This is similar to the forecast updating expression for simple exponential smoothingin Section 6.4.

All we need to start updating the estimates are the initial values S0and P0. Threeways are commonly used.

1. Use a Bayesian prior distribution.

2. IfF , G, V, W are independent oft the process is stationary. We could use the

stationary distribution ofSto start.

3. ChoosingS0= 0,P0=kI (k large) reflects prior ignorance.

8.3 Prediction

Suppose we want to predict the XT+k given (X1, . . . , X T). We already have

(XT+1 | X1, . . . , X T) NFT+1GT+1St, VT+1+ FT+1RT+1FT+1

which solves the problem for the case k= 1. By induction we can show that

(ST+k| X1, . . . , X T) N

ST+k, PT+k

31


36/36

where

ST,0= ST

PT,0= PTST,k=GT+k

ST,k1

PT,k=GT+kPT,k1GT+k

and hence that (XT+k| X1, . . . , X T) N

FT+kST,k, VT+k+ FT+kPT,kFT+k

.

8.4 Parameter estimation revisited

In practice, of course, we may not know the matrices Ft, Gt, Vt, Wt. For example, in

ARMA(p, q) they will depend on the parameters 1, . . . , p,1, . . . , q,2

, which wemay not know.

We saw that when performing prediction error decomposition that we needed tocalculate the distribution of (Xt | X1, . . . , X t1). This we have now done.Example 8.3

Consider the state space model

observed data Xt= St+ vt ,unobserved state St=St

1+ wt ,

wherevt, wt are independent errors,vt N(0, V) andwt N(0, W).Then we have Ft = 1, Gt = 1, Vt = V, Wt = W. Rt = Pt1 +W. So if

(St1 | X1, . . . , X t1) N

St1, Pt1

then (St | X1, . . . , X t) N

St, Pt

, where

St = St1+ Rt(V + Rt)1(Yt St1)

Pt =Rt R2t

V + Rt=

V RtV + Rt

= V(Pt1+ W)V + Pt1+ W

.

Asymptotically,Pt P, wherePis the positive root ofP2

+ W P W V= 0 and Stbehaves like St = (1 )

r=0

rXtr, where =V /(V +W+P). Note that thisis simple exponential smoothing.

Equally, we can predict ST+k given (X1, . . . , X T) asN

ST,k, PT,k

where

ST,0=St ,

PT,0=PT,

ST,k = ST,

PT,k = PT + kW

Documents

Cambridge Time Series Notes