ACTL2003 Summary Notes

7/23/2019 ACTL2003 Summary Notes

http://slidepdf.com/reader/full/actl2003-summary-notes 1/62

1

Zhi Ying Feng

ACTL2003 Stochastic Modelling

By Zhi Ying Feng

ContentsPart 1: Stochastic Processes ................................................................................................................. 4

Stochastic Processes ......................................................................................................................... 4

Increment Properties .................................................................................................................... 4

Markov Processes ............................................................................................................................ 4

Transition Probability .................................................................................................................. 5

Chapman-Kolmogorov Equations................................................................................................ 5

Classification of States ................................................................................................................. 6

Irreducible Markov Chain ............................................................................................................ 6

Recurrent and Transient States .................................................................................................... 6

Class Properties ............................................................................................................................ 8

Limiting Probabilities .................................................................................................................. 8

Mean Time in Transient States .................................................................................................... 9

Branching Processes .................................................................................................................... 9

Probability Generating Functions .............................................................................................. 10

Time Reversible Markov Chains ............................................................................................... 11

Exponential, Poisson and Gamma Distributions........................................................................ 11

Counting Process............................................................................................................................ 12

Poisson Process .......................................................................................................................... 13

Inter-arrival and Waiting Time .................................................................................................. 13

Order Statistics ........................................................................................................................... 13

Conditional Distribution of Arrival Time .................................................................................. 14

Thinning of Poisson Process ...................................................................................................... 14

Non-homogenous Poisson Process ............................................................................................ 15

Compound Poisson Process ....................................................................................................... 15

Continuous-time Markov Chains ................................................................................................... 16

Time Spent in a State ................................................................................................................. 16

Transition Rates and Probabilities ............................................................................................. 17

Chapman-Kolmogorov Equations.............................................................................................. 18

Limiting Probabilities ................................................................................................................ 19

Embedded Markov Chain .......................................................................................................... 19



2

Zhi Ying Feng

Time Reversibility...................................................................................................................... 20

Birth and Death Process ................................................................................................................. 21

Transition Rates and Embedded Markov Chain ........................................................................ 21

Examples of Birth and Death Processes .................................................................................... 21

Expected Time in States ............................................................................................................. 22

Kolmogorov Equations .............................................................................................................. 23

Limiting Probabilities ................................................................................................................ 23

Application: Pure Birth Process ................................................................................................. 24

Application: Simple Sickness Model ......................................................................................... 24

Occupancy Probabilities and Time ............................................................................................ 25

First Holding Time ..................................................................................................................... 25

Non-homogenous Markov Jump Processes ................................................................................... 26

Residual Holding Time .............................................................................................................. 27

Current Holding Time ................................................................................................................ 27

Part 2: Time Series ............................................................................................................................. 28

Introduction to Time Series ............................................................................................................ 28

Classical Decomposition Model ................................................................................................ 28

Moving Average Linear Filters .................................................................................................. 28

Differencing ............................................................................................................................... 29

Stationarity ................................................................................................................................. 30

Sample Statistics ........................................................................................................................ 30

Noise .......................................................................................................................................... 31

Linear Processes ......................................................................................................................... 31

Time Series Models ....................................................................................................................... 32

Moving Average Process ........................................................................................................... 32

Autoregressive Process .............................................................................................................. 33

Autoregressive Moving Average Models .................................................................................. 34

Causality..................................................................................................................................... 34

Invertibility................................................................................................................................. 35

Calculation of ACF .................................................................................................................... 35

Partial Autocorrelation Function ................................................................................................ 37

Model Building .............................................................................................................................. 38

Model Selection ......................................................................................................................... 38

Parameter Estimation ................................................................................................................. 39

Model Diagnosis ........................................................................................................................ 40



3

Zhi Ying Feng

Non-Stationarity ............................................................................................................................. 41

Stochastic Trends ....................................................................................................................... 41

ARIMA Model ........................................................................................................................... 42

SARIMA Model ......................................................................................................................... 42

Dickey-Fuller Test ..................................................................................................................... 43

Overdifferencing ........................................................................................................................ 44

Cointegrated Time Series ........................................................................................................... 44

Time Series Forecasting ................................................................................................................. 45

Time Series and Markov Property ............................................................................................. 45

k -step Ahead Predictor ............................................................................................................... 46

Best Linear Predictor ................................................................................................................. 47

Part 3: Brownian Motion.................................................................................................................... 48

Definitions .................................................................................................................................. 48

Properties of Brownian Motion.................................................................................................. 48

Brownian Motion and Symmetric Random Walk...................................................................... 49

Brownian Motion with Drift ...................................................................................................... 49

Geometric Brownian Motion ..................................................................................................... 49

Gaussian Processes .................................................................................................................... 50

Differential Form of Brownian Motion ..................................................................................... 50

Stochastic Differential Equations............................................................................................... 51

Stochastic Integration ................................................................................................................. 52

Part 4: Simulation............................................................................................................................... 53

Continuous Random Variables Simulation .................................................................................... 53

Pseudo-Random Numbers.......................................................................................................... 53

Inverse Transform Method......................................................................................................... 53

Discrete Random Variable Simulation ...................................................................................... 54

Acceptance-Rejection Method ................................................................................................... 55

Simulation Using Distributional Relationships.......................................................................... 56

Monte Carlo Simulation ................................................................................................................. 57

Expectation and Variance .......................................................................................................... 57

Antithetic Variables ................................................................................................................... 58

Control Variates ......................................................................................................................... 59

Importance Sampling ................................................................................................................. 60

Number of Simulations .............................................................................................................. 60



4

Zhi Ying Feng

Part 1: Stochastic Processes

Stochastic Processes

A stochastic process is any collection of random variables, denoted as:

, X t t T

T is the index set of the process, usually the time parameter X(t) is the state of the process at time t

S is the state space, i.e. the set of values that X(t) can take on

Stochastic processes can be classified according to the nature of the:

Index Set T

- Discrete time process: index set is finite or countably infinite

- Continuous time process: index set is continuous

State Space S

-

Discrete state space: X(t) is a discrete random variable- Continuous state space: X(t) is a continuous random variable

A sample path or a realisation of a stochastic process is a particular assignment of possible values

of X(t) for all t ∈ T .

Increment Properties

In a Markov chain, an increment is a random variable for all 1 2,t t :

2 1 X t X t

A stochastic process has independent increments if 0 1 0 1, ,..., n n X t X t X t X t X t areindependent for all 1 2 ... nt t t . Equivalently, the r.v. X t s X t and X t are independent

for all s t , i.e. future increases are independent of the past or present.

A stochastic process has stationary increments if 2 1 2 1and X t X t X t X t , i.e.

increments of the same length, have the same probability distribution, for all 1 2, and 0t t .

Markov Processes

A Markov process is a stochastic process that has the Markov property, where given the present

state, the future state is independent of the past states:

1 1 1 1 1 1Pr | ,..., Pr |n n n n n n n n X t x X t x X t x X t x X t x

A Markov Chain is a Markov process on a discrete index set 0,1,2,...T denoted by

, 0,1,2,...n X n

i.e. the process is in state k at time n, and a finite or countable state space:

0,1,2,..., or 0,1,2,...S n S

The Markov property for a Markov chain is given by:

1 1 0 0 1 1Pr | ,..., Pr |n n n n n n n n X x X x X x X x X x



5

Zhi Ying Feng

Transition Probability

The one-step transition probabilities of the Markov chain is the conditional probability of

moving to state j in one step, given that the process is in state i in the present: , 1

1Pr |n n

ij n n P X j X i

If the one-step transition probabilities do NOT depend on time, i.e. same for all n to n+1, then the

Markov chain is homogenous with stationary transition probabilities. , 1

1Pr |n n

ij ij n n P P X j X i

The transition probability matrix is the matrix consisting of all transition probabilities:

00 01 02

10 11 12

20 21 22

ij

P P P

P P P P

P P P

P

This matrix satisfies the properties:

0 for , 0,1, 2... and 1 for 0,1, 2...ij ij

all j P i j P i

Note: in the non-homogenous case, time must be specified.

The n -step transition probability is the probability that a process in state i will be in state j in n

steps:

Pr |n

ij n m m P X j X i

Chapman-Kolmogorov Equations

The Chapman-Kolmogorov equations computes the n+m-step transition probabilities:

0

for all , 0n m n m

ij ik kj

k

P P P n m

This is the sum of the transitional probabilities of reaching an intermediate state k after n steps, then

reaching state j after m steps. Or in matrix multiplication:

...

n m n m

n m times

P P P P P P

Where n nij P P is the matrix consisting of the n-step transition probabilities.

Consider the matrix multiplication A*B = AB

To get the ith row of AB, multiply the ith row of A by B

To get the jth column of AB, multiply A by the jth column of A

Kolmogorov Forward Equations: start in state i, transition into state k after n steps, then a one-

step transition to state j

1

0

for all 0n n

ij ik kj

k

P P P n

Kolmogorov Backward Equations: start in state i, one-step transition into state k , then n step

transition into state j.

1

0

for all 0n n

ij ik kj

k

P P P n



6

Zhi Ying Feng

Classification of States

Absorbing state

A state i is said to be an absorbing state if 1ii P or 0 for allij P j i . An absorbing state is a state

in which, if the process ones arrive in it, the process will stay in.

Accessible

State j is accessible from state i if 0n

ij P for some 0n . State j is accessible from i if there is a

positive probability that the process will be in state j at some future time, given that the process is

currently in state i. This is written as:

i j

Note that:

if and theni j j k i k

The absorbing state is the ONLY accessible state for an absorbing state.

Communicate

States i and j communicates if:

and , i.e.i j j i i j

That is, there is a positive probability that if the process is currently in state i, then at a future point

in time the process will return to state i in between visiting state j. Note that:

for all

if then

if and then

i i i

i j j i

i j j k i k

The class of states that communicate with state i is the set:

:C i j S i j

That is, there is a positive probability that if the process is in state i, then at a future point in time the

process will return to state i in between visiting at least one state in class C

Irreducible Markov Chain

A Markov chain is irreducible if there is only ONE class, i.e. all states communicate with each

other. Properties:

All states in a finite, irreducible Markov chain is recurrent

The probability of returning to the current state in the long run is positive The probability of being in any state in the long run is also positive

Recurrent and Transient States

Let f i be the probability that the process starting in state i will return to state i sometime in the future.

0Pr |ni ii n f P X i X i

A state is:

Recurrent if 1i f

Transient if 1i f

If state i is recurrent then, starting in state i, the process will return to state i at some point in the

future, and the process will enter state i infinitely often.



7

Zhi Ying Feng

If state i is transient then, starting in state i, the number of the times the process will enter state i

thereafter is a Bernoulli random variable

0 if process returns to state

~ 11 if process does not return to state

i i

i X Ber p f

i

Then the probability that the process will be in state i for exactly n time periods can be modelled

using the geometric distribution:

1 1n

i i f f

The total number of periods that the process is in state i is given by:

0

1 if ;, where

0 if ,

n

n n

nn

X i I I

X i

With the expected number of time periods that the process starts in state i:

0 0

0 0

0

0

0

| |

Pr |

1

1

n n

n n

n

n

nii

n

i

E I X i E I X i

X i X i

P

f

An alternate definition for recurrent/transient states is if a state i is:

1 1

recurrent if ; transient if n nii ii

n n

P P

That is, a transient state will only be visited a finite number of times.

Note:

In a finite state Markov chain, AT LEAST one state must be recurrent and NOT ALL states

can be transient. Otherwise, after a finite number of times, no states will be visited!

An absorbing state is recurrent, since it will revisit itself infinite times

If state i communicates with state j and state i is: Recurrent, then state j is also recurrent

Transient, then state j is also transient

A class of states is:

Recurrent if all states in the class are recurrent

Transient if all states in the class are transient

Closed if all of the states in the class can only lead to states WITHIN the class. Hence,

states in a closed class are recurrent

Open if the state in the class can lead to states OUTSIDE the class. Hence states in an open

class are transient



8

Zhi Ying Feng

Class Properties

Period

State i has period d(i) if d(i) is the greatest common divisor of all 1n for which 0nii P , i.e. it’s

the number of steps in all possible paths back state i, if the process starts in state i. Note that:

If 0nii P for all n, then 0d i .

If , theni j d i d j

A state with period 1 is called aperiodic

Positive Recurrent

A recurrent state i is positive recurrent if the expected time of return to itself is finite. In a finite

state Markov chain, ALL recurrent states are positive recurrent.

Ergodic

A state i is ergodic if it is positive recurrent and aperiodic.

Limiting Probabilities

Limiting probabilities is the long run probability that a process is in a certain state. For an

irreducible (only one class, all states communicate) and ergodic Markov chain, the limiting

probability exists and is independent of i, i.e. where the process starts from:

lim 0n j ij

n P j

Where j is the unique non-negative solution to the set of equations:

0 0

and 1 j i ij j

i j

P

Or, in matrix form:

1 2where , ,... P

This can be interpreted as:

The probability that the process being in state j at time t is the same as t+1, as t

j is the long run proportion of time that the Markov chain is in state j

Note that the limiting probabilities may not exist at all, or only for even/odd transitions

If state j is transient and state i is recurrent, then lim 0n

ijn P and lim 1n

iin P

If the distribution of the initial states is chosen to be the limiting distribution, then the probabilities

of being in state j initially is same as the probability of being in state j at time n:

0Pr , then Pr j n j X j X j

The mean time between visits to state j is the expected number of transitions jjm until a Markov

chain which starts in state j to return to state j, given by the mean of geometric distribution:

1 jj

j

m

That is, the proportion of time in state j equals the inverse of the mean time between visits to state j.



9

Zhi Ying Feng

Mean Time in Transient States

For a finite state Markov chain, let its transient states be denoted by 1,2,...,T t . Let T P denote the

transition matrix that ONLY contains the transient states. Note that the rows will sum to less than 1.

11 12 1

21 22 2

1 2

...

...

...

...

t

t

T

t t tt

P P P

P P P

P P P

P

The mean time spent in transient states ij s is the expected number of periods that the Markov

chain is in state j, given that the process starts in state i.

11 12 1

21 22 2

1 2

...

...

...

t

t

t t tt

s s s

s s s

s s s

S

Conditioning on the initial transition, we have for ,i j T ,

1

1

if

1 if

t

ik kj

k

ij t

ik ki

k

P s i j

s

P s i j

i.e. mean time spent in state j given the process is currently in state i, given by the transition

probability into state k starting in state i, times by the time spent in state j starting in state k ,

summed for all possible k .

Then we have:

1

T

T

S = I P S

S I P

Note that:

11 1

det

a b d b d b A

c d c a c a A ad bc

Branching ProcessesA branching process is a Markov chain that describes the size of a population where each member

in each generation produces a random number offsprings in the next generation.

Let , 0 j p j be the probability of producing j offsprings by each individual in each

generation

Assume that the offsprings of each individual is independent of the numbers by others

Let X 0 be the number of individuals initially present, i.e. the zeroth generation

Individuals produced from the nth generation belong to the (n+1)th generation

Let X n be the size of the nth generation.



10

Zhi Ying Feng

Then , 0,1, ...n X n is a branching process with:

00 1 P and state 0 is recurrent

If 0 0 p , all states other than 0 are transient, i.e. the population will either die out or

converge to infinity, since transient states are visited finite amount of times

After n+1 generations, the size of the population is a random sun:

1

1

n X

n i

i

X Y

Where Y k are i.i.d. r.v. that represent the no. of offspring of the X i-th person:

0

Pr , for 0,1,... 1i k k

k

Y k p k p

The mean and variance of number of offspring for an individual is:

0

22

0

var

i k k

i k

k

E Y kp

Y k p

The mean and variance of number of the population of the nth generation is:

12 1

1

2

if 1var

if 1

n

nn

n

n

E X

X

n

Under the assumption that 0 1 X , the probability of extinction, i.e. everyone dies, is given by:

0 0

0

k

k

k

p

Since the pgf of Y is 0

k Y k k

G s p s

we have

0 0Y G

If 1 then 0 1

If 1 then 0 1 , i.e. extinction is certain

Probability Generating Functions

Let X be an integer-valued random variable with i P X i p for 0,1, 2...i . The p.g.f. is:

0

Pr X x X

x

G t E t t X x

Properties:

Relationship between the probability mass function of X :

0

1

!

x X

X x

t

P t p x

x t

Relationship between the moment generating function of X :

exp log X X X m t E Xt G t m t



11

Zhi Ying Feng

Time Reversible Markov Chains

Consider a stationary ergodic Markov chain with stationary probabilities P ij and stationary

probabilities π i. If we start at time n and work backwards, the reversed sequence of states X n , X n-1,…

is also a Markov chain, due to independence being a symmetric relationship.

The transition probabilities of this reversed process are:

1

1

1

1

1

Pr |

Pr ,

Pr

Pr Pr |

Pr

ij m m

m m

m

m m m

m

j ji

i

Q X j X j

X j X j

X j

X j X j X j

X j

P

A time reversible Markov chain is one where:for all ,ij ijQ P i j

For a time reversible Markov chain:

for all ,i ij j ji P P i j

Thus, the rate at which the process goes from state i to state j (LHS) is the same as the rate at which

it goes from state j to state i.

Exponential, Poisson and Gamma Distributions

A r.v. X has an exponential distribution if its probability density function is in the form:

if 0

0 otherwise

xe x f x

The cumulative probability distribution and the survival function are in the form:

1 if 0

0 if 0

xe x

F x x

if 01

0 if 0

xe x

S x F x x

The moment generating function is given by:

for

tX X M t E e

t t

The mean and variance of the exponential distribution is:

2

1

1var

E X

X

The hazard (or failure) rate is given by:

1

x

x

d S x

f x edx xS x F x e



12

Zhi Ying Feng

A key property of the exponential distribution is that it is memoryless. X is said to be memoryless if

for all s,t>0,

Pr | Pr or;

Pr Pr Pr

X s t X t X s

X s t X t X s

Consider independent i X for 1,2,...,i n with parameter i then

If all the parameters are equal i then 1 2 ...n nY X X X has a gamma(n, )

distribution with pdf:

1

1 !n

n

t Y

t f t e

n

1 2min , ,..., n X X X X also has an exponential distribution with parameter and mean:

1

1

1and

n

i ni ii

E X

The probability that i X is the smallest is given by:

1

i

n

j j

If all parameters are not equal, i.e. i j then the sum of n independent exponential r.v. has

a hyperexponential distribution with pdf:

1

, ,

1

wherein

ii

n j x

i n i i n X i j i j i

f x C e C

Counting Process

A counting process , 0 N t t represents the number of events that occur up to time t , i.e. it is a

discrete state space, continuous time stochastic process. A counting process has the properties:

0 N t

N t is integer-valued

for N s N t s t , i.e. it must be non-decreasing

For s t , N t N s is the number of events that occurred in the interval ( , ] s t

A counting process has independent increments if the number of events that occur in the interval

( , ] s t , N t N s , is independent of the number of events that occur up to time s

A counting process has stationary increments if 2 1 N t s N t s has the same distribution as

2 1 N t N t , i.e. the number of events that occur in any interval of time period only depends on

the length of the time period.



13

Zhi Ying Feng

Poisson Process

A Poisson process denoted by , 0 N t t is a counting process that counts the number of events

which occur at a rate λ from time 0. It has the properties:

0 0 N

Independent and stationary increments

The number of events in any interval of length t has a Poisson distribution with mean t :

Pr Pr stationary increments

!

n

t

N s t N s n N t n

t e

n

Inter-arrival and Waiting Time

The inter-arrival or holding times , 1, 2, ...nT n is the time between the 1 th

n and thn event. For a

Poisson process, the inter-arrival times are i.i.d. exponential r.v. with:

1

, Pr andn

t t T n n f t e T t e E T

The waiting time nS is the time until the thn event, given by:

1 2

1

...n

n n i

i

S T T T T

The sum of exponential random variables each with SAME parameter has a gamma(n, )

distribution, therefore:

1

1 !n

n

t S

t f t en

Note: if S n is the aggregate of m independent claims, then it has a gamma(n, mλ ) distribution. E.g.

for a motor insurer, each insured motorist makes a claim at a rate of 0.2 per year, and there are a

total of 200 insured. Then S 100, the waiting time until the 100th event, has the distribution:

100 ~ Gamma 100, 0.2 200S

Order Statistics

Let iY be the ith smallest value among 1 2, , ..., nY Y Y , then 1 2, ,..., nY Y Y is called the order statistics. If

iY are i.i.d. continuous r.v. with pdf i f y , then the joint pdf of the order statistics 1 2, ,...,

nY Y Y is:

1 2 1 2

1

, ,..., ! , where ...n

n i n

i

f y y y n f y y y y

If iY are uniformly distributed over 0,t then the joint pdf of the order statistics is:

1 2

!, , ..., n n

n f y y y

t



14

Zhi Ying Feng

Conditional Distribution of Arrival Time

Given that N t n , i.e. up until time t , n events have occurred, the n arrival times 1,..., nS S have the

same distribution as the order statistics corresponding to n independent r.v. uniformly distributed

over 0,t .

Note that the following two events are equivalent:

1 1 2 2 1 1 2 2 1 1 1, ,..., , , ,..., ,n n n n n nS s S s S n N t n T s T s s T s s T t s

i.e. inter-arrival time between the 1 th

n and thn event is equal to the waiting time for n events

minus the waiting time for 1 th

n events:

11

1 21 2

, ,..., ,, ,..., |

Pr

...

!

!

n n n

nn

s s t s s

n

t

n

f s s s n f s s s N t n

N t n

e e e

t e

n

n

t

The conditional marginal distribution is:

Pr ,Pr |

Pr

Pr Pr

Pr

!

! !

n mm t s s

n t

mn m

n

N s m N t n N s m N t n

N t n

N s m N t s n m

N t n

n s e t s e

m n m t e

n st s

m t

Thinning of Poisson Process

Consider a Poisson process

, 0 N t t with rate and that each time an event occurs, it is either:

type I event with probability

type II event with probability 1

p

p

And that it is independent of all other events. Let 1 N t and 2 N t denote the type I and type II

events occurring in [0,t]. Let the sum of the two independent process be:

1 2 N t N t N t

1 , 0 N t t is a Poisson process with rate p

2 , 0 N t t is a Poisson process with rate 1 p

, 0 N t t is a Poisson process with rate equal to the sum of the two rates, i.e.



15

Zhi Ying Feng

Non-homogenous Poisson Process

The counting process , 0 N t t is a non-homogenous Poisson process with intensity function

t , i.e. non-constant rate, if:

0 0 N

It has independent increments It has unit jumps, i.e.

Pr 1

Pr 2

N t h N t h o h

N t h N t o h

t itself can be deterministic or a process itself. Note that if t then it is a homogenous

Poisson process.

The mean value function of a non-homogenous Poisson Process is:

0

t

m t y dy

Then:

N t is a Poisson r.v. with mean m t

N s t N s is a Poisson r.v. with mean m s t m s

1 , 0 N m t t is a homogenous Poisson process with intensity 1

Compound Poisson Process

A stochastic process , 0 X t t is a compound Poisson process if

1

N t

i

i

X t Y

Where:

, 0 N t t is a Poisson process

, 1, 2, ...iY i are i.i.d. random variables

This is useful for insurance companies, where the uncertainty on total claim size X(t) is due to both

the number of claims N(t) which follows a Poisson process, and also the claim size Y , which canhave any distribution as long as they are i.i.d. Note that claim size Y is independent of the number of

claims N(t)

The mean and variance of a compound Poisson process is given by:

2var

i

i

E X t tE Y

X t t E Y



16

Zhi Ying Feng

Continuous-time Markov Chains

A continuous-time Markov chain, or a Markov jump process, is a Markov process in continuous

time with a discrete set space. The Markov jump process , 0 X t t has a continuous version of

the Markov property: for all , 0 s t and nonnegative integers , , , 0i j x u u s :

,

Pr , Pr |

0

X t i X t s j X u x u X t s j X t i

u t

i.e. the process at time t+s is only conditional on the state at time t and independent of its past

states before time t , i.e. for 0 X u x u u t . This means that time path up to the state at time t

does not matter.

The Markov jump process is a homogenous process, i.e. all transition probabilities are stationary, if:

Pr | Pr | 0 i.e. independent of

ij

X t s j X t i X s j X i t

P s

Hence, the transition probability over a time period only depends on the duration of the time period.

The Poisson Process is a continuous time Markov chain:

Pr | Pr jumps in ( , ] | jumps in 0,

Pr jumps in ( , ] independent increments

Pr jumps in (0, ] stationary increments

!

j i

t

X t s j X s i j i s s t i s

j i s s t

j i t

t e j i

Time Spent in a State

The time spent in a state i before the next jump, denoted as T i, for a continuous Markov chain, has

the memoryless property:

Pr | Pr , , 0

Pr , | Markov Property

Pr , 0 | 0 Stationary IncrementsPr

i i

i

T s t T s X v i s v s t X u i u s

X v i s v s t X s i

X v i v t X iT t

The only continuous distribution with the memoryless property is the exponential distribution.

Hence, the time spent in a state is exponentially distributed with:

~ exp Pr 1 iv t i i iT v T t e

With the expected time spent in state i before transitioning into another state is:

1i i E T v

Where vi is the rate transition when in state i, i.e. the transition rate out of state i to another state. It

is the sum of all the rates of jumping from state i to another state j, denoted as qij:

i ij ii j iv q q



17

Zhi Ying Feng

Transition Rates and Probabilities

The transition probability of the continuous-time Markov chain is defined as:

, Pr |ij P t t s X t s j X t i

For a homogenous or stationary process we have stationary transition probabilities:

, Pr |

Pr | 0

ij

ij

P t t s X t s j X t i

X s j X i

P s

With the initial conditions

0 if

, 01 if

ij ij

i j P s s P

i j

Transition Rates of Homogenous Processes

The instantaneous transition rate ijq from state i to state j of a continuous-time Markov chain is

defined as the rate of change of the transition probability over a small period of time:

0

00

0

lim if 0 0

lim1

lim if

ij

ijhij ij

ijh

t iiii i

h

P hq i j

P h P h P t

t h P hq v i j

h

Where i iiv q denote the transition rate out of state i when the process is in state i. Note that:

1 sum of all transition probabilities is 1

0 sum of all transition rates is 0

ij j

ij j

P

q

Therefore, we have that:

0ii ij j i

ii ij i j i

q q

q q v

Equivalently, the transition probability over a small time h can be defined as:

if

1 if

ijij

ii

q h o h i j P h

q h o h i j

Transition Rates of Non-homogenous Processes

In the non-homogenous case, the transition rates are also a derivative of the transition probabilities:

0

0

0

,lim if

, ,, lim

, 1lim if

ij

ijhij ij

ijh

t s iiii i

h

P s s hq s i j

P s s h P s s h P s t

t h P s s hq s v s i j

h

Equivalently, the transition probability over a small time h can be defined as:

if ,

1 if ij

ijii

q s h o h i j P s s hq s h o h i j



18

Zhi Ying Feng

Now, ignore when the transition occurs and how long is spent in each state. Consider only the

series of states that the process transitions into. Let:

iv be the transition rate OUT of state i when the process is in state i

ij P be the probability that the transition is into state j, conditional on the fact that a transition

has OCCURRED and process is currently in state i. (see embedded Markov chain)

Then, ijq , the transition rate INTO state j, when the process is in state i, is given by:

iij ijq v P

Using Pr | P A B A A P B , where A is the event of moving out of state i, B is the event of

moving to state j. Then we also have:

ii i ij i ij j i j i

ii i ij j i

q v q v P

q v q

Therefore,

ij ij

ij

i ij

j i

q q P

v q


For a homogenous continuous-time Markov chain the Chapman-Kolmogorov equation is:

0

for 0 and all states ,ij ik kj

k

P t s P t P s t i j

With the initial conditions:

0 if

01 if

ij

i j P

i j

The Kolmogorov’s backward equation is:

ij ik kj ik kj i ij

k k i

P t q P t q P t v P t t

The Kolmogorov’s forward equation is:

ij ik kj ik kj j ij

k k j

P t P t q P t q v P t t

Define the matrices:

ij ij ijt P t q t P t t t

P Q = P

In matrix form we have:


Kolmogorov's Backward Equations

Kolmogorov's Forward Equations

Initial Condition 0

t s t s

t t t

t t t

P P P

P QP

P P Q

P I



19

Zhi Ying Feng


The limiting probabilities of a continuous-time Markov chain P j is the long run probability, or the

long-run proportion of time that the process will be in state j, independent of the initial state:

lim j ijt

P P t

To solve the limiting probabilities, if it exists, we solve the equations:

0 j j kj k kj k

k j k

v P q P q P

0P Q

1k

k

P

This equation implies that the rate at which process leaves state j is equal to the rate at which the

process enters state j:

j jv P is the rate at which process leaves state j since j P is the long-run proportion of time the

process is in state j and jv is the rate of transition out of state j when the process is in state j

kj k k j

q P

is the rate at which the process enters state j since

k

P is the long-run proportion

of the time the process is in state k and kjq is the rate of transition from state k to state j

For the limiting probabilities limt ij P to exist, the conditions are:

All states communicate so that starting in state i there is a positive probability of being in

state j

Markov chain is positive recurrent so that starting in any state, the mean time to return to

that state is finite

When the limiting probabilities exist, then the Markov chain is ergodic, note that aperiodic

is unnecessary as periodicity does not apply to continuous-time Markov chains

If the initial state is chosen according to the limiting distribution, then the probability of being in

state j at time t for all t has the same distribution (homogenous)

Pr 0 Pr j j X j P X t j P

Embedded Markov Chain

For a continuous-time Markov chain that is ergodic with limiting probabilities i P , the sequence of

states visited (ignoring time spent in each state) is a discrete-time Markov chain. This discrete-timeMarkov chain is known as the embedded Markov chain.

The embedded Markov chain has transition probabilities:

Pr transition to state | transition out of state

Pr transition from state to state

Pr transition out of state

ij

ij ij

i ij j i

P j i

i j

i

q q

v q

Note: in general ij ij P P t ! Do not get confused!



20

Zhi Ying Feng

Assuming that the embedded Markov chain is ergodic, its limiting probabilities i , i.e. the long run

proportion of transitions into state i, are the unique solutions of the set of equations:

and 1i j ji i

j i

P

The proportion of time the continuous-time process spends in state i is given, i.e. the limiting

probabilities of the original continuous Markov process, can also be found using:

i i

i

j j j

v P

v

Note that i is the limiting probability of the embedded Markov chain, is the proportion of the

number of transits into state i multiplied by1 iv , i.e. the mean time spent in state i during a visit

Time Reversibility

Going backwards, given the process is in state i at time t the probability that the process has been in

state i for an amount of time greater than s is

iv s

e

:

Pr Process in state throughout [ , ]Pr Process in state throughout , |

Pr

Pr Pr

Pr

i

i

v s

i t s t i t s t X t i

X t i

X t s i T s

X t i

e

Since for large t , we have the limiting probability Pr Pr i X t s i X t i P . Therefore, if

we go back in time, the amount of time the process spends in state i is also exponentially with rate

iv . Thus the reversed continuous-time Markov chain has the same transition intensities as for the

forward-time process.

The sequence of states visited by the reversed process, i.e. its embedded chain, is a discrete-time

Markov chain with transition probabilities:

j ji

ij

i

P Q

Thus, the continuous-time Markov chain will be time reversible, i.e. with same probability

structure as the original process, if the embedded chain is time reversible, i.e.

i ij j ji P P

Using the proportion of time the continuous-time chain is in state i:

i i i

i i i i

j j j j j

i ij j ji i i ij j j ji

v P Pv

vv

P P Pv P P v P

Note that ij i ijq v P , then we have an equivalent condition for time reversibility:

i ij j ji Pq P q

i.e. rate at which the process goes directly from i to j is the same as the rate to go directly from j to i.



21

Zhi Ying Feng

Birth and Death Process

A birth and death process is a continuous-time Markov chain with states 0,1, 2... for which

transitions from state n may only go into either 1n (a birth) or state 1n ( a death).

Suppose that the number of people in a population/system is n

New arrivals enter the population/system at a birth/arrival rate n

- The time until the next arrival is exponentially distributed with mean 1 n

People leave the population/system at a death/departure rate n

- The time until the next departure is exponentially distributed with mean 1 n

Transition Rates and Embedded Markov Chain

Let vi be the transition rate out of state i when the process is in state i. Initially with a population

of 0, there can only be birth (population can't be negative), so 0 0

0 0i ij

i i i j i

vv q

v

For the corresponding embedded Markov chain, let the transition probability P ij denote the

transition probabilities between states. If the Markov chain is in state 0 and a transition occurs, it

must transition into state 1, therefore:

01 1 P

To derive the transition probabilities of the embedded Markov chain, consider a population of i. It

will jump to 1i if a birth occurs before death and jump to 1i if a death occurs before birth:

The time to a birth T b is exponential with rate i

The time to a death T d is exponential with ratei

Therefore, the probability that a birth occurs before a death is given by: (minimum of ind. Exp.)

Pr ib d

i i

T T

Then the transition probabilities are:

, 1 , 1 , 1 ,and 1 and 0 for 1 1i ii i i i i i i k

i i i i

P P P P i k i

Also, since the time to the next jump, regardless of whether it is a birth or death, is exponential with

rate vi and the rate of jumping out of state i is either a birth or death, then we have as before:

i i iv

Examples of Birth and Death Processes

Birth and death rates independent of n

A supermarket has one server and customers join a queue. Customers arrive at rate with

exponential inter-arrival time (Poisson process). The server serves at a rate with service time also

exponentially distributed. Let X(t) be the number in the queue at time t , then:

for 0

for 10 for 0

n

n

n

nn

, 1

, 1

1 0

1,2,...

0 0

1,2,...

i i

i i

for i P

for i

for i P

for i



22

Zhi Ying Feng

Birth and death rates dependent on n

If there are now s checkout operators serving the queue, with customers joining a single queue and

going to the first available server, then:

for 0

0 for 0

for 1

for

n

n

n

n

n n s

s n s

, 1

, 1

for 0,1, 2...min ,

min , for 0,1,2..min ,

i i

i i

P ii s

i s P ii s

Poisson Process

A poison process is a special case of a birth and death process, with no death:

for 0

0 for 0

n

n

n

n

, 1

, 1

1 for 0,1, 2...

0 for 0,1, 2..

i i

i i

P i

P i

Population Growth

Each individual in a population gives birth at an exponential rate plus there is immigration at an

exponential rate

. Each individual has an exponential death rate

for 0

for 0

n

n

n n

n n

, 1

, 1

for 0,1,2...

for 0,1,2..

i i

i i

n P i

n n

n P i

n n

Expected Time in States

Let Ti be the time for the process, starting from state i, to enter state i+1, i.e. time until a birth.

Define the indicator function I as:

1 if first transition is 1

0 if first transition is 1i

i i I

i i

Then the expected value of T i can be calculated by conditioning on the first transition:

1

1

1

Pr 1 Pr 0| 1 | 0

1 1

1

1

i

ii i

i i i i

i i

i

i i

i i

i i

i i i i

i i

i i i i i i

i

T

T T

T

I I T I T I

T T

Since the time to first transition, regardless of whether it is a birth or death, is exponential with rate

i i . The time taken to i+1 if there is a death is the time from i-1 to i then from i to i+1. Note:

00

1T

In the case where the rates are homogenous, i.e. , for alli i i :

111

1 ...

ii

iT

Then the expected time to go from state i to a higher state j is:

1 1... j i i jT T T T



23

Zhi Ying Feng

Kolmogorov Equations

We can use the transition rates of a death and birth process to find the Kolmogorov’s equations, and

then solve these differential equations to find the transition probabilities.

0,1 0

0,0 0 0

0, 0 for all 1i

q

q v

q i

, 1

, 1

, , otherwise zero

i i i

i i i

i i i i i

q

q

q v

Kolmogorov’s Backward Equations

0, 0 1, 0 0,

, 1, 1, ,

j j j

i j i i j i i j i i i j

P t P t P t t

P t P t P t P t t

Kolmogorov’s Forward Equations

,0 0 ,1 0 ,0

, 1 , 1 1 , 1 ,

i i i

i j j i j j i j i i i j

P t P t P t t

P t P t P t P t t


The limiting probabilities are determined by the balance equations:

for all and 1 j j kj k k

k j k

v P q P j P

This states that the rate at which the process leaves state j (RHS) is equal to the rate at which the

process enters state j (LHS). For a birth and death process, this condition breaks down iteratively:

1 1 1 1 1 1n n n n n n n n n n n P P P P P Then in general:

0 01 11 0 2 1

1 0

0

1 2 2 1

1 111 0

1

0

2 1

...

, ...

n jn

n n

jn n j

n

P P P P P

P P P P

Substituting this into the other balance equation0

1nn P

:

1 01

0 0 0 01 011 1 2 1

12 1

1

1 ... 11 ...

n

nnn n n

nn

P P P P P

Substituting back into (2) for P n gives:

1 01

1 0 2 110

1 012 11

2 1

...

...

1 ...

n

n nn

nnn

n

P P

This also gives the condition for long run probabilities to exist, i.e.:

1 01

12 1

...n

nn



24

Zhi Ying Feng

Application: Pure Birth Process

The Poisson process is an example of a pure birth process, i.e.

for 0

0 for 0

n

n

n

n

In this case, re arranging the Kolmogorov backward equations for the case 0i

, 1, , , , 1,i j i j i j i j i j i j P t P t P t P t P t P t t t

Multiply both sides by t e :

, , 1,t t

i j i j i je P t P t e P t t

By the product rule, this is equivalent to:

, 1,t t

i j i je P t e P t t

Integrate both sides w.r.t. from 0 to s:

0, , 1, ,

0

, 1,0

0 0 0 for s

s t i j i j i j i j

s s t

i j i j

e P s e P e P t dt P i j

P s e P t dt

LHS is the probability of moving from state i to j over the time period 0 to s

RHS is the probability of the process staying in state i for a time s t , i.e. s t

e

, then

jumping to state i+1 over the time interval s t to s t dt , i.e. dt , then the probability

that the process starts in i+1 and finishes in j over the time interval s t to s, i.e. 1,i j P t dt ,

integrated for all possible times from 0 to s, which is equivalent to t varying from 0 to s.

Application: Simple Sickness Model

In a simple sickness model, an individual can be in state 0 (healthy) or in state 1 (sick).

An individual remains healthy for an exponential time with mean1 before becoming sick

An individual remains sick for an exponential time with mean1 before becoming healthy

In this birth and death process, we have:

0 1and

With all other ,i i being zeros.

Using Kolmogorov’s backward/forward equations, the transition probabilities are:

0,0

1,0

0,1

1,1

1

1

s

s

s

s

P s e

P s e

P s e

P s e



25

Zhi Ying Feng

Occupancy Probabilities and Time

In a simple sickness model, the occupancy probabilities are defined as:

0,0 P s is the probability that a healthy individual stays healthy for a period of s

1,1

P s is the probability that a sick individual remains sick for same period

By the Markov property, the time spent in state 0 or state 1 is exponential with the memoryless

property:

0 00,0

1 11,1

Pr 1 Pr 1 (1 )

Pr 1 Pr 1 (1 )

s s

s s

P s T s T s e e

P s T s T s e e

The occupancy time O(t) is the total time that the process spends in each state during the interval

(0,t). If we define the indicator function:

1 if 1

0 if 0

X s I s

X s

Then the occupation time for being sick is:

0

t O t I s ds

The expected occupation time being sick given that the initial state is healthy is given by:

0

0

0

0,10

0

2

| 0 0 | 0 0

| 0 0

Pr 1| 0 0

1

1

t

t

t

t

t s

t

O t X I s ds X

I s X ds

X s X ds

P s ds

e ds

t e

First Holding Time

The first holding time T 0 is the first the time process jumps out of the initial state:

0 0inf : t T t X X For a homogenous Markov jump process, this is exponentially distributed with rate i which is the

rate of jumping out of state i, previously denoted as iv .

i ii ij j iv q

Thus, the first holding time has the c.d.f. and p.d.f.:

0 0 0 0Pr | Pr |i it t iT t X i e T t X i e

The probability of the state to which the process jumps from is:

0 0Pr |

ij

T iji

q

X j X i P

Where0T X is independent of 0T



26

Zhi Ying Feng

Non-homogenous Markov Jump Processes

Let the transition rates be denoted by ij s , equivalent to the ijq s previously

,ij ij

t s

s P s t t

So that the non-homogenous transition probabilities over a small time period of h is:

if ,

1 if

ij

ij

ii

h s o h i j P s s h

h s o h i j

ii ij

i j

s s

Note that this can be written as:

if ,

1 if

ijij

ii

h s h o h i j P s h s

h s h o h i j

So that:

, ,ij ij

ij

P s h s P s s o h s h

h

Then by letting 0h we get:

0

, ,lim

, differential w.r.t

ij ij

ijh

ij

t s

P s h s P s s o h s

h

P s t s s

The Chapman-Kolmogorov equations are:

, , , , , ,

,

ij ik kj

k P s t P s u P u t s t s u u t

s s

P P P

P I

The Kolmogorov’s forward equations are derived by differentiating the Chapman-Kolmogorov

equation w.r.t t , then setting u t :

, , ,

, , ,

, ,

ij ik kj

k u t

kj ik kj ik j ij

k k j

P s t P s u P u t t t

t P s t t P s t v t P s t

s t s t t t

P P Q

The Kolmogorov’s backward equations are derived by differentiating the Chapman-Kolmogorov

equation w.r.t s, then setting u s :

, , ,

, , ,

, ,

ij ik kj

k u s

ik kj ik kj j kj

k k j

P s t P s u P s t s s

s P s t s P s t v t P s t

s t s s t

s

P Q P

Where t Q is the matrix containing ij s for all i,j.



27

Zhi Ying Feng

Residual Holding Time

The residual holding time R s is the time between s and the NEXT jump:

, , s s u R w X i X i s u s w

i.e. the state remains in the same state i between time s and s+w. Note that this is the non-

homogenous case of first holding time. We can show that:

Pr | exp s w

s s i s

R w X i v u du

The density of | s s R w X i is given by:

Pr | 1 Pr | exp s w

s s s s i i s

R w X i R w X i v s w v u duw w

Define

s s s R X X

as the state the process jumps to at the next jump. The density for this is:

Pr | ,

ij

s s s

i

s w X j X i R w

v s w

Then we have the transition probability as:

, Pr |

s w

i s

ij t s

v u du

i

P s t X j X i

e v s w

iI

i

s w

v s w

0

,

t s

Ij

I i

P s w t dw

This is the conditional probability that a process is in state j at time t , given that it started in state i at

time s. This transition can be done by:

(1) Staying in state i from time s for a duration of w, i.e.

s w

i s

v u d ue

(2)

Jump out of state i, i.e. i s w

(3) Given that a jump has occurred, it jumps to an intermediate state I , i.e. iI i s w v s w

(4) Then jump from state I to j over s+w to t and stay in state j until time t , i.e. , Ij P s w t

(5) Integrated over all possible w and summed for all possible states

This is the integral form of the Kolmogorov backward equation, i.e. we consider the jump (q) first

Current Holding Time

For non-homogenous processes the current holding time C t is the time between the last jump and t :

, ,t t uC w X j X j t w u t

It can be shown that:

Pr | expt

t t jt w

C w X j v u du

The density of |t t C X j is given by:

Pr | 1 Pr | expt

t t t t j jt w

C w X j C w X j v t w v duw w

Then we have the transition probability as:

0

, ,

t

jt w

t sv u dukj

ij ik jk j j

t w P s t P s t w v t w e dw

v t w

This is the integral form of the Kolmogorov forward equation, i.e. we consider the jump (q) last.



28

Zhi Ying Feng

Part 2: Time Series

Introduction to Time Series

A time series is a sequence of observations that are recorded at regular time intervals, usually

discrete and evenly spaced, e.g. daily or monthly. Let t x denote the observed data:

1 2, ,... t x x x

A time series model for t x is a family of distributions to which the joint distribution of t X is

assumed to belong. Let t X denote the time series, where each t X is a r.v.:

1 1..., , , ,...t t t X X X

Therefore, 1 1, ,t t t x x x are realisations of the r.v. 1 1, ,t t t X X X

Classical Decomposition Model

The classical decomposition model decomposes the original data X t into 3 components

t t t t X T S N

1

0 0d

t t t d t j j N S S S

Where:

T t is a deterministic trend component that is slowly changing and perfectly predictable

S t is a deterministic seasonal component with a known period of d and perfectly

predictable. Note that the seasonal component over a complete cycle is zero, i.e. d would be

4 if it is quarterly.

N t is a random component with expected value 0, as all information should be captured in

the trend and seasonal component. N t may be correlated and hence partially predictable

Moving Average Linear Filters

A moving average linear filter has the form:

1

2 1

q

t j t j t j

j j q

T a X X q

Eliminating the Seasonality

A moving average linear can also be passed through X t to eliminate the seasonal component:

If the period d is odd, i.e. 2 1d q , use the filter:

1

1 1 1...t q t q t q X X X d d d

If the period d is even, i.e. 2d q , use the filter

1 1

1 1 1...

2 2t q t q t q t q X X X X

d

Estimating the Trend

A moving average linear filter can be applied to estimate the trend. Consider t t t X T N where

t T t , applying a 2q+1 point moving average filter to X t gives:

1 1

2 1 2 1

q q

t t j t j t j t

j q j q

T T N t j N t T q q



29

Zhi Ying Feng

For the classic decomposition model:

1 1 1

2 1 2 1 2 1

q q q q

t j t j t j t j t j

j q j q j q j q

T a X T S N q q q

This model works well, i.e. t t T T if:

The trend is approximately linear

The sum of N t is close to zero

The sum of S t is zero

The estimated noise, or the residuals, is:

t t t t N x T S

Differencing

The backshift operator, or the lag operator, B is defined by:

1t t

jt t j

BX X

B X X

The difference operator is defined by:

11t t t t X B X X X

The powers of are defined by:

0

1 j j

t t

t t

X B X

X X

The difference operator with lag d is defined by:

1 d d t t t t d X B X X X

Eliminating the Trend

Trend can be eliminated by using differencing. E.g. consider a linear trend:

0 1,t t t t X T N T c c t

Apply differencing to the power of one:

0 1 0 1

1

1

t t t

t

t

X T N

c c t c c t N

c N

i.e. the trend becomes a constant c1, which is stationary. This constant can be estimated by the

sample average of t X . In general, a trend of a polynomial of degree k can be reduced to a constant

by differencing k times:

0 1 ... k k k k t k t t t T c c t c t X T N

Which is a random process with mean ! k k c

Differencing to Eliminate Seasonality

Seasonality can also be eliminated by differencing, by differencing once only but by a lag d

equivalent to the period of the seasonality:0d t t t d S S S



30

Zhi Ying Feng

Stationarity

The theory of stationary random processes is important in modelling time series, as stationarity

allows parameters to be estimated efficiently, since we can treat all samples as from the same

distribution. A non-stationary random process must be transformed into a stationary one before

analysis and modelling, i.e. by removing trend and seasonality, or by applying transformations.

A random process t X is said to be:

Integrated of order 0, i.e. 0 I if t X itself is a stationary time series process

Integrated of order 1, i.e. 1 I if t X is not stationary, but the increment through

differencing 1t t t t Y X X X form a stationary process

Integrated of order n, i.e. I n if 1n

t X is still not stationary, but the nth differenced series

nt t Y X is a stationary time series process.

However, the deseasonalised, detrended residuals, i.e. the random component, may still contain

information, i.e. they are correlated over time. Assuming stationarity in the residuals, we can try to

fit a probability model to the residuals for forecasting purposes. To do so, we will need to look at

some sample statistics of the residuals:

Sample Statistics

Let t X be a stochastic process such that var t X for all t .

The autocovariance function (ACVF) is defined by:

,

0 , var

t t

t t t

Cov X X

Cov X X X

The autocorrelation function (ACF) is defined by:

,

,0 ,

t t t t

t t

Cov X X Corr X X

Cov X X

The covariance function is defined to be:

,Cov X Y X X Y Y

XY X Y

Some properties of the covariance function are:

, 0

, ,

, , , , ,

Cov X a

Cov aX bY abCov X Y

Cov aX bY cU dV acCov X U adCov X V bcCov Y U bdCov Y V

A process t X is said to be weakly stationary if:

cov ,

t

t t

X

X X

i.e. mean is constant and the covariance of the process only depend on the time difference .



31

Zhi Ying Feng

Noise

I.I.D Noise

t X is i.i.d. noise if t X and t h X are independently and identically distributed with mean zero, i.e.

no covariance.

2~ IID 0,t X

Assuming 2t X , as for weakly stationary series we need bounded first and second moments,

then:

2

0

if 0

0 if 0

X

X

hh

h

White Noise

t X is white noise with zero mean if:

2~ WN 0,t X

Where:

2

0

if 0

0 if 0

X

X

hh

h

IID noise is white noise, but white noise is not necessarily IID noise.

White noise is weakly stationary

Usually, we assume that the error terms have a normal distribution for the purpose of

parameter estimation and etc.

2~ 0,t X N

Linear Processes

t X is a linear process if it can be represented as:

jt j t j j t t

j j

X Z B Z B Z

Where j is a sequence of constants with j j

and 2~ WN 0,t Z .

Linear processes are stationary because they are a linear combination of stationary white noise

terms. For stationary processes, the regularity condition j j

holds, i.e. j j

is

absolutely convergent. This ensures that the infinite sum can be manipulated the same way as a

finite sum, i.e. two absolutely convergent series can be added or multiplied together.

In general, if t Y is stationary and j j

holds then

t t X B Y

t X is stationary. Essentially, ALL stationary processes can be represented by a linear process



32

Zhi Ying Feng

Time Series Models

Moving Average Process

t X is a first-order moving average process MA(1), if there is a process 2~ WN 0,t Z and a

constant θ such that:

1t t t X Z Z

i.e. the next term depends on the noise plus the proportion of the noise of the previous period.

The mean is:

1

0

t t t X Z Z

The autocovariance ACVF is:

2 2

1 1 1

21 1 2 1

1 1

,

, var 1 if 0

, var if 1

, 0 if 1

t t h

t t t t t t

t t t t t

t t t h t h

h Cov X X

Cov Z Z Z Z Z Z h

Cov Z Z Z Z Z h

Cov Z Z Z Z h

The autocorrelations ACF is:

22 2

1 if 0

if 10 11

0 if 1

h

h hh h

h

The conditional mean is stochastic and depends on X t :

1 1 1 1| , ,... | 0t t t t t t t t X X X Z Z Z Z X

The conditional variance is constant, i.e. independent on X t :

2 21 1 1 1var | , ,... var | 1 var t t t t t t X X X Z Z X

Therefore, the conditional and unconditional mean and variance are different.

t X is a moving average process of order q, MA(q), if the process depends on its previous q

realisations of noises:

1 1 ...t t t q t q X Z Z Z

Note:

Moving average processes are stationary processes as X t is a linear combination of stationary

white noise terms

The ACF of a MA(q) process has non-zero values up until lag q, and near-zero values for all

lags greater than q

The conditional mean/variance is used for forecasts, whereas the unconditional

mean/variance is the long run results



33

Zhi Ying Feng

Autoregressive Process

t X is a first-order autoregressive process AR(1) if t X is stationary and there is a process

2~ WN 0,t Z , uncorrelated with all X s for s t and a constant such that:

1t t t X X Z

i.e. the next term depends on the previous value of the process and slowly converges to its mean. Note that an AR(1) can also be written as:

1

1 1

1

1

0

...h h

t t h t t t h

hh j

t h t j

j

X X Z Z Z

X Z

The mean is:

1

0

X t t E X E Z

The autocovariance ACVF is:

2

2

2

2

if 01

cov ,

if 01

X t t h

h

h

h X X

h

The autocorrelations ACF is:

1 if 0

0 if 0

X X h

X

hhh

h

The conditional mean is stochastic and depends on X t : 1 1 1 1| , ,... | 0t t t t t t t t E X X X X E Z X X E X

The conditional variance is constant, i.e. independent on X t :

21 1 1 1var | , ,... var | var t t t t t t t X X X X Z X X

Therefore, the conditional and unconditional mean and variance are different.

t X is an autoregressive process of order p, AR(p) if:

1 1 2 2 ...t t t p t p t X X X X Z

Note: For an AR(1) process to be stationary, 1 , so it can be expressed as a linear process.

When 1 , the process is known as a random walk.

For a higher order AR(p) process, there will also be conditions on 1 2, ... p for stationarity

An AR(1) process is equivalent to a MA( ∞ ) process and a MA(1) process is equivalent to an

AR( ∞) process. Theoretically, MA( ∞ ) is equivalent to and can be converted to a AR( ∞)

process

1

1

0 0

ash

h j j

t t h t j t j

j j

X X Z Z h

An AR(p) process has, in absolute values, in general decaying non-zero ACF for all lags.

The smaller , the faster ACF decays. If is negative, then ACF will have alternating signs.



34

Zhi Ying Feng

Autoregressive Moving Average Models

t X is a first-order autoregressive moving average model ARMA(1,1), if X t is stationary and:

1 1t t t t X X Z Z

Where ϕ and θ are constants and 2~ WN 0,t Z . Or alternatively:

t t B X B Z

Where

1 and 1 B B B B

Then the ARMA(1,1) equation can be written as:

t t B X B Z

Note that for X t to be stationary, the condition is the same as the AR(1) process, i.e. 1 .

t X is an ARMA process of order (p,q), i.e. ARMA(p,q) if:

1 1 1 1... ...t t p t p t t q t q

t t

X X X Z Z Z

B X B Z

Where:

21 21 ... p

p z z z z

1 21 ... qq z z z z

The polynomials ϕ(z) and θ (z) have no common roots

The ARMA(p,q) process t X is stationary if the equation 0 z has no roots on the unit circle,note that the roots can be complex i.e.:

0 for 1 z z

t X is an ARMA(p,q) process with drift if it is in the form:

1 1 1 1... ...t t p t p t t q t q X X X Z Z Z

If we estimate then remove the mean, then we have X as a normal ARMA(p,q) process. The

drift here represents the expected change after differencing.

CausalityAn ARMA(p,q) process t X is causal if there exists constants j such that j and:

0

t j t j

j

X Z

i.e. X t is expressible in terms of current and past noise terms. Causal processes are a subset of

stationary processes, i.e. to be causal it must be stationary first. Causality is important in practice,

since if X t is not causal then it depends on future noise term, which doesn’t make sense.

Theorem:

The t X satisfying t t B X B Z is causal if and only if all the roots of the equation 0 z

are outside the unit circle.



35

Zhi Ying Feng

Invertibility

An ARMA(p,q) process t X is invertible if there exists constants j such that j and:

0

t j t j

j

Z X

i.e. Z t is expressible in terms of current and past X t . If X t is not invertible then it depends on future processes, again it does not make sense

Theorem:

The t X satisfying t t B X B Z is invertible if and only if all the roots of the equation

0 z are outside the unit circle.

Note:

All MA processes are causal, but AR processes might not be

All AR processes are invertible, but MA processes might not be

An equivalent condition for an AR(1) process to be causal is 1

An equivalent condition for an MA(1) process to be invertible is 1

Calculation of ACF

Linear Filter Method

Consider the causal ARMA(p,q) process:


This can be written as:

0

t t t j t j

j

B X Z B Z Z

B

Note that the summation is only from 0 to infinity, as the process is causal.

First step is to determine the j by equating the coefficients in the equation:

t t B B Z B Z

1 1 2 11 0 1 2 11 ... ... 1 ... p q

p q B B B B B B

Then calculate the ACF by replacing the X t by their linear filter form. For h>0:

0 0

0

2 2

0

,

,

, push index back so that both term has

since , and , 0

t h t

j t h j j t j

j j

j h t j j t j t j

j h j

j h j t t t s j

h Cov X X

Cov Z Z

Cov Z Z Z

Cov Z Z Cov Z Z

This method is convenient for MA processes, since they are easily expressed as linear processes



36

Zhi Ying Feng

Yule-Walker Equations

Consider the causal ARMA(p,q) process:


First, multiply both sides by X t-h and then take the covariance:

1 1 ...t h t t h t p t h t p hCov X X Cov X X Cov X X C

Which is equivalent to:

1 1 ... p hh h h p C

Where C h is the moving average components on the RHS:

1 1 ...h t h t t h t q t h t qC Cov X Z Cov X Z Cov X Z

For 0,1,2,...h p there are a total of 1 p equations which can be expressed in matrix form:

11 00

1 21 11

2 1 31 2

11

1 ...0 1 ...0

1 ... 01 0 ... 11

... 02 1 ... 2

... 11 ... 0

p p

p

p

p

p p p p

p C C

p C C

p C

C p p p C

Thus,

given C j and ϕ j, the ACF can be computed by solving the linear equation, i.e. by taking the inverse.

For h p , we can find the ACF through:

1 1 ... p hh h h p C

To figure out C 0, C 1,…C p, if ψ j’s are available, then we can use:

0

t h j t h j j h t j

j j h

X Z Z

This gives C h as:

1 1

1 1

2 2

0

...

...

h t h t t q t q

j h t j t t q t q

j h

q q h

j h j j j h

j h j

C Cov X Z Z Z

Cov Z Z Z Z

Where 0 1 and 0 0

q h

j j h j

. However, if ψ j’s are available, we may as well use method 1!

Therefore, this method is more convenient for AR processes



37

Zhi Ying Feng

Partial Autocorrelation Function

The autocorrelation function is used to determine the no. of lags for a MA(q) process, while the

partial autocorrelation function is necessary to determine the no. of lags for a AR(p) process. In

finding ACF , we set up the Yule-Walker equations and converted into matrix form to solve for h .

By omitting the first equation, we can rewrite this matrix to solve for the parameters j s:

1 1

2 2

0 1 ... 1

1 0 ... 2

1 2 ... 0

1

2

p p

p

p

p

C

p

C

C p

Or equivalently:

p p p pC

Where the red matrix is the covariance matrix Γ and C h are:

1 1, , ... ,h t h t t h t q t h t qC Cov X Z Cov X Z Cov X Z

Therefore, the parameters are given by:

1 p p p pC

Note that for an AR(p) process, C h are all zeroes since the noise terms are uncorrelated with the past:1

p p p

Define the vector 11 2 ...

T

h h h hh h h , then for an AR(p) process where 1

p p p

,

we have:

If h = p, then 1,...,T

p p p which implies that pp p

If h > p, then 1,..., ,0,...,0T

h p which implies 0hh .

- Because AR(p) process is a special case of an AR(h) process, where 0t for t > p

If h < p, then we do not know precisely what h is

Therefore, for an AR(p) process:

1 1 ...t t p t p t X X X Z

We have:

? if

if

0 if

hh p

h p

h p

h p

The h-th partial autocorrelation function is defined by:

1 if 0

if 0hh

hh

h

Where hh is the last element in the vector 1

h h

E.g., for an AR(2) process, the PACF at lag 2 is given by:

222 where

121

22

0 1 1

1 0 2



38

Zhi Ying Feng

Model Building

For a stationary time series, model building can be classified into 3 stages:

Model Selection

To determine the appropriate model we need to look at the sample ACF and PACF:

The sample autocovariance function is:

1

1 n

t t t

x x x xn

The sample autocorrelation function is SACF:

0

Note that since SPACF will be calculated from SACF, both measures will have estimation error.

Once we have graphed the sample ACF and sample PACF, we first check that the sample ACF

should quickly converge to zero, which shows that the time series is stationary.

If the sample ACF decreases slowly but steadily from a value near 1, then the data need to

be differenced before fitting the model

If the sample ACF exhibits a periodic oscillation, then there may be some seasonality still.

Then we compare the sample ACF and PACF to the theoretical ACF and PACF of different

processes to see if there is a match.

For an AR( p) process:

Sample ACF shows exponential decay towards near-zero values Sample PACF shows significant values up to lag p, then near-zero values thereafter

For an MA(q) process:

Sample ACF shows significant values up to lag q, then near-zero values thereafter

Sample PACF shows exponential decay towards near-zero values

If neither of these situations occur, then consider an ARMA(p,q) process. However, the sample

ACF and PACF of an ARMA(p,q) process is very flexible, but in general the ACF and PACF are

the sum of the ACF and PACF of an AR(p) and a MA(q) process. So for an ARMA(p,q) model, it

should display: ACF that decays towards zero after lag p, either direct or oscillatory

PACF that decays towards zero after lag q, either direct or oscillatory



39

Zhi Ying Feng

Model Selection Criteria

When deciding the number of parameters, it is a trade-off between goodness of fit and the

simplicity of the model. More parameters mean more flexibility and a better sample fit. However,

more parameters also mean that each parameter is estimated with more uncertainty, i.e. higher

standard error.

A systematic method to choose p and q, i.e. the parameters, is to minimise an information criterion.

Information criteria have the form:

IC , , 2 log , , p q L P n p q

Where the first term, i.e. the log-likelihood function, always decreases with the no. of parameters.

However, the second “penalty” term, in terms of the no. of observations and the parameters, always

increases with the no. of parameters. Thus the IC seeks to balance out bias and variance.

There common IC’s are:

Information Criteria Penalty TermAIC 2 1 p q

BIC 1 log p q n

AICc 2 1

2

p q n

n p q

Parameter Estimation

Mean Estimation

Let t X be a weakly stationary process with mean µ. An estimator of µ is the sample mean:

1

1 n

t

t

X X n

This estimator is unbiased:

1

1 n

t

t

E X E X n

This estimator is also consistent:

1 1

21 1

21 1

2

1 1var cov ,

1

1

n n

t s

t s

n n

s t

n n s

s h s

n

h n

X X X

n n

t sn

hn

n hh

n

Thus, the variance 0 as n , i.e. it is consistent, if h

h

, which is true since for a

stationary time series the ACF should eventually converge to zero regardless of whether it is AR or

MA. For a more detailed proof, see ACTL2003 Proofs.



40

Zhi Ying Feng

Parameter Estimation

If the sufficient number of sample ACF is known, then one way to estimate the parameters in a

model, i.e. θ , ϕ , σ 2, is by equating them to the theoretical ACF derived from the Yule-Walker

equations. Use this to set up equations in terms of the parameters and solve. If there are more than

one solution, choose the solution that makes the model causal and/or invertible.

Another method is to use maximum likelihood estimation. Suppose we have a set of errors that are

assumed to be normal, then the X t themselves are also normal so:

1,..., ~ 0,T

n n n X X N

X

Where:

β is a fixed set of parameters in the model, e.g. 2,T

for MA(1) model

n is the symmetric covariance matrix of X n, expressed in terms of the parameters

0 1 2 ... 1

1 0 1 ... 2

2 1 0 ... 3

1 2 3 ... 0

n

n

n

n

n n n

Assuming that the observations follow a multivariate normal distribution, the likelihood function is:

1

1 22

1 1exp

22 det

T n n nn

n

L

X X

The maximum likelihood estimator is the value that maximises the likelihood function L( β ).

Under the normality assumption, we have the asymptotic distribution of the MLE:

~ , var A

N

Where the variance can be estimated by:

12 ln

var T

L

This result can be used to compute confidence intervals for the parameters and for hypothesis

testing about the parameters, e.g. whether to include certain parameters.

Model Diagnosis

The residuals of the proposed model are:

t t t Z X X

Where ̂ are the fitted values computed using the estimated parameters. If the proposed model is a

good approximation to the underlying time series process, then the residuals should be

approximately a white noise process. There are several methods to check this:

Plot of Residuals

If the plot of Z t against t shows any trend or patterns in fluctuations, then the model is inadequate.



41

Zhi Ying Feng

SACF of Residuals

If ~ WN 0,1t Z , then it can be shown that the ACF of Z t has the distribution:

1

~ 0, Z h N n

Therefore, at 95% significance, the sample ACF should be should be within the range:

1.960

n

Since for a white noise process, 0 for 1h h . If too many values of SACF lie outside this

range, then the model does not fit the process well and more parameters will be needed.

Ljung-Box Test

We can test the null hypothesis that ~ WN 0,1t Z using the Ljung-Box test statistic. This tests

whether jointly, all correlations at lags greater than zero are zero. Under the null hypothesis, for

large n, the Ljung-Box test statistic is:

2

2

1

2 ~h

Z h p q

j

jQ n n

n j

Where n is the no. of time series observations. In practice, h is chosen to be between 15 and 30, and

n should be large, i.e. 100n .

Using the Ljung-Box test, we reject the null hypothesis, i.e. the residual is not white noise, if:

2, 1h p qQ

Non-Stationarity

In practice, a non-stationary time series may exhibit a non-stationary level of mean, variance or

both. Transformations can be used to remove non-stationarity, e.g. taking logarithm of an

exponential trend can remove non-stationary mean, or it can smooth out the variance

Stochastic Trends

Apart from a deterministic trend or seasonality, a stochastic trend also causes non-stationarity. A

stochastic trend is when the noise terms have a permanent effect on the process. Consider the

random walk rewritten iteratively

21 0

1

where ~ WN 0,t

t t t t j t

j

X X Z X X Z Z

In this case, the effect of any Z t on X t+h is the SAME for all 0h , since 1 . This is not true for

stationary processes like AR(1) or ARMA(1,1), where 1 , since depending on h, Z t will have

different level of impact since the coefficient j changes, e.g.:

1

jt t j t j

X Z Z

Since noise terms have a lasting impact, the correlation between X t and X t-h is relatively high, so a

distinctive feature of a random walk is a very slowly decaying positive ACF. Note that differencing

the random walk once obtains a stationary series!

11t t t t t Y B X X X Z



42

Zhi Ying Feng

ARIMA Model

The process t Y is an autoregressive integrated moving average model ARIMA(p,d,q) with order of

integration d if:

1 d

t t B B Y B Z

That is t Y becomes a stationary ARMA(p,q) process after differencing d times, i.e:

1 ~ ( , )d

t t W B Y ARMA p q

t t B W B Z

E.g. consider the process defined by:

1 2 3 10.6 0.3 0.1 0.25t t t t t t X X X X Z Z

This process can be re-written as:

1 2 3 1

2 3

2

1

0.6 0.3 0.1 0.25

1 0.6 0.3 0.1 1 0.25

1 1 0.4 0.1 1 0.25

d

t t t t t t

t t

t t

B B B

X X X X Z Z

B B B X B Z

B B B X B Z

Therefore, this is a ARIMA(2,1,1) process.

SARIMA Model

Suppose that t X exhibits a stochastic seasonal trend, i.e. where X t not only depends on

1 2, ,...t t X X but also 2, ,...t s t s X X .

To model this, we can use a SARIMA , , , , s

p d q P D Q process, given by:

1 11 1 ... 1 1 ... P Q Dd s s s s s

P t Q t

AR p MA q I d I D AR P MA Q

B B B B B X B B B Z

Where AR(P), MA(Q) and I(D) are polynomials with the term s B .

E,g, consider an 12

SARIMA 1,0,0 0,1,1 process given by:

12 12

1 1 1

1 1 1t t

AR I MA

B B X B Z

This can be written as:

12 1312

12 1 13 12

1 t t t

t t t t t t

B B B X Z Z

X X X X Z Z

Note that X t depends on X t-12, X t-1, X t-13 as well as Z t-12.

SARIMA models can always be rewritten as an ARIMA model, with some constraints on the new

parameters. However, converting to ARIMA always requires more parameters than the SARIMAmodel, which leads to better in-the-sample fit, but worse out-of-sample fit, i.e. predictions. Thus,

when forecasting it is better to use SARIMA models than converting it to ARIMA models.



43

Zhi Ying Feng

Dickey-Fuller Test

The Dickey-Fuller test is a unit root test, i.e. it tests whether there is a unit root in the time series.

Note that if the polynomial z has a unit root, then the time series is not stationary and requires

differencing.

Consider a causal time series process t X with:

21 ~ WN 0,t t t X t X Z Z

To test for stochastic trend, we test the hypothesis:

0 1: 1 against : 1 H H

Note that if 1 then there is a unit root, which leads to a stochastic trend so X t is not stationary.

We can write the above model as:

1 1

* *1

1

where 1

t t t t t

t t t

X X X Z

X t X Z

Then the alternative and equivalent null hypothesis is:

*0 : 0 H

This is known as the Dickey-Fuller Test, and the test statistic is:

*

* se

Once the parameters *, , have been estimated, reject the null hypothesis * 0 if is LESS than

the critical value, which will be negative since * is negative. Note that rejection of the null

hypothesis implies that the time series is stationary and accepting the null hypothesis implies that

differencing is required.

The distribution of this test statistic is a non-standard distribution depending on α and β, with

asymptotic percentiles.

Probability to the left 0.01 0.05 0.10

Standard Normal -2.33 -1.65 -1.28

DF with 0, 0 -2.58 -1.95 -1.62

DF with 0 -3.43 -2.86 -2.57DF (unconstrained) -3.96 -3.41 -3.12

Note that the DF distributions are much more spread out than the standard normal. When choosing

whether or not to do DF with 0, 0 , there is a trade-off:

If α or β is set to 0 when in fact the true values are nonzero, the test becomes inconsistent

and asymptotic critical values are no longer valid. Decisions based on the test are likely to

be wrong, i.e. it might confuse deterministic and stochastic trends

However, allowing α or β to be non-zero reduces the power of the test, i.e. harder to detect a

false null hypothesis

How to determine α and β usually depends on what type of series we have. E.g. if a linear trend

exists then we expect the difference to only have a constant, so 0, 0



44

Zhi Ying Feng

Overdifferencing

Let U t be an ARMA(p,q) process:

t t B U B Z

Define 1 d

t t V B U , i.e. n differences. Then we have:

1 1d d t t t t BV B Z B V B B Z

B

Therefore, V t becomes an ARMA(p,q+d) process since 1 d

B B is a polynomial with order q+d .

However, this MA polynomial has a unit root so the process V t is not invertible. Therefore, we

should avoid overdifferencing as it will give us a non-invertible process, even though it’s still

stationary.

Cointegrated Time Series

Many time series in finance and economics are non-stationary (random walks), e.g. CPI and GDP,

but at the same time do not move too far apart from each other. Cointegration is used to model non-stationary series that move together

For a bivariate process ,1 ,2, T

t t t X X X , we define:

,1

,2

d t d

t d t

X X

X

A bivariate process ,1 ,2, T

t t t X X X is integrated of order d , i.e. I(d), if d t X is stationary but

1d

t X

is not.

An I(d) bivariate process is cointegrated if there is a cointegrating factor 1 2, T

such that

T t X becomes stationary, i.e.:

1 ,1 2 ,2 ~ 0t t X X I

If ,1t X and ,2t X are cointegrated, then:

,1t X and ,2t X are both I(1)

t t t e Y X is I(0)

Cointegration implies that there is a common stochastic trend between ,1t X and ,2t X

In many financial applications, the cointegrating vector is in the form:

1, T a

That is, X t,1 and X t,2 are random walks themselves but the difference ,1 ,2t t X aX is stationary. The a

term can be estimated by using regressing:

,1 ,2t t t X aX

Then we expect in the long run that the two processes converge to:

,1 ,2t t X aX



45

Zhi Ying Feng

Time Series Forecasting

Time Series and Markov Property

Recall that a process , 0,1,...t X t has the Markov property if all future states of the process

depend on its present state alone and not on any of its past states:

1 1Pr | ,... , Pr |nt s s n s t s X A X x X x X x X A X x

For all times 1 2 ... n s s s s t and all states 1 2, ,..., ,n x x x x S and all subsets A of S

AR Processes

An AR(1) process has the Markov property, since the conditional distribution of 1n X given all

previous t X depends only on n X . However, an AR(2) process does not have the Markov property,

since the conditional distribution of 1n X given all previous t X depends on both n X and 1n X . Thus in

general, AR(p) processes do not have the Markov property for p greater than 1.

However, for an AR(2) process if we define a vector-valued process Y by 1, T

t t t Y X X , then Y

has the Markov property since the conditional distribution of 1nY given all previous t Y depends only

on nY . In general, for an AR(p) process we can define a vector-valued process with p elements that

will have the Markov property.

MA Processes

A MA(q) process can never have the Markov property, even in vector form, since the distribution of

1n X depends on the value of n Z and in theory no knowledge of the value of n X or any finite

collection of 1,...,T

n n q X X will never be enough to deduce the value of n Z .However in practice,

we can estimate n Z very accurately so Markov simulation techniques still applies.

ARIMA Processes

For an ARIMA(p,d,q) process, if q is zero, i.e. no moving average component, then the process

behaves similar to AR(p) in terms of Markov property, i.e. it might not be Markov but an

appropriate vector form can be Markov.

If d is also zero and p is greater than 1 then it is essentially an AR(p) process. If both p and d are

equal to 1, then the model can be written as:

1

21 1

1 1 1 2

1 1

1

1

t t

t t

t t t t

B B X Z

B B B X Z

X X X Z

This is clearly not Markov. However, it can still be written as a vector-valued process that has the

Markov property. In general, the vector process needs p+d terms to be Markov

If q is not equal to zero, i.e. it has a moving average part, then it will never be Markov for the same

reason that MA(q) is never Markov.



46

Zhi Ying Feng

k -step Ahead Predictor

Assume that we have the following information:

(1)

All observations for t X up until time n: 1,..., n x x

(2)

An ARMA model has been fitted to the data

(3) All parameters of the model 2, , have been estimated

(4) The process t Z is known for all up until time n

k -step ahead forecasts |n k n X is one method to forecast/predict n k X by using the given

observations up until time n:

| 1| ,...,n k n n k n X X X X

This is obtained by:

Replacing the random variables 1,..., n X X by their observed values 1,..., n x x

Replacing the random variables 1 1,...,n n k X X by their forecast values |n k n X

Replacing the random variables 1 1,...,n n k Z Z by their expectations, i.e. 0

If 1,..., n Z Z are unknown, replacing 1,..., n Z Z by the residuals 1,..., n Z Z where

1| ,...,i i n Z E Z X X

To forecast |1n k X we have:

1| 1 1

1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1

| ,...,

... ... | ,...,| ,..., ... | ,..., | ,..., ... | ,...,

... ...

n n n n

n p n p n n q n q n

n n p n p n n n q n q n

n p n p n q n

X X X X

X X Z Z Z X X X X X X X X Z X X Z X X

X X Z Z

q

To forecast |2n k X we have:

2| 2 1

1 1 2 2 1 1 2 2 1

1 1 1 2 1 1 1 1 2 1

1|1 2

| ,...,

... ... | ,...,

| ,..., ... | ,..., | ,..., ... | ,...,

...

n n n n

n p n p n n n q n q n

n n p n p n n n q n q n

n n p n

X X X X

X X Z Z Z Z X X

E X X X X X X Z X X Z X X

X X

2 2... p n q n q Z Z

In practice we do not observe t Z so if there are MA terms in the model, then there are more values

of t Z than t X and there is no way of determining all of them from data. Consider the MA(1) process:

1 10.5 0.5t t t t t t X Z Z Z X Z

Then by repeated substitution we have:

1

0

0

0.5 0.5n

j n

n n j

j

Z X Z

To determine n Z we need 0 Z first, one simple way is to assume that 0 0 Z . If the process is invertible,

then this assumption will have negligible effect on |n k n X if n is large, since 0.5 n

will be large.



47

Zhi Ying Feng

Best Linear Predictor

Under assumptions (1), (2) & (3), the best linear predictor n n h P X of n h X has the form:

0 1 2 1 1...n n h n n n P X a a X a X a X

We need the values of 0 ,..., na a that minimises the mean squared error:

2

n h n n h MSE X P X

The general solution is found by minimising the n+1 first-order conditions:

0

1

1

0 0

0 0

0 0

n h n n h

n h n n h n

n n h n n h

MSE a X P X

MSE a X P X X

MSE a X P X X

Note that due to the very first condition, the expected prediction error is:

0n h n n h X P X

i.e. the prediction is unbiased.

The first equation can be simplified to:

0 1 2 ... 0na a a a

While the second can be simplified using a trick:

0

1 2

,

0 1 ... 1

0

n h n n h n n h n n h n

n h n n h n n h n n h

n h n n h n

n

X P X X X P X X

X P X X X P X

Cov X P X X

h a a a n

Then applying this trick to every subsequent MSE minimising condition, we end up with a system

of n equations (excluding the first one) that is very similar to the process of finding PACF

coefficients. This system can be represented in matrix form:

n n nh a

Where:

0 1 ... 1

1 0 ... 2

1 2 ... 0

n

n

n

n n

1

2

n

n

a

aa

a

1

1

n

h

hh

h n

Therefore, the solution is:

1n n na h

Once this is known, the a0 can be found by rewriting the first equation in matrix form:

0 1 na T

1 a

Where T1 represents a column vector of 1’s.



48

Zhi Ying Feng

Part 3: Brownian Motion

Definitions

A stochastic process , 0t X t is a Brownian motion process if:

0 0 X (1)

, 0t X t has stationary and independent increments (2)

For all 0t , 2~ 0,t X N t (3)

The conditions (2) and (3) together are equivalent to:

2~ 0, for 0t X N t t

, 0t X t has stationary increments and cov , 0 s t s X X X for 0 s t since if two

normal r.v. have 0 covariance then they are independent’

A standard Brownian motion , 0t B t is when the volatility is one, i.e.: 1 . Any Brownian

motion , 0t X t can be standardised by setting:

t t

X B

Note that Brownian motion is continuous for all values of t .

A Brownian motion process exhibits some strange behaviour:

A Brownian motion is continuous w.r.t. time t everywhere, but differentiable nowhere

Brownian motion will eventually hit any and every real value no matter how large, or how

negative. Likewise, no matter how large, it will always (with probability 1) come back down

to zero at some future time

Once a Brownian motion hits a certain value, it immediately hits it again infinitely often,

and will continue to return after arbitrarily large times

Brownian motion is fractal, i.e. it looks the same regardless of what scale you examine it

Properties of Brownian Motion

Consider a Brownian motion , 0t X t with volatility :

For any s with 0 s t , X s and X t are NOT independent. However, by independence of

increments,

and s t s X X X

are independent.

This can be shown by finding the covariance between any X s and X t

2

cov , cov ,

cov , cov ,

s t s s t s

s s s t s

X X X X X X

X X X X X

s

For any s and t , s X and t s X X are both normally distributed, i.e.:

2

2

~ 0,

~ 0,

s

t s

X N s

X X N t s



49

Zhi Ying Feng

Brownian Motion and Symmetric Random Walk

Define a symmetric random walk using random variable:

1 w.p. 0.5

1 w.p. 0.5iY

With mean 0i E Y and var 1iY

Now divide the interval 0,t into equal intervals of length t so that there is a total of t t

n

intervals. Suppose that at each t , the process can either go up or down by size x . Now let X(t) be

the position at time t , then:

1 2 ... t t

X t x Y Y Y

Then, if we let x t and let 0t then the limiting process of X(t) is a Brownian motion.

This is because: , 0t X t has independent increments, since Y’s are independent (changes in the value of

the random walk in non-overlapping time intervals are independent)

, 0t X t has stationary increments, since the distribution of the change in position of the

random walk over any time interval depends only on the length of the interval

, 0t X t has approximate normal distribution with mean 0 and variance 2t as t

converges to 0, by the Central Limit Theorem on i.i.d. random variables

Brownian Motion with Drift

A stochastic process , 0t X t is a Brownian motion process with drift coefficient and variance

parameter 2

if it satisfies:

0 0 X

, 0t X t has stationary and independent increments

For all 0t , 2~ ,t X N t t

A Brownian motion with drift can be converted to a standard Brownian motion by defining:

t t

X t B

Similarly, a standard Brownian motion can be converted to a Brownian motion with drift by

defining:

t t X t B

Geometric Brownian Motion

Let , 0t X t be a Brownian motion process with drift and volatility 2

, then the process

, 0t Y t defined by:

expt t Y X

is called a geometric Brownian motion. This is useful when you don’t want negative values





51

Zhi Ying Feng

Now consider the behaviour of or t t dt dB dB dt :

222

2 2

0

var

0

t t t t

t t t t t t t

t B t B B

t B t B B t B B

t t

o t

Then in differential form:

0 var t t dt dB dt dB o dt

Taking o dt as zero, this is again reduced to a constant:

0t dt dB

Finally, consider 2

dt :

2

0dt o dt

The following table summarises the results:

dt t dB

dt 0 0

t dB 0 dt

Stochastic Differential Equations

Assume that a stochastic process X t satisfies a stochastic differential equation (SDE) in the form:

, ,

t t t t

dX X t dt X t dB

Consider another stochastic process defined as

,t t Y F X t

Then the stochastic differential equation satisfied by Y t is given by the Ito’s formula:

22

2

, , ,1,

2t t t t x X t x X t x X

F x t F x t F x t dY dX dt X t dt

x t x

Proof:

The derivative of the second order Taylor series expansion for a function of two variables is:

2 21

, 22 x y xx yy xydf x y f dx f dy f dx f dy f dxdy

Applying this to Y t gives:

2 2 22 2

2 2

, ,

, , ,12

2

t t

t t t

t t

x X x X

t t

x X x X x X

F x t F x t dY dX dt

x t

F x t F x t F x t dX dt dX dt

x t x t

However, we know from the previous section that 2

0t dt dB dt , so:

, ,

0

t t t t dt dX dt X t dt X t dB



52

Zhi Ying Feng

And:

22

222 2

22

2

, ,

, , 2 , ,

,

,

t t t t

t t t t t t

t t

t

dX X t dt X t dB

X t dt X t dB X t X t dB dt

X t dB

X t dt

Therefore, the expansion reduces to the Ito’s formula

E.g. consider a Brownian motion , 0t X t with 0 drift and variance 2 , find the SDE for 2

t tX

We have the SDE for X t :

t t dX dB

Let 2 ,t t t Y tX F X t where 2, F x t tx so the derivatives are:

2

22

, , ,2 2

F x t F x t F x t tx x t

x t x

Then using Ito’s formula, the SDE for 2t tX is given by:

2 2 2

2 2

1, 2 2

2

2

t t t t t

t t t

d tX dF X t tX dX X dt t dt

tX dB X t dt

Stochastic Integration

Consider a Brownian motion , 0t X t with zero drift and variance 2 . Let f be a function with

continuous derivative on [a,b]. The stochastic process Y t defined by the stochastic/Ito’s integral is:

1

1

1

1max 0

limk k

k k

nb

t t k t t a n

k t t

Y f t dX f t X X

Where 0 1 ... na t t t b is a partition of [a,b].

This stochastic process Y t satisfies the properties:

Mean of the Ito integral is zero, i.e.

1

1

11

max 0lim 0k k

k k

nb

t k t t a nk

t t f t dX f t X X

- Note that functions of t in this case includes stochastic variables, e.g. 0nt t B dB

Thus, the variance and expectation squared are equal:

1

1

1

22

1

1max 0

2 21 1

1max 0

2 2

var lim var

lim

k k

k k

k k

nb b

t t k t t a a n

k t t

n

k k k n

k t t

b

a

f t dX f t dX f t X X

f t t t

f t dt



53

Zhi Ying Feng

Part 4: Simulation

Continuous Random Variables Simulation

When simulating continuous random variables we have the probability/cumulative distribution of

the random variable. The cumulative probabilities will always have a Uniform(0,1) distribution.

Since if ~ U 0,1 X then 1

F X

is distributed as F . This means that the probability that a probability is in [a,b] is the same as the probability that a probability is in [c,d]. Therefore, the first

step in simulation is usually to generate random variables from a Uniform(0,1).

Pseudo-Random Numbers

The procedure to generate pseudo-random numbers, i.e. from a Uniform(0,1), is:

1. Start with a seed X 0 and specify positive integers a, c and m, which are usually given

2. Generate pseudo-random numbers recursively using:

1 modn n X aX c m

3.

1n X m

will be an approximation to a Uniform(0,1) random variable

Inverse Transform Method

Consider a continuous random variable with cumulative distribution function F X and a Uniform(0,1)

random variable U . Define:

1 X X F U

Then X will have the distribution function:

1

Pr

Pr

Pr since is strictly increasing

using

X

X

X X

U X

X U

F x X x

F U x

U F x F

F F x

F x F y y

Then the inverse transform procedure to generate r.v. from a cumulative distribution F is:

1.

Compute 1 X F , from the p.d.f or c.d.f, if possible

2. Generate a Uniform(0,1) random variable U

3.

Set 1

X X F U

, then X will be from the distribution F

Note that for the inverse transform method to work, we must be able to calculate the inverse of the

c.d.f., i.e. 1 X F must have an explicit expression.

E.g. to simulate an exponential random variable, first find 1 X F :

1

0

1exp 1 exp ln 1

x

X x F x s ds x F x y

Suppose we generate a random variable U from Uniform(0,1). Then we set:

1 1 1

ln 1 ln x F U U U

Therefore x is a random variable with an exponential distribution.



54

Zhi Ying Feng

Discrete Random Variable Simulation

The inverse transform method can also be applied to simulate discrete random variables. Consider

one with probability mass function:

Pr , for 1, 2,...

1

j j

j j

X x P j

P

Then the procedure to simulate r.v. from this p.m.f. is:

1. Generate a Uniform(0,1) random variable U

2. Set:

1 1

2 1 1 2

1

1 1

if

if

if j j

j i ii i

x U P

x P U P P X

x P U P

E.g. simulate random variables from a geometric distribution.

Consider the probability mass function for a geometric random variable:

1

Pr 1 1 j

j P X j p p j

Notice that:

1

1

1

1

1

1

1 Pr 1

1 1

11 geometric sum

1 1

1 1

j

i

i

i

j

j

j

P X j

p p

p p

p

p

Then we have:

1

1 1

1

1

1 1 1 1

1 1 1

j j

i ii i

j j

j j

P U P

p U p

p U p

Since U is a Uniform(0,1) random variable, this is equivalent to:

11 1

ln 1 ln 1 ln 1

ln1 since ln 1 0

ln 1

j j p U p

j p U j p

U j j p

p

Therefore to generate a random variable from a geometric distribution:

1. Generate a uniform random number U

2.

Set X as j, where j is the first integer for which:

ln

ln 1

U j

p



55

Zhi Ying Feng

Acceptance-Rejection Method

Suppose we have a method, e.g. inverse transform, to simulate an r.v. with density g(y). Then we

can use this as the basis for simulating from the continuous distribution with density f(x).

The procedure to simulate random variables using the rejection method is:

1.

Choose a distribution g for which you know you can simulate outcomes2.

Set c be some constant such that c f y g y for all y

3. Simulate a random variable Y with density function g(y)

4. Simulate a Uniform(0,1) random variable U

5. Accept this as the random number, i.e. set X = Y , if:

f Y U

c g Y

6. Otherwise, reject and return to step 3.

Therefore, the value for X is Y N where N is the number of iterations until a random number is

accepted. We want to be as efficient as possible, i.e. minimise the no. of iterations, by: For efficiency, choose a density g(y) similar to f(y), e.g. exponential to normal

For efficiency, choose the smallest value of c that satisfies the inequality, using calculus, i.e.:

maxc f y g y

E.g. simulate the absolute value of a standard normal random variable X Z

Firstly, X Z has the density:

22

exp 22

x

f x

1. Let another random variable Y be from the density exp g x x , note that this is

comparable to f(x)

2. Choose the smallest value of c such that c f y g y , e.g.:

212 2

max max exp2

f x

g x

xe ec

Since the exponential part will always be less than 1

3.

Generate U 1 and U 2 from Unif(0,1)4. Compute Y from the density g using inverse transform, and using U 1, i.e.:

1logY U

5. Now check if:

22

2 1

1 1exp 1 exp log 1

2 2

f Y U Y U

cg Y

6. If true, then set 1log X Y U , if false then return to step 3 and repeat.

7.

To now generate a standard normal random variable Z , generate U 3 and set:

3

3

if 0.5

if 0.5

X U Z X U



56

Zhi Ying Feng

Simulation Using Distributional Relationships

Gamma Distribution

Recall that the waiting time to the nth event in continuous Markov chains has a Gamma distribution.

To simulate random variables from Gamma ,n distribution, the procedure is:

1.

Generate n independent Uniform(0,1) random variables 1 2, , ..., nU U U

2.

Simulate an exponential random variable using the inverse transform method

1 1ln F U U

3. Since the sum of n independent Exp(λ ) random variables has a Gamma distribution, set:

1

1ln ~ Gamma ,

n

i

i

X U n

Chi-Squared Distribution

The sum of n standard normal r.v. has a chi-squared distribution with n degrees of freedom:

2 2

1 ~

n

i ni Z

Alternatively, a chi-squared distribution with an even degree of freedom 2k is equivalent to a

Gamma , 1 2k distribution. If the degree of freedom is odd, i.e. 2k+1 we can add on an extra Z 2

term, where Z is standard normal. That is:

21

22 11

2 ln ~

2 ln ~

k

i k i

k

i k i

U

Z U

Poisson Distribution

Recall that the no. of events within one period where events follow an Exp(λ ) distribution is Poi(λ ).

To simulate random variables from a Poisson(λ ) distribution, the procedure is:

1. Generate a Gamma ,n random variable using the above steps

2. Since the waiting time until the nth event has a Gamma ,n distribution, which is a sum of

independent Exp(λ ) random variables, we set:

1

1max : ln 1 ~ Poi

n

i

i

X n U

i.e. the total number of arrivals within 1 period.

Normal Distribution

One method of simulation random variables from a normal distribution is the Box-Muller approach:

1.

Generate two random variables U 1 and U 2 from Uniform(0,1).

2.

Set:

1

21 2

1

21 2

2 ln cos 2

2 ln sin 2

X U U

Y U U

Then X and Y are a pair of independent standard normal random variables

3.

A pair of uncorrelated normal random variables can be then derived using:and X Y

4. Note that if X is normal then to generate a lognormal random variable L we set X L e



57

Zhi Ying Feng

Monte Carlo Simulation

Suppose that 1,..., n X X X is a random vector with a given joint density function 1,..., n f x x and

we want to compute:

1 1 1... ,..., ,..., ...n n n g X g x x f x x dx dx

But with a function g for which this is impossible to compute. Then we can approximate thisintegral by using Monte Carlo simulation.

Let i

X and i

Y be the ith simulated sample path of X andY . Then the MC simulation procedure is:

1. Generate a random vector 1 1 1

1 ,..., n X X X

with joint density 1,..., n f x x and compute

1 1Y g X

2. Generate second, independent from step 1, random vector 2 2 2

1 ,..., n X X X

with joint

density 1,..., n f x x and compute 2 2Y g X

3.

Repeat this process until r (a fixed number) i.i.d. random vectors are generated:

, for 1,2,...,i i

Y g X i r

4. Estimate E g X by using the arithmetic average of the generated Y ’s

1

... r

Y Y Y g X

r

This method works due to the strong law of large numbers:

1 ...lim

r i

r

Y Y Y g X

r

Expectation and Variance

Y is an unbiased estimate of g X :

1

1 1

1 1,...,

r r i i

n

i i

Y Y Y g X X r r

Since each Y i is independent and identically distributed with the distribution of 1,..., n g X X

The variance of Y is given by:

21 1

var 1 1var var var

ir r

i i

i i

Y Y Y Y

r r r

Note that usually we do not know var

iY , so we estimate it using the sample estimate:

2

1

1var

1

r i

i

Y Y Y r

In the next 3 sections, we will describe some techniques to REDUCE this variance



58

Zhi Ying Feng

Antithetic Variables

Reducing the variance of the estimate using antithetic variable involves generating estimates with

negative correlation then adding these estimates to obtain a final estimate.

Assume that k is even and that 2r . The antithetic variates procedure is:

1.

Generate a set of n variates 1 11 ,..., n X X and determine

1 1 11 ,..., nY g X X

2. Generate a set of n variates 1 1

1 ,..., n X X

, which are correlated with 1 1

1 ,..., n X X and

determine 1 1 1

1 ,..., nY g X X

3. Repeat steps 1 and 2 times to form 21 , ,...,Y Y Y

and

1 2, ,...,

r Y Y Y

4. Calculate the arithmetic average of the Y ’s:

1 2

1 1

1 1and

r i i

i i

Y Y Y Y

5.

Use:

1 2

2

Y Y Y

as the final estimate for Y g X

Using this method, as long as the correlation between 1Y and 2Y are negative, then the variance will

be reduced. One example of this is by using:

X ~ Uniform(0,1) for the first set of n variates, e.g. 1Y

1 – X . which is also Uniform(0,1) for the second set of n variates, e.g. 2Y

Then as long as g(X) is a monotonic increasing/decreasing function, we will have a negative

correlation:

cov ,1 1 X X

To show why this method reduces the variance, consider the variance of the estimator using the

antithetic variable

1 2

1 2 1 2

1 1 2

var var 2

1

var var 2 var var 4

1var 1 since var var

2

var 11 using 1 but with only the first set up to

2

var 1

var

var for 0

AV

i

i

i

Y Y Y

Y Y Y Y

Y Y Y

Y

Y

r

Y

Y r



59

Zhi Ying Feng

Control Variates

If we wish to evaluate the expected value g X and there is a function f such that the expected

value of f(X) can be evaluated analytically with:

f X

Then we can evaluate g X using:

0

1

g X g X a f X

g X a f X

a g X af X

Where a is a parameter we can choose.

Therefore, instead of evaluating g X directly, we evaluate (1) using a control variates:

1

1

cv

r

i i

i

g X Y

a g X af X

a g X af X r

Then, this estimator will have the variance:

21

2

1var var

var

var var 2 ,

r

cv i i

i

i i

i i i i

Y g X af X r

g X af X r

g X a f X aCov g X f X

r r r

This variance can be minimised by solving:

var 0

cvY

a

With the solution being the best choice of a:

,

var i i

i

Cov g X f X a

f X

Substituting this value back, we have that the minimised variance is:

2

0

,var var

var

var

var

i iicv

i

i

Cov g X f X g X Y

r r f X

g X

r

Y

Therefore, using a control variates has decreased the variance compared to the original estimator



60

Zhi Ying Feng

Importance Sampling

We can write:

X g X g x f x dx

Where f X is the density of X .

Now consider another equivalent probability density to f(x) denoted by h(x) such that the zero

probability events agree under both densities. E.g. 0 f x if and only if 0h x . Let Y be the

random variable with density function h(x).

Then we can write:

X

Y

f x g X g x h x dx

h x

f Y g Y

h Y

Then, we can simulate Y i from density h(x) and estimate the expectation X g X with:

1

1 r i

h i

i i

f Y Y g Y

r h Y

If it is possible to select a probability density h(x) so that the random variable

f X g X

h X

has a smaller variance then the estimator will be more efficient, i.e.:

var var

f X g X g X

h X

To do this, we need to select a density h(x) such that the ratio of the two densities, i.e. f x h x is

large when g(x) is small and vice versa.

Number of Simulations

Ideally, we want to carry out simulations as efficiently as possible since the no. of simulations

required for accuracy is quite large. Assume we generate n samples from a known distribution. Toestimate its mean we will use the sample average as an estimator:

1

1 r i

i

Y Y r

This is an unbiased estimator, i.e.:

Y

Its variance is given by:

2var var

iY

Y r r



61

Zhi Ying Feng

However, the value of var

iY is usually not known, so we estimate it using the sample variance

from the first k runs where k < n and usually at least 30. Then the estimate of the mean and variance

becomes:

2

1 1

1 1var

1

k k i i ik

i iY Y Y Y Y

k k

For large values of n, we know that by the Central Limit Theorem this estimator will be

approximately normal:

2

~ ,Y N r

Then, we can select n such that the estimate is within a desired accuracy of the true mean, i.e. a

percentage of the true mean with a specified probability.

E.g. random variates are generated from a Gamma distribution:

1 for 0 x f x x e x

Twenty values are generated for each sample and the mean and standard deviation of each sample

are given as:

Sample 1 2 3 4 5 6 7 8 9 10

Mean 12.01 11.79 13.43 14.01 11.44 11.19 11.24 12.42 12.91 12.29

SD 6.15 4.73 6.42 7.02 4.30 3.90 3.84 4.35 3.59 4.60

Determine the no. of simulated values required for the estimate of the mean to be to be within 5% of

the true mean with 95% certainty.

We require n such that:

Pr 0.05 0.95

0.05Pr 0.95

0.05Pr 0.95

0.05 0.05Pr 0.95

0.052 Pr 1 0.95

0.051.96

X

X

n n

Z n

Z n n

Z n

n

We do not know the true value for and , so we must use the sample estimates. The sample

estimate for the mean is given by averaging the mean of each sample:10 20 10

1 1 1

1 1 112.273

20 10 10iij

i j i

X X X



To estimate the sample variance, use:

2

2

1

22

1 1 1

22

1

1

1

12

1

1

1

n

i i

i

n n n

i i

i i i

n

i

i

s X X n

X X y X n

X n X n

So the estimated variance is given by:

10 20

22 2

1 1

110 20

10 20 1 ij

i j

x X

Where we have:

10 20 10

22 2

1 1 1

10 1022

1 1

20 1 20

19 20

4790.0596 30285.702

35075.7616

iij i

i j i

ii

i i

x s X

s X

Therefore the estimated variance and standard deviation is:

2 2135075.7616 200 12.273

199

24.876664.98765

Substituting back into the previous equation, we have:

0.051.96

0.05 12.2731.96

4.98765

254

n

n

n

We can also estimate the parameters of the Gamma distribution, since we know that:

2

var X X

Using our estimated values we have:

12.273

24 87666

Documents

ACTL2003 Summary Notes