· Contents 0 Motivation3 0.1 Discrete time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 0.2 Continuous time

Jan Kallsen

Stochastic Filtering Theory

Lecture Notes

CAU Kiel, WS17/18, as of February 8, 2018

Contents

0 Motivation 30.1 Discrete time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 Continuous time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 Discrete-time models 51.1 Optimal filtering in hidden Markov models . . . . . . . . . . . . . . . . . 5

1.1.1 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Non-normalised recursions . . . . . . . . . . . . . . . . . . . . . . 71.1.3 Normalised recursions . . . . . . . . . . . . . . . . . . . . . . . . 101.1.4 More explicit computations . . . . . . . . . . . . . . . . . . . . . 14

1.2 Optimal filtering in linear Gaussian state-space models . . . . . . . . . . . 161.2.1 Kalman filter recursions . . . . . . . . . . . . . . . . . . . . . . . 161.2.2 Slightly more general Kalman filter recursions . . . . . . . . . . . 19

1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 221.3.1 Likelihood inference . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.2 Parametric models with partial observations . . . . . . . . . . . . . 241.3.3 Likelihood estimation by filtering conditional densities . . . . . . . 261.3.4 Likelihood estimation by smoothing conditional score functions . . 291.3.5 The Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . 321.3.6 Likelihood estimation in linear Gaussian state space models . . . . 34

2 Continuous-time models 382.1 The Kushner-Stratonovich and Zakai equations . . . . . . . . . . . . . . . 382.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A Some measure theory 46A.1 Some facts on conditional probabilities and expectations . . . . . . . . . . 46A.2 The multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 50

2

Chapter 0

Motivation

In many applications not all components of a multivariate stochastic process are observable.In this course we discuss how to make inference on the unknown state based on the availableobservations. This issue differs from parameter estimation in the sense that the object ofinterest is itself random whereas model parameters are usually considered as constants –unless one takes a Bayesian point of view.

0.1 Discrete timeExample 0.1 (Moving object I) Suppose that the location X(t) of an object is recordedonly up to some measurement error. More precisely, we observe

Y (t) = X(t) + σY ∆WY (t), t = 0, 1, 2, . . . , (0.1)

where t represents time, ∆WY (t) are independent, standard Gaussian random variables, andσY is a parameter. Since we do not know the object’s movement, we consider it as randomas well, e.g. of the form

X(t) = X(t− 1) + σX∆WX(t), (0.2)

where ∆WX(t), t = 1, 2, . . . is another independent sequence of standard Gaussian randomvariables. The parameter σX denotes the standard deviation of the object’s movement in anytime period.

Example 0.2 (Moving object II) Model (0.2) may be appropriate for, say, a basketballplayer. By contrast, objects as bicycles, cars or ships move more smoothly. This can bemodelled by assuming

X(t) = X(t− 1) + v(t− 1)∆t, (0.3)

where v(t− 1) denotes the object’s velocity in the interval (t− 1, t] of length ∆t = 1. It issupposed to change according to

v(t) = v(t− 1)− λv(t− 1) + σv∆Wv(t), (0.4)

where again ∆Wv(t), t = 0, 1, . . . denotes a sequence of independent standard Gaussianrandom variables and λ, σv > 0 parameters. The random part stands for unknown manoeu-vres of the driver, whereas the mean-reverting term −λv(t − 1) prevents the vehicle fromgetting much faster if its speed is already large.

3

4 CHAPTER 0. MOTIVATION

In Examples 0.1 and 0.2 we typically want to estimate the location X(t) from the obser-vations Z(s), s = 0, 1, . . . , t.

As an example from Mathematical Finance we consider

Example 0.3 (Stochastic volatility) Suppose that v(t), t = 0, 1, 2, . . . is a Markov processand

∆X(t) =√v(t− 1)∆WX(t), (0.5)

where WX denotes an independent random walk with E(WX(t)) = 0, Var(WX(t)) = 1. IfS(t) = S(0)eX(t) denotes the price of an asset,

√v(t) can be interpreted as its unobserved

stochastic volatility. We may be interested in inferring the unobservable current level ofsquared volatility v(t− 1) from the past observations of the stock price.

0.2 Continuous timeChapter 1 concerns discrete-time models as in the examples above. Let us have a look attheir continuous-time counterparts, which are discussed in Chapter 2.

Example 0.4 (Moving object I) Equation (0.1) can be rewritten as

∆Z(t) := Z(t)− Z(t− 1) = X(t) + σY (t)∆WY (t), t = 0, 1, 2, . . . (0.6)

for Z(t) :=∑t

s=1 Y (s). The continuous-time analogue of (0.6,0.2) reads as

dZ(t) = X(t)dt+ σZdWZ(t),

dX(t) = σXdWX(t)

with Brownian motions WZ ,WX and parameters σZ , σX > 0.

Example 0.5 (Moving object II) Similarly as in the previous example, moving to thecontinuous-time limit turns (0.6, 0.3, 0.4) into

dZ(t) = X(t)dt+ σZdWZ(t),

dX(t) = v(t)dt,

dv(t) = −λv(t) + σvdWv(t)

with Brownian motions WZ ,Wv and parameters σZ , σv > 0.

Example 0.6 (Stochastic volatility) The continuous-time analogue of (0.5) would be

dX(t) =√v(t)dWX(t),

where W is a Wiener process and v some continuous-time Markov process. In this limitingmodel v is actually observable because we have d[X,X](t) = v(t)dt for the quadratic varia-tion ofX . Real high-frequency data, however, tends to be subject to so-called microstructurenoise, i.e. we observe something as

dY (t) = X(t)dt+ σY dWY (t)

rather than X itself, where WY denotes a Wiener process and σY a parameter. As in Ex-ample 0.3 we may want to approximate the current squared volatility v(t) based on the pastobservations, namely Y (s), s ≤ t in this case.

Chapter 1

Discrete-time models

In this chapter we consider discrete-time models. The continuous-time counterpart in Chap-ter 2 requires tools from stochastic calculus.

1.1 Optimal filtering in hidden Markov models

1.1.1 Hidden Markov modelsFiltering theory is typically cast in a Markovian framework. To this end, fix a filtered prob-ability space (Ω,F , (Ft)t∈N, P ).

Definition 1.1 We call an adapted Rd-valued process X Markov process if there is a transi-tion kernel p : Rd ×Bd → [0, 1] such that

PX(t)|Ft−1(ω, dxt) = p(X(ω, t− 1), dxt)

or, equivalently,

E(f(X(t))|Ft−1) =

∫f(xt)p(X(t− 1), dxt)

almost surely for any measurable function f : Rd → R that is nonnegative or PX(t)-integrable.

This means that the conditional law PX(t)|Ft−1 depends on the past Ft−1 only throughthe valueX(t−1). Moreover, it is homogeneous in the sense that the transition probabilitiesfrom time t− 1 to t do not depend on t.

The examples in the introduction can be viewed as instances of hidden Markov modelsin the following sense.

Definition 1.2 A homogeneous Rm+n-valued Markov process (X, Y ) is called hiddenMarkov model (HMM) if

1. X is a Markov process,

2. the transition kernel of (X, Y ) is of the form

P (X,Y )(t)|Ft−1(ω, d(xt, yt)) = K(xt, dyt)p(X(ω, t− 1), dxt)

for some kernel K : Rm×Bn → [0, 1], where p denotes the transition function of X ,and

5

6 CHAPTER 1. DISCRETE-TIME MODELS

3. P (X,Y )(0)(d(x0, y0)) = K(x0, dy0)p0(dx0) for p0 := PX0 .

In the following we also suppose that (X, Y ) is partially dominated in the sense that

K(x, dy) = λ(x, y)ϕ(dy)

for some measurable function λ : Rm+n → R+ and some σ-finite measure ϕ on (Rn,Bn),i.e. the probability measures K(x, ·), x ∈ Rm are all absolutely continuous with respect tothe same measure on Rn.

The unobservable or “hidden” first component X of the hidden Markov model is some-times called signal and it is a Markov process itself. The observations Y (t) depend on thepast only through X(t). They typically represent noisy versions of X(t).

For the remainder of this chapter we use the notation

xs:t := (xs, xs+1, . . . , xt) and Z(s : t) := (Z(s), . . . , Z(t))

for any sequence (xt)t∈N, any process (Z(t))t∈N, and any indices s ≤ t. We want to makestatements about X(t) based on either Y (0 : t) (filtering problem), on Y (0 : s) with s < t(prediction problem), or Y (0 : s) with s > t (smoothing problem). More specifically,we aim at the conditional law PX(t)|Y (0:s) of X(t) given Y (0 : s). For any measurable,sufficiently integrable function f : Rm → R, the conditional expectation with respect to thislaw provides the best approximation to f(X(t)) in a mean square sense, i.e.

Z? = E(f(X(t))

∣∣Y (0 : s))

=

∫f(x)PX(t)|Y (0:s)(dxt)

minimizesE((f(X(t))− Z

)2)

over all σ(Y (0 : s))-measurable random variables Z.We start with some notation. We write πt|s : R(s+1)n×Bm → [0, 1] for the factorisation

of PX(t)|Y (0:s), i.e.

πt|s(y0:s, dxt) = PX(t)|Y (0:s)=y0:s(dxt), y0:s ∈ R(s+1)n

or, equivalently,πt|s(Y (0 : s), dxt) = PX(t)|Y (0:s)(dxt).

Moreover, we write

πt|s(y0:s, f) :=

∫f(x)πt|s(y0:s, dxt)

= E(f(X(t))

∣∣Y (0 : s) = y0:s

)for y0:s ∈ R(s+1)n and measurable functions f : Rn → R such that the integral exists.The same kind of notation will be used for the non-normalised kernel %t|s that appears inTheorem 1.3 below.

1.1. OPTIMAL FILTERING IN HIDDEN MARKOV MODELS 7

1.1.2 Non-normalised recursionsThe desired kernels πt|s can be determined recursively using the following relations.

Theorem 1.3 Let (yt)t∈N be a sequence in Rm and f : Rn → R a measurable nonnegative,bounded, or sufficiently integrable function. For s, t ∈ N define kernels %t|s : R(s+1)n ×Bn → R+ recursively by

1. (inception step)

%0|0(y0, f) :=

∫f(x0)λ(x0, y0)p0(dx0), (1.1)

2. (prediction step)

%t|s(y0:s, f) :=

∫∫f(xt)p(xt−1, dxt)%t−1|s(y0:s, dxt−1) (1.2)

for t > s,

3. (correction step)

%t|t(y0:t, f) :=

∫f(xt)λ(xt, yt)%t|t−1(y0:t−1, dxt), (1.3)

4. (smoothing step)

%t|s(y0:s, dxt) := βt|s(yt+1:s, xt)%t|t(y0:t, dxt) (1.4)

for t < s, where the densities βt|s : R(s−t)n × Rm → R+ are defined recursively byβs|s(xs) := 1 and

βt|s(yt+1:s, xt) :=

∫λ(xt+1, yt+1)βt+1|s(yt+2:s, xt+1)p(xt, dxt+1). (1.5)

The desired filter, predictor, or smoother πt|s is then obtained by the Kallianpur-Striebelformula

πt|s(y0:s, f) =%t|s(y0:s, f)

%t|s(y0:s, 1). (1.6)

Proof. Ignoring the definition in the statement, we redefine the conditional measure%t|s(y0:s, ·) on Rm by

%t|s(y0:s, B) = E

(1B(X(t))

s∏r=0

λ(X(r), yr)

), B ∈ Bm

and we set

%t|s(y0:s, f) = E

(f(X(t))

s∏r=0

λ(X(r), yr)

)(1.7)


for measurable functions f : Rn → R. Using Lemma A.1 in the second step we obtain(1.6):

πt|s(y0:s, f) = E(f(X(t))

∣∣Y (0 : s) = y0:s

)=

∫f(xt)(dP

Y (0:t)|X(0:t)=x0:t/dϕ1+t)(y0:t)PX(0:t)(dx0:t)∫

(dP Y (0:t)|X(0:t)=x0:t/dϕ1+t)(y0:t)PX(0:t)(dx0:t)

=

∫f(xt)

∏sr=0 λ(xr, yr)P

X(0:t)(dx0:t)∫ ∏sr=0 λ(xr, yr)PX(0:t)(dx0:t)

=%t|s(y0:s, f)

%t|s(y0:s, 1).

where we set ϕ1+t := ⊗tr=0ϕ. We also redefine the functions βt|s : R(s+1)n ×Rm → R+ fort ≤ s, namely as

βt|s(yt+1:s, xt) = E

( s∏r=t+1

λ(X(r), yr)

∣∣∣∣X(t) = xt

).

We now show that %t|s and βt|s satisfy (1.1–1.5) in the statement of the theorem.

1. The inception step is obvious from

%0|0(y0, f) = E(f(X(0))λ(X(0), y0)

)=

∫f(x0)λ(x0, y0)p0(dx0).

2. For the prediction step observe that

%t|s(y0:s, f) =

∫· · ·∫f(xt)

s∏r=0

λ(xr, yr)p(xt−1, dxt) · · · p(x0, dx1)p0(dx0)

=

∫· · ·∫g(xt−1)

s∏r=0

λ(xr, yr)p(xt−2, dxt−1) · · · p(x0, dx1)p0(dx0)

= E

(g(X(t− 1))

s∏r=0

λ(X(r), yr)

)= %t−1|s(y0:s, g)

with g(xt−1) :=∫f(xt)p(xt−1, dxt).

3. The correction step follows from

%t|t(y0:t, f) = E

(f(X(t))

t∏r=0

λ(X(r), yr)

)

= E

(f(X(t))λ(X(t), yt)

t−1∏r=0

λ(X(r), yr)

)= %t|t−1(y0:t−1, g)

with g(xt, yt) := f(xt)λ(xt, yt).


4. For (1.4) note that

%t|s(y0:s, f) = E

(f(X(t))

s∏r=0

λ(X(r), yr)

)

= E

(f(X(t))

t∏r=0

λ(X(r), yr)E

( s∏r=t+1

λ(X(r), yr)

∣∣∣∣Ft

))

= E

(f(X(t))βt|s(yt+1:s, X(t))

t∏r=0

λ(X(r), yr)

)= %t|t(y0:t, βt|s(yt+1:s, ·)f),

where we use

E

(s∏

r=t+1

λ(X(r), yr)

∣∣∣∣∣Ft

)= E

(s∏

r=t+1

λ(X(r), yr)

∣∣∣∣∣X(t)

)= βt|s(yt+1:s, X(t))

in the third step. (1.5), on the other hand, follows from

βt|s(yt+1:s, xt)

= E

( s∏r=t+1

λ(X(r), yr)

∣∣∣∣X(t) = xt

)

=

∫ s∏r=t+1

λ(xr, yr)PX(t+1:s)|X(t)=xt(dxt+1:s)

=

∫∫ s∏r=t+2

λ(xr, yr)PX(t+2:s)|X(t+1)=xt+1(dxt+2:s)

× λ(xt+1, yt+1)PX(t+1)|X(t)=xt(dxt+1)

=

∫E

( s∏r=t+2

λ(X(r), yr)

∣∣∣∣X(t+ 1) = xt+1

)λ(xt+1, yt+1)p(xt, dxt+1)

=

∫βt+1|s(yt+2:s, xt+1)λ(xt+1, yt+1)p(xt, dxt+1).

Remark 1.4 By setting %0|−1 := p0, the inception step in Theorem 1.3 and likewise inTheorem 1.6 and Example 1.9 can be interpreted as a first correction step.

The kernels %t|s can be interpreted as a kind of non-normalised filter, predictor, orsmoother, respectively. %t|t is computed starting from %0|0 in the following order:

%0|0 → %1|0 → %1|1 → %2|1 → %2|2 → · · · → πt|t−1 → πt|t.

i.e. alternating a prediction and a correction step. %t|s for t > s is obtained starting from thefilter %s|s via the prediction steps

%s|s → %s+1|s → %s+2|s → · · · → %t−1|s → %t|s.


Finally, %t|s for t < s, can be obtained from the filter %t|t by (1.4) and the smoothing steps

1 = βs|s → βs−1|s → βs−2|s → · · · → βt+1|s → βt|s.

Remark 1.5 The normalising constant %t|s(y0:s, 1) in the Kallianpur-Striebel formula hasan intuitive interpretation. First note from (1.7) that it does not depend on t, i.e.

%t|s(y0:s, 1) = %s|s(y0:s, 1) (1.8)

for any t ∈ N. The announced interpretation is

%t|s(y0:s, 1) =dP Y (0:s)

dϕ1+s(y0:s) (1.9)

for ϕ1+s-almost any y0:s, which means that %t|s(·, 1) is the Radon-Nikodym density or like-lihood of the observations Y (0 : s) relative to ϕ1+s := ⊗st=0ϕ. Indeed, we need to showthat

P (Y (0 : s) ∈ A) =

∫1A(y0:s)%t|s(y0:s, 1)ϕ1+s(dy0:s)

for any Borel set A ⊂ Rn(1+s). This follows from (1.7) and∫1A(y0:s)%t|s(y0:s, 1)ϕ1+s(dy0:s)

=

∫1A(y0:s)E

( s∏r=0

λ(X(r), yr)

)ϕ1+s(dy0:s)

=

∫· · ·∫

1A(y0:s)λ(xs, ys)ϕ(dys)p(xs−1, dxs)

× · · ·λ(x1, y1)ϕ(dy1)p(x0, dx1)λ(x0, y0)ϕ(dy0)p0(dx0)

=

∫1A(y0:s)P

(X,Y )(0:s)(d(x, y)0:s)

= E(1A(Y (0 : s))

)= P (Y (0 : s) ∈ A).

1.1.3 Normalised recursionsThe filtering recursions can also be formulated directly in terms of πt|s rather than %t|s:

Theorem 1.6 Let (yt)t∈N be a sequence in Rm and f : Rn → R a measurable nonnegative,bounded, or sufficiently integrable function. The desired filter, predictor, or smoother πt|ssatisfies the following recursive equations.

1. (inception step)

π0|0(y0, f) =

∫f(x0)λ(x0, y0)p0(dx0)∫λ(x0, y0)p0(dx0)

, (1.10)


πt|s(y0:s, f) =

∫∫f(xt)p(xt−1, dxt)πt−1|s(y0:s, dxt−1) (1.11)

for t > s,



πt|t(y0:t, f) =

∫f(xt)λ(xt, yt)πt|t−1(y0:t−1, dxt)∫λ(xt, yt)πt|t−1(y0:t−1, dxt)

, (1.12)

4. (smoothing step)

πt|s(y0:s, dxt) = αt|s(yt+1:s, xt)πt|t(y0:t, dxt) (1.13)

for t ≤ s, where the densities αt|s : R(s−t)n × Rm → R+ are defined recursively byαs|s(xt) := 1 and

αt|s(yt+1:s, xt) :=

∫λ(xt+1, yt+1)αt+1|s(yt+2:s, xt+1)p(xt, dxt+1)∫

λ(xt+1, yt+1)πt+1|t(y0:t, dxt+1)(1.14)

=

∫λ(xt+1, yt+1)αt+1|s(yt+2:s, xt+1)p(xt, dxt+1)∫∫

λ(xt+1, yt+1)p(xt, dxt+1)πt|t(y0:t, dxt). (1.15)

Proof. Define πt|s by (1.6) and

αt|s(yt+1:s, xt) :=%t|t(y0:t, 1)

%s|s(y0:s, 1)βt|s(yt+1:s, xt).

We show by induction that these functions satisfy (1.10–1.15).

1. (1.10) follows directly from (1.1) and (1.6).

2. (1.11) follows by induction from (1.2):

πt|s(y0:s, f) =%t|s(y0:s, f)

%t|s(y0:s, 1)

=

∫∫f(xt)p(xt−1, dxt)%t−1|s(y0:s, dxt−1)

%t−1|s(y0:s, 1)

=

∫∫f(xt)p(xt−1, dxt)πt−1|s(y0:s, dxt−1).

3. Similarly, (1.12) follows by induction from (1.3):

πt|t(y0:t, f) =%t|t(y0:t, f)

%t|t(y0:t, 1)

=

∫f(xt)λ(xt, yt)%t|t−1(y0:t−1, dxt)∫λ(xt, yt)%t|t−1(y0:t−1, dxt)

=

∫f(xt)λ(xt, yt)πt|t−1(y0:t−1, dxt)∫λ(xt, yt)πt|t−1(y0:t−1, dxt)

.


4. (1.14) is obtained from (1.5, 1.2, 1.3):

αt|s(yt+1:s, xt) =%t|t(y0:t, 1)

%s|s(y0:s, 1)βt|s(yt+1:s, xt)

=%t+1|t(y0:t, 1)

%s|s(y0:s, 1)

∫λ(xt+1, yt+1)βt+1|s(yt+2:s, xt+1)p(xt, dxt+1)

=%t+1|t(y0:t, 1)%t+1|t+1(y0:t+1, 1)

%s|s(y0:s, 1)

∫λ(xt+1, yt+1)βt+1|s(yt+2:s, xt+1)p(xt, dxt+1)∫

λ(xt+1, yt+1)%t+1|t(y0:t, dxt+1)

=

∫λ(xt+1, yt+1)αt+1|s(yt+2:s, xt+1)p(xt, dxt+1)∫

λ(xt+1, yt+1)πt+1|t(y0:t, dxt+1).

The equality of (1.14) and (1.15) follows from (1.11). Recall from (1.8) that%s|s(y0:s, 1) = %t|s(y0:s, 1). Together with (1.4), this finally implies

πt|s(y0:s, dxt) =%t|s(y0:s, dxt)

%t|s(y0:s, 1)

=βt|s(yt+1:s, xt)%t|t(y0:t, dxt)

%t|s(y0:s, 1)

= αt|s(yt+1:s, xt)πt|t(y0:t, dxt)%s|s(y0:s, 1)

%t|s(y0:s, 1)

= αt|s(yt+1:s, xt)πt|t(y0:t, dxt),

which is (1.15).

As in the non-normalised case, the filtering problem to determine πt|t in now solved asfollows starting from π0|0:

π0|0 → π1|0 → π1|1 → π2|1 → π2|2 → · · · → πt|t−1 → πt|t,

i.e. alternating a prediction and a correction step. The prediction πt|s for t > s is obtainedstarting from the filter πs|s via

πs|s → πs+1|s → πs+2|s → · · · → πt−1|s → πt|s.

The smoother πt|s for t < s, on the other hand, can be computed starting from the filter πt|tand the recursion

1πs−1|s−1−→ αs−1|s

πs−2|s−2−→ αs−2|sπs−3|s−3−→ · · ·

πt+1|t+1−→ αt+1|sπt|t−→ αt|s, (1.16)

which yields πt|s via (1.13). The denominators in (1.10, 1.12, 1.14, 1.15) correspond to anormalisation.

Sometimes, one may want to determine the joint law π0:t|s(y0:s, ·) of X(0 : t) givenY (0 : s) = y0:s for t ≥ t. It can be expressed in terms of %s|s(y0:s, 1):

Proposition 1.7 Let (yt)t∈N be a sequence in Rm and f : (Rn)t+1 → R a measurablenonnegative, bounded, or sufficiently integrable function. For s, t ∈ N with t ≥ s we have

π0:t|t(y0:t, f) =

∫. . .∫f(x0:t)λ(xt, yt)p(xt−1, dxt) · · ·λ(x1, y1)p(x0, dx1)λ(x0, y0)p0(dx0)

%t|t(y0:t, 1)


and

π0:t|s(y0:s, f) =

∫. . .

∫f(x0:t)p(xs, dxs+1) · · · p(xt−1, dxt)π0:s|s(y0:s, dx0:s).

Proof. The first assertion follows from (1.7) and

π0:t|t(y0:t, f) = E(f(X(0 : t))

∣∣Y (0 : t) = y0:t

)=

∫f(x0:t)(dP

Y (0:t)|X(0:t)=x0:t/dϕ1+t)(y0:t)PX(dx0:t)∫

(dP Y (0:t)|X(0:t)=x0:t/dϕ1+t)(y0:t)PX(dx0:t)

=

∫f(x0:t)

∏tr=0 λ(xr, yr)P

X(dx0:t)∫ ∏tr=0 λ(xr, yr)PX(dx0:t)

=

∫· · ·∫f(x0:t)

∏tr=0 λ(xr, yr)p(xt−1, dxt) · · · p(x0, dx1)p0(dx0)

%t|t(y0:t, 1),

where we use Lemma A.1 in the second step. For the second claim observe that

π0:t|s(y0:s, f) = E(f(X(0 : t))

∣∣Y (0 : s) = y0:s

)= E

(E(f(X(0 : t))

∣∣Fs

)∣∣∣Y (0 : s) = y0:s

)= E

(∫. . .

∫f(X(0 : s), xs+1:t)p(xt−1, dxt) · · · p(Xs, dxs+1)

∣∣∣∣Y (0 : s) = y0:s

)=

∫. . .

∫f(x0:t)p(xt−1, dxt) · · · p(xs, dxs+1)π0:s|s(y0:s, dx0:s).

For purposes of parameter estimation, the the joint law πt−1:t|s(y0:s, ·) of X(t − 1 : t)given Y (0 : s) = y0:s for t ≤ s is of particular interest.

Proposition 1.8 Let (yt)t∈N be a sequence in Rm and f : (Rn)2 → R a measurable non-negative, bounded, or sufficiently integrable function. For s, t ∈ N with t ≤ s we have

πt−1:t|s(y0:s, f)

=

∫∫f(xt−1, xt)λ(xt, yt)βt|s(yt+1:s, xt)p(xt−1, dxt)%t−1|t−1(y0:t−1, dxt−1)

%s|s(y0:s, 1)

=

∫∫f(xt−1, xt)λ(xt, yt)αt|s(yt+1:s, xt)p(xt−1, dxt)πt−1|t−1(y0:t−1, dxt−1)∫

λ(xt, yt)πt|t−1(y0:t−1, dxt),

where the kernels and functions are defined as in Theorems 1.3, 1.6.

Proof. The previous proposition yields

πt−1:t|s(y0:s, f) =

∫· · ·∫f(xt−1:t)

∏sr=0 λ(xr, yr)p(xs−1, dxs) · · · p(x0, dx1)p0(dx0)

%s|s(y0:s, 1).


The numerator equals∫· · ·∫ (∫

· · ·∫ s∏

r=t+1

λ(xr, yr)p(xs−1, dxs) · · · p(xt, dxt+1)

)

× f(xt−1:t)λ(xt, yt)t−1∏r=0

λ(xr, yr)p(xt−1, dxt) · · · p(x0, dx1)p0(dx0)

=

∫∫βt|s(yt+1:s, xt)f(xt−1:t)λ(xt, yt)p(xt−1, dxt)%t−1|t−1(y0:t−1, dxt−1),

which show the first equality. Using (1.3) for the numerator, we obtain

%t|t(y0:t, 1)

%t−1|t−1(y0:t−1, 1)=

∫λ(xt, yt)%t|t−1(y0:t−1, dxt)

%t|t−1(y0:t−1, 1)

=

∫λ(xt, yt)πt|t−1(y0:t−1, dxt),

which leads to the second equality.

1.1.4 More explicit computationsIn practice, it may not be obvious how to use the recursions in Thereoms 1.3, 1.6 becausethe kernels πn|k are quite “high-dimensional” objects. If, however, the state space is finite,we can represent them as matrices.

Example 1.9 (Finite state space) If the Markov process X has values in a finite setA = a1, . . . , ak, it is actually a Markov chain. Let us identify the law p0 with the vector(p0(a1), . . . , p0(ak)) ∈ Rk, again denoted by p0. Likewise, the kernel p is determinedby the transition matrix (p(ai, aj))i,j=1,...,k, which we denote by p as well. Finally, the vec-tor πt|s(y0:s) := (πt|s(y0:s)i)i=1,...,k := (πt|s(y0:s, ai)i=1,...k represents the law πt|s(y0:s, ·).

With this notation, we can rephrase (1.10–1.14) as follows:

1. (inception step)

π0|0(y0)i :=λ(ai, y0)(p0)i∑kj=1 λ(aj, y0)(p0)j

, i = 1, . . . , k, (1.17)


πt|s(y0:s)i :=k∑j=1

pjiπt−1|s(y0:s)j, i = 1, . . . , k

for t > s,


πt|t(y0:t)i :=λ(ai, yt)πt|t−1(y0:t−1)i∑kj=1 λ(aj, yt)πt|t−1(y0:t−1)j

, i = 1, . . . , k, (1.18)


4. (smoothing step)

πt|s(y0:s)i = αt|s(yt+1:s)iπt|t(y0:t)i, i = 1, . . . , k,

for t ≤ s, where the vectors αt|s(yt+1:s) ∈ Rk are defined recursively by αs|s :=(1, . . . , 1) and

αt|s(yt:s)i :=

∑kj=1 λ(aj, yt+1)αt+1|s(yt+2:s)ipij∑k

j=1 λ(aj, yt+1)πt+1|t(y0:t)j

=

∑kj=1 λ(aj, yt+1)αt+1|s(yt+2:s)ipij∑kj,`=1 λ(aj, yt+1)p`jπt|t(y0:t)`

.

For later use we also mention πt−1:t|s(y0:s)ij := πt−1:t|s(y0:s, (ai, aj)) for t ≤ s and i, j =1, . . . , k, which is obtained from Proposition 1.8 as

πt−1:t|s(y0:s)ij =λ(ai, yt)αt|s(yt+1:s)ipijπt−1|t−1(y0:t−1)i∑k

`=1 λ(a`, yt)πt|t−1(y0:t−1)`.

Observe that Y need not have values in a finite space to turn the filtering recursions intoa set of finitely many equations. The expressions in Theorem 1.3 can be rephrased along thesame lines, leading to linear recursions because the denominators are absent.

In the general case one typically applies numerical approximations in order to solve therecursions. One idea is to simply discretise the state space of X and apply the recursions ofthe previous example. A successful popular alternative is to apply Monte-Carlo simulationin order to sample approximately from the desired law πt|s.

Example 1.10 (Particle filter with bootstrap/condensation) The goal is to computeπt|t(y0:t, f) (or similarly πt|t−1(y0:t−1, f)) numerically. The idea is to produce a weightedsample (x

(i)t , w

(i)t ), i = 1, . . . , N from the law. Since the empirical law

∑Ni=1w

(i)t δx(i)t

shouldconverge to πt|t(y0:t, ·) for N →∞, we can use the approximation

πt|t(y0:t, f) ≈N∑i=1

w(i)t f(x

(i)t ).

The weighted sample (x(i)s , w

(i)s ) is constructed sequentially for s = 0, . . . , t.

Step 1 (inception): Simulate a sample (x(1)0 , . . . , x

(N)0 ) from PX(0), leading to the

weighted sample ((x(1)0 , w

(1)−1), . . . , (x

(N)0 , w

(N)−1 )) := ((x

(1)0 , 1/N), . . . , (x

(N)0 , 1/N)) for the

law π0|−1 = PX(0).Step 2s− 1 (correction): Define new weights

w(i)s :=

λ(x(i)s , ys)w

(i)s−1∑N

j=1 λ(x(j)s , ys)w

(j)s−1

, i = 1, . . . , N (1.19)

which leads to a weighted sample (x(1)s , w

(1)s ), . . . , (x

(N)s , w

(N)s ) for the law πs|s(y0:s, ·).


Step 2s (prediction): Simulate a sample x(i)s+1 from the law p(x

(i)s , ·) for i = 1, . . . , N .

This leads to a weighted sample (x(1)s+1, w

(1)s ), . . . , (x

(N)s+1, w

(N)s ) for the law πs|s(y0:s, ·).

This is repeated until we end up at Step 2t − 1. However, the above procedure suffersfrom the property that more and more weights tend to get very small, which means thatthe effective sample size is small. The way out is to insert a bootstrap step. The idea is toresample after the correction step. Specifically, one draws a sample x(1)

s , . . . , x(N)s from the

law∑N

i=1 w(i)s δx(i)s

and replaces the current weighted sample (x(1)s , w

(1)s ), . . . , (x

(N)s , w

(N)s )

by the substitute (x(1)s , 1/N), . . . , (x

(N)s , 1/N) with uniform weights.

Alternatively, we may want to simulate from the law π0:t|t(y0:t, ·) which is com-puted in Proposition 1.7. The idea is basically the same as above. But in-stead of sequentially constructing weighted samples (x

(1)s , w

(1)s ), . . . , (x

(N)s , w

(N)s ) from

the law πs|s(y0:s, ·), we keep the old values in order to obtain weighted samples(x

(1)0:s, w

(1)s ), . . . , (x

(N)0:s , w

(N)s ) from the law π0:s|s(y0:s, ·). More specifically, step 1 is the

same as above. In step 2s − 1 we define the above new weights in order to turn theweighted sample (x

(1)0:s, w

(1)s−1), . . . , (x

(N)0:s , w

(N)s−1) for π0:s|s−1(y0:s−1, ·) into a weighted sam-

ple (x(1)0:s, w

(1)s ), . . . , (x

(N)0:s , w

(N)s ) for π0:s|s(y0:s, ·). In step 2s we simulate x(i)

s+1 as abovein order to turn the weighted sample (x

(1)0:s, w

(1)s ), . . . , (x

(N)0:s , w

(N)s ) for π0:s|s(y0:s, ·) into a

weighted sample (x(1)0:s+1, w

(1)s ), . . . , (x

(N)0:s+1, w

(N)s ) for π0:s+1|s(y0:s, ·). But as discussed ear-

lier, the weights degenerate unless one introduces resampling steps every now and then.In the same vein as above, one draws a sample x(1)

0:s, . . . , x(N)0:s from the law

∑Ni=1w

(i)s δx(i)0:s

and replaces the current weighted sample (x(1)0:s, w

(1)s ), . . . , (x

(N)0:s , w

(N)s ) by the substitute

(x(1)0:s, 1/N), . . . , (x

(N)0:s , 1/N) with uniform weights.

1.2 Optimal filtering in linear Gaussian state-space modelsIn linear Gaussian models the conditional laws πt|s(y0:s, ·) of the previous section are Gaus-sian as well, which implies that only their conditional means and variances need to be deter-mined. This leads to a considerable simplification of the recursions. The resulting Kalmanfilter plays a predominant role in practice.

1.2.1 Kalman filter recursionsDefinition 1.11 We call an Rm+n-valued process ((X, Y )(t))t∈N linear Gaussian statespace model if it is of the form

X(t) = aX + aXXX(t− 1) + bXZX(t), t = 1, 2, . . . (1.20)Y (t) = aY + aY XX(t) + bYZY (t), t = 0, 1, 2, . . . , (1.21)

where aX ∈ Rm, aXX ∈ Rm×m, bX ∈ Rm×m, aY ∈ Rn, aY X ∈ Rn×m, bY ∈ Rn×n,the random vector X(0) is Gaussian with mean µX(0) and covariance matrix ΣX(0) and(ZX , ZY )(t), t ∈ N is a sequence of independent standard Gaussian random vectors inRm+n, independent as well of X(0).

Examples 0.1, 0.2 from the introduction provide instances of such models.

1.2. OPTIMAL FILTERING IN LINEAR GAUSSIAN STATE-SPACE MODELS 17

Example 1.12 1. In Example 0.1 we have m = n = 1, aX = 0, aXX = 1, bX = σX ,aY = 0, aY X = 1, bY = σY .

2. In Example 0.2, the pair (v,X) corresponds to what is called X = (X1, X2) above.In this sense, the parameters are

aX =

(0

0

), aXX =

(1− λ 0

1 1

), bX =

(σv 00 0

),

aY = 0, aY X = (0, 1), bY = σY .

Fix a linear Gaussian state space model in the sense of Definition 1.11. As in Section 1.1 wewant to solve the filtering, prediction, or smoothing problem by determining the conditionallaw πt|s(y0:s, ·).

Theorem 1.13 1. πt|s(y0:s, ·) is Gaussian for any s, t ∈ N. We denote its mean andcovariance matrix by µt|s(y0:s) and Σt|s(y0:s), respectively.

2. The conditional mean µt|s(y0:s) is an affine function of y0:s.

3. The conditional covariance matrix Σt|s(y0:s) does not depend on the observations y0:s.

4. µt|s(y0:s) and Σt|s satisfy the recursions:

(a) (inception step)

µ0|−1 := µX(0),

Σ0|−1 := ΣX(0),

(b) (prediction step)

µt|s(y0:s) = aX + aXXµt−1|s(y0:s),

Σt|s = aXXΣt−1|sa>XX + bXb

>X

for t > s,

(c) (correction step)

µt|t(y0:t) = µt|t−1(y0:t−1)

+ Σt|t−1a>Y X

(aY XΣt|t−1a

>Y X + bY b

>Y

)−1(yt − aY Xµt|t−1(y0:t−1)

),

Σt|t = Σt|t−1 − Σt|t−1a>Y X

(aY XΣt|t−1a

>Y X + bY b

>Y

)−1aY XΣt|t−1

for t ≥ 0,

(d) (smoothing step)

Gt := Σt|ta>XXΣ−1

t+1|t,

µt|s(y0:s) = µt|t(y0:t) +Gt

(µt+1|s(y0:s)− µt+1|t(y0:t)

),

Σt|s = Σt|t +Gt

(Σt+1|s − Σt+1|t)G

>t

for t < s


5. For s > t the conditional law πt:t+1|s(y0:s, ·) of X(t : t + 1) given Y (0 : s) = y0:s isGaussian with mean µt:t+1|s(y0:s) := (µt|s(y0:s), µt+1|s(y0:s)) and covariance matrix

Σt:t+1|s :=

(Σt|s GtΣt+1|s

Σt+1|sG>t Σt+1|s

).

Proof.

1. This follows from (A.2) because (X, Y )(0 : t) is a Gaussian random vector.

2. This follows from (A.2) as well.

3. This follows from (A.2) as well.

4. (a) This is a definition.

(b) This follows from (1.20) and the fact that ZX(t) is independent of Ft−1.

(c) (1.21) and the independence of ZY (t) and Ft−1 ∨ σ(X(t)) yield that the condi-tional law of (X(t), Y (t)) given Y (0 : t − 1) = y0:t−1 is Gaussian with mean(µt|t−1, aY + aY Xµt|t−1) and covariance matrix(

Σt|t−1 aY XΣt|t−1

Σt|t−1a>Y X aY XΣt|t−1a

>Y X + bY b

>Y

).

The assertion follows now from (A.2) and Lemma A.7.

(d) By (1.20) the conditional law of (X(t), X(t + 1)) given Y (0 : t) = y0:t isGaussian with mean (µt|t, µt+1|t) and covariance matrix(

Σt|t aXXΣt|tΣt|ta

>XX Σt+1|t

).

Using (A.2) we conclude that the conditional law ofX(t) givenX(t+1) = xt+1

and Y (0 : t) = y0:t is Gaussian with mean µt|t+Σt|ta>XX(Σt+1|t)

−1(xt+1−µt+1|t)and covariance matrix Σt|t − Σt|ta

>XX(Σt+1|t)

−1aXXΣt|t. By Lemma A.2 andsince σ(X(t + 1), Y (0 : s)) = σ(X(t + 1), Y (0 : t), bYZY (t + 1 : s)), thislaw coincides with the conditional law of X(t) given X(t + 1) = xt+1 andY (0 : s) = y0:s. Since the conditional law of X(t + 1) given Y (0 : s) = y0:s isGaussian with mean µt+1|s and covariance matrixΣt+1|s, we obtain from LemmaA.9 that the conditional law of (X(t), X(t+1)) given Y (0 : s) = y0:s is Gaussianwith mean (µt|t +Gt(Σt+1|t)

−1(µt+1|s − µt+1|t), µt+1|s) and covariance matrix(Σt|t +Gt

(Σt+1|s − Σt+1|t)G

>t GtΣt+1|s

Σt+1|sG>t Σt+1|s

).

Since the conditional law of X(t) given Y (0 : s) = y0:s is just the marginalof the conditional law of (X(t), X(t + 1)) given Y (0 : s) = y0:s, the assertionfollows.

5. This is shown in 4.


1.2.2 Slightly more general Kalman filter recursionsIn some applications the observation process may have an effect on the signal as well. Weconsider this more general case separately:

Definition 1.14 We call an Rm+n-valued process ((X, Y )(t))t∈N general linear Gaussianstate space model if it is of the form

X(t) = aX(t− 1) + AXX(t− 1)X(t− 1) + AXY (t− 1)Y (t− 1) +BX(t− 1)Z(t),

Y (t) = aY (t− 1) + AY X(t− 1)X(t− 1) + AY Y (t− 1)Y (t− 1) +BY (t− 1)Z(t),

where

a(t− 1) =

(aX(t− 1)

aY (t− 1)

)∈ Rm+n,

A(t− 1) =

(AXX(t− 1) AXY (t− 1)AY X(t− 1) AY Y (t− 1)

)∈ R(m+n)×(m+n),

B(t− 1) =

(BX(t− 1)

BY (t− 1)

)∈ R(m+n)×(m+n)

for t = 1, 2, . . . , the random vector (X(0), Y (0)) is Gaussian with mean µ(0) =(µX(0), µY (0)) and covariance matrix

Σ(0) =

(ΣXX(0) ΣXY (0)ΣY X(0) ΣY Y (0),

),

and Z(t), t = 1, 2, . . . is a sequence of independent standard Gaussian random vectors inRm+n which are independent of (X(0), Y (0)) as well.

Since X itself is not necessarily a Markov process and since the coefficients a,A,B maydepend on time, a linear Gaussian state space model may fail to be a hidden Markov modelin the sense of Definition 1.2.

Before we extend Theorem 1.13 to this more general setup, we observe that Definition1.11 is truly a special case of Definition 1.14.

Lemma 1.15 (X, Y ) in Definition 1.11 is a general linear Gaussian state space model withcoefficients

a(t− 1) :=

(aX

aY + aY XaX

),

A(t− 1) :=

(aXX 0

aY XaXX 0

),

B(t− 1) :=

(bX 0

aY XbX bY

),

µ(0) :=

(µX(0)

aY + aY XµX(0)

),

Σ(0) :=

(ΣX(0) aY XΣX(0)

ΣX(0)a>Y X aY XΣX(0)a>Y X + bY b>Y

)for t = 1, 2, . . .


Proof. Observe that Y (t) = aY + aY XaX + aY XX(t− 1) + aY XbXZX(t) + bYZY (t).

Fix a general linear Gaussian state space model in the sense of Definition 1.14. Wedenote by πt|s(y0:s, ·) the conditional law of (X, Y )(t) given Y (0 : s) = y0:s. Its marginallaw πt|s(y0:s, · × Rn) = πt|s(y0:s, ·) is the object that we want to determine. The followingresult parallels Theorem 1.13.

Theorem 1.16 1. πt|s(y0:s, ·) is Gaussian for any s, t ∈ N. We denote its mean andcovariance matrix by

µt|s(y0:s) =

(µXt|s(y0:s)

µYt|s(y0:s)

)and

Σt|s(y0:s) =

(ΣXXt|s (y0:s) ΣXY

t|s (y0:s)

ΣY Xt|s (y0:s) ΣY Y

t|s (y0:s)

)

=

(Σ·Xt|s(y0:s)

Σ·Yt|s(y0:s)

)=(

ΣX·t|s(y0:s) ΣY ·

t|s(y0:s)),

respectively. Consequently, πt|s(y0:s, ·) is Gaussian as well with mean µXt|s(y0:s) =

µXt|s(y0:s) and covariance matrix ΣXt|s(y0:s) = ΣXX

t|s (y0:s).

2. We have

µt|s(y0:s) =

(µXt|s(y0:s)

yt

)and

Σt|s(y0:s) =

(ΣXXt|s (y0:s) 0

0 0

)for s ≥ t.

3. The conditional mean µt|s(y0:s) is an affine function of y0:s.

4. The conditional covariance matrix Σt|s(y0:s) does not depend on the observations y0:s.

5. µt|s(y0:s) and Σt|s satisfy the recursions:

(a) (inception step)

µ0|−1 := µ(0),

Σ0|−1 := Σ(0),

(b) (prediction step)

µt|s(y0:s) = a(t− 1) + A(t− 1)µt−1|s(y0:s),

Σt|s = A(t− 1)Σt−1|sA(t− 1)> +B(t− 1)B(t− 1)>

for t > s,


(c) (correction step)

µt|t(y0:t) = µt|t−1(y0:t−1) + ΣY ·t|t−1

(ΣY Yt|t−1

)−1(yt − µt|t−1(y0:t−1)),

Σt|t = Σt|t−1 − Σ−1t|t−1

(ΣY Yt|t−1

)−1Σt|t−1

for t ≥ 0,

(d) (smoothing step)

Gt := Σt|tA(t)>Σ−1t+1|t,

µt|s(y0:s) = µt|t(y0:t) +Gt

(µt+1|s(y0:s)− µt+1|t(y0:t)

),

Σt|s = Σt|t +Gt

(Σt+1|s − Σt+1|t)G

>t

for t < s.

6. For s > t the conditional law πt:t+1|s(y0:s, ·) of (X, Y )(t : t+ 1) given Y (0 : s) = y0:s

is Gaussian with mean µt:t+1|s(y0:s) := (µt|s(y0:s), µt+1|s(y0:s)) and covariance matrix

Σt:t+1|s :=

(Σt|s GtΣt+1|s

Σt+1|sG>t Σt+1|s

).

Proof. This follows along the same lines as Theorem 1.13.

Remark 1.17 (Optimal linear filtering) Suppose that (X, Y ) is a process that shares thefirst and second moments with a linear Gaussian state space model (X, Y ) as in Def-inition 1.14, i.e. E(Xi(t)) = E(Xi(t)), E(Yi(t)) = E(Yi(t)), Cov(Xi(s), Xj(t)) =

Cov(Xi(s), Xj(t)), Cov(Xi(s), Yj(t)) = Cov(Xi(s), Yj(t)), Cov(Yi(s), Xj(t)) =

Cov(Yi(s), Xj(t)), Cov(Yi(s), Yj(t)) = Cov(Yi(s), Yj(t)) for any applicable s, t, i, j. Wemay call (X, Y ) a Gaussian equivalent of (X, Y ). In this case

Z? :=

∫xtπt|s(y0:s, dxt)

with πt|s from Theorem 1.16 provides the best linear filter/predictor/smoother of X(t) givenY (0 : s) = y0:s in the sense that it minimises

Z 7→ E((X(t)− Z)2

)over all random variables of the form Z = f(y0:s) where f : Rn(s+1) → Rm is an affinefunction. This holds because the solution to this optimisation problem depends only on thefirst two moments of the process (X, Y ).

Example 1.18 As an example for Remark 1.17 we reconsider Example 0.3 with a Markovprocess v(t) satisfying E(v(t)|v(t− 1)) = α+βv(t− 1) for some constants α, β ∈ R. Thisholds e.g. for the popular Heston model if it is restricted to discrete time. Set X(t) = v(t)


and Y (t) = (Y1(t), Y2(t)) := (X(t), (∆X(t))2). It is easy to see that (X, Y ) allows for aGaussian equivalent as in Definition 1.14 with coefficients

a(t− 1) =

α00

,

A(t− 1) =

β 0 00 0 01 0 0

,

B(t− 1) =

√E((v(t)− α− βv(t− 1))2

)0 0

0√E(v(t− 1)) 0

0 0√

3E(v(t− 1))

for t = 1, 2, . . . Remark 1.17 now allows to compute e.g. the best linear filter of v(t) basedon linear combinations of past observations X(s), (∆X(s))2, s = 0, . . . , t.

1.3 Maximum likelihood estimationSo far we assumed that the law of the process (X, Y ) and hence e.g. the vectors and ma-trices aX , aXX , bX , aY , aY X , bY , µX(0),ΣX(0) in Definition 1.14 were known. This may bethe case in some applications as e.g. Examples 0.1 and 0.2. In others as e.g. Example 1.18the natural way to determine the unknown parameters in the model is statistical inferencebased on the observations Y (0), Y (1), . . . . One may even be interested in this question ofparameter estimation independently of our application to filtering. We consider the max-imum likelihood approach in these notes because it is known to often produce reasonableand in some sense even asymptotically optimal results.

1.3.1 Likelihood inferenceBefore we study the more involved case of a hidden Markov model, we recall some generalfacts and notions. To this end we set Ω := (Rd)N, we denote the Rd-valued canonical processby X and its natural filtration by (Ft)t∈N. Moreover, we consider probability measures P0

and Pϑ for any ϑ ∈ Θ ⊂ Rk. The following simple lemma states that the joint density ofX(0 : t) can be computed from the one-period conditional densitites.

Lemma 1.19 Suppose that PX(t)|Ft−1

ϑ PX(t)|Ft−1

0 with density

xt 7→ %ϑ,t|t−1(X(0 : t− 1), xt)

for any t ∈ N. For t = 0 we set F−1 := ∅,Ω so that %ϑ,0|−1 = dPX(0)ϑ /dP

X(0)0 . Then

PX(0:t)ϑ P

X(0:t)0 , t ∈ N (1.22)

with density

%ϑ,t(x0:t) :=t∏

s=0

%ϑ,s|s−1(x0:s−1, xs). (1.23)

1.3. MAXIMUM LIKELIHOOD ESTIMATION 23

Proof. Cf. Lemma A.6.

(1.22) means that Pϑ is locally absolutely continuous with respect to P0, which is written

as Pϑloc P0. In this case we can define the density process

Zϑ(t) :=dP

X(0:t)ϑ

dPX(0:t)0

(X(0 : t)) = %ϑ,t(X(0 : t)), t ∈ N.

It has the following martingale property.

Lemma 1.20 The density process Zϑ = (Zϑ(t))t∈N is a P0-martingale, i.e. we haveE0(Zϑ(t)|Ft−1) = Zϑ(t− 1) for any t ≥ 1.

Proof. This is left as an exercise.

For likelihood-based inference we consider the density %ϑ,t(X(0 : t)) or its logarithmlog %ϑ,t(X(0 : t)) as a function of the unknown parameter ϑ. If it exists, the maximumlikelihood estimator (MLE) is the value ϑ maximising

ϑ 7→ %ϑ,t(X(0 : t))

or, equivalently,ϑ 7→ log %ϑ,t(X(0 : t)).

If the model is regular enough, the optimiser ϑ = ϑ is obtained by the corresponding first-order condition, i.e. by solving the estimating equation

sϑ,t(X(0 : t)) = 0,

wheresϑ,t(x0:t) := ∇ϑ log %ϑ,t(x0:t)

denotes the score function. Here and in the following, ∇ϑ denotes the gradient of the fol-lowing expression, interpreted as a function of ϑ ∈ Θ ⊂ Rk.

Lemma 1.21 In the statements below we suppose that the derivatives exist and that differ-entiation and integration can be interchanged in the proof of the second result.

1. Define the conditional score function by

sϑ,t|t−1(x0:t−1, xt) := ∇ϑ log %ϑ,t|t−1(x0:t−1, xt).

for ϑ ∈ Θ, t ∈ N, x0:t ∈ (Rd)+. Then

sϑ,t(x0:t) =t∑

s=0

sϑ,s|s−1(x0:s−1, xs).

2. Define the score process as

Sϑ(t) := sϑ,t(X(0 : t)).

Then Eϑ(Sϑ(0)) = 0 and Sϑ = (Sϑ(t))t∈N is a Pϑ-martingale.


Proof.

1. This follows from (1.23).

2. The martingale property follows from

Eϑ(∆Sϑ(t)|Ft−1) = Eϑ(∇ϑ log %ϑ,t|t−1(X(0 : t− 1), X(t))

∣∣Ft−1

)= E0

(∇ϑ%ϑ,t|t−1(X(0 : t− 1), X(t))

%ϑ,t|t−1(X(0 : t− 1), X(t))%ϑ,t|t−1(X(0 : t− 1), X(t))

∣∣∣∣∣Ft−1

)= ∇ϑE0

(%ϑ,t|t−1(X(0 : t− 1), X(t))

∣∣Ft−1

)= ∇ϑ1 = 0.

The equality Eϑ(Sϑ(0)) = 0 follows similarly.

Provided that the parametric model is regular enough, we can hope for some agreeableproperties of the MLE.

1. Using some law of large numbers it often follows that the score function is consistentin the sense that the properly rescaled score process converges in Pϑ-probability to 0for t → ∞. This is a key property for showing that the MLE ϑ itself is consistent,i.e. Pϑ- limt→∞ ϑ(t) = ϑ, where ϑ(t) denotes the MLE based on the observationsX(0 : t).

2. Generalisations of the central limit theorem often yield that the score function is alsoasymptotically normal in the sense that the properly normalised score process con-verges in Pϑ-law to a Gaussian random vector for t→∞. This in turn is used to verifythat the MLE is asymptotically normal, i.e. limt→∞L (C(t)−1(ϑ(t)−ϑ)) = Nk(0, 1)weakly with some deterministic matrices C(t) ∈ Rk×k, t ∈ N. If we are lucky, itis even asymptotically efficient in the sense that the asymptotic covariance reachesthe Cramer-Rao bound and turns out to be minimal among the asymptotically normalestimators.

3. Quasi-likelihood estimators are often based on estimating equations sϑ,t(X(0 : t)) =0 with alternative estimating functions sϑ,t : Rt+1 → Rk instead of the score functionsϑ,t. If they are unbiased in the sense that Eϑ(sϑ,t(X(0 : t))) = 0 and if additionalregularity holds, we may hope for consistency and asymptotic normality of the cor-responding estimator. Asymptotic efficiency, however, cannot be expected any more.We refer to [Sør12, Section 1.10] in this context.

1.3.2 Parametric models with partial observationsWe distinguish three general setups below for addressing statistical inference. In order toderive the MLE for these classes of partially observed models, we consider two differentapproaches, one based on filtering the conditional density in Section 1.3.3, the second onsmoothing the conditional score function in Section 1.3.4. The first approach is appliedto linear Gaussian state space models in Section 1.3.6, the second to discrete state spacemodels in Section 1.3.5.


1.3.2.1 Hidden Markov models

We start by considering hidden Markov models as in Section 1.1. For ease of expositionwe work on the canonical space. Put differently, we set Ω := (Rm+n)N and denote thecanonical Rm+n-valued process by (X, Y ) and its natural filtration by (Ft)t∈N. The proba-bility measure P = P (X,Y ) on Ω now depends on a parameter or parameter vector ϑ ∈ Θ.This is indicated by writing Pϑ, Kϑ(xt, dyt), pϑ(xt−1, dxt), pϑ,0(dx0), λϑ(xt, yt) etc. insteadof P,K(xt, dyt), p(xt−1, dxt), p0(dx0), λ(xt, yt) etc. in Section 1.1. In order to apply likeli-hood methods, we assume that the densities

λϑ(xt, yt) =Kϑ(xt, dyt)

ϕ(dyt)

all refer to the same probability measure ϕ which does not depend on ϑ.We introduce the space Ω := (Rn)N with canonical process Y and its filtration (Ft)t∈N.

It represents the canonical space for the observations without the signal. The above lawsPϑ = P

(X,Y )ϑ induce probability measures Pϑ := P Y

ϑ on Ω. In line with Section 1.3.1 wewrite

%ϑ,t|t−1(y0:t−1, ·) :=dP

Y (t)|Y (0:t−1)=y0:t−1

ϑ

dϕ=dP

Y (t)|Y (0:t−1)=y0:t−1

ϑ

dϕ

and

%ϑ,t :=dP

Y (0:t)ϑ

dϕ1+t=dP

Y (0:t)ϑ

dϕ1+t

for Radon-Nikodym densities with respect to ϕ and ϕ1+t = ⊗ts=0ϕ, respectively.

1.3.2.2 Hidden Markov models with dominating measure

As in the previous paragraph we set Ω := (Rm+n)N and denote the canonical Rm+n-valuedprocess by (X, Y ) and its natural filtration by (Ft)t∈N. For the approach in Section 1.3.4we need that not only the laws P Y (0:t)

ϑ , ϑ ∈ Θ of the observation process Y but also the jointlaws P (X,Y )(0:t)

ϑ , ϑ ∈ Θ are absolutely continuous with respect to the same measure.To this end, consider probability measures P0 and Pϑ, ϑ ∈ Θ on Ω such that

P(X,Y )(t)|Ft−1

ϑ P(X,Y )(t)|Ft−1

0 , t ∈ N with density

(xt, yt) 7→ %ϑ,t|t−1((X, Y )(0 : t− 1), (xt, yt)). (1.24)

In order to obtain a hidden Markov model as in Definition 1.2, we suppose that P0 =PX

0 ⊗ϕN with some law PX0 on (Rm)N such thatX is a Markov process under P0 with initial

law PX(0)0 and transition kernel

PX(t)|Ft−1

0 (ω, dxt) = p0(X(ω, t− 1), dxt).

Moreoever, we write ϕN := ⊗t∈Nϕ. This means that the Rn-valued random variablesY (0), Y (1), . . . are iid under P0 with law ϕ and also independent of X . The conditionaldensities (1.24) are supposed to be of the form

%ϑ,t|t−1((X, Y )(0 : t− 1), (xt, yt)) = %Xϑ (X(t− 1), xt)λϑ(xt, yt) (1.25)


for t ≥ 1 and%ϑ,0|−1(x0, y0) = %Xϑ,0(x0)λϑ(x0, y0) (1.26)

with some functions %Xϑ , λϑ, %Xϑ,0.

Lemma 1.22 Relative to Pϑ, the process (X, Y ) is a hidden Markov model in the sense ofDefinition 1.2. The corresponding functions are

pϑ(xt−1, dxt) := %Xϑ (xt−1, xt)p0(xt−1, dxt),

λϑ(xt, yt) = λϑ(xt, yt),

pϑ,0(dx0) := %Xϑ,0(x0)PX(0)0 (dx0)

instead of p(xt−1, dxt), λ(xt, yt), p0(dx0) in Definition 1.2.

Proof. It is easy to verify that the transition kernel under Pϑ is as in Definition 1.2.

Finally, we introduce once more the space Ω := (Rn)N with canonical process Y and itsfiltration (Ft)t∈N. It represents the canonical space for the observations without the signal.The above laws P0 = P

(X,Y )0 , Pϑ = P

(X,Y )ϑ induce probability measures P0 := P Y

0 andPϑ := P Y

ϑ on Ω.

1.3.2.3 General Markov process with dominating measure

The general linear Gaussian state space model of Section 1.2.2 is not necessarily of HMMtype in the sense of Definition 1.2. In order to cover it as well, we introduce a third setup.Once more we consider the canonical process (X, Y ) on Ω := (Rm+n)N with canonicalfiltration (Ft)t∈N and laws P0 as well as Pϑ, ϑ ∈ Θ on that space such that (X, Y ) is aMarkov process and P (X,Y )(t)|(X,Y )(t−1)

ϑ P(X,Y )(t)|(X,Y )(t−1)0 , t ∈ N with densities

(xt, yt) 7→ %ϑ,t|t−1((X, Y )(t− 1), (xt, yt)). (1.27)

We do not require a HMM structure as in Section 1.1.1 but we do suppose thatY (0), Y (1), . . . are independent and also independent of X under P0. Put differently, werequire

P0 = PX0 ⊗

⊗t∈N

PY (t)0 . (1.28)

As in Sections 1.3.2.1 and 1.3.2.2, we introduce the space Ω := (Rn)N with canonicalprocess Y and its filtration (Ft)t∈N, standing for the observations alone. Again, P0 =

P(X,Y )0 , Pϑ = P

(X,Y )ϑ induce probability measures P0 := P Y

0 and Pϑ := P Yϑ on Ω.

1.3.3 Likelihood estimation by filtering conditional densities

We start by considering the HMM setup of Section 1.3.2.1. First recall the following imme-diate consequence of Remark 1.5.


Corollary 1.23 If %ϑ,t|t denotes the non-normalised filter of Section 1.1 relative to Pϑ, wehave

%ϑ,t(y0:t) = %ϑ,t|t(y0:t, 1) (1.29)

for ϕ1+t-almost any y0:t.

Example 1.24 In order to approximate %ϑ,t|t(y0:t, f) rather than πϑ,t|t(y0:t, f) with the par-ticle filter in Example 1.10, use

%t|t(y0:t, f) ≈N∑i=1

w(i)t f(x

(i)t )

withw(i)s = λ(x(i)

s , ys)w(i)s−1

instead of (1.19). In the resampling step, draw x(1)s , . . . , x

(N)s from∑N

i=1w

(i)s∑N

j=1 w(j)s

δx(i)s

and replace the current sample (x(1)s , w

(1)s ), . . . , (x

(N)s , w

(N)s ) by

(x(1)s ,∑N

i=1w(i)s /N), . . . , (x

(N)s ,

∑Ni=1w

(i)s /N).

Alternatively, we can express the likelihood directly in terms of normalised filters:

Lemma 1.25 Denote by πϑ,t|t for any ϑ ∈ Θ, t ∈ N the filter πt|t of Section 1.1 relative toPϑ. Then

%ϑ,t|t−1(y0:t−1, yt) = πϑ,t−1|t−1

(y0:t−1, fϑ(·, yt)

)with

fϑ(xt−1, yt) :=

∫λϑ(xt, yt)pϑ(xt−1, dxt)

for t ≥ 1 and

%ϑ,0|−1(y0) =

∫λϑ(x0, y0)P

X(0)ϑ (dx0). (1.30)

Consequently,

%ϑ,t(y0:t) =

∫λϑ(x0, y0)P

X(0)ϑ (dx0)

t∏s=1

πϑ,s−1|s−1

(y0:s−1, fϑ(·, ys)

). (1.31)

Proof. (1.30) follows from (1.29) and (1.1). In view of (1.23, 1.29, 1.3, 1.2) we have

%ϑ,t|t−1(y0:t−1, yt) =%ϑ,t(y0:t)

%ϑ,t−1(y0:t−1)

=%ϑ,t|t(y0:t, 1)

%ϑ,t−1|t−1(y0:t−1, 1)

=%ϑ,t−1|t−1(y0:t, fϑ(·, yt))%ϑ,t−1|t−1(y0:t−1, 1)

= πϑ,t−1|t−1(y0:t, fϑ(·, yt))

for t ≥ 1. Now we turn instead to the framework of Section 1.3.2.3, which is not necessarily of

HMM type. The counterpart of Lemma 1.25 reads as follows.


Lemma 1.26 1. We have P Y (t)|Ft−1

ϑ PY (t)|Ft−1

0 = PY (t)0 with corresponding density

yt 7→ %ϑ,t|t−1(Y (0 : t− 1), yt), where

%ϑ,t|t−1(y0:t−1, yt)

:= Eϑ

(∫%ϑ,t|t−1((X, Y )(t− 1), (xt, yt))

× PX(t)|(X,Y )(t−1)0 (dxt)

∣∣∣∣Y (0 : t− 1) = y0:t−1

). (1.32)

2. We have P Y (0:t)ϑ P

Y (0:t)0 with density

%ϑ,t(y0:t) :=t∏

s=0

%ϑ,s|s−1(y0:s−1, ys). (1.33)

Proof.

1. First note that P Y (t)|Y (0:t−1)0 = P

Y (t)0 and

P(X,Y )(t)|Ft−1

0 (d(xt, yt)) = PX(t)|Ft−1

0 (dxt)PY (t)0 (dyt)

by (1.28) and Lemma A.3. Set

%Yϑ,t|t−1((X, Y )(t− 1), yt)

:=

∫%ϑ,t|t−1((X, Y )(t− 1), (xt, yt))P

X(t)|(X,Y )(t−1)0 (dxt).

From ∫1B(yt)P

Y (t)|Y (0:t−1)ϑ (dyt) = Eϑ

(∫1B(yt)P

Y (t)|Ft−1

ϑ (dyt)

∣∣∣∣Y (0 : t− 1)

)= Eϑ

(∫1Rm×B(xt, yt)P

(X,Y )(t)|Ft−1

ϑ (d(xt, yt))

∣∣∣∣Y (0 : t− 1)

)= Eϑ

(∫1Rm×B(xt, yt)%ϑ,t|t−1

((X, Y )(t− 1), (xt, yt)

)× P (X,Y )(t)|Ft−1

0 (d(xt, yt))

∣∣∣∣Y (0 : t− 1)

)= Eϑ

(∫1B(yt)%

Yϑ,t|t−1

((X, Y )(t− 1), yt

)PY (t)0 (dyt)

∣∣∣∣Y (0 : t− 1)

)=

∫1B(yt)Eϑ

(%Yϑ,t|t−1

((X, Y )(t− 1), yt

)∣∣∣Y (0 : t− 1))PY (t)0 (dyt)

=

∫1B(yt)%ϑ,t|t−1(Y (0 : t− 1), yt)P

Y (t)0 (dyt)


for Borel sets B ⊂ Rn it follows that∫1B(yt)P

Y (t)|Y (0:t−1)=y0:t−1

ϑ (dyt)

=

∫1B(yt)P

Y (t)|Y (0:t−1)=y0:t−1

ϑ (dyt)

=

∫1B(yt)%ϑ,t|t−1(y0:t−1, yt)P

Y (t)0 (dyt)

=

∫1B(yt)%ϑ,t|t−1(y0:t−1, yt)P

Y (t)0 (dyt)

and hence the first assertion.

2. This is evident from Lemma 1.19.

Both in (1.31) and (1.32, 1.33) maximising the likelihood ϑ 7→ %ϑ,t(y0:t) involves filteringsome function of X(t− 1).

1.3.4 Likelihood estimation by smoothing conditional score functionsIn this alternative approach we consider a parametric hidden Markov model as introduced inSection 1.3.2.2. We start from the score function or process provided that we have observed(X, Y )(0 : t), i.e. both the signal and the observation process up to time t. By Lemma 1.21it is of the form

Sϑ(t) = sϑ,t((X, Y )(0 : t))

=t∑

s=0

∇ϑ log %ϑ,s|s−1((X, Y )(0 : s− 1), (X, Y )(s))

=t∑

s=0

∇ϑ log(%Xϑ (X(s− 1), X(s))λϑ(X(s), Y (s))

). (1.34)

Our goal is to determine the score process Sϑ(t) on Ω.

Lemma 1.27 In the statements below we suppose that the derivatives exist and that differ-entiation and integration can be interchanged in the proof.

Denote by πϑ,s|t, πϑ,s−1:s|t for any ϑ ∈ Θ, s ≤ t ∈ N the smoothers πϑ,s|t, πs−1:s|t ofSection 1.1 relative to Pϑ. Then

Sϑ(y0:∞, t) = Eϑ(Sϑ(t)

∣∣Y (0 : t) = y0:t

)(1.35)

= πϑ,0|t(y0:t, fϑ,0(·, y0)

)+

t∑s=1

πϑ,s−1:s|t(y0:t, fϑ(·, ys)

)(1.36)

withfϑ(xs−1:s, ys) := ∇ϑ log

(%Xϑ (xs−1, xs)λϑ(xs, ys)

)for t ≥ 1 and

fϑ,0(x0, y0) := ∇ϑ log(%Xϑ,0(x0)λϑ(x0, y0)

).


Proof. First note that the density of P Y (0:t)ϑ = P

Y (0:t)ϑ relative to P Y (0:t)

0 = PY (0:t)0 is given

by

%ϑ,t(y0:t) =

∫%ϑ,t((x, y)0:t)P

X(0:t)0 (dx0:t).

Secondly, by Lemma A.5 and independence of X, Y under P0 we have

E0

(g((X, Y )(0 : t))

∣∣Y (0 : t))

=

∫g(x0:t, Y (0 : t))P

X(0:t)0 (dx0:t)

for any sufficiently integrable function g. Hence we obtain

Eϑ(Sϑ(t)|Y (0 : t)) = Eϑ

(∇ϑ%ϑ,t((X, Y )(0 : t))

%ϑ,t((X, Y )(0 : t))

∣∣∣∣Y (0 : t)

)=

E0

(∇ϑ%ϑ,t((X, Y )(0 : t))

∣∣Y (0 : t))

E0

(%ϑ,t((X, Y )(0 : t))

∣∣Y (0 : t))

=

∫∇ϑ%ϑ,t(x0:t, Y (0 : t))P

X(0:t)0 (dx0:t)∫

%ϑ,t(x0:t, Y (0 : t))PX(0:t)0 (dx0:t)

= ∇ϑ log

∫%ϑ,t(x0:t, Y (0 : t))P

X(0:t)0 (dx0:t)

= ∇ϑ log %ϑ,t(Y (0 : t)),

where we used Lemma A.5 in the second step and Lemma A.3 in the third. SinceSϑ(y0:∞, t) = ∇ϑ log %ϑ,t(y0:t), we obtain (1.35).

Equation (1.36) follows from (1.35) and (1.34).

Put differently, the score function involves smoothing fϑ(X(s − 1 : s), ys), which de-pends on two consecutive values of the signal process. This may be feasible but looks morecomplicated than the representations in Corollary 1.23 or Lemma 1.25. Therefore it doesnot seem to be used in this form.

Instead it has effectively been suggested to separate the parameters in the filter and inthe conditional score functions, i.e. we consider the function

g(ϑ1, ϑ2) := πϑ2,0|t(y0:t, fϑ1,0(·, y0)

)+

t∑s=1

πϑ2,s−1:s|t(y0:t, fϑ1(·, ys)

).

for y0:t = Y (0 : t). The EM algorithm (for expectation and maximisation) now consists oftwo steps that are successively alternated. We start with some parameter choice ϑ.

E-step In the expectation step one computes the smoothing distributions πϑ,0|t(y0:t, ·

)and

πϑ,s−1:s|t(y0:t, ·

), s = 1, . . . , t for fixed ϑ.

M-step In the subsequent M-step one determines the solution ϑ1 to g(ϑ1, ϑ) = 0. Sincethis equation is the first order condition for maximisation of the function

ϑ1 7→ G(ϑ1, ϑ) := πϑ,0|t(y0:t, Fϑ1,0(·, y0)

)+

t∑s=1

πϑ,s−1:s|t(y0:t, Fϑ1(·, ys)

)(1.37)


withFϑ1(xs−1:s, ys) := log

(%Xϑ1(xs−1, xs)λϑ1(xs, ys)

)(1.38)

for t ≥ 1 andFϑ1,0(x0, y0) := log

(%Xϑ1,0(x0)λϑ1(x0, y0)

), (1.39)

this can be viewed as a maximisation step. Afterwards, the current value of ϑ isreplaced by ϑ1 and we turn back to the E-step.

It may seem far from obvious that why or under what conditions this procedure trulyconverges to some MLE. The following lemma shows that the original likelihood ϑ 7→%ϑ(y0:t) is typically increased in the M-step. This more or less implies that the algorithmends up at least at a local maximum of the likelihood.

Lemma 1.28 Maximisation of ϑ1 7→ G(ϑ1, ϑ) in the M-step leads to higher likelihood inthe sense that %ϑ1,t(y0:t) ≥ %ϑ,t(y0:t).

Proof. Recall from Lemma 1.19 and (1.25, 1.26) that

log %ϑ1,t((x, y)0:t) =t∑

s=0

log %ϑ1,s|s−1((x, y)0:s−1, (xs, ys))

= log(%Xϑ1,0(x0)λϑ1(x0, y0)

)+

t∑s=1

log(%Xϑ1(xs−1, xs)λϑ1(xs, ys)

)and hence G(ϑ1, ϑ) = Eϑ(log %ϑ1,t((X, Y )(0 : t))|Y (0 : t) = y0:t) for any ϑ, ϑ1 ∈ Θ. Sinceϑ1 maximises ϑ1 7→ G(ϑ1, ϑ), we conclude

0 ≤ G(ϑ1, ϑ)−G(ϑ, ϑ)

= Eϑ

(log

%ϑ1,t((X, Y )(0 : t))

%ϑ,t((X, Y )(0 : t))

∣∣∣∣Y (0 : t) = y0:t

)

≤ logEϑ

(%ϑ1,t((X, Y )(0 : t))

%ϑ,t((X, Y )(0 : t))

∣∣∣∣Y (0 : t) = y0:t

)

= log%ϑ1,t(y0:t)

%ϑ,t(y0:t),

where we used Jensen’s inequality in the third step and Lemma A.4 in the last.

Since the smoother in (1.37) looks more diffcult than the filters needed for Lemma 1.25or Corollary 1.23, one may wonder why one should prefer the EM algorithm to direct nu-merical maximisation of the likelihood computed in Section 1.3.3. In some models as in thefollowing section, however, the maximisation in the M-step can be done explicitly, which ismuch less frequently the case for the original likelihood. Therefore the additional computa-tional burden of the smoothing step may be outweighed by the fact that numerical maximi-sation of a function with possibly many degrees of freedom can be avoided.


1.3.5 The Baum-Welch algorithmThe Baum-Welch algorithm concerns likelihood estimation in the finite state space modelof Example 1.9. We consider the transition matrix p ∈ Rk×k and the initial law p0 ∈ Rk

as the unknown parameters, which we want to estimate. For the time being, the transitionfunction λ concerning the observation noise is assumed to be known. We proceed accordingto the EM algorithm from Section 1.3.4.

E-step Starting from a given parameter vector ϑ = (p0, p) we determine the smoothingdistributions πϑ,s−1:s|t(y0:t) for s = 1, . . . , t recursively as explained in Example 1.9.

M-step In the subsequent M-step the maximiser ϑ1 = (p0, p) of

ϑ1 7→ G(ϑ1, ϑ) :=k∑i=1

πϑ,0|t(y0:t)i log(p0)i

+t∑

s=1

k∑i,j=1

πϑ,s−1:s|t(y0:t)ij log pij.

is determined subject to the conditions that pi· and p0 represent probabilites, i.e. theyare nonnegative and satisfy

∑kj=1 pij = 1 for i = 1, . . . , k and

∑ki=1(p0)i = 1.

Lemma 1.29 The maximiser of ϑ1 7→ G(ϑ1, ϑ) is given by

pij =

∑ts=1 πϑ,s−1:s|t(y0:t)ij∑ts=1 πϑ,s−1|t(y0:t)i

=Eϑ(∑t

s=1 1X(s−1)=i,X(s)=j|Y (0 : t) = y0:t)

Eϑ(∑t

s=1 1X(t−1)=i|Y (0 : t) = y0:t)(1.40)

for i, j = 1, . . . , k such that the denominator does not vanish and

(p0)i = πϑ,0|t(y0:t)i = Pϑ(X(0) = i|Y (0 : t) = y0:t) (1.41)

for i = 1, . . . , k.

Proof. Consider first-order conditions for maximisation of ϑ1 7→ G(ϑ1, ϑ). In order toobtain (1.40) differentiate either with respect to pij, j = 1, . . . , k − 1 or with respectto pij, j = 1, . . . , k using a Lagrange multiplier for the constraint

∑kj=1 pij = 1.

Equation (1.40) is derived along the same lines.

As explained above, we take (p0, p) as updated value for ϑ and return to the E-step.

Now suppose that λ depends on an unknown parameter as well. We discuss here thecase that Y has values in a finite set B = b1, . . . , b` and that ϑ = (p0, p, q) is consideredas parameter vector, where q = (qij)i=1,...,k

j=1,...,`with qij := λ(ai, bj).


E-step Starting from a given parameter vector ϑ = (p0, p, q) we determine the smoothingdistributions πϑ,t−1:t|s(y0:s) for t = 1, . . . , s precisely as above relying on Example1.9.

M-step Since λϑ in (1.37–1.39) now depends on ϑ as well, we now need to consider

G(ϑ1, ϑ) :=k∑i=1

πϑ,0|t(y0:t)i(

log(p0)i + log qi,y0)

+t∑

s=1

k∑i,j=1

πϑ,s−1:s|t(y0:t)ij

(log pij + log qi,ys

).

The maximiser ϑ1 = (p0, p, q) of ϑ1 → G(ϑ1, ϑ) subject to the conditions of nonega-tivity and normalisation is determined using Lemma 1.29 together with

Lemma 1.30 The maximiser q is given by

qij =

∑ts=0 πs|t(y0:t)i1ys=j∑t

s=0 πϑ,s|t(y0:t)i

=Eϑ(∑t

s=0 1X(s)=i|Y (0 : t) = y0:t)1ys=j

Eϑ(∑t

s=0 1X(s)=i|Y (0 : t) = y0:t)

for i = 1, . . . , k and j = 1, . . . , ` such that the denominator does not vanish.

Proof. This follows along the same lines as Lemma 1.29.

As usual, we take (p0, p, q) as updated value for ϑ and return to the E-step.

As a side remark note that we could base statistical inference alternatively on Corollary1.23. By Theorem 1.3 and parallel to Example 1.9 the likelihood

%ϑ,t(y0:t) = %ϑ,t|t(y0:t, 1) =k∑i=1

%ϑ,t|t(y0:t)i

is obtained from the recursion

%ϑ,0|0(y0)i := λ(ai, y0)(p0)i,

%ϑ,t|t(y0:t)i := λ(ai, yt)k∑j=1

pji%ϑ,t−1|t−1(y0:t−1)j

for i = 1, . . . , k. In Example 1.9 this amounts to multiplying the denominators in (1.17) and(1.18) up to t. One can derive explicit representations of the derivative with respect to ϑ aswell, which proves to be useful for numerical maximisation of the likelihood.


1.3.6 Likelihood estimation in linear Gaussian state space models

In Gaussian models we can express the likelihood more explicitly. We consider a generallinear Gaussian state space model as in Section 1.2.2. Our aim is to determine the likelihoodand the score function. We proceed basically as in Section 1.3.3. However, densities areexpressed below with respect to Lebesgue measure rather than some P0. Since the scorefunction and the MLE do not depend on the dominating measure, this simplification doesnot affect the relevant results.

If the parametric model is of the form in Definition 1.14, the densities (1.27) are of theform

%ϑ,t|t−1((X, Y )(0 : t− 1), (xt, yt)) = ϕµϑ,t|t−1((xt−1,yt−1)),Σϑ,t|t−1(xt, yt),

where we write ϕµ,Σ for the density of a multivariate normal distribution N(µ,Σ),

µϑ,t|t−1(xt−1, yt−1) =

(µXϑ,t|t−1(xt−1, yt−1)

µYϑ,t|t−1(xt−1, yt−1)

)

denotes the conditional expectation of (X, Y )(t) given (X, Y )(t− 1) = (xt−1, yt−1), and

Σϑ,t|t−1 =

(ΣXXϑ,t|t−1 ΣXY

ϑ,t|t−1

ΣY Xϑ,t|t−1 ΣY Y

ϑ,t|t−1

)

the conditional covariance of (X, Y )(t) given (X, Y )(t − 1) = (xt−1, yt−1). Direct cal-culation or (A.2) yield that Σϑ,t|t−1 does not depend on (xt−1, yt−1) at all. Moreover,µϑ,t|t−1((xt−1, yt−1)) is an affine function of (xt−1, yt−1). In particular, we can write

µYϑ,t|t−1(xt−1, yt−1) = αϑ,t + βϑ,t(xt−1, yt−1)>

with some explicitly known coefficients αϑ,t ∈ Rm, βϑ,t ∈ Rm×(m+n).Recall that the Pϑ-conditional distribution of X(t − 1) given Y (0 : t − 1) = y0:t−1 is

Gaussian. Its conditional mean and variance are obtained from the Kalman filter recursionsin Theorem 1.16. We denote the conditional mean by µXϑ,t−1|t−1(y0:t−1) and the conditional

variance as ΣXϑ,t−1|t−1. Note that µXϑ,t−1|t−1(y0:t−1) is an affine function of y0:t−1 and that

ΣXϑ,t−1|t−1 does not depend on y0:t−1 at all. We set

µYϑ,t|t−1(y0:t−1) := αϑ,t + βϑ,t(µXϑ,t−1|t−1(y0:t−1), yt−1

)>,

ΣYϑ,t|t−1 := β>ϑ,t

(ΣXϑ,t−1|t−1 0

0 0

)βϑ,t,

which denote the Pϑ-conditional mean and variance of µYϑ,t|t−1((X, Y )(t − 1)) given Y (0 :

t− 1) = y0:t−1.


We are now ready to compute (1.32):

%ϑ,t|t−1(y0:t−1, yt)

= Eϑ

(∫%ϑ,t|t−1((X, Y )(0 : t− 1), (xt, yt))dxt

∣∣∣∣Y (0 : t− 1) = y0:t−1

)= Eϑ

(ϕµY

ϑ,t|t−1((X,Y )(t−1)),ΣY Y

ϑ,t|t−1(yt)∣∣∣Y (0 : t− 1) = y0:t−1

)(1.42)

= Eϑ

(ϕyt,ΣY Y

ϑ,t|t−1

(µYϑ,t|t−1((X, Y )(t− 1))

)∣∣∣Y (0 : t− 1) = y0:t−1

)=

∫ϕyt,ΣY Y

ϑ,t|t−1(z)ϕµY

ϑ,t|t−1(y0:t−1),ΣY

ϑ,t|t−1(z)dz

=

∫ϕyt,ΣY Y

ϑ,t|t−1(z)ϕ−µY

ϑ,t|t−1(y0:t−1),ΣY

ϑ,t|t−1(−z)dz

= ϕyt,ΣY Yϑ,t|t−1

∗ ϕ−µYϑ,t|t−1

(y0:t−1),ΣYϑ,t|t−1

(0)

= ϕyt−µYϑ,t|t−1(y0:t−1),ΣY Y

ϑ,t|t−1+ΣY

ϑ,t|t−1(0)

= ϕ0,ΣY Yϑ,t|t−1

+ΣYϑ,t|t−1

(yt − µYϑ,t|t−1(y0:t−1)

)= ϕµY

ϑ,t|t−1(y0:t−1),ΣY Y

ϑ,t|t−1+ΣY

ϑ,t|t−1(yt),

where ∗ denotes convolution. Consequently, %ϑ,t|t−1(y0:t−1, yt) resembles the densityϕµY

ϑ,t|t−1((X,Y )(t−1)),ΣY Y

ϑ,t|t−1inside (1.42) but the unobserved mean µYϑ,t|t−1((X, Y )(t − 1)) is

replaced by its filtered value µYϑ,t|t−1(y0:t−1) and the covariance matrix ΣY Yϑ,t|t−1 by its larger

substitute ΣY Yϑ,t|t−1 + ΣY

ϑ,t|t−1. The unconditional density is obtained from (1.33) as

%ϑ,t(y0:t) =t∏

s=0

%ϑ,s|s−1(y0:s−1, ys)

=t∏

s=0

ϕ0,ΣY Yϑ,s|s−1

+ΣYϑ,s|s−1

(ys − µYϑ,s|s−1(y0:s−1)

).

Hence the log-likelihood and the score process equal

log %ϑ,t(Y (0 : t)) =t∑

s=0

logϕ0,ΣY Yϑ,s|s−1

+ΣYϑ,s|s−1

(Y (s)− µYϑ,s|s−1(Y (0 : s− 1))

)resp.

Sϑ(t) =t∑

s=0

∇ϑ logϕ0,ΣY Yϑ,s|s−1

+ΣYϑ,s|s−1

(Y (s)− µYϑ,s|s−1(Y (0 : s− 1))

). (1.43)

Observe that a crucial step in computing the log-likelihood or the score function is to deter-mine µXϑ,s−1|s−1(y0:s−1) and ΣX

ϑ,s−1|s−1, s = 1, . . . , t using the Kalman filter recursions.Let us briefly mention that one could also use the EM algorithm if the linear Gaussian

state space model is of HMM type as in Definition 1.11:


Remark 1.31 (EM algorithm) In the setup of Definition 1.11 suppose that the parame-ters aX , aXX , bX , aY , aY X , bY depend on ϑ. In the notation of Section 1.3.4 we have that%Xϑ (xs−1, ·), λϑ(xs, ·), %Xϑ,0(·) are densities of the Gaussian laws N(aX + aXXxs−1, bXb

>X),

N(aY + aY Xxs, bY b>Y ), N(µX(0),ΣX(0)), respectively. Lemma A.9 yields that (xs, ys) 7→

%Xϑ (xs−1, xs)λϑ(xs, ys) is a density of a Gaussian law with mean

µ :=

(aX + aXXxs−1

aY + aY XaX + aY XaXXxs−1

)and covariance matrix

Σ :=

(bXb

>X aY XbXb

>X

bXb>Xa>Y X aY XbXb

>Xa>Y X + bY b

>Y

).

Similarly, (xs, ys) 7→ %Xϑ,0(x0)λϑ(x0, y0) is a density of a Gaussian law with mean

µ0 :=

(µX(0)

aY + aY XµX(0)

)and covariance matrix

Σ0 :=

(ΣX(0) aY XΣX(0)

ΣX(0)a>Y X aY XΣX(0)a>Y X + bY b>Y

).

Consequently, the functions in (1.38, 1.39) are of the form

Fϑ(xs−1:s, ys) = −d2

log(2π)− 1

2log | det Σ|

− 1

2

(xs − aX − aXXxs−1

ys − aY − aY XaX − aY XaXXxs−1

)>Σ−1

(xs − aX − aXXxs−1

ys − aY − aY XaX − aY XaXXxs−1

)and

Fϑ,0(x0, y0) = −d2

log(2π)− 1

2log | det Σ0|

− 1

2

>( x0 − µX(0)

y0 − aY − aY XµX(0)

)>Σ−1

0

(x0 − µX(0)

y0 − aY − aY XµX(0)

).

Observe that these functions are second-order polynomials in xs−1, xs, which means thatfirst and second moments of πϑ,s−1:s|t are needed for the explicit computation of (1.37).However, these have been determined in Theorem 1.16(6). If we can maximise ϑ1 7→G(ϑ1, ϑ) explicitly, the EM algorithm may make sense in this setup as well.

Finally, we briefly comment on non-Gaussian setups:

Remark 1.32 (Quasi-likelihood estimation) Sometimes the Gaussian score function(1.43) can be applied in more general non-Gaussian setups. We consider a parametric model(X, Y ) as in the first paragraph of Section 1.3.3. Suppose that it is equivalent in the senseof Remark 1.17 to some parametric general linear Gaussian state space model, i.e. for any


ϑ ∈ Θ it shares the first and second moments of its Gaussian counterpart. This happens e.g.in Example 1.18 if we consider it as a parametric model.

In this case we may be tempted to estimate ϑ based on the estimating equation Sϑ(t) = 0with Sϑ(t) from (1.43). However, not being related to the likelihood of our model, it does notseem obvious why this quasi-likelihood estimator should make any sense, To this end, recallthat Eϑ(Sϑ(t)) = 0 holds in the equivalent Gaussian model by Lemma 1.21. Moreover,observe that Sϑ(t) in (1.43) is a quadratic function of the observations Y (0 : t). Since thefirst two moments coincide, the unbiasedness holds of the estimating function holds in ourparametric model as well. As mentioned after Lemma 1.21, this may be used as a startingpoint to derive consistency and asymptotic normality of this quasi-likelihood estimator.

Chapter 2

Continuous-time models

Many results of Chapter 1 can be extended to continuous time. At this point we focus on twoparticularly well-known results, namely the Kushner-Stratonovich and the Zakai equations.They characterise the optimal filter for a diffusion or more general Markov process which isobserved with additive white noise.

2.1 The Kushner-Stratonovich and Zakai equationsAs usual, we fix a filtered probability space (Ω,F , (Ft)t∈R+ , P ). We consider an Rm+n-valued adapted process (X, Y ) of the form

dX(t) = b(X(t))dt+ σ(X(t))dWX(t), (2.1)dY (t) = h(X(t))dt+ dW Y (t),

where WX ,W Y denote independent Rm- resp. Rn-valued Wiener processes and b : Rm →Rm, σ : Rm → Rm×m, h : Rm → Rn measurable functions. It can be viewed as acontinuous-time counterpart of a HMM, where the “infinitesimal” increment dY (t) corre-sponds to what was Y (t) in Section 1.1.1. In order for the following results to hold, wemake the following

Assumption 2.1 We suppose that E(|X(0)|3 < ∞, the mappings b and σ are globallyLipschitz, and h satisfies the linear growth condition

|h(x)|2 ≤ c(1 + |x|2), x ∈ Rm

for some constant c.

Parallel to Section 1.1 we interpret X as a signal which is to be recovered from observingY . Using the notation

xs:t := (xr)r∈[s,t] and Z(s : t) := (Z(r))r∈[s,t]

for functions x = (xt)t∈R+ and processes Z(s : t) := (Z(t))t∈R+ , the goal is to determinethe kernel

πt|s(y0:s, dxt) := PX(t)|Y (0:s)=y0:s(dxt)

38

2.1. THE KUSHNER-STRATONOVICH AND ZAKAI EQUATIONS 39

for s = t (filtering), s < t (prediction), s > t (smoothing). As in Section 1.1 we use thenotation

πt|s(y0:s, f) :=

∫f(x)πt|s(y0:s, dxt)

= E(f(X(t))

∣∣Y (0 : s) = y0:s

)for (yt)t∈R+ : R+ → Rn and measurable functions f : Rm → R such that the integral exists.In these notes we focus on the filtering problem. Its solution can be expressed in terms ofstochastic differential equations which can be viewed as a counterpart to the recursions inSections 1.1.2 and 1.1.3.

As in discrete time we obtain simpler equations if we consider some kind of non-normalised filter. To this end we fix a maximal time horizon T ∈ R+. Its choice doesnot affect the results; it is only needed for technical reasons. Define the local martingale

Z := E

(−∫ ·

0

h(X(t))dW Y (t)

)T,

where the superscript denotes stopping at T , i.e,

Z(t) = exp

(−∫ t∧T

0

h(X(s))dW Y (s)− 1

2

∫ t∧T

0

h(X(s))2ds

).

Lemma 2.2 Z is a martingale.

Proof. Since Z is a nonnegative local martingale, it is a supermartingale. In order for thetrue martingale property to hold, it suffices to verify that E(Z(T )) = 1. To this end, notethat

E(Z(T )

∣∣X(0 : T ) = x0:T

)= E

(exp

(−∫ T

0

h(xs)dWY (s)

))exp

(− 1

2

∫ T

0

h(xs)2ds

)= exp

(1

2

∫ T

0

h(xs)2ds

)exp

(− 1

2

∫ T

0

h(xs)2ds

)= 1.

This impliesE(Z(T )

)= E

(E(Z(T )

∣∣X(0 : T )))

= 1.

Since Z is a positive martingale, it is the density process of some probability measureQ ∼ P .

Lemma 2.3 1. Relative to Q, the processes WX , Y are independent Rm- resp. Rn-valued Wiener processes on [0, T ], i.e. the stopped process (WX , Y )T has the samelaw as an Rm+n-valued Wiener process which is stopped at T .

40 CHAPTER 2. CONTINUOUS-TIME MODELS

2. PX = QX , i.e. X has the same law under P and Q.

3. Z := 1/Z is the density process of P relative to Q. Moreover,

Z(t) = E

(∫ ·0

h(X(t))dW Y (t)

)(T ∧ t)

= exp

(∫ t∧T

0

h(X(s))dY (s)− 1

2

∫ t∧T

0

h(X(s))2ds

), t ∈ R+.

Proof.

1. Since (WX ,W Y ) is a Wiener process under P , Girsanov’s theorem yields that(WX(t),W Y (t) +

∫ t∧T0

h(X(s))ds)t∈R+ is a Wiener process relative to Q. This pro-cess coincides with (WX , Y ) on [0, T ], which yields the first claim.

2. The first statement yields that WX is a Wiener process under both P and Q. TheLipschitz conditions in Assumption 2.1 imply that the SDE (2.1) has a unique strongsolution, which in turn implies uniqueness in law. Put differently, the law of X coin-cides under P and Q.

3. This follows from straightforward calculations.

The non-normalised filter is now defined as

%t|s(y0:s, f) = EQ(f(X(t))Z(t)

∣∣Y (0 : s) = y0:s

)for (yt)t∈R+ : R+ → Rn and measurable functions f : Rm → R such that the expecta-tion exists. It corresponds directly to the object (1.7) for s = t. As in discrete time, theKallianpur-Striebel formula now yields the optimal filter.

Lemma 2.4 (Kallianpur-Striebel formula) Fix (yt)t∈R+ : R+ → Rn and let f : Rm → Rbe a measurable nonnegative, bounded or sufficiently integrable function. Then

πt|t(y0:t, f) =%t|t(y0:t, f)

%t|t(y0:t, 1).

Proof. Using Lemma A.5, σ(Y (0 : t)) ⊂ Ft, and EQ(dPdQ|Ft) = Z(t), we have

E(f(X(t))

∣∣Y (0 : t))

=EQ(f(X(t))dP

dQ

∣∣Y (0 : t))

EQ(dPdQ

∣∣Y (0 : t))

=EQ(f(X(t))Z(t)

∣∣Y (0 : t))

EQ(Z(t)

∣∣Y (0 : t))

=%t|t(Y (0 : t), f)

%t|t(Y (0 : t), 1).

Below it is shown that the non-normalised filter solves some kind of stochastic differ-ential equation. It can be viewed as the adjoint of a stochastic partial differential equation,which descibes the temporal evolution of the non-normalised density of the law πt|t(y0:t, ·).To this end we need a technical lemma.


Lemma 2.5 Let ϕ be a predictable process with EQ(∫ t

0ϕ(s)2ds) <∞, t ∈ R+. Then

1.

EQ

(∫ t

0

ϕ(s)dYi(s)

∣∣∣∣Y (0 : t)

)=

∫ t

0

EQ(ϕ(s)

∣∣Y (0 : t))dYi(s)

for i = 1, . . . , n and any t ≥ 0,

2.

EQ

(∫ t

0

ϕ(s)dWXi (s)

∣∣∣∣Y (0 : t)

)= 0

for i = 1, . . . ,m and any t ≥ 0.

Proof. This is shown in [BC09, Lemma 3.21]. Here we give only an informal argument forsimple integrands of the form

ϕ(s) =k∑j=1

Uj1(tj−1,tj ](s)

with t0 ≤ . . . ≤ tk ≤ t and Ftj−1-measurable random variables Ui.

1. Since ∫ t

0

ϕ(s)dYi(s) =k∑j=1

Uj(Yi(tj)− Yi(tj−1))

and

EQ(ϕ(s)

∣∣Y (0 : t))

=k∑j=1

EQ(Uj∣∣Y (0 : t)

)1(tj−1,tj ](s),

we have

EQ

(∫ t

0

ϕ(s)dYi(s)

∣∣∣∣Y (0 : t)

)=

k∑j=1

EQ(Uj∣∣Y (0 : t)

)(Yi(tj)− Y (tj−1))

=

∫ t

0

EQ(ϕ(s)

∣∣Y (0 : t))dYi(s).

2. Since ∫ t

0

ϕ(s)dWXi (s) =

k∑j=1

Uj(WXi (tj)−WX

i (tj−1)),


we have

EQ

(∫ t

0

ϕ(s)dWXi (s)

∣∣∣∣Y (0 : t)

)=

k∑j=1

EQ(Uj(W

Xi (tj)−WX

i (tj−1))∣∣Y (0 : t)

)=

k∑j=1

EQ(Uj(W

Xi (tj)−WX

i (tj−1))∣∣Y (0 : tj−1)

)=

k∑j=1

EQ

(UjEQ

(WXi (tj)−WX

i (tj−1)∣∣Ftj−1

)∣∣∣Y (0 : tj−1))

= 0,

where we used that (Y (r)− Y (tj−1))r≥tj−1is independent of Ftj−1

∨ σ(WX(0 : t)),applied Lemma A.2, and used the martingale property of WX .

For twice continuously differentiable functions f : Rm → R we define Gf : Rm → Rby

Gf(x) :=m∑i=1

bi(x)∂f(x)

∂xi+

1

2

m∑i,j=1

σi(x)∂2f(x)

∂xi∂xj.

Leaving aside questions of its domain, the mapping G : f 7→ Gf is the infinitesimal gener-ator of X .

We are now ready to formulate our main result.

Theorem 2.6 (Zakai equation) Let f : Rm → R be twice continuously differentiable withcompact support. Then we have

d%t|t(y0:t, f) = %t|t(y0:t, Gf)dt+ %t|t(y0:t, fh)dY (t) (2.2)

in the sense that

%t|t(y0:t, f) = E(f(X(0)) +

∫ t

0

%s|s(y0:s, Gf)ds+

∫ t

0

%s|s(y0:s, fh)dY (s). (2.3)

Proof. A rigorous proof is to be found in [BC09, Theorem 3.24]. We give here a slightlyheuristic proof which sweeps some integrability issues under the rug.

Integration by parts yields

d(f(X(t))Z(t)) = f(X(t))dZ(t) + Z(t)df(X(t)) + d[Z, f(X)](t).

Since [W Y ,WX ] = 0, the term [Z, f(X)] vanishes. Applying Ito’s formula, we obtain

d(f(X(t))Z(t)) = f(X(t))Z(t)h(X(t))dY (t)

+ Z(t)Gf(X(t))dt+ Z(t)∇f(X(t))σ(X(t))dWX(t)


and hence

f(X(t))Z(t) = f(X(0)) +

∫ t

0

f(X(s))Z(s)h(X(s))dY (s)

+

∫ t

0

Z(s)Gf(X(s))ds

+

∫ t

0

Z(s)∇f(X(s))σ(X(s))dWX(s)

in integral notation. Taking conditional expectations we obtain

%t|t(y0:t, f) = EQ(f(X(t))Z(t)

∣∣Y (0 : t))

= EQ(f(X(0))

∣∣Y (0 : t))

+ EQ

(∫ t

0

f(X(s))Z(s)h(X(s))dY (s)

∣∣∣∣Y (0 : t)

)+ EQ

(∫ t

0

Z(s)Gf(X(s))ds

∣∣∣∣Y (0 : t)

)+ EQ

(∫ t

0

Z(s)∇f(X(s))σ(X(s))dWX(s)

∣∣∣∣Y (0 : t)

). (2.4)

The first term equals EQ(f(X(0))) by Lemma A.2. Applying Lemma 2.5(1) and subse-quently Lemma A.2 to the second term, it can be rewritten as∫ t

0

EQ(f(X(s))Z(s)h(X(s))

∣∣Y (0 : t))dY (s)

=

∫ t

0

EQ(f(X(s))Z(s)h(X(s))

∣∣Y (0 : t))dY (s)

=

∫ t

0

EQ(f(X(s))h(X(s))Z(s)

∣∣Y (0 : s))dY (s)

=

∫ t

0

%s|s(y0:s, fh)dY (s).

Using Fubini’s theorem and once more Lemma A.2, the third term in (2.4) reads as∫ t

0

EQ(Z(s)Gf(X(s))

∣∣Y (0 : t))ds

=

∫ t

0

EQ(Z(s)Gf(X(s))

∣∣Y (0 : s))ds

=

∫ t

0

%s|s(y0:s, Gf)ds.

Finally, the last term in (2.4) vanishes do to Lemma 2.5(2). Summing up, we obtain (2.3).

We do not discuss uniqueness of the solution to the Zakai equation here but refer insteadto [BC09, Chapter 4].


Remark 2.7 The Zakai equation holds for the constant function f = 1 as well. Alterna-tively, %t|t(y0:t, 1) can be obtained as a limit. Indeed, monotone convergence yields

%t|t(y0:t, fn) ↑ %t|t(y0:t, 1)

if (fn) denotes a sequence of nonnegative functions as in Theorem 2.6 such that fn ↑ 1.

In view of Theorem 1.6 we may wonder whether we can directly state an equation forthe original normalised filter, i.e. for πt|t(y0:t, f). This is indeed possible by applying Ito’sformula to the Zakai equation.

Theorem 2.8 (Kushner-Stratonovich equation) Let f : Rm → R be twice continuouslydifferentiable with compact support. Then we have

dπt|t(y0:t, f) = πt|t(y0:t, Gf)dt

+(πt|t(y0:t, fh)− πt|t(y0:t, f)πt|t(y0:t, h)

)(dY (t)− πt|t(y0:t, h)dt

).

(2.5)

Proof. Letting U(t) = %t|t(y0:t, f) and V (t) = %t|t(y0:t, 1), integration by parts and Ito’sformula yield

dπt|t(y0:t, f) = dU

V(t)

=1

V (t)dU(t) + U(t)d

1

V(T ) + d

[U,

1

V

](t)

=1

V (t)dU(t) +

U(t)

V (t)2dV (t) +

U(t)

V (t)3d[V, V ](t)− U(t)

V (t)2d[U, V ](t).

By Zakai’s equation (2.2) this equals

%t|t(y0:t, Gf)

%t|t(y0:t, 1)dt+

%t|t(y0:t, fh)

%t|t(y0:t, 1)dY (t)

−%t|t(y0:t, h)%t|t(y0:t, f)

%t|t(y0:t, 1)2dt

+%t|t(y0:t, f)%t|t(y0:t, h)2

%t|t(y0:t, 1)3dt−

%t|t(y0:t, fh)%t|t(y0:t, h)

%t|t(y0:t, 1)2dt,

which in turn coincides with the right-hand side of (2.5) by Lemma 2.4.

Again we refer to [BC09, Chapter 4] for a discussion of uniqueness of the the solutionto this equation.

2.2 OutlookLet us briefly discuss some results that are missing in this chapter for lack of time. Theorems2.6 and 2.8 can be extended to obtain expressions for the predictors %t|s resp. πt|t with t > s.

2.2. OUTLOOK 45

The derivation of counterparts to the smoothing steps in Theorems 1.3 and 1.6 turns out tobe more involved. It either requires to introduce backward stochastic integration or someclever change of variables to avoid the latter, cf. [KE02].

The diffusion model of Equation 2.1 can be extended to more general hidden Markovmodels. In particular, the diffusion X can be replaced by quite arbitrary continuous-timeMarkov process in Rm. Since the Zakai and Kushner-Stratonovich equations for %t|t(y0:t, f)and πt|t(y0:t, f) involve e.g. fh on their right-hand side which in turn depends on fh2 etc., itis not obvious whether they reduce to some finite system of stochastic differential equations.And indeed, this is rarely the case. The most prominent example concerns linear Gaussiansystems, namely the continuous-time counterpart of Section 1.2. The Kalman-Bucy filteramounts to solving an ordinary differential equation for the filter variance Σt|t and a linearstochastic differential equation for the conditional mean µt|t(y0:t). Expressions for predic-tion and smoothing can be obtained as well.

The second example concerns the case where the Markov process X attains only finitelymany values, i.e. it reduces to a continuous-time Markov chain. The resulting Wonham filtercan be viewed as a continuous-time analogue of Example 1.9.

Likelihood inference rests on the same pillars as in Section 1.3. As in Corollary 1.23,the non-normalised filter %ϑ,t|t(y0:t, 1) can be interpreted as the likelihood of y0:t under Pϑ,here relative to the Wiener measure. Lemmas 1.25 and 1.26 allow for continuous-timecounterparts as well.

Parameter estimation for a noisy continuous-time Markov chain can be tackled by usingthe EM algorithm as in Section 1.3.5. It leads to an analogue of (1.41): the numerator isto be replaced by the estimated number of transitions from ai to aj whereas the estimatedtime in state ai is substituted for the denominator. Not surprisingly, a continuous counterpartexists for the estimating procedures of Section 1.3.6 for linear Gaussian models.

Appendix A

Some measure theory

A.1 Some facts on conditional probabilities and expecta-tions

We recall some properties of conditional laws and expectations whose proofs are left to thereader.

Lemma A.1 Suppose that X, Y are random variables with

P (X,Y )(d(x, y)) = λ(x, y)ϕ(dy)PX(dx)

for some σ-finite measure ϕ and some measurable function λ. Put differently. the con-ditional laws P Y |X=x(·) are absolutely continuous with respect to ϕ with Radon-Nikodymdensity λ(x, ·) for any x. Then

PX|Y=y(A) =

∫1A(x)λ(x, y)PX(dx)∫λ(x, y)PX(dx)

for any event A and P Y -almost any y.

Lemma A.2 If X is a random variable and C ,D are σ-fields such that D is independentof σ(X) ∨ C , then

E(X|C ∨D) = E(X|C ).

Lemma A.3 If X, Y are random variables, f is a real-valued function, and X is measur-able with respect to the σ-field C , we have

E(f(X, Y )|C ) =

∫f(X, y)P Y |C (dy).

If Y and C are independent, we have P Y |C = P Y and hence

E(f(X, Y )|C ) =

∫f(X, y)P Y (dy).

46

A.1. SOME FACTS ON CONDITIONAL PROBABILITIES AND EXPECTATIONS 47

Lemma A.4 For Q P with density dQdP

and random variables X we have

dQX

dPX(x) = E

(dQ

dP

∣∣∣∣X = x

)for PX-almost all x.

Lemma A.5 For Q P with density dQdP

, nonnegative or Q-integrable random variablesX , and sub-σ-fields C we have

EQ(X|C ) =EP (X dQ

dP|C )

EP (dQdP|C )

.

Lemma A.6 Let X = (X(t))t∈N be the canonical process on Ω := (Rd)N, generatingthe filtration (Ft)t∈N. Suppose that P = PX and Q = QX are probability measureson Ω with QX(0) PX(0) and QX(s)|X(0:s−1) PX(s)|X(0:s−1) for s = 1, . . . , t. ThenQX |Ft PX |Ft with density

dQX |Ft

dPX |Ft

=dQX(0)

dPX(0)(X(0))

t∏s=1

dQX(s)|X(0:s−1)

dPX(s)|X(0:s−1)(X(0 : s− 1), X(s)).

Lemma A.7 Let X,Y,Z be Polish spaces. Define a mapping c : M1(X×Y)×Y)→M1(X)by

c(P (X,Y ), y) := PX|Y=y

for any X×Y-valued random variable (X, Y ) on some probability space (Ω,F , P ). Here,M1(·) denotes the set of all probability measures on the space in parentheses. (Note that cis unique only up to equivalence classes connected to P Y -null sets.) Now, let (X, Y, Z) besome X× Y× Z-valued random variable. Then

PX|(Y,Z)=(y,z) = c(P (X,Y )|Z=z, y) (A.1)

for any y ∈ Y, z ∈ Z.

Proof. By definition c(P (X,Y )|Z=z, y) satisfies∫∫1A×B(x, y)c(P (X,Y )|Z=z, y)(dx)P Y |Z=z(dy)

=

∫1A×B(x, y)P (X,Y )|Z=z(d(x, y))

for any measurable sets A ⊂ X, B ⊂ Y. For (A.1) to hold we need to verify that∫1B×C(y, z)c(P (X,Y )|Z=z, y)(A)P (Y,Z)(d(y, z))

=

∫1A(x)1B×C(y, z)P (X,Y,Z)(d(x, y, z))

48 APPENDIX A. SOME MEASURE THEORY

for any measurable sets A ⊂ X, B ⊂ Y, C ⊂ Z. This follows from∫1B×C(y, z)c(P (X,Y )|Z=z, y)(A)P (Y,Z)(d(y, z))

=

∫C

∫∫1B(y)1A(x)c(P (X,Y )|Z=z, y)(dx)P Y |Z=z(dy)PZ(dz)

=

∫C

∫1B(y)1A(x)P (X,Y )|Z=z(d(x, y))PZ(dz)

=

∫1A(x)1B×C(y, z)P (X,Y,Z)(d(x, y, z))

as desired.

Remark A.8 Note thatP (X,Y ) = P Y |X ⊗ PX

in the sense that P (X,Y )(d(x, y)) = P Y |X=x(dy)⊗ PX(dx). Similarly, we have

P (X,Y )|Z = P Y |(Z,X) ⊗ PX|Z

in the sense that P (X,Y )|Z=z(d(x, y)) = P Y |(Z,X)=(z,x)(dy) ⊗ PX|Z=z(dx). Indeed, thisfollows from

P (X,Y )|Z=z(d(x, y))PZ(dz) = P (Z,X,Y )(d(z, x, y)

= P Y |(Z,X))=(z,x)(dy)PX|Z=z(dx)PZ(dz).

A.2 The multivariate normal distributionLet us recall some properties of the multivariate normal distribution.

1. By definition, random vector X = (X1, . . . , Xd) is multivariate normal if and onlyif the univariate random variable a>X =

∑di=1 aiXi is normally distributed for any

a = (a1, . . . , ad) ∈ Rd.

2. The notation X ∼ Nd(µ,Σ) indicates that X is an Rd-valued random vector withE(X) = µ ∈ Rd, Cov(X) = Σ ∈ Rd×d.

3. X ∼ Nd(µ,Σ) holds if and only if the characteristic function of X is of the form

ϕX(u) = E(exp(iu>X)

)= exp

(iu>µ− 1

2u>Σu

), u ∈ Rd.

4. Let µ ∈ Rd be a vector and Σ ∈ Rd×d a symmetric, positive definite matrix. ThenX ∼ Nd(µ,Σ) holds if and only if

fX(x) =1√

(2π)d| det Σ|exp

(1

2(x− µ)>Σ−1(x− µ)

)holds for the density function.

A.2. THE MULTIVARIATE NORMAL DISTRIBUTION 49

5. Let X ∼ Nd(µ,Σ).

(a) (Linear transformations) Then BX + b ∼ Nk(Bµ + b, BΣB>) for b ∈ Rk,B ∈ Rk×d.

(b) (Marginal laws) Let X(1) = (X1, . . . , Xk) and X(2) = (Xk+1, . . . , Xd), whichimplies X = (X(1), X(2)). Moreover, write

µ = (µ(1), µ(2)), Σ =

(Σ(11) Σ(12)

Σ(21) Σ(22)

).

Then X(1) ∼ Nk(µ(1),Σ(11)) and X(2) ∼ Nd−k(µ(2),Σ(22)).

(c) (Conditional laws) Suppose that det Σ(11) 6= 0. Then the conditional law of X(2)

given X(1) = x1 is

PX(2)|X(1)=x1 = Nd−k(µ2,1,Σ22,1), (A.2)

where µ2,1 = µ(2) + Σ(21)Σ−1(11)(x1 − µ(1)) and Σ22,1 = Σ(22) − Σ(21)Σ

−1(11)Σ(12).

This holds for general Σ(11) as well if we denote by Σ−1(11) the Moore-Penrose

pseudoinverse of Σ−1(11).

(d) (Convolution) If X ∼ Nd(µ,Σ) and Y ∼ Nd(µ, Σ) are independent, we haveX + Y ∼ Nd(µ+ µ,Σ + Σ).

(e) (Representation based on standard normal vectors) X has the representationX = µ + AZ with Z ∼ Nk(0, 1k) for some k ≤ d and some A ∈ Rd×k

with AA> = Σ. If det Σ 6= 0, then k = d.

Lemma A.9 Suppose that X(1) = (X1, . . . , Xk) and X(2) = (Xk+1, . . . , Xd) are randomvectors such that X(1) ∼ N(µ(1),Σ(11)) and

PX(2)|X(1)=x1 = Nd−k(α + βx1, γ)

for some µ(1) ∈ Rk, Σ(11) ∈ Rk×k, α ∈ Rd−k, β ∈ R(d−k)×k, γ ∈ R(d−k)×(d−k). ThenX = (X(1), X(2)) is Gaussian with mean µ = (µ(1), α + βµ(1)) and covariance matrix

Σ =

(Σ(11) βΣ(11)

Σ(11)β> βΣ(11)β

> + γ

).

Proof. The proof is left as an exercise.

Bibliography

[BC09] A. Bain and D. Crisan. Fundamentals of Stochastic Filtering. Springer, NewYork, 2009.

[CLG89] F. Campillo and F. Le Gland. MLE for partially observed diffusions: direct max-imization vs. the EM algorithm. Stochastic Process. Appl., 33:245–274, 1989.

[CMR05] O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models.Springer, New York, 2005.

[Kal80] G. Kallianpur. Stochastic Filtering Theory. Springer, New York, 1980.

[KE02] V. Krishnamurthy and R. Elliott. Robust continuous-time smoothers without two-sided stochastic integrals. IEEE Trans. Automatic Contr., 47:1824–1841, 2002.

[LS01a] R Liptser and A. Shiryaev. Statistics of Random Processes I. Springer, Berlin,second edition, 2001.

[LS01b] R. Liptser and A. Shiryaev. Statistics of Random Processes II. Springer, Berlin,second edition, 2001.

[Øks03] B. Øksendal. Stochastic Differential Equations. Springer, Berlin, sixth edition,2003.

[Pap14] A. Papanicolaou. Stochastic analysis seminar on filtering theory. arXiv preprintarXiv:1406.1936, 2014.

[Par] E. Pardoux. Équations du filtrage non linéaire de la prédiction et du lissage.Stochastics, 6:193–231.

[Run10] W. Runggaldier. Filtering. In R. Cont, editor, Encyclopedia of QuantitativeFinance. Wiley, New York, 2010.

[Sør12] Michael Sørensen. Estimating functions for diffusion-type processes. InM. Kessler, A. Lindner, and M. Sørensen, editors, Statistical Methods forStochastic Differential Equations, pages 1–107. CRC Press, Boca Raton, 2012.

[vH08] R. van Handel. Hidden Markov models. Lecture Notes, 2008.

50

Documents

· Contents 0 Motivation3 0.1 Discrete time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 0.2 Continuous time