Estimation of the score vector and observed information matrix in intractable models

Estimation of the score vector and observedinformation matrix in intractable models

Arnaud Doucet (University of Oxford)Pierre E. Jacob (University of Oxford)

Sylvain Rubenthaler (Universite Nice Sophia Antipolis)

April 15th, 2015

Pierre Jacob Derivative estimation 1/ 35

Outline

1 Context

2 General results

3 Monte Carlo

4 Hidden Markov models


Outline

1 Context

2 General results

3 Monte Carlo



Motivation

Derivatives of the likelihood help optimizing / sampling.

For many models they are not available.

One can resort to approximation techniques.


Motivation

Let ℓ(θ) denote a “log-likelihood” but it could be anyfunction.

Let L(θ) = exp ℓ(θ), the “likelihood”.

Assume that we have access to estimators L(θ) of L(θ)such that

E[L(θ)] = L(θ)

andV[L(θ)] = L(θ)2v(θ)

M.


Finite difference

First derivative:

ℓ(1)(θ⋆) = log L(θ⋆ + h) − log L(θ⋆ − h)2h

.

converges to ∇ℓ(θ⋆) when M → ∞ and h → 0.

Second derivative:

ℓ(2)(θ) = log L(θ + h) − 2 log L(θ) + log L(θ − h)h2 .

converges to ∇2ℓ(θ⋆) when M → ∞ and h → 0.


Finite difference

Optimal rate of convergence for the first derivative:

h ∼ M −1/6 leading to MSE ∼ M −2/3.

For the second derivative:

h ∼ M −1/8 leading to MSE ∼ M −1/2.


Outline

1 Context

2 General results

3 Monte Carlo



Iterated Filtering

Given a log likelihood ℓ and a given point, consider a prior

θ ∼ N (θ⋆, τ2).

Posterior expectation when the prior variance goes to zero

First-order moments give first-order derivatives:

|τ−2 (E[θ|Y ] − θ⋆) − ∇ℓ(θ⋆)| ≤ Cτ.

Phrased simply,

posterior mean − prior meanprior variance

≈ score.

Result from Ionides, Bhadra, Atchade, King, Iterated filtering,2011.


Stein’s lemma

Stein’s lemma states that

θ ∼ N (θ⋆, τ2)

if and only if for any function g such that E [|∇g(θ)|] < ∞,

E [(θ − θ⋆) g (θ)] = τ2E [∇g (θ)] .

If we choose the function g : θ 7→ exp ℓ (θ) /Z withZ = E [exp ℓ (θ)] and apply Stein’s lemma we obtain

1ZE [θ exp ℓ(θ)] − θ0 = τ2

ZE [∇ℓ (θ) exp (ℓ (θ))]

⇔ τ−2 (E [θ | Y ] − θ0) = E [∇ℓ (θ) | Y ] .

Notation: E[φ(θ) | Y ] := E[φ(θ) exp ℓ(θ)]/Z.


Stein’s lemmaFor the second derivative, we consider

h : θ 7→ (θ − θ⋆) exp ℓ (θ) /Z.

Then

E[(θ − θ⋆)2 | Y

]= τ2 + τ4E

[∇2ℓ(θ) + ∇ℓ(θ)2 | Y

].

Adding and subtracting terms also yields

τ−4(V [θ | Y ] − τ2

)= E

[∇2ℓ(θ) | Y

]+{E[∇ℓ(θ)2 | Y

]− (E [∇ℓ(θ) | Y ])2

}.

. . . but what we really want is

∇ℓ(θ⋆), ∇ℓ2(θ⋆)

and notE [∇ℓ(θ) | Y ] ,E

[∇ℓ2(θ) | Y

].


Core Idea

The prior is a normal distribution N (θ⋆, τ2).The prior moments behave like:

Eτ [φ (Θ)] = φ (θ⋆) + τ2

2∇2φ (θ⋆) + O

(τ4)

.

The posterior moments behave like:

Eτ [φ (Θ) | Y ] = φ (θ⋆)+τ2

2

(∇2φ (θ⋆) + 2∇φ (θ⋆) ∇ℓ (θ⋆)

)+O

(τ4)

.

Our arXived proof suffers from an overdose of Taylorexpansions.


Proof: prior moments

Let φ : R → R be a four times continuously differentiablefunction. Assume that there exists a constant M < ∞ andδ > 0 such that

∣∣∣d4φ(θ)dθ4

∣∣∣ ≤ M for all θ ∈ B(θ⋆, δ).

Cut the expectation into two parts:

Eτ [φ (Θ)] =∫

B(θ⋆,δ)φ (θ) pτ (θ) dθ +

∫Bc(θ⋆,δ)

φ (θ) pτ (θ) dθ

The second term is o(τ k) for any power k.


Proof: prior momentsLet’s deal with the first term:∫

B(θ⋆,δ)φ (θ) pτ (dθ) .

Taylor expansion:

∀θ ∈ B(θ⋆, δ) φ(θ) = φ (θ⋆) +∑

k=1,2,3

dkφ(θ⋆)dθk

1k!

(θ − θ⋆)k + R3 (θ, θ⋆) .

The Gaussian prior integrates any (θ − θ⋆)k with odd k to zeroover B(θ⋆, δ), so∫

B(θ⋆,δ)φ(θ)pτ (dθ) = φ (θ⋆) + d2φ(θ⋆)

dθ2

∫B(θ⋆,δ)

12

(θ − θ⋆)2 pτ (dθ)

+∫

B(θ⋆,δ)R3 (θ, θ⋆) pτ (dθ) .



Ford2φ(θ⋆)

dθ2

∫B(θ⋆,δ)

12

(θ − θ⋆)2 pτ (dθ)

we “complete the integral” over all R and subtract an integralover Bc(θ⋆, δ) which is o(τ k) for any k. We are left with

d2φ(θ⋆)dθ2

τ2

2+ o(τ k) for any k

For ∫B(θ⋆,δ)

R3 (θ, θ⋆) pτ (dθ) .

we use the assumption to say that for all θ there is θ in B(θ⋆, δ)such that

R3(θ, θ⋆) = d4φ(θ)dθ4

(θ − θ⋆)4

4!.



Since d4φ(θ)dθ4 is upper bounded by some M by assumption:

|∫

B(θ⋆,δ)R3 (θ, θ⋆) pτ (dθ) |

≤ M∫

B(θ⋆,δ)

(θ − θ⋆)4

4!pτ (dθ)

≤ M∫R

(θ − θ⋆)4

4!pτ (dθ)

= τ4 × C .

Combining all the terms, we obtain

Eτ [φ (Θ)] = φ (θ⋆) + τ2

2∇2φ (θ⋆) + O

(τ4)

.


Proof: posterior moments

We want to obtain the posterior moments:

Eτ [φ (Θ) | Y ] = φ (θ⋆)+τ2

2

(∇2φ (θ⋆) + 2∇φ (θ⋆) ∇ℓ (θ⋆)

)+O

(τ4)

.

We write:Eτ [φ (Θ) | Y ] = Eτ [φ (Θ) L (Θ)]

Eτ [L (Θ)].

Then we apply the prior moment expansion for φ × L and forL, and we simplify the ratio of two expansions.

We need to assume that the likelihood is four times continuouslydifferentiable, with bounded fourth derivatives around θ⋆.


Main results

In general, with a prior N (θ⋆, τ2Σ), when Σ is fixed and τ goesto zero,

τ−2Σ−1 Epost [(Θ − θ⋆)] = ∇ℓ(θ⋆) + O(τ2)

,{τ−4Σ−1

(Vpost [Θ] − τ2Σ

)Σ−1

}= ∇2ℓ(θ⋆) + O

(τ2)

.


Extension of Iterated Filtering

Posterior variance when the prior variance goes to zero

Second-order moments give second-order derivatives:

|τ−4(Cov[θ|Y ] − τ2

)− ∇2ℓ(θ⋆)| ≤ Cτ2.

Phrased simply,

posterior variance − prior varianceprior variance2 ≈ hessian.

Result from Doucet, Jacob, Rubenthaler on arXiv, 2013.


Proximity mappingGiven a real function f and a point θ⋆, consider for any τ2 > 0

θ 7→ f (θ) exp{

− 12τ2 (θ − θ⋆)2

}

θθ0

Figure : Example for f : θ 7→ exp(−|θ|) and three values of τ2.


Proximity mapping

Proximity mapping

The τ2-proximity mapping is defined by

proxf : θ0 7→ argmaxθ∈R f (θ) exp{

− 12τ2 (θ − θ0)2

}.

Moreau approximation

The τ2-Moreau approximation is defined by

fτ2 : θ0 7→ C supθ∈R f (θ) exp{

− 12τ2 (θ − θ0)2

}where C is a normalizing constant.


Proximity mapping

θ

Figure : θ 7→ f (θ) and θ 7→ fτ2(θ) for three values of τ2.


Proximity mapping

Property

Those objects are such that

proxf (θ0) − θ0

τ2 = ∇ log fτ2(θ0) −−−→τ2→0

∇ log f (θ0)

Moreau (1962), Fonctions convexes duales et points proximauxdans un espace Hilbertien.

Pereyra (2013), Proximal Markov chain Monte Carloalgorithms.


Proximity mapping

Bayesian interpretation

If f is a seen as a likelihood function then

θ 7→ f (θ) exp{

− 12τ2 (θ − θ0)2

}is an unnormalized posterior density function based on aNormal prior with mean θ0 and variance τ2.

Henceproxf (θ0) − θ0

τ2 −−−→τ2→0

∇ log f (θ0)

can be read

posterior mode − prior modeprior variance

≈ score.


Outline

1 Context

2 General results

3 Monte Carlo



Moment shift estimatorHow to estimate:

S(θ⋆) = τ−2Σ−1 Epost [(Θ − θ⋆)] = ∇ℓ(θ⋆) + O(τ2)

?

Importance Sampling estimator from the prior :

SN (θ⋆) = τ−2Σ−1(

1N

N∑i=1

Wiθi − θ⋆

)

where

Wi = L(θi)∑Nj=1 L(θj)

.

Turns out it is better to use

SN (θ⋆) = τ−2Σ−1(

1N

N∑i=1

Wiθi − 1N

N∑i=1

θi

).

Any idea why?Pierre Jacob Derivative estimation 21/ 35

Moment shift estimator

For the second order derivative, we want to estimate:{τ−4Σ−1

(Vpost [Θ] − τ2Σ

)Σ−1

}= ∇2ℓ(θ⋆) + O

(τ2)

.

We propose:

τ−4Σ−1

N∑i=1

Wi

(θi −

N∑i=1

Wjθj

)2

− 1N

N∑i=1

(θi − 1

N

N∑i=1

θi

)2Σ−1.


Moment shift estimator

We retrieve the same rates of convergence as in finite difference,e.g. if τ ∼ N −1/6, the MSE is in N −2/3.

Then why bother? Why has better performance been observedin practice in some scenarios?

Different behaviour in the non-asympotic regime.

For any function υ and tuning parameter M ,

V(SN (θ⋆)

)≤ τ−2CN ,

where CN does not depend on υ nor M .


Outline

1 Context

2 General results

3 Monte Carlo



Hidden Markov models

y2

X2X0

y1

X1...

... yT

XT

θ

Figure : Graph representation of a general hidden Markov model.


Hidden Markov models

Direct application of the previous results

1 Prior distribution N (θ0, σ2) on the parameter θ.

2 The derivative approximations involve E[θ|Y ] andCov[θ|Y ].

3 Posterior moments for HMMs can be estimated byparticle MCMC,

SMC2,

ABCor your favourite method.

Ionides et al. proposed another approach.Pierre Jacob Derivative estimation 25/ 35

Iterated Filtering

Modification of the model: θ is time-varying.The associated loglikelihood is

ℓ(θ1:T ) = log p(y1:T ; θ1:T )

= log∫

X T+1

T∏t=1

g(yt | xt , θt) µ(dx1 | θ1)T∏

t=2f (dxt | xt−1, θt).

Introducing θ 7→ (θ, θ, . . . , θ) := θ[T ] ∈ RT , we have

ℓ(θ[T ]) = ℓ(θ)

and the chain rule yields

dℓ(θ)dθ

=T∑

t=1

∂ℓ(θ[T ])∂θt

.


Iterated Filtering

Choice of prior on θ1:T :

θ1 = θ0 + V1, V1 ∼ τ−1κ{

τ−1 (·)}

θt+1 − θ0 = ρ(θt − θ0

)+ Vt+1, Vt+1 ∼ σ−1κ

{σ−1 (·)

}Choose σ2 such that τ2 = σ2/(1 − ρ2). Covariance of the prioron θ1:T :

ΣT = τ2

1 ρ · · · · · · · · · ρT−1

ρ 1 ρ · · · · · · ρT−2

ρ2 ρ 1 . . . ρT−3

... . . . . . . . . . ...

ρT−2 . . . 1 ρρT−1 · · · · · · · · · ρ 1

.


Iterated Filtering

Applying the general results for this prior yields, with|x| =

∑Tt=1 |xi |:

|∇ℓ(θ[T ]0 ) − Σ−1

T

(E[θ1:T | Y

]− θ

[T ]0

)| ≤ Cτ2

Moreover we have∣∣∣∣∣T∑

t=1


−T∑

t=1

{Σ−1

T

(E[θ1:T | Y

]− θ

[T ]0

)}t

∣∣∣∣∣≤

T∑t=1

∣∣∣∣∣∂ℓ(θ[T ])∂θt

−{

Σ−1T

(E[θ1:T | Y

]− θ

[T ]0

)}t

∣∣∣∣∣and

dℓ(θ)dθ

=T∑

t=1


.


Iterated Filtering

The estimator of the score is thus given by

T∑t=1

{Σ−1

T

(E[θ1:T | Y

]− θ

[T ]0

)}t

which can be reduced to

Sτ,ρ,T (θ0) = τ−2

1 + ρ

[(1 − ρ)

{T−1∑t=2

E(

θt∣∣∣Y)}− {(1 − ρ) T + 2ρ} θ0

+E(

θ1∣∣∣Y)+ E

(θT∣∣∣Y)] ,

given the form of Σ−1T . Note that in the quantities E(θt | Y ),

Y = Y1:T is the complete dataset, thus those expectations arewith respect to the smoothing distribution.


Iterated Filtering

If ρ = 1, then the parameters follow a random walk:

θ1 = θ0 + N (0, τ2) and θt+1 = θt + N (0, σ2).

In this case Ionides et al. proposed the estimator

Sτ,σ,T = τ−2(E(θT | Y

)− θ0

)as well as

S (bis)τ,σ,T =

T∑t=1

VP,t−1(θF ,t − θF ,t−1

)with VP,t = Cov[θt | y1:t−1] and θF ,t = E[θt | y1:t ].

Those expressions only involve expectations with respect tofiltering distributions.


Iterated Filtering

If ρ = 0, then the parameters are i.i.d:

θ1 = θ0 + N (0, τ2) and θt+1 = θ0 + N (0, τ2).

In this case the expression of the score estimator reduces to

Sτ,T = τ−2T∑

t=1

(E(θt | Y

)− θ0

)which involves smoothing distributions.

There’s only one parameter τ2 to choose for the prior.However smoothing for general hidden Markov models isdifficult, and typically resorts to “fixed lag approximations”.


Numerical results

Linear Gaussian state space model where the ground truth isavailable through the Kalman filter.

X0 ∼ N (0, 1) and Xt = ρXt−1 + N (0, V )Yt = ηXt + N (0, W ).

Generate T = 100 observations and setρ = 0.9, V = 0.7, η = 0.9 and W = 0.1, 0.2, 0.4, 0.9.

240 independent runs, matching the computational costsbetween methods in terms of number of calls to the transitionkernel.


Numerical results

●●●

●●

●● ● ●

● ● ● ● ● ● ● ● ● ●●●●

●●●

●●

● ●● ● ● ● ● ● ● ● ●

●●●

●●

●●

●●

●● ● ●

● ● ● ● ●

●

●●●

●●

●●

●●

●●

●● ● ●

●●

●

●

10

100

1000

10000

0.0 0.1 0.2 0.3h

RM

SE

parameter ● ● ● ●1 2 3 4

Finite Difference

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5tau

RM

SE

parameter ● ● ● ●1 2 3 4

Iterated Smoothing

Figure : 240 runs for Iterated Smoothing and Finite Difference.


Numerical results

●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5tau

RM

SE

parameter ● ● ● ●1 2 3 4

Iterated Smoothing

●●●

●●

● ● ● ● ●●●

●

● ●

● ● ● ● ●

●

●

●●

●●

● ● ● ●●

●

●●

●● ● ● ● ●

10

100

1000

10000

0.0 0.1 0.2 0.3 0.4 0.5tau

RM

SE

parameter ● ● ● ●1 2 3 4

Iterated Filtering 1

●●●● ● ● ● ● ● ●

●●●● ● ● ●● ● ●

●●●● ●●

●● ● ●

●●●●●

●● ● ● ●

10

100

1000

10000

0.0 0.1 0.2 0.3 0.4 0.5tau

RM

SE

parameter ● ● ● ●1 2 3 4

Iterated Filtering 2

Figure : 240 runs for Iterated Smoothing and Iterated Filtering.


Bibliography

Main references:Inference for nonlinear dynamical systems, Ionides, Breto,King, PNAS, 2006.Iterated filtering, Ionides, Bhadra, Atchade, King, Annalsof Statistics, 2011.Efficient iterated filtering, Lindstrom, Ionides, Frydendall,Madsen, 16th IFAC Symposium on System Identification.Derivative-Free Estimation of the Score Vectorand Observed Information Matrix,Doucet, Jacob, Rubenthaler, 2013 (on arXiv).


Science

Estimation of the score vector and observed information matrix in intractable models