Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

Based on

Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23

Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

Based on

Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)

Importance sampling (IS)

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

Estimand:

q1(x)µ(dx) =

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

r ≡ c1

c2= E2

[q1(X )

q2(X )

The EE estimator:

n2∑i=1

q1(Xi2)

q2(Xi2)

What about MLE?

The “likelihood” is:

f (X12 . . .Xn22) =

n2∏i=1

p2(Xi2) — free of the estimand c1!

So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?

What are we “inferring”?What is the “unknown” model parameter?

What about MLE?

f (X12 . . .Xn22) =

n2∏i=1

What about MLE?

f (X12 . . .Xn22) =

n2∏i=1

What about MLE?

f (X12 . . .Xn22) =

n2∏i=1

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

n1∑i=1

[q2(Xi1)

r ≡ c1

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

r(t+1)O =

n2∑i=1

[q1(Xi2)

n1∑i=1

[q2(Xi1)

r ≡ c1

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

r(t+1)O =

n2∑i=1

[q1(Xi2)

n1∑i=1

[q2(Xi1)

r ≡ c1

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

r(t+1)O =

n2∑i=1

[q1(Xi2)

n1∑i=1

[q2(Xi1)

r ≡ c1

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

r(t+1)O =

n2∑i=1

[q1(Xi2)

n1∑i=1

[q2(Xi1)

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

What about MLE?

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

Pretending the measure is unknown!

Because

q(x)µ(dx),

and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.

This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.

Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.

Because

q(x)µ(dx),

Because

q(x)µ(dx),

Because

q(x)µ(dx),

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Estimand: c1 =∫

Γ q1(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

µ(dx) =P(dx)

Estimand: c1 =∫

Γ q1(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

µ(dx) =P(dx)

Estimand: c1 =∫

Γ q1(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

µ(dx) =P(dx)

Estimand: c1 =∫

Γ q1(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

µ(dx) =P(dx)

Thus the MLE for r ≡ c1/c2 is

∫q1(x)µ(dx) =

n2∑i=1

q1(Xi2)

q2(Xi2)

When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.

{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.

∫q1(x)µ(dx) =

n2∑i=1

q1(Xi2)

q2(Xi2)

∫q1(x)µ(dx) =

n2∑i=1

q1(Xi2)

q2(Xi2)

∫q1(x)µ(dx) =

n2∑i=1

q1(Xi2)

q2(Xi2)

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Γ qj(x)µ(x), j = 1, . . . , J.

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

Γ qj(x)µ(x), j = 1, . . . , J.

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

Γ qj(x)µ(x), j = 1, . . . , J.

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

Γ qj(x)µ(x), j = 1, . . . , J.

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

MLE for µ given by equating the canonical sufficient statistics P toits expectation:

nP(dx) =J∑

nj c−1j qj(x)µ(dx),

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

Consequently, the MLE for {c1, . . . , cJ} must satisfy

qr (x) d µ =J∑

nj∑i=1

qr (xij)∑Js=1 ns c−1

s qs(xij). (B)

(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).

nP(dx) =J∑

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

qr (x) d µ =J∑

nj∑i=1

s qs(xij). (B)

nP(dx) =J∑

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

qr (x) d µ =J∑

nj∑i=1

s qs(xij). (B)

nP(dx) =J∑

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

qr (x) d µ =J∑

nj∑i=1

s qs(xij). (B)

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Examples:

Group-invariance submodel

Linear submodelLog-linear submodel

Examples:

Group-invariance submodelLinear submodel

Log-linear submodel

Examples:

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

n2∑i=1

q1(Xi2)

q2(Xi2)+

n2∑i=1

q1(−Xi2)

q2(Xi2).

0 q1(x)µ(dx)/c2.

There are many more improvements ...

Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.

The new MLE of µ is

µG(dx) =nPG(dx)∑J

j=1 nj c−1j q Gj (x)

where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).

When the draws are i.i.d. within each psdµ,

µG = E [µ| GX ],

i.e., the Rao-Blackwellization of µ given the orbit.

Consequently,

c Gj =

qj(x)µG(dx) = E [cj |GX ].

µG = E [µ| GX ],

Consequently,

c Gj =

µG = E [µ| GX ],

Consequently,

c Gj =

µG = E [µ| GX ],

Consequently,

c Gj =

Using Groups to model trade-off

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}Likelihood for µ: L(µ) =

∏Jj=1

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

∏Jj=1

i=1 pj(Xij)µ(Xij)

The MLE is

j=1 πjpj(x),

the datai .i .d∼

i=1 pj(Xij)µ(Xij)

The MLE is

j=1 πjpj(x),

the datai .i .d∼

∏Jj=1

i=1 pj(Xij)µ(Xij)

The MLE is

j=1 πjpj(x),

the datai .i .d∼

∏Jj=1

i=1 pj(Xij)µ(Xij)

The MLE is

j=1 πjpj(x),

the datai .i .d∼

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

So why MLE?

Γ q(x)µ(dx).

cb ≡J∑

nj∑i=1

where g = (p2 − p1, . . . , pJ − p1)>.

cλ,B =J∑

nj∑i=1

pj(xji )

So why MLE?

Γ q(x)µ(dx).

cb ≡J∑

nj∑i=1

where g = (p2 − p1, . . . , pJ − p1)>.

cλ,B =J∑

nj∑i=1

pj(xji )

So why MLE?

Γ q(x)µ(dx).

cb ≡J∑

nj∑i=1

where g = (p2 − p1, . . . , pJ − p1)>.

cλ,B =J∑

nj∑i=1

pj(xji )

So why MLE?

Γ q(x)µ(dx).

cb ≡J∑

nj∑i=1

where g = (p2 − p1, . . . , pJ − p1)>.

cλ,B =J∑

nj∑i=1

pj(xji )

Three estimators for c =∫

Γ q(x) µ(dx):

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

where πj = nj/n are the true proportions.

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n∑i=1

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Γ q(x) µ(dx):

n∑i=1

Lik: 1

n∑i=1

Γ q(x) µ(dx):

n∑i=1

Lik: 1

n∑i=1

Γ q(x) µ(dx):

n∑i=1

Lik: 1

n∑i=1

Γ q(x) µ(dx):

n∑i=1

Lik: 1

n∑i=1

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

(i) q2(x) with n draws, or

(ii) q1(x) and q2(x) each with n/2 draws,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

A little surprise?

Table: Comparison of design and estimator

one sampler two samplers

IS Reg Lik IS Reg Lik

Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881

Std Err .162 .00919 .00920 .0174 .00885 .00884

Note: Sqrt MSE is√

mean squared error of the point estimates andStd Err is

√mean of the variance estimates from 10000 repeated

simulations of size n = 500.

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

What Did I Learn?

Model what we ignore, not what we know!

Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.

There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.

What Did I Learn?

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Prove it is MLE, or a good approximation to MLE.

Or derive MLE or a cost-effective approximation to it.

More ......

Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

Documents

Prepare to Preach (Session 1)

Prepare to Preach (Ssssion 3)

‘Tis The Season To Preach

Prepare to Preach (Session 4)

How to Preach Clearly

Let's Re-make · Let's Re-make

PRAY, THEN PREACH - Squarespacestatic1.squarespace.com/.../Pray-Then-Preach-English.pdf · PRAY, THEN PREACH Edward McKendree Bounds ... he solves the mystery of their success. The

Boring Sermons and How Not to Preach Them!haddingtonhouse.org/.../Boring-Sermons-and-How-Not-to-Preach-Them.pdf · Boring Sermons – and How Not to Preach Them! Stuart Olyott* *Stuart

Maximum likelihood (ML) Conditional distribution and likelihood Maximum likelihood estimator Information in the data and likelihood Observed and Fisher’s

Maximum Likelihood and Restricted Likelihood - NIST Page

Preach or Practice ECIL2013

Practice What You Preach - cbiovikings.org

We Preach Christ Crucified

Presentación de PowerPoint€¦ · DAISY ROBIN & ME l... Oxford University Pre... Let's begin Let's learn Let's review Put trousers. Extended language Let's begin Let's learn Let's

LET'S TRY...LET'S TRY

The Gospel We Preach

Practising what we preach

Unapologetic Preaching · 2017. 1. 1. · ~ Dr. Martin Lloyd Jones “Preach with authority. The authority for us is the Word of God. Preach with simplicity . . . Preach with urgency

Likelihood and Conditional Likelihood Inference for ...dzhang2/paper/cgamm.pdf · Likelihood and Conditional Likelihood Inference for Generalized Additive ... in the model and marginal

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model