Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

Preview:

DESCRIPTION

Xiao-Li Meng's slides for his talks at Columbia, Sept. 2011, and ICERM, Nov. 2012

Citation preview

logo

Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

Based on

Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23

logo

Let’s Practice What We Preach:Likelihood Methods for Monte Carlo Data

Xiao-Li Meng

Department of Statistics, Harvard University

September 24, 2011

Based on

Kong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

Importance sampling (IS)

Estimand:

c1 =

∫Γ

q1(x)µ(dx) =

∫Γ

q1(x)

p2(x)p2(x)µ(dx).

Data: {Xi2, i = 1, . . . n2} ∼ p2 = q2/c2

Estimating Equation (EE):

r ≡ c1

c2= E2

[q1(X )

q2(X )

].

The EE estimator:

r =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

Standard IS estimator for c1 when c2 = 1.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23

logo

What about MLE?

The “likelihood” is:

f (X12 . . .Xn22) =

n2∏i=1

p2(Xi2) — free of the estimand c1!

So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?

What are we “inferring”?What is the “unknown” model parameter?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23

logo

What about MLE?

The “likelihood” is:

f (X12 . . .Xn22) =

n2∏i=1

p2(Xi2) — free of the estimand c1!

So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?

What are we “inferring”?What is the “unknown” model parameter?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23

logo

What about MLE?

The “likelihood” is:

f (X12 . . .Xn22) =

n2∏i=1

p2(Xi2) — free of the estimand c1!

So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?

What are we “inferring”?What is the “unknown” model parameter?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23

logo

What about MLE?

The “likelihood” is:

f (X12 . . .Xn22) =

n2∏i=1

p2(Xi2) — free of the estimand c1!

So why are {Xi2, i = 1, . . . n2} even relevant?Violation of likelihood principle?

What are we “inferring”?What is the “unknown” model parameter?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23

logo

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

c2=

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

1n2

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

]1n1

n1∑i=1

[q2(Xi1)

s1q1(Xi1)+s2 r(t)O q2(Xi1)

]

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23

logo

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

c2=

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

1n2

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

]1n1

n1∑i=1

[q2(Xi1)

s1q1(Xi1)+s2 r(t)O q2(Xi1)

]

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23

logo

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

c2=

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

1n2

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

]1n1

n1∑i=1

[q2(Xi1)

s1q1(Xi1)+s2 r(t)O q2(Xi1)

]

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23

logo

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

c2=

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

1n2

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

]1n1

n1∑i=1

[q2(Xi1)

s1q1(Xi1)+s2 r(t)O q2(Xi1)

]

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23

logo

Bridge sampling (BS)

Data: {Xij , i = 1, . . . , nj} ∼ pj = qj/cj , j = 1, 2

Estimating Equation (Meng and Wong, 1996):

r ≡ c1

c2=

E2[α(X )q1(X )]

E1[α(X )q2(X )], ∀ α : 0 < |

∫αq1q2dµ| <∞

Optimal choice: αO(x) ∝ [n1q1(x) + n2rq2(x)]−1

Optimal estimator rO , the limit of

r(t+1)O =

1n2

n2∑i=1

[q1(Xi2)

s1q1(Xi2)+s2 r(t)O q2(Xi2)

]1n1

n1∑i=1

[q2(Xi1)

s1q1(Xi1)+s2 r(t)O q2(Xi1)

]

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

What about MLE?

The “likelihood” is:

2∏j=1

nj∏i=1

qj(Xij)

cj∝ c−n1

1 c−n22 — free of data!

What went wrong: cj is not “free parameter” becausecj =

∫Γ qj(x)µ(dx) and qj is known.

So what is the “unknown” model parameter?

Turns out rO is the same as Bennett’s (1976) optimal acceptanceratio estimator, as well as Geyer’s (1994) reversed logistic regressionestimator.

So why is that? Can it be improved upon without any “sleight ofhand”?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23

logo

Pretending the measure is unknown!

Because

c =

∫Γ

q(x)µ(dx),

and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.

This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.

Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23

logo

Pretending the measure is unknown!

Because

c =

∫Γ

q(x)µ(dx),

and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.

This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.

Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23

logo

Pretending the measure is unknown!

Because

c =

∫Γ

q(x)µ(dx),

and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.

This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.

Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23

logo

Pretending the measure is unknown!

Because

c =

∫Γ

q(x)µ(dx),

and q is known in the sense that we can evaluate it at any samplevalue, the only way to make c “unknown” is to assume the underlyingmeasure µ is “unknown”.

This is natural because Monte Carlo simulation means we use samplesto represent, and thus estimate/infer, the underlying populationq(x)µ(dx), and hence estimate/infer µ since q is known.

Monte Carlo integration is about finding a tractable discrete µ toapproximate the intractable µ.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23

logo

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23

logo

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23

logo

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23

logo

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23

logo

Importance Sampling Likelihood

Estimand: c1 =∫

Γ q1(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} ∼ i .i .d . c−12 q2(x)µ(dx)

Likelihood for µ:

L(µ) =

n2∏i=1

c−12 q2(Xi2)µ(Xi2)

Note that c2 is a functional of µ.

The nonparametric MLE of µ is

µ(dx) =P(dx)

q2(x), P — empirical measure

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23

logo

Importance Sampling Likelihood

Thus the MLE for r ≡ c1/c2 is

r =

∫q1(x)µ(dx) =

1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.

{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23

logo

Importance Sampling Likelihood

Thus the MLE for r ≡ c1/c2 is

r =

∫q1(x)µ(dx) =

1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.

{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23

logo

Importance Sampling Likelihood

Thus the MLE for r ≡ c1/c2 is

r =

∫q1(x)µ(dx) =

1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.

{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23

logo

Importance Sampling Likelihood

Thus the MLE for r ≡ c1/c2 is

r =

∫q1(x)µ(dx) =

1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)

When c2 = 1, q2 = p2, standard IS estimator for c1 is obtained.

{X(i2), i = 1, . . . n2} is (minimum) sufficient for µ onx ∈ S2 = {x : q2(x) > 0}, and hence c1 is guaranteed to beconsistent only when S1 ⊂ S2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23

logo

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23

logo

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23

logo

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23

logo

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23

logo

Bridge Sampling Likelihood

Estimand: ∝ cj =∫

Γ qj(x)µ(x), j = 1, . . . , J.

Data: {Xij , 1 ≤ i ≤ nj} ∼ c−1j qj(x)µ(dx), 1 ≤ j ≤ J

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 c−1j qj(Xij)µ(Xij)

Writing θ(x) = log µ(x), then

log L(µ) = n

∫Γθ(x)dP −

J∑j=1

nj log cj(θ),

P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23

logo

Bridge Sampling Likelihood

MLE for µ given by equating the canonical sufficient statistics P toits expectation:

nP(dx) =J∑

j=1

nj c−1j qj(x)µ(dx),

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

. (A)

Consequently, the MLE for {c1, . . . , cJ} must satisfy

cr =

∫Γ

qr (x) d µ =J∑

j=1

nj∑i=1

qr (xij)∑Js=1 ns c−1

s qs(xij). (B)

(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23

logo

Bridge Sampling Likelihood

MLE for µ given by equating the canonical sufficient statistics P toits expectation:

nP(dx) =J∑

j=1

nj c−1j qj(x)µ(dx),

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

. (A)

Consequently, the MLE for {c1, . . . , cJ} must satisfy

cr =

∫Γ

qr (x) d µ =J∑

j=1

nj∑i=1

qr (xij)∑Js=1 ns c−1

s qs(xij). (B)

(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23

logo

Bridge Sampling Likelihood

MLE for µ given by equating the canonical sufficient statistics P toits expectation:

nP(dx) =J∑

j=1

nj c−1j qj(x)µ(dx),

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

. (A)

Consequently, the MLE for {c1, . . . , cJ} must satisfy

cr =

∫Γ

qr (x) d µ =J∑

j=1

nj∑i=1

qr (xij)∑Js=1 ns c−1

s qs(xij). (B)

(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23

logo

Bridge Sampling Likelihood

MLE for µ given by equating the canonical sufficient statistics P toits expectation:

nP(dx) =J∑

j=1

nj c−1j qj(x)µ(dx),

µ(dx) =nP(dx)∑J

j=1 nj c−1j qj(x)

. (A)

Consequently, the MLE for {c1, . . . , cJ} must satisfy

cr =

∫Γ

qr (x) d µ =J∑

j=1

nj∑i=1

qr (xij)∑Js=1 ns c−1

s qs(xij). (B)

(B) is the “dual” equation of (A), and is also the same as theequation for optimal multiple bridge sampling estimator (Tan 2004).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodel

Linear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodel

Log-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

But We Can Ignore Less ...

To restrict the parameter space for µ by using some knowledge of theknown µ, that it, to set up a sub-model.

The new MLE has a smaller asymptotic variance under the submodelthan under the full model.

Examples:

Group-invariance submodelLinear submodelLog-linear submodel

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

An Universally Improved IS

Estimand: r = c1/c2; cj =∫Rd qj(x)µ(dx)

Data: {Xi2, i = 1, . . . n2} i .i .d ∼ c−12 q2µ(dx)

Taking G = {Id ,−Id} leads to

rG =1

n2

n2∑i=1

q1(Xi2) + q1(−Xi2)

q2(Xi2) + q2(−Xi2).

Because of the Rao-Blackwellization, V(rG) ≤ V(r).

Need twice as many evaluations, but typically this is a small insurancepremium.

Consider S1 = R & S2 = R+. Then rG is consistent for r :

rG =1

n2

n2∑i=1

q1(Xi2)

q2(Xi2)+

1

n2

n2∑i=1

q1(−Xi2)

q2(Xi2).

But standard IS r only estimates∫∞

0 q1(x)µ(dx)/c2.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23

logo

There are many more improvements ...

Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.

The new MLE of µ is

µG(dx) =nPG(dx)∑J

j=1 nj c−1j q Gj (x)

,

where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).

When the draws are i.i.d. within each psdµ,

µG = E [µ| GX ],

i.e., the Rao-Blackwellization of µ given the orbit.

Consequently,

c Gj =

∫Γ

qj(x)µG(dx) = E [cj |GX ].

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23

logo

There are many more improvements ...

Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.

The new MLE of µ is

µG(dx) =nPG(dx)∑J

j=1 nj c−1j q Gj (x)

,

where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).

When the draws are i.i.d. within each psdµ,

µG = E [µ| GX ],

i.e., the Rao-Blackwellization of µ given the orbit.

Consequently,

c Gj =

∫Γ

qj(x)µG(dx) = E [cj |GX ].

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23

logo

There are many more improvements ...

Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.

The new MLE of µ is

µG(dx) =nPG(dx)∑J

j=1 nj c−1j q Gj (x)

,

where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).

When the draws are i.i.d. within each psdµ,

µG = E [µ| GX ],

i.e., the Rao-Blackwellization of µ given the orbit.

Consequently,

c Gj =

∫Γ

qj(x)µG(dx) = E [cj |GX ].

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23

logo

There are many more improvements ...

Define a sub-model by requiring µ to be G-invariant, where G is afinite group on Γ.

The new MLE of µ is

µG(dx) =nPG(dx)∑J

j=1 nj c−1j q Gj (x)

,

where PG(A) = aveg∈G P(gA); q Gj (x) = aveg∈G qj(gx).

When the draws are i.i.d. within each psdµ,

µG = E [µ| GX ],

i.e., the Rao-Blackwellization of µ given the orbit.

Consequently,

c Gj =

∫Γ

qj(x)µG(dx) = E [cj |GX ].

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23

logo

Using Groups to model trade-off

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

).

The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23

logo

Using Groups to model trade-off

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

).

The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23

logo

Using Groups to model trade-off

If G1 k G2, thenVar

(~c G1

)≤ Var

(~c G2

).

The statistical efficiency increases with the size of Gi , but so does thecomputational cost needed for function evaluation (but not forsampling, because there are no additional samples involved).

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23

logo

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

∫Γ

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}Likelihood for µ: L(µ) =

∏Jj=1

∏nj

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23

logo

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

∫Γ

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}Likelihood for µ: L(µ) =

∏Jj=1

∏nj

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23

logo

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

∫Γ

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}

Likelihood for µ: L(µ) =∏J

j=1

∏nj

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23

logo

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

∫Γ

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}Likelihood for µ: L(µ) =

∏Jj=1

∏nj

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23

logo

Linear submodel: stratified sampling (Tan 2004)

Data: {Xij , 1 ≤ i ≤ nj}i .i .d∼ pj(x)µ(dx), 1 ≤ j ≤ J.

The sub-model has parameter space{µ :

∫Γ

pj(x)µ(dx), 1 ≤ j ≤ J, are equal (to 1).

}Likelihood for µ: L(µ) =

∏Jj=1

∏nj

i=1 pj(Xij)µ(Xij)

The MLE is

µlin(dx) =P(dx)∑J

j=1 πjpj(x),

where πjs are MLEs from a mixture model:

the datai .i .d∼

∑Jj=1 πjpj(·) with πjs unknown

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23

logo

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

j=1

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

,

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

j=1

1

nj

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23

logo

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

j=1

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

,

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

j=1

1

nj

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23

logo

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

j=1

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

,

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

j=1

1

nj

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23

logo

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

j=1

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

,

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

j=1

1

nj

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23

logo

So why MLE?

Goal: to estimate c =∫

Γ q(x)µ(dx).

For an arbitrary vector b, consider the control-variate estimator(Owen and Zhou 2000)

cb ≡J∑

j=1

nj∑i=1

q(xji )− b>g(xji )∑Js=1 nsps(xji )

,

where g = (p2 − p1, . . . , pJ − p1)>.

A more general class: for∑J

j=1 λj(x) ≡ 1 and∑J

j=1 λj(x)bj(x) ≡ b,consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004)

cλ,B =J∑

j=1

1

nj

nj∑i=1

λj(xji )q(xji )− b>j (xji )g(xji )

pj(xji )

Should cλ,B be more efficient than cb? Could there be somethingeven more efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23

logo

Three estimators for c =∫

Γ q(x) µ(dx):

IS: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πj = nj/n are the true proportions.

Reg:1

n

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

,

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23

logo

Three estimators for c =∫

Γ q(x) µ(dx):

IS: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πj = nj/n are the true proportions.

Reg:1

n

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

,

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23

logo

Three estimators for c =∫

Γ q(x) µ(dx):

IS: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πj = nj/n are the true proportions.

Reg:1

n

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

,

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23

logo

Three estimators for c =∫

Γ q(x) µ(dx):

IS: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πj = nj/n are the true proportions.

Reg:1

n

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

,

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23

logo

Three estimators for c =∫

Γ q(x) µ(dx):

IS: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πj = nj/n are the true proportions.

Reg:1

n

n∑i=1

q(xi )− β>g(xi )∑Jj=1 πjpj(xi )

,

where β is the estimated regression coefficient, ignoring stratification.

Lik: 1

n

n∑i=1

q(xi )∑Jj=1 πjpj(xi )

,

where πjs are the estimated proportions, ignoring stratification.

Which one is most efficient? Least efficient?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or

(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

Let’s find it out ...

Γ = R10 and µ is Lebesgue measure.

The integrand is

q(x) = 0.810∏j=1

φ(x j) + 0.210∏j=1

ψ(x j ; 4) ,

where φ(·) is standard normal density and ψ(·; 4) is t4 density.

Two sampling designs:

(i) q2(x) with n draws, or(ii) q1(x) and q2(x) each with n/2 draws,

where

q1(x) =10∏j=1

φ(x j), q2(x) =10∏j=1

ψ(x j ; 1)

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23

logo

A little surprise?

Table: Comparison of design and estimator

one sampler two samplers

IS Reg Lik IS Reg Lik

Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881

Std Err .162 .00919 .00920 .0174 .00885 .00884

Note: Sqrt MSE is√

mean squared error of the point estimates andStd Err is

√mean of the variance estimates from 10000 repeated

simulations of size n = 500.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 19 / 23

logo

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23

logo

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23

logo

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23

logo

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23

logo

Comparison of efficiency:

Statistical efficiency: IS < Reg ≈ Lik

IS is a stratified estimator, which uses only the labels.

Reg is conventional method of control variates.

Lik is constrained MLE, which uses pjs but ignores the labels;it is exact if q = pj for any particular j .

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23

logo

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?

Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23

logo

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?

Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23

logo

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?

Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23

logo

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?

Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23

logo

Building intuition ...

Suppose we make n = 2 draws, one from N(0, 1) and one fromCauchy(0, 1), hence π1 = π2 = 50%.

Suppose the draws are {1, 1}, what would be the MLE (π1, π2)?

Suppose the draws are {1, 3}, what would be the MLE (π1, π2)?

Suppose the draws are {3, 3}, what would be the MLE (π1, π2)?

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23

logo

What Did I Learn?

Model what we ignore, not what we know!

Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.

There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23

logo

What Did I Learn?

Model what we ignore, not what we know!

Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.

There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23

logo

What Did I Learn?

Model what we ignore, not what we know!

Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.

There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23

logo

What Did I Learn?

Model what we ignore, not what we know!

Model comparison/selection is not about which model is true (as allof them are “true”), but which model represents a better compromiseamong human, computational, and statistical efficiency.

There is a cure for our “schizophrenia” — we now can analyze MonteCarlo data using the same sound statistical principles and methods foranalyzing real data.

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.

Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

logo

If you are looking for theoretical research topics ...

RE-EXAM OLD ONES AND DERIVE NEW ONES!

Prove it is MLE, or a good approximation to MLE.Or derive MLE or a cost-effective approximation to it.

Markov chain Monte Carlo (Tan 2006, 2008)

More ......

Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

Recommended