22
1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at www.science.oregonstate.edu/~piercedo/osu-mcmc- mpl.ppt

1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

Embed Size (px)

Citation preview

Page 1: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

1

Connections between MCMC and Likelihood

MethodsDonald A. Pierce

with Ruggero BellioWinter 2010 OSU

Slides are at www.science.oregonstate.edu/~piercedo/osu-mcmc-mpl.ppt

Page 2: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

2

It is popular these days to be “Bayesian”, in large part due to the utility of MCMC and in particular (Win)BUGS

However, substantive prior information is seldom used, aiming for “objective Bayes”, and connections to likelihood inference are interesting

Largely, the gain in MCMC is in utilizing rather intractable likelihood functions: integrating over latent variates, e.g. latent cluster effects or covariates observed with error

However, if everything except observed data is a random variable, issues of inference become highly (too?) automatic

Page 3: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

3

A key issue in this is the contrast of profile and integrated likelihoods, namely

Modern higher-order likelihood theory suggests, surprisingly, that integrated likelihoods can overcome shortcomings of profile likelihood

A posterior for is an instance of integrated likelihood

That is, so

( ; ) max ( , ; )

( ; ) ( , ; ) ( )P

w

I

L y L y

L y L y w d

( , | ) ( , ; ) ( , )y L y

( | ) ( ) ( , ; ) ( | )y L y d

Page 4: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

4

An integrated likelihood is approximated very well by a Laplace approximation

Hence, the MCMC posterior for “flat” priors is essentially

We will see that this depends substantially on the representation of the nuisance parameter --- to be avoided in frequentist or likelihood inference

The approximation above is, within reason, valid for any such representation (not that this is so comforting)

1/2ˆ ˆ( ; ) ( ; ) | ( ) | ( )w

I PL y L y j w

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

Page 5: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

5

Regarding “flat” priors: in practice those used in WinBUGS manual examples seem advisable, i.e. proper but very diffuse for parameters on , e.g. dnorm(0,1E-6), and implicitly for the logs of inherently positive parameters, e.g. dgamma(1E-6,1E-6)

The latter is to obtain approximate invariance to scale for scale parameters, a natural requirement

If to facilitate convergence is chosen otherwise, then for likelihood analysis one should divide the posterior of by the prior

Geyer & Thompson (1992 JRSS-B) gave a method for computing the likelihood using MCMC, but the proposal here is far simpler

( , )

( )

Page 6: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

6

An attempt to generally improve on profile likelihood was the Cox-Reid approximate conditional likelihood

requiring that the nuisance parameter be represented as ‘orthogonal’ to , i.e. that varies slowly with

However, orthogonal parameters are not at all uniquely defined, resulting in arbitrariness of the ACL that must be resolved

A partial indication of our interests is that the ACL is formally the same as the above approximation to the posterior for using flat priors

1/2ˆ( ; ) ( ; ) | ( ) |AC PL y L y j

Page 7: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

7

Barndorff-Nielsen developed the modified profile likelihood

that is invariant to representation of the nuisance parameter --- a really key issue

Remarkable stroke of intuition, and B-N only showed that the MPL approximates what is desired for the primary special settings: exponential families, regression-scale models, etc

We have been developing the idea that what the MPL in general approximates is a suitable integrated likelihood, hence with close connections to MCMC

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

Page 8: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

8

Example (Pierce & Peters 1992): CC study, 40 sets with 2:1 matching, 30/80 of controls “exposed”

Solid line PL, dashed lines conditional likelihood and MPL

Page 9: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

9

Concept of ‘orthogonal’ parameter, for ACL and for MCMC, needs clarification

In principle there is an ‘ideal’ choice of orthogonal parameter such that the integrated likelihood, i.e. the Bayes posterior (with uniform priors), approximates the MPL

Some goals are: (a) to actually compute this, either from the likelihood or the posterior samples, (b) to recover the PL from the posterior distribution, and (c) to approximate the MPL in this way, even if not as in (a)

These are not completed, but some progress has been made

Page 10: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

10

Example: Binary data on 50 subjects, repeated observations at up to five times, total of 220 observations

Suitable for logistic mixed model with latent random intercepts for subjects

Interest parameter the standard deviation of the random intercepts. Seven nuisance parameters: constant term, 2 treatment parameters, 4 for time effects

Usual parametrization is not orthogonal: vector of canonical regression parameters are ‘attenuated’ as

suggesting an approximately orthogonal parameter

2

0ˆ ˆ / 1 0.304

2/ 1 0.304

Page 11: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

11

WinBUGS posterior densities of using flat priors: heavy line original parametrization, light line using the approximately orthogonal nuisance parameters

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Sigma

De

nsi

ty

Page 12: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

12

Posterior samples: Sigma vs constant term, for original and orthogonal parametrizations

This provides a clue that we can use posterior samples to assess and correct for lack of orthogonality

1 2 3 4 5

12

34

constant-orig

Sig

ma

1.0 1.5 2.0 2.5

0.5

1.0

1.5

2.0

2.5

3.0constant-orthog

Sig

ma

Page 13: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

13

Important but confusing issue –- clearly, if we transform the posterior samples asthe marginal distribution of is unchanged

Part of reason reparametrization of matters is that this is done in the model specification, where in contrast to the above there is no (implicit) Jacobian involved in the density

Having samples from the joint distribution of , it would be possible but impractical to divide the density by the Jacobian, to avoid re-doing MCMC

We can achieve this aim otherwise by resampling from the posterior samples with weights inversely proportional to the reciprocal Jacobian

{ , } { , ( , )}

{ , }

Page 14: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

14

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

Recall that to very good approximation the MCMC posterior, for flat priors, is essentially

which can be expressed approximately as 1/2( | ) ( ; ) | asyvar( | ) |Py L y

We can approximate the final factor from the MCMC samples at hand, and thus approximate the PL by dividing the posterior density of by our estimate of

There are, however, issues involving the distinction between posterior andsampling theory ˆvar( ; )

1/2| asyvar( | ) |

var( | )

Page 15: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

15

A transparent way to do this, although there may be more accurate ways

Choose bins for (e.g. 20 using quantiles), for each of these compute , and then smooth (the logs of) these by quadratic regression on the bin classmarks

| var( | ) |

1.0 1.5 2.0 2.5 3.0 3.5

-10

-9

-8

-7

-6

-5

-4

log

va

r(la

m|p

si)

Page 16: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

16

Red right: MCMC posterior original parametrizationRed left (dashed): after above adjustmentBlack: PL computed by quadrature

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Page 17: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

17

What should be the meaning of ‘orthogonal’ parameter for use in the APL?

Said earlier that should vary slowly with which is related to the more usual definition that the (expected) cross-information terms are zero

But if satisfies this definition then so does any 1-1 transformation of it --- very unsatisfactory

Further, this could not be a requirement for validity of APL, since linear transformations leave the APL unchanged even though not conforming at all to such requirements

This suggests more difficulties than first thought in utilizing plots such as on slide 12 for such purposes

c

Page 18: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

18

There is in principle a reparametrization such that MPL and IL agree (related to Severini, 2007 Bmtrka)

The constrained MLE can be thought of as a function of if sufficient, otherwise

If there is taken as a variable, this defines a nuisance parameter representation

This representation of the NP depends on or on --- no real problem for Bayesian methods

Define as the inverse function solving the equation

Then the MPL is the Laplace approximation to the integrated likelihood based on representation of the nuisance parameter

ˆ( ˆ , , )a

( , )

*( , ) *( , )

*( , )

ˆ( ˆ , )

( ˆ , )a

Page 19: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

19

Theory for this: Laplace approximation in parametrizations and differ only by Jacobian factor

and we are matching that Jacobian with final factor of

Actually need only derivatives

Difficulty in all this is in utilizing, for likelihood, variations in while holding fixed a suitable ancillary “a”

Roughly speaking, a suitable ancillary is the ratio of observed to expected information for

* 1/ { / }

1/2 1/2ˆ( ) | ( ) | ( ) | ( ˆ ) | | / |P P cnstr MLEL j L j

( , )

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

( , )

Page 20: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

20

Ex: Two exponential samples with means and

Reparametrize orthogonally with means

Then provides the corresponding

parametric function

Set this equal to and solve for the inverse

Then to up to Laplace approximation the MPL is the IL for nuisance parameter representation

log PL

log ACL and MCMC posterior with “obvious” orthog

but for this example MPL=PL

ˆ ˆ(1 ˆ / ) / 2

( , ) (1 ˆ / ) / 2

* (1 ˆ / ) / 2

*( , ) 2 / 1 ˆ /

*

/ ,

2 log(1 ˆ / ) log( )n n

2( 1) log(1 ˆ / ) ( 1/ 2) log( )n n

Page 21: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

21

Our MCMC example is not very suitable for investigating all this --- MPL is (again) very near the PL

When likelihood is intractable, or when the MLE is not sufficient, can we use the MCMC to approximate the MPL?

Is it better to approximate the reparametrization for which IL = MPL, or better to compute the required Jacobian more directly?

An issue is whether there can, in principle, be enough information in the likelihood, or posterior samples, to approximate the MPL

Can we tell from the posterior samples how the joint distribution would change for slightly different data?

Page 22: 1 Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at piercedo/osu-mcmc-mpl.ppt

22

There is yet another parametrization such that locally the nuisance parameter becomes a translation parameter

In this parametrization the answer to that question is “yes”

An aim is to capitalize on this without solving for that new parametrization, perhaps taking advantage of the fact that the product of the final two terms in the MPL is invariant to reparametrization

Have had some success for a single nuisance parameter, but there remains much to do

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j