150
Relevant statistics for (approximate) Bayesian model choice Christian P. Robert Universit´ e Paris-Dauphine, Paris & University of Warwick, Coventry [email protected]

Pittsburgh and Toronto "Halloween US trip" seminars

Embed Size (px)

DESCRIPTION

Those are the slides of the talks given in Pittsburgh and Toronto, on Oct. 30 and 31, 2013

Citation preview

Page 1: Pittsburgh and Toronto "Halloween US trip" seminars

Relevant statistics for (approximate) Bayesianmodel choice

Christian P. RobertUniversite Paris-Dauphine, Paris & University of Warwick, Coventry

[email protected]

Page 2: Pittsburgh and Toronto "Halloween US trip" seminars

a meeting of potential interest?!

MCMSki IV to be held in Chamonix Mt Blanc, France, fromMonday, Jan. 6 to Wed., Jan. 8 (and BNP workshop onJan. 9), 2014All aspects of MCMC++ theory and methodology and applicationsand more! All posters welcomed!

Website http://www.pages.drexel.edu/∼mwl25/mcmski/

Page 3: Pittsburgh and Toronto "Halloween US trip" seminars

Outline

Intractable likelihoods

ABC methods

ABC as an inference machine

ABC for model choice

Model choice consistency

Conclusion and perspectives

Page 4: Pittsburgh and Toronto "Halloween US trip" seminars

Intractable likelihood

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

I is (really!) not available in closed form

I cannot (easily!) be either completed or demarginalised

I cannot be (at all!) estimated by an unbiased estimator

c© Prohibits direct implementation of a generic MCMC algorithmlike Metropolis–Hastings

Page 5: Pittsburgh and Toronto "Halloween US trip" seminars

Intractable likelihood

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

I is (really!) not available in closed form

I cannot (easily!) be either completed or demarginalised

I cannot be (at all!) estimated by an unbiased estimator

c© Prohibits direct implementation of a generic MCMC algorithmlike Metropolis–Hastings

Page 6: Pittsburgh and Toronto "Halloween US trip" seminars

Getting approximative

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach

Empirical approximations to the original B problem

I Degrading the data precision down to a tolerance ε

I Replacing the likelihood with a non-parametric approximation

I Summarising/replacing the data with insufficient statistics

Page 7: Pittsburgh and Toronto "Halloween US trip" seminars

Getting approximative

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach

Empirical approximations to the original B problem

I Degrading the data precision down to a tolerance ε

I Replacing the likelihood with a non-parametric approximation

I Summarising/replacing the data with insufficient statistics

Page 8: Pittsburgh and Toronto "Halloween US trip" seminars

Getting approximative

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach

Empirical approximations to the original B problem

I Degrading the data precision down to a tolerance ε

I Replacing the likelihood with a non-parametric approximation

I Summarising/replacing the data with insufficient statistics

Page 9: Pittsburgh and Toronto "Halloween US trip" seminars

Different [levels of] worries about abc

Impact on B inference

I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!

I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)

I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)

Page 10: Pittsburgh and Toronto "Halloween US trip" seminars

Different [levels of] worries about abc

Impact on B inference

I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!

I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)

I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)

Page 11: Pittsburgh and Toronto "Halloween US trip" seminars

Different [levels of] worries about abc

Impact on B inference

I a mere computational issue (that will eventually end up beingsolved by more powerful computers, &tc, even if too costly inthe short term, as for regular Monte Carlo methods) Not!

I an inferential issue (opening opportunities for new inferencemachine on complex/Big Data models, with legitimitydiffering from classical B approach)

I a Bayesian conundrum (how closely related to the/a Bapproach? more than recycling B tools? true B with rawdata? a new type of empirical B inference?)

Page 12: Pittsburgh and Toronto "Halloween US trip" seminars

Approximate Bayesian computation

Intractable likelihoods

ABC methodsGenesis of ABCABC basicsAdvances and interpretationsABC as knn

ABC as an inference machine

ABC for model choice

Model choice consistency

Conclusion and perspectives

Page 13: Pittsburgh and Toronto "Halloween US trip" seminars

Genetic background of ABC

skip genetics

ABC is a recent computational technique that only requires beingable to sample from the likelihood f (·|θ)

This technique stemmed from population genetics models, about15 years ago, and population geneticists still contributesignificantly to methodological developments of ABC.

[Griffith & al., 1997; Tavare & al., 1999]

Page 14: Pittsburgh and Toronto "Halloween US trip" seminars

Demo-genetic inference

Each model is characterized by a set of parameters θ that coverhistorical (time divergence, admixture time ...), demographics(population sizes, admixture rates, migration rates, ...) and genetic(mutation rate, ...) factors

The goal is to estimate these parameters from a dataset ofpolymorphism (DNA sample) y observed at the present time

Problem:

most of the time, we cannot calculate the likelihood of thepolymorphism data f (y|θ)...

Page 15: Pittsburgh and Toronto "Halloween US trip" seminars

Demo-genetic inference

Each model is characterized by a set of parameters θ that coverhistorical (time divergence, admixture time ...), demographics(population sizes, admixture rates, migration rates, ...) and genetic(mutation rate, ...) factors

The goal is to estimate these parameters from a dataset ofpolymorphism (DNA sample) y observed at the present time

Problem:

most of the time, we cannot calculate the likelihood of thepolymorphism data f (y|θ)...

Page 16: Pittsburgh and Toronto "Halloween US trip" seminars

Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium

Sample of 8 genes

Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼

Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2

Page 17: Pittsburgh and Toronto "Halloween US trip" seminars

Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium

Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)

Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼

Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2

Page 18: Pittsburgh and Toronto "Halloween US trip" seminars

Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium

Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)

Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼

Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2

Page 19: Pittsburgh and Toronto "Halloween US trip" seminars

Neutral model at a given microsatellite locus, in a closedpanmictic population at equilibrium

Observations: leafs of the treeθ = ?

Kingman’s genealogyWhen time axis isnormalized,T (k) ∼ Exp(k(k − 1)/2)

Mutations according tothe Simple stepwiseMutation Model(SMM)• date of the mutations ∼

Poisson process withintensity θ/2 over thebranches• MRCA = 100• independent mutations:±1 with pr. 1/2

Page 20: Pittsburgh and Toronto "Halloween US trip" seminars

Much more interesting models. . .

I several independent locusIndependent gene genealogies and mutations

I different populationslinked by an evolutionary scenario made of divergences,admixtures, migrations between populations, selectionpressure, etc.

I larger sample sizeusually between 50 and 100 genes

A typical evolutionary scenario:

MRCA

POP 0 POP 1 POP 2

τ1

τ2

Page 21: Pittsburgh and Toronto "Halloween US trip" seminars

Instance of ecological questions

I How the Asian ladybirdbeetle arrived in Europe?

I Why does they swarm rightnow?

I What are the routes ofinvasion?

I How to get rid of them?

I Why did the chicken crossthe road?

[Lombaert & al., 2010, PLoS ONE]

Page 22: Pittsburgh and Toronto "Halloween US trip" seminars

Worldwide invasion routes of Harmonia axyridis

Worldwide routes of invasion of Harmonia axyridis

For each outbreak, the arrow indicates the most likely invasionpathway and the associated posterior probability, with 95% credibleintervals in brackets

[Estoup et al., 2012, Molecular Ecology Res.]

Page 23: Pittsburgh and Toronto "Halloween US trip" seminars

c© Intractable likelihood

Missing (too much missing!) data structure:

f (y|θ) =

∫G

f (y|G ,θ)f (G |θ)dG

cannot be computed in a manageable way...[Stephens & Donnelly, 2000]

The genealogies are considered as nuisance parameters

This modelling clearly differs from the phylogenetic perspective

where the tree is the parameter of interest.

Page 24: Pittsburgh and Toronto "Halloween US trip" seminars

c© Intractable likelihood

Missing (too much missing!) data structure:

f (y|θ) =

∫G

f (y|G ,θ)f (G |θ)dG

cannot be computed in a manageable way...[Stephens & Donnelly, 2000]

The genealogies are considered as nuisance parameters

This modelling clearly differs from the phylogenetic perspective

where the tree is the parameter of interest.

Page 25: Pittsburgh and Toronto "Halloween US trip" seminars

A?B?C?

I A stands for approximate[wrong likelihood / pic?]

I B stands for Bayesian

I C stands for computation[producing a parametersample]

Page 26: Pittsburgh and Toronto "Halloween US trip" seminars

A?B?C?

I A stands for approximate[wrong likelihood / pic?]

I B stands for Bayesian

I C stands for computation[producing a parametersample]

Page 27: Pittsburgh and Toronto "Halloween US trip" seminars

A?B?C?

I A stands for approximate[wrong likelihood / pic?]

I B stands for Bayesian

I C stands for computation[producing a parametersample]

ESS=155.6

θ

Den

sity

−0.5 0.0 0.5 1.0

0.0

1.0

ESS=75.93

θ

Den

sity

−0.4 −0.2 0.0 0.2 0.4

0.0

1.0

2.0

ESS=76.87

θ

Den

sity

−0.4 −0.2 0.0 0.2

01

23

4

ESS=91.54

θ

Den

sity

−0.6 −0.4 −0.2 0.0 0.2

01

23

4

ESS=108.4

θ

Den

sity

−0.4 0.0 0.2 0.4 0.6

0.0

1.0

2.0

3.0

ESS=85.13

θ

Den

sity

−0.2 0.0 0.2 0.4 0.6

0.0

1.0

2.0

3.0

ESS=149.1

θ

Den

sity

−0.5 0.0 0.5 1.0

0.0

1.0

2.0

ESS=96.31

θ

Den

sity

−0.4 0.0 0.2 0.4 0.6

0.0

1.0

2.0

ESS=83.77

θ

Den

sity

−0.6 −0.4 −0.2 0.0 0.2 0.4

01

23

4

ESS=155.7

θ

Den

sity

−0.5 0.0 0.5

0.0

1.0

2.0

ESS=92.42

θ

Den

sity

−0.4 0.0 0.2 0.4 0.6

0.0

1.0

2.0

3.0 ESS=95.01

θ

Den

sity

−0.4 0.0 0.2 0.4 0.6

0.0

1.5

3.0

ESS=139.2

Den

sity

−0.6 −0.2 0.2 0.6

0.0

1.0

2.0

ESS=99.33

Den

sity

−0.4 −0.2 0.0 0.2 0.4

0.0

1.0

2.0

3.0

ESS=87.28

Den

sity

−0.2 0.0 0.2 0.4 0.6

01

23

Page 28: Pittsburgh and Toronto "Halloween US trip" seminars

How Bayesian is aBc?

Could we turn the resolution into a Bayesian answer?

I ideally so (not meaningfull: requires ∞-ly powerful computer

I approximation error unknown (w/o costly simulation)

I true B for wrong model (formal and artificial)

I true B for noisy model (much more convincing)

I true B for estimated likelihood (back to econometrics?)

I illuminating the tension between information and precision

Page 29: Pittsburgh and Toronto "Halloween US trip" seminars

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]

Page 30: Pittsburgh and Toronto "Halloween US trip" seminars

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]

Page 31: Pittsburgh and Toronto "Halloween US trip" seminars

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Tavare et al., 1997]

Page 32: Pittsburgh and Toronto "Halloween US trip" seminars

A as A...pproximative

When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone

ρ(y, z) 6 ε

where ρ is a distanceOutput distributed from

π(θ)Pθ{ρ(y, z) < ε}def∝ π(θ|ρ(y, z) < ε)

[Pritchard et al., 1999]

Page 33: Pittsburgh and Toronto "Halloween US trip" seminars

A as A...pproximative

When y is a continuous random variable, strict equality z = y isreplaced with a tolerance zone

ρ(y, z) 6 ε

where ρ is a distanceOutput distributed from

π(θ)Pθ{ρ(y, z) < ε}def∝ π(θ|ρ(y, z) < ε)

[Pritchard et al., 1999]

Page 34: Pittsburgh and Toronto "Halloween US trip" seminars

ABC algorithm

In most implementations, further degree of A...pproximation:

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)

until ρ{η(z),η(y)} 6 εset θi = θ

end for

where η(y) defines a (not necessarily sufficient) statistic

Page 35: Pittsburgh and Toronto "Halloween US trip" seminars

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f (z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

...does it?!

Page 36: Pittsburgh and Toronto "Halloween US trip" seminars

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f (z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

...does it?!

Page 37: Pittsburgh and Toronto "Halloween US trip" seminars

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f (z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of theposterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|y) .

...does it?!

Page 38: Pittsburgh and Toronto "Halloween US trip" seminars

Output

The likelihood-free algorithm samples from the marginal in z of:

πε(θ, z|y) =π(θ)f (z|θ)IAε,y(z)∫

Aε,y×Θ π(θ)f (z|θ)dzdθ,

where Aε,y = {z ∈ D|ρ(η(z),η(y)) < ε}.

The idea behind ABC is that the summary statistics coupled with asmall tolerance should provide a good approximation of therestricted posterior distribution:

πε(θ|y) =

∫πε(θ, z|y)dz ≈ π(θ|η(y)) .

Not so good..!

Page 39: Pittsburgh and Toronto "Halloween US trip" seminars

Comments

I Role of distance paramount (because ε 6= 0)

I Scaling of components of η(y) is also determinant

I ε matters little if “small enough”

I representative of “curse of dimensionality”

I small is beautiful!

I the data as a whole may be paradoxically weakly informativefor ABC

Page 40: Pittsburgh and Toronto "Halloween US trip" seminars

curse of dimensionality in summaries

Consider that

I the simulated summary statistics η(y1), . . . ,η(yN)

I the observed summary statistics η(yobs)

are iid with uniform distribution on [0, 1]d

Take δ∞(d ,N) = E[

mini=1,...,N

∥∥η(yobs) − η(yi )∥∥∞]

N = 100 N = 1, 000 N = 10, 000 N = 100, 000δ∞(1,N) 0.0025 0.00025 0.000025 0.0000025δ∞(2,N) 0.033 0.01 0.0033 0.001δ∞(10,N) 0.28 0.22 0.18 0.14δ∞(200,N) 0.48 0.48 0.47 0.46

Page 41: Pittsburgh and Toronto "Halloween US trip" seminars

ABC (simul’) advances

how approximative is ABC? ABC as knn

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 42: Pittsburgh and Toronto "Halloween US trip" seminars

ABC (simul’) advances

how approximative is ABC? ABC as knn

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 43: Pittsburgh and Toronto "Halloween US trip" seminars

ABC (simul’) advances

how approximative is ABC? ABC as knn

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 44: Pittsburgh and Toronto "Halloween US trip" seminars

ABC (simul’) advances

how approximative is ABC? ABC as knn

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε

[Beaumont et al., 2002]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 45: Pittsburgh and Toronto "Halloween US trip" seminars

ABC as knn

[Biau et al., 2013, Annales de l’IHP]

Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,

ε = εN = qα(d1, . . . , dN)

I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice

[Blum & Francois, 2010]

I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]

Page 46: Pittsburgh and Toronto "Halloween US trip" seminars

ABC as knn

[Biau et al., 2013, Annales de l’IHP]

Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,

ε = εN = qα(d1, . . . , dN)

I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice

[Blum & Francois, 2010]

I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]

Page 47: Pittsburgh and Toronto "Halloween US trip" seminars

ABC as knn

[Biau et al., 2013, Annales de l’IHP]

Practice of ABC: determine tolerance ε as a quantile on observeddistances, say 10% or 1% quantile,

ε = εN = qα(d1, . . . , dN)

I Interpretation of ε as nonparametric bandwidth onlyapproximation of the actual practice

[Blum & Francois, 2010]

I ABC is a k-nearest neighbour (knn) method with kN = NεN[Loftsgaarden & Quesenberry, 1965]

Page 48: Pittsburgh and Toronto "Halloween US trip" seminars

ABC consistency

Provided

kN/ log logN −→∞ and kN/N −→ 0

as N →∞, for almost all s0 (with respect to the distribution ofS), with probability 1,

1

kN

kN∑j=1

ϕ(θj) −→ E[ϕ(θj)|S = s0]

[Devroye, 1982]

Biau et al. (2013) also recall pointwise and integrated mean square errorconsistency results on the corresponding kernel estimate of theconditional posterior distribution, under constraints

kN →∞, kN/N → 0, hN → 0 and hpNkN →∞,

Page 49: Pittsburgh and Toronto "Halloween US trip" seminars

ABC consistency

Provided

kN/ log logN −→∞ and kN/N −→ 0

as N →∞, for almost all s0 (with respect to the distribution ofS), with probability 1,

1

kN

kN∑j=1

ϕ(θj) −→ E[ϕ(θj)|S = s0]

[Devroye, 1982]

Biau et al. (2013) also recall pointwise and integrated mean square errorconsistency results on the corresponding kernel estimate of theconditional posterior distribution, under constraints

kN →∞, kN/N → 0, hN → 0 and hpNkN →∞,

Page 50: Pittsburgh and Toronto "Halloween US trip" seminars

Rates of convergence

Further assumptions (on target and kernel) allow for precise(integrated mean square) convergence rates (as a power of thesample size N), derived from classical k-nearest neighbourregression, like

I when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4p+8

I when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4p+8 logN

I when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4m+p+4

[Biau et al., 2013]

Drag: Only applies to sufficient summary statistics

Page 51: Pittsburgh and Toronto "Halloween US trip" seminars

Rates of convergence

Further assumptions (on target and kernel) allow for precise(integrated mean square) convergence rates (as a power of thesample size N), derived from classical k-nearest neighbourregression, like

I when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4p+8

I when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4p+8 logN

I when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4m+p+4

[Biau et al., 2013]

Drag: Only applies to sufficient summary statistics

Page 52: Pittsburgh and Toronto "Halloween US trip" seminars

ABC inference machine

Intractable likelihoods

ABC methods

ABC as an inference machineError inc.Exact BC and approximatetargetssummary statistic

ABC for model choice

Model choice consistency

Conclusion and perspectives

Page 53: Pittsburgh and Toronto "Halloween US trip" seminars

How Bayesian is aBc..?

I may be a convergent method of inference (meaningful?sufficient? foreign?)

I approximation error unknown (w/o massive simulation)

I pragmatic/empirical B (there is no other solution!)

I many calibration issues (tolerance, distance, statistics)

I the NP side should be incorporated into the whole B picture

I the approximation error should also be part of the B inference

Page 54: Pittsburgh and Toronto "Halloween US trip" seminars

ABCµ

Idea Infer about the error as well as about the parameter:Use of a joint density

f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Page 55: Pittsburgh and Toronto "Halloween US trip" seminars

ABCµ

Idea Infer about the error as well as about the parameter:Use of a joint density

f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Page 56: Pittsburgh and Toronto "Halloween US trip" seminars

ABCµ

Idea Infer about the error as well as about the parameter:Use of a joint density

f (θ, ε|y) ∝ ξ(ε|y, θ)× πθ(θ)× πε(ε)

where y is the data, and ξ(ε|y, θ) is the prior predictive density ofρ(η(z),η(y)) given θ and y when z ∼ f (z|θ)Warning! Replacement of ξ(ε|y, θ) with a non-parametric kernelapproximation.

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Page 57: Pittsburgh and Toronto "Halloween US trip" seminars

Wilkinson’s exact BC (not exactly!)

ABC approximation error (i.e. non-zero tolerance) replaced withexact simulation from a controlled approximation to the target,convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on the production of a randomisedobservation y = y+ ξ, ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 58: Pittsburgh and Toronto "Halloween US trip" seminars

Wilkinson’s exact BC (not exactly!)

ABC approximation error (i.e. non-zero tolerance) replaced withexact simulation from a controlled approximation to the target,convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on the production of a randomisedobservation y = y+ ξ, ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 59: Pittsburgh and Toronto "Halloween US trip" seminars

How exact a BC?

Pros

I Pseudo-data from true model vs. observed data from noisymodel

I Outcome is completely controlled

I Link with ABCµ when assuming y observed withmeasurement error with density Kε

I Relates to the theory of model approximation[Kennedy & O’Hagan, 2001]

I leads to “noisy ABC”: justdo it!, i.e. perturbates the data down to precision ε and proceed

[Fearnhead & Prangle, 2012]

Cons

I Requires Kε to be bounded by M

I True approximation error never assessed

Page 60: Pittsburgh and Toronto "Halloween US trip" seminars

How exact a BC?

Pros

I Pseudo-data from true model vs. observed data from noisymodel

I Outcome is completely controlled

I Link with ABCµ when assuming y observed withmeasurement error with density Kε

I Relates to the theory of model approximation[Kennedy & O’Hagan, 2001]

I leads to “noisy ABC”: justdo it!, i.e. perturbates the data down to precision ε and proceed

[Fearnhead & Prangle, 2012]

Cons

I Requires Kε to be bounded by M

I True approximation error never assessed

Page 61: Pittsburgh and Toronto "Halloween US trip" seminars

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics

I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I may be imposed for external/practical reasons (e.g., DIYABC)

I may gather several non-B point estimates

I can learn about efficient combination

Page 62: Pittsburgh and Toronto "Halloween US trip" seminars

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics

I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I may be imposed for external/practical reasons (e.g., DIYABC)

I may gather several non-B point estimates

I can learn about efficient combination

Page 63: Pittsburgh and Toronto "Halloween US trip" seminars

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics

I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I may be imposed for external/practical reasons (e.g., DIYABC)

I may gather several non-B point estimates

I can learn about efficient combination

Page 64: Pittsburgh and Toronto "Halloween US trip" seminars

Semi-automatic ABC

Fearnhead and Prangle (2012) study ABC and the selection of thesummary statistic in close proximity to Wilkinson’s proposal

I ABC considered as inferential method and calibrated as such

I randomised (or ‘noisy’) version of the summary statistics

η(y) = η(y) + τε

I derivation of a well-calibrated version of ABC, i.e. analgorithm that gives proper predictions for the distributionassociated with this randomised summary statistic

[weak property]

Page 65: Pittsburgh and Toronto "Halloween US trip" seminars

Summary [of F&P/statistics)

I optimality of the posterior expectation

E[θ|y]

of the parameter of interest as summary statistics η(y)![requires iterative process]

I use of the standard quadratic loss function

(θ− θ0)TA(θ− θ0)

I recent extension to model choice, optimality of Bayes factor

B12(y)

[F&P, ISBA 2012, Kyoto]

Page 66: Pittsburgh and Toronto "Halloween US trip" seminars

Summary [of F&P/statistics)

I optimality of the posterior expectation

E[θ|y]

of the parameter of interest as summary statistics η(y)![requires iterative process]

I use of the standard quadratic loss function

(θ− θ0)TA(θ− θ0)

I recent extension to model choice, optimality of Bayes factor

B12(y)

[F&P, ISBA 2012, Kyoto]

Page 67: Pittsburgh and Toronto "Halloween US trip" seminars

LDA summaries for model choice

In parallel to F& P semi-automatic ABC, selection of mostdiscriminant subvector out of a collection of summary statistics,based on Linear Discriminant Analysis (LDA)

[Estoup & al., 2012, Mol. Ecol. Res.]

Solution now implemented in DIYABC.2[Cornuet & al., 2008, Bioinf.; Estoup & al., 2013]

Page 68: Pittsburgh and Toronto "Halloween US trip" seminars

Implementation

Step 1: Rake a subset of the α% (e.g., 1%) best simulationsfrom an ABC reference table usually including 106–109

simulations for each of the M compared scenarios/models

Selection based on normalized Euclidian distance computedbetween observed and simulated raw summaries

Step 2: run LDA on this subset to transform summaries into(M − 1) discriminant variables

When computing LDA functions, weight simulated data with anEpanechnikov kernel

Step 3: Estimation of the posterior probabilities of eachcompeting scenario/model by polychotomous local logisticregression against the M − 1 most discriminant variables

[Cornuet & al., 2008, Bioinformatics]

Page 69: Pittsburgh and Toronto "Halloween US trip" seminars

LDA advantages

I much faster computation of scenario probabilities viapolychotomous regression

I a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table

I allows for a large collection of initial summaries

I ability to evaluate Type I and Type II errors on more complexmodels

I LDA reduces correlation among explanatory variables

When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated

Page 70: Pittsburgh and Toronto "Halloween US trip" seminars

LDA advantages

I much faster computation of scenario probabilities viapolychotomous regression

I a (much) lower number of explanatory variables improves theaccuracy of the ABC approximation, reduces the tolerance εand avoids extra costs in constructing the reference table

I allows for a large collection of initial summaries

I ability to evaluate Type I and Type II errors on more complexmodels

I LDA reduces correlation among explanatory variables

When available, using both simulated and real data sets, posteriorprobabilities of scenarios computed from LDA-transformed and rawsummaries are strongly correlated

Page 71: Pittsburgh and Toronto "Halloween US trip" seminars

ABC for model choice

Intractable likelihoods

ABC methods

ABC as an inference machine

ABC for model choiceBMC PrincipleGibbs random fields (counterexample)Generic ABC model choice

Model choice consistency

Conclusion and perspectives

Page 72: Pittsburgh and Toronto "Halloween US trip" seminars

Bayesian model choice

BMC Principle

Several modelsM1,M2, . . .

are considered simultaneously for dataset y and model index M

central to inference.Use of

I prior π(M = m), plus

I prior distribution on the parameter conditional on the value mof the model index, πm(θm)

Page 73: Pittsburgh and Toronto "Halloween US trip" seminars

Bayesian model choice

BMC Principle

Several modelsM1,M2, . . .

are considered simultaneously for dataset y and model index M

central to inference.Goal is to derive the posterior distribution of M,

π(M = m|data)

a challenging computational target when models are complex.

Page 74: Pittsburgh and Toronto "Halloween US trip" seminars

Generic ABC for model choice

Algorithm 2 Likelihood-free model choice sampler (ABC-MC)

for t = 1 to T dorepeat

Generate m from the prior π(M = m)Generate θm from the prior πm(θm)Generate z from the model fm(z|θm)

until ρ{η(z),η(y)} < εSet m(t) = m and θ(t) = θm

end for

[Grelaud & al., 2009; Toni & al., 2009]

Page 75: Pittsburgh and Toronto "Halloween US trip" seminars

ABC estimates

Posterior probability π(M = m|y) approximated by the frequencyof acceptances from model m

1

T

T∑t=1

Im(t)=m .

Page 76: Pittsburgh and Toronto "Halloween US trip" seminars

ABC estimates

Posterior probability π(M = m|y) approximated by the frequencyof acceptances from model m

1

T

T∑t=1

Im(t)=m .

Extension to a weighted polychotomous logistic regressionestimate of π(M = m|y), with non-parametric kernel weights

[Cornuet et al., DIYABC, 2009]

Page 77: Pittsburgh and Toronto "Halloween US trip" seminars

Potts model

Skip MRFs

Potts model

Distribution with an energy function of the form

θS(y) = θ∑l∼i

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑x∈X

exp{θS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

Page 78: Pittsburgh and Toronto "Halloween US trip" seminars

Potts model

Skip MRFs

Potts model

Distribution with an energy function of the form

θS(y) = θ∑l∼i

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑x∈X

exp{θS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

Page 79: Pittsburgh and Toronto "Halloween US trip" seminars

Neighbourhood relations

SetupChoice to be made between M neighbourhood relations

im∼ i ′ (0 6 m 6 M − 1)

withSm(x) =

∑im∼i ′

I{xi=xi ′ }

driven by the posterior probabilities of the models.

Page 80: Pittsburgh and Toronto "Halloween US trip" seminars

Model index

Computational target:

P(M = m|x) ∝∫Θm

fm(x|θm)πm(θm) dθm π(M = m)

If S(x) sufficient statistic for the joint parameters(M, θ0, . . . ,θM−1),

P(M = m|x) = P(M = m|S(x)) .

Page 81: Pittsburgh and Toronto "Halloween US trip" seminars

Model index

Computational target:

P(M = m|x) ∝∫Θm

fm(x|θm)πm(θm) dθm π(M = m)

If S(x) sufficient statistic for the joint parameters(M, θ0, . . . ,θM−1),

P(M = m|x) = P(M = m|S(x)) .

Page 82: Pittsburgh and Toronto "Halloween US trip" seminars

Sufficient statistics in Gibbs random fields

Each model m has its own sufficient statistic Sm(·) andS(·) = (S0(·), . . . ,SM−1(·)) is also (model-)sufficient.Explanation: For Gibbs random fields,

x |M = m ∼ fm(x|θm) = f 1m(x|S(x))f2m(S(x)|θm)

=1

n(S(x))f 2m(S(x)|θm)

wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}

c© S(x) is sufficient for the joint parameters

Page 83: Pittsburgh and Toronto "Halloween US trip" seminars

Sufficient statistics in Gibbs random fields

Each model m has its own sufficient statistic Sm(·) andS(·) = (S0(·), . . . ,SM−1(·)) is also (model-)sufficient.Explanation: For Gibbs random fields,

x |M = m ∼ fm(x|θm) = f 1m(x|S(x))f2m(S(x)|θm)

=1

n(S(x))f 2m(S(x)|θm)

wheren(S(x)) = ] {x ∈ X : S(x) = S(x)}

c© S(x) is sufficient for the joint parameters

Page 84: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example

iid Bernoulli model versus two-state first-order Markov chain, i.e.

f0(x|θ0) = exp

(θ0

n∑i=1

I{xi=1}

)/{1 + exp(θ0)}

n ,

versus

f1(x|θ1) =1

2exp

(θ1

n∑i=2

I{xi=xi−1}

)/{1 + exp(θ1)}

n−1 ,

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).

Page 85: Pittsburgh and Toronto "Halloween US trip" seminars

About sufficiency

If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x),η2(x)) is not always sufficient for (m, θm)

c© Potential loss of information at the testing level

Page 86: Pittsburgh and Toronto "Halloween US trip" seminars

About sufficiency

If η1(x) sufficient statistic for model m = 1 and parameter θ1 andη2(x) sufficient statistic for model m = 2 and parameter θ2,(η1(x),η2(x)) is not always sufficient for (m, θm)

c© Potential loss of information at the testing level

Page 87: Pittsburgh and Toronto "Halloween US trip" seminars

Poisson/geometric example

Samplex = (x1, . . . , xn)

from either a Poisson P(λ) or from a geometric G(p)Sum

S =

n∑i=1

xi = η(x)

sufficient statistic for either model but not simultaneously

Page 88: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (T →∞)

ABC approximation

B12(y) =

∑Tt=1 Imt=1 Iρ{η(zt),η(y)}6ε∑Tt=1 Imt=2 Iρ{η(zt),η(y)}6ε

,

where the (mt , z t)’s are simulated from the (joint) priorAs T goes to infinity, limit

Bε12(y) =

∫Iρ{η(z),η(y)}6επ1(θ1)f1(z|θ1) dz dθ1∫Iρ{η(z),η(y)}6επ2(θ2)f2(z|θ2) dz dθ2

=

∫Iρ{η,η(y)}6επ1(θ1)f

η1 (η|θ1) dη dθ1∫

Iρ{η,η(y)}6επ2(θ2)fη2 (η|θ2) dη dθ2

,

where f η1 (η|θ1) and f η2 (η|θ2) distributions of η(z)

Page 89: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (T →∞)

ABC approximation

B12(y) =

∑Tt=1 Imt=1 Iρ{η(zt),η(y)}6ε∑Tt=1 Imt=2 Iρ{η(zt),η(y)}6ε

,

where the (mt , z t)’s are simulated from the (joint) priorAs T goes to infinity, limit

Bε12(y) =

∫Iρ{η(z),η(y)}6επ1(θ1)f1(z|θ1) dz dθ1∫Iρ{η(z),η(y)}6επ2(θ2)f2(z|θ2) dz dθ2

=

∫Iρ{η,η(y)}6επ1(θ1)f

η1 (η|θ1) dη dθ1∫

Iρ{η,η(y)}6επ2(θ2)fη2 (η|θ2) dη dθ2

,

where f η1 (η|θ1) and f η2 (η|θ2) distributions of η(z)

Page 90: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (ε→ 0)

When ε goes to zero,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

c© Bayes factor based on the sole observation of η(y)

Page 91: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (ε→ 0)

When ε goes to zero,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

c© Bayes factor based on the sole observation of η(y)

Page 92: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (under sufficiency)

If η(y) sufficient statistic in both models,

fi (y|θi ) = gi (y)fηi (η(y)|θi )

Thus

B12(y) =

∫Θ1π(θ1)g1(y)f

η1 (η(y)|θ1) dθ1∫

Θ2π(θ2)g2(y)f

η2 (η(y)|θ2) dθ2

=g1(y)

∫π1(θ1)f

η1 (η(y)|θ1) dθ1

g2(y)∫π2(θ2)f

η2 (η(y)|θ2) dθ2

=g1(y)

g2(y)Bη12(y) .

[Didelot, Everitt, Johansen & Lawson, 2011]

c© No discrepancy only when cross-model sufficiency

Page 93: Pittsburgh and Toronto "Halloween US trip" seminars

Limiting behaviour of B12 (under sufficiency)

If η(y) sufficient statistic in both models,

fi (y|θi ) = gi (y)fηi (η(y)|θi )

Thus

B12(y) =

∫Θ1π(θ1)g1(y)f

η1 (η(y)|θ1) dθ1∫

Θ2π(θ2)g2(y)f

η2 (η(y)|θ2) dθ2

=g1(y)

∫π1(θ1)f

η1 (η(y)|θ1) dθ1

g2(y)∫π2(θ2)f

η2 (η(y)|θ2) dθ2

=g1(y)

g2(y)Bη12(y) .

[Didelot, Everitt, Johansen & Lawson, 2011]

c© No discrepancy only when cross-model sufficiency

Page 94: Pittsburgh and Toronto "Halloween US trip" seminars

Poisson/geometric example (back)

Samplex = (x1, . . . , xn)

from either a Poisson P(λ) or from a geometric G(p)

Discrepancy ratio

g1(x)

g2(x)=

S !n−S/∏

i xi !

1/(

n+S−1S

)

Page 95: Pittsburgh and Toronto "Halloween US trip" seminars

Poisson/geometric discrepancy

Range of B12(x) versus Bη12(x): The values produced have nothingin common.

Page 96: Pittsburgh and Toronto "Halloween US trip" seminars

Formal recovery

Creating an encompassing exponential family

f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}

leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]

Page 97: Pittsburgh and Toronto "Halloween US trip" seminars

Formal recovery

Creating an encompassing exponential family

f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}

leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]

In the Poisson/geometric case, if∏

i xi ! is added to S , nodiscrepancy

Page 98: Pittsburgh and Toronto "Halloween US trip" seminars

Formal recovery

Creating an encompassing exponential family

f (x|θ1, θ2,α1,α2) ∝ exp{θT1 η1(x) + θT2 η2(x) + α1t1(x) + α2t2(x)}

leads to a sufficient statistic (η1(x),η2(x), t1(x), t2(x))[Didelot, Everitt, Johansen & Lawson, 2011]

Only applies in genuine sufficiency settings...

c© Inability to evaluate information loss due to summarystatistics

Page 99: Pittsburgh and Toronto "Halloween US trip" seminars

Meaning of the ABC-Bayes factor

The ABC approximation to the Bayes Factor is based solely on thesummary statistics....In the Poisson/geometric case, if E[yi ] = θ0 > 0,

limn→∞Bη12(y) =

(θ0 + 1)2

θ0e−θ0

Page 100: Pittsburgh and Toronto "Halloween US trip" seminars

Meaning of the ABC-Bayes factor

The ABC approximation to the Bayes Factor is based solely on thesummary statistics....In the Poisson/geometric case, if E[yi ] = θ0 > 0,

limn→∞Bη12(y) =

(θ0 + 1)2

θ0e−θ0

Page 101: Pittsburgh and Toronto "Halloween US trip" seminars

ABC model choice consistency

Intractable likelihoods

ABC methods

ABC as an inference machine

ABC for model choice

Model choice consistencyFormalised frameworkConsistency resultsSummary statistics

Conclusion and perspectives

Page 102: Pittsburgh and Toronto "Halloween US trip" seminars

The starting point

Central question to the validation of ABC for model choice:

When is a Bayes factor based on an insufficient statistic T(y)consistent?

Note/warnin: c© drawn on T(y) through BT12(y) necessarily differs

from c© drawn on y through B12(y)[Marin, Pillai, X, & Rousseau, JRSS B, 2013]

Page 103: Pittsburgh and Toronto "Halloween US trip" seminars

The starting point

Central question to the validation of ABC for model choice:

When is a Bayes factor based on an insufficient statistic T(y)consistent?

Note/warnin: c© drawn on T(y) through BT12(y) necessarily differs

from c© drawn on y through B12(y)[Marin, Pillai, X, & Rousseau, JRSS B, 2013]

Page 104: Pittsburgh and Toronto "Halloween US trip" seminars

A benchmark if toy example

Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]

Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/

√2), Laplace distribution with mean θ2

and scale parameter 1/√

2 (variance one).

Page 105: Pittsburgh and Toronto "Halloween US trip" seminars

A benchmark if toy example

Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]

Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/

√2), Laplace distribution with mean θ2

and scale parameter 1/√

2 (variance one).Four possible statistics

1. sample mean y (sufficient for M1 if not M2);

2. sample median med(y) (insufficient);

3. sample variance var(y) (ancillary);

4. median absolute deviation mad(y) = med(|y− med(y)|);

Page 106: Pittsburgh and Toronto "Halloween US trip" seminars

A benchmark if toy example

Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]

Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/

√2), Laplace distribution with mean θ2

and scale parameter 1/√

2 (variance one).

●●

Gauss Laplace

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n=100

Page 107: Pittsburgh and Toronto "Halloween US trip" seminars

A benchmark if toy example

Comparison suggested by referee of PNAS paper [thanks!]:[X, Cornuet, Marin, & Pillai, Aug. 2011]

Model M1: y ∼ N(θ1, 1) opposedto model M2: y ∼ L(θ2, 1/

√2), Laplace distribution with mean θ2

and scale parameter 1/√

2 (variance one).

●●

Gauss Laplace

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n=100

●●

●●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Page 108: Pittsburgh and Toronto "Halloween US trip" seminars

Framework

Starting from sample

y = (y1, . . . , yn)

the observed sample, not necessarily iid with true distribution

y ∼ Pn

Summary statistics

T(y) = Tn = (T1(y),T2(y), · · · ,Td(y)) ∈ Rd

with true distribution Tn ∼ Gn.

Page 109: Pittsburgh and Toronto "Halloween US trip" seminars

Framework

c© Comparison of

– under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1

– under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2

turned into

– under M1, T(y) ∼ G1,n(·|θ1), and θ1|T(y) ∼ π1(·|Tn)

– under M2, T(y) ∼ G2,n(·|θ2), and θ2|T(y) ∼ π2(·|Tn)

Page 110: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[A1] is a standard central limit theorem under the true model[A2] controls the large deviations of the estimator Tn from theestimand µ(θ)[A3] is the standard prior mass condition found in Bayesianasymptotics (di effective dimension of the parameter)[A4] restricts the behaviour of the model density against the truedensity

[Think CLT!]

Page 111: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[Think CLT!]

[A1] There exist

I a sequence {vn} converging to +∞,

I a distribution Q,

I a symmetric, d × d positive definite matrix V0 and

I a vector µ0 ∈ Rd

such thatvnV

−1/20 (Tn − µ0)

n→∞ Q, under Gn

Page 112: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[Think CLT!]

[A2] For i = 1, 2, there exist sets Fn,i ⊂ Θi , functions µi (θi ) andconstants εi , τi ,αi > 0 such that for all τ > 0,

supθi∈Fn,i

Gi ,n

[|Tn − µi (θi )| > τ|µi (θi ) − µ0| ∧ εi |θi

](|τµi (θi ) − µ0| ∧ εi )

−αi. v−αi

n

withπi (F

cn,i ) = o(v−τin ).

Page 113: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[Think CLT!]

[A3] If inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, defining (u > 0)

Sn,i (u) ={θi ∈ Fn,i ; |µi (θi ) − µ0| 6 u v−1

n

},

there exist constants di < τi ∧ αi − 1 such that

πi (Sn,i (u)) ∼ udi v−din , ∀u . vn

Page 114: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[Think CLT!]

[A4] If inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, for any ε > 0, there existU, δ > 0 and (En)n such that, if θi ∈ Sn,i (U)

En ⊂ {t ; gi (t |θi ) < δgn(t)} and Gn (Ecn ) < ε.

Page 115: Pittsburgh and Toronto "Halloween US trip" seminars

Assumptions

A collection of asymptotic “standard” assumptions:

[Think CLT!]

Again (sumup)[A1] is a standard central limit theorem under the true model[A2] controls the large deviations of the estimator Tn from theestimand µ(θ)[A3] is the standard prior mass condition found in Bayesianasymptotics (di effective dimension of the parameter)[A4] restricts the behaviour of the model density against the truedensity

Page 116: Pittsburgh and Toronto "Halloween US trip" seminars

Effective dimension

Understanding di in [A3]:defined only when µ0 ∈ {µi (θi ), θi ∈ Θi },

πi (θi : |µi (θi ) − µ0| < n−1/2) = O(n−di/2)

is the effective dimension of the model Θi around µ0

Page 117: Pittsburgh and Toronto "Halloween US trip" seminars

Effective dimension

Understanding di in [A3]:when inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,

mi (Tn)

gn(Tn)

∼ v−din ,

thuslog(mi (T

n)/gn(Tn)) ∼ −di log vn

and v−din penalization factor resulting from integrating θi out (see

effective number of parameters in DIC)

Page 118: Pittsburgh and Toronto "Halloween US trip" seminars

Effective dimension

Understanding di in [A3]:In regular models, di dimension of T(Θi ), leading to BICIn non-regular models, di can be smaller

Page 119: Pittsburgh and Toronto "Halloween US trip" seminars

Asymptotic marginals

Asymptotically, under [A1]–[A4]

mi (t) =

∫Θi

gi (t |θi )πi (θi ) dθi

is such that(i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,

Clvd−din 6 mi (T

n) 6 Cuvd−din

and(ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0

mi (Tn) = oPn [vd−τin + vd−αi

n ].

Page 120: Pittsburgh and Toronto "Halloween US trip" seminars

Within-model consistency

Under same assumptions,if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,

the posterior distribution of µi (θi ) given Tn is consistent at rate1/vn provided αi ∧ τi > di .

Page 121: Pittsburgh and Toronto "Halloween US trip" seminars

Between-model consistency

Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!

Page 122: Pittsburgh and Toronto "Halloween US trip" seminars

Between-model consistency

Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!

Indeed, if

inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0

then

Clv−(d1−d2)n 6 m1(T

n)/m2(T

n) 6 Cuv−(d1−d2)n ,

where Cl ,Cu = OPn(1), irrespective of the true model.c© Only depends on the difference d1 − d2: no consistency

Page 123: Pittsburgh and Toronto "Halloween US trip" seminars

Between-model consistency

Consequence of above is that asymptotic behaviour of the Bayesfactor is driven by the asymptotic mean value of Tn under bothmodels. And only by this mean value!

Else, if

inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0

thenm1(T

n)

m2(Tn)> Cu min

(v−(d1−α2)n , v

−(d1−τ2)n

)

Page 124: Pittsburgh and Toronto "Halloween US trip" seminars

Consistency theorem

If

inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0,

Bayes factor

BT12 = O(v

−(d1−d2)n )

irrespective of the true model. It is inconsistent since it alwayspicks the model with the smallest dimension

Page 125: Pittsburgh and Toronto "Halloween US trip" seminars

Consistency theorem

If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :

0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)

< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,

then the Bayes factor BT12 is consistent

Page 126: Pittsburgh and Toronto "Halloween US trip" seminars

Consequences on summary statistics

Bayes factor driven by the means µi (θi ) and the relative positionof µ0 wrt both sets {µi (θi ); θi ∈ Θi }, i = 1, 2.

For ABC, this implies the simplest summary statistics Tn areancillary statistics with different mean values under both models

Else, if Tn asymptotically depends on some of the parameters ofthe models, it is possible that there exists θi ∈ Θi such thatµi (θi ) = µ0 even though model M1 is misspecified

Page 127: Pittsburgh and Toronto "Halloween US trip" seminars

Consequences on summary statistics

Bayes factor driven by the means µi (θi ) and the relative positionof µ0 wrt both sets {µi (θi ); θi ∈ Θi }, i = 1, 2.

For ABC, this implies the simplest summary statistics Tn areancillary statistics with different mean values under both models

Else, if Tn asymptotically depends on some of the parameters ofthe models, it is possible that there exists θi ∈ Θi such thatµi (θi ) = µ0 even though model M1 is misspecified

Page 128: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [1]

If

Tn = n−1n∑

i=1

X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

and the true distribution is Laplace with mean θ0 = 1, under theGaussian model the value θ∗ = 2

√3 − 3 also leads to µ0 = µ(θ

∗)[here d1 = d2 = d = 1]

Page 129: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [1]

If

Tn = n−1n∑

i=1

X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

and the true distribution is Laplace with mean θ0 = 1, under theGaussian model the value θ∗ = 2

√3 − 3 also leads to µ0 = µ(θ

∗)[here d1 = d2 = d = 1]

c© A Bayes factor associated with such a statistic is inconsistent

Page 130: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [1]

If

Tn = n−1n∑

i=1

X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

Page 131: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [1]

If

Tn = n−1n∑

i=1

X 4i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

●●

●●

M1 M2

0.30

0.35

0.40

0.45

n=1000

Fourth moment

Page 132: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [0]

WhenT(y) =

{y(4)n , y

(6)n

}and the true distribution is Laplace with mean θ0 = 0, thenµ0 = 6, µ1(θ

∗1) = 6 with θ∗1 = 2

√3 − 3

[d1 = 1 and d2 = 1/2]thus

B12 ∼ n−1/4 → 0 : consistent

Under the Gaussian model µ0 = 3, µ2(θ2) > 6 > 3 ∀θ2

B12 → +∞ : consistent

Page 133: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [0]

WhenT(y) =

{y(4)n , y

(6)n

}and the true distribution is Laplace with mean θ0 = 0, thenµ0 = 6, µ1(θ

∗1) = 6 with θ∗1 = 2

√3 − 3

[d1 = 1 and d2 = 1/2]thus

B12 ∼ n−1/4 → 0 : consistent

Under the Gaussian model µ0 = 3, µ2(θ2) > 6 > 3 ∀θ2

B12 → +∞ : consistent

Page 134: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss [0]

WhenT(y) =

{y(4)n , y

(6)n

}

●●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=100

●●●●●

●●

●●●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=1000

●●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=10000

Fourth and sixth moments

Page 135: Pittsburgh and Toronto "Halloween US trip" seminars

Checking for adequate statistics

Run a practical check of the relevance (or non-relevance) of Tn

null hypothesis that both models are compatible with the statisticTn

H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0

againstH1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0

testing procedure provides estimates of mean of Tn under eachmodel and checks for equality

Page 136: Pittsburgh and Toronto "Halloween US trip" seminars

Checking in practice

I Under each model Mi , generate ABC sample θi ,l , l = 1, · · · , L

I For each θi ,l , generate yi ,l ∼ Fi ,n(·|ψi ,l), derive Tn(yi ,l) andcompute

µi =1

L

L∑l=1

Tn(yi ,l), i = 1, 2 .

I Conditionally on Tn(y),

√L(µi − Eπ [µi (θi )|Tn(y)]) N(0,Vi ),

I Test for a common mean

H0 : µ1 ∼ N(µ0,V1) , µ2 ∼ N(µ0,V2)

against the alternative of different means

H1 : µi ∼ N(µi ,Vi ), with µ1 6= µ2 .

Page 137: Pittsburgh and Toronto "Halloween US trip" seminars

Toy example: Laplace versus Gauss

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

Gauss Laplace Gauss Laplace

010

2030

40

Normalised χ2 without and with mad

Page 138: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

Two populations (1 and 2) having diverged at a fixed known timein the past and third population (3) which diverged from one ofthose two populations (models 1 and 2, respectively).

Observation of 50 diploid individuals/population genotyped at 5,50 or 100 independent microsatellite loci.

Model 2

Page 139: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

Two populations (1 and 2) having diverged at a fixed known timein the past and third population (3) which diverged from one ofthose two populations (models 1 and 2, respectively).

Observation of 50 diploid individuals/population genotyped at 5,50 or 100 independent microsatellite loci.

Stepwise mutation model: the number of repeats of the mutatedgene increases or decreases by one. Mutation rate µ common to allloci set to 0.005 (single parameter) with uniform prior distribution

µ ∼ U[0.0001, 0.01]

Page 140: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

Summary statistics associated to the (δµ)2 distance

xl ,i ,j repeated number of allele in locus l = 1, . . . , L for individuali = 1, . . . , 100 within the population j = 1, 2, 3. Then

(δµ)2j1,j2 =

1

L

L∑l=1

1

100

100∑i1=1

xl ,i1,j1 −1

100

100∑i2=1

xl ,i2,j2

2

.

Page 141: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

For two copies of locus l with allele sizes xl ,i ,j1 and xl ,i ′,j2 , mostrecent common ancestor at coalescence time τj1,j2 , gene genealogydistance of 2τj1,j2 , hence number of mutations Poisson withparameter 2µτj1,j2 . Therefore,

E{(

xl ,i ,j1 − xl ,i ′,j2)2

|τj1,j2

}= 2µτj1,j2

andModel 1 Model 2

E{(δµ)

21,2

}2µ1t

′ 2µ2t′

E{(δµ)

21,3

}2µ1t 2µ2t

E{(δµ)

22,3

}2µ1t

′ 2µ2t

Page 142: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

Thus,

I Bayes factor based only on distance (δµ)21,2 not convergent: if

µ1 = µ2, same expectation

I Bayes factor based only on distance (δµ)21,3 or (δµ)

22,3 not

convergent: if µ1 = 2µ2 or 2µ1 = µ2 same expectation

I if two of the three distances are used, Bayes factor converges:there is no (µ1,µ2) for which all expectations are equal

Page 143: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

● ●

5 50 100

0.0

0.4

0.8

DM2(12)

●●

● ●

5 50 100

0.0

0.4

0.8

DM2(13)

●●

●●●●●

●●●●●●●

5 50 100

0.0

0.4

0.8

DM2(13) & DM2(23)

Posterior probabilities that the data is from model 1 for 5, 50and 100 loci

Page 144: Pittsburgh and Toronto "Halloween US trip" seminars

A population genetic illustration

●●●●●

DM2(12) DM2(13) & DM2(23)

020

4060

8010

012

014

0

●●●●●●●●●

DM2(12) DM2(13) & DM2(23)

0.0

0.2

0.4

0.6

0.8

1.0

p−values

Page 145: Pittsburgh and Toronto "Halloween US trip" seminars

Embedded models

When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order

v−(d1−d2)n

Page 146: Pittsburgh and Toronto "Halloween US trip" seminars

Embedded models

When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order

v−(d1−d2)n

If summary statistic only informative on a parameter that is thesame under both models, i.e if d1 = d2, then

c© the Bayes factor is not consistent

Page 147: Pittsburgh and Toronto "Halloween US trip" seminars

Embedded models

When M1 submodel of M2, and if the true distribution belongs tothe smaller model M1, Bayes factor is of order

v−(d1−d2)n

Else, d1 < d2 and Bayes factor is going to ∞ under M1. If truedistribution not in M1, then

c© Bayes factor is consistent only if µ1 6= µ2 = µ0

Page 148: Pittsburgh and Toronto "Halloween US trip" seminars

Summary

I Model selection feasible with ABC

I Choice of summary statistics is paramount

I At best, ABC → π(. | T(y)) which concentrates around µ0

I For consistent estimation : {θ;µ(θ) = µ0} = θ0

I For consistent testing{µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Page 149: Pittsburgh and Toronto "Halloween US trip" seminars

Summary

I Model selection feasible with ABC

I Choice of summary statistics is paramount

I At best, ABC → π(. | T(y)) which concentrates around µ0

I For consistent estimation : {θ;µ(θ) = µ0} = θ0

I For consistent testing{µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Page 150: Pittsburgh and Toronto "Halloween US trip" seminars

Conclusion/perspectives

I abc part of a wider pictureto handle complex/Big Datamodels, able to start fromrudimentary machinelearning summaries

I many formats of empirical[likelihood] Bayes methodsavailable

I lack of comparative toolsand of an assessment forinformation loss