Bayes and Classical Approaches: A Comparative Overview · The Bayesian Approach Rather than having a variety of competing estimators for the same problem, the Bayesian \estimator"

$Page 1: Bayes and Classical Approaches: A Comparative Overview · The Bayesian Approach Rather than having a variety of competing estimators for the same problem, the Bayesian \estimator"$
Bayes and Classical Approaches: A ComparativeOverview

Econ 690

Purdue University

December 23, 2011

Justin L. Tobias (Purdue) Comparative Overview December 23, 2011 1 / 46

Outline

1 Review

2 Exchange Paradox

3 Example #1

4 exchangeability

5 The Likelihood PrincipleExample #1Example #2


Review

The Classical (Frequentist) Approach

For both the Bayesian and the Frequentist, the object of interest isdenoted as θ ⊆ Θ ⊆ Rk , a parameter which characterizes somefeature of the population at large (e.g., the average height of men inthe U.S.)

A sample of data is collected, which must then be used to shed lighton the value of θ.

A variety of estimators - rules taking the data y and mapping thatdata into a “guess” regarding the value of θ - are possible. Theperformance of an estimator is evaluated by the frequentist byanalyzing its sampling distribution.


Review

The Classical (Frequentist) Approach

“Frequentist econometrics is tied to the notion of sample andpopulation” (Lancaster, 2004, page 360). A “good” estimator is onewhich is “close” to θ, where “closeness” is measured by itsperformance averaged over hypothetical repeated sampling from thepopulation.

Therefore, although a given estimate obtained by a particularresearcher may be far away from θ, he or she could take comfort inthe fact that other researchers, obtaining different samples from thepopulation of interest, and making use of the same estimator, willobtain estimates that are “close” to θ, at least on average.


Review

The Classical (Frequentist) Approach, Continued

To illustrate, consider the estimator:

For any sample of data, the realized estimate can never equal the trueparameter of interest θ. Thus, ex post, (i.e., after seeing the data),we know that the estimate is not “good” in the sense that, wheneverthe estimator it is applied, there is no chance of realizing a estimateequal to the true parameter θ.

However, viewed according to the ex ante criterion of unbiasedness,the estimator is “close” to the true parameter θ in the sense that ifwe could apply our estimator over and over again, on average, wewould get the “correct” θ.


Review

The Bayesian Approach

Unlike the frequentist point of view, the Bayesian perspective providesno role for data that could have been observed, but is not. That is, itsfocus is ex post - conditioning on the observed data y , rather than exante - which involves averaging over the sampling distribution of y |θ.

In fairness, the Bayesian approach requires more inputs, and mustspecify the full joint distribution of observables y that become knownunder sampling and unobservables θ. This joint distribution istypically decomposed as

p(y , θ) = p(y |θ)p(θ),

into the likelihood p(y |θ) [which, viewed as a function of θ for giveny is denoted L(θ)] and a prior p(θ).


Review


The demands required of the model specification open the Bayesianup to criticism regarding sensitivity of results to the specification ofthe likelihood and/or the prior. In frequentist econometrics, far less isrequired - no prior is needed, and in many cases, and a completespecification of the sampling distribution p(y |θ) is not necessary andonly moment conditions are assumed to hold.

However, in such cases, approximations must be used to characterizethe sampling distribution of the estimator since the model is not richenough to allow for analytic calculation of this distribution. Thus,flexibility comes at a cost.


Review


Bayesians regard the necessary additional structure as a worthyinvestment: By construction, Bayesian inference provides exact finitesample results.

In many cases, flexible representations of the likelihood (allowing forfat tails, skewness or multimodality) can go a long way toward easingconcerns regarding its specification.

In sufficiently large samples, the influence of the prior generallyvanishes, and it is typically wise to conduct a sensitivity analysis overalternate prior choices. (For testing purposes, the prior is far moreimportant).


Review


Rather than having a variety of competing estimators for the sameproblem, the Bayesian “estimator” is unique - all informationregarding θ is summarized in the posterior distribution p(θ|y).Different features of this posterior can be used for point estimates,e.g., the posterior mean or the posterior mode, depending on therelevant Loss function.

Nuisance parameters can cause headaches for frequentisteconometrics. In the Bayesian framework, they are handled intuitivelyand consistently across various models, by simply integrating themout of the posterior distribution.


Review

Asymptotic Comparisons

In finite dimensional models with standard regularity conditions, thefollowing are known:

Asymptotic Interval Agreement There exist regions Rn such that∫Rn

p(θ|yn)dθ = 1− α and Rn has frequentist coverage

1− α + O(n−1).


Review

Exponential Family Example

Consider a set of scalar independent and identically distributed (iid)observations, denoted here by y1, y2, . . . , yn , from the natural exponentialfamily of distributions:

p(yi |θ) = h(yi ) exp [θyi − g(θ)] , i = 1, 2, . . . , n,

where the functions h and g are known and yn will be used to denote thefull vector of n observations.


Review

Exponential Family Example: Sampling Theory

Provided one can interchange the order of integration and differentiation:

where pθ(y |θ) ≡ ∂p(y |θ)/∂θ and gθ(θ) is defined similarly.

This immediately impliesE (y) = gθ(θ0)


Review


By extension:Var(y) = gθθ(θ0),

where gθθ(·) is analogously defined as the second derivative.

It is straightforward to show that the MLE is determined by solving

gθ(θn) = yn,

where yn ≡ n−1∑n

i=1 yi and θn represents the MLE.


Review


Under suitable regularity conditions, we obtain the following asymptoticapproximation to the sampling distribution of ψn:

from which large-sample confidence intervals can be constructed.


Review

Bayesian Inference

Consider the priorp(θ|a, b) ∝ exp[aθ − bg(θ)],

Using the same type of argument, it is straightforward to show, providedthe moment exists:

E (ψ) ≡ µψ

= a b−1,

with ψ ≡ gθ(θ).


Review

Bayesian Inference

Combining the likelihood and prior, we obtain via Bayes Theorem:

p(θ|yn) ∝ exp [(nyn + a)θ − (n + b)g(θ)] .

It therefore follows:

which we write equivalently as


Review

Bayesian Inference

The previous equation immediately reveals

Since ωn → 1, we see that the Bayesian posterior mean and the frequentistMLE in this case are asymptotically equivalent.


Review

In order to conduct inference, for which finite sample results areseldom available, the frequentist typically relies on asymptoticapproximations to the sampling distribution of the estimator.

But under this large sample metric, the sampling distribution of theBayes rule is identical, suggesting that, according to his or her ownrecipe for inference, the prior does not patter.

To question the role of the prior, then, from a sampling theoryperspective is to introduce the importance of finite sample results;otherwise concerns about the prior must be misplaced andunwarranted, as the Bayes rule simply represents a different estimatorwith identical large sample sampling properties.


Exchange Paradox

Exchange Paradox

Another reason why one may pursue a Bayesian approach is that priorinformation is available (and should be used) while not using suchinformation can lead to undesirable results.

An example of this is the Exchange Paradox (or “two envelope problem”):see, e.g. Christensen and Utts (1992 American Statistician)

An amount of money m is placed in one envelope and 2m placed inanother. One of the envelopes is randomly handed to a player and thenopened. The player is permitted to keep the prize or, if desired, choosethe other envelope.


Exchange Paradox

Exchange Paradox

The agent opens the envelope, revealing an amount x .

He reasons that the other envelope either contains x/2 or 2x .

Since the envelope was distributed randomly, the chance of each is 1/2.

His expected return from switching is

so he switches.

However, your “opponent” thinks the same way and switches as well. Howcan this be reasonable?


Exchange Paradox

Exchange Paradox

A solution to this problem notes that the agent probably has some beliefsover the distribution of prize money, and this should be (and probably is)taken into account when making a decision.

Formally, let M be the prize placed in the first envelope, and X be the (exante) random amount placed in your envelope. Upon seeing X = x in yourenvelope, M = x or M = x/2.

The sampling distribution is


Exchange Paradox

Exchange Paradox

Bayes Theorem gives:

Likewise,


Exchange Paradox

Exchange Paradox

If the agent keeps the envelope, he gets x with certainty. If he switches, hegets:

Therefore, the expected return from switching is:


Exchange Paradox

Exchange Paradox

It is straightforward to show that, when 2p(x) = p(x/2), the expectedreturn from switching is x . So, when

it is optimal for the agent to switch. Perhaps most importantly, when willthe expected return equal 5x/4?


Example #1

Ex Ante v.s. Ex Post: An Example

This example is constructed to illustrate the differences between theex ante and ex post perspectives. (It also helps to clarify the betweena Bayesian posterior interval and a frequentist confidence interval).


Example #1


Given an unknown parameter θ, −∞ < θ <∞, suppose Yi (i = 1, 2)are iid binary random variables with probabilities of occurrence equallydistributed over points of support θ − 1 and θ + 1. That is,

1

Pr(Yi = θ − 1|θ) = 1/2 and

2

Pr(Yi = θ + 1|θ) = 1/2.

Suppose the prior for θ is constant.


Example #1


Pr(Yi = θ − 1|θ) = 1/2, and Pr(Yi = θ + 1|θ) = 1/2.

(a) Suppose we observe y1 6= y2. Find the posterior distribution of θ.

(b) Suppose we observe y1 = y2. Find the posterior distribution of θ.

(c) Consider

θ =

(1/2)(Y1 + Y2) if Y1 6= Y2

Y1 − 1 if Y1 = Y2.

What is the ex ante probability that this estimator contains (i.e.,equals) θ?


Example #1



(a) If y1 6= y2, then one of the values must equal θ − 1 and the otherequals θ + 1. Ex post, averaging these two values it is absolutelycertain that θ = (1/2)(y1 + y2), i.e.,

(b) If y1 = y2 = y , say, then the common value y is either θ − 1 orθ + 1. Since the prior does not distinguish among values of θ, expost, it is equally uncertain whether θ = y − 1 or θ = y + 1, i.e.,


Example #1



(c) Parts (a) and (b) suggest the ex post (i.e., posterior) probability thatour estimator equals θ is either 1 or 1/2, depending on whether y1 6= y2 ory1 = y2.

From the ex ante perspective, however,

The difficult question for the pure frequentist is: why use the realized datato estimate θ and then report the ex ante confidence level 75% instead ofthe appropriate ex post measure of uncertainty?


exchangeability

exchangeability and de Finetti

Consider a sequence of Bernoulli random variables x1, x2, · · · , xn and thejoint probability

p(x1, x2, . . . , xn).

Suppose, additionally, that you feel the labels 1, 2, · · · , n are uniformativein the sense that all of the marginal distributions of the random quantities(and marginals of pairs, triples, etc.) are the same. Such a belief impliesthat

p(x1, . . . , xn) = p(xπ(1), . . . xπ(n)),

where π is an operator that permutes the labels i = 1, . . . , n.If the equation above holds for any permutation operator π, then therandom quantities x1, x2, · · · , xn are regarded as exchangeable.


exchangeability


What would NOT be an exchangeable sequence?

Consider the case of “streaks” in sports, where a team who has just wonits previous game is more likely to win the next, and conversely, a teamwho has just lost a game is more likely to loose the next.In this case,

p(1, 1, 1, 0, 0, 0)

would not be believed to equal

p(1, 0, 1, 0, 1, 0)

and thus the joint probability is not preserved under permutation. Thus,the sequence would not be regarded as exchangeable.


exchangeability


What would NOT be an exchangeable sequence?

[Bernardo and Smith (2000)] Suppose that xi , . . . , xn denotemeasurements of a particular substance, all made by the same individualusing the same process and equipment. In this case, it may be reasonableto assume that the sequence of measurements is exchangeable.

If, however, the measurements are taken by a series of different labs, thenone might be skeptical of assuming exchangeability of the entire sequence.One might, however, still be comfortable with assuming thatexchangeability is appropriate for the sequences within each lab. Suchbeliefs are regarded as partially (or conditionally) exchangeable.


exchangeability

exchangeability and de Finetti: Example

Our common assumption of iid data is a stronger condition thanassuming the data are exchangeable.

To see this, we need to show the equivalence of p(x1, x2, . . . , xn) andp(xπ(1), xπ(2), . . . , xπ(n)) when the data are iid :

p(x1, x2, . . . , xn) =n∏

j=1

Pr(Xj = xj)

=n∏

j=1

Pr(X = xj)

= Pr(X = xπ(1))Pr(X = xπ(2)) · · ·Pr(X = xπ(n))

= p(xπ(1), xπ(2), . . . , xπ(n))


exchangeability


SupposeY = [Y1 Y2 · · · YT ]′ ∼ N(0T ,Σ),

where

and ι is a T × 1 of ones.Let π(t) (t = 1, 2, · · · ,T ) be a permutation of 1, 2, · · · ,T and write

[Yπ(1) Yπ(2) · · ·Yπ(T )]′ = AY ,

where A is a T × T selection matrix such that for t = 1, 2, · · · ,T , row tin A consists of all zeros except column π(t) which is unity.Show that these beliefs are exchangeable but the events are not necessarilyindependent.


exchangeability


Note that

Then,AY ∼ N(0T ,Ω)

where

Hence, beliefs regarding Yt(t = 1, 2, · · ·T ) are exchangeable. Despite thisexchangeability, if α 6= 0, Yt (t = 1, 2, · · ·T ) are not independent


exchangeability

de Finetti’s Representation Theorem (Loosely)

Suppose that x1, x2, . . . is an infinitely exchangeable sequence of Bernoullirandom variables (every finite subsequence is exchangeable). Then, thejoint probability distribution p(x1, x2, · · · , xn) must take the form:

p(x1, x2, . . . , xn) =

∫ 1

0

[n∏

i=1

θxi (1− θ)(1−xi )

]dQ(θ)

for some distribution function Q(θ).


exchangeability

de Finetti’s Representation Theorem

p(x1, x2, . . . , xn) =

∫ 1

0

[n∏

i=1

θxi (1− θ)(1−xi )

]dQ(θ).

This result is profound from the viewpoint of subjectivist probabilitymodeling. It:

1 “justifies” thinking about the data as being conditionally independentBernoulli random variables (conditioned on a parameter θ).

2 provides an explicit justification for the “prior” Q(θ).

3 The assumption of exchangeability implies the Bayesian viewpoint oflikelihood times prior.


The Likelihood Principle


Definition

The likelihood principle (loosely) asserts that if two experiments yieldproportional likelihood functions for realized data y , i.e., L1(θ) = cL2(θ),where c does not depend on θ, then (under the same prior), identicalinferences should be obtained about θ.

This is a controversial statement (see, e.g. Robins and Wasserman,JASA, 2000), although it follows from two, seemingly reasonableconditions: Conditionality and (weak) sufficiency.

In the Bayesian paradigm, this is not an imposed “principle,” butrather, follows as a consequence of Bayes Theorem.




To see this, consider two experiments with realized data y1 and y2,which yield proportional likelihood functions L1(θ) and L2(θ),respectively.

The posterior under experiment 1 is:

Thus, upon normalization, it is seen that the posteriors under eachexperiment are identical, and thus identical inferences are obtained.




In frequentist calculations, proportional likelihoods may not yieldidentical inferences, since the Classical paradigm provides an explicitrole for data that could have been observed, but were not.

The following simple example illustrates this point.


The Likelihood Principle Example #1


Consider two researchers, A and B.

Researcher A observes a random variable X having the Gammadistribution G (3, θ−1) with density

Researcher B observes a random variable Y having the Poissondistribution Po(2θ) with mass function

In collecting one observation each, Researcher A observes X = 2 andResearcher B observes Y = 3.




The respective likelihood functions are

LA(θ; x = 2) =

(θ3

Γ(3)

)22 exp(−2θ) ∝ θ3 exp(−2θ),

LB(θ; y = 3) =(2θ)3 exp(−2θ)

3!∝ θ3 exp(−2θ).




Since these likelihood functions are proportional, the evidence in thedata about θ is the same in both data sets.

Because the priors and loss functions are the same, from the Bayesianstandpoint the two researchers would obtain the same posteriordistributions and therefore, make identical inferences.

From the standpoint of maximum likelihood estimation, since thelikelihoods are proportional, the maximum likelihood point estimateswould be the same, θ = 1.5.

However, the maximum likelihood estimators would be very different:

In the case of Researcher A, the sampling distribution would becontinuous, whereas in the case of Researcher B the samplingdistribution would be discrete.




A second (and more widely cited) example also emphasizes this point.

Suppose researcher A decides to flip a coin n times, and observes lsuccesses (heads) during these n trials.

The likelihood function in this case is the standard Bernoulli pdf:

L(θ) =

(nl

)θl(1− θ)n−l .




Now, suppose that researcher B decides to flip a coin until sheobserves l heads. It happens that it takes her exactly n trials toobserve l heads.

In this case, the appropriate likelihood is the negative binomialdistribution, which describes the distribution over the number of trials(n) necessary to achieve l successes.

Since these two likelihoods are proportional, from the Bayesian pointof view, these two researchers will obtain identical posterior inferencesprovided they employ the same prior.




However, inference in the classical case regarding θ will differ acrossthe two experiments. Specifically, suppose n = 12, l = 9. Then, in theBernoulli case:

Pr(X ≥ 9|H0 : θ = 1/2) = .073.

while in the Negative Binomial case

Pr(X ≥ 9|H0 : θ = 1/2) = .0327.

In fact, the classical econometrician can not make inferences about θunless he or she knew what the researcher intended to do beforeconducting the trials!

Depending on the intention of the researcher, you either reject(negative binomial) or fail to reject.


Documents

Bayes and Classical Approaches: A Comparative Overview · The Bayesian Approach Rather than having a variety of competing estimators for the same problem, the Bayesian \estimator"