Bayesian Point Estimation - Purdue UniversityPoint Estimation Under All-or-Nothing Loss Di erentiating expected (posterior) loss with respect to ^ yields Equating the above to zero

Basics Loss Functions Quadratic Loss Asymmetirc Linear Loss All-or-nothing Loss Example

Bayesian Point Estimation

Econ 690

Purdue University

Justin L. Tobias Point Estimation


Outline

1 Basic Framework

2 Common Loss Functions

3 Point Estimation Under Quadratic Loss

4 Point Estimation Under Asymmetric Linear Loss

5 Point Estimation Under All-or-Nothing Loss

6 Example



Framework

The Bayesian begins by specifying a loss (cost) function C (θ̂, θ).This is:

1 A nonnegative function satisfying C (θ, θ) = 0.

2 A function that measures the consequences of using θ̂ = θ̂(Y )(i.e., a particular function of the data) as an estimate whenthe “state of nature” is θ.

3 Usually, C (θ̂, θ) is a non-decreasing function of the (absolutevalue of the) sampling error θ̂ − θ.

4 Whether or not to impose symmetry of the sampling erroraround zero is at the discretion of the researcher and willdepend on the problem at hand.



Framework

A good estimate is one which “minimizes” C (θ̂, θ) in somesense, but its randomness must first be eliminated.

From the frequentist sampling theory point of view, θ isnon-stochastic but C (θ̂, θ) is nonetheless stochastic becausethe estimator θ̂ = θ̂(Y ) is a random variable.

An obvious way to circumscribe the randomness of C (θ̂, θ) isto focus attention on its expected value. Frequentiststherefore consider the Risk function:

R(θ̂|θ) = EY |θ[C (θ̂(Y ), θ)],

where the expectation (assumed to exist) is taken withrespect to the sampling density p(y |θ).



Framework

In contrast the Bayesian perspective is entirely ex post (i.e., itconditions on the observed data y).

That is, the Bayesian uses θ̂ = θ̂(y) as a point estimate of theunknown parameter θ.

The Bayesian solution to the randomness of the loss functionis similar to the frequentist solution: take its expectationbefore minimization.

The expectation, however, is with respect to the posteriordistribution θ|y , and not the sampling distribution y |θ used toobtain the risk function.



Framework

The Bayesian point estimate is defined as the solution(assuming the expectation exists) to the following problem:

where

Most importantly, note that the posterior expectation removesθ from the criterion function, unlike the case of risk functionR(θ̂|θ).



Framework

Also note that if the researcher is interested only in a subset θ1 ofthe parameter vector θ = [θ1, θ2]′, this preference can be reflectedin the loss function specification: C (θ̂, θ) = C (θ̂1, θ1). In this case:

Thus, nuisance parameters are simply marginalized out of theproblem.



Popular Loss (Cost) Functions

Consider the case of a single parameter of interest θ. Let c, c1 andc2 be known constants.

is known as a quadratic loss function.




The loss function

is known as an asymmetric linear loss function. If c1 = c2, thisbecomes a symmetric linear loss function or the absolute lossfunction.




Given constants c > 0 and d > 0, the loss function

is known as an all-or-nothing loss function over |θ̂ − θ| ≤ d .




When there are several parameters of interest, the most popularloss functions are of the weighted squared error type:

where Q is a positive definite matrix



Loss Functions, Plotted

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2Sampling Error

Quadratic Loss

Linear Loss

All−or−nothing loss



Point Estimation under (Weighted) Squared Error Loss

Consider the cost function

C (θ̂, θ) = (θ̂ − θ)′Q(θ̂ − θ)

We can write this as:



Point Estimation under (Weighted) Squared Error Loss

Noting that only the last two terms of the previous expressioninvolve θ, the posterior expected loss is

Picking θ̂ = E (θ|y) minimizes this expression. Note that θ̂ doesnot depend on Q.



Point Estimation Under Asymmetric Linear Loss

C (θ̂, θ) =

{c1|θ̂ − θ| if θ̂ ≤ θc2|θ̂ − θ| if θ̂ > θ

Before working out the point estimate under this loss function, weneed to recall Leibniz’s Rule (differentiation under the integralsign):

∂

∂x

∫ v(x)

u(x)f (t)dt = f [v(x)]

∂v

∂x− f [u(x)]

∂u

∂x.

Back to derivation




C (θ̂, θ) =

{c1|θ̂ − θ| if θ̂ ≤ θc2|θ̂ − θ| if θ̂ > θ

Posterior expected loss is

where P (·) denotes the c.d.f. corresponding to p (·).




Eθ|y [C (θ̂, θ)] = c2θ̂P(θ̂|y)− c2

∫ θ̂

−∞θp(θ|y)dθ −

c1θ̂[1− P(θ̂|y)] + c1

∫ ∞θ̂

θp(θ|y)dθ.

Differentiating the above with respect to θ̂ (i.e., applying Leibniz’srule) to Leibniz’s Rule yields




∂Eθ|y [C (θ̂, θ)]

∂θ̂= −c1 + (c1 + c2)P(θ̂|y).

Equating this expression to zero and solving for θ̂ yields

θ̂ = P−1θ|y

(c1

c1 + c2

),

i.e., the c1/(c1 + c2) posterior quantile of θ.




Note that

If c1 = c2, θ̂ = P−1θ|y (1/2), so that under absolute loss

(symmetric linear loss), the posterior median is the optimalpoint estimate

If c2 is large relative to c1 in the sense that c2 = kc1 for largek, then

θ̂ = P−1θ|y[(1 + k)−1

],

where argument inside the inverse cdf function is a smallnumber, and thus θ̂ moves into the left-tail of p(θ|y).This makes some sense since, in this structure, the relativepenalty for choosing θ̂ to be “too big” has increased. That is,the penalty for overestimation is far greater than that forunderestimation, resulting in “small” θ̂.



Point Estimation Under All-or-Nothing Loss

C (θ̂, θ) =

{c if |θ̂ − θ| > d

0 if |θ̂ − θ| ≤ d

Expected posterior loss is




Eθ|y [C (θ̂, θ)] = c(

1− [P(θ̂ + d |y)− P(θ̂ − d |y)])

Staring at this expression, it becomes clear that, for given d wewant the term P(θ̂ + d |y)− P(θ̂ − d |y) to be as large as possible.

Thus, θ̂ is chosen to be the center of an interval of width 2dhaving maximum posterior probability.




Differentiating expected (posterior) loss with respect to θ̂ yields

Equating the above to zero and solving for θ̂ implies the end pointsof this interval have equal posterior density.Also note that as d → 0, θ̂ becomes the mode of p(θ|y).



Example with Log Wage Data

Consider again our illustrative example with log wage data.

The data set used contains 1,217 observations on threevariables: hourly wages, education and a standardized testscore.

We consider the model

yi = β0 + β1Edi + εi ,

where y is the log of the hourly wage.

We also employ the flat prior

p(β0, β1, σ2) ∝ σ−2.




In the slides related to Bayesian inference in the linearregression model, we showed:

β1|y ∼ t(.0910, [.0066]2, 1, 215).

Derive the Bayesian point estimates of the return to educationunder quadratic, asymmetric linear (with c2 = 2c1) andabsolute (as d → 0) losses.




Under quadratic loss, the optimal point estimate is theposterior mean, E (β1|y). Thus,

β̂1 = .091

is the optimal point estimate under this loss function.

Under all-or-nothing loss, as d → 0, the posterior mode is theoptimal point estimate. Since the Student-t is symmetric withmean, median and mode equal to β̂, it follows that

β̂1 = .091

is again the optimal point estimate under this loss function.




For the asymmetric linear loss function, with c2 = 2c1, wehave (using crude notation)

β̂1 = T−1ν (β1; 1/3).

In other words, we seek a β̂1 such that

Tν(β̂1; .091, [.0066]2) = 1/3.

[The above notation denotes the Student-t cdf according tothe given mean, variance, and degrees of freedom parameter.]




We can write this as the β̂1 such that

Pr

(β1 − .091

.0066≤ β̂1 − .091

.0066

∣∣∣∣y)

= 1/3.

or equivalently, as the β̂1 such that

Tv

(β̂1 − .091

.0066

)= 1/3.

with Tv now denoting the cdf from the standardizedStudent-t distribution with ν degrees of freedom.Solving this, we obtain (using the “tinv” command in Matlab)

β̂1 = .091 + .0066 ∗ T−1v (1/3)

≈ .091 + .0066× (−.4308)

≈ .0882



Other Loss Functions

Another possible loss function, though less widely-used, is theLINEX loss function, Varian (1975), Zellner (1986 JASA, “BayesianEstimation and Prediction Using Asymmetric Loss Functions” ):

Note: exp(a∆) ≈ 1 + a∆ + (1/2)a2∆2 and thus, for a ≈ 0,this behaves much like a quadratic loss function.

For a > 0, the exponential term dominates for large ∆,implying that overestimation is more costly thanunderestimation. The converse is true when a < 0.



Other Loss Functions

Under the LINEX function, expected posterior loss is:

Provided the expectation exists, one can show:

Interestingly, under (say) a normal posterior, one can readilyshow that the posterior mean is inadmissible under LINEX loss(more on this later). Conversely, the LINEX estimator isinadmissible under quadratic loss.



Other Loss Functions, Continued

A common problem in econometrics / statistics concernsestimation and inference for a ratio of parameters. In a regressionmodel, for example,

y = Xβ + ε, ε|X ∼ N (0, σ2In),

suppose we wish to conduct inference on

θ = β1/β2,

the ratio of the first two elements of the regression coefficientvector (WLOG). Note that the ML estimator:

θ̂ = β̂1/β̂2

has no finite sample moments, although the asymptotics arewell-behaved (and easily characterized).




A reasonable loss function in this case is to consider (e.g., Zellner1978 JoE, “Estimation of Population Means and RegressionCoefficients Including Structural Coefficients: A MinimumExpected Loss (MELO) Approach”) is

Straightforward derivations show:

θ̂ =Eθ|y (β1)

Eθ|y (β2)

1 + Covθ|y (β1, β2)[Eθ|y (β1)Eθ|y (β2)]−1

1 + Varθ|y (β2)[Eθ|y (β2)]−2.




In the regression model application under a flat (improper) priorfor β, σ2, this reduces to :

θ̂ =β̂1

β̂2

1 + m12s2[β̂1β̂2]−1

1 + m22s2β̂2−2

where

ν = n−k > 2, mij is the (i,j) element of (X ′X )−1, s2 =νs2

ν − 2,

and β̂j denotes the j th element of the OLS vector.

The estimator above does have finite first and secondmoments, and shares the same asymptotic distribution of thestandard ML estimator.


Documents

Bayesian Point Estimation - Purdue UniversityPoint Estimation Under All-or-Nothing Loss Di erentiating expected (posterior) loss with respect to ^ yields Equating the above to zero