CONTENTSbernardo/2011PhilStat.pdf · Philosophy of Statistics: An Introduction 1 Prasanta S. Bandyopadhyay and Malcolm R. Forster Part I. Probability & Statistics Elementary Probability

CONTENTS

viiGeneral PrefaceDov Gabbay, Paul Thagard, and John Woods

ixPrefacePrasanta S. Bandyopadhyay and Malcolm R. Forster

xiiiList of Contributors

Introduction

1Philosophy of Statistics: An IntroductionPrasanta S. Bandyopadhyay and Malcolm R. Forster

Part I. Probability & Statistics

53Elementary Probability and Statistics: A PrimerPrasanta S. Bandyopadhyay and Steve Cherry

Part II. Philosophical Controversies aboutConditional Probability

99Conditional ProbabilityAlan Hajek

137The Varieties of Conditional ProbabilityKenny Easwaran

Part III. Four Paradigms of Statistics

Classical Statistics Paradigm

153Error StatisticsDeborah G. Mayo and Aris Spanos

199Significance TestingMichael Dickson and Davis Baird

Bayesian Paradigm

233The Bayesian Decision-Theoretic Approach to StatisticsPaul Weirich

18

263Modern Bayesian Inference:Foundations and Objective MethodsJose M. Bernardo

307Evidential Probability and Objective Bayesian EpistemologyGregory Wheeler and Jon Williamson

333Confirmation TheoryJames Hawthorne

391Challenges to Bayesian Confirmation TheoryJohn D. Norton

441Bayesianism as a Pure Logic of InferenceColin Howson

473Bayesian Inductive Logic, Verisimilitude, and StatisticsRoberto Festa

Likelihood Paradigm

493Likelihood and its Evidential FrameworkJe!rey D. Blume

513Evidence, Evidence Functions, and Error ProbabilitiesMark L. Taper and Subhash R. Lele

Akaikean Paradigm

535AIC Scores as Evidence — a Bayesian InterpretationMalcolm Forster and Elliott Sober

Part IV: The Likelihood Principle

553The Likelihood PrincipleJason Grossman

Part V: Recent Advances in Model Selection

583AIC, BIC and Recent Advances in Model SelectionArijit Chakrabarti and Jayanta K. Ghosh

607Posterior Model ProbabilitiesA. Philip Dawid

19

Part VI: Attempts to Understand Di!erentAspects of “Randomness”

633Defining RandomnessDeborah Bennett

641Mathematical Foundations of RandomnessAbhijit Dasgupta

Part VII: Probabilistic and StatisticalParadoxes

713Paradoxes of ProbabilitySusan Vineberg

737Statistical Paradoxes: Take It to The LimitC. Andy Tsao

Part VIII: Statistics and Inductive Inference

751Statistics as Inductive InferenceJan-Willem Romeijn

Part IX: Various Issues about Causal Inference

777Common Cause in Causal InferencePeter Spirtes

813The Logic and Philosophy of Causal Inference: A StatisticalPerspectiveSander Greenland

Part X: Some Philosophical Issues ConcerningStatistical Learning Theory

833Statistical Learning Theory as a Framework for thePhilosophy of InductionGilbert Harman and Sanjeev Kulkarni

849Testability and Statistical Learning TheoryDaniel Steel

20

Part XI: Di!erent Approaches to SimplicityRelated to Inference and Truth

865Luckiness and Regret in Minimum Description Length InferenceSteven de Rooij and Peter D. Grunwald

901MML, Hybrid Bayesian Network Graphical Models, StatisticalConsistency, Invariance and UniquenessDavid L. Dowe

983Simplicity, Truth and ProbabilityKevin T. Kelly

Part XII: Special Problems inStatistics/Computer Science

1027Normal ApproximationsRobert J. Boik

1073Stein’s PhenomenonRichard Charnigo and Cidambi Srinivasan

1099Data, Data, Everywhere: Statistical Issues in Data MiningChoh Man Teng

Part XIII: An Application of Statistics toClimate Change

1121An Application of Statistics in Climate Change: Detection ofNonlinear Changes in a Streamflow Timing Measure in theColumbia and Missouri HeadwatersMark C. Greenwood, Joel Harper and Johnnie Moore

Part XIV: Historical Approaches toProbability/Statistics

1149The Subjective and the ObjectiveSandy L. Zabell

1175Probability in Ancient IndiaC. K. Raju

1197Index

MODERN BAYESIAN INFERENCE:FOUNDATIONS AND OBJECTIVE METHODS

Jose M. Bernardo

The field of statistics includes two major paradigms: frequentist and Bayesian.Bayesian methods provide a complete paradigm for both statistical inference anddecision making under uncertainty. Bayesian methods may be derived from anaxiomatic system and provide a coherent methodology which makes it possible toincorporate relevant initial information, and which solves many of the di"cultieswhich frequentist methods are known to face. If no prior information is to beassumed, a situation often met in scientific reporting and public decision making,a formal initial prior function must be mathematically derived from the assumedmodel. This leads to objective Bayesian methods, objective in the precise sense thattheir results, like frequentist results, only depend on the assumed model and thedata obtained. The Bayesian paradigm is based on an interpretation of probabilityas a rational conditional measure of uncertainty, which closely matches the sense ofthe word ‘probability’ in ordinary language. Statistical inference about a quantityof interest is described as the modification of the uncertainty about its value inthe light of evidence, and Bayes’ theorem specifies how this modification shouldprecisely be made.

1 INTRODUCTION

Scientific experimental or observational results generally consist of (possibly many)sets of data of the general form D = {x1, . . . ,xn}, where the xi’s are somewhat“homogeneous” (possibly multidimensional) observations xi. Statistical methodsare then typically used to derive conclusions on both the nature of the processwhich has produced those observations, and on the expected behaviour at futureinstances of the same process. A central element of any statistical analysis is thespecification of a probability model which is assumed to describe the mechanismwhich has generated the observed data D as a function of a (possibly multidimen-sional) parameter (vector) ( # #, sometimes referred to as the state of nature,about whose value only limited information (if any) is available. All derived sta-tistical conclusions are obviously conditional on the assumed probability model.

Unlike most other branches of mathematics, frequentist methods of statisticalinference su!er from the lack of an axiomatic basis; as a consequence, their pro-posed desiderata are often mutually incompatible, and the analysis of the samedata may well lead to incompatible results when di!erent, apparently intuitive

Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics.Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M.Gabbay, Paul Thagard and John Woods.c! 2011 Elsevier B.V. All rights reserved.

264 Jose M. Bernardo

procedures are tried; see Lindley [1972] and Jaynes [1976] for many instructiveexamples. In marked contrast, the Bayesian approach to statistical inference isfirmly based on axiomatic foundations which provide a unifying logical structure,and guarantee the mutual consistency of the methods proposed. Bayesian meth-ods constitute a complete paradigm to statistical inference, a scientific revolutionin Kuhn’s sense.

Bayesian statistics only require the mathematics of probability theory and theinterpretation of probability which most closely corresponds to the standard use ofthis word in everyday language: it is no accident that some of the more importantseminal books on Bayesian statistics, such as the works of de Laplace [1812],Je!reys [1939] or de Finetti [1970] are actually entitled “Probability Theory”.The practical consequences of adopting the Bayesian paradigm are far reaching.Indeed, Bayesian methods (i) reduce statistical inference to problems in probabilitytheory, thereby minimizing the need for completely new concepts, and (ii) serveto discriminate among conventional, typically frequentist statistical techniques, byeither providing a logical justification to some (and making explicit the conditionsunder which they are valid), or proving the logical inconsistency of others.

The main result from these foundations is the mathematical need to describeby means of probability distributions all uncertainties present in the problem. Inparticular, unknown parameters in probability models must have a joint probabil-ity distribution which describes the available information about their values; thisis often regarded as the characteristic element of a Bayesian approach. Noticethat (in sharp contrast to conventional statistics) parameters are treated as ran-dom variables within the Bayesian paradigm. This is not a description of theirvariability (parameters are typically fixed unknown quantities) but a descriptionof the uncertainty about their true values.

A most important particular case arises when either no relevant prior infor-mation is readily available, or that information is subjective and an “objective”analysis is desired, one that is exclusively based on accepted model assumptionsand well-documented public prior information. This is addressed by referenceanalysis which uses information-theoretic concepts to derive formal reference priorfunctions which, when used in Bayes’ theorem, lead to posterior distributions en-capsulating inferential conclusions on the quantities of interest solely based on theassumed model and the observed data.

In this article it is assumed that probability distributions may be describedthrough their probability density functions, and no distinction is made betweena random quantity and the particular values that it may take. Bold italic romanfonts are used for observable random vectors (typically data) and bold italic greekfonts are used for unobservable random vectors (typically parameters); lower case isused for variables and calligraphic upper case for their dominion sets. Moreover,the standard mathematical convention of referring to functions, say f and g ofx # X , respectively by f(x) and g(x), will be used throughout. Thus, &(!|D,C)and p(x|!, C) respectively represent general probability densities of the unknownparameter ! # $ given data D and conditions C, and of the observable random

Modern Bayesian Inference: Foundations and Objective Methods 265

vector x # X conditional on ! and C. Hence, &(!|D,C) / 0,!" &(!|D,C)d! =

1, and p(x|!, C) / 0,!X p(x|!, C) dx = 1. This admittedly imprecise notation

will greatly simplify the exposition. If the random vectors are discrete, thesefunctions naturally become probability mass functions, and integrals over theirvalues become sums. Density functions of specific distributions are denoted byappropriate names. Thus, if x is a random quantity with a normal distribution ofmean µ and standard deviation %, its probability density function will be denotedN(x|µ,%).

Bayesian methods make frequent use of the the concept of logarithmic diver-gence, a very general measure of the goodness of the approximation of a probabilitydensity p(x) by another density p(x). The Kullback-Leibler, or logarithmic diver-gence of a probability density p(x) of the random vector x # X from its trueprobability density p(x), is defined as 0{p(x)|p(x)} =

!X p(x) log{p(x)/p(x)} dx.

It may be shown that (i) the logarithmic divergence is non-negative (and it iszero if, and only if, p(x) = p(x) almost everywhere), and (ii) that 0{p(x)|p(x)} isinvariant under one-to-one transformations of x.

This article contains a brief summary of the mathematical foundations ofBayesian statistical methods (Section 2), an overview of the paradigm (Section3), a detailed discussion of objective Bayesian methods (Section 4), and a descrip-tion of useful objective inference summaries, including estimation and hypothesistesting (Section 5).

Good introductions to objective Bayesian statistics include Lindley [1965], Zell-ner [1971], and Box and Tiao [1973]. For more advanced monographs, see [Berger,1985; Bernardo and Smith, 1994].

2 FOUNDATIONS

A central element of the Bayesian paradigm is the use of probability distribu-tions to describe all relevant unknown quantities, interpreting the probability ofan event as a conditional measure of uncertainty, on a [0, 1] scale, about the oc-currence of the event in some specific conditions. The limiting extreme values 0and 1, which are typically inaccessible in applications, respectively describe im-possibility and certainty of the occurrence of the event. This interpretation ofprobability includes and extends all other probability interpretations. There aretwo independent arguments which prove the mathematical inevitability of the useof probability distributions to describe uncertainties; these are summarized laterin this section.

2.1 Probability as a Rational Measure of Conditional Uncertainty

Bayesian statistics uses the word probability in precisely the same sense in whichthis word is used in everyday language, as a conditional measure of uncertainty as-sociated with the occurrence of a particular event, given the available information


and the accepted assumptions. Thus, Pr(E|C) is a measure of (presumably ratio-nal) belief in the occurrence of the event E under conditions C. It is important tostress that probability is always a function of two arguments, the event E whoseuncertainty is being measured, and the conditions C under which the measure-ment takes place; “absolute” probabilities do not exist. In typical applications,one is interested in the probability of some event E given the available data D, theset of assumptions A which one is prepared to make about the mechanism whichhas generated the data, and the relevant contextual knowledge K which might beavailable. Thus, Pr(E|D,A,K) is to be interpreted as a measure of (presumablyrational) belief in the occurrence of the event E, given data D, assumptions A andany other available knowledge K, as a measure of how “likely” is the occurrenceof E in these conditions. Sometimes, but certainly not always, the probability ofan event under given conditions may be associated with the relative frequency of“similar” events in “similar” conditions. The following examples are intended toillustrate the use of probability as a conditional measure of uncertainty.

Probabilistic diagnosis. A human population is known to contain 0.2% ofpeople infected by a particular virus. A person, randomly selected from that pop-ulation, is subject to a test which is from laboratory data known to yield positiveresults in 98% of infected people and in 1% of non-infected, so that, if V de-notes the event that a person carries the virus and + denotes a positive result,Pr(+|V ) = 0.98 and Pr(+|V ) = 0.01. Suppose that the result of the test turns outto be positive. Clearly, one is then interested in Pr(V |+, A,K), the probability thatthe person carries the virus, given the positive result, the assumptions A about theprobability mechanism generating the test results, and the available knowledge Kof the prevalence of the infection in the population under study (described here byPr(V |K) = 0.002). An elementary exercise in probability algebra, which involvesBayes’ theorem in its simplest form (see Section 3), yields Pr(V |+, A,K) = 0.164.Notice that the four probabilities involved in the problem have the same interpre-tation: they are all conditional measures of uncertainty. Besides, Pr(V |+, A,K)is both a measure of the uncertainty associated with the event that the particularperson who tested positive is actually infected, and an estimate of the proportionof people in that population (about 16.4%) that would eventually prove to beinfected among those which yielded a positive test. 1

Estimation of a proportion. A survey is conducted to estimate the proportion! of individuals in a population who share a given property. A random sample ofn elements is analyzed, r of which are found to possess that property. One is thentypically interested in using the results from the sample to establish regions of [0, 1]where the unknown value of ! may plausibly be expected to lie; this informationis provided by probabilities of the form Pr(a < ! < b|r, n,A,K), a conditionalmeasure of the uncertainty about the event that ! belongs to (a, b) given theinformation provided by the data (r, n), the assumptions A made on the behaviourof the mechanism which has generated the data (a random sample of n Bernoulli


trials), and any relevant knowledge K on the values of ! which might be available.For example, after a political survey in which 720 citizens out of a random sampleof 1500 have declared their support to a particular political measure, one mayconclude that Pr(! < 0.5|720, 1500, A,K) = 0.933, indicating a probability ofabout 93% that a referendum on that issue would be lost. Similarly, after ascreening test for an infection where 100 people have been tested, none of which hasturned out to be infected, one may conclude that Pr(! < 0.01|0, 100, A,K) = 0.844,or a probability of about 84% that the proportion of infected people is smallerthan 1%. 1

Measurement of a physical constant. A team of scientists, intending to es-tablish the unknown value of a physical constant µ, obtain data D = {x1, . . . , xn}which are considered to be measurements of µ subject to error. The probabili-ties of interest are then typically of the form Pr(a < µ < b|x1, . . . , xn, A,K), theprobability that the unknown value of µ (fixed in nature, but unknown to the sci-entists) lies within an interval (a, b) given the information provided by the dataD, the assumptions A made on the behaviour of the measurement mechanism,and whatever knowledge K might be available on the value of the constant µ.Again, those probabilities are conditional measures of uncertainty which describethe (necessarily probabilistic) conclusions of the scientists on the true value of µ,given available information and accepted assumptions. For example, after a class-room experiment to measure the gravitational field with a pendulum, a studentmay report (in m/sec2) something like Pr(9.788 < g < 9.829|D,A,K) = 0.95,meaning that, under accepted knowledge K and assumptions A, the observed dataD indicate that the true value of g lies within 9.788 and 9.829 with probability0.95, a conditional uncertainty measure on a [0,1] scale. This is naturally com-patible with the fact that the value of the gravitational field at the laboratorymay well be known with high precision from available literature or from preciseprevious experiments, but the student may have been instructed not to use thatinformation as part of the accepted knowledge K. Under some conditions, it isalso true that if the same procedure were actually used by many other studentswith similarly obtained data sets, their reported intervals would actually cover thetrue value of g in approximately 95% of the cases, thus providing a frequentistcalibration of the student’s probability statement. 1

Prediction. An experiment is made to count the number r of times that anevent E takes place in each of n replications of a well defined situation; it isobserved that E does take place ri times in replication i, and it is desired to forecastthe number of times r that E will take place in a similar future situation. Thisis a prediction problem on the value of an observable (discrete) quantity r, giventhe information provided by data D, accepted assumptions A on the probabilitymechanism which generates the ri’s, and any relevant available knowledge K.Computation of the probabilities {Pr(r|r1, . . . , rn, A,K)}, for r = 0, 1, . . ., is thus


required. For example, the quality assurance engineer of a firm which producesautomobile restraint systems may report something like Pr(r = 0|r1 = . . . = r10 =0, A,K) = 0.953, after observing that the entire production of airbags in each ofn = 10 consecutive months has yielded no complaints from their clients. Thisshould be regarded as a measure, on a [0, 1] scale, of the conditional uncertainty,given observed data, accepted assumptions and contextual knowledge, associatedwith the event that no airbag complaint will come from next month’s productionand, if conditions remain constant, this is also an estimate of the proportion ofmonths expected to share this desirable property.

A similar problem may naturally be posed with continuous observables. Forinstance, after measuring some continuous magnitude in each of n randomly chosenelements within a population, it may be desired to forecast the proportion of itemsin the whole population whose magnitude satisfies some precise specifications. Asan example, after measuring the breaking strengths {x1, . . . , x10} of 10 randomlychosen safety belt webbings to verify whether or not they satisfy the requirementof remaining above 26 kN, the quality assurance engineer may report somethinglike Pr(x > 26|x1, . . . , x10, A,K) = 0.9987. This should be regarded as a measure,on a [0, 1] scale, of the conditional uncertainty (given observed data, acceptedassumptions and contextual knowledge) associated with the event that a randomlychosen safety belt webbing will support no less than 26 kN. If production conditionsremain constant, it will also be an estimate of the proportion of safety belts whichwill conform to this particular specification.

Often, additional information of future observations is provided by related co-variates. For instance, after observing the outputs {y1, . . . ,yn} which correspondto a sequence {x1, . . . ,xn} of di!erent production conditions, it may be desired toforecast the output y which would correspond to a particular set x of productionconditions. For instance, the viscosity of commercial condensed milk is requiredto be within specified values a and b; after measuring the viscosities {y1, . . . , yn}which correspond to samples of condensed milk produced under di!erent physi-cal conditions {x1, . . . ,xn}, production engineers will require probabilities of theform Pr(a < y < b|x, (y1,x1), . . . , (yn,xn), A,K). This is a conditional measureof the uncertainty (always given observed data, accepted assumptions and contex-tual knowledge) associated with the event that condensed milk produced underconditions x will actually satisfy the required viscosity specifications. 1

2.2 Statistical Inference and Decision Theory

Decision theory not only provides a precise methodology to deal with decisionproblems under uncertainty, but its solid axiomatic basis also provides a powerfulreinforcement to the logical force of the Bayesian approach. We now summarizethe basic argument.

A decision problem exists whenever there are two or more possible courses ofaction; let A be the class of possible actions. Moreover, for each a # A, let $a

be the set of relevant events which may a!ect the result of choosing a, and let


c(a, !) # Ca, ! # $a, be the consequence of having chosen action a when event! takes place. The class of pairs {($a, Ca), a # A} describes the structure of thedecision problem. Without loss of generality, it may be assumed that the possibleactions are mutually exclusive, for otherwise one would work with the appropriateCartesian product.

Di!erent sets of principles have been proposed to capture a minimum collectionof logical rules that could sensibly be required for “rational” decision-making.These all consist of axioms with a strong intuitive appeal; examples include thetransitivity of preferences (if a1 > a2 given C, and a2 > a3 given C, then a1 > a3

given C), and the sure-thing principle (if a1 > a2 given C and E, and a1 > a2 givenC and not E, then a1 > a2 given C). Notice that these rules are not intended as adescription of actual human decision-making, but as a normative set of principlesto be followed by someone who aspires to achieve coherent decision-making.

There are naturally di!erent options for the set of acceptable principles (see e.g.Ramsey 1926; Savage, 1954; DeGroot, 1970; Bernardo and Smith, 1994, Ch. 2 andreferences therein), but all of them lead basically to the same conclusions, namely:

(i) Preferences among consequences should be measured with a real-valuedbounded utility function U(c) = U(a, !) which specifies, on some numericalscale, their desirability.

(ii) The uncertainty of relevant events should be measured with a set of probabilitydistributions {(&(!|C, a), ! # $a), a # A} describing their plausibility giventhe conditions C under which the decision must be taken.

(iii) The desirability of the available actions is measured by their correspondingexpected utility

(1) U(a|C) =

(

"a

U(a, !)&(!|C, a) d!, a # A.

It is often convenient to work in terms of the non-negative loss function definedby

(2) L(a, !) = supa(A

{U(a, !)}' U(a, !),

which directly measures, as a function of !, the “penalty” for choosing awrong action. The relative undesirability of available actions a # A is thenmeasured by their expected loss

(3) L(a|C) =

(

"a

L(a, !)&(!|C, a) d!, a # A.

Notice that, in particular, the argument described above establishes the need toquantify the uncertainty about all relevant unknown quantities (the actual valuesof the !’s), and specifies that this quantification must have the mathematicalstructure of probability distributions. These probabilities are conditional on the


circumstances C under which the decision is to be taken, which typically, but notnecessarily, include the results D of some relevant experimental or observationaldata.

It has been argued that the development described above (which is not ques-tioned when decisions have to be made) does not apply to problems of statisticalinference, where no specific decision making is envisaged. However, there are twopowerful counterarguments to this. Indeed, (i) a problem of statistical inferenceis typically considered worth analyzing because it may eventually help make sen-sible decisions; a lump of arsenic is poisonous because it may kill someone, notbecause it has actually killed someone [Ramsey, 1926], and (ii) it has been shown[Bernardo, 1979a] that statistical inference on ! actually has the mathematicalstructure of a decision problem, where the class of alternatives is the functionalspace

(4) A =

8&(!|D); &(!|D) > 0,

(

"&(!|D) d! = 1

9

of the conditional probability distributions of ! given the data, and the utilityfunction is a measure of the amount of information about ! which the data maybe expected to provide.

2.3 Exchangeability and Representation Theorem

Available data often take the form of a set {x1, . . . ,xn} of “homogeneous” (pos-sibly multidimensional) observations, in the precise sense that only their valuesmatter and not the order in which they appear. Formally, this is captured by thenotion of exchangeability. The set of random vectors {x1, . . . ,xn} is exchangeable iftheir joint distribution is invariant under permutations. An infinite sequence {xj}of random vectors is exchangeable if all its finite subsequences are exchangeable.Notice that, in particular, any random sample from any model is exchangeablein this sense. The concept of exchangeability, introduced by de Finetti [1937], iscentral to modern statistical thinking. Indeed, the general representation theoremimplies that if a set of observations is assumed to be a subset of an exchange-able sequence, then it constitutes a random sample from some probability model{p(x|(),( # "}, x # X , described in terms of (labeled by) some parameter vec-tor (; furthermore this parameter ( is defined as the limit (as n * 0) of somefunction of the observations. Available information about the value of ( in prevail-ing conditions C is necessarily described by some probability distribution &((|C).

For example, in the case of a sequence {x1, x2, . . .} of dichotomous exchangeablerandom quantities xj # {0, 1}, de Finetti’s representation theorem establishes thatthe joint distribution of (x1, . . . , xn) has an integral representation of the form

(5) p(x1, . . . , xn|C) =

( 1

0

n:

i=1

!xi(1 ' !)1!xi &(!|C) d!, ! = limn)"

r

n,


where r =$

xj is the number of positive trials. This is nothing but the joint dis-tribution of a set of (conditionally) independent Bernoulli trials with parameter !,over which some probability distribution &(!|C) is therefore proven to exist. Moregenerally, for sequences of arbitrary random quantities {x1,x2, . . .}, exchangeabil-ity leads to integral representations of the form

(6) p(x1, . . . ,xn|C) =

(

!

n:

i=1

p(xi|()&((|C) d(,

where {p(x|(),( # #} denotes some probability model, ( is the limit as n * 0 ofsome function f(x1, . . . ,xn) of the observations, and &((|C) is some probabilitydistribution over ". This formulation includes “nonparametric” (distribution free)modelling, where ( may index, for instance, all continuous probability distributionson X . Notice that &((|C) does not describe a possible variability of ( (since (will typically be a fixed unknown vector), but a description on the uncertaintyassociated with its actual value.

Under appropriate conditioning, exchangeability is a very general assumption,a powerful extension of the traditional concept of a random sample. Indeed, manystatistical analyses directly assume data (or subsets of the data) to be a randomsample of conditionally independent observations from some probability model,so that p(x1, . . . ,xn|() =

;ni=1 p(xi|(); but any random sample is exchangeable,

since;n

i=1 p(xi|() is obviously invariant under permutations. Notice that the ob-servations in a random sample are only independent conditional on the parametervalue (; as nicely put by Lindley, the mantra that the observations {x1, . . . ,xn} ina random sample are independent is ridiculous when they are used to infer xn+1.Notice also that, under exchangeability, the general representation theorem pro-vides an existence theorem for a probability distribution &((|C) on the parameterspace #, and that this is an argument which only depends on mathematical prob-ability theory.

Another important consequence of exchangeability is that it provides a formaldefinition of the parameter ( which labels the model as the limit, as n * 0, ofsome function f(x1, . . . ,xn) of the observations; the function f obviously dependsboth on the assumed model and the chosen parametrization. For instance, in thecase of a sequence of Bernoulli trials, the parameter ! is defined as the limit, asn * 0, of the relative frequency r/n. It follows that, under exchangeability, thesentence “the true value of (” has a well-defined meaning, if only asymptoticallyverifiable. Moreover, if two di!erent models have parameters which are function-ally related by their definition, then the corresponding posterior distributions maybe meaningfully compared, for they refer to functionally related quantities. Forinstance, if a finite subset {x1, . . . , xn} of an exchangeable sequence of integer ob-servations is assumed to be a random sample from a Poisson distribution Po(x|"),so that E[x|"] = ", then " is defined as limn)"{xn}, where xn =

$j xj/n; sim-

ilarly, if for some fixed non-zero integer r, the same data are assumed to be arandom sample for a negative binomial Nb(x|r, !), so that E[x|!, r] = r(1 ' !)/!,then ! is defined as limn)"{r/(xn + r)}. It follows that ! " r/("+ r) and, hence,


! and r/(" + r) may be treated as the same (unknown) quantity whenever thismight be needed as, for example, when comparing the relative merits of thesealternative probability models.

3 THE BAYESIAN PARADIGM

The statistical analysis of some observed data D typically begins with some infor-mal descriptive evaluation, which is used to suggest a tentative, formal probabilitymodel {p(D|(), ( # #} assumed to represent, for some (unknown) value of (, theprobabilistic mechanism which has generated the observed data D. The argumentsoutlined in Section 2 establish the logical need to assess a prior probability distri-bution &((|K) over the parameter space #, describing the available knowledge Kabout the value of ( prior to the data being observed. It then follows from standardprobability theory that, if the probability model is correct, all available informa-tion about the value of ( after the data D have been observed is contained inthe corresponding posterior distribution whose probability density, &((|D,A,K),is immediately obtained from Bayes’ theorem,

(7) &((|D,A,K) =p(D|()&((|K)!

# p(D|()&((|K) d(,

where A stands for the assumptions made on the probability model. It is thissystematic use of Bayes’ theorem to incorporate the information provided by thedata that justifies the adjective Bayesian by which the paradigm is usually known.It is obvious from Bayes’ theorem that any value of ( with zero prior densitywill have zero posterior density. Thus, it is typically assumed (by appropriaterestriction, if necessary, of the parameter space #) that prior distributions arestrictly positive (as Savage put it, keep the mind open, or at least ajar). To simplifythe presentation, the accepted assumptions A and the available knowledge K areoften omitted from the notation, but the fact that all statements about ( givenD are also conditional to A and K should always be kept in mind.

EXAMPLE 1 Bayesian inference with a finite parameter space. Let p(D|!), ! #{!1, . . . , !m}, be the probability mechanism which is assumed to have generatedthe observed data D, so that ! may only take a finite number of values. Usingthe finite form of Bayes’ theorem, and omitting the prevailing conditions from thenotation, the posterior probability of !i after data D have been observed is

(8) Pr(!i|D) =p(D|!i) Pr(!i)$m

j=1 p(D|!j) Pr(!j), i = 1, . . . ,m.

For any prior distribution p(!) = {Pr(!1), . . . ,Pr(!m)} describing available knowl-edge about the value of !, Pr(!i|D) measures how likely should !i be judged, givenboth the initial knowledge described by the prior distribution, and the informationprovided by the data D.


0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Figure 1. Posterior probability of infection Pr(V |+) given a positive test, as afunction of the prior probability of infection Pr(V )

An important, frequent application of this simple technique is provided by prob-abilistic diagnosis. For example, consider the simple situation where a particu-lar test designed to detect a virus is known from laboratory research to give apositive result in 98% of infected people and in 1% of non-infected. Then, theposterior probability that a person who tested positive is infected is given byPr(V |+) = (0.98 p)/{0.98 p + 0.01 (1 ' p)} as a function of p = Pr(V ), the priorprobability of a person being infected (the prevalence of the infection in the pop-ulation under study). Figure 1 shows Pr(V |+) as a function of Pr(V ).

As one would expect, the posterior probability is only zero if the prior proba-bility is zero (so that it is known that the population is free of infection) and itis only one if the prior probability is one (so that it is known that the populationis universally infected). Notice that if the infection is rare, then the posteriorprobability of a randomly chosen person being infected will be relatively low evenif the test is positive. Indeed, for say Pr(V ) = 0.002, one finds Pr(V |+) = 0.164,so that in a population where only 0.2% of individuals are infected, only 16.4% ofthose testing positive within a random sample will actually prove to be infected:most positives would actually be false positives.

In this section, we describe in some detail the learning process described byBayes’ theorem, discuss its implementation in the presence of nuisance parameters,show how it can be used to forecast the value of future observations, and analyzeits large sample behaviour.

3.1 The Learning Process

In the Bayesian paradigm, the process of learning from the data is systematicallyimplemented by making use of Bayes’ theorem to combine the available prior


information with the information provided by the data to produce the requiredposterior distribution. Computation of posterior densities is often facilitated bynoting that Bayes’ theorem may be simply expressed as

(9) &((|D) $ p(D|()&((),

(where $ stands for ‘proportional to’ and where, for simplicity, the accepted as-sumptions A and the available knowledge K have been omitted from the notation),since the missing proportionality constant [

!# p(D|()&(() d(]!1 may always be

deduced from the fact that &((|D), a probability density, must integrate to one.Hence, to identify the form of a posterior distribution it su"ces to identify a ker-nel of the corresponding probability density, that is a function k(() such that&((|D) = c(D) k(() for some c(D) which does not involve (. In the exampleswhich follow, this technique will often be used.

An improper prior function is defined as a positive function &(() such that!# &(() d( is not finite. Equation (9), the formal expression of Bayes’ theo-

rem, remains technically valid if &(() is an improper prior function provided that!# p(D|()&(() d( < 0, thus leading to a well defined proper posterior density

&((|D) $ p(D|()&((). In particular, as will later be justified (Section 4) it alsoremains philosophically valid if &(() is an appropriately chosen reference (typicallyimproper) prior function.

Considered as a function of (, l((, D) = p(D|() is often referred to as thelikelihood function. Thus, Bayes’ theorem is simply expressed in words by thestatement that the posterior is proportional to the likelihood times the prior. Itfollows from equation (9) that, provided the same prior &(() is used, two dif-ferent data sets D1 and D2, with possibly di!erent probability models p1(D1|()and p2(D2|() but yielding proportional likelihood functions, will produce identicalposterior distributions for (. This immediate consequence of Bayes theorem hasbeen proposed as a principle on its own, the likelihood principle, and it is seen bymany as an obvious requirement for reasonable statistical inference. In particular,for any given prior &((), the posterior distribution does not depend on the setof possible data values, or the sample space. Notice, however, that the likelihoodprinciple only applies to inferences about the parameter vector ( once the datahave been obtained. Consideration of the sample space is essential, for instance,in model criticism, in the design of experiments, in the derivation of predictivedistributions, and in the construction of objective Bayesian procedures.

Naturally, the terms prior and posterior are only relative to a particular set ofdata. As one would expect from the coherence induced by probability theory, ifdata D = {x1, . . . ,xn} are sequentially presented, the final result will be the samewhether data are globally or sequentially processed. Indeed, &((|x1, . . . ,xi+1) $p(xi+1|()&((|x1, . . . ,xi), for i = 1, . . . , n ' 1, so that the “posterior” at a givenstage becomes the “prior” at the next.

In most situations, the posterior distribution is “sharper” than the prior so that,in most cases, the density &((|x1, . . . ,xi+1) will be more concentrated around thetrue value of ( than &((|x1, . . . ,xi). However, this is not always the case: oc-


casionally, a “surprising” observation will increase, rather than decrease, the un-certainty about the value of (. For instance, in probabilistic diagnosis, a sharpposterior probability distribution (over the possible causes {(1, . . . ,(k} of a syn-drome) describing, a “clear” diagnosis of disease (i (that is, a posterior with alarge probability for (i) would typically update to a less concentrated posteriorprobability distribution over {(1, . . . ,(k} if a new clinical analysis yielded datawhich were unlikely under (i.

For a given probability model, one may find that a particular function of the datat = t(D) is a su"cient statistic in the sense that, given the model, t(D) contains allinformation about ( which is available in D. Formally, t = t(D) is su"cient if (andonly if) there exist nonnegative functions f and g such that the likelihood functionmay be factorized in the form p(D|() = f((, t)g(D). A su"cient statistic alwaysexists, for t(D) = D is obviously su"cient; however, a much simpler su"cientstatistic, with a fixed dimensionality which is independent of the sample size,often exists. In fact this is known to be the case whenever the probability modelbelongs to the generalized exponential family, which includes many of the morefrequently used probability models. It is easily established that if t is su"cient,the posterior distribution of ( only depends on the data D through t(D), and maybe directly computed in terms of p(t|(), so that, &((|D) = p((|t) $ p(t|()&(().

Naturally, for fixed data and model assumptions, di!erent priors lead to di!erentposteriors. Indeed, Bayes’ theorem may be described as a data-driven probabilitytransformation machine which maps prior distributions (describing prior knowl-edge) into posterior distributions (representing combined prior and data knowl-edge). It is important to analyze whether or not sensible changes in the priorwould induce noticeable changes in the posterior. Posterior distributions basedon reference “noninformative” priors play a central role in this sensitivity analysiscontext. Investigation of the sensitivity of the posterior to changes in the prioris an important ingredient of the comprehensive analysis of the sensitivity of thefinal results to all accepted assumptions which any responsible statistical studyshould contain.

EXAMPLE 2 Inference on a binomial parameter. If the data D consist of nBernoulli observations with parameter ! which contain r positive trials, thenp(D|!, n) = !r(1 ' !)n!r, so that t(D) = {r, n} is su"cient. Suppose thatprior knowledge about ! is described by a Beta distribution Be(!|#,$), so that&(!|#,$) $ !"!1(1 ' !)$!1. Using Bayes’ theorem, the posterior density of ! is&(!|r, n,#,$) $ !r(1 ' !)n!r !"!1(1 ' !)$!1 $ !r+"!1(1 ' !)n!r+$!1, the Betadistribution Be(!|r + #, n ' r + $).

Suppose, for example, that in the light of precedent surveys, available infor-mation on the proportion ! of citizens who would vote for a particular politicalmeasure in a referendum is described by a Beta distribution Be(!|50, 50), so thatit is judged to be equally likely that the referendum would be won or lost, and itis judged that the probability that either side wins less than 60% of the vote is0.95.


0.35 0.4 0.45 0.5 0.55 0.6 0.65

5

10

15

20

25

30

Figure 2. Prior and posterior densities of the proportion ! of citizens that wouldvote in favour of a referendum

A random survey of size 1500 is then conducted, where only 720 citizens declareto be in favour of the proposed measure. Using the results above, the correspondingposterior distribution is then Be(!|770, 830). These prior and posterior densitiesare plotted in Figure 2; it may be appreciated that, as one would expect, the e!ectof the data is to drastically reduce the initial uncertainty on the value of ! and,hence, on the referendum outcome. More precisely, Pr(! < 0.5|720, 1500,H,K) =0.933 (shaded region in Figure 2) so that, after the information from the survey hasbeen included, the probability that the referendum will be lost should be judgedto be about 93%.

The general situation where the vector of interest is not the whole parametervector (, but some function ! = !(() of possibly lower dimension than (, will nowbe considered. Let D be some observed data, let {p(D|(),( # "} be a probabilitymodel assumed to describe the probability mechanism which has generated D, let&(() be a probability distribution describing any available information on the valueof (, and let ! = !(() # $ be a function of the original parameters over whosevalue inferences based on the data D are required. Any valid conclusion on thevalue of the vector of interest ! will then be contained in its posterior probabilitydistribution &(!|D) which is conditional on the observed data D and will naturallyalso depend, although not explicitly shown in the notation, on the assumed model{p(D|(),( # "}, and on the available prior information encapsulated by &(().The required posterior distribution p(!|D) is found by standard use of probabilitycalculus. Indeed, by Bayes’ theorem, &((|D) $ p(D|()&((). Moreover, let " ="(() # ' be some other function of the original parameters such that 2 = {!,"}is a one-to-one transformation of (, and let J(() = (32/3() be the correspondingJacobian matrix. Naturally, the introduction of " is not necessary if !(() is aone-to-one transformation of (. Using standard change-of-variable probability


techniques, the posterior density of 2 is

(10) &(2|D) = &(!,"|D) =

<&((|D)

|J(()|

=

%=%(&)

and the required posterior of ! is the appropriate marginal density, obtained byintegration over the nuisance parameter ",

(11) &(!|D) =

(

!&(!,"|D) d".

Notice that elimination of unwanted nuisance parameters, a simple integrationwithin the Bayesian paradigm is, however, a di"cult (often polemic) problem forfrequentist statistics.

Sometimes, the range of possible values of ( is e!ectively restricted by contex-tual considerations. If ( is known to belong to #c 9 #, the prior distribution isonly positive in #c and, using Bayes’ theorem, it is immediately found that therestricted posterior is

(12) &((|D,( # #c) =&((|D)!#c

&((|D), ( # #c,

and obviously vanishes if ( /# #c. Thus, to incorporate a restriction on the possi-ble values of the parameters, it su"ces to renormalize the unrestricted posteriordistribution to the set #c 9 # of parameter values which satisfy the requiredcondition. Incorporation of known constraints on the parameter values, a simplerenormalization within the Bayesian pardigm, is another very di"cult problem forconventional statistics. For further details on the elimination of nuisance param-eters see [Liseo, 2005].

EXAMPLE 3 Inference on normal parameters. Let D = {x1, . . . xn} be a randomsample from a normal distribution N(x|µ,%). The corresponding likelihood func-tion is immediately found to be proportional to %!n exp['n{s2+(x'µ)2}/(2%2)],with nx =

$i xi, and ns2 =

$i(xi'x)2. It may be shown (see Section 4) that ab-

sence of initial information on the value of both µ and % may formally be describedby a joint prior function which is uniform in both µ and log(%), that is, by the(improper) prior function &(µ,%) = %!1. Using Bayes’ theorem, the correspondingjoint posterior is

(13) &(µ,%|D) $ %!(n+1) exp['n{s2 + (x ' µ)2}/(2%2)].

Thus, using the Gamma integral in terms of " = %!2 to integrate out %,

(14) &(µ|D) $( "

0%!(n+1) exp

<' n

2%2[s2 + (x ' µ)2]

=d% $ [s2 + (x ' µ)2]!n/2,

which is recognized as a kernel of the Student density St(µ|x, s/)

n ' 1, n ' 1).Similarly, integrating out µ,


9.7 9.75 9.8 9.85 9.9

10

20

30

40

9.75 9.8 9.85 9.9

10

20

30

40

Figure 3. Posterior density &(g|m, s, n) of the value g of the gravitational field,given n = 20 normal measurements with mean m = 9.8087 and standard deviations = 0.0428, (a) with no additional information, and (b) with g restricted to Gc ={g; 9.7803 < g < 9.8322}. Shaded areas represent 95%-credible regions of g

(15) &(%|D) $( "

!"%!(n+1) exp

<' n

2%2[s2 + (x ' µ)2]

=dµ $ %!n exp

<'ns2

2%2

=.

Changing variables to the precision " = %!2 results in &("|D) $ "(n!3)/2ens2'/2, akernel of the Gamma density Ga("|(n'1)/2, ns2/2). In terms of the standard de-viation % this becomes &(%|D) = p("|D)|3"/3%| = 2%!3Ga(%!2|(n'1)/2, ns2/2),a square-root inverted gamma density.

A frequent example of this scenario is provided by laboratory measurementsmade in conditions where central limit conditions apply, so that (assuming no ex-perimental bias) those measurements may be treated as a random sample froma normal distribution centered at the quantity µ which is being measured, andwith some (unknown) standard deviation %. Suppose, for example, that in an ele-mentary physics classroom experiment to measure the gravitational field g with apendulum, a student has obtained n = 20 measurements of g yielding (in m/sec2)a mean x = 9.8087, and a standard deviation s = 0.0428. Using no other informa-tion, the corresponding posterior distribution is &(g|D) = St(g|9.8087, 0.0098, 19)represented in Figure 3(a). In particular, Pr(9.788 < g < 9.829|D) = 0.95, so that,with the information provided by this experiment, the gravitational field at thelocation of the laboratory may be expected to lie between 9.788 and 9.829 with


probability 0.95.Formally, the posterior distribution of g should be restricted to g > 0; however,

as immediately obvious from Figure 3a, this would not have any appreciable e!ect,due to the fact that the likelihood function is actually concentrated on positive gvalues.

Suppose now that the student is further instructed to incorporate into the anal-ysis the fact that the value of the gravitational field g at the laboratory is knownto lie between 9.7803 m/sec2 (average value at the Equator) and 9.8322 m/sec2

(average value at the poles). The updated posterior distribution will the be

(16) &(g|D, g # Gc) =St(g|m, s/

)n ' 1, n)!

g(GcSt(g|m, s/

)n ' 1, n)

, g # Gc,

represented in Figure 3(b), where Gc = {g; 9.7803 < g < 9.8322}. One-dimen-sional numerical integration may be used to verify that Pr(g > 9.792|D, g # Gc) =0.95. Moreover, if inferences about the standard deviation % of the measurementprocedure are also requested, the corresponding posterior distribution is found tobe &(%|D) = 2%!3Ga(%!2|9.5, 0.0183). This has a mean E[%|D] = 0.0458 andyields Pr(0.0334 < % < 0.0642|D) = 0.95.

3.2 Predictive Distributions

Let D = {x1, . . . ,xn}, xi # X , be a set of exchangeable observations, and con-sider now a situation where it is desired to predict the value of a future obser-vation x # X generated by the same random mechanism that has generated thedata D. It follows from the foundations arguments discussed in Section 2 thatthe solution to this prediction problem is simply encapsulated by the predictivedistribution p(x|D) describing the uncertainty on the value that x will take, giventhe information provided by D and any other available knowledge. Suppose thatcontextual information suggests the assumption that data D may be consideredto be a random sample from a distribution in the family {p(x|(),( # #}, and let&(() be a prior distribution describing available information on the value of (.Since p(x|(, D) = p(x|(), it then follows from standard probability theory that

(17) p(x|D) =

(

#p(x|()&((|D) d(,

which is an average of the probability distributions of x conditional on the (un-known) value of (, weighted with the posterior distribution of ( given D.

If the assumptions on the probability model are correct, the posterior predictivedistribution p(x|D) will converge, as the sample size increases, to the distributionp(x|() which has generated the data. Indeed, the best technique to assess thequality of the inferences about ( encapsulated in &((|D) is to check against theobserved data the predictive distribution p(x|D) generated by &((|D). For a goodintroduction to Bayesian predictive inference, see Geisser [1993].


EXAMPLE 4 Prediction in a Poisson process. Let D = {r1, . . . , rn} be a randomsample from a Poisson distribution Pn(r|") with parameter ", so that p(D|") $"te!'n, where t =

$ri. It may be shown (see Section 4) that absence of initial

information on the value of " may be formally described by the (improper) priorfunction &(") = "!1/2. Using Bayes’ theorem, the corresponding posterior is

(18) &("|D) $ "te!'n "!1/2 $ "t!1/2e!'n,

the kernel of a Gamma density Ga("|, t + 1/2, n), with mean (t + 1/2)/n. Thecorresponding predictive distribution is the Poisson-Gamma mixture

(19) p(r|D) =

( "

0Pn(r|")Ga("|, t +

1

2, n) d" =

nt+1/2

%(t + 1/2)

1

r!

%(r + t + 1/2)

(1 + n)r+t+1/2.

Suppose, for example, that in a firm producing automobile restraint systems, theentire production in each of 10 consecutive months has yielded no complaint fromtheir clients. With no additional information on the average number " of com-plaints per month, the quality assurance department of the firm may report thatthe probabilities that r complaints will be received in the next month of pro-duction are given by equation (19), with t = 0 and n = 10. In particular,p(r = 0|D) = 0.953, p(r = 1|D) = 0.043, and p(r = 2|D) = 0.003. Many othersituations may be described with the same model. For instance, if metereologicalconditions remain similar in a given area, p(r = 0|D) = 0.953 would describe thechances of no flash flood next year, given 10 years without flash floods in the area.

EXAMPLE 5 Prediction in a Normal process. Consider now prediction of a con-tinuous variable. Let D = {x1, . . . , xn} be a random sample from a normal distri-bution N(x|µ,%). As mentioned in Example 3, absence of initial information onthe values of both µ and % is formally described by the improper prior function&(µ,%) = %!1, and this leads to the joint posterior density (13). The correspond-ing (posterior) predictive distribution is

(20) p(x|D) =

( "

0

( "

!"N(x|µ,%)&(µ,%|D) dµd% = St(x|x, s

>n + 1

n ' 1, n ' 1).

If µ is known to be positive, the appropriate prior function will be the restrictedfunction

(21) &(µ,%) =

8%!1 if µ > 00 otherwise.

However, the result in equation (19) will still hold, provided the likelihood functionp(D|µ,%) is concentrated on positive µ values. Suppose, for example, that in thefirm producing automobile restraint systems, the observed breaking strengths ofn = 10 randomly chosen safety belt webbings have mean x = 28.011 kN andstandard deviation s = 0.443 kN, and that the relevant engineering specificationrequires breaking strengths to be larger than 26 kN. If data may truly be assumedto be a random sample from a normal distribution, the likelihood function is only


appreciable for positive µ values, and only the information provided by this smallsample is to be used, then the quality engineer may claim that the probability thata safety belt randomly chosen from the same batch as the sample tested wouldsatisfy the required specification is Pr(x > 26|D) = 0.9987. Besides, if productionconditions remain constant, 99.87% of the safety belt webbings may be expectedto have acceptable breaking strengths.

3.3 Asymptotic Behaviour

The behaviour of posterior distributions when the sample size is large is now con-sidered. This is important for, at least, two di!erent reasons: (i) asymptotic resultsprovide useful first-order approximations when actual samples are relatively large,and (ii) objective Bayesian methods typically depend on the asymptotic propertiesof the assumed model. Let D = {x1, . . . ,xn}, x # X , be a random sample of size nfrom {p(x|(),( # #}. It may be shown that, as n * 0, the posterior distributionof a discrete parameter ( typically converges to a degenerate distribution whichgives probability one to the true value of (, and that the posterior distribution ofa continuous parameter ( typically converges to a normal distribution centered atits maximum likelihood estimate ( (MLE), with a variance matrix which decreaseswith n as 1/n.

Consider first the situation where # = {(1,(2, . . .} consists of a countable(possibly infinite) set of values, such that the probability model which corre-sponds to the true parameter value (t is distinguishable from the others in thesense that the logarithmic divergence 0{p(x|(i)|p(x|(t)} of each of the p(x|(i)from p(x|(t) is strictly positive. Taking logarithms in Bayes’ theorem, definingzj = log[p(xj |(i)/p(xj |(t)], j = 1, . . . , n, and using the strong law of large numberson the n conditionally independent and identically distributed random quantitiesz1, . . . , zn, it may be shown that

(22) limn)"

Pr((t|x1, . . . ,xn) = 1, limn)"

Pr((i|x1, . . . ,xn) = 0, i .= t.

Thus, under appropriate regularity conditions, the posterior probability of the trueparameter value converges to one as the sample size grows.

Consider now the situation where ( is a k-dimensional continuous parameter.Expressing Bayes’ theorem as &((|x1, . . . ,xn) $ exp{log[&(()]+

$nj=1 log[p(xj |()]},

expanding$

j log[p(xj |()] about its maximum (the MLE (), and assuming reg-ularity conditions (to ensure that terms of order higher than quadratic may beignored and that the sum of the terms from the likelihood will dominate the termfrom the prior) it is found that the posterior density of ( is the approximatek-variate normal

(23) &((|x1, . . . ,xn) & Nk{(,S(D, ()}, S!1(D,() =

+'

n'

l=1

32 log[p(xl|()]

3(i3(j

,.

A simpler, but somewhat poorer, approximation may be obtained by using the


strong law of large numbers on the sums in (22) to establish that S!1(D, () &nF((), where F(() is Fisher’s information matrix, with general element

(24) Fij(() = '(

Xp(x|()

32 log[p(x|()]

3(i3(jdx,

so that

(25) &((|x1, . . . ,xn) & Nk((|(, n!1 F!1(()).

Thus, under appropriate regularity conditions, the posterior probability density ofthe parameter vector ( approaches, as the sample size grows, a multivarite normaldensity centered at the MLE (, with a variance matrix which decreases with n asn!1 .

EXAMPLE 2, continued. Asymptotic approximation with binomial data. LetD = (x1, . . . , xn) consist of n independent Bernoulli trials with parameter !, sothat p(D|!, n) = !r(1 ' !)n!r. This likelihood function is maximized at ! = r/n,and Fisher’s information function is F (!) = !!1(1' !)!1. Thus, using the resultsabove, the posterior distribution of ! will be the approximate normal,

(26) &(!|r, n) & N(!|!, s(!)/)

n), s(!) = {!(1 ' !)}1/2

with mean ! = r/n and variance !(1 ' !)/n. This will provide a reasonableapproximation to the exact posterior if (i) the prior &(!) is relatively “flat” in theregion where the likelihood function matters, and (ii) both r and n are moderatelylarge. If, say, n = 1500 and r = 720, this leads to &(!|D) & N(!|0.480, 0.013),and to Pr(! > 0.5|D) & 0.940, which may be compared with the exact valuePr(! > 0.5|D) = 0.933 obtained from the posterior distribution which correspondsto the prior Be(!|50, 50). 1

It follows from the joint posterior asymptotic behaviour of ( and from theproperties of the multivariate normal distribution that, if the parameter vector isdecomposed into ( = (!,"), and Fisher’s information matrix is correspondinglypartitioned, so that

(27) F(() = F(!,") = (F!!(!,") F!'(!,")F'!(!,") F''(!,") )

and

(28) S(!,") = F!1(!,") = (S!!(!,") S!'(!,")S'!(!,") S''(!,") ) ,

then the marginal posterior distribution of ! will be

(29) &(!|D) & N{!|!, n!1 S!!(!, ")},

while the conditional posterior distribution of " given ! will be

(30) &("|!, D) & N{"|"' F!1'' (!, ")F'!(!, ")(! ' !), n!1 F!1

'' (!, ")}.


Notice that F!1'' = S'' if (and only if) F is block diagonal, i.e. if (and only if) !

and " are asymptotically independent.

EXAMPLE 3, continued. Asymptotic approximation with normal data. LetD = (x1, . . . , xn) be a random sample from a normal distribution N(x|µ,%). Thecorresponding likelihood function p(D|µ,%) is maximized at (µ, %) = (x, s), andFisher’s information matrix is diagonal, with Fµµ = %!2. Hence, the posteriordistribution of µ is approximately N(µ|x, s/

)n); this may be compared with the

exact result &(µ|D) = St(µ|x, s/)

n ' 1, n ' 1) obtained previously under the as-sumption of no prior knowledge. 1

4 REFERENCE ANALYSIS

Under the Bayesian paradigm, the outcome of any inference problem (the posteriordistribution of the quantity of interest) combines the information provided by thedata with relevant available prior information. In many situations, however, eitherthe available prior information on the quantity of interest is too vague to warrantthe e!ort required to have it formalized in the form of a probability distribution,or it is too subjective to be useful in scientific communication or public decisionmaking. It is therefore important to be able to identify the mathematical formof a “noninformative” prior, a prior that would have a minimal e!ect, relative tothe data, on the posterior inference. More formally, suppose that the probabilitymechanism which has generated the available data D is assumed to be p(D|(),for some ( # #, and that the quantity of interest is some real-valued function! = !(() of the model parameter (. Without loss of generality, it may be assumedthat the probability model is of the form

(31) M = {p(D|!,"), D # D, ! # $," # '}

p(D|!,"), where " is some appropriately chosen nuisance parameter vector. Asdescribed in Section 3, to obtain the required posterior distribution of the quantityof interest &(!|D) it is necessary to specify a joint prior &(!,"). It is now requiredto identify the form of that joint prior &!(!,"|M,P), the !-reference prior, whichwould have a minimal e!ect on the corresponding posterior distribution of !,

(32) &(!|D) $(

!p(D|!,")&!(!,"|M,P) d",

within the class P of all the prior disributions compatible with whatever informa-tion about (!,") one is prepared to assume, which may just be the class P0 ofall strictly positive priors. To simplify the notation, when there is no danger ofconfusion the reference prior &!(!,"|M,P) is often simply denoted by &(!,"), butits dependence on the quantity of interest !, the assumed model M and the classP of priors compatible with assumed knowledge, should always be kept in mind.

To use a conventional expression, the reference prior “would let the data speakfor themselves” about the likely value of !. Properly defined, reference posterior


distributions have an important role to play in scientific communication, for theyprovide the answer to a central question in the sciences: conditional on the assumedmodel p(D|!,"), and on any further assumptions of the value of ! on which theremight be universal agreement, the reference posterior &(!|D) should specify whatcould be said about ! if the only available information about ! were some well-documented data D and whatever information (if any) one is prepared to assumeby restricting the prior to belong to an appropriate class P.

Much work has been done to formulate “reference” priors which would make theidea described above mathematically precise. For historical details, see [Bernardoand Smith, 1994, Sec. 5.6.2; Kass and Wasserman, 1996; Bernardo, 2005a] andreferences therein. This section concentrates on an approach that is based on in-formation theory to derive reference distributions which may be argued to providethe most advanced general procedure available; this was initiated by Bernardo[1979b; 1981] and further developed by Berger and Bernardo [1989; 1992a; 1982b;1982c; 1997; 2005a; Bernardo and Ramon, 1998; Berger et al., 2009], and referencestherein. In the formulation described below, far from ignoring prior knowledge, thereference posterior exploits certain well-defined features of a possible prior, namelythose describing a situation were relevant knowledge about the quantity of interest(beyond that universally accepted, as specified by the choice of P) may be heldto be negligible compared to the information about that quantity which repeatedexperimentation (from a specific data generating mechanism M) might possiblyprovide. Reference analysis is appropriate in contexts where the set of inferenceswhich could be drawn in this possible situation is considered to be pertinent.

Any statistical analysis contains a fair number of subjective elements; theseinclude (among others) the data selected, the model assumptions, and the choiceof the quantities of interest. Reference analysis may be argued to provide an“objective” Bayesian solution to statistical inference problems in just the samesense that conventional statistical methods claim to be “objective”: in that thesolutions only depend on model assumptions and observed data.

4.1 Reference Distributions

One parameter. Consider the experiment which consists of the observation of dataD, generated by a random mechanism p(D|!) which only depends on a real-valuedparameter ! # $, and let t = t(D) # T be any su"cient statistic (which maywell be the complete data set D). In Shannon’s general information theory, theamount of information I!{T,&(!)} which may be expected to be provided by D,or (equivalently) by t(D), about the value of ! is defined by

(33) I!{T,&(!)} = 0 {p(t)&(!)|p(t|!)&(!)} = Et

< (

"&(!|t) log

&(!|t)&(!)

d!

=,

the expected logarithmic divergence of the prior from the posterior. This is natu-rally a functional of the prior &(!): the larger the prior information, the smallerthe information which the data may be expected to provide. The functional


I!{T,&(!)} is concave, non-negative, and invariant under one-to-one transforma-tions of !. Consider now the amount of information I!{T k,&(!)} about ! whichmay be expected from the experiment which consists of k conditionally indepen-dent replications {t1, . . . , tk} of the original experiment. As k * 0, such anexperiment would provide any missing information about ! which could possiblybe obtained within this framework; thus, as k * 0, the functional I!{T k,&(!)}will approach the missing information about ! associated with the prior p(!).Intuitively, a !-“noninformative” prior is one which maximizes the missing infor-mation about !. Formally, if &k(!) denotes the prior density which maximizesI!{T k,&(!)} in the class P of s prior distributions which are compatible with ac-cepted assumptions on the value of ! (which may well be the class P0 of all strictlypositive proper priors) then the !-reference prior &(!|M,P) is the limit as k * 0(in a sense to be made precise) of the sequence of priors {&k(!), k = 1, 2, . . .}.

Notice that this limiting procedure is not some kind of asymptotic approxima-tion, but an essential element of the definition of a reference prior. In particular,this definition implies that reference distributions only depend on the asymptoticbehaviour of the assumed probability model, a feature which actually simplifiestheir actual derivation.

EXAMPLE 6 Maximum entropy. If ! may only take a finite number of values,so that the parameter space is $ = {!1, . . . , !m} and &(!) = {p1, . . . , pm}, withpi = Pr(! = !i), and there is no topology associated to the parameter space $,so that the !i’s are just labels with no quantitative meaning, then the missinginformation associated to {p1, . . . , pm} reduces to

(34) limk)"

I!{T k,&(!)} = H(p1, . . . , pm) = ''m

i=1pi log(pi),

that is, the entropy of the prior distribution {p1, . . . , pm}.Thus, in the non-quantitative finite case, the reference prior &(!|M,P) is that

with maximum entropy in the class P of priors compatible with accepted assump-tions. Consequently, the reference prior algorithm contains “maximum entropy”priors as the particular case which obtains when the parameter space is a finiteset of labels, the only case where the original concept of entropy as a measure ofuncertainty is unambiguous and well-behaved. In particular, if P is the class P0

of all priors over {!1, . . . , !m}, then the reference prior is the uniform prior overthe set of possible ! values, &(!|M,P0) = {1/m, . . . , 1/m}.

Formally, the reference prior function &(!|M,P) of a univariate parameter ! isdefined to be the limit of the sequence of the proper priors &k(!) which maximizeI!{T k,&(!)} in the precise sense that, for any value of the su"cient statistict = t(D), the reference posterior, the intrinsic1 limit &(!|t) of the correspondingsequence of posteriors {&k(!|t)}, may be obtained from &(!|M,P) by formal useof Bayes theorem, so that &(!|t) $ p(t|!)&(!|M,P).

1A sequence {"k(#|t)} of posterior distributions converges intrinsically to a limit "(#|t) if thesequence of expected intrinsic discrepancies Et[${"k(#|t),"(#|t)}] converges to 0, where ${p, q} =min{k(p|q), k(q|p)}, and k(p|q) =

R

! q(#) log[q(#)/p(#)]d#. For details, see [Berger et al., 2009].


Reference prior functions are often simply called reference priors, even thoughthey are usually not probability distributions. They should not be considered asexpressions of belief, but technical devices to obtain (proper) posterior distribu-tions which are a limiting form of the posteriors which could have been obtainedfrom possible prior beliefs which were relatively uninformative with respect tothe quantity of interest when compared with the information which data couldprovide.

If (i) the su"cient statistic t = t(D) is a consistent estimator ! of a continuousparameter !, and (ii) the class P contains all strictly positive priors, then thereference prior may be shown to have a simple form in terms of any asymptoticapproximation to the posterior distribution of !. Notice that, by construction, anasymptotic approximation to the posterior does not depend on the prior. Specifi-cally, if the posterior density &(!|D) has an asymptotic approximation of the form&(!|!, n), the (unrestricted) reference prior is simply

(35) &(!|M,P0) $ &(!|!, n)

1111!=!

.

One-parameter reference priors are invariant under reparametrization; thus, if2 = 2(!) is a piecewise one-to-one function of !, then the 2-reference prior issimply the appropriate probability transformation of the !-reference prior.

EXAMPLE 7 Je!reys’ prior. If ! is univariate and continuous, and the poste-rior distribution of ! given {x1 . . . , xn} is asymptotically normal with standarddeviation s(!)/

)n, then, using (34), the reference prior function is &(!) $ s(!)!1.

Under regularity conditions (often satisfied in practice, see Section 3.3), the pos-terior distribution of ! is asymptotically normal with variance n!1 F!1(!), whereF (!) is Fisher’s information function and ! is the MLE of !. Hence, the referenceprior function in these conditions is &(!|M,P0) $ F (!)1/2, which is known as Jef-freys’ prior. It follows that the reference prior algorithm contains Je!reys’ priorsas the particular case which obtains when the probability model only depends ona single continuous univariate parameter, there are regularity conditions to guar-antee asymptotic normality, and there is no additional information, so that theclass of possible priors is the set P0 of all strictly positive priors over $. Theseare precisely the conditions under which there is general agreement on the use ofJe!reys’ prior as a “noninformative” prior.

EXAMPLE 2, continued. Reference prior for a binomial parameter. Let dataD = {x1, . . . , xn} consist of a sequence of n independent Bernoulli trials, so thatp(x|!) = !x(1 ' !)1!x, x # {0, 1}; this is a regular, one-parameter continuousmodel, whose Fisher’s information function is F (!) = !!1(1 ' !)!1. Thus, thereference prior &(!) is proportional to !!1/2(1' !)!1/2, so that the reference prioris the (proper) Beta distribution Be(!|1/2, 1/2). Since the reference algorithmis invariant under reparametrization, the reference prior of 4(!) = 2 arc sin

)! is

&(4) = &(!)/|34/3/!| = 1; thus, the reference prior is uniform on the variance-stabilizing transformation 4(!) = 2arc sin

)!, a feature generally true under reg-


0.01 0.02 0.03 0.04 0.05

100

200

300

400

500

600

Figure 4. Posterior distribution of the proportion of infected people in the popu-lation, given the results of n = 100 tests, none of which were positive

ularity conditions. In terms of !, the reference posterior is &(!|D) = &(!|r, n) =Be(!|r + 1/2, n ' r + 1/2), where r =

$xj is the number of positive trials.

Suppose, for example, that n = 100 randomly selected people have been testedfor an infection and that all tested negative, so that r = 0. The reference posteriordistribution of the proportion ! of people infected is then the Beta distributionBe(!|0.5, 100.5), represented in Figure 4. It may well be known that the infectionwas rare, leading to the assumption that ! < !0, for some upper bound !0; the(restricted) reference prior would then be of the form &(!) $ !!1/2(1 ' !)!1/2

if ! < !0, and zero otherwise. However, provided the likelihood is concentratedin the region ! < !0, the corresponding posterior would virtually be identical toBe(!|0.5, 100.5). Thus, just on the basis of the observed experimental results, onemay claim that the proportion of infected people is surely smaller than 5% (forthe reference posterior probability of the event ! > 0.05 is 0.001), that ! is smallerthan 0.01 with probability 0.844 (area of the shaded region in Figure 4), that it isequally likely to be over or below 0.23% (for the median, represented by a verticalline, is 0.0023), and that the probability that a person randomly chosen from thepopulation is infected is 0.005 (the posterior mean, represented in the figure bya black circle), since Pr(x = 1|r, n) = E[!|r, n] = 0.005. If a particular pointestimate of ! is required (say a number to be quoted in the summary headline) theintrinsic estimator suggests itself (see Section 5); this is found to be !% = 0.0032(represented in the figure with a white circle). Notice that the traditional solutionto this problem, based on the asymptotic behaviour of the MLE, here ! = r/n = 0for any n, makes absolutely no sense in this scenario. 1

One nuisance parameter. The extension of the reference prior algorithm to thecase of two parameters follows the usual mathematical procedure of reducing theproblem to a sequential application of the established procedure for the single


parameter case. Thus, if the probability model is p(t|!,"), ! # $, " # ' anda !-reference prior &!(!,"|M,P) is required, the reference algorithm proceeds intwo steps:

(i) Conditional on !, p(t|!,") only depends on the nuisance parameter " and,hence, the one-parameter algorithm may be used to obtain the conditionalreference prior &("|!,M,P).

(ii) If &("|!,M,P) is proper, this may be used to integrate out the nuisance pa-rameter thus obtaining the one-parameter integrated model p(t|!) =!! p(t|!,")&("|!,M,P) d", to which the one-parameter algorithm may be

applied again to obtain &(!|M,P). The !-reference prior is then&!(!,"|M,P) = &("|!,M,P)&(!|M,P), and the required reference poste-rior is &(!|t) $ p(t|!)&(!|M,P).

If the conditional reference prior is not proper, then the procedure is performedwithin an increasing sequence {'i} of subsets converging to ' over which &("|!) isintegrable. This makes it possible to obtain a corresponding sequence of !-referenceposteriors {&i(!|t} for the quantity of interest !, and the required reference pos-terior is the corresponding intrinsic limit &(!|t) = limi &i(!|t).

A !-reference prior is then defined as a positive function &!(!,") which maybe formally used in Bayes’ theorem as a prior to obtain the reference poste-rior, i.e. such that, for any su"cient t # T (which may well be the whole dataset D) &(!|t) $

!! p(t|!,")&!(!,") d". The approximating sequences should

be consistently chosen within a given model. Thus, given a probability model{p(x|(),( # #} an appropriate approximating sequence {#i} should be chosenfor the whole parameter space #. Thus, if the analysis is done in terms of, say,2 = {21,22} # ((#), the approximating sequence should be chosen such that(i = 2(#i). A natural approximating sequence in location-scale problems is{µ, log %} # ['i, i]2.

The !-reference prior does not depend on the choice of the nuisance parameter"; thus, for any 2 = 2(!,") such that (!,2) is a one-to-one function of (!,"), the !-reference prior in terms of (!,2) is simply &!(!,2) = &!(!,")/|3(!,2)/3(!,")|, theappropriate probability transformation of the !-reference prior in terms of (!,").Notice, however, that the reference prior may depend on the parameter of interest;thus, the !-reference prior may di!er from the 4-reference prior unless either 4 isa piecewise one-to-one transformation of !, or 4 is asymptotically independent of!. This is an expected consequence of the fact that the conditions under whichthe missing information about ! is maximized are not generally the same as theconditions which maximize the missing information about an arbitrary function4 = 4(!,").

The non-existence of a unique “noninformative prior” which would be appro-priate for any inference problem within a given model was established by Dawid,Stone and Zidek [1973], when they showed that this is incompatible with consis-tent marginalization. Indeed, if given the model p(D|!,"), the reference posterior


of the quantity of interest !, &(!|D) = &(!|t), only depends on the data througha statistic t whose sampling distribution, p(t|!,") = p(t|!), only depends on !,one would expect the reference posterior to be of the form &(!|t) $ &(!) p(t|!) forsome prior &(!). However, examples were found where this cannot be the case ifa unique joint “noninformative” prior were to be used for all possible quantities ofinterest.

EXAMPLE 8 Regular two dimensional continuous reference prior functions. If thejoint posterior distribution of (!,") is asymptotically normal, then the !-referenceprior may be derived in terms of the corresponding Fisher’s information matrix,F(!,"). Indeed, if

(36) F(!,") =

+F!!(!,") F!'(!,")F!'(!,") F''(!,")

,, and S(!,") = F!1(!,"),

then the unrestricted !-reference prior is &!(!,"|M,P0) = &("|!)&(!), where

(37) &("|!) $ F 1/2'' (!,"), " # '.

If &("|!) is proper,

(38) &(!) $ exp?(

!&("|!) log[S!1/2

!! (!,")] d"@, ! # $.

If &("|!) is not proper, integrations are performed on an approximating sequence{'i} to obtain a sequence {&i("|!)&i(!)}, (where &i("|!) is the proper renormaliza-tion of &("|!) to 'i) and the !-reference prior &!(!,") is defined as its appropriate

limit. Moreover, if (i) both F 1/2'' (!,") and S!1/2

!! (!,") factorize, so that

(39) S!1/2!! (!,") $ f!(!) g!("), F 1/2

'' (!,") $ f'(!) g'("),

and (ii) the parameters ! and " are variation independent, so that ' does notdepend on !, then the !-reference prior is simply &!(!,") = f!(!) g'("), even ifthe conditional reference prior &("|!) = &(") $ g'(") (which will not depend on!) is actually improper.

EXAMPLE 3, continued. Reference priors for the normal model. The informationmatrix which corresponds to a normal model N(x|µ,%) is

(40) F(µ,%) =

+%!2 00 2%!2

,, S(µ,%) = F!1(µ,%) =

+%2 00 1

2%2

,;

hence F 1/2## (µ,%) =

)2%!1 = f#(µ) g#(%), with g#(%) = %!1, and thus &(%|µ) =

%!1. Similarly, S!1/2µµ (µ,%) = %!1 = fµ(µ) gµ(%), with fµ(µ) = 1, and thus

&(µ) = 1. Therefore, the µ-reference prior is &µ(µ,%|M,P0) = &(%|µ)&(µ) = %!1,as already anticipated. Moreover, as one would expect from the fact that F(µ,%)is diagonal and also anticipated, it is similarly found that the %-reference prior is&#(µ,%|M,P0) = %!1, the same as before.


Suppose, however, that the quantity of interest is not the mean µ or the stan-dard deviation %, but the standardized mean 4 = µ/%. Fisher’s information ma-trix in terms of the parameters 4 and % is F(4,%) = J t F(µ,%)J , where J =(3(µ,%)/3(4,%)) is the Jacobian of the inverse transformation; this yields

(41) F(4,%) =

+1 4%!1

4%!1 %!2(2 + 42)

,, S(4,%) =

+1 + 1

242 ' 1

24%' 1

24%12%

2

,.

Thus, S!1/2(( (4,%) $ (1 + 1

242)!1/2 and F 1/2

## (4,%) $ %!1(2 + 42)1/2. Hence,

using again the results in Example 8, &((4,%|M,P0) = (1 + 124

2)!1/2%!1. In

the original parametrization, this is &((µ,%|M,P0) = (1 + 12 (µ/%)2)!1/2%!2,

which is very di!erent from &µ(µ,%|M,P0) = &#(µ,%|M,P0) = %!1. The cor-responding reference posterior of 4 is &(4|x1, . . . , xn) $ (1+ 1

242)!1/2 p(t|4) where

t = ($

xj)/($

x2j )

1/2, a one-dimensional (marginally su"cient) statistic whosesampling distribution, p(t|µ,%) = p(t|4), only depends on 4. Thus, the referenceprior algorithm is seen to be consistent under marginalization. 1

Many parameters. The reference algorithm is easily generalized to an arbitrarynumber of parameters. If the model is p(t|(1, . . . ,(m), a joint reference prior

(42) &(!m|!m!1, . . . , !1) ( . . . ( &(!2|!1) ( &(!1)

may sequentially be obtained for each ordered parametrization {!1((), . . . , !m(()}of interest, and these are invariant under reparametrization of any of the !i(()’s.The choice of the ordered parametrization {!1, . . . , !m} precisely describes theparticular prior required, namely that which sequentially maximizes the miss-ing information about each of the !i’s, conditional on {!1, . . . , !i!1}, for i = m,m ' 1, . . . , 1.

EXAMPLE 9 Stein’s paradox. Let D be a random sample from a m-variatenormal distribution with mean µ = {µ1, . . . , µm} and unitary variance matrix. Thereference prior which corresponds to any permutation of the µi’s is uniform, andthis prior leads indeed to appropriate reference posterior distributions for any ofthe µi’s, namely &(µi|D) = N(µi|xi, 1/

)n). Suppose, however, that the quantity

of interest is ! =$

i µ2i , the distance of µ to the origin. As showed by Stein

[1959], the posterior distribution of ! based on that uniform prior (or in any “flat”proper approximation) has very undesirable properties; this is due to the fact that auniform (or nearly uniform) prior, although “noninformative” with respect to eachof the individual µi’s, is actually highly informative on the sum of their squares,introducing a severe positive bias (Stein’s paradox). However, the reference priorwhich corresponds to a parametrization of the form {!,"1, . . . ,"m!1} produces,for any choice of the nuisance parameters "i = "i(µ), the reference posterior&(!|D) = &(!|t) $ !!1/2/2(nt|m,n!), where t =

$i x2

i , and this posterior isshown to have the appropriate consistency properties.

Far from being specific to Stein’s example, the inappropriate behaviour in prob-lems with many parameters of specific marginal posterior distributions derived


from multivariate “flat” priors (proper or improper) is indeed very frequent. Hence,sloppy, uncontrolled use of “flat” priors (rather than the relevant reference priors),is very strongly discouraged.

Limited information Although often used in contexts where no universallyagreed prior knowledge about the quantity of interest is available, the referencealgorithm may be used to specify a prior which incorporates any acceptable priorknowledge; it su"ces to maximize the missing information within the class P ofpriors which is compatible with such accepted knowledge. Indeed, by progressiveincorporation of further restrictions into P, the reference prior algorithm becomesa method of (prior) probability assessment. As described below, the problem hasa fairly simple analytical solution when those restrictions take the form of knownexpected values. The incorporation of other type of restrictions usually involvesnumerical computations.

EXAMPLE 10 Univariate restricted reference priors. If the probability mecha-nism which is assumed to have generated the available data only depends on aunivarite continuous parameter ! # $ 9 <, and the class P of acceptable priors isa class of proper priors which satisfies some expected value restrictions, so that

(43) P =

8&(!); &(!) > 0,

(

"&(!) d! = 1,

(

"gi(!)&(!) d! = $i, i = 1, . . . ,m

9

then the (restricted) reference prior is

(44) &(!|M,P) $ &(!|M,P0) exp"'m

j=1*i gi(!)

#

where &(!|M,P0) is the unrestricted reference prior and the *i’s are constants(the corresponding Lagrange multipliers), to be determined by the restrictionswhich define P. Suppose, for instance, that data are considered to be a randomsample from a location model centered at !, and that it is further assumed thatE[!] = µ0 and that Var[!] = %2

0 . The unrestricted reference prior for any regularlocation problem may be shown to be uniform, so that here &(!|M,P0) = 1.Thus, the restricted reference prior must be of the form &(!|M,P) $ exp{*1! +*2(! ' µ0)2}, with

!" ! &(!|M,P) d! = µ0 and

!"(! ' µ0)2 &(!|M,P) d! = %2

0 .Hence, &(!|M,P) is the normal distribution with the specified mean and variance,N(!|µ0,%0).

4.2 Frequentist Properties

Bayesian methods provide a direct solution to the problems typically posed instatistical inference; indeed, posterior distributions precisely state what can be saidabout unknown quantities of interest given available data and prior knowledge. Inparticular, unrestricted reference posterior distributions state what could be saidif no prior knowledge about the quantities of interest were available.


A frequentist analysis of the behaviour of Bayesian procedures under repeatedsampling may, however, be illuminating, for this provides some interesting connec-tions between frequentist and Bayesian inference. It is found that the frequentistproperties of Bayesian reference procedures are typically excellent, and may beused to provide a form of calibration for reference posterior probabilities.

Point Estimation It is generally accepted that, as the sample size increases, a“good” estimator ! of ! ought to get the correct value of ! eventually, that is tobe consistent. Under appropriate regularity conditions, any Bayes estimator 4%

of any function 4(!) converges in probability to 4(!), so that sequences of Bayesestimators are typically consistent. Indeed, it is known that if there is a consis-tent sequence of estimators, then Bayes estimators are consistent. The rate ofconvergence is often best for reference Bayes estimators.

It is also generally accepted that a “good” estimator should be admissible,that is, not dominated by any other estimator in the sense that its expected lossunder sampling (conditional to !) cannot be larger for all ! values than thatcorresponding to another estimator. Any proper Bayes estimator is admissible;moreover, as established by Wald [1950], a procedure must be Bayesian (proper orimproper) to be admissible. Most published admissibility results refer to quadraticloss functions, but they often extend to more general loss funtions. Reference Bayesestimators are typically admissible with respect to appropriate loss functions.

Notice, however, that many other apparently intuitive frequentist ideas on es-timation have been proved to be potentially misleading. For example, given asequence of n Bernoulli observations with parameter ! resulting in r positive tri-als, the best unbiased estimate of !2 is found to be r(r'1)/{n(n'1)}, which yields!2 = 0 when r = 1; but to estimate the probability of two positive trials as zero,when one positive trial has been observed, is less than sensible. In marked con-trast, any Bayes reference estimator provides a reasonable answer. For example,the intrinsic estimator of !2 is simply (!%)2, where !% is the intrinsic estimator of! described in Section 5.1. In particular, if r = 1 and n = 2 the intrinsic estimatorof !2 is (as one would naturally expect) (!%)2 = 1/4.

Interval Estimation As the sample size increases, the frequentist coverageprobability of a posterior q-credible region typically converges to q so that, forlarge samples, Bayesian credible intervals may (under regularity conditions) be in-terpreted as approximate frequentist confidence regions: under repeated sampling,a Bayesian q-credible region of ! based on a large sample will cover the true valueof ! approximately 100q% of times. Detailed results are readily available for uni-variate problems. For instance, consider the probability model {p(D|(),( # #},let ! = !(() be any univariate quantity of interest, and let t = t(D) # T be anysu"cient statistic. If !q(t) denotes the 100q% quantile of the posterior distributionof ! which corresponds to some unspecified prior, so that

(45) Pr[! % !q(t)|t] =

(

!*!q(t)&(!|t) d! = q,


then the coverage probability of the q-credible interval {!; ! % !q(t)},

(46) Pr[!q(t) / !|(] =

(

!q(t)+!p(t|() dt,

is such that

(47) Pr[!q(t) / !|(] = Pr[! % !q(t)|t] + O(n!1/2).

This asymptotic approximation is true for all (su"ciently regular) positive priors.However, the approximation is better, actually O(n!1), for a particular class ofpriors known as (first-order) probability matching priors. For details on probablitymatching priors see Datta and Sweeting [2005] and references therein. Referencepriors are typically found to be probability matching priors, so that they providethis improved asymptotic agreement. As a matter of fact, the agreement (in regularproblems) is typically quite good even for relatively small samples.

EXAMPLE 11 Product of normal means. Consider the case where independentrandom samples {x1, . . . , xn} and {y1, . . . , ym} have respectively been taken fromthe normal densities N(x|(1, 1) and N(y|(2, 1), and suppose that the quantityof interest is the product of their means, 4 = (1(2 (for instance, one may beinterested in inferences about the area 4 of a rectangular piece of land, givenmeasurements {xi} and {yj} of its sides). Notice that this is a simplified versionof a problem that it is often encountered in the sciences, where one is interestedin the product of several magnitudes, all of which have been measured with error.Using the procedure described in Example 8, with the natural approximatingsequence induced by ((1,(2) # ['i, i]2, the 4-reference prior is found to be

(48) &(((1,(2|M,P0) $ (n(21 + m(2

2)!1/2,

very di!erent from the uniform prior &%1((1,(2|M,P0) = &%2((1,(2|M,P0) = 1which should be used to make objective inferences about either (1 or (2. The prior&(((1,(2) may be shown to provide approximate agreement between Bayesiancredible regions and frequentist confidence intervals for 4; indeed, this prior (withm = n) was originally suggested by Stein in the 1980’s to obtain such approximateagreement. The same example was later used by Efron [1986] to stress the factthat, even within a fixed probability model {p(D|(),( # #}, the prior required tomake objective inferences about some function of the parameters 4 = 4(() mustgenerally depend on the function 4. For further details on the reference analysisof this problem, see [Berger and Bernardo, 1989].

The numerical agreement between reference Bayesian credible regions and fre-quentist confidence intervals is actually perfect in special circumstances. Indeed,as Lindley [1958] pointed out, this is the case in those problems of inference whichmay be transformed to location-scale problems.

EXAMPLE 3, continued. Inference on normal parameters. Let D = {x1, . . . xn}be a random sample from a normal distribution N(x|µ,%). As mentioned before,


the reference posterior of the quantity of interest µ is the Student distributionSt(µ|x, s/

)n ' 1, n'1). Thus, normalizing µ, the posterior distribution of t(µ) =)

n ' 1(x'µ)/s, as a function of µ given D, is the standard Student St(t|0, 1, n'1)with n' 1 degrees of freedom. On the other hand, this function t is recognized tobe precisely the conventional t statistic, whose sampling distribution is well knownto also be standard Student with n ' 1 degrees of freedom. It follows that, forall sample sizes, posterior reference credible intervals for µ given the data willbe numerically identical to frequentist confidence intervals based on the samplingdistribution of t.

A similar result is obtained in inferences about the variance. Thus, the referenceposterior distribution of " = %!2 is the Gamma distribution Ga("|(n'1)/2, ns2/2)and, hence, the posterior distribution of r = ns2/%2, as a function of %2 given D, isa (central) /2 with n'1 degrees of freedom. But the function r is recognized to bea conventional statistic for this problem, whose sampling distribution is well knownto also be /2 with n ' 1 degrees of freedom. It follows that, for all sample sizes,posterior reference credible intervals for %2 (or any one-to-one function of %2) giventhe data will be numerically identical to frequentist confidence intervals based onthe sampling distribution of r. 1

5 INFERENCE SUMMARIES

From a Bayesian viewpoint, the final outcome of a problem of inference aboutany unknown quantity is nothing but the corresponding posterior distribution.Thus, given some data D and conditions C, all that can be said about any func-tion ( of the parameters which govern the model is contained in the posteriordistribution &((|D,C), and all that can be said about some function y of futureobservations from the same model is contained in its posterior predictive distri-bution p(y|D,C). Indeed, Bayesian inference may technically be described as adecision problem where the space of available actions is the class of those poste-rior probability distributions of the quantity of interest which are compatible withaccepted assumptions.

However, to make it easier for the user to assimilate the appropriate conclusions,it is often convenient to summarize the information contained in the posterior dis-tribution by (i) providing values of the quantity of interest which, in the lightof the data, are likely to be “close” to its true value and by (ii) measuring thecompatibility of the results with hypothetical values of the quantity of interestwhich might have been suggested in the context of the investigation. In this sec-tion, those Bayesian counterparts of traditional estimation and hypothesis testingproblems are briefly considered.

5.1 Estimation

In one or two dimensions, a graph of the posterior probability density of the quan-tity of interest (or the probability mass function in the discrete case) immediately


conveys an intuitive, “impressionist” summary of the main conclusions which maypossibly be drawn on its value. Indeed, this is greatly appreciated by users, andmay be quoted as an important asset of Bayesian methods. From a plot of itsposterior density, the region where (given the data) a univariate quantity of inter-est is likely to lie is easily distinguished. For instance, all important conclusionsabout the value of the gravitational field in Example 3 are qualitatively availablefrom Figure 3. However, this does not easily extend to more than two dimensionsand, besides, quantitative conclusions (in a simpler form than that provided by themathematical expression of the posterior distribution) are often required.

Point Estimation Let D be the available data, which are assumed to have beengenerated by a probability model {p(D|(),( # #}, and let ! = !(() # $ be thequantity of interest. A point estimator of ! is some function of the data ! = !(D)which could be regarded as an appropriate proxy for the actual, unknown value of!. Formally, to choose a point estimate for ! is a decision problem, where the actionspace is the class $ of possible ! values. From a decision-theoretic perspective,to choose a point estimate ! of some quantity ! is a decision to act as though !were !, not to assert something about the value of ! (although desire to assertsomething simple may well be the reason to obtain an estimate). As prescribed bythe foundations of decision theory (Section 2), to solve this decision problem it isnecessary to specify a loss function L(!, !) measuring the consequences of actingas if the true value of the quantity of interest were !, when it is actually !. Theexpected posterior loss if ! were used is

(49) L[!|D] =

(

"L(!, !)&(!|D) d!,

and the corresponding Bayes estimator !% is that function of the data, !% = !%(D),which minimizes this expectation.

EXAMPLE 12 Conventional Bayes estimators. For any given model and data, theBayes estimator obviously depends on the chosen loss function. The loss functionis context specific, and should be chosen in terms of the anticipated uses of theestimate; however, a number of conventional loss functions have been suggestedfor those situations where no particular uses are envisaged. These loss functionsproduce estimates which may be regarded as simple descriptions of the locationof the posterior distribution. For example, if the loss function is quadratic, sothat L(!, !) = (! ' !)t(! ' !), then the Bayes estimator is the posterior mean!% = E[!|D], assuming that the mean exists. Similarly, if the loss function is azero-one function, so that L(!, !) = 0 if ! belongs to a ball or radius 5 centeredin ! and L(!, !) = 1 otherwise, then the Bayes estimator !% tends to the posteriormode as the ball radius 5 tends to zero, assuming that a unique mode exists. If ! isunivariate and the loss function is linear, so that L(!, !) = c1(! ' !) if ! / !, andL(!, !) = c2(!' !) otherwise, then the Bayes estimator is the posterior quantile oforder c2/(c1 + c2), so that Pr[! < !%] = c2/(c1 + c2). In particular, if c1 = c2, theBayes estimator is the posterior median. The results derived for linear loss funtions


clearly illustrate the fact that any possible parameter value may turn out be theBayes estimator: it all depends on the loss function describing the consequencesof the anticipated uses of the estimate.

EXAMPLE 13 Intrinsic estimation. Conventional loss functions are typically non-invariant under reparametrization. It follows that the Bayes estimator 4% of a one-to-one transformation 4 = 4(!) of the original parameter ! is not necessarily 4(!%)(the univariate posterior median, which is invariant, is an interesting exception).Moreover, conventional loss functions focus on the “distance” between the estimate! and the true value !, rather then on the “distance” between the probabilitymodels they label. Inference-oriented loss functions directly focus on how di!erentthe probability model p(D|!,") is from its closest approximation within the family{p(D|!,"i),"i # '}, and typically produce invariant solutions. An attractiveexample is the intrinsic discrepancy, +(!, !) defined as the minimum logarithmicdivergence between a probability model labeled by ! and a probability modellabeled by !. When there are no nuisance parameters, this is given by

(50) +(!, !) = min{0(!|!),0(!|!)}, 0(!i|!) =

(

Tp(t|!) log

p(t|!)p(t|!i)

dt,

where t = t(D) # T is any su"cient statistic (which may well be the whole dataset D). The definition is easily extended to problems with nuisance parameters;in this case,

(51) +(!, !,") = min'i(!

+(!,"i, !,")

measures the logarithmic divergence from p(t|!,") of its closest approximationwith ! = !, and the loss function now depends on the complete parameter vec-tor (!,"). Although not explicitly shown in the notation, the intrinsic discrep-ancy function typically depends on the sample size n; indeed, when the dataconsist of a random sample D = {x1, . . . ,xn} from some model p(x|!) then0(!i|!) = n

!X p(x|!) log[p(x|!)/p(x|!i)] dx so that the discrepancy associated with

the full model is simply n times the discrepancy which corresponds to a singleobservation. The intrinsic discrepancy is a symmetric, non-negative loss func-tion with a direct interpretation in information-theoretic terms as the minimumamount of information which is expected to be necessary to distinguish between themodel p(D|!,") and its closest approximation within the class {p(D|!,"i),"i # '}.Moreover, it is invariant under one-to-one reparametrization of the parameter ofinterest !, and does not depend on the choice of the nuisance parameter ". Theintrinsic estimator is naturally obtained by minimizing the reference posteriorexpected intrinsic discrepancy

(52) d(!|D) =

(

!

(

"+(!, !,")&(!,"|D) d!d".

Since the intrinsic discrepancy is invariant under reparametrization, minimizingits posterior expectation produces invariant estimators. For further details onintrinsic point estimation see [Bernardo and Juarez, 2003; Bernardo, 2006].


EXAMPLE 2, continued. Intrinsic estimation of a binomial parameter. In theestimation of a binomial proportion !, given data D = (n, r), the Bayes referenceestimator associated with the quadratic loss (the corresponding posterior mean) isE[!|D] = (r + 1

2 )/(n+1), while the quadratic loss based estimator of, say, the log-odds 4(!) = log[!/(1'!)], is found to be E[4|D] = 2(r+ 1

2 )'2(n'r+ 12 ) (where

2(x) = d log[%(x)]/dx is the digamma function), which is not equal to 4(E[!|D]).The intrinsic loss function in this problem is

(53) +(!, !) = n min{0(!|!),0(!|!)}, 0(!i|!) = ! log!

!i+ (1 ' !) log

1 ' !

1 ' !i,

and the corresponding intrinsic estimator !% is obtained by minimizing the ex-pected posterior loss d(!|D) =

!+(!, !)&(!|D) d!. The exact value of !% may be

obtained by numerical minimization, but a very good approximation is given by!% & (r + 1

3 )/(n + 23 ).

Since intrinsic estimation is an invariant procedure, the intrinsic estimator of thelog-odds will simply be the log-odds of the intrinsic estimator of !. As one wouldexpect, when r and n' r are both large, all Bayes estimators of any well-behavedfunction 4(!) will cluster around 4(E[!|D]). 1

Interval Estimation To describe the inferential content of the posterior distri-bution of the quantity of interest &(!|D) it is often convenient to quote regionsR 9 $ of given probability under &(!|D). For example, the identification of regionscontaining 50%, 90%, 95%, or 99% of the probability under the posterior may besu"cient to convey the general quantitative messages implicit in &(!|D); indeed,this is the intuitive basis of graphical representations of univariate distributionslike those provided by boxplots.

Any region R 9 $ such that!

R &(!|D)d! = q (so that, given data D, the truevalue of ! belongs to R with probability q), is said to be a posterior q-credible re-gion of !. Notice that this provides immediately a direct intuitive statement aboutthe unknown quantity of interest ! in probability terms, in marked contrast to thecircumlocutory statements provided by frequentist confidence intervals. Clearly,for any given q there are generally infinitely many credible regions. A credibleregion is invariant under reparametrization; thus, for any q-credible region R of !,4(R) is a q-credible region of 4 = 4(!). Sometimes, credible regions are selected tohave minimum size (length, area, volume), resulting in highest probability density(HPD) regions, where all points in the region have larger probability density thanall points outside. However, HPD regions are not invariant under reparametriza-tion: the image 4(R) of an HPD region R will be a credible region for 4, but willnot generally be HPD; indeed, there is no compelling reason to restrict attention toHPD credible regions. In one-dimensional problems, posterior quantiles are oftenused to derive credible regions. Thus, if !q = !q(D) is the 100q% posterior quan-tile of !, then R = {!; ! % !q} is a one-sided, typically unique q-credible region,and it is invariant under reparametrization. Indeed, probability centered q-credible


regions of the form R = {!; !(1!q)/2 % ! % !(1+q)/2} are easier to compute, andare often quoted in preference to HPD regions.

EXAMPLE 3. Inference on normal parameters, continued. In the numerical ex-ample about the value of the gravitational field described in Figure 3a, the interval[9.788, 9.829] in the unrestricted posterior density of g is a HPD, 95%-credible re-gion for g. Similarly, the interval [9.7803, 9.8322] in Figure 3b is also a 95%-credibleregion for g, but it is not HPD. 1

Decision theory may also be used to select credible regions. Thus, lowest pos-terior loss (LPL) regions, are defined as those where all points in the region havesmaller posterior expected loss than all points outside. Using the intrinsic discrep-ancy as a loss function yields intrinsic credible regions which, as one would expectfrom an invariant loss function, are coherent under one-to-one transformations.For details, see [Bernardo, 2005b; 2007].

The concept of a credible region for a function ! = !(() of the parameter vectoris trivially extended to prediction problems. Thus, a posterior q-credible region forx # X is a subset R of the sample space X with posterior predictive probabilityq, so that

!R p(x|D)dx = q.

5.2 Hypothesis Testing

The reference posterior distribution &(!|D) of the quantity of interest ! conveysimmediate intuitive information on those values of ! which, given the assumedmodel, may be taken to be compatible with the observed data D, namely, thosewith a relatively high probability density. Sometimes, a restriction ! # $0 9 $ ofthe possible values of the quantity of interest (where $0 may possibly consists of asingle value !0) is suggested in the course of the investigation as deserving specialconsideration, either because restricting ! to $0 would greatly simplify the model,or because there are additional, context specific arguments suggesting that ! # $0.Intuitively, the hypothesis H0 " {! # $0} should be judged to be compatible withthe observed data D if there are elements in $0 with a relatively high posteriordensity. However, a more precise conclusion is often required and, once again, thisis made possible by adopting a decision-oriented approach. Formally, testing thehypothesis H0 " {! # $0} is a decision problem where the action space has onlytwo elements, namely to accept (a0) or to reject (a1) the proposed restriction. Tosolve this decision problem, it is necessary to specify an appropriate loss function,L(ai, !), measuring the consequences of accepting or rejecting H0 as a function ofthe actual value ! of the vector of interest. Notice that this requires the statementof an alternative a1 to accepting H0; this is only to be expected, for an action istaken not because it is good, but because it is better than anything else that hasbeen imagined.

Given data D, the optimal action will be to reject H0 if (and only if) the ex-pected posterior loss of accepting,

!" L(a0, !)&(!|D) d!, is larger than the expected

posterior loss of rejecting,!" L(a1, !)&(!|D) d!, that is, if (and only if)


(54)

(

"[L(a0, !) ' L(a1, !)]&(!|D) d! =

(

"&L(!)&(!|D) d! > 0.

Therefore, only the loss di!erence &L(!) = L(a0, !) ' L(a1, !), which measuresthe advantage of rejecting H0 as a function of !, has to be specified. Thus, as com-mon sense dictates, the hypothesis H0 should be rejected whenever the expectedadvantage of rejecting H0 is positive.

A crucial element in the specification of the loss function is a description of whatis actually meant by rejecting H0. By assumption a0 means to act as if H0 weretrue, i.e. as if ! # $0, but there are at least two options for the alternative actiona1. This may either mean (i) the negation of H0, that is to act as if ! /# $0 or,alternatively, it may rather mean (ii) to reject the simplification implied by H0 andto keep the unrestricted model, ! # $, which is true by assumption. Both optionshave been analyzed in the literature, although it may be argued that the problemsof scientific data analysis where hypothesis testing procedures are typically usedare better described by the second alternative. Indeed, an established model,identified by H0 " {! # $0}, is often embedded into a more general model,{! # $, $0 9 $}, constructed to include possibly promising departures from H0,and it is required to verify whether presently available data D are still compatiblewith ! # $0, or whether the extension to ! # $ is really required.

EXAMPLE 14 Conventional hypothesis testing. Let &(!|D), ! # $, be the pos-terior distribution of the quantity of interest, let a0 be the decision to work un-der the restriction ! # $0 and let a1 be the decision to work under the com-plementary restriction ! /# $0. Suppose, moreover, that the loss structure hasthe simple, zero-one form given by {L(a0, !) = 0, L(a1, !) = 1} if ! # $0 and,similarly, {L(a0, !) = 1, L(a1, !) = 0} if ! /# $0, so that the advantage &L(!)of rejecting H0 is 1 if ! /# $0 and it is '1 otherwise. With this loss functionit is immediately found that the optimal action is to reject H0 if (and only if)Pr(! /# $0|D) > Pr(! # $0|D). Notice that this formulation requires thatPr(! # $0) > 0, that is, that the hypothesis H0 has a strictly positive priorprobability. If ! is a continuous parameter and $0 has zero measure (for instanceif H0 consists of a single point !0), this requires the use of a non-regular “sharp”prior concentrating a positive probability mass on $0. For details see [Kaas andRafetery, 1995] and references therein.

EXAMPLE 15 Intrinsic hypothesis testing. Again, let &(!|D), ! # $, be the pos-terior distribution of the quantity of interest, and let a0 be the decision to workunder the restriction ! # $0, but let a1 now be the decision to keep the general, un-restricted model ! # $. In this case, the advantage &L(!) of rejecting H0 as a func-tion of ! may safely be assumed to have the form &L(!) = +($0, !)' +%, for some+% > 0, where (i) +($0, !) is some measure of the discrepancy between the assumedmodel p(D|!) and its closest approximation within the class {p(D|!0), !0 # $0},such that +($0, !) = 0 whenever ! # $0, and (ii) +% is a context dependent utilityconstant which measures the (necessarily positive) advantage of being able to workwith the simpler model when it is true. Choices for both +($0, !) and +% which


may be appropriate for general use will now be described.For reasons similar to those supporting its use in point estimation, an attrac-

tive choice for the function d($0, !) is an appropriate extension of the intrinsicdiscrepancy; when there are no nuisance parameters, this is given by

(55) +($0, !) = inf!0("0

min{0(!0|!), 0(!|!0)}

where 0(!0|!) =!

T p(t|!) log{p(t|!)/p(t|!0)}dt, and t = t(D) # T is any su"cientstatistic, which may well be the whole dataset D. As before, if the data D ={x1, . . . ,xn} consist of a random sample from p(x|!), then

(56) 0(!0|!) = n

(

Xp(x|!) log

p(x|!)p(x|!0)

dx.

Naturally, the loss function +($0, !) reduces to the intrinsic discrepancy +(!0, !)of Example 13 when $0 contains a single element !0. Besides, as in the case ofestimation, the definition is easily extended to problems with nuisance parameters,with

(57) +(#0, !,") = inf!0("0,'0(!

+(!0,"o, !,").

The hypothesis H0 should be rejected if the posterior expected advantage of re-jecting is

(58) d(#0|D) =

(

!

(

"+(#0, !,")&(!,"|D) d!d" > +%,

for some +% > 0. As an expectation of a non-negative quantity, d($0, D) is ob-vioulsly nonnegative. Morovever, if 4 = 4(!) is a one-to-one transformation of !,then d(4($0), D) = d($0, D) so that, as one should clearly require, the expectedintrinsic loss of rejecting H0 is invariant under reparametrization.

It may be shown that, as the sample size increases, the expected value ofd($0, D) under sampling tends to one when H0 is true, and tends to infinityotherwise; thus d($0, D) may be regarded as a continuous, positive measure ofhow inappropriate (in loss of information units) it would be to simplify the modelby accepting H0. In traditional language, d($0, D) is a test statistic for H0 andthe hypothesis should be rejected if the value of d($0, D) exceeds some criticalvalue +%. In sharp contrast to conventional hypothesis testing, this critical value+% is found to be a context specific, positive utility constant +%, which may pre-cisely be described as the number of information units which the decision maker isprepared to lose in order to be able to work with the simpler model H0, and doesnot depend on the sampling properties of the probability model. The proceduremay be used with standard, continuous regular priors even in sharp hypothesistesting, when $0 is a zero-measure set (as would be the case if ! is continuous and$0 contains a single point !0).

Naturally, to implement the test, the utility constant +% which defines the rejec-tion region must be chosen. Values of d($0, D) of about 1 should be regarded as


an indication of no evidence against H0, since this is precisely the expected valueof the test statistic d($0, D) under repeated sampling from the null. If followsfrom its definition that d($0, D) is the reference posterior expectation of the log-likelihood ratio against the null. Hence, values of d($0, D) of about log[12] & 2.5,and log[150] & 5 should be respectively regarded as an indication of mild evidenceagainst H0, and significant evidence against H0. In the canonical problem of test-ing a value µ = µ0 for the mean of a normal distribution with known variance(see below), these values correspond to the observed sample mean x respectivelylying 2 or 3 posterior standard deviations from the null value µ0. Notice that, insharp contrast to frequentist hypothesis testing, where it is hazily recommendedto adjust the significance level for dimensionality and sample size, this provides anabsolute scale (in information units) which remains valid for any sample size andany dimensionality.

For further details on intrinsic hypothesis testing see [Bernardo and Rueda,2003; Bernardo and Perez, 2007].

EXAMPLE 16 Testing the value of a normal mean. Let the data D = {x1, . . . , xn}be a random sample from a normal distribution N(x|µ,%), where % is assumed tobe known, and consider the problem of testing whether these data are or are notcompatible with some specific sharp hypothesis H0 " {µ = µ0} on the value ofthe mean.

The conventional approach to this problem requires a non-regular prior whichplaces a probability mass, say p0, on the value µ0 to be tested, with the remaining1 ' p0 probability continuously distributed over <. If this prior is chosen to be&(µ|µ .= µ0) = N(µ|µ0,%0), Bayes theorem may be used to obtain the correspond-ing posterior probability,

(59) Pr[µ0|D,"] =B01(D,") p0

(1 ' p0) + p0 B01(D,"),

(60) B01(D,") =A1 +

n

"

B1/2exp

<'1

2

n

n + "z2

=,

where z = (x ' µ0)/(%/)

n) measures, in standard deviations, the distance be-tween x and µ0 and " = %2/%2

0 is the ratio of model to prior variance. Thefunction B01(D,"), a ratio of (integrated) likelihood functions, is called the Bayesfactor in favour of H0. With a conventional zero-one loss function, H0 shouldbe rejected if Pr[µ0|D,"] < 1/2. The choices p0 = 1/2 and " = 1 or " = 1/2,describing particular forms of sharp prior knowledge, have been suggested in theliterature for routine use. The conventional approach to sharp hypothesis testingdeals with situations of concentrated prior probability; it assumes important priorknowledge about the value of µ and, hence, should not be used unless this is anappropriate assumption. Moreover [Bartlett, 1957], the resulting posterior proba-bility is extremely sensitive to the specific prior specification. In most applications,H0 is really a hazily defined small region rather than a point. For moderate sam-ple sizes, the posterior probability Pr[µ0|D,"] is an approximation to the posterior


probability Pr[µ0 ' 5 < µ < µ0 ' 5|D,"] for some small interval around µ0 whichwould have been obtained from a regular, continuous prior heavily concentratedaround µ0; however, this approximation always breaks down for su"ciently largesample sizes. One consequence (which is immediately apparent from the last twoequations) is that for any fixed value of the pertinent statistic z, the posterior prob-ability of the null, Pr[µ0|D,"], tends to one as n * 0. Far from being specificto this example, this unappealing behaviour of posterior probabilities based onsharp, non-regular priors generally known as Lindley’s paradox [Lindley, 1957] isalways present in the conventional Bayesian approach to sharp hypothesis testing.

The intrinsic approach may be used without assuming any sharp prior knowl-edge. The intrinsic discrepancy is +(µ0, µ) = n(µ ' µ0)2/(2%2), a simple trans-formation of the standardized distance between µ and µ0. The reference prioris uniform and the corresponding (proper) posterior distribution is &(µ|D) =N(µ|x,%/

)n). The expected value of +(µ0, µ) with respect to this posterior is

d(µ0, D) = (1 + z2)/2, where z = (x ' µ0)/(%/)

n) is the standardized distancebetween x and µ0. As foretold by the general theory, the expected value of d(µ0, D)under repeated sampling is one if µ = µ0, and increases linearly with n if µ = µ0.Moreover, in this canonical example, to reject H0 whenever |z| > 2 or |z| > 3, thatis whenever µ0 is 2 or 3 posterior standard deviations away from x, respectivelycorresponds to rejecting H0 whenever d(µ0, D) is larger than 2.5, or larger than 5.

If % is unknown, the reference prior is &(µ,%) = %!1, and the intrinsic discrep-ancy becomes

(61) +(µ0, µ,%) =n

2log

<1 +

+µ ' µ0

%

,2=.

The intrinsic test statistic d(µ0, D) is found as the expected value of +(µ0, µ,%)under the corresponding joint referenceposterior distribution; this may be exactlyexpressed in terms of hypergeometric functions, and is well approximated by

(62) d(µ0, D) & 1

2+

n

2log

+1 +

t2

n

,,

where t is the traditional statistic t =)

n ' 1(x ' µ0)/s, ns2 =$

j(xj ' x)2. Forinstance, for samples sizes 5, 30 and 1000, and using the utility constant +% = 5,the hypothesis H0 would be rejected whenever |t| is respectively larger than 5.025,3.240, and 3.007.

6 DISCUSSION

This article focuses on the basic concepts of the Bayesian paradigm, with specialemphasis on the derivation of “objective” methods, where the results only dependon the data obtained and the model assumed. Many technical aspects have beenspared; the interested reader is referred to the bibliography for further information.This final section briefly reviews the main arguments for an objective Bayesianapproach.


6.1 Coherence

By using probability distributions to characterize all uncertainties in the problem,the Bayesian paradigm reduces statistical inference to applied probability, therebyensuring the coherence of the proposed solutions. There is no need to investigate,on a case by case basis, whether or not the solution to a particular problem islogically correct: a Bayesian result is only a mathematical consequence of explic-itly stated assumptions and hence, unless a logical mistake has been committed inits derivation, it cannot be formally wrong. In marked contrast, conventional sta-tistical methods are plagued with counterexamples. These include, among manyothers, negative estimators of positive quantities, q-confidence regions (q < 1)which consist of the whole parameter space, empty sets of “appropriate” solu-tions, and incompatible answers from alternative methodologies simultaneouslysupported by the theory.

The Bayesian approach does require, however, the specification of a (prior) prob-ability distribution over the parameter space. The sentence “a prior distributiondoes not exist for this problem” is often stated to justify the use of non-Bayesianmethods. However, the general representation theorem proves the existence of sucha distribution whenever the observations are assumed to be exchangeable (and, ifthey are assumed to be a random sample then, a fortiori, they are assumed to beexchangeable). To ignore this fact, and to proceed as if a prior distribution didnot exist, just because it is not easy to specify, is mathematically untenable.

6.2 Objectivity

It is generally accepted that any statistical analysis is subjective, in the sense thatit is always conditional on accepted assumptions (on the structure of the data, onthe probability model, and on the outcome space) and those assumptions, althoughpossibly well founded, are definitely subjective choices. It is, therefore, mandatoryto make all assumptions very explicit.

Users of conventional statistical methods rarely dispute the mathematical foun-dations of the Bayesian approach, but claim to be able to produce “objective”answers in contrast to the possibly subjective elements involved in the choice ofthe prior distribution.

Bayesian methods do indeed require the choice of a prior distribution, and criticsof the Bayesian approach systematically point out that in many important situa-tions, including scientific reporting and public decision making, the results mustexclusively depend on documented data which might be subject to independentscrutiny. This is of course true, but those critics choose to ignore the fact that thisparticular case is covered within the Bayesian approach by the use of referenceprior distributions which (i) are mathematically derived from the accepted proba-bility model (and, hence, they are “objective” insofar as the choice of that modelmight be objective) and, (ii) by construction, they produce posterior probabilitydistributions which, given the accepted probability model, only contain the infor-mation about their values which data may provide and, optionally, any further


contextual information over which there might be universal agreement.

6.3 Operational meaning

An issue related to objectivity is that of the operational meaning of referenceposterior probabilities; it is found that the analysis of their behaviour under re-peated sampling provides a suggestive form of calibration. Indeed, Pr[! # R|D] =!

R &(!|D) d!, the reference posterior probability that ! # R, is both a measure ofthe conditional uncertainty (given the assumed model and the observed data D)about the event that the unknown value of ! belongs to R 9 $, and the limitingproportion of the regions which would cover ! under repeated sampling condi-tional on data “su"ciently similar” to D. Under broad conditions (to guaranteeregular asymptotic behaviour), all large data sets from the same model are “su"-ciently similar” among themselves in this sense and hence, given those conditions,reference posterior credible regions are approximate frequentist confidence regions.

The conditions for this approximate equivalence to hold exclude, however, im-portant special cases, like those involving “extreme” or “relevant” observations. Invery special situations, when probability models may be transformed to location-scale models, there is an exact equivalence; in those cases reference posterior cred-ible intervals are, for any sample size, exact frequentist confidence intervals.

6.4 Generality

In sharp contrast to most conventional statistical methods, which may only be ex-actly applied to a handful of relatively simple stylized situations, Bayesian methodsare defined to be totally general. Indeed, for a given probability model and priordistribution over its parameters, the derivation of posterior distributions is a well-defined mathematical exercise. In particular, Bayesian methods do not requireany particular regularity conditions on the probability model, do not depend onthe existence of su"cient statistics of finite dimension, do not rely on asymptoticrelations, and do not require the derivation of any sampling distribution, nor (afortiori) the existence of a “pivotal” statistic whose sampling distribution is inde-pendent of the parameters.

However, when used in complex models with many parameters, Bayesian meth-ods often require the computation of multidimensional definite integrals and, fora long time in the past, this requirement e!ectively placed practical limits onthe complexity of the problems which could be handled. This has dramaticallychanged in recent years with the general availability of large computing power,and the parallel development of simulation-based numerical integration techniueslike importance sampling or Markov chain Monte Carlo (MCMC). These methodsprovide a structure within which many complex models may be analyzed usinggeneric software. MCMC is numerical integration using Markov chains. MonteCarlo integration proceeds by drawing samples from the required distributions,and computing sample averages to approximate expectations. MCMC methods


draw the required samples by running appropriately defined Markov chains for along time; specific methods to construct those chains include the Gibbs samplerand the Metropolis algorithm, originated in the 1950’s in the literature of statisti-cal physics. The development of improved algorithms and appropriate diagnostictools to establish their convergence, remains a very active research area.

For an introduction to MCMC methods in Bayesian inference, see [Gilks et al.,1996; Mira, 2005], and references therein.

BIBLIOGRAPHY

[Bartlett, 1957] M. Bartlett. A comment on D. V. Lindley’s statistical paradox. 44, 533–534,1957.

[Berger and Bernardo, 1989] J. O. Berger and J. M. Bernardo. Estimating a product of means:Bayesian analysis with reference priors. J. Amer. Statist. Assoc.,84, 200–207, 1989.

[Berger and Bernardo, 1992a] J. O. Berger and J. M. Bernardo. Ordered group reference priorswith applications to a multinomial problem. 79, 25–37, 1992.

[Berger and Bernardo, 1992b] J. O. Berger and J. M. Bernardo. Reference priors in a variancecomponents problem. Bayesian Analysis in Statistics and Econometrics,, 323–340, 1992.

[Berger and Bernardo, 1992c] J. O. Berger and J. M. Bernardo. On the development of referencepriors. 4, 35–60, 1992 (with discussion).

[Berger et al., 2009] J. O. Berger, J. M. Bernardo, and D. Sun. The formal definition of referencepriors. Ann Statist., 37, 905–938, 2009.

[Berger, 1985] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Berlin:Springer., 1985.

[Bernardo, 1979a] J. M. Bernardo. Expected information as expected utility. Ann Statist.,7,686–690, 1979.

[Bernardo, 1979b] J. M. Bernardo. Reference posterior distributions for Bayesian inference. J.R. Statist. Soc. B 41, 113-147 (with discussion). Reprinted in Bayesian Inference 1 (G. C.Tiao and N. G. Polson, eds). Oxford: Edward Elgar, 229-263, 1979.

[Bernardo, 1981] J. M. Bernardo. Reference decisions. Symposia Mathematica 25, 85–94, 1981.[Bernardo, 1997] J. M. Bernardo. Noninformative priors do not exist. J. Statist. Planning and

Inference,65, 159–189, 1997 (with discussion).[Bernardo, 2005a] J. M. Bernardo. Reference analysis. Handbook of Statistics 25 (D. K. Dey

and C. R. Rao eds.) Amsterdam: Elsevier, 17–90, 2005.[Bernardo, 2005b] J. M. Bernardo. Intrinsic credible regions: An objective Bayesian approach

to interval estimation. Test,14, 317–384, 2005 (with discussion).[Bernardo, 2006] J. M. Bernardo. Intrinsic point estimation of the normal variance. Bayesian

Statistics and its Applications. (S. K. Upadhyay, U. Singh and D. K. Dey, eds.) New Delhi:Anamaya Pub, 110-121, 2006.

[Bernardo, 2007] J. M. Bernardo. Objective Bayesian point and region estimation in location-scale models. Sort 14, 3–44, 2007.

[Bernardo and Juarez, 2003] J. M. Bernardo and M. A. Juarez. Intrinsic Estimation. 7, 465–476,2003.

[Bernardo and Perez, 2007] J. M. Bernardo and S. Peez. Comparing normal means: New meth-ods for an old problem. Bayesian Analysis 2, 45–58, 2007.

[Bernardo and Ramon, 1998] J. M. Bernardo and J. M. Ramon. An introduction to Bayesianreference analysis: inference on the ratio of multinomial parameters. The Statistician,47, 1–35,1998.

[Bernardo and Rueda, 2002] J. M. Bernardo and R. Rueda. Bayesian hypothesis testing: Areference approach. International Statistical Review 70 , 351-372, 2002.

[Bernardo and Smith, 1994] J. M. Bernardo and A. F. M. Smith. Bayesian Theory, Chichester:Wiley, 1994. Second edition forthcoming.

[Box and Tiao, 1973] G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis.Reading, MA: Addison-Wesley, 1973.


[Datta and Sweeting, 2005] G. S. Datta and T. J. Sweeting. Probability matching priors. Hand-book of Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 91–114, 2005.

[Dawid et al., 1973] A. P. Dawid, M. Stone, and J. V. Zidek. Marginalization paradoxes inBayesian and structural inference. J. R. Statist. Soc. B 35, 189-233, 1973 (with discussion).

[de Finetti, 1937] B. de Finetti. La prevision: ses lois logiques, ses sources subjectives. Ann.Inst. H. Poincare 7, 1–68, 1937. Reprinted in 1980 as ‘Foresight; its logical laws, its subjectivesources’ in Studies in Subjective Probability,, 93–158.

[de Finetti, 1970] B. de Finetti Teoria delle Probabilite Turin: Einaudi, 1970. English translationas Theory of Probability Chichester: Wiley,, 1975.

[DeGroot, 1970] M. H. DeGroot. Optimal Statistical Decisions, New York: McGraw Hill,, 1970.[Efron, 1986] B. Efron. Why isn’t everyone a Bayesian? Amer. Statist.,40, 1–11, 198 (with dis-

cussion).[Geisser, 1993] S. Geisser. Predictive Inference: an Introduction. London: Chapman and Hall,

1993.[Gilks et al., 1996] W. R. Gilks, S. Y. Richardson, and D. J. Spiegelhalter, eds. Markov Chain

Monte Carlo in Practice. London: Chapman and Hall, 1996.[Jaynes, 1976] E. T. Jaynes. Confidence intervals vs. Bayesian intervals. Foundations of Proba-

bility Theory, Statistical Inference and Statistical Theories of Science 2 (W. L. Harper andand C. A. Hooker, eds). Dordrecht: Reidel, 175–257, 1976 (with discussion).

[Je!reys, 1939] H. Je!reys. Theory of Probability. Oxford: Oxford University Press. Third edi-tion in 1961, Oxford: Oxford University Press, 1939.

[Kass and Raftery, 1995] R. E. Kass and A. E. Raftery. Bayes factors. J. Amer. Statist. As-soc.,90, 773–795, 1995.

[Kass and Wasserman, 1996] R. E. Kass and L. Wasserman. The selection of prior distributionsby formal rules. J. Amer. Statist. Assoc.,91, 1343–1370, 1996.

[Laplace, 1812] P. S. Laplace. Theorie Analytique des Probabilites. Paris: Courcier, 1812.Reprinted as Oeuvres Completes de Laplace 7, 1878–1912. Paris: Gauthier-Villars.

[Lindley, 1957] D. V. Lindley. A statistical paradox. 44, 187–192, 1957.[Lindley, 1958] D. V. Lindley. Fiducial distribution and Bayes’ Theorem. J. Roy. Statis. Soc.,20,

102–107, 1958.[Lindley, 1965] D. V. Lindley. Introduction to Probability and Statistics from a Bayesian View-

point. Cambridge: Cambridge University Press, 1965.[Lindley, 1972] D. V. Lindley. Bayesian Statistics, a Review. Philadelphia, PA: SIAM, 1972.[Liseo, 2005] B. Liseo. The elimination of nuisance parameters. Handbook of Statistics 25 (D.

K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 193–219, 2005.[Mira, 2005] A. Mira. MCMC methods to estimate Bayesian parametric models. Handbook of

Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 415–436, 2005.[Ramsey, 1926] F. P. Ramsey. Truth and probability. The Foundations of Mathematics and

Other Logical Essays (R. B. Braithwaite, ed.). London: Kegan Paul 1926 (1931), 156–198.Reprinted in 1980 in Studies in Subjective Probability,, 61–92.

[Savage, 1954] L. J. Savage. The Foundations of Statistics. New York: Wiley, 1954. Secondedition in 1972, New York: Dover.

[Stein, 1959] C. Stein. An example of wide discrepancy between fiducial and confidence intervals.Ann. Math. Statist., 30, 877–880, 1959.

[Zellner, 1971] A. Zellner. An Introduction to Bayesian Inference in Econometrics. New York:Wiley, 1971. Reprinted in 1987, Melbourne, FL: Kreiger.

Documents

CONTENTSbernardo/2011PhilStat.pdf · Philosophy of Statistics: An Introduction 1 Prasanta S. Bandyopadhyay and Malcolm R. Forster Part I. Probability & Statistics Elementary Probability