Upload
nguyenphuc
View
236
Download
4
Embed Size (px)
Citation preview
Bayesian Inference
1
Thomas Bayes
• Bayesian statistics named after Thomas Bayes
(1702-1761) -- an English statistician, philosopher
and Presbyterian minister.
2
Bayes' solution to a problem of
inverse probability was
presented in "An Essay
towards solving a Problem in
the Doctrine of Chances"
which was read to the Royal
Society in 1763 after Bayes'
death. It contains a special
case of the Bayes' theorem.
Pierre-Simon Laplace
• In statistics, the Bayesian interpretation of
probability was developed mainly by Pierre-Simon
Laplace (1749 –1827).
3
Laplace is remembered as the
French Newton. He was
Napoleon's examiner when
Napoleon attended the Ecole
Militaire in Paris in 1784.
Laplace became a count of the
Empire in 1806 and was
named a marquis in 1817.
Bayes Theorem
• Bayes Theorem for probability events A and B
• Or for a set of mutually exclusive and exhaustive
events (i.e. ), then
)(
)()|(
)(
)()|(
BP
APABP
BP
ABPBAP
i iii APAP 1)()(
j jj
iii
APABP
APABPBAP
)()|(
)()|()|(
4
Example – Diagnostic Test
• A new Ebola (EB) test is claimed to have
“95% sensitivity and 98% specificity”
• In a population with an EB prevalence of
1/1000, what is the chance that a patient
testing positive actually has EB?
Let A be the event patient is truly positive, A’
be the event that they are truly negative
Let B be the event that they test positive
5
Diagnostic Test, ctd.
• We want P(A|B)
• “95% sensitivity” means that P(B|A) = 0.95
• “98% specificity” means that P(B|A’) = 0.02
So from Bayes Theorem
045.0999.002.0001.095.0
001.095.0
)'()'|()()|(
)()|()|(
APABPAPABP
APABPBAP
Thus over 95% of those testing positive will, in fact, not have EB.
6
Bayesian Inference
In Bayesian inference there is a fundamental distinction between
• Observable quantities x, i.e. the data
• Unknown quantities θ
θ can be statistical parameters, missing data, latent variables…
• Parameters are treated as random variables
In the Bayesian framework we make probability statements about model parameters
In the frequentist framework, parameters are fixed non-random quantities and the probability statements concern the data.
7
Prior distributions
As with all statistical analyses we start by posting a model which specifies f(x| θ)
This is the likelihood which relates all variables into a ‘full probability model’
However from a Bayesian point of view :
• is unknown so should have a probability distribution reflecting our uncertainty about it before seeing the data
• Therefore we specify a prior distribution f(θ)
Note this is like the prevalence in the example
8
Posterior Distributions
Also x is known so should be conditioned on and here we
use Bayes theorem to obtain the conditional distribution
for unobserved quantities given the data which is
known as the posterior distribution.
)|()()|()(
)|()()|(
x
x
xx ff
dff
fff
The prior distribution expresses our uncertainty about
before seeing the data.
The posterior distribution expresses our uncertainty
about after seeing the data.
9
Examples of Bayesian Inference
using the Normal distribution
Known variance, unknown mean
It is easier to consider first a model with 1
unknown parameter. Suppose we have a
sample of Normal data:
Let us assume we know the variance, 2
and we assume a prior distribution for the
mean, based on our prior beliefs:
Now we wish to construct the
posterior distribution f(|x).
.,...,1),(~ 2 niNX i ,
),(~ 2
00 N
10
Posterior for Normal distribution
mean
So we have
))//()//1(exp(
)/)(exp()2(
)/)(exp()2(
)|()()|(
)/)(exp()2()|(
)/)(exp()2()(
22
00
22
0
2
21
2
1
2
212
2
0
2
0212
0
22
212
2
0
2
0212
0
21
21
21
21
consxn
x
fff
xxf
f
i
i
n
i
i
ii
xx
hence and
11
Posterior for Normal distribution
mean (continued)
For a Normal distribution with response y
with mean and variance we have
}/exp{
}/)(exp{)2()(
12
21
2
212
1
consyy
yyf
We can equate this to our posterior as follows:
)//( and )//1(
))//()//1(exp(
22
00
122
0
22
00
22
0
2
21
i
i
i
i
xn
consxn
12
Large sample properties
As n
So posterior variance
Posterior mean
And so the posterior distribution
Compared to
in the Frequentist setting
xnx ))//(/( 22
00
222
0 / )//1(/1 nn
n/2
)/,(| 2 nxN x
)/,(~| 2 nNX
13
Sufficient Statistic
• Intuitively, a sufficient statistic for a parameter is a statistic that captures all the information about a given parameter contained in the sample.
• Sufficiency Principle: If T(X) is a sufficient statistic for θ, then any inference about θ should depend on the sample X only through the value of T(X).
• That is, if x and y are two sample points such that T(x) = T(y), then the inference about θ should be the same whether X = x or X = y.
• Definition: A statistic T(x) is a sufficient statistic for θ if the conditional distribution of the sample X given T(x) does not depend on θ.
14
• Definition: Let X1, X2, … Xn denote a
random sample of size n from a
distribution that has a pdf f(x,θ), θ ε Ω .
Let Y1=u1(X1, X2, … Xn) be a statistic
whose pdf or pmf is fY1(y1,θ). Then Y1 is a
sufficient statistic for θ if and only if
where H(X1, X2, … Xn) does not depend on
θ ε Ω.
1
1 2
1 2
1 1 2
; ; ;, ,
, , ;
n
n
Y n
f x f x f xH x x x
f u x x x
15
• Example: Normal sufficient statistic:
Let X1, X2, … Xn be independently and identically distributed N(μ,σ2) where the variance is known. The sample mean
is the sufficient statistic for μ.
1
1 n
iiT X X X
n
16
_
• Starting with the joint distribution function
2
221
2
22 2 1
1exp
22
1exp
22
ni
i
ni
ni
xf x
x
17
_
• Next, we add and subtract the sample
average yielding
2
22 2 1
2 2
1
22 2
1exp
22
1exp
22
ni
ni
n
i
i
n
x x xf x
x x n x
18
_
• Where the last equality derives from
• Given that the distribution of the sample
mean is
1 1
0n n
i i
i i
x x x x x x
2
1 22 2
1exp
22
n xq T X
n
19
_
• The ratio of the information in the sample
to the information in the statistic becomes
2 2
1
22 2
2
1 22 2
1exp
22
1exp
22
n
i
i
n
x x n x
f x
q T x n x
n
20
_
_
2
1
1 212 22
1exp
22
n
i
i
n
x xf x
q T x n
which is a function of the data X1, X2, … Xn only, and does not depend on μ. Thus we have shown that the sample mean is a sufficient statistic for μ.
21
_
_
• Theorem (Factorization Theorem) Let
f(x|θ) denote the joint pdf or pmf of a
sample X. A statistic T(X) is a sufficient
statistic for θ if and only if there exists
functions g(t|θ) and h(x) such that, for all
sample points x and all parameter points θ
f x g T x h x
22
_ _ _
Posterior Distribution Through
Sufficient Statistics
Theorem: The posterior distribution
depends only on sufficient statistics.
Proof: let T(X) be a sufficient statistic for θ, then
23
)|(
)|()(
)|()(
)()|()(
)()|()(
)|()(
)|()()|(
xx
x
xx
xx
x
xx
TfdTff
Tff
dHTff
HTff
dff
fff
)()|()|( xxx HTff
Posterior Distribution Through
Sufficient Statistics
Example: Posterior for Normal distribution
mean (with known variance)
Now, instead of using the entire sample, we
can derive the posterior distribution using
the sufficient statistic
24
xx T
Exercise: Please derive the posterior
distribution using this approach.
Example. Posterior distribution (& the conjugate priors)
Assume the parameter of interest is , the proportion of some
property of interest in the population. Thus the distribution of the
individual data point is Bernoulli(), and a reasonable prior
density for is the Beta density:
function Beta called-so the
, 1 , and
parameters (constant) twoare 0 and 0 where
10;,
1,;
1
0
11
11
dxxxB
Bf
0 0.5 1
Beta(1,1)
Beta(5,5)
Beta(1,5)
Beta(5,1)
Beta(2,5)
Beta(5,2)
Beta(0.5,0.5)
Beta(0.3,0.7)
Beta(0.7,0.3)
Now, assume a sample of size n from the population in which Y of the
values possess the property of interest (# of ‘successes’). It is easy to see
that Y is a sufficient statistics following the Bionomial distribution with
pdf:
The posterior pdf based on this sufficient statistic is thus:
yny
y
nyf
1;
ynyB
dd
dBy
n
By
n
dfyf
fyfyf
yny
yny
yny
yny
yny
yny
yny
,
1
1
1
11
11
,
11
,
11
;
;;
11
1
0
11
11
1
0
11
11
1
0
11
11
1
0
Thus, the posterior density is also a Beta density with parameters y +
and n – y +
Prior distributions that combined with the
likelihood gives a posterior in the same
distributional family are named conjugate priors.
(Note that by a distributional family we mean
distributions that go under a common name:
Normal distribution, Binomial distribution, Poisson
distribution etc. )
A conjugate prior always go together with a particular
likelihood to produce the posterior.
We sometimes refer to a conjugate pair of
distributions meaning
(prior distribution, sample distribution = likelihood) 28
In particular, if the sample distribution, i.e. f (x| ) belongs to the k-
parameter exponential family of distributions:
we may put
where 1 , … , k + 1 are parameters of this prior distribution
and K( ) is a function of 1 , … , k + 1 only .
θθ
θDxCxBA
k
j
jj
exf
1|
θθ
θθ
θ
DA
KDA
k
k
j
jj
kkk
k
j
jj
e
ef
1
1
111
1
,,,
29
Then
i.e. the posterior distribution is of the same form as the prior distribution but
with parameters
instead of
θθ
θθ
θθθθ
θθθ
DnxBA
DnxBAK
xC
KDAnDxCxBA
k
k
j
j
n
i
ijj
k
k
j
j
n
i
ijj
kk
n
i
i
kkk
k
j
jj
n
i
i
k
j
n
i
ijj
e
eee
ee
fff
1
1 1
1
1 1111
111
111 1
,,,
,,,
|
|xx
nxBxB k
n
i
ikk
n
i
i
1
11
11 ,,,
11 ,,, kk 30
Some common cases:
Conjugate prior Sample distribution Posterior
Beta Binomial Beta
Normal Normal, known 2 Normal
Gamma Poisson Gamma
Pareto Uniform Pareto
,~ Beta xnxBetax ,~| ,~ nBinX
2,~ N 2,~ NX i
22
22
22
2
22
2
,~|
nx
n
n
nNx
,~ Gamma PoX i ~ nxGammax ii ,~|
;p ,0~ UX i n
n xq ,max;; x
31
Non-informative priors
• We often do not have any prior information, although
true Bayesian’s would argue we always have some prior
information!
• We would hope to have good agreement between the
Frequentist approach and the Bayesian approach with a
non-informative prior.
• Diffuse or flat priors are often better terms to use as no
prior is strictly non-informative!
• For our example of an unknown mean, candidate priors
are a Uniform distribution over a large range or a Normal
distribution with a huge variance.
32
Improper priors
• The limiting prior of both the Uniform and Normal is a Uniform prior on the whole real line.
• Such a prior is defined as improper as it is not strictly a probability distribution and doesn’t integrate to 1.
• Some care has to be taken with improper priors however in many cases they are acceptable provided they result in a proper posterior distribution.
• Uniform priors are often used as non-informative priors however it is worth noting that a uniform prior on one scale can be very informative on another.
• For example: If we have an unknown variance we may put a uniform prior on the variance, standard deviation or log(variance) which will all have different effects.
33
Point and Interval Estimation
• In Bayesian inference the outcome of interest for a
parameter is its full posterior distribution however we
may be interested in summaries of this distribution.
• A simple point estimate would be the mean of the
posterior. (although the median and mode are
alternatives.)
• Interval estimates are also easy to obtain from the
posterior distribution and are given several names, for
example credible intervals, Bayesian confidence
intervals and Highest density regions (HDR). All of these
refer to the same quantity.
34
Definition (Mean Square Error)
2))((... θxθx hEESM h
θxθθxθx ddffh )()|()))(( 2
35
Bayes Estimator
Theorem: The Bayes estimator that will
Minimize the mean square error is
xx |θθ Eh
• That is, the posterior mean.
^
36
Bayes Estimator
Lemma: Suppose Z and W are real random variables, then
That is, the posterior mean E(Z|W) will minimize the quadratic loss (mean square error).
22 ))|(())((min WZEZEWhZEh
37
Bayes Estimator
Proof of the Lemma:
Conditioning on W, the cross term is zero. Thus
2
2
22
))()|((
))()|())(|((2
))|((
))()|()|(())((
WhWZEE
WhWZEWZEZE
WZEZE
WhWZEWZEZEWhZE
222 ))()|(())|(())(( WhWZEEWZEZEWhZE
38
Bayes Estimator
39
Bayes Estimator
40
Bayes Estimator
Recall:
Theorem: The posterior distribution
depends only on sufficient
statistics.
Therefore, let T(X) be a sufficient
statistic, then
)xx (|| TEE θθθ ^
41
In all cases
• The posterior mean is a compromise between
the prior mean and the MLE
• The posterior s.d. is less than both the prior s.d.
and the s.e. (MLE)
‘A Bayesian is one who, vaguely expecting a
horse and catching a glimpse of a donkey,
strongly concludes he has seen a mule’
-- Senn, 1997--
As n
• The posterior mean the MLE
• The posterior s.d. the s.e. (MLE)
• The posterior does not depend on the prior. 42
Point Estimation via Decision Theory
The action is a particular point estimator
State of nature is the true value of
The loss function is a measure of how good (desirable)
the estimator is of :
Prior information is quantified by the prior distribution
(p.d.f.) f( )
Data is the random sample x from a distribution with
p.d.f. f (x | )
ˆ,SS LL
43
Three loss functions
Zero-one loss:
Absolute (error) loss:
Quadratic (error) loss:
ˆ1
ˆ0ˆ,SL
|ˆ|ˆ, SL
2ˆˆ, SL44
Bayes estimator with respect to a given loss function
We are looking for the estimator that minimizes
For any given value of x what has to be minimized is
the posterior expected loss:
xxxx
xxxx
x|xx
x|xx
dfdfL
ddffL
ddffL
dfdfLdfR
XS
XS
S
S
|ˆ,
|ˆ,
ˆ,
ˆ,ˆ,
dfLS xx |ˆ,
45
Now minimization with respect to different loss functions
will result in measures of location in the posterior
distribution of .
Zero-one loss: The Maximum A-Posteriori (MAP)
estimator (* it will maximize the posterior pdf):
Absolute error loss:
Quadratic loss: The Bayes estimator:
given for mode posterior theis ˆ xx
given for median posterior theis ˆ xx
given for mean posterior theis ˆ xx 46
Credible Intervals (Regions)
• Considering a conjugate normal example with posterior distribution:
μ|x ~ N(165.23,2.222)
• Then a 95% credible interval for μ is
165.23±1.96×sqrt(2.222) = (162.31,168.15).
• Note that the credible intervals can be interpreted in the more natural way that there is a probability of 0.95 that the interval contains μ rather than the Frequentist conclusion that 95% of such intervals would contain μ.
47
Credible regions
Let be the parameter space for and let Sx be a subset of .
If
then Sx is a 1 – credible region for .
For a scalar we refer to it as a credible interval
The important difference compared to confidence regions:
A credible region is a fixed region (conditional on the sample) to which
the random variable belongs with probability 1 – .
A confidence region is a random region that covers the fixed with
probability 1 –
x
xS
df 1| θθ
Highest posterior density (HPD) regions
Equal-tailed credible intervals
An equal-tailed 1 – credible interval is a 1 – credible interval (a , b )
such that
region credible (HPD)density posterior highest 1 a called is then
||
, allfor If
21
21
x
xx
xx
S
ff
SS
θθ
θθ
2|Pr|Pr xx ba
Example:
In a consignment of 5000 pills suspected to contain the drug Ecstasy a
sample of 10 pills are sampled for chemical analysis. Let be the
unknown proportion of Ecstasy pills in the consignment and assume we
have a high prior belief that this proportion is 100%.
Such a high prior belief can be modelled with a Beta density
where is set to 1 and is set to a fairly high value, say 20, i.e. Beta
(20,1)
,
111
Bf
Beta(20,1)
0 0.2 0.4 0.6 0.8 1
Now, suppose after chemical analysis of the 10 pills, all of them showed
to contain Ecstasy.
The posterior density for is Beta( + x, + n – x) (conjugate prior
with binomial sample distribution as the population is large)
Beta (20 + 10, 1 + 10 – 10) = Beta (30, 1)
Then a lower-tail 99% credible interval for satisfies
Thus with 99% certainty we can state that at least 85.8% of the pills in the
consignment consist of Ecstasy
99.0
1,30
1 29
dxB
x
a
858.01,303099.01
99.011,3030
1
301
30
Ba
aB
Comments:
We have used the binomial distribution for the sample. More correct
would be to use the hypergeometric distribution, but the binomial is a
good approximation.
For a smaller consignment (i.e. population) we can benefit on using the
result that the posterior for the number of pills containing Ecstasy in the
rest of the consignment after removing the sample is beta-binomial.
This would however give similar results
If a sample consists of 100% of one kind, how would a confidence
interval for be obtained?
Hypothesis Testing
Another big issue in statistical modelling is the ability to
test hypotheses and model comparisons in general.
The Bayesian approach is in some ways more
straightforward. For an unknown parameter θ
we simply calculate the posterior probabilities
and decide between H0 and H1 accordingly.
We also require the prior probabilities to achieve this
)|( ),|( 1100 xPpxPp
)( ),( 1100 PP
53
Decision theory applied on hypothesis testing
Test of H0: = 0 vs. H1: = 1
Decision procedure: C = Use a test with critical region C
Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”
Loss function:
H0 true H1 true
Accept H0 0 b
Reject H0 a 0
54
Risk function
Assume a prior setting p0 = Pr (H0 is true) = Pr ( = 0)
and p1 = Pr (H1 is true) = Pr ( = 1)
The Bayes risk (prior expected risk) becomes
bbR
aaR
CH
CH
LER
C
C
CSC
10;
10;
|Pr valuefor true accepting when Loss
|Pr valuefor true rejecting when Loss
,;
1
0
0
0
θ
θ
θθ
θθ
θθ
X
X
XX
10; pbpaRE C θθ55
Bayes test:
Minimax test:
Lemma: Bayes tests and most powerful tests (Neyman-Pearson
lemma) are equivalent in that every most powerful test is a Bayes
test for some values of p0 and p1 and every Bayes test is a most
powerful test with
10minarg;minarg pbpaRECC
CB
θθ
baR
CC
C ,maxminarg;maxminarg*
θθθ
bp
ap
L
L
1
0
0
1
;
;
x
x
θ
θ
56
ap
bp
L
L
0
1
1
0
;
;
x
x
θ
θor, equivalently,
Example:
Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.
We would like to test H0: = 1 vs. H0: = 2 with a Bayes test with
losses a = 2 and b = 1 and with prior probabilities p0 and p1
0;0,|11 xexf x
1
021
1
0
1
0
2
10
10
11
11
0
1
2ln
24
44
;
;
21
21
21
21
2
1
01
1
0
2
1
11
1
1
p
pxx
p
p
bp
ape
ee
e
ee
ee
L
L
xx
xx
xx
xx
xx
xx
x
x
57
Now,
1
0
0
12ln1
1
02ln1
1
21
1
0
1
1
0
1
11
0
1
1
0
10
1
0 0
12
11
21
2ln1
21
2ln
1
1
1
0
1
0
1111
1
11
1
1
11
1
1
11
11
1
1
2
21
11
1
1
2
21
11
p
p
p
pe
p
pe
etetXXPete
exedxee
dxeedxee
dxdxeetXXP
p
p
p
p
tttt
t
x
tx
t
x
tx
t
x
xtx
t
x
xt
x
xx
t
x
xt
x
xx
A fixed size gives conditions on p0 and p1, and a certain
choice will give a minimized 58
59
Bayesian analysis is increasingly popular thanks to
the advent of the modern algorithms (Markov Chain
Monte Carlo [MCMC]), computers and growing data
base, some notable procedures are:
• Bayesian Hierarchical Models
• Empirical Bayes
• Bayesian Networks
• Naïve Bayes
• etc …
The Modern Bayesians
Acknowledgement
Part of this presentation was based on publicly
available resources posted on-line.
60