Chapter 02: Probability Theory

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 02: Probability Theory

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 02 – p. 1/64

References for this chapter

Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 2,Springer, 2006

Christian P. Robert, The Bayesian Choice, Springer, 1994

Robb. J. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, 1982


Basic Concepts


Probability theory

Probability theory deals with:

random experiments

random events

random variables

models to describe the occurance of random events


Random experiments

A random experiment is an experiment with unknown outcome. E.g.

rolling a dice

sensing physical values, e.g. temperature, humidity, magnetic field strength,voltage, ...

counting cars on a road segment

The set of all possible outcomes of a random experiment constitutes the samplespace Ω


Random events

A random event is a set of possible outcomes of a random experiment, i.e.A ⊆ Ω. E.g.

rolling a dice yields a “6”Ω = 1, 2, 3, 4, 5, 6, A = 6

its raining (or not) during the day at a certain placeΩ = raining, not raining, A = raining

an observed train runs with a velocity between 100kmh

and 200kmh

Ω = R, A = [100, 200]

a fruit which is put on the scales at a super market is an apple or a pearΩ = strawberry, pear, peach, apple, banana, cherry, A = apple, pear


Random variables

Precise: Real-valued functions on Ω are called random variables.

Intuitive: Random variables are variables whose value is the result of arandom experiment

Examples:

the present number of inhabitants of Freiburg Ω = N0, X(ω) = ω

the gray level at a certain position of a camera image Ω = 0, 1, . . . , 255,

X(ω) = ω

the index in an enumeration of fruitΩ = strawberry, pear, peach, apple, banana, cherry,

X(ω) =

1 if ω = strawberry

2 if ω = pear

3 if ω = peach...


Random variables

Functions of random variables are random variables themselves. E.g.

the average gray level of all pixels of a camera image

Z(ω) = f(X(ω)) + g(Y (ω))

Z(ω) = f(X(ω), Y (ω))

Notation: the dependency of X on ω is often not written, i.e. we write X insteadof X(ω).


Probability distributions

How can we describe the occurance of random events?

A probability distribution is a function f that assigns a real number to eachmeasurable subset of the sample set Ω and that meets the following axioms

f(A) ≥ 0 for all sets A ⊆ Ω

f(Ω) = 1

f(A ∪ B) = f(A) + f(B) for all disjoint sets A,B ⊆ Ω

f(A) is called the probability of A.

What means a “measurable set” ?

if Ω is finite or countable: every subset is measurable

if Ω = Rn only those subsets A ⊆ Ω are measurable for which

∫

Adx > 0

(all except null sets)


Probability distributions

The definition specifies formally, which functions can be used as probabilitydistributions and which cannot

The definition does not specify which function is useful for a certain randomexperiment or which one is similar to observed frequencies

Examples (raining/not raining experiment):

f :

∅ 7→ 0

raining 7→ 0.3

not raining 7→ 0.7

raining, not raining 7→ 1

g :

∅ 7→ 0

raining 7→ 0.6

not raining 7→ 0.4


h :

∅ 7→ 0

raining 7→ 1

not raining 7→ 0



Probability mass functions

How can we represent probability distributions efficiently?

probability mass function (pmf)

only for countable Ω

p : Ω → R≥0

p assigns to each ω ∈ Ω the probability of ω Examples:

p1 :

raining 7→ 0.3

not raining 7→ 0.7p2 :

strawberry 7→ 0.1

pear 7→ 0.2

peach 7→ 0.1

apple 7→ 0.3

banana 7→ 0.2

cherry 7→ 0.1

p3 : i ∈ N0 7→ 0.7 · 0.3i


Probability density functions

probability density function (pdf)

only for Ω = Rn

p : Ω → R≥0

p assigns to each ω ∈ Ω a value so that∫

Ap(x)dx is the probability for the

event A ⊆ Ω

Example:

p(x) =

x6− 1

3if 2 ≤ x ≤ 5

−x2+ 3 if 5 < x ≤ 6

0 otherwise

2

0.5

6

P (X ∈ [3, 4]) =∫ 4

3p(x)dx = 1

4

P (X < 3) =∫ 4

−∞p(x)dx = 1

12



If two random variables depend on each other, how do their probability functionsdepend on each other?

Examples:

Y = X + c (c ∈ R constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < X + c < b)

= P (a− c < X < b− c) =

∫ b−c

a−c

px(x)dx

=x=y−c

∫ b

a

px(y − c)dy

follows: py(y) = px(y − c)



If two random variables depend on each other, how do their probability functionsdepend on each other?

Examples:

Y = X + c (c ∈ R constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < X + c < b)

= P (a− c < X < b− c) =

∫ b−c

a−c

px(x)dx

=x=y−c

∫ b

a

px(y − c)dy

follows: py(y) = px(y − c)



Y = s ·X (s > 0 constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < sX < b)

= P (a

s< X <

b

s) =

∫ bs

as

px(x)dx

=x= y

s

∫ b

a

px(y

s)1

sdy

follows: py(y) =1spx(

y

s)

Y =

1 if X ≥ 0

0 otherwise

py(0) =

∫ 0

−∞

px(x)dx py(1) =

∫ ∞

0

px(x)dx



Y = 1X

if pX(x) = 0 for all x ≤ 0

→ exercises



Z = X + Y

Not that easy!Example:

px(x) =

13

if x = 123

if x = 2py(y) =

13

if y = 123

if y = 2

Possible distributions for Z :

p(I)z (z) =

19

if z = 249

if z = 349

if z = 4

p(II)z (z) =

13

if z = 2

0 if z = 323

if z = 4

The distribution of Z depends on the relationship between X and Y !


Joint probabilities

A joint probability density function for two random variables X and Y is aprobability density function for the pair (X, Y ).

pX,Y : R2 → R≥0

∫∞

−∞

∫∞

−∞pX,Y (x, y)dx dy = 1

if A ⊆ R2, then

∫

ApX,Y (x, y)d(x, y) is the joint probability of event A

We can define joint probability mass functions analogously.


Joint probabilities

Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX,Y (x, y) =

112

if (x, y) = (1, 1)14

if (x, y) = (1, 2)512

if (x, y) = (2, 1)14

if (x, y) = (2, 2)


Joint probabilities and marginal probabilities

How do pX,Y and pX , pY depend on each other?∫ b

a

∫ ∞

−∞

pX,Y (x, y)dydx = P (a < X < b,−∞ < Y < ∞) =

∫ b

a

pX(x)dx

Hence:

pX(x) =

∫ ∞

−∞

pX,Y (x, y)dy (for Y ∈ R)

pY (y) =

∫ ∞

−∞

pX,Y (x, y)dx (for X ∈ R)

pX(x) =∞∑

y=−∞

pX,Y (x, y) (for Y ∈ N0)

pY (y) =∞∑

x=−∞

pX,Y (x, y) (for X ∈ N0)

This equation is called marginalisation of a random variable.Learning and Inference in Graphical Models. Chapter 02 – p. 20/64


Example:

pX,Y (x, y) =

112

if (x, y) = (1, 1)14

if (x, y) = (1, 2)512

if (x, y) = (2, 1)14

if (x, y) = (2, 2)

pX(x) =

13

if x = 123

if x = 2

pY (y) =

12

if y = 112

if y = 2

y = 1 y = 2 Σ

x = 1 112

14

13

x = 2 512

14

23

Σ 12

12

1



Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX(x) =

∫ x2

01dy if 0 < x < 2

0 otherwise

=

x2

if 0 < x < 2

0 otherwise

pX(x) =

∫ 2

2y1dx if 0 < y < 1

0 otherwise

=

2− 2y if 0 < y < 1

0 otherwise

Y

X

0

Y

X

1

pX,Y (x, y)

pX(x)

pY (y)



Z = X + Y

How does pZ depend on pX,Y ?

discrete case X, Y, Z ∈ N0

pZ(z) = P (Z = z) = P (X + Y = z)

=∑

x∈N0

P (X = x, X + Y = z) =∑

x∈N0

P (X = x, Y = z − x)

=∑

x∈N0

pX,Y (x, z − x)

continuous case X, Y, Z ∈ R

pZ(z) =

∫ ∞

−∞

pX,Y (x, z − x)dx



Derivation of the continuous case:∫ b

a

pZ(z)dz = P (a < X + Y < b)

= P (−∞ < X < ∞, a−X < Y < b−X)

=

∫ ∞

−∞

∫ b−x

a−x

pX,Y (x, y)dy dx

=z=y+x

∫ ∞

−∞

∫ b

a

pX,Y (x, z − x)dz dx

=

∫ b

a

∫ ∞

−∞

pX,Y (x, z − x)dx dz


Conditional distributions

What is the distribution of Y if we would know that X takes a certain value x?

A conditional distribution of Y given X is defined by as

pY |X(y|x) =pX,Y (x, y)

pX(x)Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX(x) =

x2

if 0 < x < 2

0 otherwise

pY |X(y|x) =

1x2

if 0 < y < x2

0 otherwisex

y

pX,Y (x, y)

pY |X(y|x)


Joint, marginal, and joint distributions

joint distribution

pX,Y (x, y)

marginal distribution

pY (y) =

∫ ∞

−∞

pX,Y (x, y)dx

conditional distribution

pY |X(y|x) =pX,Y (x, y)

pX(x)

x

y

pX,Y (x, y)

x

y

pX,Y (x, y)

pY (y)

∫

x

y

pX,Y (x, y)

pY |X(y|x)


Statistical independence

Statistical independence of two random variables means that knowing the value ofone does not allow to draw conclusions about the value of the other.

Definition: X , Y are independent if for all values of x and y holds

pX,Y (x, y) = pX(x) · pY (y)

If X and Y are independent pX|Y (x|y) = pX(x) and pY |X(y|x) = pY (y)

Example:

x

y

pX,Y (x, y)

X , Y are not independent

x

y

pX,Y (x, y)

X , Y are independent


Statistical independence

Why is independence so important?

simplification of formula

neglecting influence of other variables

if variables are not independet, we can use knowledge of one variable toconclude about the other

Important:

the concept of independence can be extendend to 3,4, ... random variablesor to groups of random variables

statistical independence is not transitive.Example: X, Y, Z ∈ 0, 1 with

pX,Y,Z(x, y, z) =

14

if x+ y + z ∈ 1, 30 otherwise


Remarks on notation

probabilities

• complete (correct) notation: P (X = x), pX(x)

• simplified notation: P (X), p(X)

• for real-valued variables P (X) denotes the probability density functionof X

pdf and pmf

• precise: pdf6= pmf

• convenient: unified notation,

use

∫ b

a

f(x)dx and

b∑

x=a

f(x) interchangeable


Standard Distribution Families


Standard distribution families

Every function can serve as pdf/pmf that meets the axioms.but

Some families of functions have proved to be very useful for many purposes.

Relevant for us:

categorical distribution (discrete)

uniform distribution (univariate/multivariate real valued)

Gaussian/normal distribution (univariate/multivariate real-valued)

gamma distribution/inverse gamma distribution (real-valued)

Dirichlet distribution (multivariate real-valued)

Wishart distribution/Inverse Wishart distribution (multivariate real-valued)


Categorical distribution

typically used to describe the value of a discrete random variable with finiterange

parameters: q1, . . . , qN ∈ R≥0 so that∑N

i=1 qi = 1qi desribes the probability of outcome i.

pmf: p(i) =

qi if 1 ≤ i ≤ N

0 otherwise

examples:

• rolling a dice

• election

• fruit on the scales at a super market

notation: X ∼ C(q1, . . . , qN) a categorical distribution for a random variable wich takes only values 0 and

1 is called a Bernoulli distribution.


Uniform distribution

univariate case:

• only (real) numbers in an interval [a, b]occur

• all numbers within the interval have sameprobability

• parameters: a, b ∈ R, a < b

• pdf: p(x) =

1

b−aif a ≤ x ≤ b

0 otherwise

• example: quantisation error of ananalog/digital converter

• notation: X ∼ U(a, b)

x

p(x)


Uniform distribution

multivariate case:

• only vectors in A ⊆ Rn occur

• all vectors within A have same probability

• parameters: A ⊆ Rn measureable

• pdf: p(~x) =

1∫

A1d~y

if ~x ∈ A

0 otherwise

• notation: X ∼ U(A)

x1

x2

p(x1, x2)


Normal (Gaussian) distribution

most important distribution for real-valued random variables

for a random variable that takes values around a “central” value

univariate case:

• parameters: µ ∈ R, σ2 ∈ R>0

• pdf:1√

2π√σ2

e−12

(x−µ)2

σ2

• notation: X ∼ N (µ, σ2)

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Normal (Gaussian) distribution

multivariate case:

• parameters: ~µ ∈ Rn, Σ ∈ R

n×n symmetric, positive definite

• pdf:1

(√2π)n

√

det(Σ)e−

12(~x−~µ)TΣ−1(~x−~µ)

• notation: X ∼ N (~µ,Σ)

−5 −4 −3 −2 −1 0 1 2 3 4 5

−5

0

5

0

0.02

0.04

0.06

0.08


Gamma distribution

a distribution that assigns nonzero probability to positive numbers.

parameters a > 0, b > 0

pdf: p(x) =

ba

Γ(a)xa−1e−bx if x ≥ 0

0 otherwise

notation: X ∼ Γ(a, b)

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Inverse gamma distribution

if X ∼ Γ(a, b), then Y = 1X

is said to be distributed w.r.t. Γ−1(a, b)

parameters a > 0, b > 0

pdf: p(y) =

ba

Γ(a)y−a−1e

− by if y ≥ 0

0 otherwise

the inverse gamma distribution is used to model distributions over theparameter σ2 of Gaussian distributions


Wishart distribution

the multivariate extension of the gamma distribution

models a distribution over symmetric, positive definite matrices

parameters: W ∈ Rn×n symmetric, positive definite, ν > n− 1

pdf: p(A) =1

Z(W, ν)· det(A) ν−n−1

2 e−12trace(W−1A)

with Z(W, ν) = det(W )ν2 · 2 ν·n

2 · π n·(d−1)4 ·

n∏

i=1

Γ(ν + 1− i

2

)

notation: A ∼ W(W, ν)

the Wishart distribution for n = 1 is equal to the gamma distribution.


Inverse Wishart distribution

if A ∼ W(W, ν), then B = A−1 is said to be distributed w.r.t. W−1(W, ν)

the inverse Wishart distribution is used to model distributions over theparameter Σ of multivariate Gaussian distributions

parameters: W ∈ Rn×n symmetric, positive definite, ν > n− 1

pdf: p(B) =1

Z(W, ν)· det(B)

−ν−n−12 e−

12trace(W−1B−1)

with Z(W, ν) = det(W )−ν2 · 2 ν·n

2 · π n·(d−1)4 ·

n∏

i=1

Γ(ν + 1− i

2

)

the inverse Wishart distribution for n = 1 is equal to the inverse gammadistribution.


Dirichlet distribution

a distribution over a simplex of numbers

models probabilities of a categorical distribution

parameters: α1, . . . , αn ≥ 0

pdf: p(q1, . . . , qn) =Γ(∑n

i=1 αi)∏n

i=1 Γ(αi)

n∏

i=1

qαi−1i

notation: (q1, . . . , qn) ∼ D(α1, . . . , αn)

the Dirichlet distribution for n = 2 is called Beta distribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

n=2

n=3


Maximum-Likelihood Estimators


Density estimators

How do we get meaningful probability distributions for real processes?

using background knowledge to find the appropriate distribution family

using histograms to compare its shape with the shape of possible distributionfamilies

using estimators to find the “best parameters” for a sample

background knowledge and assumption of an ideal dice: qi =16

Examples:

rolling a dice

• distribution family: categorical with n = 6 possible outcomes

• parameters qi: choose w.r.t. background knowledge and assumption of

an ideal dice: qi =16

• parameters qi: choose w.r.t. random sample o1, . . . , oN to

qi =|j|oj=i|

N


Density estimators

Examples:

data from a certain sensor

which density family would be suitable?

how do we choose the parameters ofthat density family to achieve optimalresults?

optimal fit for a normal distribution −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.080

10

20

30

40

50

60

70

80

90


Density estimators

problem specification

• given a sample x1, . . . , xN from an unknown distribution

• assuming a density family fθ parameterized by parameters θ

• find the value of θ so that the density fits as best as possible to thesample

basic idea:

search θ that maximizes the likelihood p(x1, . . . , xN |θ) assumption to simplify calculation: x1, . . . , xN are independent (i.i.d.)

p(x1, . . . , xN |θ) =N∏

i=1

p(xi|θ) =N∏

i=1

fθ(xi) → maximum

An estimator that maximizes the likelihood is called maximum likelihoodestimator (ML)

trick: often it is conventient to maximize log p(x1, . . . , xN |θ) instead of

p(x1, . . . , xN |θ)Learning and Inference in Graphical Models. Chapter 02 – p. 45/64

Maximum likelihood estimators

Example:

uniform distribution → blackboard

normal distribution → blackboard

categorical distribution → blackboard


Maximum likelihood estimators

Maximum likelihood can also be used for conditional distributions

Example:

x1, . . . , xN given, constant

yi ∼ N (a · xi + b, σ2) with unknown parameters a, b, σ2

maximum-likelihood estimator:

maximizea,b,σ2

1√2π

√σ2

e−12

(yi−(a·xi+b))2

σ2

yields(∑N

i=1 x2i

∑N

i=1 xi∑N

i=1 xi N

)

·(

a

b

)

=

(∑N

i=1(xi · yi)∑N

i=1 yi

)

σ2 =1

N

N∑

i=1

(yi − (a · xi + b))2

→ linear regression


Bayesian Inference


Uncertainty in estimates

maximum likelihood yields single parameter estimate.E.g. in linear regression one value for a and b which can be used to predicty(x) for new values of x.

shortcoming: ML-estimate is created from random data, i.e. it is randomitself.How can we model the randomness in the parameter estimate and considerfor prediction?


Bayesian inference

modeling a distribution of parameters

p( θ︸︷︷︸

parameters

| D︸︷︷︸

sample

)

Bayes’ theorem

p(θ|D) =p(D|θ) · p(θ)

p(D)∝ p(D|θ) · p(θ)

denominator is independent of θ, i.e. it is constant w.r.t. θ.


Bayesian inference

p(θ|D) ∝ p(D|θ) · p(θ) p(θ) is called prior distribution

Which parameters do we expect without having seen the data?

p(D|θ) is the data likelihoodHow well do the data fit to a certain parameter?

p(θ|D) is called posterior distributionWhich parameters can we expect knowing the data?


Bayesian inference

Example: Bernoulli experiment (e.g. tossing a coin) with unknown parameter θ

prior: we want to model that we have noidea about θp(θ) = 1 (for all θ ∈ [0, 1])

assume, we repeated the experiment 10times with 3 times success

p(θ|D1) ∝ θ3 · (1− θ)7

in another 90 trials we obtained 27 timessuccess

p(θ|D1, D2) ∝ θ27 · (1− θ)63 · θ3 · (1− θ)7 = θ30 · (1− θ)70

ML-estimate would be θ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9


Bayesian inference

which are suitable prior/posterior distributions?

p(D|θ) · p(θ) should be a treatable distribution

p(θ) and p(θ|D) should belong to the same distribution family

→ conjugate prior


Conjugate priors

Examples:

categorical distribution

• data likelihoodN∏

i=1

qxi=

k∏

j=1

qnj

j with nj = |i|xi = j|

• which density function over (q1, . . . , qk) looks similar?

• try the Dirichlet distribution as prior (q1, . . . , qk) ∼ D(α1, . . . , αk)

p(q1, . . . , qk) ∝k∏

j=1

qαj−1j

p(q1, . . . , qk|D) ∝k∏

j=1

qnj

j

︸︷︷︸

likelihood

·k∏

j=1

qαj−1j

︸︷︷︸

prior

=k∏

j=1

qαj+nj

j

Hence, (q1, . . . , qk)|D ∼ D(α1 + n1, . . . , αk + nk)Learning and Inference in Graphical Models. Chapter 02 – p. 54/64

Conjugate priors

parameter µ of a univariate Gaussian distribution

• data likelihood

N∏

i=1

1√2π

√σ2

e−12

(xi−µ)2

σ2

• which density function over µ looks similar?

• try a Gaussian distribution as prior µ ∼ N (m, ρ2)

p(µ) =1√

2π√

ρ2e− 1

2(µ−m)2

ρ2

→ blackboard


Conjugate priors

Table of useful conjugate priors

distribution parameter conjugate prior posterior

categoricalC(q1, . . . , qk)

q1, . . . , qk DirichletD(α1, . . . , αk)

D(α1 + n1,

. . . , αk + nk)

univ. GaussianN (µ, σ2)

µ GaussianN (m, ρ2)

N (σ2m+ρ2x

σ2+ρ2, σ2ρ2

σ2+ρ2)

univ. GaussianN (µ, σ2)

σ2 Inverse gammaΓ−1(a, b)

Γ−1(a+ 12, b+ (x−µ)2

2)

multivariateGaussianN (~µ,Σ)

~µ GaussianN (~µ,Ψ)

N ((Σ−1 +Ψ−1)−1(Σ−1~x+Ψ−1~µ), (Σ−1+Ψ−1)−1)

multivariateGaussianN (~µ,Σ)

Σ Inv. WishartW−1(W, ν)

W−1((W−1 + (~x−~µ)(~x− ~µ)T )−1, ν + 1)


Conjugate priors

Special case: conjugate distribution for pair of (µ, σ2) / (~µ,Σ) of a GaussianIn the multivariate case, we the conjugate prior is:

Σ∼W−1(W, ν)

µ|Σ∼N (~m,1

ηΣ)

The posterior is for observed data ~x1, . . . , ~xk with x = 1k

∑k

i=1 ~xi and

S = 1k

∑k

i=1(~xi − x)(~xi − x)T :

Σ|~x1, . . . , ~xk ∼W−1((W−1 + kS +kη

k + η(x−m)(x−m)T )−1, ν + k)

µ|Σ, ~x1, . . . , ~xk ∼N (k

k + ηx+

η

k + ηm,

1

k + ηΣ)


Priors

improper priors: a prior density p for which∫p(θ)dθ 6= 1

non-informative priors: a prior that indicates that “is neutral”. Severaldefinitions:

• a prior with uniform density over all parameter values

• a prior for which argmaxθ p(θ|D) = argmaxθ p(D|θ)• Jeffreys prior


Jeffreys prior

what is a non-informative prior?

• it should not prefer one parameter value over another

• it should not depend on the parametrization of a distribution

Jeffreys prior (Harold Jeffreys, 1946)

• based on information theory

• considers the way a parameter influences the estimate

Jeffreys prior is defined as:

p(~θ) ∝√

det(I(~θ))with I(~θ) the Fisher information

I(~θ)i,j =∫ ∞

−∞

(

− ∂2

∂θi∂θjlog p(x|~θ)

)

· p(x|~θ)dx


Jeffreys prior

Example: x ∼ N (µ, s)Determine Jeffreys prior w.r.t. (µ, s)→ blackboard

Example: categorical distributionNoninformative priors:

• D(1, . . . , 1) yields a uniform distribution over all values

• D(0, . . . , 0) yields a prior so that

argmaxθ p(θ|D) = argmaxθ p(D|θ)• D(1

2, . . . , 1

2) yields the Jeffreys prior


Maximum a posterior estimator

Idea to use Bayesian analysis to obtain a single parameter estimate:maximum-a-posterior estimator (MAP)

θ = argmaxθ

P (θ|D) = argmaxθ

(P (D|θ) · P (θ))

Example: estimating the parameters or a categorical distribution:

prior: D(α1, . . . , αk)

posterior: D(α1 + n1, . . . , αk + nk)

MAP estimator: qMAPj =

nj+αj−1∑k

ν=1(nν+αν−1)

ML estimator: qMLj =

nj∑kν=1 nν


Summary


Summary

basic concepts of probability theory

• random events, random variables

• probability distributions, density functions

• joint, marginal, conditional distributions/probabilities

standard distribution families

• categorical

• uniform (univariate/multivariate)

• Gaussian (univariate/multivariate)

• gamma and inverted gamma

• Wishart and inverted Wishart

• Dirichlet


Summary

maximum likelihood estimators

Bayesian inference

• Bayesian analysis, priors, posteriors

• conjugate priors

• non-informative priors

• maximum-a-posterior estimators


Documents

Chapter 02: Probability Theory