64
L EARNING AND I NFERENCE IN G RAPHICAL M ODELS Chapter 02: Probability Theory Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 02 – p. 1/64

Chapter 02: Probability Theory

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 02: Probability Theory

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 02: Probability Theory

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 02 – p. 1/64

Page 2: Chapter 02: Probability Theory

References for this chapter

Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 2,Springer, 2006

Christian P. Robert, The Bayesian Choice, Springer, 1994

Robb. J. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, 1982

Learning and Inference in Graphical Models. Chapter 02 – p. 2/64

Page 3: Chapter 02: Probability Theory

Basic Concepts

Learning and Inference in Graphical Models. Chapter 02 – p. 3/64

Page 4: Chapter 02: Probability Theory

Probability theory

Probability theory deals with:

random experiments

random events

random variables

models to describe the occurance of random events

Learning and Inference in Graphical Models. Chapter 02 – p. 4/64

Page 5: Chapter 02: Probability Theory

Random experiments

A random experiment is an experiment with unknown outcome. E.g.

rolling a dice

sensing physical values, e.g. temperature, humidity, magnetic field strength,voltage, ...

counting cars on a road segment

The set of all possible outcomes of a random experiment constitutes the samplespace Ω

Learning and Inference in Graphical Models. Chapter 02 – p. 5/64

Page 6: Chapter 02: Probability Theory

Random events

A random event is a set of possible outcomes of a random experiment, i.e.A ⊆ Ω. E.g.

rolling a dice yields a “6”Ω = 1, 2, 3, 4, 5, 6, A = 6

its raining (or not) during the day at a certain placeΩ = raining, not raining, A = raining

an observed train runs with a velocity between 100kmh

and 200kmh

Ω = R, A = [100, 200]

a fruit which is put on the scales at a super market is an apple or a pearΩ = strawberry, pear, peach, apple, banana, cherry, A = apple, pear

Learning and Inference in Graphical Models. Chapter 02 – p. 6/64

Page 7: Chapter 02: Probability Theory

Random variables

Precise: Real-valued functions on Ω are called random variables.

Intuitive: Random variables are variables whose value is the result of arandom experiment

Examples:

the present number of inhabitants of Freiburg Ω = N0, X(ω) = ω

the gray level at a certain position of a camera image Ω = 0, 1, . . . , 255,

X(ω) = ω

the index in an enumeration of fruitΩ = strawberry, pear, peach, apple, banana, cherry,

X(ω) =

1 if ω = strawberry

2 if ω = pear

3 if ω = peach...

Learning and Inference in Graphical Models. Chapter 02 – p. 7/64

Page 8: Chapter 02: Probability Theory

Random variables

Functions of random variables are random variables themselves. E.g.

the average gray level of all pixels of a camera image

Z(ω) = f(X(ω)) + g(Y (ω))

Z(ω) = f(X(ω), Y (ω))

Notation: the dependency of X on ω is often not written, i.e. we write X insteadof X(ω).

Learning and Inference in Graphical Models. Chapter 02 – p. 8/64

Page 9: Chapter 02: Probability Theory

Probability distributions

How can we describe the occurance of random events?

A probability distribution is a function f that assigns a real number to eachmeasurable subset of the sample set Ω and that meets the following axioms

f(A) ≥ 0 for all sets A ⊆ Ω

f(Ω) = 1

f(A ∪ B) = f(A) + f(B) for all disjoint sets A,B ⊆ Ω

f(A) is called the probability of A.

What means a “measurable set” ?

if Ω is finite or countable: every subset is measurable

if Ω = Rn only those subsets A ⊆ Ω are measurable for which

Adx > 0

(all except null sets)

Learning and Inference in Graphical Models. Chapter 02 – p. 9/64

Page 10: Chapter 02: Probability Theory

Probability distributions

The definition specifies formally, which functions can be used as probabilitydistributions and which cannot

The definition does not specify which function is useful for a certain randomexperiment or which one is similar to observed frequencies

Examples (raining/not raining experiment):

f :

∅ 7→ 0

raining 7→ 0.3

not raining 7→ 0.7

raining, not raining 7→ 1

g :

∅ 7→ 0

raining 7→ 0.6

not raining 7→ 0.4

raining, not raining 7→ 1

h :

∅ 7→ 0

raining 7→ 1

not raining 7→ 0

raining, not raining 7→ 1

Learning and Inference in Graphical Models. Chapter 02 – p. 10/64

Page 11: Chapter 02: Probability Theory

Probability mass functions

How can we represent probability distributions efficiently?

probability mass function (pmf)

only for countable Ω

p : Ω → R≥0

p assigns to each ω ∈ Ω the probability of ω Examples:

p1 :

raining 7→ 0.3

not raining 7→ 0.7p2 :

strawberry 7→ 0.1

pear 7→ 0.2

peach 7→ 0.1

apple 7→ 0.3

banana 7→ 0.2

cherry 7→ 0.1

p3 : i ∈ N0 7→ 0.7 · 0.3i

Learning and Inference in Graphical Models. Chapter 02 – p. 11/64

Page 12: Chapter 02: Probability Theory

Probability density functions

probability density function (pdf)

only for Ω = Rn

p : Ω → R≥0

p assigns to each ω ∈ Ω a value so that∫

Ap(x)dx is the probability for the

event A ⊆ Ω

Example:

p(x) =

x6− 1

3if 2 ≤ x ≤ 5

−x2+ 3 if 5 < x ≤ 6

0 otherwise

2

0.5

6

P (X ∈ [3, 4]) =∫ 4

3p(x)dx = 1

4

P (X < 3) =∫ 4

−∞p(x)dx = 1

12

Learning and Inference in Graphical Models. Chapter 02 – p. 12/64

Page 13: Chapter 02: Probability Theory

Probability density functions

If two random variables depend on each other, how do their probability functionsdepend on each other?

Examples:

Y = X + c (c ∈ R constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < X + c < b)

= P (a− c < X < b− c) =

∫ b−c

a−c

px(x)dx

=x=y−c

∫ b

a

px(y − c)dy

follows: py(y) = px(y − c)

Learning and Inference in Graphical Models. Chapter 02 – p. 13/64

Page 14: Chapter 02: Probability Theory

Probability density functions

If two random variables depend on each other, how do their probability functionsdepend on each other?

Examples:

Y = X + c (c ∈ R constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < X + c < b)

= P (a− c < X < b− c) =

∫ b−c

a−c

px(x)dx

=x=y−c

∫ b

a

px(y − c)dy

follows: py(y) = px(y − c)

Learning and Inference in Graphical Models. Chapter 02 – p. 14/64

Page 15: Chapter 02: Probability Theory

Probability density functions

Y = s ·X (s > 0 constant)∫ b

a

py(y)dy = P (a < Y < b) = P (a < sX < b)

= P (a

s< X <

b

s) =

∫ bs

as

px(x)dx

=x= y

s

∫ b

a

px(y

s)1

sdy

follows: py(y) =1spx(

y

s)

Y =

1 if X ≥ 0

0 otherwise

py(0) =

∫ 0

−∞

px(x)dx py(1) =

∫ ∞

0

px(x)dx

Learning and Inference in Graphical Models. Chapter 02 – p. 15/64

Page 16: Chapter 02: Probability Theory

Probability density functions

Y = 1X

if pX(x) = 0 for all x ≤ 0

→ exercises

Learning and Inference in Graphical Models. Chapter 02 – p. 16/64

Page 17: Chapter 02: Probability Theory

Probability density functions

Z = X + Y

Not that easy!Example:

px(x) =

13

if x = 123

if x = 2py(y) =

13

if y = 123

if y = 2

Possible distributions for Z :

p(I)z (z) =

19

if z = 249

if z = 349

if z = 4

p(II)z (z) =

13

if z = 2

0 if z = 323

if z = 4

The distribution of Z depends on the relationship between X and Y !

Learning and Inference in Graphical Models. Chapter 02 – p. 17/64

Page 18: Chapter 02: Probability Theory

Joint probabilities

A joint probability density function for two random variables X and Y is aprobability density function for the pair (X, Y ).

pX,Y : R2 → R≥0

∫∞

−∞

∫∞

−∞pX,Y (x, y)dx dy = 1

if A ⊆ R2, then

ApX,Y (x, y)d(x, y) is the joint probability of event A

We can define joint probability mass functions analogously.

Learning and Inference in Graphical Models. Chapter 02 – p. 18/64

Page 19: Chapter 02: Probability Theory

Joint probabilities

Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX,Y (x, y) =

112

if (x, y) = (1, 1)14

if (x, y) = (1, 2)512

if (x, y) = (2, 1)14

if (x, y) = (2, 2)

Learning and Inference in Graphical Models. Chapter 02 – p. 19/64

Page 20: Chapter 02: Probability Theory

Joint probabilities and marginal probabilities

How do pX,Y and pX , pY depend on each other?∫ b

a

∫ ∞

−∞

pX,Y (x, y)dydx = P (a < X < b,−∞ < Y < ∞) =

∫ b

a

pX(x)dx

Hence:

pX(x) =

∫ ∞

−∞

pX,Y (x, y)dy (for Y ∈ R)

pY (y) =

∫ ∞

−∞

pX,Y (x, y)dx (for X ∈ R)

pX(x) =∞∑

y=−∞

pX,Y (x, y) (for Y ∈ N0)

pY (y) =∞∑

x=−∞

pX,Y (x, y) (for X ∈ N0)

This equation is called marginalisation of a random variable.Learning and Inference in Graphical Models. Chapter 02 – p. 20/64

Page 21: Chapter 02: Probability Theory

Joint probabilities and marginal probabilities

Example:

pX,Y (x, y) =

112

if (x, y) = (1, 1)14

if (x, y) = (1, 2)512

if (x, y) = (2, 1)14

if (x, y) = (2, 2)

pX(x) =

13

if x = 123

if x = 2

pY (y) =

12

if y = 112

if y = 2

y = 1 y = 2 Σ

x = 1 112

14

13

x = 2 512

14

23

Σ 12

12

1

Learning and Inference in Graphical Models. Chapter 02 – p. 21/64

Page 22: Chapter 02: Probability Theory

Joint probabilities and marginal probabilities

Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX(x) =

∫ x2

01dy if 0 < x < 2

0 otherwise

=

x2

if 0 < x < 2

0 otherwise

pX(x) =

∫ 2

2y1dx if 0 < y < 1

0 otherwise

=

2− 2y if 0 < y < 1

0 otherwise

Y

X

0

Y

X

1

pX,Y (x, y)

pX(x)

pY (y)

Learning and Inference in Graphical Models. Chapter 02 – p. 22/64

Page 23: Chapter 02: Probability Theory

Joint probabilities and marginal probabilities

Z = X + Y

How does pZ depend on pX,Y ?

discrete case X, Y, Z ∈ N0

pZ(z) = P (Z = z) = P (X + Y = z)

=∑

x∈N0

P (X = x, X + Y = z) =∑

x∈N0

P (X = x, Y = z − x)

=∑

x∈N0

pX,Y (x, z − x)

continuous case X, Y, Z ∈ R

pZ(z) =

∫ ∞

−∞

pX,Y (x, z − x)dx

Learning and Inference in Graphical Models. Chapter 02 – p. 23/64

Page 24: Chapter 02: Probability Theory

Joint probabilities and marginal probabilities

Derivation of the continuous case:∫ b

a

pZ(z)dz = P (a < X + Y < b)

= P (−∞ < X < ∞, a−X < Y < b−X)

=

∫ ∞

−∞

∫ b−x

a−x

pX,Y (x, y)dy dx

=z=y+x

∫ ∞

−∞

∫ b

a

pX,Y (x, z − x)dz dx

=

∫ b

a

∫ ∞

−∞

pX,Y (x, z − x)dx dz

Learning and Inference in Graphical Models. Chapter 02 – p. 24/64

Page 25: Chapter 02: Probability Theory

Conditional distributions

What is the distribution of Y if we would know that X takes a certain value x?

A conditional distribution of Y given X is defined by as

pY |X(y|x) =pX,Y (x, y)

pX(x)Example:

pX,Y (x, y) =

1 if 0 < x < 2 and 0 < y < x2

0 otherwise

pX(x) =

x2

if 0 < x < 2

0 otherwise

pY |X(y|x) =

1x2

if 0 < y < x2

0 otherwisex

y

pX,Y (x, y)

pY |X(y|x)

Learning and Inference in Graphical Models. Chapter 02 – p. 25/64

Page 26: Chapter 02: Probability Theory

Joint, marginal, and joint distributions

joint distribution

pX,Y (x, y)

marginal distribution

pY (y) =

∫ ∞

−∞

pX,Y (x, y)dx

conditional distribution

pY |X(y|x) =pX,Y (x, y)

pX(x)

x

y

pX,Y (x, y)

x

y

pX,Y (x, y)

pY (y)

x

y

pX,Y (x, y)

pY |X(y|x)

Learning and Inference in Graphical Models. Chapter 02 – p. 26/64

Page 27: Chapter 02: Probability Theory

Statistical independence

Statistical independence of two random variables means that knowing the value ofone does not allow to draw conclusions about the value of the other.

Definition: X , Y are independent if for all values of x and y holds

pX,Y (x, y) = pX(x) · pY (y)

If X and Y are independent pX|Y (x|y) = pX(x) and pY |X(y|x) = pY (y)

Example:

x

y

pX,Y (x, y)

X , Y are not independent

x

y

pX,Y (x, y)

X , Y are independent

Learning and Inference in Graphical Models. Chapter 02 – p. 27/64

Page 28: Chapter 02: Probability Theory

Statistical independence

Why is independence so important?

simplification of formula

neglecting influence of other variables

if variables are not independet, we can use knowledge of one variable toconclude about the other

Important:

the concept of independence can be extendend to 3,4, ... random variablesor to groups of random variables

statistical independence is not transitive.Example: X, Y, Z ∈ 0, 1 with

pX,Y,Z(x, y, z) =

14

if x+ y + z ∈ 1, 30 otherwise

Learning and Inference in Graphical Models. Chapter 02 – p. 28/64

Page 29: Chapter 02: Probability Theory

Remarks on notation

probabilities

• complete (correct) notation: P (X = x), pX(x)

• simplified notation: P (X), p(X)

• for real-valued variables P (X) denotes the probability density functionof X

pdf and pmf

• precise: pdf6= pmf

• convenient: unified notation,

use

∫ b

a

f(x)dx and

b∑

x=a

f(x) interchangeable

Learning and Inference in Graphical Models. Chapter 02 – p. 29/64

Page 30: Chapter 02: Probability Theory

Standard Distribution Families

Learning and Inference in Graphical Models. Chapter 02 – p. 30/64

Page 31: Chapter 02: Probability Theory

Standard distribution families

Every function can serve as pdf/pmf that meets the axioms.but

Some families of functions have proved to be very useful for many purposes.

Relevant for us:

categorical distribution (discrete)

uniform distribution (univariate/multivariate real valued)

Gaussian/normal distribution (univariate/multivariate real-valued)

gamma distribution/inverse gamma distribution (real-valued)

Dirichlet distribution (multivariate real-valued)

Wishart distribution/Inverse Wishart distribution (multivariate real-valued)

Learning and Inference in Graphical Models. Chapter 02 – p. 31/64

Page 32: Chapter 02: Probability Theory

Categorical distribution

typically used to describe the value of a discrete random variable with finiterange

parameters: q1, . . . , qN ∈ R≥0 so that∑N

i=1 qi = 1qi desribes the probability of outcome i.

pmf: p(i) =

qi if 1 ≤ i ≤ N

0 otherwise

examples:

• rolling a dice

• election

• fruit on the scales at a super market

notation: X ∼ C(q1, . . . , qN) a categorical distribution for a random variable wich takes only values 0 and

1 is called a Bernoulli distribution.

Learning and Inference in Graphical Models. Chapter 02 – p. 32/64

Page 33: Chapter 02: Probability Theory

Uniform distribution

univariate case:

• only (real) numbers in an interval [a, b]occur

• all numbers within the interval have sameprobability

• parameters: a, b ∈ R, a < b

• pdf: p(x) =

1

b−aif a ≤ x ≤ b

0 otherwise

• example: quantisation error of ananalog/digital converter

• notation: X ∼ U(a, b)

x

p(x)

Learning and Inference in Graphical Models. Chapter 02 – p. 33/64

Page 34: Chapter 02: Probability Theory

Uniform distribution

multivariate case:

• only vectors in A ⊆ Rn occur

• all vectors within A have same probability

• parameters: A ⊆ Rn measureable

• pdf: p(~x) =

1∫

A1d~y

if ~x ∈ A

0 otherwise

• notation: X ∼ U(A)

x1

x2

p(x1, x2)

Learning and Inference in Graphical Models. Chapter 02 – p. 34/64

Page 35: Chapter 02: Probability Theory

Normal (Gaussian) distribution

most important distribution for real-valued random variables

for a random variable that takes values around a “central” value

univariate case:

• parameters: µ ∈ R, σ2 ∈ R>0

• pdf:1√

2π√σ2

e−12

(x−µ)2

σ2

• notation: X ∼ N (µ, σ2)

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Learning and Inference in Graphical Models. Chapter 02 – p. 35/64

Page 36: Chapter 02: Probability Theory

Normal (Gaussian) distribution

multivariate case:

• parameters: ~µ ∈ Rn, Σ ∈ R

n×n symmetric, positive definite

• pdf:1

(√2π)n

det(Σ)e−

12(~x−~µ)TΣ−1(~x−~µ)

• notation: X ∼ N (~µ,Σ)

−5 −4 −3 −2 −1 0 1 2 3 4 5

−5

0

5

0

0.02

0.04

0.06

0.08

Learning and Inference in Graphical Models. Chapter 02 – p. 36/64

Page 37: Chapter 02: Probability Theory

Gamma distribution

a distribution that assigns nonzero probability to positive numbers.

parameters a > 0, b > 0

pdf: p(x) =

ba

Γ(a)xa−1e−bx if x ≥ 0

0 otherwise

notation: X ∼ Γ(a, b)

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Learning and Inference in Graphical Models. Chapter 02 – p. 37/64

Page 38: Chapter 02: Probability Theory

Inverse gamma distribution

if X ∼ Γ(a, b), then Y = 1X

is said to be distributed w.r.t. Γ−1(a, b)

parameters a > 0, b > 0

pdf: p(y) =

ba

Γ(a)y−a−1e

− by if y ≥ 0

0 otherwise

the inverse gamma distribution is used to model distributions over theparameter σ2 of Gaussian distributions

Learning and Inference in Graphical Models. Chapter 02 – p. 38/64

Page 39: Chapter 02: Probability Theory

Wishart distribution

the multivariate extension of the gamma distribution

models a distribution over symmetric, positive definite matrices

parameters: W ∈ Rn×n symmetric, positive definite, ν > n− 1

pdf: p(A) =1

Z(W, ν)· det(A) ν−n−1

2 e−12trace(W−1A)

with Z(W, ν) = det(W )ν2 · 2 ν·n

2 · π n·(d−1)4 ·

n∏

i=1

Γ(ν + 1− i

2

)

notation: A ∼ W(W, ν)

the Wishart distribution for n = 1 is equal to the gamma distribution.

Learning and Inference in Graphical Models. Chapter 02 – p. 39/64

Page 40: Chapter 02: Probability Theory

Inverse Wishart distribution

if A ∼ W(W, ν), then B = A−1 is said to be distributed w.r.t. W−1(W, ν)

the inverse Wishart distribution is used to model distributions over theparameter Σ of multivariate Gaussian distributions

parameters: W ∈ Rn×n symmetric, positive definite, ν > n− 1

pdf: p(B) =1

Z(W, ν)· det(B)

−ν−n−12 e−

12trace(W−1B−1)

with Z(W, ν) = det(W )−ν2 · 2 ν·n

2 · π n·(d−1)4 ·

n∏

i=1

Γ(ν + 1− i

2

)

the inverse Wishart distribution for n = 1 is equal to the inverse gammadistribution.

Learning and Inference in Graphical Models. Chapter 02 – p. 40/64

Page 41: Chapter 02: Probability Theory

Dirichlet distribution

a distribution over a simplex of numbers

models probabilities of a categorical distribution

parameters: α1, . . . , αn ≥ 0

pdf: p(q1, . . . , qn) =Γ(∑n

i=1 αi)∏n

i=1 Γ(αi)

n∏

i=1

qαi−1i

notation: (q1, . . . , qn) ∼ D(α1, . . . , αn)

the Dirichlet distribution for n = 2 is called Beta distribution.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

n=2

n=3

Learning and Inference in Graphical Models. Chapter 02 – p. 41/64

Page 42: Chapter 02: Probability Theory

Maximum-Likelihood Estimators

Learning and Inference in Graphical Models. Chapter 02 – p. 42/64

Page 43: Chapter 02: Probability Theory

Density estimators

How do we get meaningful probability distributions for real processes?

using background knowledge to find the appropriate distribution family

using histograms to compare its shape with the shape of possible distributionfamilies

using estimators to find the “best parameters” for a sample

background knowledge and assumption of an ideal dice: qi =16

Examples:

rolling a dice

• distribution family: categorical with n = 6 possible outcomes

• parameters qi: choose w.r.t. background knowledge and assumption of

an ideal dice: qi =16

• parameters qi: choose w.r.t. random sample o1, . . . , oN to

qi =|j|oj=i|

N

Learning and Inference in Graphical Models. Chapter 02 – p. 43/64

Page 44: Chapter 02: Probability Theory

Density estimators

Examples:

data from a certain sensor

which density family would be suitable?

how do we choose the parameters ofthat density family to achieve optimalresults?

optimal fit for a normal distribution −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.080

10

20

30

40

50

60

70

80

90

Learning and Inference in Graphical Models. Chapter 02 – p. 44/64

Page 45: Chapter 02: Probability Theory

Density estimators

problem specification

• given a sample x1, . . . , xN from an unknown distribution

• assuming a density family fθ parameterized by parameters θ

• find the value of θ so that the density fits as best as possible to thesample

basic idea:

search θ that maximizes the likelihood p(x1, . . . , xN |θ) assumption to simplify calculation: x1, . . . , xN are independent (i.i.d.)

p(x1, . . . , xN |θ) =N∏

i=1

p(xi|θ) =N∏

i=1

fθ(xi) → maximum

An estimator that maximizes the likelihood is called maximum likelihoodestimator (ML)

trick: often it is conventient to maximize log p(x1, . . . , xN |θ) instead of

p(x1, . . . , xN |θ)Learning and Inference in Graphical Models. Chapter 02 – p. 45/64

Page 46: Chapter 02: Probability Theory

Maximum likelihood estimators

Example:

uniform distribution → blackboard

normal distribution → blackboard

categorical distribution → blackboard

Learning and Inference in Graphical Models. Chapter 02 – p. 46/64

Page 47: Chapter 02: Probability Theory

Maximum likelihood estimators

Maximum likelihood can also be used for conditional distributions

Example:

x1, . . . , xN given, constant

yi ∼ N (a · xi + b, σ2) with unknown parameters a, b, σ2

maximum-likelihood estimator:

maximizea,b,σ2

1√2π

√σ2

e−12

(yi−(a·xi+b))2

σ2

yields(∑N

i=1 x2i

∑N

i=1 xi∑N

i=1 xi N

)

·(

a

b

)

=

(∑N

i=1(xi · yi)∑N

i=1 yi

)

σ2 =1

N

N∑

i=1

(yi − (a · xi + b))2

→ linear regression

Learning and Inference in Graphical Models. Chapter 02 – p. 47/64

Page 48: Chapter 02: Probability Theory

Bayesian Inference

Learning and Inference in Graphical Models. Chapter 02 – p. 48/64

Page 49: Chapter 02: Probability Theory

Uncertainty in estimates

maximum likelihood yields single parameter estimate.E.g. in linear regression one value for a and b which can be used to predicty(x) for new values of x.

shortcoming: ML-estimate is created from random data, i.e. it is randomitself.How can we model the randomness in the parameter estimate and considerfor prediction?

Learning and Inference in Graphical Models. Chapter 02 – p. 49/64

Page 50: Chapter 02: Probability Theory

Bayesian inference

modeling a distribution of parameters

p( θ︸︷︷︸

parameters

| D︸︷︷︸

sample

)

Bayes’ theorem

p(θ|D) =p(D|θ) · p(θ)

p(D)∝ p(D|θ) · p(θ)

denominator is independent of θ, i.e. it is constant w.r.t. θ.

Learning and Inference in Graphical Models. Chapter 02 – p. 50/64

Page 51: Chapter 02: Probability Theory

Bayesian inference

p(θ|D) ∝ p(D|θ) · p(θ) p(θ) is called prior distribution

Which parameters do we expect without having seen the data?

p(D|θ) is the data likelihoodHow well do the data fit to a certain parameter?

p(θ|D) is called posterior distributionWhich parameters can we expect knowing the data?

Learning and Inference in Graphical Models. Chapter 02 – p. 51/64

Page 52: Chapter 02: Probability Theory

Bayesian inference

Example: Bernoulli experiment (e.g. tossing a coin) with unknown parameter θ

prior: we want to model that we have noidea about θp(θ) = 1 (for all θ ∈ [0, 1])

assume, we repeated the experiment 10times with 3 times success

p(θ|D1) ∝ θ3 · (1− θ)7

in another 90 trials we obtained 27 timessuccess

p(θ|D1, D2) ∝ θ27 · (1− θ)63 · θ3 · (1− θ)7 = θ30 · (1− θ)70

ML-estimate would be θ = 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

Learning and Inference in Graphical Models. Chapter 02 – p. 52/64

Page 53: Chapter 02: Probability Theory

Bayesian inference

which are suitable prior/posterior distributions?

p(D|θ) · p(θ) should be a treatable distribution

p(θ) and p(θ|D) should belong to the same distribution family

→ conjugate prior

Learning and Inference in Graphical Models. Chapter 02 – p. 53/64

Page 54: Chapter 02: Probability Theory

Conjugate priors

Examples:

categorical distribution

• data likelihoodN∏

i=1

qxi=

k∏

j=1

qnj

j with nj = |i|xi = j|

• which density function over (q1, . . . , qk) looks similar?

• try the Dirichlet distribution as prior (q1, . . . , qk) ∼ D(α1, . . . , αk)

p(q1, . . . , qk) ∝k∏

j=1

qαj−1j

p(q1, . . . , qk|D) ∝k∏

j=1

qnj

j

︸ ︷︷ ︸

likelihood

·k∏

j=1

qαj−1j

︸ ︷︷ ︸

prior

=k∏

j=1

qαj+nj

j

Hence, (q1, . . . , qk)|D ∼ D(α1 + n1, . . . , αk + nk)Learning and Inference in Graphical Models. Chapter 02 – p. 54/64

Page 55: Chapter 02: Probability Theory

Conjugate priors

parameter µ of a univariate Gaussian distribution

• data likelihood

N∏

i=1

1√2π

√σ2

e−12

(xi−µ)2

σ2

• which density function over µ looks similar?

• try a Gaussian distribution as prior µ ∼ N (m, ρ2)

p(µ) =1√

2π√

ρ2e− 1

2(µ−m)2

ρ2

→ blackboard

Learning and Inference in Graphical Models. Chapter 02 – p. 55/64

Page 56: Chapter 02: Probability Theory

Conjugate priors

Table of useful conjugate priors

distribution parameter conjugate prior posterior

categoricalC(q1, . . . , qk)

q1, . . . , qk DirichletD(α1, . . . , αk)

D(α1 + n1,

. . . , αk + nk)

univ. GaussianN (µ, σ2)

µ GaussianN (m, ρ2)

N (σ2m+ρ2x

σ2+ρ2, σ2ρ2

σ2+ρ2)

univ. GaussianN (µ, σ2)

σ2 Inverse gammaΓ−1(a, b)

Γ−1(a+ 12, b+ (x−µ)2

2)

multivariateGaussianN (~µ,Σ)

~µ GaussianN (~µ,Ψ)

N ((Σ−1 +Ψ−1)−1(Σ−1~x+Ψ−1~µ), (Σ−1+Ψ−1)−1)

multivariateGaussianN (~µ,Σ)

Σ Inv. WishartW−1(W, ν)

W−1((W−1 + (~x−~µ)(~x− ~µ)T )−1, ν + 1)

Learning and Inference in Graphical Models. Chapter 02 – p. 56/64

Page 57: Chapter 02: Probability Theory

Conjugate priors

Special case: conjugate distribution for pair of (µ, σ2) / (~µ,Σ) of a GaussianIn the multivariate case, we the conjugate prior is:

Σ∼W−1(W, ν)

µ|Σ∼N (~m,1

ηΣ)

The posterior is for observed data ~x1, . . . , ~xk with x = 1k

∑k

i=1 ~xi and

S = 1k

∑k

i=1(~xi − x)(~xi − x)T :

Σ|~x1, . . . , ~xk ∼W−1((W−1 + kS +kη

k + η(x−m)(x−m)T )−1, ν + k)

µ|Σ, ~x1, . . . , ~xk ∼N (k

k + ηx+

η

k + ηm,

1

k + ηΣ)

Learning and Inference in Graphical Models. Chapter 02 – p. 57/64

Page 58: Chapter 02: Probability Theory

Priors

improper priors: a prior density p for which∫p(θ)dθ 6= 1

non-informative priors: a prior that indicates that “is neutral”. Severaldefinitions:

• a prior with uniform density over all parameter values

• a prior for which argmaxθ p(θ|D) = argmaxθ p(D|θ)• Jeffreys prior

Learning and Inference in Graphical Models. Chapter 02 – p. 58/64

Page 59: Chapter 02: Probability Theory

Jeffreys prior

what is a non-informative prior?

• it should not prefer one parameter value over another

• it should not depend on the parametrization of a distribution

Jeffreys prior (Harold Jeffreys, 1946)

• based on information theory

• considers the way a parameter influences the estimate

Jeffreys prior is defined as:

p(~θ) ∝√

det(I(~θ))with I(~θ) the Fisher information

I(~θ)i,j =∫ ∞

−∞

(

− ∂2

∂θi∂θjlog p(x|~θ)

)

· p(x|~θ)dx

Learning and Inference in Graphical Models. Chapter 02 – p. 59/64

Page 60: Chapter 02: Probability Theory

Jeffreys prior

Example: x ∼ N (µ, s)Determine Jeffreys prior w.r.t. (µ, s)→ blackboard

Example: categorical distributionNoninformative priors:

• D(1, . . . , 1) yields a uniform distribution over all values

• D(0, . . . , 0) yields a prior so that

argmaxθ p(θ|D) = argmaxθ p(D|θ)• D(1

2, . . . , 1

2) yields the Jeffreys prior

Learning and Inference in Graphical Models. Chapter 02 – p. 60/64

Page 61: Chapter 02: Probability Theory

Maximum a posterior estimator

Idea to use Bayesian analysis to obtain a single parameter estimate:maximum-a-posterior estimator (MAP)

θ = argmaxθ

P (θ|D) = argmaxθ

(P (D|θ) · P (θ))

Example: estimating the parameters or a categorical distribution:

prior: D(α1, . . . , αk)

posterior: D(α1 + n1, . . . , αk + nk)

MAP estimator: qMAPj =

nj+αj−1∑k

ν=1(nν+αν−1)

ML estimator: qMLj =

nj∑kν=1 nν

Learning and Inference in Graphical Models. Chapter 02 – p. 61/64

Page 62: Chapter 02: Probability Theory

Summary

Learning and Inference in Graphical Models. Chapter 02 – p. 62/64

Page 63: Chapter 02: Probability Theory

Summary

basic concepts of probability theory

• random events, random variables

• probability distributions, density functions

• joint, marginal, conditional distributions/probabilities

standard distribution families

• categorical

• uniform (univariate/multivariate)

• Gaussian (univariate/multivariate)

• gamma and inverted gamma

• Wishart and inverted Wishart

• Dirichlet

Learning and Inference in Graphical Models. Chapter 02 – p. 63/64

Page 64: Chapter 02: Probability Theory

Summary

maximum likelihood estimators

Bayesian inference

• Bayesian analysis, priors, posteriors

• conjugate priors

• non-informative priors

• maximum-a-posterior estimators

Learning and Inference in Graphical Models. Chapter 02 – p. 64/64