Inference in Gaussian and Hybrid Bayesian Networks

ICS 275B

Gaussian Distribution

1 p where

),(p, triplea asor ),N( as dRepresente

-3 -2 -1 0 1 2 3

gaussian(x,0,1)gaussian(x,1,1)

-3 -2 -1 0 1 2 3

gaussian(x,0,1)gaussian(x,0,2)

Multivariate Gaussian

Definition:

Let X1,…,Xn. Be a set of random variables. A multivariate Gaussian distribution over X1,…,Xn is a parameterized by an n-dimensional mean vector and an n x n positive definitive covariance matrix . It defines a joint density via:

xxXP T

Multivariate Gaussian

xxXP T

Linear Gaussian Distribution

Definition:

Let Y be a continuous node with continuous parents X1,…,Xk. We say that Y has a linear Gaussian model if it can be described using parameters 0, …,k and 2 such that:

P(y| x1,…,xk)=N (μy + 1x1 +…,kxk ; )

=N([μy,1,…,k] , )

),(~ aaNA

)],([~ bbawNB

A BA B

kkYY 110

-10 -5 0 5 10X -10-5

00.050.1

0.150.2

0.250.3

0.350.4

Linear Gaussian Network

Definition

Linear Gaussian Bayesian network is a Bayesian network all of whose variables are continuous and where all of the CPTs are linear Gaussians.

Linear Gaussian BN Multivariate Gaussian

=>Linear Gaussian BN has a compact representation

Inference in Continuous Networks

σCMCKc

CcNAPABP

APABPBP

/11-K where

),()(*)|(

)(*)|()(

)],,([P(B) ),()(

Marginalization

answer. required theis )','(

' ' quantitiesfour ofmatrix a is

' 'quantities twocontaining vector a is

Problems: When we Multiply two arbitrary Gaussians!

σCMCKc

CcNAPABP

APABPBP

/11-K where

),()(*)|(

)(*)|()(

)],,([P(B) ),()(

Inverse of K and M is always well defined.

However, this inverse is not!

Theoretical explanation: Why this is the case ? Inverse of a matrix

of size n x n exists when the matrix is of rank n.

If all sigmas and w’s are assumed to be 1.

(K-1+M-1) has rank 2 and so is not invertible.

1 1- 0

1- 1 0

/ /1 0

0 1- 1-

0 1- 1

0 / /11-K where

1 1- 0

1- 2 1-

0 1- 111

)|,()|(*)|(

XBAPXBPBAP

Density vs conditional

However, Theorem: If the product of the gaussians

represents a multi-variate gaussian density, then the inverse always exists. For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then

inverse of C always exists. P(A,B) is a multi-variate gaussian (density).

But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C may not exist. P(A,B|X) is a conditional gaussian.

Inference: A general algorithm Computing marginal of a given variable, say Z.

)2log(2

],...,[Let w

),,(),,...,,(),...,|(

khgwwNXXyP

Step 1:

Convert all conditional gaussians to canonical form

Step 2: Extend all g’s,h’s and

k’s to the same domain by adding 0’s.

0 0 tochanged is K'

same theremainsK domain, same theK to and K' Extending

)',','()(

),,()|(

khgBAP

Inference: A general algorithm Computing marginal of a given variable, say Z. Step 3: Add all g’s, all h’s and all k’s. Step 4: Let the variables involved in the

computation be: P(X1,X2,…,Xk,Z)= N(μ,∑)

...........

..............................

Step 5:

Extract the marginal

Inference: Computing marginal of a given variable

For a continuous Gaussian Bayesian Network, inference is polynomial O(N3). Complexity of matrix inversion

So algorithms like belief propagation are not generally used when all variables are Gaussian.

Can we do better than N^3? Use Bucket elimination.

Bucket elimination Algorithm elim-bel (Dechter 1996)

Multiplication operator

P(a|e=0)

W*=4”induced width” (max clique size)

bucket B:

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e)(a,hD

e)c,d,(a,hB

e)d,(a,hC

Marginalization operator

Multiplication Operator

Convert all functions to canonical form if necessary.

Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)

Again our problem!

bucket B:

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e)(a,hD

e)c,d,(a,hB

e)d,(a,hC

h(a,d,c,e) does not represent a density and so cannot be computed in our usual form N(μ,σ)

Solution: Marginalize in canonical form Although intermediate functions computed in bucket

elimination are conditional, we can marginalize in canonical form, so we can eliminate the problem of non-existence of inverse completely.

Algorithm

In each bucket, convert all functions in canonical form if necessary, multiply them and marginalize out the variable in the bucket as shown in the previous slide.

Theorem: P(A) is a density and is correct. Complexity: Time and space: O((w+1)^3)

where w is the width of the ordering used.

Continuous Node, Discrete ParentsDefinition:

Let X be a continuous node, and let U={U1,U2,…,Un} be its discrete parents and Y={Y1,Y2,…,Yk} be its continuous parents. We say that X has a conditional linear Gaussian (CLG) CPT if, for every value uD(U), we have a a set of (k+1) coefficients au,0, au,1, …, au,k+1 and a variance u

2 such that:

),(),|(1

iuiiuu yaaNyuXp

CLG Network

Definition:

A Bayesian network is called a CLG network if every discrete node has only discrete parents, and every continuous node has a CLG CPT.

Inference in CLGs

Can we use the same algorithm? Yes, but the algorithm is unbounded if we are not

careful. Reason:

Marginalizing out discrete variables from any arbitrary function in CLGs is not bounded. If we marginalize out y and k from f(x,y,i,k) , the result is

a mixture of 4 gaussians instead of 2. X and y are continuous variables I and k are discrete binary variables.

Solution: Approximate the mixture of Gaussians by a single gaussian

Multiplication and Marginalization Convert all functions to

canonical form if necessary.

Extend all functions to the same variables

(g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)

MultiplicationStrong marginal when marginalizing continuous variables

Weak marginal when marginalizing discrete variables

Problem while using this marginalization in bucket elimination Requires computing ∑ and μ which is not possible

due to non-existence of inverse. Solution: Use an ordering such that you never have

to marginalize out discrete variables from a function that has both discrete and continuous gaussian variables.

Special case: Compute marginal at a discrete node Homework: Derive a bucket elimination algorithm

for computing marginal of a continuous variable.

bucket B:

P(c|a)

P(b|a,e) P(d|b,a) P(d|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

P(e) e)(a,hD

e)d,(a,hC

Special Case: A marginal on a discrete variable in a CLG is to be computed.B,C and D are continuous variables and A and E is discrete

e)c,d,(a,hB

Complexity of the special case Discrete-width (wd): Maximum number of

discrete variables in a clique Continuous-width (wc): Maximum number of

continuous variables in a clique Time: O(exp(wd)+wc^3) Space: O(exp(wd)+wc^3)

Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques

(Same as assigning functions to buckets) Select a Strong Root Perform message passing

Creating a Special-tree decomposition Moralize the Bayesian Network. Select an ordering such that all continuous

variables are ordered before discrete variables (Increases induced width).

Elimination order

Strong elimination order:• First eliminate continuous variables• Eliminate discrete variable when no

available continuous variables

Moralized graph has this edge

W and X are discrete variables and Y and Z are continuous.

Elimination order (1)

dim: 2 dim: 2

dim: 2

dim: 2 dim: 2

3 dim: 2

Cliques 1

Cliques 2

separator

Bucket tree or Junction tree (1)

Cliques 1

Cliques 2: root

separator

Algorithm for the general case:Computing Belief at a continuous node of a CLG

Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques

Assigning Functions to cliques Select a function and place it in an arbitrary

clique that mentions all variables in the function.

Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques

Strong Root

We define a strong root as any node R in the bucket-tree which satisfies the following property: for any pair (V,W) which are neighbors on the tree with W closer to R than V, we have

variablesdiscrete ofset theis

variablescontinuous ofset theis

W Vor \

Example Strong rootStrong Root

Create a special tree-decomposition Assign functions to appropriate cliques

Message passing at a typical node x2

oNode “a” contains functions assigned to it according to the tree-decomposition scheme denoted by pj(a)

)()),(()),((),(

basepa biaiba apaisephbaseph

)),(( axseph naxn

)),(( 11 axseph ax

Message Passing

rootroot

Collect

rootroot

Distribute

Figure from P. Green

Two pass algorithm: Bucket-tree propagation

Lets look at the messagesCollect Evidence

∫Mout

∫Min∫D

Strong Root

Distribute Evidence

∫E∑W,B

∫E∑B

Strong Root

Lauritzens theorem

When you perform message passing such that collect evidence contains only strong marginals and distribute evidence may contain weak marginals, the junction-tree algorithm in exact in the sense that: The first (mean) and second moments (variance)

computed are true moments

Complexity

Polynomial in #of continuous variables in a clique (n3)

Exponential in #of discrete variables in a clique Possible options for approximation

Ignore the strong root assumption and use approximation like MBTE, IJGP, Sampling

Respect the strong root assumption and use approximation like MBTE, IJGP, Sampling Inaccuracies only due to discrete variables if done in one

pass of MBTE.

W=0 W=1

X=0 X=1

Initialization (1)

dim: 2 dim: 2

dim: 2

5.07.0

3.09.0

5.07.0

2.03.0

Initialization (2)

wyz wxywy

Cliques 1 Cliques 2 (root)

w=0g=log(0.5),h=[],K

w=1g=log(0.5),h=[],K

x=0g=log(0.4),h=[],K

x=1g=log(0.6),h=[],K

=[]X=0 X=1

g = -4.1245

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

g = -3.0310

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

W=0 W=1

g = -4.0629

h = [0.0889 -0.0111 -0.0556 0.0556]

g = -2.7854

h = [0.0867 -0.0633 -0.1000 -0.1667]

K = 0.1444 - 0.0089 - 0.1 0.0778

- 0.0089 0.0378 - 0.0333 - 0.0556- 0.1 - 0.0333 0.1111 0

0.0778 - 0.0556 0 0.1111

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

W=0 W=1

g = -4.7560

g = -3.4786

Initialization (3)

wyz wxywy

0.0889 - 0.0111 - 0.0556 0.0556

0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.0867 - 0.0633 - 0.1 - 0.1667

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

wx=00 wx=10

g = -5.1308

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

g = -5.1308

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

wx=01 wx=11

g = -3.5418

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

g = -3.5418

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

Message Passing

wyz wxywy

Collect evidencewywyzwy )()(*

)()()(

wywxywxy

Distribute evidencewywxywy )()( ***

)()()(

wywyzwyz

Collect evidence (1)

wyz wxywy

1 , ,KK

112122

)||log)2log((2

KKpgg T

)ˆ,ˆ,ˆ;(][ 2121 KgdTTT hyyyy y2y3

(y1,y2)(y2)

wyz wxywy

W=0 W=1

g = -4.7560

g = -3.4786

0.0889 - 0.0111 - 0.0556 0.0556

0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.0867 - 0.0633 - 0.1 - 0.1667

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

W=0 W=1

g = -0.6931

h = [0.1388 0]’ *1.0e-16

K = [0.2776 -0.0694;0.0347 0]*1.0e-16

g = -0.6931

h = [0 0]’

K = [0 0 0 0]

marginalization

wyz wxywy

W=0 W=1

g = -0.6931

h = [0.1388 0]’ *1.0e-16

K = [0.2776 -0.0694;0.0347 0]*1.0e-16

g = -0.6931

h = [0 0]’

K = [0 0 0 0]

wx=00 wx=10

g = -5.1308

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

g = -5.1308

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

wx=01 wx=11

g = -3.5418

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

g = -3.5418

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

multiplication

wx=00 wx=10

g = -5.8329

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

g = -5.8329

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

wx=01 wx=11

g = -4.2350

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

g = -4.2350

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

Distribute evidence (1)

wyz wxywy

W=0 W=1

g = -4.7560

g = -3.4786

0.0889 - 0.0111 - 0.0556 0.0556

0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.0867 - 0.0633 - 0.1 - 0.1667

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

W=0 W=1

g = -0.6931

h = [0.1388 0]’ *1.0e-16

K = [0.2776 -0.0694;0.0347 0]*1.0e-16

g = -0.6931

h = [0 0]’

K = [0 0 0 0]

division

wyz wxywy

W=0 W=1

g = -4.0629

g = -2.7854

0.0889 - 0.0111 - 0.0556 0.0556

0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.0867 - 0.0633 - 0.1 - 0.1667

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

wyz wxywy

wx=00 wx=10

g = -5.8329

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

g = -5.8329

h = [-0.02 0.12]’

K = [0.1 0; 0 0.1]

wx=01 wx=11

g = -4.2350

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

g = -4.2350

h = [0.5 -0.5]’

K = [0.5 0.5;0.5 0.5]

Marginalize over x

w=0 w=1

logp = -0.6931

mu = [0.52 -0.12]’

Sigma =

logp = -0.6931

mu = [0.52 -0.12]’

Sigma =5.5456 - 0.6336

- 0.6336 6.36165.5456 - 0.6336

- 0.6336 6.3616

wyz wxywy

W=0 W=1

g = -4.0629

g = -2.7854

0.0889 - 0.0111 - 0.0556 0.0556

0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.0867 - 0.0633 - 0.1 - 0.1667

0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

w=0 w=1

logp = -0.6931

mu = [0.52 -0.12]’

Sigma =

logp = -0.6931

mu = [0.52 -0.12]’

Sigma =5.5456 - 0.6336

- 0.6336 6.36165.5456 - 0.6336

- 0.6336 6.3616

multiplication

w=0 w=1

g = -4.3316

h = [0.0927 -0.0096]’

g = -0.6931

h = [0.0927 -0.0096]’

K =0.1824 0.01820.0182 0.159

0.1824 0.01820.0182 0.159

Canonical form

wyz wxywy

W=0 W=1

g = -8.3935

g = -7.1170

0.1816 - 0.0207 - 0.0556 0.05560.3268 0.0093 - 0.1 0.07780.0093 0.1968 - 0.0333 - 0.0556

- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111

0.1793 - 0.073 - 0.1 - 0.1667

0.3907 - 0.1285 0.15 - 0.2333- 0.1285 0.2623 - 0.1 0.1667

0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333

After Message Passing

p(wyz) p(wxy)p(wy)

Local marginal distributions

Inference in Gaussian and Hybrid Bayesian Networks

Documents

Approximate Bayesian Inference for Latent Gaussian Models ...rjs57/RSS/0708/Rue08.pdf · Approximate Bayesian Inference for Latent Gaussian Models 3 Dynamic models Temporal dependency

BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE … · BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND ... Bayesian Classiﬁcation Using Gaussian Mixture Model and EM Estimation:

Gaussian Processes and Bayesian Inference Applied to Deep ... · Gaussian Processes and Bayesian Inference Applied to Deep Image Priors Diana Na Kyoung Lee March 5, 2020 1Introduction

Bayesian Inference!!!

BAYESIAN MODELING AND CALIBRATION OF COMPUTER MODELS · OF COMPUTER MODELS Bayesian inference & Markov chain Monte Carlo Gaussian processes, Computer model calibration and prediction

Latent Gaussian models: Approximate Bayesian inference (INLA)folk.ntnu.no/joeid/phdclass_30jan18.pdf · 2018. 1. 30. · Latent Gaussian models: Approximate Bayesian inference (INLA)

AALBORG UNIVERSITYpeople.math.aau.dk/~jm/R-2014-04.pdf · AALBORG UNIVERSITY ’ & $ % Gaussian-log-Gaussian wavelet trees, frequentist and Bayesian inference, and statistical signal

Approximate Bayesian Inference for Latent … · Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations Havard Rue & Sara Martino

Chance, long tails, and inference in a non-Gaussian ...Chance, long tails, and inference in a non-Gaussian, Bayesian theory of vocal learning in songbirds Baohua Zhoua, David Hofmann

Bayesian Gaussian Process Models: PAC-Bayesian ... · machine learning, and among these Bayesian inference for Gaussian process (GP) models has recently received signi cant attention

Bayesian Inference for Categorical Data Analysis: A Surveypeople.stat.sc.edu/Hitchcock/bayesfinal.pdf · Bayesian Inference for Categorical Data Analysis: ... Bayesian Inference for

Inference in Gaussian and Hybrid Bayesian Networks ICS 275B

Gaussian Process Based Bayesian Inference System for ...ira.lib.polyu.edu.hk/bitstream/10397/80443/1/Ren_Gaussian_Based_Inference.pdf · complex surfaces superimposing multiple scales

Full Bayesian inference (Learning)...Learning paradigms Learning as inference Bayesian learning, full Bayesian inference, Bayesian model averaging Model identification, maximum likelihood

Approximate Bayesian Inference for Hierarchical Gaussian ... › ... › RueMartino2007.pdf · parameters at the third stage. Markov chain Monte Carlo is the common approach for Bayesian

Workshop on Bayesian Inference for Latent Gaussian Models ...gue20/masterfile.pdf · Welcome to the third workshop on Bayesian Inference for Latent Gaussian Models with Applications!

Gaussian Models: Bayesian Inference

Conjugate Bayesian analysis of the Gaussian distributionmurphyk/Papers/bayesGauss.pdf · Conjugate Bayesian analysis of the Gaussian distribution ... Bayesian inference and conjugate

Bayesian Inference for General Gaussian Graphical Models With

PySSM: A Python Module for Bayesian Inference of Linear Gaussian … · 2 PySSM: Bayesian Inference of Linear Gaussian State Space Models in Python approach combines them together