Inference in Gaussian and Hybrid Bayesian Networks
ICS 275B
Gaussian Distribution
2
1 p where
),(p, triplea asor ),N( as dRepresente
2
)(exp
2
1)(
2
xxP
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
gaussian(x,0,1)gaussian(x,1,1)
2
)(exp
2
1)(
2xxP
N(, )
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
gaussian(x,0,1)gaussian(x,0,2)
2
)(exp
2
1)(
2xxP
N(, )
Multivariate Gaussian
Definition:
Let X1,…,Xn. Be a set of random variables. A multivariate Gaussian distribution over X1,…,Xn is a parameterized by an n-dimensional mean vector and an n x n positive definitive covariance matrix . It defines a joint density via:
)()(
2
1
||)2(
1)( 1
2/12/
xxXP T
n
Multivariate Gaussian
)()(
2
1
||)2(
1)( 1
2/12/
xxXP T
n
Linear Gaussian Distribution
Definition:
Let Y be a continuous node with continuous parents X1,…,Xk. We say that Y has a linear Gaussian model if it can be described using parameters 0, …,k and 2 such that:
P(y| x1,…,xk)=N (μy + 1x1 +…,kxk ; )
=N([μy,1,…,k] , )
A B
),(~ aaNA
)],([~ bbawNB
A BA B
kkYY 110
-10 -5 0 5 10X -10-5
05
10
Y
00.050.1
0.150.2
0.250.3
0.350.4
Linear Gaussian Network
Definition
Linear Gaussian Bayesian network is a Bayesian network all of whose variables are continuous and where all of the CPTs are linear Gaussians.
Linear Gaussian BN Multivariate Gaussian
=>Linear Gaussian BN has a compact representation
Inference in Continuous Networks
A B
abb
bba
baba
bab
a
A
babaa
wσ
σCMCKc
/σww
wM
MKC
CcNAPABP
APABPBP
wNNAP
*)/(
/
/
/ /1
/11-K where
11
),()(*)|(
)(*)|()(
)],,([P(B) ),()(
11
21
1
Marginalization
answer. required theis )','(
' '
' ' quantitiesfour ofmatrix a is
' 'quantities twocontaining vector a is
),(
bb
bbba
abaa
a
ba
A
N
C
c
CcN
Problems: When we Multiply two arbitrary Gaussians!
abb
bba
baba
bab
a
A
babaa
wσ
σCMCKc
/σww
wM
MKC
CcNAPABP
APABPBP
wNNAP
*)/(
/
/
/ /1
/11-K where
11
),()(*)|(
)(*)|()(
)],,([P(B) ),()(
11
21
1
Inverse of K and M is always well defined.
However, this inverse is not!
Theoretical explanation: Why this is the case ? Inverse of a matrix
of size n x n exists when the matrix is of rank n.
If all sigmas and w’s are assumed to be 1.
(K-1+M-1) has rank 2 and so is not invertible.
1 1- 0
1- 1 0
0 0 0
/ 0
/ /1 0
0 0 0
0 0 0
0 1- 1-
0 1- 1
0 0 0
0 /
0 / /11-K where
1 1- 0
1- 2 1-
0 1- 111
)|,()|(*)|(
2
1
2
1
1
bxbx
bxb
abab
aba
/σww
wM
/σww
w
MKC
XBAPXBPBAP
Density vs conditional
However, Theorem: If the product of the gaussians
represents a multi-variate gaussian density, then the inverse always exists. For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then
inverse of C always exists. P(A,B) is a multi-variate gaussian (density).
But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C may not exist. P(A,B|X) is a conditional gaussian.
Inference: A general algorithm Computing marginal of a given variable, say Z.
1 w'-
w- '1
1
)2log(2
1
2
],...,[Let w
),,(),,...,,(),...,|(
2
2
2
2
2
1
11
wwK
wh
g
ww
khgwwNXXyP
y
y
y
y
y
y
k
ykyk
Step 1:
Convert all conditional gaussians to canonical form
Inference: A general algorithm Computing marginal of a given variable, say Z.
Step 2: Extend all g’s,h’s and
k’s to the same domain by adding 0’s.
0 /1
0 0 tochanged is K'
same theremainsK domain, same theK to and K' Extending
]/1['
)',','()(
/
/ /1
),,()|(
2
b
b
abab
aba
σ
σK
khgBP
/σww
wK
khgBAP
Inference: A general algorithm Computing marginal of a given variable, say Z. Step 3: Add all g’s, all h’s and all k’s. Step 4: Let the variables involved in the
computation be: P(X1,X2,…,Xk,Z)= N(μ,∑)
Inference: A general algorithm Computing marginal of a given variable, say Z.
z
k
ZZ
ZZZkZ
Zk
zzNZP
....
...........
..............................
),()(
1
z
1
1111
Step 5:
Extract the marginal
Inference: Computing marginal of a given variable
For a continuous Gaussian Bayesian Network, inference is polynomial O(N3). Complexity of matrix inversion
So algorithms like belief propagation are not generally used when all variables are Gaussian.
Can we do better than N^3? Use Bucket elimination.
Bucket elimination Algorithm elim-bel (Dechter 1996)
b
Multiplication operator
P(a|e=0)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
Marginalization operator
Multiplication Operator
Convert all functions to canonical form if necessary.
Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)
Again our problem!
b
Multiplication operator
P(a)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
P(e)
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
Marginalization operator
h(a,d,c,e) does not represent a density and so cannot be computed in our usual form N(μ,σ)
Solution: Marginalize in canonical form Although intermediate functions computed in bucket
elimination are conditional, we can marginalize in canonical form, so we can eliminate the problem of non-existence of inverse completely.
Algorithm
In each bucket, convert all functions in canonical form if necessary, multiply them and marginalize out the variable in the bucket as shown in the previous slide.
Theorem: P(A) is a density and is correct. Complexity: Time and space: O((w+1)^3)
where w is the width of the ordering used.
Continuous Node, Discrete ParentsDefinition:
Let X be a continuous node, and let U={U1,U2,…,Un} be its discrete parents and Y={Y1,Y2,…,Yk} be its continuous parents. We say that X has a conditional linear Gaussian (CLG) CPT if, for every value uD(U), we have a a set of (k+1) coefficients au,0, au,1, …, au,k+1 and a variance u
2 such that:
),(),|(1
2,0,
k
iuiiuu yaaNyuXp
CLG Network
Definition:
A Bayesian network is called a CLG network if every discrete node has only discrete parents, and every continuous node has a CLG CPT.
Inference in CLGs
Can we use the same algorithm? Yes, but the algorithm is unbounded if we are not
careful. Reason:
Marginalizing out discrete variables from any arbitrary function in CLGs is not bounded. If we marginalize out y and k from f(x,y,i,k) , the result is
a mixture of 4 gaussians instead of 2. X and y are continuous variables I and k are discrete binary variables.
Solution: Approximate the mixture of Gaussians by a single gaussian
Multiplication and Marginalization Convert all functions to
canonical form if necessary.
Extend all functions to the same variables
(g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)
MultiplicationStrong marginal when marginalizing continuous variables
Weak marginal when marginalizing discrete variables
Problem while using this marginalization in bucket elimination Requires computing ∑ and μ which is not possible
due to non-existence of inverse. Solution: Use an ordering such that you never have
to marginalize out discrete variables from a function that has both discrete and continuous gaussian variables.
Special case: Compute marginal at a discrete node Homework: Derive a bucket elimination algorithm
for computing marginal of a continuous variable.
b
Multiplication operator
P(a)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a,e) P(d|b,a) P(d|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
P(e) e)(a,hD
(a)hE
e)d,(a,hC
Marginalization operator
Special Case: A marginal on a discrete variable in a CLG is to be computed.B,C and D are continuous variables and A and E is discrete
e)c,d,(a,hB
Complexity of the special case Discrete-width (wd): Maximum number of
discrete variables in a clique Continuous-width (wc): Maximum number of
continuous variables in a clique Time: O(exp(wd)+wc^3) Space: O(exp(wd)+wc^3)
Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques
(Same as assigning functions to buckets) Select a Strong Root Perform message passing
Creating a Special-tree decomposition Moralize the Bayesian Network. Select an ordering such that all continuous
variables are ordered before discrete variables (Increases induced width).
Elimination order
w
y
x
z
Strong elimination order:• First eliminate continuous variables• Eliminate discrete variable when no
available continuous variables
Moralized graph has this edge
W and X are discrete variables and Y and Z are continuous.
Elimination order (1)
w
y
x
z
dim: 2 dim: 2
dim: 2
1
Elimination order (2)
w
y
x
z
dim: 2 dim: 2
2
1
Elimination order (3)
w
y
x
z
3 dim: 2
2
1
Elimination order (4)
w
y
x
z
3 4
2
1w
y
z
3
2
1
w
y
x3 4
2w
y
3
2
Cliques 1
Cliques 2
separator
Bucket tree or Junction tree (1)
w
y
z
w
y
x
w
y
Cliques 1
Cliques 2: root
separator
Algorithm for the general case:Computing Belief at a continuous node of a CLG
Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques
(Same as assigning functions to buckets) Select a Strong Root Perform message passing
Assigning Functions to cliques Select a function and place it in an arbitrary
clique that mentions all variables in the function.
Algorithm for the general case:Computing Belief at a continuous node of a CLG
Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques
(Same as assigning functions to buckets) Select a Strong Root Perform message passing
Strong Root
We define a strong root as any node R in the bucket-tree which satisfies the following property: for any pair (V,W) which are neighbors on the tree with W closer to R than V, we have
variablesdiscrete ofset theis
variablescontinuous ofset theis
W Vor \
WV
Example Strong rootStrong Root
Algorithm for the general case:Computing Belief at a continuous node of a CLG
Create a special tree-decomposition Assign functions to appropriate cliques
(Same as assigning functions to buckets) Select a Strong Root Perform message passing
Message passing at a typical node x2
oNode “a” contains functions assigned to it according to the tree-decomposition scheme denoted by pj(a)
)()),(()),((),(
j
j
basepa biaiba apaisephbaseph
)),(( axseph naxn
a
b
x1
)),(( 11 axseph ax
Message Passing
rootroot
Collect
rootroot
Distribute
Figure from P. Green
Two pass algorithm: Bucket-tree propagation
Lets look at the messagesCollect Evidence
∫C
∫L
∫Mout
∫Min∫D
∫D
Strong Root
Distribute Evidence
∫E∑W,B
∫E∑W,B
∑W
∫E∑B
∑F
Strong Root
Lauritzens theorem
When you perform message passing such that collect evidence contains only strong marginals and distribute evidence may contain weak marginals, the junction-tree algorithm in exact in the sense that: The first (mean) and second moments (variance)
computed are true moments
Complexity
Polynomial in #of continuous variables in a clique (n3)
Exponential in #of discrete variables in a clique Possible options for approximation
Ignore the strong root assumption and use approximation like MBTE, IJGP, Sampling
Respect the strong root assumption and use approximation like MBTE, IJGP, Sampling Inaccuracies only due to discrete variables if done in one
pass of MBTE.
W=0 W=1
X=0 X=1
Initialization (1)
w
y
x
z
dim: 2 dim: 2
dim: 2
dim: 2
w=0
0.5
w=1
0.5
x=0
0.4
x=1
0.6
)100
010,
2.1
2.0;(
yN )
20
02,
0.1
0.1;(
yN
)90
09,
5.07.0
3.09.0
5.0
5.0;(
yzN )
30
02,
5.07.0
2.03.0
5.0
2.0;(
yzN
Initialization (2)
wyz wxywy
Cliques 1 Cliques 2 (root)
w=0g=log(0.5),h=[],K
=[]
w=1g=log(0.5),h=[],K
=[]
x=0g=log(0.4),h=[],K
=[]
x=1g=log(0.6),h=[],K
=[]X=0 X=1
g = -4.1245
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -3.0310
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
W=0 W=1
g = -4.0629
h = [0.0889 -0.0111 -0.0556 0.0556]
K =
g = -2.7854
h = [0.0867 -0.0633 -0.1000 -0.1667]
K = 0.1444 - 0.0089 - 0.1 0.0778
- 0.0089 0.0378 - 0.0333 - 0.0556- 0.1 - 0.0333 0.1111 0
0.0778 - 0.0556 0 0.1111
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
W=0 W=1
g = -4.7560
h =
K =
g = -3.4786
h =
K =
Initialization (3)
wyz wxywy
Cliques 1 Cliques 2 (root)
0.0889 - 0.0111 - 0.0556 0.0556
0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.0867 - 0.0633 - 0.1 - 0.1667
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
wx=00 wx=10
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01 wx=11
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
empty
Message Passing
wyz wxywy
Cliques 1 Cliques 2 (root)
empty
Collect evidencewywyzwy )()(*
)(
)()()(
**
wy
wywxywxy
Distribute evidencewywxywy )()( ***
)(
)()()(
*
****
wy
wywyzwyz
Collect evidence (1)
wyz wxywy
Cliques 1 Cliques 2 (root)
empty
2221
1211
2
1
2
1 , ,KK
KKKh
h
h
y
yy
121
112122
11
11212
11
1111
11
ˆ
ˆ
)||log)2log((2
1
KKKKK
KK
KKpgg T
hhh
hh
)ˆ,ˆ,ˆ;(][ 2121 KgdTTT hyyyy y2y3
y1y2
y2
(y1,y2)(y2)
Collect evidence (2)
wyz wxywy
Cliques 1 Cliques 2 (root)
empty
W=0 W=1
g = -4.7560
h =
K =
g = -3.4786
h =
K =
0.0889 - 0.0111 - 0.0556 0.0556
0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.0867 - 0.0633 - 0.1 - 0.1667
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
W=0 W=1
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
marginalization
Collect evidence (3)
wyz wxywy
Cliques 1 Cliques 2 (root)
empty
W=0 W=1
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
wx=00 wx=10
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01 wx=11
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
multiplication
wx=00 wx=10
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01 wx=11
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
Distribute evidence (1)
wyz wxywy
Cliques 1 Cliques 2 (root)
W=0 W=1
g = -4.7560
h =
K =
g = -3.4786
h =
K =
0.0889 - 0.0111 - 0.0556 0.0556
0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.0867 - 0.0633 - 0.1 - 0.1667
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
W=0 W=1
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
division
Distribute evidence (2)
wyz wxywy
Cliques 1 Cliques 2 (root)
W=0 W=1
g = -4.0629
h =
K =
g = -2.7854
h =
K =
0.0889 - 0.0111 - 0.0556 0.0556
0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.0867 - 0.0633 - 0.1 - 0.1667
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
Distribute evidence (3)
wyz wxywy
Cliques 1 Cliques 2 (root)
wx=00 wx=10
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01 wx=11
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
Marginalize over x
w=0 w=1
logp = -0.6931
mu = [0.52 -0.12]’
Sigma =
logp = -0.6931
mu = [0.52 -0.12]’
Sigma =5.5456 - 0.6336
- 0.6336 6.36165.5456 - 0.6336
- 0.6336 6.3616
Distribute evidence (4)
wyz wxywy
Cliques 1 Cliques 2 (root)
W=0 W=1
g = -4.0629
h =
K =
g = -2.7854
h =
K =
0.0889 - 0.0111 - 0.0556 0.0556
0.1444 - 0.0089 - 0.1 0.0778- 0.0089 0.0378 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.0867 - 0.0633 - 0.1 - 0.1667
0.2083 - 0.1467 0.15 - 0.2333- 0.1467 0.1033 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
w=0 w=1
logp = -0.6931
mu = [0.52 -0.12]’
Sigma =
logp = -0.6931
mu = [0.52 -0.12]’
Sigma =5.5456 - 0.6336
- 0.6336 6.36165.5456 - 0.6336
- 0.6336 6.3616
multiplication
w=0 w=1
g = -4.3316
h = [0.0927 -0.0096]’
K =
g = -0.6931
h = [0.0927 -0.0096]’
K =0.1824 0.01820.0182 0.159
0.1824 0.01820.0182 0.159
Canonical form
Distribute evidence (5)
wyz wxywy
Cliques 1 Cliques 2 (root)
W=0 W=1
g = -8.3935
h =
K =
g = -7.1170
h =
K =
0.1816 - 0.0207 - 0.0556 0.05560.3268 0.0093 - 0.1 0.07780.0093 0.1968 - 0.0333 - 0.0556
- 0.1 - 0.0333 0.1111 00.0778 - 0.0556 0 0.1111
0.1793 - 0.073 - 0.1 - 0.1667
0.3907 - 0.1285 0.15 - 0.2333- 0.1285 0.2623 - 0.1 0.1667
0.15 - 0.1 0.5 0- 0.2333 0.1667 0 0.3333
After Message Passing
p(wyz) p(wxy)p(wy)
Cliques 1 Cliques 2 (root)
Local marginal distributions