Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
PGMs Srihari
Topics
1. What are graphical models (or PGMs)2. Use of PGMs Engineering and AI3. Directionality in graphs4. Bayesian Networks5. Generative Models and Sampling6. Using PGMs with fully Bayesian Models7. Discrete Case8. Complexity Issues
2
PGMs Srihari
3
Discrete Variables• When constructing more complex probability
distributions from simpler (exponential) distributions graphical models are useful
• Graphical models have nice properties when each parent-child pair are conjugate
• Two cases of interest:– Both correspond to discrete variables– Both correspond to Gaussian variables
PGMs Srihari
4
Discrete Case• Probability distribution for a single discrete
variable x having K states• Using 1-of -K representation
– For K=6 when x3=1 then x represented as x=(0,0,1,0,0,0)T
Note that • If probability of xk=1 is given by parameter
• The distribution is normalized:
– There are K-1 independent values for needed to define distribution
TK1
1
),..,(µ where)µ|x( µµµ ==Õ=
K
k
xkkp
å ==
K
k kx1 1
1 1
=å=
K
kkµ
p(x |µ)
µk
µk
x1 xKx:
PGMs Srihari
5
Two Discrete Variables
• x1 and x2 each with K states each• Denote probability of both x1k=1 and x2l=1 by
– where x1k denotes kth component of x1
• Joint distribution is
• Since parameters are subject to constraint• There are K2-1 parameters• For arbitrary distribution over M variables there are KM-1 parameters
p(x1, x2 |µ) = µklx1kx2 l
l=1
K
∏k=1
K
∏1=å åk l klµ
µkl
x11 x1Kx21 x2K
x1:
x2:
PGMs Srihari
6
Graphical Models for Two Discrete Variables
• Joint distribution p(x1,x2)• Using product rule
– is factored as p(x2|x1)p(x1)
• Has two node graph• Marginal distribution p(x1) has K-1 parameters• Conditional distribution p(x2|x1) also requires K-1 parameters for each of K values of x1
• Total number of parameters is (K-1)+K(K-1) =K2-1– As before
PGMs Srihari
7
Two Independent Discrete Variables
• x1 and x2 are independent– Has graphical model
• Each variable described by a separate multinomial distribution– Total no of parameters is 2(K-1)– For M variables no of parameters is M(K-1)
• Reduced number of parameters by dropping links in graph– Grows linearly with no of variables
PGMs Srihari
8
Fully connected has high complexity• General case of M discrete variables x1,.., xM
• If BN is fully connected– Completely general distribution with KM-1 parameters
• If there are no links– Joint distribution factorizes into product of marginals– Total no of parameters is M(K-1)
• Graphs of intermediate levels of connectivity– More general distribution than fully factorized ones– Require fewer parameters than general joint
distribution– Example: chain of nodes
PGMs Srihari
9
Special Case: Chain of Nodes
• Marginal distribution p(x1) requires K-1 parameters• Each of the M-1 conditional distributions p(xi|xi-1),
for i=2,..,M requires K(K-1) parameters• Total parameter count is K-1+(M-1)K(K-1)Which is quadratic in K
• Grows linearly (not exponentially) with length of chain
PGMs Srihari
10
Alternative: Sharing Parameters• Reduce parameters by sharing or tying parameters
• In above, – all conditional distributions p(xi|xi-1), for i=2,..,M– share same set of K(K-1) parameters governing
distribution of x1• Total of K2-1 parameters needed to specify
distribution
PGMs Srihari
11
Conversion into Bayesian Model• Given graph over discrete
variables• We can turn it into a Bayesian
model by introducing Dirichlet priors for parameters
• Each node acquires an additional parent for each discrete node
• Tie the parameters governing conditional distributions p(xi|xi-1)
Chain of NodesWith priors
Sharing Parameters
PGMs Srihari
• Bernoulli: p(x=1|μ)=μ
• Likelihood of Bernoulli with D={x1,..xN}
• Binomial:
– Conjugate Prior:
Binomial: Beta Prior
Bin(m | N,µ) = Nm
⎛⎝⎜
⎞⎠⎟µm (1− µ)N−m
Beta(µ | a,b) = Γ(a + b)Γ(a)Γ(b)
µa −1(1− µ)b−1
p(D |µ) = µ xn
n=1
N
∏ (1−µ)1−xn
PGMs Srihari
• Generalized Bernoulli (1-of-K) x=(0,0,1,0,0,0)T K=6
• Multinomial
• Where the normalization coefficient is the no of ways of partitioning Nobjects into K groups of size
• Conjugate prior distribution for parameters mk
• Normalized form is
Multinomial: Dirichlet Prior
TK
K
k
xkkp ),..,(µ where)µ|x( 1
1
µµµ ==Õ=
( ) Õ=
÷÷ø
öççè
æ=
K
k
mk
kK
k
mmmN
NmmmMult121
21 ..,|.. µµ
kmmm .., 21
!!..!!
.. 2121 kk mmmN
mmmN
=÷÷ø
öççè
æ
åÕ =££=
-k kk
K
kkkp 1 and 10 where )|(
1
1 µµµaaµ a
åÕ==
- =GG
G=
K
kk
K
kk
k
kDir
10
1
1
1
0 where)()...(
)()|( aaµaa
aaµ a
K=2 is Bernoulli
K=2 is Binomial
PGMs Srihari
14
Controlling Number of parameters in models: Parameterized Conditional
Distributions• Control exponential growth of parameters in
models of discrete variables• Use parameterized models for conditional
distributions instead of complete tables of conditional probability values
PGMs Srihari
15
Parameterized Conditional Distributions
• Consider graph with binary variables• Each parent variable xi governed by
single parameter µi representing probability p(xi=1)• M parameters in total for parent
nodes• Conditional distribution p(y|x1,..,xM)
requires 2M parameters– Representing probability p(y=1) for
each of the 2M settings of parent variables– 000000000 to 111111111
PGMs Srihari
16
Conditional distribution using logistic sigmoid
• Parsimonious form of conditional distribution• Logistic sigmoid acting on linear combination
of parent variables
– where s(a) = (1+exp(-a))-1 is the logistic sigmoid– x=(x0,x1,..,xM)T is vector of parent states
• No of parameters grows linearly with M• Analogous to choice of a restrictive form of
covariance matrix in multivariate Gaussian
( )xw),..,|1(1
01T
M
iiiM xwwxxyp ss =÷ø
öçè
æ+== å
=
PGMs Srihari
17
Linear Gaussian Models• Expressing multivariate Gaussian as a directed graph• Corresponding to linear Gaussian model over
component variables– Mean of a conditional distribution is a linear function of the
conditioning variable• Allows expressing interesting structure of distribution
– General Gaussian case and diagonal covariance case represent opposite extremes
PGMs Srihari
18
Graph with continuous random variables
• Arbitrary acyclic graph over D variables• Node i represents a single continuous random
variable xi having Gaussian distribution• Mean of distribution is a linear combination of states of
its parent nodes pai of node i
• Where – wij and bi are parameters governing the mean– vi is the variance of the conditional distribution for xi
÷÷ø
öççè
æ+N= å
Î ipajiijijiii vbxwxpaxp ,)|(
PGMs Srihari
19
Joint Distribution
• Log of joint distribution
• Where x=( x1,..,xD )T
• This is a quadratic function of x– Hence joint distribution p(x) is a multivariate
Gaussian
constbxwxv
paxpp
ipajijiji
D
i i
D
iii
+÷÷ø
öççè
æ--=
=
åå
å
Î=
=
2
1
1
21
-
)|(ln)x(ln
Terms independent of x
PGMs Srihari
20
Mean and Covariance of Joint Distribution• Recursive Formulation• Since each variable xi has,
– conditional on the states of its parents, – a Gaussian distribution, we can write– where ei is
• a zero mean, unit variance Gaussian random variable • satisfying E[ei]=0 and E[eiej]=Iij• and Iij is the i,j element of the identity matrix
• Taking expectation• Thus we can find components of E[x]=(E[x1],..E[xD])T
– by starting at lowest numbered node and working recursively through the graph
• Similarly elements of covariance matrix
ipaj
iiijii
vbwx eåÎ
++=
åÎ
+=ipaj
ijiji bxwx ][][ EE
åÎ
+=jpaj
jijkijkji vIxxwxx ],cov[],cov[
PGMs Srihari
21
Three cases for no. of parameters
• No links in the graph– 2D parameters
• Fully connected graph– D(D+1)/2 parameters
• Graphs with intermediate level of complexity– Chain
x1
x2
x4
x3
x5
x6
PGMs Srihari
22
Extreme Case with no links
• D isolated nodes– There are no parameters wij– Only D parameters bi and– D parameters vi
• Mean of p(x) given by (b1,..,bD)T
• Covariance matrix is diagonal of form diag(v1,..,vD)• Joint distribution has total of 2D parameters• Represents set of D independent univariate Gaussian
distributions
PGMs Srihari
23
Extreme case with all links• Fully connected graph• Each node has all lower numbered nodes as parents• Matrix wij has i-1 entries on the ith row and hence is a
lower triangular matrix (with no entries on leading diagonal)
• Total no of parameters wij is to take D2 no of elements in D x D matrix, subtracting D to account for diagonal and divide by 2 to account for elements only below diagonal
PGMs Srihari
24
Graph with intermediate complexity
• Link missing between variables x1 and x3• Mean and covariance of joint distribution
are( )
å÷÷÷
ø
ö
ççç
è
æ
++++=
+++=
)()()(
, ,
12
22
12
23212132
12
23212
2121
121321211
12132232312121
213221
2121
vwvwvwvwvwwvwvwvwvvw
vwwvwv
bwwbwbbwbb Tµ
PGMs Srihari
25
Extension to multivariate Gaussian variables
• Nodes in the graph represent multivariate Gaussian variables
• Write conditional distribution for node i in the form
• where Wij is a matrix (non-square if xi and xjhave different dimensionalities)
÷÷ø
öççè
æ+N= å å
Î ipajiijijiii vpap ,bxWx)|x(
PGMs Srihari
26
Summary1. PGMs allow visualizing probabilistic models
– Joint distributions are directed/undirected PGMs
2. PGMs can be used to generate samples– Ancestral sampling with directed PGMs is simple
3. PGMs are useful for Bayesian statistics– Discrete variable PGM represented using Dirichlet priors
4. Parameter explosion controlled by tying parameters
5. Multivariate Gaussian expressed as PGM– Graph is a linear Gaussian model over
components