Introduction to PGMs: Discrete Variables

PGMs Srihari

1

Introduction to PGMs: Discrete VariablesSargur Srihari

[email protected]

PGMs Srihari

Topics

1. What are graphical models (or PGMs)2. Use of PGMs Engineering and AI3. Directionality in graphs4. Bayesian Networks5. Generative Models and Sampling6. Using PGMs with fully Bayesian Models7. Discrete Case8. Complexity Issues

2

PGMs Srihari

3

Discrete Variables• When constructing more complex probability

distributions from simpler (exponential) distributions graphical models are useful

• Graphical models have nice properties when each parent-child pair are conjugate

• Two cases of interest:– Both correspond to discrete variables– Both correspond to Gaussian variables

PGMs Srihari

4

Discrete Case• Probability distribution for a single discrete

variable x having K states• Using 1-of -K representation

– For K=6 when x3=1 then x represented as x=(0,0,1,0,0,0)T

Note that • If probability of xk=1 is given by parameter

• The distribution is normalized:

– There are K-1 independent values for needed to define distribution

TK1

1

),..,(µ where)µ|x( µµµ ==Õ=

K

k

xkkp

å ==

K

k kx1 1

1 1

=å=

K

kkµ

p(x |µ)

µk

µk

x1 xKx:

PGMs Srihari

5

Two Discrete Variables

• x1 and x2 each with K states each• Denote probability of both x1k=1 and x2l=1 by

– where x1k denotes kth component of x1

• Joint distribution is

• Since parameters are subject to constraint• There are K2-1 parameters• For arbitrary distribution over M variables there are KM-1 parameters

p(x1, x2 |µ) = µklx1kx2 l

l=1

K

∏k=1

K

∏1=å åk l klµ

µkl

x11 x1Kx21 x2K

x1:

x2:

PGMs Srihari

6

Graphical Models for Two Discrete Variables

• Joint distribution p(x1,x2)• Using product rule

– is factored as p(x2|x1)p(x1)

• Has two node graph• Marginal distribution p(x1) has K-1 parameters• Conditional distribution p(x2|x1) also requires K-1 parameters for each of K values of x1

• Total number of parameters is (K-1)+K(K-1) =K2-1– As before

PGMs Srihari

7

Two Independent Discrete Variables

• x1 and x2 are independent– Has graphical model

• Each variable described by a separate multinomial distribution– Total no of parameters is 2(K-1)– For M variables no of parameters is M(K-1)

• Reduced number of parameters by dropping links in graph– Grows linearly with no of variables

PGMs Srihari

8

Fully connected has high complexity• General case of M discrete variables x1,.., xM

• If BN is fully connected– Completely general distribution with KM-1 parameters

• If there are no links– Joint distribution factorizes into product of marginals– Total no of parameters is M(K-1)

• Graphs of intermediate levels of connectivity– More general distribution than fully factorized ones– Require fewer parameters than general joint

distribution– Example: chain of nodes

PGMs Srihari

9

Special Case: Chain of Nodes

• Marginal distribution p(x1) requires K-1 parameters• Each of the M-1 conditional distributions p(xi|xi-1),

for i=2,..,M requires K(K-1) parameters• Total parameter count is K-1+(M-1)K(K-1)Which is quadratic in K

• Grows linearly (not exponentially) with length of chain

PGMs Srihari

10

Alternative: Sharing Parameters• Reduce parameters by sharing or tying parameters

• In above, – all conditional distributions p(xi|xi-1), for i=2,..,M– share same set of K(K-1) parameters governing

distribution of x1• Total of K2-1 parameters needed to specify

distribution

PGMs Srihari

11

Conversion into Bayesian Model• Given graph over discrete

variables• We can turn it into a Bayesian

model by introducing Dirichlet priors for parameters

• Each node acquires an additional parent for each discrete node

• Tie the parameters governing conditional distributions p(xi|xi-1)

Chain of NodesWith priors

Sharing Parameters

PGMs Srihari

• Bernoulli: p(x=1|μ)=μ

• Likelihood of Bernoulli with D={x1,..xN}

• Binomial:

– Conjugate Prior:

Binomial: Beta Prior

Bin(m | N,µ) = Nm

⎛⎝⎜

⎞⎠⎟µm (1− µ)N−m

Beta(µ | a,b) = Γ(a + b)Γ(a)Γ(b)

µa −1(1− µ)b−1

p(D |µ) = µ xn

n=1

N

∏ (1−µ)1−xn

PGMs Srihari

• Generalized Bernoulli (1-of-K) x=(0,0,1,0,0,0)T K=6

• Multinomial

• Where the normalization coefficient is the no of ways of partitioning Nobjects into K groups of size

• Conjugate prior distribution for parameters mk

• Normalized form is

Multinomial: Dirichlet Prior

TK

K

k

xkkp ),..,(µ where)µ|x( 1

1

µµµ ==Õ=

( ) Õ=

÷÷ø

öççè

æ=

K

k

mk

kK

k

mmmN

NmmmMult121

21 ..,|.. µµ

kmmm .., 21

!!..!!

.. 2121 kk mmmN

mmmN

=÷÷ø

öççè

æ

åÕ =££=

-k kk

K

kkkp 1 and 10 where )|(

1

1 µµµaaµ a

åÕ==

- =GG

G=

K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( aaµaa

aaµ a

K=2 is Bernoulli

K=2 is Binomial

PGMs Srihari

14

Controlling Number of parameters in models: Parameterized Conditional

Distributions• Control exponential growth of parameters in

models of discrete variables• Use parameterized models for conditional

distributions instead of complete tables of conditional probability values

PGMs Srihari

15

Parameterized Conditional Distributions

• Consider graph with binary variables• Each parent variable xi governed by

single parameter µi representing probability p(xi=1)• M parameters in total for parent

nodes• Conditional distribution p(y|x1,..,xM)

requires 2M parameters– Representing probability p(y=1) for

each of the 2M settings of parent variables– 000000000 to 111111111

PGMs Srihari

16

Conditional distribution using logistic sigmoid

• Parsimonious form of conditional distribution• Logistic sigmoid acting on linear combination

of parent variables

– where s(a) = (1+exp(-a))-1 is the logistic sigmoid– x=(x0,x1,..,xM)T is vector of parent states

• No of parameters grows linearly with M• Analogous to choice of a restrictive form of

covariance matrix in multivariate Gaussian

( )xw),..,|1(1

01T

M

iiiM xwwxxyp ss =÷ø

öçè

æ+== å

=

PGMs Srihari

17

Linear Gaussian Models• Expressing multivariate Gaussian as a directed graph• Corresponding to linear Gaussian model over

component variables– Mean of a conditional distribution is a linear function of the

conditioning variable• Allows expressing interesting structure of distribution

– General Gaussian case and diagonal covariance case represent opposite extremes

PGMs Srihari

18

Graph with continuous random variables

• Arbitrary acyclic graph over D variables• Node i represents a single continuous random

variable xi having Gaussian distribution• Mean of distribution is a linear combination of states of

its parent nodes pai of node i

• Where – wij and bi are parameters governing the mean– vi is the variance of the conditional distribution for xi

÷÷ø

öççè

æ+N= å

Î ipajiijijiii vbxwxpaxp ,)|(

PGMs Srihari

19

Joint Distribution

• Log of joint distribution

• Where x=( x1,..,xD )T

• This is a quadratic function of x– Hence joint distribution p(x) is a multivariate

Gaussian

constbxwxv

paxpp

ipajijiji

D

i i

D

iii

+÷÷ø

öççè

æ--=

=

åå

å

Î=

=

2

1

1

21

-

)|(ln)x(ln

Terms independent of x

PGMs Srihari

20

Mean and Covariance of Joint Distribution• Recursive Formulation• Since each variable xi has,

– conditional on the states of its parents, – a Gaussian distribution, we can write– where ei is

• a zero mean, unit variance Gaussian random variable • satisfying E[ei]=0 and E[eiej]=Iij• and Iij is the i,j element of the identity matrix

• Taking expectation• Thus we can find components of E[x]=(E[x1],..E[xD])T

– by starting at lowest numbered node and working recursively through the graph

• Similarly elements of covariance matrix

ipaj

iiijii

vbwx eåÎ

++=

åÎ

+=ipaj

ijiji bxwx ][][ EE

åÎ

+=jpaj

jijkijkji vIxxwxx ],cov[],cov[

PGMs Srihari

21

Three cases for no. of parameters

• No links in the graph– 2D parameters

• Fully connected graph– D(D+1)/2 parameters

• Graphs with intermediate level of complexity– Chain

x1

x2

x4

x3

x5

x6

PGMs Srihari

22

Extreme Case with no links

• D isolated nodes– There are no parameters wij– Only D parameters bi and– D parameters vi

• Mean of p(x) given by (b1,..,bD)T

• Covariance matrix is diagonal of form diag(v1,..,vD)• Joint distribution has total of 2D parameters• Represents set of D independent univariate Gaussian

distributions

PGMs Srihari

23

Extreme case with all links• Fully connected graph• Each node has all lower numbered nodes as parents• Matrix wij has i-1 entries on the ith row and hence is a

lower triangular matrix (with no entries on leading diagonal)

• Total no of parameters wij is to take D2 no of elements in D x D matrix, subtracting D to account for diagonal and divide by 2 to account for elements only below diagonal

PGMs Srihari

24

Graph with intermediate complexity

• Link missing between variables x1 and x3• Mean and covariance of joint distribution

are( )

å÷÷÷

ø

ö

ççç

è

æ

++++=

+++=

)()()(

, ,

12

22

12

23212132

12

23212

2121

121321211

12132232312121

213221

2121

vwvwvwvwvwwvwvwvwvvw

vwwvwv

bwwbwbbwbb Tµ

PGMs Srihari

25

Extension to multivariate Gaussian variables

• Nodes in the graph represent multivariate Gaussian variables

• Write conditional distribution for node i in the form

• where Wij is a matrix (non-square if xi and xjhave different dimensionalities)

÷÷ø

öççè

æ+N= å å

Î ipajiijijiii vpap ,bxWx)|x(

PGMs Srihari

26

Summary1. PGMs allow visualizing probabilistic models

– Joint distributions are directed/undirected PGMs

2. PGMs can be used to generate samples– Ancestral sampling with directed PGMs is simple

3. PGMs are useful for Bayesian statistics– Discrete variable PGM represented using Dirichlet priors

4. Parameter explosion controlled by tying parameters

5. Multivariate Gaussian expressed as PGM– Graph is a linear Gaussian model over

components

Documents

Introduction to PGMs: Discrete Variables