24
Probabilistic Graphical Models Probabilistic Graphical Models COMP 790 90 Seminar COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline Outline It d ti Introduction Representation Bayesian network Conditional Independence Inference: Variable elimination Learning Learning Markov Random Field Clique Pi i MRF Pair-wise MRF Inference: Belief Propagation Conclusion COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 2

Probabilistic Graphical Models - Computer ScienceProbabilistic Graphical Models COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Probabilistic Graphical ModelsProbabilistic Graphical Models

COMP 790 90 SeminarCOMP 790-90 Seminar

Spring 2011

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

OutlineOutline

I t d tiIntroductionRepresentationBayesian networkBayesian network

Conditional IndependenceInference: Variable elimination LearningLearning

Markov Random FieldCliqueP i i MRFPair-wise MRF

Inference: Belief Propagation

Conclusion

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2

IntroductionIntroduction

Graphical Model:

+Probability Theory Graph Theory

Probability theory: ensures consistency, provides interface models to dataprovides interface models to data.

Graph theory: intuitively appealing interface for humans, efficient general purpose algorithms.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications3

, g p p g

IntroductionIntroduction

Modularity: a complex system is built by combining simpler partscombining simpler parts.

Provides a natural tool for two problems:

Uncertainty and ComplexityPlays an important role in the design and y p ganalysis of machine learning algorithms

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications4

IntroductionIntroduction

Many of the classical multivariate probabilistic systems are special cases of the general graphical model formalism:

Mixture modelsu e ode sFactor analysisHidden Markov ModelsKalman filters

The graphical model framework provides a way to view all of these systems as instances of common underlying formalism.

Techniques that have been developed in one field can be transferred h fi ldto other fields

A framework for the design of new system

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications5

RepresentationRepresentation

A graphical model represent probabilistic relationships between a set of random variables.Variables are represented by nodes:

Binary events, Discrete variables, Continuous variables

Conditional (in)dependency is t d b ( b f) drepresented by (absence of) edges.

Directed Graphical Model: (Bayesian network)Undirected Graphical Model: (Markov Random Field)

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications6

OutlineOutline

IntroductionRepresentationBayesian networkBayesian network

Conditional IndependenceInference: Variable elimination LearningLearning

Markov Random FieldCliquePair-wise MRF

Inference: Belief Propagation

Conclusion

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications7

Bayesian NetworkBayesian Network

Directed acyclic graphs (DAG) ParentsDirected acyclic graphs (DAG).Directed edges give causality relationships between variables

Parents

pFor each variable X and parents pa(X) exists a conditional

b bilit P(X| (X))probability --P(X|pa(X))Discrete Variables: Conditional Probability Table(CPT)Probability Table(CPT)

Description of a noisy “causal” process

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications8

A Example: What Causes Grass Wet?

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications9

More Complex ExampleMore Complex Example

Diagnose the engine start problemDiagnose the engine start problem

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications10

More Complex ExampleMore Complex Example

C b d iComputer-based Patient Case Simulation system (CPCS-PM) developed by Parker and Miller

422 nodes and 867 arcs: 14 nodes describe diseases, 33 ,nodes describe history and risk factors, and the remaining 375 nodes e a g 375 odesdescribe various findings related to the diseases

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications11

Joint DistributionJoint Distribution

P(X X )P(X1,…Xn)

If the variables are binary, we need O(2n)parameters to describe P

For the wet grass example, need 2^4-1=15 parameters

Can we do better?

Key idea: use properties of independence.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications12

Independent Random VariablesIndependent Random Variables

X is independent of Y ifffor all values x,y)()|( xXPyYxXP y

If X and Y are independent then)()|( y

)()()()|()( YPXPYPYXPYXP

Unfortunately most of random variables of

)()()()|()( ,

)()...()()...( 21,1 nn XPXPXPXXP Unfortunately, most of random variables of interest are not independent of each other

The wet grass example

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications13

g p

Conditional IndependenceConditional IndependenceA more suitable notion is that of conditionalA more suitable notion is that of conditional independence.

X and Y are conditionally independent given ZX and Y are conditionally independent given Z

N i)|()|()|,(

)|(),|(

ZYPZXPZYXP

ZXPYZXP

Notation:

The conditionally independent structure in the )|,( ZYXI

Cgrass exampleI(S,R|C)

C

S R

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications14

I(C,W|S,R)W

Conditional IndependenceConditional Independence

Directed Markov Property:

Each random variable X, is Y1

Parent

conditionally independent of

its non-descendents, X

Y1

Y2D sc nd ntgiven its parents Pa(X)

Formally,

X

Y

Y2Descendent

y,P(X|NonDesc(X), Pa(X))=P(X|Pa(X))

Notation: I (X, NonDesc(X) | Pa(X))

Y3

Y Non descendent

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications15

Y4 Non-descendent

Factorized RepresentationFactorized RepresentationFull Joint distribution is defined in terms of local conditional distributions(obtained via the chain rule)

))(|(),,( 1 iin xpaxpxxP

Graphical Structure encodes conditional independences among random variables

))(|(),,( 1 iin pp

gRepresent the full joint distribution over the variables more compactly

Complexity reductionComplexity reductionJoint probability of n binary variables O(2n)Factorized form O(n*2k)

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications16

( )k: maximal number of parents of a node

Factorized RepresentationFactorized RepresentationThe wetgrass example

P(C,S,R,W)=P(W|S,R)P(R|C)P(S|C)P(C)

Only need 1+2+2+4=9 parameters

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications17

InferenceInference

C i f h di i l b biliComputation of the conditional probability distribution of one set of nodes, given a model and another set of nodes.another set of nodes.Bottom-up

Given Observation (leaves), the probabilities of the reasons can be calculated accordingly.“diagnosis” from effects to reasons

Top downTop-downKnowledge influences the probability of the outcomePredict the effects

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications18

Basic ComputationBasic Computation

The value of x depends on yThe value of x depends on yDependency: conditional probability P(x|y)Knowledge about y: prior probability P(y)

yg y p p y (y)

Product rule)()|(),( yPyxPyxP

xSum rule (Marginalization)

y

yxPxP ),()( x

yxPyP ),()(

x

Bayesian ruley x

)()|()|(

yPyxPxyP prior likelihood lconditiona

poserior

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications19

)()|(

xPxyP

likelihoodposerior

Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)

Two possible causes: rain or sprinkle. Which is more likely?

Apply Bayes’ rule

)( TWP

0045.018.00495.00324.0009.00396.0

),,,(,,

rscTWrRsScCP

6471.0

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications20

Inference: Bottom UP Inference: Bottom UP

C S R W P( )

T T T T 0.99*0.8*0.1*0.5=0.0396

T T F T 0 9*0 2*0 1*0 5=0 009T T F T 0.9 0.2 0.1 0.5=0.009

T F T T 0.9*0.8*0.9*0.5=0.324

T F F T 0*0.2*0.9*0.5=0

F T T T 0.99*0.2*0.5*0.5=0.0495

F T F T 0.9*9.8*0.5*0.5=0.18

F F T T 0.9*0.2*0.5*0.5=0.045

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications21

F F F T 0*0.8*0.5*0.5=0

Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)

Two possible causes: rain or sprinkle. Which is more likely?

Apply Bayes’ rule

),(

)|(

TWTSP

TWTSP

),,,(

)(

,

TWrRTScCP

TWP

rc

43.064710

2781.0

64710

18.00495.0009.00396.0

)(

TWP

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications22

6471.06471.0

Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)

Two possible causes: rain or sprinkle. Which is more likely?

Apply Bayes’ rule

),(

)|(

TWTRP

TWTRP

),,,(

)(

,

TWTRsScCP

TWP

sc

708.064710

4581.0

64710

045.00495.0324.00396.0

)(

TWP

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications23

6471.06471.0

Inference: Top-downInference: Top-down

Th b bilit th t th ill b tThe probability that the grass will be wet given that it is cloudy.

WRS

RS

WRSCP

WRSCP

TCP

TCTWPTCTWP ,

),,,(

),,,(

)(

),()|(

WRS ,,),,,()(

C

S R

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications24

W

Inference AlgorithmsInference Algorithms

Exact inference problem in general graphical model is NP-hardExact Inference

Variable eliminationM i l ithMessage passing algorithmClustering and joint tree approach

Approximate InferenceApproximate InferenceLoopy belief propagationSampling (Monte Carlo) methods

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications25

Variational methods

Variable EliminationVariable Elimination

Computing P(W=T)Computing P(W T)Approach 1. Blind approach

Sum out all un-instantiated variables from the full jointj

Computation Cost O(2n)Th t lThe wetgrass example

Number of additions: 14Number of products:?

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications26

Solution: explore the graph structure

Variable Elimination

Approach 2: Interleave sums and Products

Variable Elimination

Approach 2: Interleave sums and ProductsThe key idea is to push sums in as far as possible

iIn computationFirst compute:Then compute:And so on

Computation Cost O(n*2k)For wetgrass example

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications27

For wetgrass exampleNumber of Additions:?Number of products:?

LearningLearning

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications28

LearningLearning

Learn parameters or structure from dataStructure learning: find correct connectivityStructure learning: find correct connectivity between existing nodesParameter learning: find maximumParameter learning: find maximum likelihood estimates of parameters of each conditional probability distributionp yA lot of knowledge (structures and probabilities) came from domain experts

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications29

p ) p

LearningLearning

Structure Observation Method

Known Full Maximum LikelihoodKnown Full Maximum Likelihood (ML) estimation

K P ti l E t ti M i i tiKnown Partial Expectation Maximization algorithm (EM)

U k F ll M d l l iUnknown Full Model selection

Unknown Partial EM + model selection

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications30

Model Selection MethodModel Selection Method

Select a 'good' model from all possible models and use it as if it were the correct model

Having defined a scoring function, a search algorithm is then used to find a network structure that receives the highest score fitting the prior kno ledge and datahighest score fitting the prior knowledge and data

Unfortunately, the number of DAG's on n variables is super-exponential in n. The usual approach is therefore p p ppto use local search algorithms (e.g., greedy hill climbing) to search through the space of graphs.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications31

EM AlgorithmEM Algorithm

Expectation (E) stepUse current parameters to estimate the unobservedUse current parameters to estimate the unobserved data

Maximization (M) stepMaximization (M) stepUse estimated data to do ML/MAP estimation of the parameterthe parameter

Repeat EM steps, until convergence

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications32

OutlineOutline

I t d tiIntroductionRepresentationBayesian networkBayesian network

Conditional IndependenceInference LearningLearning

Markov Random FieldCliqueP i i MRFPair-wise MRF

Inference: Belief Propagation

Conclusion

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications33

Markov Random FieldsMarkov Random FieldsUndirected edges simply give correlations between g p y gvariables

The joint distribution is product of local functions over the cliques of the graph

CC xPZ

xP )(1

)(

where are the clique potentials, and Z is a normalization constant

CZ)( CC xP

w

zy

xw

zy

x

),,(),,(1

),,,( zyxPwyxPZ

wzyxP BA

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications34

w

zy

x

Z

The Clique The Clique

A liA cliqueA set of variables which are the arguments of a l l f tilocal function

The order of a cliqueThe number of variables in the clique

Example:p),()(),()()(),...,( 534,332,12151 xxPxxPxxxPxPxPxxP EDCBA

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications35

first order clique third order clique second order clique

Regular and Arbitrary GraphRegular and Arbitrary Graph

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications36

Pair-wise MRFPair-wise MRF

The order of cliques is at most twoThe order of cliques is at most two.

Commonly used in computer vision applications.Infer underline unknown variables through local observation andInfer underline unknown variables through local observation and the smooth prior

o1 o2 o3Observed image

φ2(i2) φ3(i3)φ1(i1)

o4 o5 o6

i1 i2 i3Underlying truth

φ5(i5)(i2,

i 5)

φ6(i6)φ4(i4)

ψ12(i1, i2) ψ23(i2, i3)

(i3,

i 6)

(i1,

i 4)

o7 o8 o9

i4 i5 i6ψ45(i4, i5)

φ5(i5)

ψ56(i5, i6)

ψ25

i 5, i

8)

φ6(i6)φ4(i4)

i 6, i

9)ψ

36

i 4, i

7)ψ

14

compatibility

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications37

o7 o8 o9

i7 i8 i9

ψ58

(i φ8(i8)φ7(i7) φ9(i9)

ψ78(i7, i8) ψ89(i8, i9)

ψ69

(i

ψ47

(i

Pair-wise MRFo1 o2 o3

Observed imageφ (i ) φ (i )φ (i )

Pair-wise MRF1 2 3

o4 o5 o6

i1 i2 i3Underlying truth

2, i 5

)

φ2(i2) φ3(i3)φ1(i1)

ψ12(i1, i2) ψ23(i2, i3)

3, i 6

)

1, i 4

)

o4 o5 o6

i4 i5 i6

ψ45(i4, i5)

φ5(i5)

ψ56(i5, i6)

ψ25

(i2

i 8)

φ6(i6)φ4(i4)

i 9)

ψ36

(i3

i 7)

ψ14

(i1

(i i ) i * t i

o7 o8 o9

i7 i8 i9

ψ58

(i5,

φ8(i8)φ7(i7) φ9(i9)

ψ78(i7, i8) ψ89(i8, i9)

ψ69

(i6,

ψ47

(i4,

ψxy(ix, iy) is an nx * ny matrix.

φx(ix) is a vector of length nx, where nx is the number of states of ix.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications38

Pair-wise MRFo1 o2 o3

Observed imageφ (i ) φ (i )φ (i )

Pair-wise MRF1 2 3

o4 o5 o6

i1 i2 i3Underlying truth

2, i 5

)

φ2(i2) φ3(i3)φ1(i1)

ψ12(i1, i2) ψ23(i2, i3)

3, i 6

)

1, i 4

)

o4 o5 o6

i4 i5 i6

ψ45(i4, i5)

φ5(i5)

ψ56(i5, i6)

ψ25

(i2

i 8)

φ6(i6)φ4(i4)

i 9)

ψ36

(i3

i 7)

ψ14

(i1

Gi ll th id d t t fi d th t lik l t t f

o7 o8 o9

i7 i8 i9

ψ58

(i5,

φ8(i8)φ7(i7) φ9(i9)

ψ78(i7, i8) ψ89(i8, i9)

ψ69

(i6,

ψ47

(i4,

Given all the evidence nodes yi, we want to find the most likely state for all the hidden nodes xi, which is equivalent to maximizing

iijiij xxxZ

xP )(),(1

})({

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications39

ij i

iijiijZ)()(})({

Belief Propagationo1 o2 o3

Observed imageφ (i ) φ (i )φ (i )

Belief Propagation1 2 3

o4 o5 o6

i1 i2 i3Underlying truth

2, i 5

)

φ2(i2) φ3(i3)φ1(i1)

ψ12(i1, i2) ψ23(i2, i3)

3, i 6

)

1, i 4

)

o4 o5 o6

i4 i5 i6

ψ45(i4, i5)

φ5(i5)

ψ56(i5, i6)

ψ25

(i2

i 8)

φ6(i6)φ4(i4)

i 9)

ψ36

(i3

i 7)

ψ14

(i1

B li f d t i t thi b bilit

o7 o8 o9

i7 i8 i9

ψ58

(i5,

φ8(i8)φ7(i7) φ9(i9)

ψ78(i7, i8) ψ89(i8, i9)

ψ69

(i6,

ψ47

(i4,

Beliefs are used to approximate this probability

z

xzxxxxx imiib )()()(

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications40

xi yz

xzxyxxyxxyxy imiiiim )(),()()(

Belief Propagation

i

Belief Propagation

o5

i2

φ5(i5)m2-

>5(

i 5)

i4 i5 i6

φ5( 5)

m4->5(i5) m6->5(i5)

m

5(x 5

)

5(i 5

)

i8

m8-

>5

m8-

>5

Beliefs are used to approximate this probability)()()()()()( 5855655455255555 imimimimiib

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications41

Belief PropagationBelief Propagation

i2

i 5) (i5)

i1

i 4)

o5

x4 i5 i6

ψ45(i4, i5)φ5(i5)

ψ56(i5, i6)

ψ25

(i2,

8)

m45(i5) m65(i5)

m25

)

o4

i4

φ4(i4)

4)m

14(i

i8

ψ58

(i5,

i 8

65 5

m85

(i5)

i7

m74

(i4

Beliefs are used to approximate this probability)()()()()()( 5855655455255555 imimimimiib

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications42

4

)()(),()()( 474414544544545i

imimiiiim

Belief Propagation

φ(i ) and ψ (i i )

Belief Propagation

F d

φ(ix) and ψxy(ix,iy)

For every node ix

Compute m (i )

N

Compute mzx(ix)for each neighbor iz

Does bx(ix) converge?Y

Compute bx(ix)Output most likely

state for every node i

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications43

state for every node ix

Application: Learning Based Image Super Resolution

Extrapolate higher resolution images from low-p g gresolution inputs.

The basic assumption: there are correlations between low frequency and high frequency information.

A node corresponds to an image patchφx(xp): the probability of high frequency given observed low frequency

ψxy(xp, xq): the smooth prior between neighbor patches

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications44

Image Super ResolutionImage Super Resolution

(a) Images from a "generic" example set.

(b) Input (magnified x4) (c) Cubic spline (d) Super-resolution result (e) Actual full-resolution

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications45

ConclusionConclusion

A graphical representation of the probabilistic structure of a set of random variables, along with f ti th t b d t d i th j i tfunctions that can be used to derive the joint probability distribution.

I t iti i t f f d liIntuitive interface for modeling.

Modular: Useful tool for managing complexity.

C f li f d lCommon formalism for many models.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications46

ReferencesReferences

K i M h I t d ti t G hi l M d l T h i lKevin Murphy, Introduction to Graphical Models, Technical Report, May 2001.M. I. Jordan, Learning in Graphical Models, MIT Press, 1999.Yijuan Lu, Introduction to Graphical Models, http:// www.cs.utsa.edu/~danlo/teaching/cs7123/Fall2005/Lyijuan.www.cs.utsa.edu/ danlo/teaching/cs7123/Fall2005/Lyijuan.ppt.Milos Hauskrecht, Probabilistic graphical models, http://www cs pitt edu/~milos/courses/cs3710/Lectures/Clashttp://www.cs.pitt.edu/~milos/courses/cs3710/Lectures/Class3.pdf.P. Smyth, Belief networks, hidden Markov models, and M k d fi ld if i i P tt R iti

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications47

Markov random fields: a unifying view, Pattern Recognition Letters, 1998.

ReferencesReferences

F R Kschischang B J Frey and H A Loeliger 2001F. R. Kschischang, B. J. Frey and H. A. Loeliger, 2001. Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory, February, 2001.Y didi J S F W T d W i Y U d t diYedidia J.S., Freeman W.T. and Weiss Y, Understanding Belief Propagation and Its Generalizations, IJCAI 2001 Distinguished Lecture track. William T. Freeman, Thouis R. Jones, and Egon C. Pasztor, Example-based super-resolution, IEEE Computer Graphics and Applications, March/April, 2002.p pp pW. T. Freeman, E. C. Pasztor, O. T. Carmichael Learning Low-Level Vision International Journal of Computer Vision, 40(1), pp. 25-47, 2000.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications48

Vision, 40(1), pp. 25 47, 2000.