42
Machine Learning Machine Learning Exact Inference Exact Inference Eric Xing Eric Xing Lecture 11, August 14, 2010 A A A B A C B A C B A C A D C A D C B A B A A A f m b m c m d m Eric Xing © Eric Xing @ CMU, 2006-2010 1 Reading: E F H E F H E F H E F E F E F E G E G E G A D C E A D C E A D C E D C D C h m g m e m f m

Lecture11 xing

Embed Size (px)

Citation preview

Page 1: Lecture11 xing

Machine LearningMachine Learninggg

Exact InferenceExact Inference

Eric XingEric Xing

Lecture 11, August 14, 2010

AAA

B A

C

B A

C

B A

CA

DC

A

DC

B AB A AA

fm

bmcm

dm

Eric Xing © Eric Xing @ CMU, 2006-2010 1

Reading:E F

H

E F

H

E F

H

E FE FE F

E

G

E

G

E

G

A

DC

E

A

DC

E

A

DC

E

DC DC

hmgm

emfm

Page 2: Lecture11 xing

Inference and LearningInference and Learning We now have compact representations of probability

distributions: GM

A BN M describes a unique probability distribution P

Typical tasks:

Task 1: How do we answer queries about P?

We use inference as a name for the process of computing answers to such queries

Task 2: How do we estimate a plausible model M from data D? Task 2: How do we estimate a plausible model M from data D?

i. We use learning as a name for the process of obtaining point estimate of M.

ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.

Eric Xing © Eric Xing @ CMU, 2006-2010 2

iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

Page 3: Lecture11 xing

Inferential Query 1: LikelihoodLikelihood

Most of the queries one may ask involve evidenceq y

Evidence xv is an assignment of values to a set Xv of nodes in the GM over varialbe set X={X1, X2, …, Xn}

Without loss of generality Xv={Xk+1, … , Xn},

Write XH=X\Xv as the set of hidden variables, XH can be or X

Simplest query: compute probability of evidence

∑ ∑ )(),,()(1

1x x

kk

,,x,xPPP vx

vHv xXXxH

Eric Xing © Eric Xing @ CMU, 2006-2010 3

this is often referred to as computing the likelihood of xv

Page 4: Lecture11 xing

Inferential Query 2: Conditional Probability

Often we are interested in the conditional probability

Conditional Probability

p ydistribution of a variable given the evidence

VHVH xXxXxXX ),(),()|( PPP

this is the a posteriori belief in XH, given evidence xv

HxVHH

VH

V

VHVVH xxXx

xXX),()(

)|(PP

P

this is the a posteriori belief in XH, given evidence xv

We usually query a subset Y of all hidden variables XH={Y,Z}and "don't care" about the remaining Z:and don t care about the remaining, Z:

z

VV xzZYxY )|,()|( PP

Eric Xing © Eric Xing @ CMU, 2006-2010 4

the process of summing out the "don't care" variables z is called marginalization, and the resulting P(Y|xv) is called a marginal prob.

Page 5: Lecture11 xing

Applications of a posteriori Belief Prediction: what is the probability of an outcome given the starting

?

Applications of a posteriori Belief

condition

the query node is a descendent of the evidence

A CB?

q y

Diagnosis: what is the probability of disease/fault given symptoms

A CB?

the query node an ancestor of the evidence

Learning under partial observation

A CB

g p fill in the unobserved values under an "EM" setting (more later)

The directionality of information flow between variables is not

Eric Xing © Eric Xing @ CMU, 2006-2010 5

yrestricted by the directionality of the edges in a GM probabilistic inference can combine evidence form all parts of the network

Page 6: Lecture11 xing

Inferential Query 3: Most Probable Assignment

In this query we want to find the most probable joint

Most Probable Assignment

assignment (MPA) for some variables of interest

Such reasoning is usually performed under some given evidence xv, and ignoring (the values of) other variables Z:

z

VyVyV xzZYxYxY )|,(maxarg)|(maxarg|* PP

this is the maximum a posteriori configuration of Y.

Eric Xing © Eric Xing @ CMU, 2006-2010 6

Page 7: Lecture11 xing

Complexity of Inference

Thm:

Complexity of Inference

Thm:Computing P(XH=xH| xv) in an arbitrary GM is NP-hard

Hardness does not mean we cannot solve inference

It implies that we cannot find a general procedure that works efficiently for arbitrary GMsy

For particular families of GMs, we can have provably efficient procedures

Eric Xing © Eric Xing @ CMU, 2006-2010 7

Page 8: Lecture11 xing

Approaches to inferenceApproaches to inference

Exact inference algorithmsg

The elimination algorithm Belief propagationp p g The junction tree algorithms (but will not cover in detail here)

Approximate inference techniques Approximate inference techniques

Variational algorithmsVariational algorithms Stochastic simulation / sampling methods Markov chain Monte Carlo methods

Eric Xing © Eric Xing @ CMU, 2006-2010 8

Page 9: Lecture11 xing

Inference on General BN via Variable Elimination

General idea:

Variable Elimination

Write query in the form

ii paxPXP 1 )|(),( e

this suggests an "elimination order" of latent variables to be marginalized

Iteratively

nx x x i

ii paxPXP3 2

1 )|(),( e

Iteratively

Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term

I t th t i t th d t Insert the new term into the product

wrap-up),()|( eXPXP 1

Eric Xing © Eric Xing @ CMU, 2006-2010 9

)(),()|(

ee

PXP 1

1

Page 10: Lecture11 xing

Hidden Markov ModelHidden Markov Model

y2 y3y1 yT...

p(x y) = p(x x y y )A AA Ax2 x3x1 xT

y2 y3y1 yT...

... p(x, y) = p(x1……xT, y1, ……, yT)

= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)

C diti l b bilitConditional probability:

Eric Xing © Eric Xing @ CMU, 2006-2010 10

Page 11: Lecture11 xing

Hidden Markov ModelHidden Markov Model

y2 y3y1 yT...Conditional probability:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Eric Xing © Eric Xing @ CMU, 2006-2010 11

Page 12: Lecture11 xing

A Bayesian network

A food web

A Bayesian network

A food web

B A

DC

E FE F

G H

Eric Xing © Eric Xing @ CMU, 2006-2010 12

What is the probability that hawks are leaving given that the grass condition is poor?

Page 13: Lecture11 xing

Example: Variable Elimination Query: P(A |h) B A

Example: Variable Elimination

Need to eliminate: B,C,D,E,F,G,H

Initial factors:

B A

DC

Choose an elimination order: H,G,F,E,D,C,B

E F

G H

),|()|()|(),|()|()|()()( fehPegPafPdcePadPbcPbPaP

Step 1: Conditioning (fix the evidence node (i.e., h) on its observed value (i.e., )):h~

This step is isomorphic to a marginalization step:

),|~(),( fehhpfemh B A

DC

Eric Xing © Eric Xing @ CMU, 2006-2010 13

h

h hhfehpfem )~(),|(),( E F

G

Page 14: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B,C,D,E,F,G

Initial factors:

B A

DC

E F

G H),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(femegPafPdcePadPbcPbPaP

fehPegPafPdcePadPbcPbPaP

h

Step 2: Eliminate Gt compute

1)|()( g

g egpemB A

DC)()()|()|()|()|()()( fememafPdcePadPbcPbPaP h

Eric Xing © Eric Xing @ CMU, 2006-2010 14

E F),()|(),|()|()|()()(

),()()|(),|()|()|()()(

femafPdcePadPbcPbPaP

fememafPdcePadPbcPbPaP

h

hg

Page 15: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B,C,D,E,F

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPbcPbPaPfemegPafPdcePadPbcPbPaP

fehPegPafPdcePadPbcPbPaP

h

h

Step 3: Eliminate Ft

),()|(),|()|()|()()( ff h

compute

fhf femafpaem ),()|(),(

)()|()|()|()()( eamdcePadPbcPbPaP

B A

DC

Eric Xing © Eric Xing @ CMU, 2006-2010 15

),(),|()|()|()()( eamdcePadPbcPbPaP fE

Page 16: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B,C,D,E

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPbcPbPaPfemegPafPdcePadPbcPbPaP

fehPegPafPdcePadPbcPbPaP

h

h

Step 4: Eliminate Et

),(),|()|()|()()(),()|(),|()|()|()()(

eamdcePadPbcPbPaPff

f

h

B A

DC

compute

efe eamdcepdcam ),(),|(),,(

)()|()|()()( dcamadPbcPbPaP

B A

DC

Eric Xing © Eric Xing @ CMU, 2006-2010 16

E

),,()|()|()()( dcamadPbcPbPaP e

Page 17: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B,C,D

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPbcPbPaPfemegPafPdcePadPbcPbPaP

fehPegPafPdcePadPbcPbPaP

h

h

),,()|()|()()(

),(),|()|()|()()(

dcamadPbcPbPaP

eamdcePadPbcPbPaP

e

f

Step 5: Eliminate D compute ed dcamadpcam ),,()|(),(

B A

C

Eric Xing © Eric Xing @ CMU, 2006-2010 17

d

ed p ),,()|(),(

),()|()()( camdcPbPaP d

Page 18: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B,C

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPdcPbPaPfemegPafPdcePadPdcPbPaP

fehPegPafPdcePadPdcPbPaP

h

h

),()|()()(),,()|()|()()(

),(),|()|()|()()(

camdcPbPaPdcamadPdcPbPaP

eamdcePadPdcPbPaP

d

e

f

Step 6: Eliminate C compute dc cambcpbam ),()|(),(

B A

Eric Xing © Eric Xing @ CMU, 2006-2010 18

),()|()()( camdcPbPaP d

c

dc p ),()|(),(

Page 19: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPdcPbPaPfemegPafPdcePadPdcPbPaP

fehPegPafPdcePadPdcPbPaP

h

h

),()|()()(),,()|()|()()(

),(),|()|()|()()(

camdcPbPaPdcamadPdcPbPaP

eamdcePadPdcPbPaP

d

e

f

Step 7: Eliminate B compute

),()()( bambPaP c

cb bambpam ),()()(A

Eric Xing © Eric Xing @ CMU, 2006-2010 19

b

cb p ),()()(

)()( amaP b

Page 20: Lecture11 xing

Example: Variable Elimination Query: P(B |h) B A

Example: Variable Elimination

Need to eliminate: B

Initial factors:

B A

DC

E F

G H),()|(),|()|()|()()(),()|()|(),|()|()|()()(

),|()|()|(),|()|()|()()(

femafPdcePadPdcPbPaPfemegPafPdcePadPdcPbPaP

fehPegPafPdcePadPdcPbPaP

h

h

),()|()()(),,()|()|()()(

),(),|()|()|()()(

camdcPbPaPdcamadPdcPbPaP

eamdcePadPdcPbPaP

d

e

f

Step 8: Wrap-up)()(

),()()(amaP

bambPaP

b

c

,)()()~,( amaphap b b amaphp )()()~(

Eric Xing © Eric Xing @ CMU, 2006-2010 20

Step 8 ap up ,)()(),( pp b

ab

b

amapamaphaP

)()()()()~|(

a

b amaphp )()()(

Page 21: Lecture11 xing

Complexity of variable elimination Suppose in one elimination step we compute

elimination

x

kxkx yyxmyym ),,,('),,( 11

k

xmyyxm )()(' y

This requires multiplications

i

cikx ixmyyxm

11 ),(),,,( y

CXk )Val()Val( Y p

─ For each value of x, y1, …, yk, we do k multiplications

i

Ci)()(

additions

─ For each value of y1, …, yk , we do |Val(X)| additions

i

CiX )Val()Val( Y

Eric Xing © Eric Xing @ CMU, 2006-2010 21

Complexity is exponential in number of variables in the intermediate factor

Page 22: Lecture11 xing

Elimination CliquesElimination Cliques

B AB A

DC

E F

G H

B A

DC

E F

G H

B A

DC

B A

DC

G H

B A

DC

B A

DCD

E F

G H

D

E F

G

DC

E F

DC

E

)( fem )(em )( aem )( dcam

B A

DC

B A

C

B A A

),( femh )(emg ),( aem f ),,( dcame

Eric Xing © Eric Xing @ CMU, 2006-2010 22

D

),( camd ),( bamc )(amb

Page 23: Lecture11 xing

Understanding Variable EliminationElimination A graph elimination algorithm

B A

DC

E F

B A

DC

E F

B A

DC

B A

DC

E F

B A

DC

E F

B A

DC

E

B A

C

B A A

moralization

G H G H G

graph elimination

Intermediate terms correspond to the cliques resulted from elimination “good” elimination orderings lead to small cliques and hence reduce complexitygood elimination orderings lead to small cliques and hence reduce complexity

(what will happen if we eliminate "e" first in the above graph?)

finding the optimum ordering is NP-hard, but for many graph optimum or near-optimum can often be heuristically found

Eric Xing © Eric Xing @ CMU, 2006-2010 23

p y

Applies to undirected GMs

Page 24: Lecture11 xing

From Elimination to Belief PropagationPropagation Recall that Induced dependency during marginalization is p y g g

captured in elimination cliques Summation <-> elimination Intermediate term <-> elimination cliqueq

A

B A

CA

B A A

A

E FA

DC

A

DC

Eric Xing © Eric Xing @ CMU, 2006-2010 24

Can this lead to an generic inference algorithm?

E F

H

E

G

E

Page 25: Lecture11 xing

Tree GMsTree GMs

Undirected tree: a unique path between

Directed tree: all nodes except the root ha e e actl one

Poly tree: can have multiple parents

Eric Xing © Eric Xing @ CMU, 2006-2010 25

any pair of nodes have exactly one parent

Page 26: Lecture11 xing

Equivalence of directed and undirected trees Any undirected tree can be converted to a directed tree by choosing a root

undirected trees

node and directing all edges away from it

A directed tree and the corresponding undirected tree make the same conditional independence assertionsp

Parameterizations are essentially the same.

Undirected tree: Undirected tree:

Directed tree:

Equivalence:

Eric Xing © Eric Xing @ CMU, 2006-2010 26

Evidence:?

Page 27: Lecture11 xing

From elimination to message passingpassing Recall ELIMINATION algorithm:

Choose an ordering Z in which query node f is the final node Place all potentials on an active list Eliminate node i by removing all potentials containing i, take sum/product over xi.

Place the resultant factor back on the list Place the resultant factor back on the list

For a TREE graph: Choose query node f as the root of the tree View tree as a directed tree with edges pointing towards from f Elimination ordering based on depth-first traversal Elimination of each node can be considered as message-passing (or Belief Propagation) Elimination of each node can be considered as message passing (or Belief Propagation)

directly along tree branches, rather than on some transformed graphs thus, we can use the tree itself as a data-structure to do general inference!!

Eric Xing © Eric Xing @ CMU, 2006-2010 27

Page 28: Lecture11 xing

Message passing for treesMessage passing for trees

Let m (x ) denote the factor resulting fromf

Let mij(xi) denote the factor resulting from eliminating variables from bellow up to i, which is a function of xi:

i This is reminiscent of a message sent from j to i.

j

k l

Eric Xing © Eric Xing @ CMU, 2006-2010 28

k lmij(xi) represents a "belief" of xi from xj!

Page 29: Lecture11 xing

Elimination on trees is equivalent to message passing along q g p g gtree branches!

f

ii

j

Eric Xing © Eric Xing @ CMU, 2006-2010 29

k l

Page 30: Lecture11 xing

The message passing protocol:The message passing protocol: A two-pass algorithm:p g

X1

(X ) (X )

X

m21(X 1)

m32(X 2) m42(X 2)

m12(X 2)

X2

X3X4

m32(X 2) m42(X 2)

Eric Xing © Eric Xing @ CMU, 2006-2010 30

m24(X 4)3

m23(X 3)

Page 31: Lecture11 xing

Belief Propagation (SP-algorithm): Sequential implementationSequential implementation

Eric Xing © Eric Xing @ CMU, 2006-2010 31

Page 32: Lecture11 xing

Belief Propagation (SP-algorithm): Parallel synchronous implementationParallel synchronous implementation

For a node of degree d, whenever messages have arrived on any subset of d-1 node, compute the message for the remaining edge and send!

Eric Xing © Eric Xing @ CMU, 2006-2010 32

compute the message for the remaining edge and send! A pair of messages have been computed for each edge, one for each direction All incoming messages are eventually computed for each node

Page 33: Lecture11 xing

Correctness of BP on treeCorrectness of BP on tree

Collollary: the synchronous implementation is "non-blocking"

Thm: The Message Passage Guarantees obtaining all marginals in the tree

What about non-tree?

Eric Xing © Eric Xing @ CMU, 2006-2010 33

Page 34: Lecture11 xing

Inference on general GMInference on general GM Now, what if the GM is not a tree-like graph?

Can we still directly run message i t l l it d ?message-passing protocol along its edges?

For non-trees, we do not have the guarantee that message-passing g g p gwill be consistent!

Then what? Then what? Construct a graph data-structure from P that has a tree structure, and run message-passing

on it!

Eric Xing © Eric Xing @ CMU, 2006-2010 34

Junction tree algorithm

Page 35: Lecture11 xing

Elimination Clique Recall that Induced dependency during marginalization is

Elimination Cliquep y g g

captured in elimination cliques Summation <-> elimination Intermediate term <-> elimination cliqueq

A

B A

CA

B A A

A

E FA

DC

A

DC

Eric Xing © Eric Xing @ CMU, 2006-2010 35

Can this lead to an generic inference algorithm?

E F

H

E

G

E

Page 36: Lecture11 xing

A Clique TreeA Clique Tree

B A

C

B A A

bmcmA

C

A

DC fmdm

E FA

DC

E hmem

E F

H

E

G

E h

gm

Eric Xing © Eric Xing @ CMU, 2006-2010 36

e

fg

e

eamemdcepdcam

),()(),|(),,(

Page 37: Lecture11 xing

From Elimination to Message Passing

Elimination message passing on a clique tree

Passing

Elimination message passing on a clique tree

B A

DC

E F

B A

DC

E F

B A

DC

B A

DC

E F

B A

DC

E F

B A

DC

E

B A

C

B A A

B A

C

B A A

bmcm

G H G H G A

E FA

DC

A

DC

emfm

dm

e dcam ),,(

E F

H

E

G

DC

E hmgm

e

fg eamemdcep ),()(),|(

Eric Xing © Eric Xing @ CMU, 2006-2010 37

Messages can be reused

Page 38: Lecture11 xing

From Elimination to Message PassingPassing

Elimination message passing on a clique tree Elimination message passing on a clique tree Another query ...

B A

C

B A A

cm bmA

E FA

DC

A

DC

em

dmfm

E F

H

E

G

DC

E

gmhm

Eric Xing © Eric Xing @ CMU, 2006-2010 38

Messages mf and mh are reused, others need to be recomputed

Page 39: Lecture11 xing

The Shafer Shenoy AlgorithmThe Shafer Shenoy Algorithm Shafer-Shenoy algorithmy g

Message from clique i to clique j :

S )( Clique marginal

iji

iSC jk

kiikCji S\

)(

SCp )()(

Eric Xing © Eric Xing @ CMU, 2006-2010 39

k

kiikCi SCpi

)()(

Page 40: Lecture11 xing

A Sketch of the Junction Tree AlgorithmAlgorithm The algorithmg

Construction of junction trees --- a special clique tree

Propagation of probabilities --- a message-passing protocol

Results in marginal probabilities of all cliques --- solves all queries in a single run

A generic exact inference algorithm for any GM A generic exact inference algorithm for any GM

Complexity: exponential in the size of the maximal clique ---a good elimination order often leads to small maximal clique, and hence a good (i.e., thin) JT

Many well-known algorithms are special cases of JT

Eric Xing © Eric Xing @ CMU, 2006-2010 40

Forward-backward, Kalman filter, Peeling, Sum-Product ...

Page 41: Lecture11 xing

The Junction tree algorithm for HMMThe Junction tree algorithm for HMM A junction tree for the HMM

),( 11 xy ),( 21 yy ),( 32 yy ),( TT yy 1

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

y yy yy yy

)( 2y )( 3y )( Ty)( 1y )( 2y

Rightward pass ),( 22 xy ),( 33 xy ),( TT xy

),( 1tt yy)( ttt y1 )( 11 ttt y

)|()()|(

ty

tttttttttt yyyyy )()(),()( 11111

This is exactly the forward algorithm! ),( 11 tt xy

)( 1 tt y

t

tt

t

ytttyytt

yttttttt

yayxp

yxpyyyp

)()|(

)|()()|(

, 111

1111

1

y g

Leftward pass …

1

11111ty

tttttttttt yyyyy )()(),()(

),( 1tt yy)( ttt y1 )( 11 ttt y

)( 1 ty

Eric Xing © Eric Xing @ CMU, 2006-2010 41

This is exactly the backward algorithm!

1ty

1

11111ty

ttttttt yxpyyyp )|()()|(

),( 11 tt xy

)( 1 tt y

Page 42: Lecture11 xing

SummarySummary The simple Eliminate algorithm captures the key algorithmic

Operation underlying probabilistic inference:--- That of taking a sum over product of potential functions

The computational complexity of the Eliminate algorithm can be reduced to purely graph-theoretic considerations.

This graph interpretation will also provide hints about how to design improved inference algorithms

What can we say about the overall computational complexity of the algorithm? In particular, how can we control the "size" of the

Eric Xing © Eric Xing @ CMU, 2006-2010 42

summands that appear in the sequence of summation operation.