78
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP Advanced I WS 06/07 Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97-021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99. Parameter Estimation

Graphical Models - Learning -

  • Upload
    moke

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

MINVOLSET. KINKEDTUBE. PULMEMBOLUS. INTUBATION. VENTMACH. DISCONNECT. PAP. SHUNT. VENTLUNG. VENITUBE. PRESS. MINOVL. FIO2. VENTALV. PVSAT. ANAPHYLAXIS. ARTCO2. EXPCO2. SAO2. TPR. INSUFFANESTH. HYPOVOLEMIA. LVFAILURE. CATECHOL. LVEDVOLUME. STROEVOLUME. ERRCAUTER. HR. - PowerPoint PPT Presentation

Citation preview

Page 1: Graphical Models - Learning -

Graphical Models- Learning -

Graphical Models- Learning -

Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel

Albert-Ludwigs University Freiburg, Germany

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

AdvancedI WS 06/07

Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Applicationto Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97-021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99.

Parameter Estimation

Page 2: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Outline

• Introduction • Reminder: Probability theory• Basics of Bayesian Networks• Modeling Bayesian networks• Inference (VE, Junction tree)• [Excourse: Markov Networks]• Learning Bayesian networks• Relational Models

Page 3: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 What is Learning?

• Agents are said to learn if they improve their performance over time based on experience.

The problem of understanding intelligence is said to be the greatest problem in science today and “the” problem for this century – as deciphering the genetic code was for the second half of the last one…the problem of learning represents a gateway to understanding intelligence in man and machines.

-- Tomasso Poggio and Steven Smale, 2003.

- Learning

- Learning

Page 4: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Why bothering with learning?

• Bottleneck of knowledge aquisition– Expensive, difficult– Normally, no expert is around

• Data is cheap !– Huge amount of data avaible, e.g.

•Clinical tests•Web mining, e.g. log files•....

- Learning

- Learning

Page 5: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Why Learning Bayesian Networks?

• Conditional independencies & graphical language capture structure of many real-world distributions

• Graph structure provides much insight into domain– Allows “knowledge discovery”

• Learned model can be used for many tasks

• Supports all the features of probabilistic learning– Model selection criteria– Dealing with missing data & hidden variables

Page 6: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Learning With Bayesian Networks

Data +

Priori Info

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

- Learning

- Learning

Page 7: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Learning With Bayesian Networks

Data +

Priori Info

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Learningalgorithm

- Learning

- Learning

Page 8: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 What does the data look like?

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

complete data set

X1

X2

XM..

.

- Learning

- Learning

attributes/variables

data cases

Page 9: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 What does the data look like?

A1 A2 A3 A4 A5 A6

true true ? true false false

? true ? ? false false

... ... ... ... ... ...

true false ? false true ?

incomplete data set

Real-world data: states of

some random variables are

missing– E.g. medical

diagnose: not all patient are subjects to all test

– Parameter reduction, e.g. clustering, ...

- Learning

- Learning

Page 10: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

missing value

What does the data look like?

incomplete data set

Real-world data: states of

some random variables are

missing– E.g. medical

diagnose: not all patient are subjects to all test

– Parameter reduction, e.g. clustering, ...

- Learning

- Learning

A1 A2 A3 A4 A5 A6

true true ? true false false

? true ? ? false false

... ... ... ... ... ...

true false ? false true ?

Page 11: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

missing value

What does the data look like?

incomplete data set

Real-world data: states of

some random variables are

missing– E.g. medical

diagnose: not all patient are subjects to all test

– Parameter reduction, e.g. clustering, ...

- Learning

- Learning

hidden/latent

A1 A2 A3 A4 A5 A6

true true ? true false false

? true ? ? false false

... ... ... ... ... ...

true false ? false true ?

Page 12: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Hidden variable – Examples

X1 X2 X3

Y3Y2Y1

X1 X2 X3

H

Y3Y2Y1

Hhidden

2 2 2

16

4 4 4

34118

2 2 2

16 32 64

1. Parameter reduction:

- Learning

- Learning

Page 13: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Hidden variable – Examples

2. Clustering: Cluster

XnX2X1 ...

assignment of cluster

attributes

Clusterhidden

- Learning

- Learning

• Hidden variables also appear in clustering

• Autoclass model:– Hidden variables assigns

class labels– Observed attributes are

independent given the class

Page 14: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

[Slides due to Eamonn Keogh]

- Learning

- Learning

Page 15: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Iteration 1

The cluster means are randomly assigned

[Slides due to Eamonn Keogh]

- Learning

- Learning

Page 16: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Iteration 2

[Slides due to Eamonn Keogh]

- Learning

- Learning

Page 17: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Iteration 5

[Slides due to Eamonn Keogh]

- Learning

- Learning

Page 18: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Iteration 25

[Slides due to Eamonn Keogh]

- Learning

- Learning

Page 19: Graphical Models - Learning -

What is a natural grouping among these objects?What is a natural grouping among these objects?[Slides due to Eamonn Keogh]

Page 20: Graphical Models - Learning -

School Employees Simpson's Family Males Females

Clustering is subjectiveClustering is subjective

What is a natural grouping among these objects?What is a natural grouping among these objects?[Slides due to Eamonn Keogh]

Page 21: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Learning With Bayesian Networks

Fixed structure Fixed variables Hidden variables

ob

serv

ed

fully

Partia

lly

Easiest problemcounting

Selection of arcsNew domain with no domain expert

Data mining

Numerical, nonlinear

optimization,Multiple calls to

BNs,Difficult for large

networks

Encompasses to difficult

subproblem,„Only“ Structural

EM is known

Scientific discouvery

A B A B? A B? ? H

- Learning

- Learning

Page 22: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

• Let be set of data

over m RVs

• is called a data case

• iid - assumption:

– All data cases are independently sampled from identical distributions

Parameter Estimation

iX∈X

Find: Parameters of CPDs which match the data best

Θ

X

- Learning

- Learning

Page 23: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?

- Learning

- Learning

Page 24: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?

- Learning

- LearningFind paramteres which have

most likely produced the data

Page 25: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?1.MAP parameters

- Learning

- Learning

Page 26: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?1.MAP parameters

2.Data is equally likely for all parameters.

- Learning

- Learning

Page 27: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?1.MAP parameters

2.Data is equally likely for all parameters3.All parameters are apriori equally likely

- Learning

- Learning

Page 28: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?

Find: ML parameters

- Learning

- Learning

Page 29: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Likelihood of the paramteres given the data

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?

Find: ML parameters

- Learning

- Learning

Page 30: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Likelihood of the paramteres given the data

Maximum Likelihood - Parameter Estimation

• What does „best matching“ mean ?

Find: ML parameters

- Learning

- Learning

Log-Likelihood of the paramteres given the data

Page 31: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Maximum Likelihood

• This is one of the most commonly used estimators in statistics

• Intuitively appealing

• Consistent: estimate converges to best possible value as the number of examples grow

• Asymptotic efficiency: estimate is as close to the true value as possible given a particular training set

- Learning

- Learning

Page 32: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Learning With Bayesian Networks

Fixed structure Fixed variables Hidden variables

ob

serv

ed

fully

Partia

lly

Easiest problemcounting

Selection of arcsNew domain with no domain expert

Data mining

Numerical, nonlinear

optimization,Multiple calls to

BNs,Difficult for large

networks

Encompasses to difficult

subproblem,„Only“ Structural

EM is known

Scientific discouvery

A B A B? A B? ? H

- Learning

- Learning

?

Page 33: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Known Structure, Complete Data

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Learningalgorithm

- Learning

- Learning

E B

A

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

• Network structure is specified

– Learner needs to estimate parameters

• Data does not contain missing values

Page 34: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

Page 35: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

Page 36: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

Page 37: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

(BN semantics)

Page 38: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

(BN semantics)

Page 39: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

(BN semantics)

Page 40: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

Only local parameters of family of Aj involved

(BN semantics)

Page 41: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

Each factor individually !!

Only local parameters of family of Aj involved

(BN semantics)

Page 42: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 ML Parameter Estimation

A1 A2 A3 A4 A5 A6

true true false true false false

false true true true false false

... ... ... ... ... ...

true false false false true true

- Learning

- Learning

(iid)

Each factor individually !!

Only local parameters of family of Aj involved

(BN semantics)

Decomposability of the likelihood

Page 43: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Decomposability of Likelihood

If the data set if complete (no question marks)

• we can maximize each local likelihood function independently, and

• then combine the solutions to get an MLE solution.

decomposition of the global problem toindependent, local sub-problems. This

allowsefficient solutions to the MLE problem.

Page 44: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

• Random variable V with 1,...,K values

where Nk is the counts of state k in data

This constraint implies that the choice on I influences the choice on j (i<>j)

Likelihood for Multinominals

X

- Learning

- Learning

Page 45: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Likelihood for Binominals (2 states only)

X

- Learning

- Learning

=> MLE is

1 + 2 =1

• Compute partial derivative

• Set partial derivative zero

Page 46: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Likelihood for Binominals (2 states only)

X

- Learning

- Learning

=> MLE is

1 + 2 =1

• Compute partial derivative

• Set partial derivative zero

In general, for multinomials (>2 states),the MLE is

Page 47: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

• multinomial for each joint state pa of the parents of V:

• MLE

Likelihood for Conditional Multinominals

- Learning

- Learning

Page 48: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Learning With Bayesian Networks

Fixed structure Fixed variables Hidden variables

ob

serv

ed

fully

Partia

lly

Easiest problemcounting

Selection of arcsNew domain with no domain expert

Data mining

Numerical, nonlinear

optimization,Multiple calls to

BNs,Difficult for large

networks

Encompasses to difficult

subproblem,„Only“ Structural

EM is known

Scientific discouvery

A B A B? A B? ? H

- Learning

- Learning

?

Page 49: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Known Structure, Incomplete Data

E B

A

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Learningalgorithm

- Learning

- Learning

E B

A

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B)

E, B, A<Y,?,N><Y,N,?><N,N,Y><N,Y,Y> . .<?,Y,Y>

• Network structure is specified• Data contains missing values

– Need to consider assignments to missing values

Page 50: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea

• In the case of complete data, ML parameter estimation is easy: – simply counting (1 iteration)

Incomplete data ?1. Complete data (Imputation)

• most probable?, average?, ... value

2. Count3. Iterate

- Learning

- Learning

Page 51: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

- Learning

- Learning

A B

Page 52: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.5

1.5

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

- Learning

- Learning

A B

Page 53: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.5

1.5

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

- Learning

- Learning

A B

A B N

true true 1.0

true ?=true0.5

true ?=false0.5

false true 1.0

true false 1.0

false ?=true0.5

false ?=false0.5

Page 54: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.5

1.5

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

- Learning

- Learning

maximize

A B

Page 55: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.5

1.5

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

iterate

- Learning

- Learning

maximize

A B

Page 56: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

- Learning

- Learning

A B

Page 57: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.25

1.75

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

iterate

- Learning

- Learning

maximize

A B

Page 58: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

- Learning

- Learning

A B

Page 59: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Idea: complete the data

incomplete dataA B

true true

true ?

false

true

truefalse

false ?

complete data

0.125

1.875

1.5

1.5

N

falsefalse

truefalse

falsetrue

truetrue

BA

expected counts

complete

iterate

- Learning

- Learning

maximize

A B

Page 60: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Complete-data likelihood

A1 A2 A3 A4 A5 A6

true true ? true false false

? true ? ? false false

... ... ... ... ... ...

true false ? false true ?

incomplete-data likelihood

Assume complete data exists with

complete-data likelihood

- Learning

- Learning

Page 61: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM Algorithm - Abstract

- Learning

- Learning

Expectation Step

Maximization Step

Page 62: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

P(Y

Expectation Maximization (EM):Construct an new function based on the “current point” (which “behaves well”)Property: The maximum of the new function has a better scoring then the current point.

- Learning

- Learning

EM Algorithm - Principle

Page 63: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

P(Y

EM Algorithm - Principle

Current point k

Expectation Maximization (EM):Construct an new function based on the “current point” (which “behaves well”)Property: The maximum of the new function has a better scoring then the current point.

- Learning

- Learning

Page 64: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

P(Y

EM Algorithm - Principle

Current point k

new function Qk

Expectation Maximization (EM):Construct an new function based on the “current point” (which “behaves well”)Property: The maximum of the new function has a better scoring then the current point.

- Learning

- Learning

Page 65: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

P(Y

EM Algorithm - Principle

Current point k

Maximum k+1

new function Qk

Expectation Maximization (EM):Construct an new function based on the “current point” (which “behaves well”)Property: The maximum of the new function has a better scoring then the current point.

- Learning

- Learning

Page 66: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

• Random variable V with 1,...,K values

where ENk are the expected counts

of state k in the data, i.e.

• „MLE“:

EM for Multi-Nominals

X

- Learning

- Learning

Page 67: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

• multinomial for each joint state pa of the parents of V:

• „MLE“

EM for Conditional Multinominals

- Learning

- Learning

Page 68: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Learning Parameters: incomplete data

Initial parameters

Current model

Non-decomposable likelihood (missing value, hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

- Learning

- Learning

Page 69: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Learning Parameters: incomplete data

Initial parameters

Current model

Non-decomposable likelihood (missing value, hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..

Expected counts

Expectation Inference: P(S|X=0,D=1,C=0,B=1)

- Learning

- Learning

Page 70: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Learning Parameters: incomplete data

Initial parameters

Current model

Non-decomposable likelihood (missing value, hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..

Expected counts

Expectation Inference: P(S|X=0,D=1,C=0,B=1)

Update parameters (ML, MAP)

Maximization - Learning

- Learning

Page 71: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Learning Parameters: incomplete data

EM-algorithm:iterate until convergence

Initial parameters

Current model

Non-decomposable likelihood (missing value, hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..

Expected counts

Expectation Inference: P(S|X=0,D=1,C=0,B=1)

Update parameters (ML, MAP)

Maximization - Learning

- Learning

Page 72: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07

Learning Parameters: incomplete data

1. Initialize parameters

2. Compute pseudo counts for each variable

3. Set parameters to the (completed) ML estimates

4. If not converged, iterate to 2

junction tree algorithm

- Learning

- Learning

Page 73: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Monotonicity

• (Dempster, Laird, Rubin ´77): the incomplete-data likelihood fuction is not decreased after an EM iteration

• (discrete) Bayesian networks: for any initial, non-uniform value the EM algorithm converges to a (local or global) maximum.

- Learning

- Learning

Page 74: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 LL on training set (Alarm)

Experiment by Bauer, Koller and Singer [UAI97]

- Learning

- Learning

Page 75: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Parameter value (Alarm)

Experiment by Baur, Koller and Singer [UAI97]

- Learning

- Learning

Page 76: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 EM in Practice

- Learning

- Learning

Initial parameters:• Random parameters setting• “Best” guess from other source

Stopping criteria:• Small change in likelihood of data• Small change in parameter values

Avoiding bad local maxima:• Multiple restarts• Early “pruning” of unpromising ones

Speed up: • various methods to speed

convergence

Page 77: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Gradient Ascent

•Main result

•Requires same BN inference computations as EM

•Pros: – Flexible– Closely related to methods in neural network training

•Cons:– Need to project gradient onto space of legal parameters– To get reasonable convergence we need to combine

with “smart” optimization techniques

- Learning

- Learning

Page 78: Graphical Models - Learning -

Bayesian Networks

Bayesian Networks

AdvancedI WS 06/07 Parameter Estimation: Summary

• Parameter estimation is a basic task for learning with Bayesian networks

• Due to missing values non-linear optimization – EM, Gradient, ...

• EM for multi-nominal random variables– Fully observed data: counting– Partially observed data: pseudo counts

• Junction tree to do multiple inference

- Learning

- Learning