Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf ([email protected]) University of Michigan Michigan Chemical Process Dynamics

Bayesian Networks I:Static Models & Multinomial Distributions

By Peter Woolf ([email protected])University of Michigan

Michigan Chemical Process Dynamics and Controls Open Textbook

version 1.0

Creative commons

Existing plant measurementsPhysics, chemistry, and

chemical engineering knowledge & intuition Bayesian network models to

establish connections

Patterns of likely causes & influences

Efficient experimental design to test combinations of

causes

ANOVA & probabilistic models to eliminate irrelevant or uninteresting

relationships

Process optimization (e.g. controllers, architecture, unit

optimization, sequencing, and utilization)

Dynamical process modeling

More scenarios where Bayesian Networks can help

• Inferential sensing: how do you sense the state of something you don’t see?

• Sensor redundancy: if multiple sensors disagree, what can you say about the state of the system?

• Nosy systems: if your system is highly variable, how can you model it?

Stages of knowing a model:1. Topology and parameters are known.

2. Topology is known and we have data to learn parameters

3. Only data are known, must learn topology and parameters

4. Only partial data are known, must learn topology and parameters

5. Model is unknown and nonstationary

Mor

e re

alis

tic e.g. Solve a given ODE

€

dh

dt=10 − 5h h[0] =10

e.g. Fit parameters to an ODE using optimization

€

dh

dt=10 − k1h h[0] =10

??

€

dh

dt= ??

??

€

dh

dt= ??

?? More research..

Bay

esia

n N

etw

ork

s

Probability Tables

€

P(A)

€

P(B | A)A P(B=on|A) P(B=off|A)

high 11=0.3 12=0.7

medium 21=0.99 22=0.01

low 31=0.46 32=0.54

P(A=high) P(A=medium) P(A=low)

01=0.21 02=0.45 03=0.34 Note:Rows sum to 1, but columnsdon’t

A

B

P(C-) P(C+)

0.5 0.5

C P(S-) P(S+) C P(R-) P(R+)

- 0.5 0.5 - 0.8 0.2

+ 0.9 0.1 + 0.2 0.8

S R P(W-) P(W+)

- - 1.0 0.0

+ - 0.1 0.9

- + 0.1 0.9

+ + 0.01 0.99

• Graphical form of Bayes’ Rule

• Conditional independence• Decomposition of joint probability

P(C+, S-, R+, W+) = P(C+)P(S-|C+)P(R+|C+)P(W+|S-,R+)

• Causal networks• Inference on a network vs inference of a network

Bayesian Networks

Inference on a network

A

B

A P(B=on|A) P(B=off|A)

high 11=0.3 12=0.7

medium 21=0.99 22=0.01

low 31=0.46 32=0.54


01=0.21 02=0.45 03=0.34

Exact vs. Approximate calculation:

• In some cases you can exactly calculate probabilities on a BN given some data. This can be done directly or using quite complex algorithms for faster execution time. • For large networks, exact is impractical.


A

B


high 11=0.3 12=0.7

medium 21=0.99 22=0.01

low 31=0.46 32=0.54


01=0.21 02=0.45 03=0.34

Given a value of A, say A=high, what is B?

€

P(B | A,θ)

P(B=on)=0.3P(B=off)=0.7

The answer is a probability!


A

B


high 11=0.3 12=0.7

medium 21=0.99 22=0.01

low 31=0.46 32=0.54


01=0.21 02=0.45 03=0.34

€

P(A | B,θ) =P(B | A,θ)P(A |θ)

P(B |θ)

Given a value of B, say B=on, what is A?

€

=P(B | A,θ)P(A |θ)

P(B | Ai,θ)P(Ai |θ)i=1

n

∑

€

P(A = high | B = on,θ) =

(0.3)(0.21)

(0.3)(0.21) + (0.99)(0.45) + (0.46)(0.34)= 0.09

P(A = med | B = on,θ) = 0.67

P(A = low | B = on,θ) = 0.24

This is what Genie is doing on the wiki examples


A

B


high 11=0.3 12=0.7

medium 21=0.99 22=0.01

low 31=0.46 32=0.54


01=0.21 02=0.45 03=0.34

Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes• Repeat sampling out until you fill the network.• Start over and gather averages.


Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes • Repeat sampling out until you fill the network.• Start over and gather averages.

*

e1

e1

e1

e1

*=observed datae1, e2=sample estimates in round 1 and 2


Approximate inference via Markov Chain Monte Carlo Sampling • Given partial data, use your conditional probabilities to sample a value around the observed values and head nodes • Repeat sampling out until you fill the network.• Start over and gather averages.

Method always works in the limit of infinite samples…

*

e1

e1

e1

e1

e2 e2

e2

*=observed datae1, e2=sample estimates in round 1 and 2

You have been hired to analyze the statistical stability of a reactor system run by XYZ Chemical. Current thinking is that the temperature of the reactor and the catalyst type are primarily responsible for the product yield, although many other factors apparently play a role. You draw this relationship out as a graph like the following:

temp

yield

catalyst This can be interpreted as a Bayesian network!

Example scenario

The network is the same as saying:

€

p(temp,cat, yield) = p(temp)p(cat) p(yield | cat, temp)

Aside: Equivalence classes Imagine you are given the following statement of conditional probability between three variables, A, B, and C.

€

P(A,B,C)=P(A)P(B|A)P(C|B) For this statement, draw out three Bayesian networks that are consistent with this model. Hint: one can be read directly from the model, but the other two must be derived from this model using Bayes rule.

Aside: Equivalence classes Imagine you are given the following statement of conditional probability between three variables, A, B, and C.

€

P(A,B,C)=P(A)P(B|A)P(C|B) For this statement, draw out three Bayesian networks that are consistent with this model. Hint: one can be read directly from the model, but the other two must be derived from this model using Bayes rule. Model 1: Model 1 can be read directly from the model Model 2: Model 2 can be derived from model 1 using Bayes rule

€

P(A,B,C)=P(A)P(A|B)P(B)P(A)

⎡ ⎣ ⎢ ⎤

⎦ ⎥P(C|B)P(A,B,C)=P(A|B)P(B)P(C|B)

Model 3: Model 3 can be derived from model 2 using Bayes rule again

€

P(A,B,C)=P(A|B)P(B)P(B|C)P(C)P(B)

⎡ ⎣ ⎢ ⎤

⎦ ⎥P(A,B,C)=P(A|B)P(B|C)P(C)

Note that these are equivalence classes and are a fundamental property of observed data. Causality can only be determined from observational data to some extent! The network A->B<-C is fundamentally different (prove it to yourself with Bayes rule), and can be distinguished with observational data.

recall

€

P(B | A) =P(A | B)P(B)

P(A)

FUNDAMENTAL PROPERTY!Equivalent models if we just observe A, B, and C.

If we intervene and change A, B, or C we can distinguish between them. OR we can use our knowledge to choose the direction

€

P(A,B,C) = P(A)P(C)P(B | C,A)No arrangement of this last model will produce the upper 3 models.

You have been hired to analyze the statistical stability of a reactor system run by XYZ Chemical. Current thinking is that the temperature of the reactor and the catalyst type are primarily responsible for the product yield, although many other factors apparently play a role. You draw this relationship out as a graph like the following: You go through the last year’s worth of operating data and create the following conditional probability tables to describe how the system behaves: Temperature P(Temperature)

High (H) 0.35 Medium (M) 0.40

Low (L) 0.25

Catalyst P(Catalyst) A 0.40 B 0.60

Temperature Catalyst P(Yield=H) P(Yield=M) P(Yield=L)

H A 0.51 0.08 0.41 H B 0.30 0.20 0.50 M A 0.71 0.09 0.20 M B 0.92 0.05 0.03 L A 0.21 0.40 0.39 L B 0.12 0.57 0.31

temp

yield

catalyst

Example scenario

€

p(temp,cat, yield) = p(temp)p(cat) p(yield | cat, temp)

Temperature P(Temperature) High (H) 0.35

Medium (M) 0.40 Low (L) 0.25

Here we can use the multinomial distribution and the probabilities in the table above:

€

p(n1,n2,...,nk ) =

ni

i=1

k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟!

ni!i=1

k

∏pi

n i

i=1

k

∏ =N!

n1!n2!...nk!( )* p1

n1 p2n2 ...pk

nk( )

(1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don’t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property.

Compare to the binomial distribution we discussed previously (k=2)

€

p(n1,n2) =n1 + n2( )!

n1!n2!p1

n1 (1− p1)n 2


Medium (M) 0.40 Low (L) 0.25

€

p(4H,2M,3L) =9!

(4!2!3!)* (0.3540.4020.253) = 0.047

For this problem we find:

(1) Given these data, what is the probability of observing a set of 9 temperature readings of which 4 are high, 2 are medium, and 3 are low? Note that these are independent readings and we don’t care about the ordering of the readings, just the probability of observing a set of 9 readings with this property.

Here we can use the multinomial distribution and the probabilities in the table above:

€

p(n1,n2,...,nk ) =

ni

i=1

k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟!

ni!i=1

k

∏pi

n i

i=1

k

∏ =N!

n1!n2!...nk!( )* p1

n1 p2n2 ...pk

nk( )

The next most likely temperature reading is medium, because this has the highest probability of 0.4. The previous sequence of temperature readings do not matter assuming these are independent readings, as is mentioned above.

(2) After gathering these 9 temperature readings, what is the most likely next temperature reading you will see? Why?


Medium (M) 0.40 Low (L) 0.25

Catalyst P(Catalyst)

A 0.40 B 0.60

(3) What is the probability of sampling a set of 9 observations with 7 of them catalyst A and 2 of them catalyst B? Here again, order does not matter.

Here we can use the two state case of the multinomial distribution, (the binomial distribution):

€

p(7A,2B) =9!

7!2!* (0.470.62) = 0.0212

(4) What is the probability of observing the following yield values? Note here we have the temperature and catalyst values, so we can use the conditional probability values. As before, order of observations does not matter, but the association between temperature and catalyst to yield does matter. For this part, just write down the expression you would use—you don’t need to do the full calculation.

Number of times observed

Temperature Catalyst Yield

4x H A H

2x M B L

3x L A HTemperature Catalyst P(Yield=H) P(Yield=M) P(Yield=L)

H A 0.51 0.08 0.41 H B 0.30 0.20 0.50 M A 0.71 0.09 0.20 M B 0.92 0.05 0.03 L A 0.21 0.40 0.39 L B 0.12 0.57 0.31

temp

yield

catalyst

Temperature Catalyst P(Yield=H) P(Yield=M) P(Yield=L) H A 0.51 0.08 0.41 H B 0.30 0.20 0.50 M A 0.71 0.09 0.20 M B 0.92 0.05 0.03 L A 0.21 0.40 0.39 L B 0.12 0.57 0.31

The number of orderings of identical items is the factorial term in the multinomial:

Thus the total probability is 0.00071048

€

combinations =9!

4!2!3!( )=1260

Calculation method 1:First we will calculate the probability of this set for a particular ordering:

€

p(ordered yield observations | model, parameters) = 0.5140.0320.213( ) = 5.6e − 7



4x H A H

2x M B L

3x L A H

Calculation method 2:

The combination term is the same, 1260.

€

p(yield observations | model, parameters) = (combinations) probabilities∏

Note that this matches the result in calculation method 1 exactly. We can repeat this for the second case to find p(0H,0M,2L|T=med, Cat=B)=0.032 which is again the same as above. Taking the product of the combinations and probabilities we find the same total probability of 0.00071048.

€

p(4H,0M,0L | T = high,Cat = A) =4!

4!0!0!( )* 0.5140.0800.410

( ) = 0.514

The probabilities can be interpreted here as another multinomial term. For example, for the first observation, we could say what is the probability of observing a 4 high, 0 med, and 0 low yields for a system with a high temperature and catalyst A? Using the multinomial distribution we would find:




4x H A H

2x M B L

3x L A H

temp

yield

catalyst

This term is the probability of the data given a model and parameters:

P(data|model, parameters)The absolute value of this probability is not very informative by itself, but it could be if it were compared to something else.

Note that the joint probability model here is p(temperature, catalyst, yield)=p(temperature)*p(catalyst)*p(yield | temperature, catalyst)= 0.047*00.0212*0.00071=7.07e-7

(Note: p(temp) and p(cat) were calculated earlier in the lecture)

As an example, lets say that you try another model where yield only depends on temperature. This model is shown graphically below:

temp

yield

catalyst

What is the conditional probability model?

P(temperature, cat, yield)=p(temp)p(cat)p(yield | temp)

(call this model 2)

As an example, lets say that you try another model where yield only depends on temperature. This model is shown graphically below:

P(temperature, cat, yield)=p(temp)p(cat)p(yield|temp)(call this model 2)


How do we change this table to get p(yield|temp)?

The conditional probability table will change as the catalyst is marginalized out:

€

p(yield|temp)= p(yield|temp,cati)p(cati)i=A,B∑

Merging the following two tables:

Catalyst P(Catalyst) A 0.40 B 0.60

Temperature Catalyst P(Yield=H) P(Yield=M) P(Yield=L)

H A 0.51 0.08 0.41 H B 0.30 0.20 0.50 M A 0.71 0.09 0.20 M B 0.92 0.05 0.03 L A 0.21 0.40 0.39 L B 0.12 0.57 0.31

To yield: Temperature P(Yield=H) P(Yield=M) P(Yield=L)

H 0.51*0.4+0.3*0.6 =0.384

0.08*0.4+0.2*0.6 =0.152

0.41*0.4+0.50*0.6 =0.464

M 0.71*0.4+0.92*0.6 =0.836

0.09*0.4+0.05*0.6 =0.066

0.20*0.4+0.03*0.6 =0.098

L 0.21*0.4+0.12*0.6 =0.156

0.40*0.4+0.57*0.6 =0.502

0.39*0.4+0.31*0.6 =0.342

Now what?

Now we can calculate the probability of our data as before from the data Number of times observed


4x H A H 2x M B L 3x L A H P(temp) and p(catalyst) remain unchanged, but we need to update p(yield| temperature) from the reduced dataset: Number of times observed

Temperature Yield

4x H H 2x M L 3x L H

Here like before we observe 9 observations, of which there are 3 distinct outcomes

€

p(yield|temp)= 9!4!2!3!( )*0.38440.09820.1563( )=0.0009989

Temperature P(Yield=H) P(Yield=M) P(Yield=L) H 0.51*0.4+0.3*0.6

=0.384 0.08*0.4+0.2*0.6

=0.152 0.41*0.4+0.50*0.6

=0.464 M 0.71*0.4+0.92*0.6

=0.836 0.09*0.4+0.05*0.6

=0.066 0.20*0.4+0.03*0.6

=0.098 L 0.21*0.4+0.12*0.6

=0.156 0.40*0.4+0.57*0.6

=0.502 0.39*0.4+0.31*0.6

=0.342

So which model is better?

Given these data, we compare the two models with and without the catalyst influence in

what is called a Bayes factor:

€

Bayes factor=p(model 1|data)p(model 2|data)=

p(data|model 1)p(model 1)p(data)

p(data|model 2)p(model 2)p(data)

=p(data|model 1)p(data|model 2)

(assuming that both models are equally likely). The term p(data) cancels out.

A Bayes factor (BF) is like a p-value in probability or Bayesian terms.

BF near 1=?

BF far from 1=?

Both models are nearly equal

Models are different

Limitations:

• Analysis based on only 9 data points. This is useful for identifying unusual behavior. For example, in this case, we might conclude that catalyst A and B still have distinct properties, even though, say, they have been recycled many times.

• We don’t always have parameters like the truth table to start with.

Our model is the joint probability so we can write this ratio out as:

€

Bayes factor=p(model 1|data)p(model 2|data)=

p(data|model 1)p(data|model 2)=

=p(temp)p(cat)p(yield|temp,cat)p(temp)p(cat)p(yield|temp) =p(yield|temp,cat)

p(yield|temp) =0.000710480.0009989=0.711

Thus given our model parameters and some data, we can say that the model where the

catalyst type influences the yield is favored over the simpler model by a factor of 0.711:1

or , said another way, the simpler model fits the data 1.40 times better.

Another use of this kind of model is to evaluate confidence intervals. For example, if I

draw 100 samples from different lots of product made using the high temperature process

with catalyst A, what range of high yield samples would I expect to draw 90% of the

time?

Said another way, what is the most likely number? What is the value that covers 45% of

the probability density below, and what is the number that covers 45% of the probability

density above? Temperature Catalyst P(Yield=H) P(Yield=M) P(Yield=L)

H A 0.51 0.08 0.41 H B 0.30 0.20 0.50 M A 0.71 0.09 0.20 M B 0.92 0.05 0.03 L A 0.21 0.40 0.39 L B 0.12 0.57 0.31

Solution:

Most likely configuration from 100 samples is 51H, 8M, and 41L.

The density is plotted here:

L

Constraints:• There are a total of 100 samples drawn, thus 100=H+M+L

• For the maximum likelihood case, H=51, so the relationship between M and L is100=51+M+L → M=49-L

• At some lower value of H we get the expressionM=(100-H)-L

Integrate by summing!

51H, 8M, and 41L

M

Take Home Messages• Using a Bayesian network you can

describe complex relationships between variables

• Multinomial distributions allow you to handle variables with more than 2 states

• Using the rules of probability (Baye’s rule, marginalization, and independence), you can infer states on a Bayesian network

After weeks of discussion, you and your colleagues have created the following influence diagram to describe the causal relationships in your process. This process is shown below.

1) You decide to interpret this influence diagram as a Bayesian network. To help you do analysis on this network, you need to cast the network as a statement of conditional probability. Below write out the statement of conditional probability that is consistent with the network P(Tf, Ts, Ps, Crxn, Ccat, Yield, P) = =P(Tf)P(Cat)P(Ts)P(Ps|Ts)P(Crxn|Tf, Ts, Cat)P(Yield| Crxn) P(P|Yield)

2) After creating the influence diagram above, what else do you need for the model before you can use it to make predictions in a Bayesian network inference engine such as Genie? Probabilities! We need the probabilities of each event will take place given its parent configurations. This could come from data or from estimates, but it is required. 3) When making predictions on a Bayesian network, what does the solution look like? For example, imagine that we want to infer the profitability of the process (P) given the temperature of the feed (Tf). If profitability can take on values {very high, high, med, low, very low}, give an example solution of what inference on a Bayesian network would look like. Bayesian networks will produce a probability distribution, not a point answer. So for the profit example above, we would get a table like: P(P=very high)=0.1, P(P=high)=0.15, P(P=med)=0.25, P(P=low)=0.45, P(P=very low)=0.05 (note that these probabilities must sum to 1)

Documents

Bayesian Networks I: Static Models & Multinomial Distributions By Peter Woolf ([email protected]) University of Michigan Michigan Chemical Process Dynamics