22
Artificial Intelligence Bayesian Networks Stephan Dreiseitl FH Hagenberg Software Engineering & Interactive Media Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 1 / 43 Overview Representation of uncertain knowledge Constructing Bayesian networks Using Bayesian networks for inference Algorithmic aspects of inference Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 2 / 43

Ai Lecture11 Bayesnets 2up

Embed Size (px)

DESCRIPTION

bayesian network

Citation preview

Page 1: Ai Lecture11 Bayesnets 2up

Artificial IntelligenceBayesian Networks

Stephan Dreiseitl

FH HagenbergSoftware Engineering & Interactive Media

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 1 / 43

Overview

Representation of uncertain knowledge

Constructing Bayesian networks

Using Bayesian networks for inference

Algorithmic aspects of inference

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 2 / 43

Page 2: Ai Lecture11 Bayesnets 2up

A simple Bayesian network example

Worms Umbrellas

Rain

P(Rain,Worms,Umbrellas) =P(Worms |Rain)P(Umbrellas |Rain)P(Rain)

With conditional independence, need only right-hand sideto represent joint distribution

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 3 / 43

A simple Bayesian net example (cont.)

Worms Umbrellas

Rain

Intuitively: graphical representation of influence

Mathematically: graphical representation of conditionalindependence assertions

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 4 / 43

Page 3: Ai Lecture11 Bayesnets 2up

A more complicated example

Burglary Earthquake

Alarm

MaryCalls JohnCalls

P(b)0.001 0.002

P(e)

T T 0.95

F T 0.24T F 0.94

F F 0.001

B E P(a B, E)|

P(m A)|AT 0.7 F 0.01

T 0.9 F 0.05

P(j A)|A

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 5 / 43

Definition of Bayesian networks

A Bayesian network is a directed acyclic graph with

random variables as nodes,

links that specify “directly influences” relationships,

probability distributions P(Xi | parents(Xi)) for eachnode Xi

Graph structure asserts conditional independencies:

P(MaryCalls | JohnCalls,Alarm,Earthquake,Burglary) =P(MaryCalls |Alarm)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 6 / 43

Page 4: Ai Lecture11 Bayesnets 2up

Bayesian networks as joint probabilities

P(X1, . . . ,Xn) =n∏

i=1

P(Xi |X1, . . . ,Xi−1)

=n∏

i=1

P(Xi | parents(Xi))

for parents(Xi) ⊆ {X1, . . . ,Xi−1}

Burglary example:

P(b,¬e, a,¬m, j) = P(b)P(¬e)P(a | b,¬e)P(¬m | a)P(j | a)= 0.001× 0.998× 0.94× 0.3× 0.9= 0.0002

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 7 / 43

Conditional independencies in networks

Use graphical structure to visualize conditionaldependencies and independencies

Nodes are dependent if there is information flow betweenthem (along one possible path)

Nodes are independent if information flow is blocked(along all possible paths)

Distinguish situations with and without evidence(instantiated variables)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 8 / 43

Page 5: Ai Lecture11 Bayesnets 2up

Conditional independencies in networks

No evidence: Information flow along a path is blocked iffthere is a “head-to-head” node (blocker) on path

No blockers between A and B:

A B

A B

Blocker C between A and B:

A BC

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 9 / 43

Conditional independencies in networks

Evidence blocks information flow, except at blockers (ortheir descendents), where it opens information flow

Information flow between A and B blocked by evidence:

A B

A B

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 10 / 43

Page 6: Ai Lecture11 Bayesnets 2up

Conditional independencies in networks

Information flow between A and B unblocked byevidence:

A B

A B

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 11 / 43

Conditional independencies in networks

A node is conditionally independent of itsnon-descendents, given its parents

A B

C C D

X

P P1

1 2

2

P(X |P1,P2,A,B ,D) = P(X |P1,P2)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 12 / 43

Page 7: Ai Lecture11 Bayesnets 2up

Cond. independencies in networks (cont.)

A node is conditionally independent of all other nodes inthe network, given its Markov blanket: its parents,children, and children’s parents

A B

C C D

X

P P1

1 2

2

P(X |P1,P2,C1,C2,A,B ,D) = P(X |P1,P2,C1,C2,A,B)Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 13 / 43

Noisy OR

For Boolean node X with n Boolean parents, conditionalprobability table has 2n entries

Noisy OR assumption reduces this number to n: Assumeeach parent may be inhibited independently

Flu Malaria

Fever

Cold

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 14 / 43

Page 8: Ai Lecture11 Bayesnets 2up

Noisy OR (cont.)

Need only specify first three entries of table:

Flu Malaria Cold P(¬fever)

T F F 0.2F T F 0.1F F T 0.6F F F 1.0F T T 0.1× 0.6 = 0.06T F T 0.2× 0.6 = 0.12T T F 0.2× 0.1 = 0.02T T T 0.2× 0.1× 0.6 = 0.012

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 15 / 43

Building an example network

“When I go home at night, I want to know if my familyis home before I try the doors (perhaps the mostconvenient door to enter is double locked when nobodyis home). Now, often when my wife leaves the house sheturns on an outdoor light. However, she sometimes turnson this light if she is expecting a guest. Also, we have adog. When nobody is home, the dog is put in the backyard. The same is true if the dog has bowel trouble.Finally, if the dog is in the back yard, I will probably hearher barking, but sometimes I can be confused by otherdogs barking.”F. Jensen, An introduction to Bayesian networks, UCL Press, 1996.

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 16 / 43

Page 9: Ai Lecture11 Bayesnets 2up

Building an example network (cont.)

Relevant entities: Boolean random variables FamilyOut,LightsOn, HearDogBark

Causal structure: FamilyOut has direct influence on bothLightsOn and HearDogBark, so LightsOn andHearDogBark are conditionally independent givenFamilyOut

FamilyOut

LightsOn HearDogBark

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 17 / 43

Building an example network (cont.)

Numbers in conditional probability table derived fromprevious experience, or subjective belief

P(familyout) = 0.2P(lightson | familyout) = 0.99P(lightson | ¬familyout) = 0.1

Run into problem with P(heardogbark | familyout): dogmay be out because of bowel problems, and barking maybe other dogs

Network structure needs to be updated to reflect this

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 18 / 43

Page 10: Ai Lecture11 Bayesnets 2up

Building an example network (cont.)

Introduce mediating variable DogOut to modeluncertainty with bowel problems and hearing other dogsbark

FamilyOut

LightsOn DogOut

BowelProblems

HearDogBark

Need: P(DogOut |FamilyOut,BowelProblems)P(HearDogBark |DogOut)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 19 / 43

Building an example network (cont.)

Obtain the following additional probability tables:

FamilyOut

LightsOn DogOut

BowelProblems

HearDogBark

P(f)0.2

P(b)0.05

T 0.99 F 0.1

P(l F)|FT T 0.99

F T 0.96T F 0.88

F F 0.2

F B P(d F, B)|

T 0.6 F 0.25

P(h D)|D

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 20 / 43

Page 11: Ai Lecture11 Bayesnets 2up

Inference in Bayesian networks

Given events (instantiated variables) e and noinformation on hidden variables H , calculate distributionfor query variable Q

Algorithmically, calculate P(Q | e) by marginalizing over H

P(Q | e) = αP(Q, e) = α∑

h

P(Q, e, h)

with h all possible value combinations of H

Distinguish between causal, diagnostic, and intercausalreasoning

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 21 / 43

Types of inference

Causal reasoning: query variable is downstream of events

P(heardogbark | familyout) = 0.56

Diagnostic reasoning: query variable upstream of events

P(familyout | heardogbark) = 0.296

Explaining away (intercausal reasoning): knowing effectand possible cause, reduce the probability of otherpossible causes

P(familyout | bowelproblems, heardogbark) = 0.203P(bowelproblems | heardogbark) = 0.078P(bowelproblems | familyout, heardogbark) = 0.053

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 22 / 43

Page 12: Ai Lecture11 Bayesnets 2up

Algorithmic aspects of inference

Calculating joint distribution computationally expensive

Several alternatives for inference in Bayesian networks:

Exact inferenceby enumerationby variable elimination

Stochastic inference (Monte Carlo methods)by sampling from the joint distributionby rejection samplingby likelihood weightingby Markov chain Monte Carlo methods

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 23 / 43

Inference by enumeration

FamilyOut example (d ′ ∈ {d ,¬d}, b′ ∈ {b,¬b})P(F | l , h) = αP(F , l , h) = α

∑d ′

∑b′

P(F , l , h, d ′, b′)

P(f | l , h) = α∑d ′

∑b′

P(f )P(b′)P(l | f )P(d ′ | f , b′)P(h | d ′)

= αP(f )P(l | f )∑d ′

P(h | d ′)∑b′

P(b′)P(d ′ | f , b′)

= α 0.2 ∗ 0.99 ∗ (0.6 ∗ 0.8857 + 0.25 ∗ 0.1143)

= α 0.111

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 24 / 43

Page 13: Ai Lecture11 Bayesnets 2up

Inference by enumeration (cont.)

Similarily, P(¬f | l , h) = α 0.0267

From P(f | l , h) + P(¬f | l , h) = 1, normalization yields

P(F | l , h) = α (0.111, 0.0267) = (0.806, 0.194)

Burglary example:

P(B | j ,m) =

αP(B)∑e′

P(e ′)∑a′

P(a′ |B , e ′)P(j | a′)P(m | a′)

Last two factors P(j | a)P(m | a) do not depend on e ′,but have to be evaluated twice (for e and ¬e)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 25 / 43

Variable elimination

Eliminate repetitive calculations by summing inside out,storing intermediate results (cf. dynamic programming)

Burglary example, different query:

P(J | b) =

αP(b)∑e′

P(e ′)∑a′

P(a′|b, e ′)P(J |a′)∑m′

P(m′|a′)︸ ︷︷ ︸= 1

Fact: Any variable that is not an ancestor of the query orevidence variables is irrelevant and can be dropped

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 26 / 43

Page 14: Ai Lecture11 Bayesnets 2up

Sampling from the joint distribution

Straightforward if there is no evidence in the network:Sample each variable in topological order

For nodes without parents, sample from theirdistribution; for nodes with parents, sample from theconditional distribution

With NS(x1, . . . , xn) being the number of times specificrealizations (x1, . . . , xn) are generated in N samplingexperiments, obtain

limN→∞

NS(x1, . . . , xn)

N= P(x1, . . . , xn)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 27 / 43

Example: joint distribution sampling

FamilyOut

LightsOn DogOut

BowelProblems

HearDogBark

P(f)0.2

P(b)0.05

T 0.99 F 0.1

P(l F)|FT T 0.99

F T 0.96T F 0.88

F F 0.2

F B P(d F, B)|

T 0.6 F 0.25

P(h D)|D

What is probability of family at home, dog has no bowelproblems and isn’t out, the light is off, and a dog’sbarking can be heard?

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 28 / 43

Page 15: Ai Lecture11 Bayesnets 2up

Example: joint distribution sampling

FamilyOut example: Generate 100000 samples from thenetwork by

first sampling from FamilyOut and BowelProblemsvariables,

then sampling from all other variables in turn, givensampled parent values

Obtain NS(¬f ,¬b,¬l ,¬d , h) = 13740

Compare with P(¬f ,¬b,¬l ,¬d , h) =0.8× 0.95× 0.9× 0.8× 0.25 = 0.1368

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 29 / 43

Example: joint distribution sampling

Advantage of sampling: easy to generate estimates forother probabilities

Standard error of estimates drops as 1/√

N , forN = 100000 this is 0.00316

NS(¬d)/100000 = 0.63393(P(¬d) = 0.63246

)NS(f ,¬h)

NS(¬h)= 0.1408

(P(f | ¬h) = 0.1416

)Last example: Form of rejection sampling

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 30 / 43

Page 16: Ai Lecture11 Bayesnets 2up

Rejection sampling in Bayesian networks

Method to approximate conditional probabilities P(X | e)of variables X , given evidence e:

P(X | e) ≈ NS(X , e)

NS(e)

Rejection sampling: Take only those samples that areconsistent with the evidence into account

Problem with rejection sampling: Number of samplesconsistent with evidence drops exponentially withnumber of evidence variables, therefore unusable forreal-life networks

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 31 / 43

Likelihood weighting

Fix evidence and sample all other variables

This overcomes rejection sampling shortcoming by onlygenerating samples consistent with the evidence

Problem: Consider situation with P(E = e|X = x) = 0.001and P(X = x) = 0.9. Then, 90% of samples will haveX = x (and fixed E = e), but this combination is veryunlikely, since P(E = e|X = x) = 0.001

Solution: Weight each sample by product of conditionalprobabilities of evidence variables, given its parents

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 32 / 43

Page 17: Ai Lecture11 Bayesnets 2up

Example: Likelihood weighting

FamilyOut example: Calculate P(F | l , d)

Iterate the following:

sample all non-evidence variables, given the evidencevariables, obtaining, e.g., (¬f ,¬b, h)

calculate weighting factor, e.g.P(l | ¬f )× P(d | ¬f ,¬b) = 0.1× 0.2 = 0.02

Finally, sum and normalize weighting factors for samples(f , l , d) and (¬f , l , d)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 33 / 43

Example: Likelihood weighting (cont).

For N = 100000, obtain

NS(¬f , l , d) = 20164∑

w(¬f ,l ,d) = 1907.18

NS(f , l , d) = 79836∑

w(f ,l ,d) = 17676.4

P(f | l , d) ≈ 17676.4/(17676.4 + 1907.18) = 0.90261

Correct: P(f | l , d) = 0.90206

Disadvantage of likelihood weighting: With manyevidence variables, most samples will have very smallweights, and few samples with larger weights dominate

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 34 / 43

Page 18: Ai Lecture11 Bayesnets 2up

Markov chains

A sequence of discrete r.v. X0,X1, . . . is called a Markovchain with state space S iff

P(Xn = xn |X0 = x0, . . . ,Xn−1 = xn−1)

= P(Xn = xn |Xn−1 = xn−1)

for all x0, . . . , xn ∈ S .

Thus, Xn is conditionally independent of all variablesbefore it, given Xn−1

Specify state transition matrix P withPij = P(Xn = xj |Xn−1 = xi)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 35 / 43

Markov chains Monte Carlo methods

Want to obtain samples from given distributions Pd(X )(hard to sample from with other methods)

Idea: Construct a Markov chain that, for arbitrary initialstate x0, converges towards a stationary (equilibrium)distribution Pd(X )

Then, successive realizations xn, xn+1, . . . are sampledaccording to Pd (but are not independent!!)

Often not clear when convergence of chain has takenplace

Therefore, discard initial portion of chain (burn-in phase)Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 36 / 43

Page 19: Ai Lecture11 Bayesnets 2up

Markov chain example

Let S = {1, 2} with state transition matrix P =

(12

12

14

34

)Simulate chain for 1000 steps, show NS(1)/N andNS(2)/N for N = 1, . . . , 1000 with starting state 1 (left)and 2 (right)

200 400 600 800 1000

0.2

0.4

0.6

0.8

200 400 600 800 1000

0.2

0.4

0.6

0.8

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 37 / 43

MCMC for Bayesian networks

Given evidence e and non-evidence variables X , useMarkov chains to sample from the distribution P(X | e)

Obtain sequence of states x0, x1, . . . , discard initialportion

After convergence, samples xk , xk+1, . . . have desireddistribution P(X | e)

Many variants of Markov chain Monte Carlo algorithms

Consider only Gibbs sampling

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 38 / 43

Page 20: Ai Lecture11 Bayesnets 2up

Gibbs sampling

Fix evidence variables to e, assign arbitrary values tonon-evidence variables X

Recall: Markov blanket of a variable is parents, children,and children’s parents.

Iterate the following:

pick arbitrary variable Xi from X

sample from P(Xi |MarkovBlanket(Xi))

new state = old state, with new value of Xi

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 39 / 43

Gibbs sampling (cont.)

Calculating P(Xi |MarkovBlanket(Xi)):

P(xi |MarkovBlanket(Xi))

= αP(xi | parents(Xi))×∏

Yi∈children(Xi )

P(yi | parents(Yi))

With this, calculate P(xi |MarkovBlanket(Xi)) andP(¬xi |MarkovBlanket(Xi)), normalize to obtainP(Xi |MarkovBlanket(Xi))

Sample from this for next value of Xi , and thus nextstate of Markov chain

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 40 / 43

Page 21: Ai Lecture11 Bayesnets 2up

Bayesian network MCMC example

FamilyOut example: Calculate P(F | l , d)

Start with arbitrary non-evidence settings (f , b, h)

Pick F , sample from P(F | l , d , b), obtain ¬f

Pick B , sample from P(B | ¬f , d), obtain b

Pick H , sample from P(H | d), obtain h

Iterate last three steps 50000 times, keep last 10000states

Obtain P(f | l , d) ≈ 0.9016 (correct 0.90206)

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 41 / 43

Comparison of inference algorithms

Inference by enumeration computationally prohibitive

Variable elimination removes all irrelevant variables

Direct sampling from joint distribution: easy when noevidence present

Use rejection sampling and likelihood weighting for moreefficient calculations

Markov chain Monte Carlo methods are most efficient forlarge networks by calculating new states based on oldstates, but lose independence of samples

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 42 / 43

Page 22: Ai Lecture11 Bayesnets 2up

Summary

Bayesian networks are graphical representations of causalinfluence among random variables

Network structure graphically specifies conditionalindependence assumptions

Need conditional distributions of nodes, given its parents

Use noisy OR to reduce number of parameters in tables

Reasoning types in Bayesian networks: causal, diagnostic,and explaining away

There are exact and approximate inference algorithms

Stephan Dreiseitl (Hagenberg/SE/IM) Lecture 11: Bayesian Networks Artificial Intelligence SS2010 43 / 43