CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference

CPSC 7373: Artificial IntelligenceLecture 5: Probabilistic Inference

Jiang Bian, Fall 2012University of Arkansas at Little Rock

Overview and Example

A

J M

B EThe alarm (A) might go off because of either a Burglary (B) and/or an Earthquake (E). And when the alarm (A) goes off, either John (J) and/or Mary (M) will call to report.

Possible questions:• Given the evidence of either B or E, what’s

the probability of J or M will call?

Answer to this type of questions:• Posterior distribution: P(Q1, Q2 … | E1=e1,

E2=e2)• It's the probability distribution of one or

more query variables given the values of the evidence variables.

EVIDENCE

QUERY

HIDDEN


A

J M

B EThe alarm (A) might go off because of either a Burglary (B) and/or an Earthquake (E). And when the alarm (A) goes off, either John (J) and/or Mary (M) will call to report.

Possible questions:• Out of all the possible values for all the

query variables, which combination of values has the highest probability?

Answer to these questions:• argmaxq: P(Q1=q1, Q2=q2 … | E1=e1, …)• Which Q values are maxable given the

evidence values?

EVIDENCE

QUERY

HIDDEN


A

J M

B EImagine the situation where Mary has called to report that the alarm is going off, and we want to know whether or not there has been a burglary. For each of the nodes, tell us if the node is an evidence node, a hidden node or a query node?


A

J M

B EImagine the situation where Mary has called to report that the alarm is going off, and we want to know whether or not there has been a burglary. For each of the nodes, tell us if the node is an evidence node, a hidden node or a query node?

Evidence: MQuery: BHidden: E, A, J

Inference through enumeration

A

J M

B E P(+b|+j, +m) = ???

Imagine the situation where both John and Mary have called to report that the alarm is going off, and we want to know the probability of a burglary.

Definition:Conditional probability:

P(Q|E) = P(Q, E) / P(E)


A

J M

B E P(+b|+j, +m) = ???

= P(+b, +j, +m) / P(+j, +m)

P(+b, +j, +m)


P(Q|E) = P(Q, E) / P(E)


A

J M

B E

B P(B)

+b

0.01

¬b

0.999

E P(E)+e 0.00

2¬e 0.99

8

A J P(J|A)

+a +j 0.9

+a ¬j 0.1

¬a +j 0.05

¬a ¬j 0.95

A M P(M|A)

+a +m 0.7

+a ¬m 0.3

¬a +m 0.01

¬a ¬m 0.99

B E A P(A|B,E)

+b

+e +a

0.95

+b

+e ¬a

0.05

+b

¬e +a

0.94

+b

¬e ¬a

0.06

¬b

+e +a

0.29

¬b

+e ¬a

0.71

¬b

¬e +a

0.001

¬b

¬e ¬a

0.999

Given +e, +a ???


P(+b) P(e) P(a|+b,e)

P(+j|a)

P(+m|a)

+e, +a

0.001

0.002

0.95 0.9 0.7 0.000001197

+e, ¬a

0.001

0.002

0.05 0.05 0.01 5e-11

¬e, +a

0.001

0.998

0.94 0.9 0.7 0.0005910156

¬e, ¬a

0.001

0.998

0.05 0.05 0.01 2.495e-8

0.0005922376

P(+b, +j, +m)


P(b) P(e) P(a|b,e) P(+j|a)

P(+m|a)

+e, +a, +b

0.001

0.002

0.95 0.9 0.7 0.000001197

+e, ¬a, +b

0.001

0.002

0.05 0.05 0.01 5e-11

¬e, +a, +b

0.001

0.998

0.94 0.9 0.7 0.0005910156

¬e, ¬a, +b

0.001

0.998

0.05 0.05 0.01 2.495e-8

+e, +a, ¬b

0.999

0.002

0.29 0.9 0.7 0.0003650346

+e, ¬a, ¬b

0.999

0.002

0.05 0.05 0.01 4.995e-8

¬e, +a, ¬b

0.999

0.998

0.71 0.9 0.7 0.4459589946

¬e, ¬a, ¬b

0.999

0.998

0.999 0.05 0.01 0.00049800249

0.44741431924

P(+j, +m)


A

J M

B E P(+b|+j, +m) = ???

= P(+b, +j, +m) / P(+j, +m)= 0.0005922376 / 0.44741431924= 0.284


P(Q|E) = P(Q, E) / P(E)

Enumeration

• We assumed binary events/Boolean variables.

• Only 5 variables:– 25 = 32 rows in the CPT

• Practically, what if we have a large network?

A

J M

B E

Example: Car-diagnosis Initial evidence: engine won't start

Testable variables (thin ovals), diagnosis variables (thick ovals)

Hidden variables (shaded) ensure sparse structure, reduce parameters

Example: Car insurancePredict claim costs (medical, liability, property) given data on application form (other unshaded nodes)

If Boolean: 227

rows in the CPTNOT Boolean in reality.

Speed Up EnumerationP(+b, +j, +m)

Pulling out terms:

Speed up enumeration• Maximize Independence

– The structure of the Bayes network determines how efficient to calculate the probability values.

X1 X2 Xn O(n)

X1

Xn X2

O(2n)

Bayesian networks: definition• A simple, graphical notation for conditional independence

assertions and hence for compact specification of full joint distributions

• Syntax:– a set of nodes, one per variable– a directed, acyclic graph (link = “directly influences")– a conditional distribution for each node given its parents: P(Xi|

Parents(Xi))• In the simplest case, conditional distribution represented

as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

Constructing Bayesian Networks

• Dependent or Independent?– P(J|M) = P(J)?

A

J M

B E

J M

The alarm (A) might go off because of either a Burglary (B) and/or an Earthquake (E). And when the alarm (A) goes off, either John (J) and/or Mary (M) will call to report.

Suppose we choose the ordering M, J, A, B, E

J M

A

P(A|J,M) = P(A|J)?P(A|J,M) = P(A)?

J M

A

B

P(B|A, J, M) = P(B|A)?P(B|A, J, M) = P(B)?

J M

A

B E

P(E|B, A, J, M) = P(E|A)?P(E|B, A, J, M) = P(E|A, B)?

J M

A

B E

• Deciding conditional independence is hard in non-causal directions

• (Causal models and conditional independence seem hardwired for humans!)

• Assessing conditional probabilities is hard in non-causal directions

• Network is less compact: 1 + 2 + 4 + 2 + 4=13 numbers needed

Variable Elimination

• Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid re-computation

)()(

)()(

),()()(

)()(),,()()(

)()(),|()()(

),()|(),|()()(

),,,,(m)j,|P(B

bfbf

bfbP

ebfePbP

afafebafePbP

afafebaPePbP

amPajPebaPePbP

aemjbP

jmaeb

jmae

jmae

mjaae

mjae

ea

ea

(sum out A)

(sum out E)

Variable Elimination

• Variable elimination:– Summing out a variable from a product of factors:• move any constant factors outside the summation• add up submatrices in pointwise product of remaining

factors– still N-P complete, but faster than enumeration

Pointwise product of factors f1 and f2

Variable EliminationR T L

+r 0.1

¬r 0.9

+r +t 0.8

+r ¬t 0.2

¬r +t 0.1

¬r ¬t 0.9

+t +l 0.3

+t ¬l 0.7

¬t +l 0.1

¬t ¬l 0.9

1) Joining factorsP(R, T)

P(R) P(T|R) P(L|T)

+r +t 0.08

+r ¬t 0.02

¬r +t 0.09

¬r ¬t 0.81


RT L

+r +t 0.08

+r ¬t 0.02

¬r +t 0.09

¬r ¬t 0.81

P(R, T)

+t +l 0.3

+t ¬l 0.7

¬t +l 0.1

¬t ¬l 0.9

P(L|T)

Marginalize on the variable R, to gives us a table of just the variable T. P(R,T) - > P(T)

+t ??¬t ??


RT L

+r +t 0.08

+r ¬t 0.02

¬r +t 0.09

¬r ¬t 0.81

P(R, T)

+t +l 0.3

+t ¬l 0.7

¬t +l 0.1

¬t ¬l 0.9

P(L|T)

2) Marginalize on the variable R, to gives us a table of just the variable T. P(R,T) - > P(T)

+t 0.17¬t 0.83


RT L +t +l 0.3

+t ¬l 0.7

¬t +l 0.1

¬t ¬l 0.9

P(L|T)

+t 0.17¬t 0.83

T L

P(T)3) Joint probability of P(T, L)

+t +l ??

+t ¬l ??

¬t +l ??

¬t ¬l ??


RT L +t +l 0.3

+t ¬l 0.7

¬t +l 0.1

¬t ¬l 0.9

P(L|T)

+t 0.17¬t 0.83

T L

P(T)3) Joint probability of P(T, L)

+t +l 0.051

+t ¬l 0.119

¬t +l 0.083

¬t ¬l 0.747


RT L

T L

4) P(L)

+t +l 0.051

+t ¬l 0.119

¬t +l 0.083

¬t ¬l 0.747

P(T, L)T, L

+l ??¬l ??


RT L

T L

4) P(L)

+t +l 0.051

+t ¬l 0.119

¬t +l 0.083

¬t ¬l 0.747

P(T, L)T, L

+l 0.134¬l 0.886

Choice of ordering is important!

Approximate Inference: Sampling• Joint probability of heads and tails of a 1 cent, and a 5

cent coin.• Advantages:– Computationally easier.– Works even without CPTs.

1 cent 5 cent

H H

H T

T H

T T

Sampling ExampleCloudy: P(C)

C

S R

W

+c 0.5¬c 0.5

Rain: P(R|C)+c +r 0.8

¬r 0.2

¬c +r 0.2

¬r 0.8

Sprinkler: P(S|C)+c +s 0.1

¬s 0.9

¬c +s 0.5

¬s 0.5Sprinkler: P(W|S,R)+c +s +w 0.99

¬w 0.01

¬s +w 0.90

¬w 0.10

¬c +s +w 0.90

¬w 0.10

¬s +w 0.01

¬w 0.99

Samples: +c, ¬s, +r

• Sampling is consistent if we want to compute the full joint probability of the network or individual variables.

• What about conditional probability? P(w|¬c)• Rejection sampling: need to reject samples that do not

match the probabilities that we are interested in.

Rejection sampling

• Too many rejected samples make it in-efficient.– Likelihood weight sampling: inconsistent

AB

Likelihood weightingCloudy: P(C)

C

S R

W

+c 0.5¬c 0.5

Rain: P(R|C)+c +r 0.8

¬r 0.2

¬c +r 0.2

¬r 0.8

Sprinkler: P(S|C)+c +s 0.1

¬s 0.9

¬c +s 0.5

¬s 0.5Sprinkler: P(W|S,R)+c +s +w 0.99

¬w 0.01

¬s +w 0.90

¬w 0.10

¬c +s +w 0.90

¬w 0.10

¬s +w 0.01

¬w 0.99

P(R|+s, +w)

Weight samples:+c, 0.1 +s, +r, 0.99 +w

weight: .01 x .99, +c, +s, +r, +w

P(C|+s, +r) ??

Gibbs Sampling

• Markov Chain Monte Carlo (MCMC)– Sample one variable at a time conditioning on

others.

+s+c -r -w

-s+c -r -w

-s+c +r -w

Monty Hall Problem• Suppose you're on a game show, and you're given the choice of three doors:

Behind one door is a car; behind the others, goats. You pick a door, say No. 2 [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say No. 1, which has a goat. He then says to you, "Do you want to pick door No. 3?" Is it to your advantage to switch your choice?

P(C=3|S=2) = ??P(C=3|H=1,S=2) = ??

Monty Hall Problem• Suppose you're on a game show, and you're given the choice of three doors:

Behind one door is a car; behind the others, goats. You pick a door, say No. 2 [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say No. 1, which has a goat. He then says to you, "Do you want to pick door No. 3?" Is it to your advantage to switch your choice?

P(C=3|S=2) = 1/3P(C=3|H=1,S=2) = 2/3Why???

Monty Hall Problem

• P(C=3|H=1,S=2) – = P(H=1|C=3,S=1)P(C=3|S=1)/SUM(P(H=1|C=i,

S=2)P(C=i|S=2) = 2/3• P(C=1|S=2) = P(C=2|S=2)=P(C=3|S=2) = 1/3

Documents

CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference