Bayesian Networks Without Tears

  • Upload
    car-alv

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

  • 7/28/2019 Bayesian Networks Without Tears

    1/14

    Articles

    50 AI MAGAZINE

    u n d e r s t a n d i n g(Charniak and Gold-man 1989a, 1989b;Goldman 1990),vision (Levitt, Mullin,

    and Binford 1989),heuristic search(Hansson and Mayer1989), and so on. Itis probably fair tosay that Bayesiannetworks are to alarge segment ofthe AI-uncertaintycommunity whatresolution theoremproving is to the AI-logic community.

    Nevertheless, despite what seems to be their

    obvious importance, the ideas and techniqueshave not spread much beyond the researchcommunity responsible for them. This isprobably because the ideas and techniques arenot that easy to understand. I hope to rectifythis situation by making Bayesian networks

    more accessible tothe probabilis-

    tically unso-

    Over the last fewyears, a method ofreasoning usingprobabilities, vari-ously called belief

    networks, Bayesiannetworks, knowl-edge maps, proba-bilistic causalnetworks, and so on,has become popularwithin the AI proba-bility and uncertain-ty community. Thismethod is best sum-marized in JudeaPearls (1988) book,but the ideas are aproduct of many hands. I adopted Pearls

    name, Bayesian networks, on the groundsthat the name is completely neutral aboutthe status of the networks (do they really rep-resent beliefs, causality, or what?). Bayesiannetworks have been applied to problems inmedical diagnosis (Heckerman 1990; Spiegel-halter, Franklin, and Bull 1989), map learning(Dean 1990), lan-guage

    Bayesian Networkswithout Tears

    Eugene Charniak

    I give an introduction to Bayesian networks forAI researchers with a limited grounding in prob-

    ability theory. Over the last few years, this

    method of reasoning using probabilities hasbecome popular within the AI probability and

    uncertainty community. Indeed, it is probably

    fair to say that Bayesian networks are to a largesegment of the AI-uncertainty community whatresolution theorem proving is to the AI-logic

    community. Nevertheless, despite what seems to

    be their obvious importance, the ideas andtechniques have not spread much beyond the

    research community responsible for them. This is

    probably because the ideas and techniques arenot that easy to understand. I hope to rectify this

    situation by making Bayesian networks more

    accessible to the probabilistically unsophisticated.

    0738-4602/91/$4.00 1991 AAAI

    makingBayesian

    networks moreaccessible to

    the probabilis-tically

    unsophis-ticated.

  • 7/28/2019 Bayesian Networks Without Tears

    2/14

    phisticated. That is, this article tries to makethe basic ideas and intuitions accessible tosomeone with a limited grounding in proba-bility theory (equivalent to what is presentedin Charniak and McDermott [1985]).

    An Example Bayesian Network

    The best way to understand Bayesian networksis to imagine trying to model a situation inwhich causality plays a role but where ourunderstanding of what is actually going on isincomplete, so we need to describe thingsprobabilistically. Suppose when I go home atnight, I want to know if my family is homebefore I try the doors. (Perhaps the most con-venient door to enter is double locked whennobody is home.) Now, often when my wifeleaves the house, she turns on an outdoorlight. However, she sometimes turns on thislight if she is expecting a guest. Also, we havea dog. When nobody is home, the dog is put

    in the back yard. The same is true if the doghas bowel troubles. Finally, if the dog is in thebackyard, I will probably hear her barking (orwhat I think is her barking), but sometimes Ican be confused by other dogs barking. Thisexample, partially inspired by Pearls (1988)earthquake example, is illustrated in figure 1.There we find a graph not unlike many we seein AI. We might want to use such diagrams topredict what will happen (if my family goesout, the dog goes out) or to infer causes fromobserved effects (if the light is on and the dogis out, then my family is probably out).

    The important thing to note about this

    example is that the causal connections arenot absolute. Often, my family will have leftwithout putting out the dog or turning on alight. Sometimes we can use these diagramsanyway, but in such cases, it is hard to knowwhat to infer when not all the evidence pointsthe same way. Should I assume the family isout if the light is on, but I do not hear thedog? What if I hear the dog, but the light isout? Naturally, if we knew the relevant proba-bilities, such as P(family-out | light-on, hear-bark), then we would be all set. However,typically, such numbers are not available forall possible combinations of circumstances.

    Bayesian networks allow us to calculate themfrom a small set of probabilities, relating onlyneighboring nodes.

    Bayesian networks are directed acyclic graphs(DAGs) (like figure 1), where the nodes arerandom variables, and certain independenceassumptions hold, the nature of which I dis-cuss later. (I assume without loss of generalitythat DAG is connected.) Often, as in figure 1,the random variables can be thought of as

    states of affairs, and the variables have twopossible values, true and false. However, thisneed not be the case. We could, say, have a

    node denoting the intensity of an earthquakewith values no-quake, trembler, rattler, major,and catastrophe. Indeed, the variable valuesdo not even need to be discrete. For example,the value of the variable earthquake might bea Richter scale number. (However, the algo-rithms I discuss only work for discrete values,so I stick to this case.)

    In what follows, I use a sans serif font forthe names of random variables, as in earth-quake. I use the name of the variable in italicsto denote the proposition that the variabletakes on some particular value (but where weare not concerned with which one), for exam-

    ple, earthquake. For the special case of Booleanvariables (with values true and false), I use thevariable name in a sans serif font to denotethe proposition that the variable has thevalue true (for example, family -out). I alsoshow the arrows pointing downward so thatabove and below can be understood toindicate arrow direction.

    The arcs in a Bayesian network specify theindependence assumptions that must holdbetween the random variables. These inde-pendence assumptions determine what prob-ability information is required to specify theprobability distribution among the random

    variables in the network. The reader shouldnote that in informally talking about DAG, Isaid that the arcs denote causality, whereas inthe Bayesian network, I am saying that theyspecify things about the probabilities. Thenext section resolves this conflict.

    To specify the probability distribution of aBayesian network, one must give the priorprobabilities of all root nodes (nodes with nopredecessors) and the conditional probabilities

    rt c es

    WINTER 1991 51

    light-on

    family-out

    dog-out

    bowel-problem

    hear-bark

    Figure 1. A Causal Graph.

    The nodes denote states of affairs, and the arcs can be interpreted as causal connections.

  • 7/28/2019 Bayesian Networks Without Tears

    3/14

    Articles

    52 AI MAGAZINE

    of all nonroot nodes given all possible combi-nations of their direct predecessors. Thus,figure 2 shows a fully specified Bayesian net-work corresponding to figure 1. For example,it states that if family members leave ourhouse, they will turn on the outside light 60

    percent of the time, but the light will be turnedon even when they do not leave 5 percent ofthe time (say, because someone is expected).

    Bayesian networks allow one to calculatethe conditional probabilities of the nodes inthe network given that the values of some ofthe nodes have been observed. To take theearlier example, if I observe that the light ison (light-on = true) but do not hear my dog(hear-bark = false), I can calculate the condi-tional probability offami ly-out given thesepieces of evidence. (For this case, it is .5.) Italk of this calculation as evaluating theBayesian network (given the evidence). In

    more realistic cases, the networks would con-sist of hundreds or thousands of nodes, andthey might be evaluated many times as newinformation comes in. As evidence comes in,it is tempting to think of the probabilities ofthe nodes changing, but, of course, what ischanging is the conditional probability of thenodes given the changing evidence. Some-times people talk about the belief of the nodechanging. This way of talking is probably

    harmless provided that one keeps in mindthat here, belief is simply the conditionalprobability given the evidence.

    In the remainder of this article, I firstdescribe the independence assumptionsimplicit in Bayesian networks and show how

    they relate to the causal interpretation of arcs(Independence Assumptions). I then showthat given these independence assumptions,the numbers I specified are, in fact, all thatare needed (Consistent Probabilities). Evaluat-ing Networks describes how Bayesian net-works are evaluated, and the next sectiondescribes some of their applications.

    Independence Assumptions

    One objection to the use of probabilitytheory is that the complete specification of a

    probability distribution requires absurdlymany numbers. For example, if there are nbinary random variables, the complete distri-bution is specified by 2n-1 joint probabilities.(If you do not know where this 2n-1 comesfrom, wait until the next section, where Idefine joint probabilities.) Thus, the completedistribution for figure 2 would require 31values, yet we only specified 10. This savingsmight not seem great, but if we doubled the

    thecomplete

    specificationof a

    probabilitydistribution

    requiresabsurdly

    manynumbers.

    light-on (lo)

    family-out (fo)

    dog-out (do)

    bowel-problem (bp)

    hear-bark(hb)

    Figure 2. A Bayesian Network for the family-outProblem.

    I added the prior probabilities for root nodes and the posterior probabilities for nonroots given all possible values of

    their parents.

    P(fo) = .15 P(bp) = .01

    P(lo fo) = .6P(lo fo) = .05

    P(hb do) = .7P(hb do) = .01

    P(do fo bp) = .99P(do fo bp) = .90P(do fo bp) = .97P(do fo bp) = .3

  • 7/28/2019 Bayesian Networks Without Tears

    4/14

    size of the network by grafting on a copy, as

    shown in figure 3, 2n

    -1 would be 1023, butwe would only need to give 21. Where doesthis savings come from?

    The answer is that Bayesian networks havebuilt-in independence assumptions. To take asimple case, consider the random variablesfamily-out and hear-bark. Are these variablesindependent? Intuitively not, because if myfamily leaves home, then the dog is morelikely to be out, and thus, I am more likely tohear it. However, what if I happen to knowthat the dog is definitely in (or out of) thehouse? Is hear-bark independent offamily-outthen? That is, is P(hear-bark | family-out, dog-

    out) = P(hear-bark | dog-out)? The answer nowis yes. After all, my hearing her bark wasdependent on her being in or out. Once Iknow whether she is in or out, then wherethe family is is of no concern.

    We are beginning to tie the interpretationof the arcs as direct causality to their proba-bilistic interpretation. The causal interpreta-tion of the arcs says that the family being outhas a direct causal connection to the dog

    being out, which, in turn, is directly connected

    to my hearing her. In the probabilistic inter-pretation, we adopt the independenceassumptions that the causal interpretationsuggests. Note that if I had wanted to say thatthe location of the family was directly relevantto my hearing the dog, then I would have toput another arc directly between the two.Direct relevance would occur, say, if the dog ismore likely to bark when the family is awaythan when it is at home. This is not the casefor my dog.

    In the rest of this section, I define the inde-pendence assumptions in Bayesian networksand then show how they correspond to what

    one would expect given the interpretation ofthe arcs as causal. In the next section, I for-mally show that once one makes these inde-pendence assumptions, the probabilitiesneeded are reduced to the ones that I speci-fied (for roots, the priors; for nonroots, theconditionals given immediate predecessors).

    First, I give the rule specifying dependenceand independence in Bayesian networks:

    In a Bayesian network, a variable a is

    rt c es

    WINTER 1991 53

    dog-out

    family-out

    light-on

    bowel-problem

    hear-bark

    frog-out

    family-lout

    night-on

    towel-problem

    hear-quark

    Figure 3. A Network with 10 Nodes.

    This illustration is two copies of the graph from figure 1 attached to each other. Nonsense names were given to the

    nodes in the second copy.

  • 7/28/2019 Bayesian Networks Without Tears

    5/14

    A path from q to r is d-con-necting with respect to the evi-dence nodes E if every interiornode n in the path has the proper-ty that either

    1. it is linear or diverging andnot a member ofE or

    2. it is converging, and eithern or one of its descendants is in E.

    In the literature, the term d-separa-tion is more common. Two nodes ared-separatedif there is no d-connect-ing path between them. I find theexplanation in terms of d-connectingslightly easier to understand. I gothrough this definition slowly in a

    moment, but roughly speaking, two nodesare d-connected if either there is a causal pathbetween them (part 1 of the definition), orthere is evidence that renders the two nodescorrelated with each other (part 2).

    To understand this definition, lets start bypretending the part (2) is not there. Then wewould be saying that a d-connecting path mustnot be blocked by evidence, and there can beno converging interior nodes. We already sawwhy we want the evidence blocking restric-tion. This restriction is what says that oncewe know about a middle node, we do notneed to know about anything further away.

    What about the restriction on convergingnodes? Again, consider figure 2. In this dia-gram, I am saying that both bowel-problemand family-out can cause dog-out. However,does the probability ofbowel-problem dependon that offamily-out? No, not really. (We

    could imagine a case where they were depen-dent, but this case would be another ballgame and another Bayesian network.) Notethat the only path between the two is by wayof a converging node for this path, namely,dog-out. To put it another way, if two thingscan cause the same state of affairs and haveno other connection, then the two things areindependent. Thus, any time we have twopotential causes for a state of affairs, we havea converging node. Because one major use ofBayesian networks is deciding which poten-tial cause is the most likely, converging nodesare common.

    Now let us consider part 2 in the definitionof d-connecting path. Suppose we know thatthe dog is out (that is, dog-out is a member ofE). Now, are family-away and bowel-problemindependent? No, even though they wereindependent of each other when there wasno evidence, as I just showed. For example,knowing that the family is at home shouldraise (slightly) the probability that the doghas bowel problems. Because we eliminated

    dependent on a variable b given evidence E= {e1 en} if there is a d-connecting pathfrom a to b given E. (I call E the evidence

    nodes. E can be empty. It can not include aor b.) Ifa is not dependent on b given E, ais independent ofb given E.Note that for any random variable {f} it is

    possible for two variables to be independentof each other given E but dependent given E {f} and vise versa (they may be dependentgiven E but independent given E {f}. In par-ticular, if we say that two variables a and bare independent of each other, we simplymean thatP(a | b) = P(a). It might still be thecase that they are not independent given, say,e (that is,P(a | b,e) P(a | e).

    To connect this definition to the claim that

    family-out is independent ofhear-bark givendog-out, we see when I explain d-connectingthat there is no d-connecting path fromfamily-out to hear-bark given dog-out becausedog-out, in effect, blocks the path betweenthe two.

    To understand d-connecting paths, we needto keep in mind the three kinds of connec-tions between a random variable b and itstwo immediate neighbors in the path, a andc. The three possibilities are shown in figure 4and correspond to the possible combinationsof arrow directions from b to a and c. In thefirst case, one node is above b and the other

    below; in the second case, both are above;and in the third, both are below. (Remember,we assume that arrows in the diagram gofrom high to low, so going in the direction ofthe arrow is going down.) We can say that anode b in a path P is linear, converging ordiverging in Pdepending on which situationit finds itself according to figure 4.

    Now I give the definition of a d-connectingpath:

    Articles

    54 AI MAGAZINE

    a c b

    a cb

    a

    b

    c

    Linear

    Converging Diverging

    Figure 4. The Three Connection Types.

    In each case, node b is between a andc in the undirected path between the two.

  • 7/28/2019 Bayesian Networks Without Tears

    6/14

    the most likely explanation for the dog beingout, less likely explanations become morelikely. This situation is covered by part 2.Here, the d-connecting path is from family-away to bowel-problem . It goes through aconverging node (dog-out), but dog-out isitself a conditioning node. We would have asimilar situation if we did not know that thedog was out but merely heard the barking. Inthis case, we would not be sure the dog wasout, but we do have relevant evidence (whichraises the probability), so hear-bark, in effect,connects the two nodes above the convergingnode. Intuitively, part 2 means that a pathcan only go through a converging node if weare conditioning on an (indirect) effect of theconverging node.

    Consistent Probabilities

    One problem that can plague a naive proba-bilistic scheme is inconsistent probabilities.

    For example, consider a system in which wehave P(a | b) = .7, P(b | a) = .3, andP(b) = .5.Just eyeballing these equations, nothing looksamiss, but a quick application of Bayess lawshows that these probabilities are not consis-tent because they require P(a) > 1. By Bayesslaw,

    P(a)P(b | a) /P(b) = P(a | b) ;so,

    P(a) = P(b) P(b | a)/ P(b | a) = .5 * .7 / .3 =.35 / .3).

    Needless to say, in a system with a lot ofsuch numbers, making sure they are consistentcan be a problem, and one system (PROSPECTOR)

    had to implement special-purpose techniquesto handle such inconsistencies (Duda, Hart,and Nilsson 1976). Therefore, it is a niceproperty of the Bayesian networks that if youspecify the required numbers (the probabilityof every node given all possible combinationsof its parents), then (1) the numbers will beconsistent and (2) the network will uniquelydefine a distribution. Furthermore, it is nottoo hard to see that this claim is true. To seeit, we must first introduce the notion of jointdistribution.

    A joint distribution of a set of random vari-ables v1 vn is defined as P(v1 vn ) for all

    values of v1 vn. That is, for the set ofBoolean variables (a,b), we need the probabili-tiesP(a b),P( a b),P(a b), andP( a b). Ajoint distribution for a set of random vari-ables gives all the information there is aboutthe distribution. For example, suppose wehad the just-mentioned joint distribution for(a,b), and we wanted to compute, say,P(a | b):

    P(a | b) = P(a b) / P(b) = P(a b) / (P(a b) +P( a b) .

    Note that for n Boolean variables, the jointdistribution contains 2n values. However, thesum of all the joint probabilities must be 1

    because the probability of all possible out-comes must be 1. Thus, to specify the jointdistribution, one needs to specify 2n -1 num-bers, thus the 2n -1 in the last section.

    I now show that the joint distribution for aBayesian network is uniquely defined by theproduct of the individual distributions foreach random variable. That is, for the net-work in figure 2 and for any combination ofvalues fo, bp, lo, hb (for example, t, f, f, t, t),the joint probability is

    P(fo bp lo do hb) =P(fo)P(bp)P(lo | fo)P(do | fobp)P(hb | do) .

    Consider a network N consisting of vari-

    ables v1 vn. Now, an easily proven law ofprobability is that

    P(v1 vn) =P(v1)P(v2 | v1) P(vn | v1 ... vn-1).This equation is true for any set of random

    variables. We use the equation to factor ourjoint distribution into the component partsspecified on the right-hand side of the equa-tion. Exactly how a particular joint distributionis factored according to this equation dependson how we order the random variables, thatis, which variable we make v1, v2, and so on.For the proof, I use what is called a topologicalsorton the random variables. This sort is anordering of the variables such that every vari-

    able comes before all its descendants in thegraph. Let us assume that v1 vn is such anordering. In figure 5, I show one such order-ing for figure 1.

    Let us consider one of the terms in thisproduct, P(vj | vj - 1). An illustration of whatnodes v1 vj might look like is given infigure 6. In this graph, I show the nodesimmediately above vj and otherwise ignoreeverything except vc, which we are concen-

    rt c es

    WINTER 1991 55

    dog-out

    family-out

    light-on

    bowel-problem

    hear-bark

    3 4

    5

    1 2

    Figure 5. A Topological Ordering.

    In this case, I made it a simple top-down numbering.

  • 7/28/2019 Bayesian Networks Without Tears

    7/14

    the nodes in the product. Thus, for figure 2,we get

    P(fo bp lo do hb) =P(fo)P(bp)P(lo | fo)P(do | fobp)P(hb | do) .

    We have shown that the numbers specifiedby the Bayesian network formalism in factdefine a single joint distribution, thusuniqueness. Furthermore, if the numbers foreach local distribution are consistent, then

    the global distribution is consistent. (Localconsistency is just a matter of having theright numbers sum to 1.)

    Evaluating Networks

    As I already noted, the basic computation onbelief networks is the computation of everynodes belief (its conditional probability)given the evidence that has been observed sofar. Probably the most important constrainton the use of Bayesian networks is the factthat in general, this computation is NP-hard

    (Cooper 1987). Furthermore, the exponentialtime limitation can and does show up onrealistic networks that people actually wantsolved. Depending on the particulars of thenetwork, the algorithm used, and the caretaken in the implementation, networks assmall as tens of nodes can take too long, ornetworks in the thousands of nodes can bedone in acceptable time.

    The first issue is whether one wants an

    trating on and which connects with vj in twodifferent ways that we call the left and rightpaths, respectively. We can see from figure 6that none of the conditioning nodes (the nodesbeing conditioned on in the conditionalprobability) inP(vj | v1 ... vj - 1) is below vj (inparticular, vm is not a conditioning node).This condition holds because of the way inwhich we did the numbering.

    Next, we want to show that all and onlythe parents ofvj need be in the conditioningportion of this term in the factorization. Tosee that this is true, suppose vcis not immedi-ately above vj but comes before vj in the num-bering. Then any path between vcand vj musteither be blocked by the nodes just above vj(as is the right path from vcin figure 6) or gothrough a node lower than vj (as is the leftpath in figure 6). In this latter case, the pathis not d-connecting because it goes through aconverging node vm where neither it nor anyof its descendants is part of the conditioningnodes (because of the way we numbered).

    Thus, no path from vcto vj can be d-connect-ing, and we can eliminate vc from the condi-tioning section because by the independenceassumptions in Bayesian networks, vj is inde-pendent of vc given the other conditioningnodes. In this fashion, we can remove all thenodes from the conditioning case forP(vj | v1... vj - 1) except those immediately above vj .In figure 6, this reduction would leave uswith P(vj | vj - 1 vj - 2). We can do this for all

    Articles

    56 AI MAGAZINE

    vc

    vj 2 vj 1

    vj

    vm

    Figure 6. Node vj in a Network.

    I show that when conditioningvj only on its successors,

    its value is dependent only on its immediate successors,

    vj - 1 andvj - 2.

    a

    d

    bc

    e

    Figure 7. Nodes in a Singly Connected Network.

    Because of the singly connected property, any two nodes

    connected to node e have only one path between them

    the path that goes through e.

    the mostimportant

    constraintisthatthiscomputationis NP-hard

  • 7/28/2019 Bayesian Networks Without Tears

    8/14

    exact solution (which is NP-hard) or if onecan make do with an approximate answer(that is, the answer one gets is not exact butwith high probability is within some smalldistance of the correct answer). I start withalgorithms for finding exact solutions.

    Exact Solutions

    Although evaluating Bayesian networks is, ingeneral, NP-hard, there is a restricted class ofnetworks that can efficiently be solved in

    time linear in the number of nodes. The classis that of singly connected networks. A singlyconnected network (also called apolytree) is onein which the underlying undirected graph hasno more than one path between any twonodes. (The underlying undirected graph isthe graph one gets if one simply ignores thedirections on the edges.) Thus, for example,the Bayesian network in figure 5 is singly con-nected, but the network in figure 6 is not.Note that the direction of the arrows does notmatter. The left path from vcto vj requires oneto go against the direction of the arrow fromvm to vj. Nevertheless, it counts as a path from

    vm to vj.The algorithm for solving singly connected

    Bayesian networks is complicated, so I do notgive it here. However, it is not hard to seewhy the singly connected case is so mucheasier. Suppose we have the case sketchilyillustrated in figure 7 in which we want toknow the probability of e given particularvalues for a, b, c, and d. We specify that a andb are above e in the sense that the last step ingoing from them to e takes us along an arrowpointing down into e. Similarly, we assume cand d are below e in the same sense. Nothingin what we say depends on exactly how a and

    b are above e or how d and c are below. Alittle examination of what follows shows thatwe could have any two sets of evidence (pos-sibly empty) being above and below e ratherthan the sets {a b} and {c d}. We have justbeen particular to save a bit on notation.

    What does matter is that there is only oneway to get from any of these nodes to e andthat the only way to get from any of thenodes a, b, c, d to any of the others (for exam-

    ple, from b to d) is through e. This claim fol-lows from the fact that the network is singlyconnected. Given the singly connected condi-tion, we show that it is possible to break upthe problem of determiningP(e | a b c d) intotwo simpler problems involving the networkfrom e up and the network from e down.First, from Bayess rule,

    P(e | a b c d) =P(e)P(a b c d| e) /P(a b c d) .Taking the second term in the numerator, wecan break it up using conditioning:

    P(e | a b c d) = P(e)P(a b | e) P(c d | a b e) /

    P(a b c d) .Next, note that P(c d | a b e) = P(c d | e )because e separates a and b from c and d (bythe singly connected condition). Substitutingthis term for the last term in the numeratorand conditioning the denominator on a, b,we get

    P(e | a b c d) =P(e)P(a b | e)P(c d| e) /P(a b)P(c d| a b) .Next, we rearrange the terms to get

    P(e | a b c d) = (P(e)P(a b | e) /P(a b)) (P(c d|e) (P(c d| a b)) .Apply Bayess rule in reverse to the first col-lection of terms, and we get

    P(e | a b c d) = (P(e | a b )P(c d| e)) (1 /P(c d|a b)) .

    We have now done what we set out to do.The first term only involves the material frome up and the second from e down. The lastterm involves both, but it need not be calcu-lated. Rather, we solve this equation for allvalues ofe (just true and false ife is Boolean).The last term remains the same, so we cancalculate it by making sure that the probabili-ties for all the values of E sum to 1. Naturally,to make this sketch into a real algorithm forfinding conditional probabilities for polytreeBayesian networks, we need to show how to

    calculateP(e | a b) andP(c d| e), but the easewith which we divided the problem into twodistinct parts should serve to indicate thatthese calculations can efficiently be done. Fora complete description of the algorithm, seePearl (1988) or Neapolitan (1990).

    Now, at several points in the previous dis-cussion, we made use of the fact that the net-work was singly connected, so the sameargument does not work for the general case.

    rt c es

    WINTER 1991 57

    Bayesian networks have been extended to handledecision theory.

  • 7/28/2019 Bayesian Networks Without Tears

    9/14

    called clustering. In clustering, one combinesnodes until the resulting graph is singly con-nected. Thus, to turn figure 8 into a singlyconnected network, one can combine nodesb and c. The resulting graph is shown infigure 9. Note now that the node {b c} has asits values the cross-product of the values ofband c singly. There are well-understood tech-niques for producing the necessary localprobabilities for the clustered network. Thenone evaluates the network using the singly

    connected algorithm. The values for the vari-ables from the original network can then beread off those of the clustered network. (Forexample, the values ofb and c can easily becalculated from the values for {b c}.) At themoment, a variant of this technique pro-posed by Lauritzen and Spiegelhalter (1988)and improved by Jensen (1989) is the fastestexact algorithm for most applications. Theproblem, of course, is that the nodes onecreates might have large numbers of values.A node that was the combination of 10Boolean-valued nodes would have 1024values. For dense networks, this explosion

    of values and worse can happen. Thus, oneoften considers settling for approximationsof the exact value. We turn to this area next.

    Approximate Solutions

    There are a lot of ways to find approxima-tions of the conditional probabilities in aBayesian network. Which way is the bestdepends on the exact nature of the network.

    However, exactly what is it that makes multi-ply connected networks hard? At first glance,it might seem that any belief network oughtto be easy to evaluate. We get some evidence.Assume it is the value of a particular node. (Ifit is the values of several nodes, we just takeone at a time, reevaluating the network as weconsider each extra fact in turn.) It seems thatwe located at every node all the informationwe need to decide on its probability. That is,once we know the probability of its neigh-

    bors, we can determine its probability. (Infact, all we really need is the probability of itsparents.)

    These claims are correct but misleading. Insingly connected networks, a change in oneneighbor ofe cannot change another neigh-bor ofe except by going through e itself. Thisis because of the single-connection condition.Once we allow multiple connections betweennodes, calculations are not as easy. Considerfigure 8. Suppose we learn that node d hasthe value true, and we want to know the con-ditional probabilities at node c. In this net-work, the change at d will affect c in more

    than one way. Not only does c have toaccount for the direct change in d but alsothe change in a that will be caused by dthrough b. Unlike before, these changes donot separate cleanly.

    To evaluate multiply connected networksexactly, one has to turn the network into anequivalent singly connected one. There are afew ways to perform this task. The mostcommon ways are variations on a technique

    Articles

    58 AI MAGAZINE

    a

    b c

    d

    Figure 8. A Multiply Connected Network.

    There are two paths between node a and node d.

    d

    b c

    a

    Figure 9. A Clustered, MultiplyConnected Network.

    By clustering nodes b and c, we turned the graph offigure 8 into a singly connected network.

  • 7/28/2019 Bayesian Networks Without Tears

    10/14

    However, many of the algorithms have a lotin common. Essentially, they randomly positvalues for some of the nodes and then usethem to pick values for the other nodes. Onethen keeps statistics on the values that thenodes take, and these statistics give theanswer. To take a particularly clear case, thetechnique called logic sampling (Henrion1988) guesses the values of the root nodes inaccordance with their prior probabilities.Thus, if v is a root node, and P(v) = .2, onerandomly chooses a value for this node but insuch a way that it is true about 20 percent ofthe time. One then works ones way down thenetwork, guessing the value of the next lowernode on the basis of the values of the highernodes. Thus, if, say, the nodes a and b, whichare above c, have been assigned true andfalse, respectively, andP(c | b) = .8, then wepick a random number between 0 and 1, andif it is less than .8, we assign c to true, other-wise, false. We do this procedure all the waydown and track how often each of our nodesis assigned to each of its values. Note that, asdescribed, this procedure does not take evi-dence nodes into account. This problem canbe fixed, and there are variations that improveit for such cases (Shacter and Peot 1989; Shweand Cooper 1990). There are also differentapproximation techniques (see Horvitz, Suer-mondt, and Cooper [1989]). At the moment,however, there does not seem to be a singletechnique, either approximate or exact, thatworks well for all kinds of networks. (It isinteresting that for the exact algorithms, thefeature of the network that determines perfor-

    mance is the topology, but for the approxima-tion algorithms, it is the quantities.) Giventhe NP-hard result, it is unlikely that we willever get an exact algorithm that works wellfor all kinds of Bayesian networks. It might bepossible to find an approximation schemethat works well for everything, but it mightbe that in the end, we will simply have alibrary of algorithms, and researchers willhave to choose the one that best suits theirproblem.

    Finally, I should mention that for thosewho have Bayesian networks to evaluate butdo not care to implement the algorithms

    themselves, at least two software packages arearound that implement some of the algo-rithms I mentioned: IDEAL (Srinivas and Breese1989, 1990) and HUGIN (Andersen 1989).

    Applications

    As I stated in the introduction, Bayesian net-works are now being used in a variety ofapplications. As one would expect, the most

    common is diagnosis problems, particularly,medical diagnosis. A current example ofthe use of Bayesian networks in this area is

    PATHFINDER (Heckerman 1990), a program todiagnose diseases of the lymph node. Apatient suspected of having a lymph nodedisease has a lymph node removed and exam-ined by a pathologist. The pathologist exam-ines it under a microscope, and the informationgained thereby, possibly together with othertests on the node, leads to a diagnosis.PATHFINDER allows a physician to enter theinformation and get the conditional probabil-ities of the diseases given the evidence so far.

    PATHFINDER also uses decision theory. Deci-sion theory is a close cousin ofprobabi litytheory in which one also specifies the desir-

    ability of various outcomes (their utility) andthe costs of various actions that might be per-formed to affect the outcomes. The idea is tofind the action (or plan) that maximizes theexpected utility minus costs. Bayesian net-works have been extended to handle decisiontheory. A Bayesian network that incorporatesdecision nodes (nodes indicating actions thatcan be performed) and value nodes (nodesindicating the values of various outcomes) is

    rt c es

    WINTER 1991 59

    ?

    Figure 10. Map Learning.

    Finding the north-south corridor makes it more likely that there is an intersection

    north of the robots current location.

  • 7/28/2019 Bayesian Networks Without Tears

    11/14

    Bayesian networks are being used in lessobvious applications as well. At Brown Uni-versity, there are two such applications: maplearning (the work of Ken Basye and TomDean) and story understanding (Robert Gold-man and myself). To see how Bayesian net-works can be used for map learning, imaginethat a robot has gone down a particular corri-dor for the first time, heading, say, west. Atsome point, its sensors pick up some featuresthat most likely indicate a corridor headingoff to the north (figure 10). Because of its cur-

    rent task, the robot keeps heading west. Nev-ertheless, because of this information, therobot should increase the probability that aknown east-west corridor, slightly to thenorth of the current one, will also intersectwith this north-south corridor. In this domain,rather than having diseases that cause certainabnormalities, which, in turn, are reflected astest results, particular corridor layouts causecertain kinds of junctions between corridors,which, in turn, cause certain sensor readings.Just as in diagnosis, the problem is to reasonbackward from the tests to the diseases; inmap learning, the problem is to reason back-

    ward from the sensor readings to the corridorlayout (that is, the map). Here, too, the intentis to combine this diagnostic problem withdecision theory, so the robot could weigh thealternative of deviating from its plannedcourse to explore portions of the building forwhich it has no map.

    My own work on story understanding(Charniak and Goldman 1989a, 1991; Gold-man 1990) depends on a similar analogy.

    called an influence diagram, a concept invent-ed by Howard (Howard and Matheson 1981).In PATHFINDER , decision theory is used tochoose the next test to be performed when thecurrent tests are not sufficient to make a diag-nosis. PATHFINDER has the ability to make treat-ment decisions as well but is not used for thispurpose because the decisions seem to be sen-sitive to details of the utilities. (For example,how much treatment pain would you tolerateto decrease the risk of death by a certainpercentage?)

    PATHFINDERs model of lymph node diseasesincludes 60 diseases and over 130 featuresthat can be observed to make the diagnosis.Many of the features have more than 2 possi-ble outcomes (that is, they are not binaryvalued). (Nonbinary values are common forlaboratory tests with real-number results. Onecould conceivably have the possible values ofthe random variable be the real numbers, buttypically to keep the number of values finite,one breaks the values into significant regions.I gave an example of this early on with earth-quake, where we divided the Richter scale forearthquake intensities into 5 regions.) Various

    versions of the program have been implement-ed (the current one is PATHFINDER-4), and theuse of Bayesian networks and decision theoryhas proven better than (1) MYCIN-style certaintyfactors (Shortliffe 1976), (2) Dempster-Shafertheory of belief (Shafer 1976), and (3) simplerBayesian models (ones with less realistic inde-pendence assumptions). Indeed, the programhas achieved expert-level performance andhas been implemented commercially.

    Articles

    60 AI MAGAZINE

    eat out

    order

    straw-drinking

    milk-shake drink-straw animal-straw

    "straw""milk-shake""order"

    Figure 11. Bayesian Network for a Simple Story.

    Connecting straw to the earlier context makes the drink-straw reading more likely.

  • 7/28/2019 Bayesian Networks Without Tears

    12/14

    Imagine, to keep things simple, that the storywe are reading was created when the writerobserved some sequence of events and wrotethem down so that the reader would knowwhat happened. For example, suppose Sally isengaged in shopping at the supermarket. Ourobserver sees Sally get on a bus, get off at thesupermarket, and buy some bread. He/shewrites this story down as a string of Englishwords. Now the disease is the high-levelhypothesis about Sallys task (shopping). Theintermediate levels would include things suchas what the writer actually saw (which wasthings such as traveling to the supermarketnote that shopping is not immediatelyobservable but, rather, has to be put togetherfrom simpler observations). The bottom layerin the network, the evidence, is now theEnglish words that the author put down onpaper.

    In this framework, problems such as, say,word-sense ambiguity, become intermediaterandom variables in the network. For exam-ple, figure 11 shows a simplified version ofthe network after the story Sue ordered amilkshake. She picked up the straw. At thetop, we see a hypothesis that Sue is eatingout. Below this hypothesis is one that she willdrink a milkshake (in a particular way called,there, straw-drinking). Because this actionrequires a drinking straw, we get a connectionto this word sense. At the bottom of the net-work, we see the word straw, which couldhave been used if the author intended us tounderstand the word as describing either adrinking straw or some animal straw (that is,

    the kind animals sleep on). As one wouldexpect for this network, the probability ofdrinking-straw will be much higher given theevidence from the words because the evidencesuggests a drinking event, which, in turn,makes a drinking straw more probable. Notethat the program has a knowledge base thattells it how, in general, eating out relates todrinking (and, thus, to straw drinking), howstraw drinking relates to straws, and so on.This knowledge base is then used to construct,on the fly, a Bayesian network (like the one infigure 11) that represents a particular story.

    But Where Do the NumbersCome From?

    One of the points I made in this article is thebeneficial reduction in the number of param-eters required by Bayesian networks. Indeed,if anything, I overstated how many numbersare typically required in a Bayesian network.For example, a common situation is to haveseveral causes for the same result. This situation

    occurs when a symptom is caused by severaldiseases, or a persons action could be the

    result of several plans. This situation is shownin figure 12. Assuming all Boolean nodes, thenode fever would require 8 conditional proba-bilities. However, doctors would be unlikelyto know such numbers. Rather, they mightknow that the probability of a fever is .8 givena cold; .98 given pneumonia; and, say, .4 givenchicken pox. They would probably also saythat the probability of fever given 2 of themis slightly higher than either alone. Pearl sug-gested that in such cases, we should specifythe probabilities given individual causes butuse stereotypical combination rules for com-bining them when more than 1 case is pre-

    sent. The current case would be handled byPearls noisy-Or random variable. Thus, ratherthan specifying 8 numbers, we only need tospecify 3. We require still fewer numbers.

    However, fewer numbers is not no numbersat all, and the skeptic might still wonder howthe numbers that are still required are, in fact,obtained. In all the examples described previ-ously, they are made up. Naturally, nobodyactually makes this statement. What onereally says is that they are elicited from anexpert who subjectively assesses them. Thisstatement sounds a lot better, but there isreally nothing wrong with making up num-

    bers. For one thing, experts are fairly good atit. In one study (Spiegelhalter, Franklin, andBull 1989), doctors assessments of the num-bers required for a Bayesian network werecompared to the numbers that were subse-quently collected and found to be pretty close(except the doctors were typically too quickin saying that things had zero probability). Ialso suspect that some of the prejudiceagainst making up numbers (but not, for

    rt c es

    WINTER 1991 61

    cold

    fever

    chicken-poxpneumonia

    Figure 12. Three Causes for a Fever.

    Viewing the fever node as a noisy-Or node makes it easier to construct the posteriordistribution for it.

  • 7/28/2019 Bayesian Networks Without Tears

    13/14

    grant IRI-8911122 and the Office of NavalResearch under contract N00014-88-K-0589.

    References

    Andersen, S. 1989. HUGINA Shell for BuildingBayesian Belief Universes for Expert Systems. In

    Proceedings of the Eleventh International Joint

    Conference on Artificial Intelligence, 10801085.Menlo Park, Calif.: International Joint Conferences

    on Artificial Intelligence.

    Charniak, E., and Goldman, R. 1991. A Probabilis-tic Model of Plan Recognition. In Proceedings of

    the Ninth National Conference on Artificial Intelli-

    gence, 160165. Menlo Park, Calif.: American Asso-

    ciation for Artificial Intelligence.

    Charniak, E., and Goldman, R. 1989a. A Semantics

    for Probabilistic Quantifier-Free First-Order Lan-

    guages with Particular Application to Story Under-standing. In Proceedings of the Eleventh

    International Joint Conference on Artificial Intelli-

    gence, 10741079. Menlo Park, Calif.: International

    Joint Conferences on Artificial Intelligence.

    Charniak, E., and Goldman, R. 1989b. Plan Recog-

    nition in Stories and in Life. In Proceedings of the

    Fifth Workshop on Uncertainty in Artificial Intelli-

    gence, 5460. Mountain View, Calif.: Associationfor Uncertainty in Artificial Intelligence.

    Charniak, E., and McDermott, D. 1985.Introduction

    to Artificial Intelligence. Reading, Mass.: Addison-Wesley.

    Cooper, G. F. 1987. Probabilistic Inference Using

    Belief Networks is NP-Hard, Technical Report, KSL-87-27, Medical Computer Science Group, Stanford

    Univ.

    Dean, T. 1990. Coping with Uncertainty in a Control

    System for Navigation and Exploration. In Proceed-ings of the Ninth National Conference on Artificial

    Intelligence, 10101015. Menlo Park, Calif.: Ameri-

    can Association for Artificial Intelligence.

    Duda, R.; Hart, P.; and Nilsson, N. 1976. SubjectiveBayesian Methods for Rule-Based Inference Sys-

    tems. In Proceedings of the American Federation of

    Information Processing Societies National Comput-er Conference, 10751082. Washington, D.C.:

    American Federation of Information Processing

    Societies.

    Goldman, R. 1990. A Probabilistic Approach to

    Language Understanding, Technical Report, CS-90-

    34, Dept. of Computer Science, Brown Univ.

    example, against making up rules) is that onefears that any set of examples can beexplained away by merely producing theappropriate numbers. However, with thereduced number set required by Bayesiannetworks, this fear is no longer justified; anyreasonably extensive group of test examplesoverconstrains the numbers required.

    In a few cases, of course, it might actuallybe possible to collect data and produce therequired numbers in this way. When this ispossible, we have the ideal case. Indeed, thereis another way of using probabilities, whereone constrains ones theories to fit ones data-collection abilities. Mostly, however, Bayesiannetwork practitioners subjectively access theprobabilities they need.

    Conclusions

    Bayesian networks offer the AI researcher aconvenient way to attack a multitude ofproblems in which one wants to come toconclusions that are not warranted logically

    but, rather, probabilistically. Furthermore,they allow you to attack these problems with-out the traditional hurdles of specifying a setof numbers that grows exponentially withthe complexity of the model. Probably themajor drawback to their use is the time ofevaluation (exponential time for the generalcase). However, because a large number ofpeople are now using Bayesian networks,there is a great deal of research on efficientexact solution methods as well as a variety ofapproximation schemes. It is my belief thatBayesian networks or their descendants arethe wave of the future.

    Acknowledgments

    Thanks to Robert Goldman, Solomon Shimo-ny, Charles Moylan, Dzung Hoang, DilipBarman, and Cindy Grimm for comments onan earlier draft of this article and to GeoffreyHinton for a better title, which, unfortunate-ly, I could not use. This work was supportedby the National Science Foundation under

    Articles

    62 AI MAGAZINE

    the major drawback to their use is the time of evaluation

  • 7/28/2019 Bayesian Networks Without Tears

    14/14

    Hansson, O., and Mayer, A. 1989. Heuristic Search

    as Evidential Reasoning. In Proceedings of the FifthWorkshop on Uncertainty in Artificial Intelligence,

    152161. Mountain View, Calif.: Association for

    Uncertainty in Artificial Intelligence.

    Heckerman, D. 1990. Probabilistic Similarity Net-works, Technical Report, STAN-CS-1316, Depts. of

    Computer Science and Medicine, Stanford Univ.

    Henrion, M. 1988. Propagating Uncertainty in

    Bayesian Networks by Logic Sampling. In Uncertaintyin Artificial Intelligence 2, eds. J. Lemmer and L.

    Kanal, 149163. Amsterdam: North Holland.

    Horvitz, E.; Suermondt, H.; and Cooper, G. 1989.Bounded Conditioning: Flexible Inference for Deci-

    sions under Scarce Resources. In Proceedings of the

    Fifth Workshop on Uncertainty in Artificial Intelli-gence, 182193. Mountain View, Calif.: Association

    for Uncertainty in Artificial Intelligence.

    Howard, R. A., and Matheson, J. E. 1981. Influence

    Diagrams. In Appl icat ions of Deci sion Analys is,volume 2, eds. R. A. Howard and J. E. Matheson,

    721762. Menlo Park, Calif.: Strategic Decisions

    Group.

    Jensen, F. 1989. Bayesian Updating in RecursiveGraphical Models by Local Computations, Techni-

    cal Report, R-89-15, Dept. of Mathematics andComputer Science, University of Aalborg.

    Lauritzen, S., and Spiegelhalter, D. 1988. Local

    Computations with Probabilities on Graphical

    Structures and Their Application to Expert Systems.

    Journal of the Royal Statistical Society50:157224.

    Levitt, T.; Mullin, J.; and Binford, T. 1989. Model-

    Based Influence Diagrams for Machine Vision. In

    Proceedings of the Fifth Workshop on Uncertaintyin Artificial Intelligence, 233244. Mountain View,

    Calif.: Association for Uncertainty in Artificial Intel-

    ligence.

    Neapolitan, E. 1990.Probabilistic Reasoning in Expert

    Systems. New York: Wiley.

    Pearl, J. 1988. Probabilistic Reasoning in Intelligent

    Systems: Networks of Plausible Inference. San Mateo,Calif.: Morgan Kaufmann.

    Shacter, R., and Peot, M. 1989. Simulation

    Approaches to General Probabilistic Inference on

    Belief Networks. In Proceedings of the Fifth Work-shop on Uncertainty in Artificial Intelligence,

    311318. Mountain View, Calif.: Association for

    Uncertainty in Artificial Intelligence.

    Shafer, G. 1976. A Mathematical Theory of Evidence.Princeton, N.J.: Princeton University Press.

    Shortliffe, E. 1976. Computer-Based Medical Consul-

    tations:MYCIN. New York: American Elsevier.

    Shwe, M., and Cooper, G. 1990. An Empirical Anal-

    ysis of Likelihood-Weighting Simulation on a Large,

    Multiply Connected Belief Network. In Proceedings

    of the Sixth Conference on Uncertainty in ArtificialIntelligence, 498508. Mountain View, Calif.: Asso-

    ciation for Uncertainty in Artificial Intelligence.

    Spiegelhalter, D.; Franklin, R.; and Bull, K. 1989.

    Assessment Criticism and Improvement of Impre-cise Subjective Probabilities for a Medical Expert

    System. In Proceedings of the Fifth Workshop on

    Uncertainty in Artificial Intelligence, 335342.

    Mountain View, Calif.: Association for Uncertaintyin Artificial Intelligence.

    Srinivas, S., and Breese, J. 1990. IDEAL: A SoftwarePackage for Analysis of Influence Diagrams. In Pro-

    ceedings of the Sixth Conference on Uncertainty in

    Artificial Intelligence, 212219. Mountain View,

    Calif.: Association for Uncertainty in Artificial Intel-ligence.

    Srinivas, S., and Breese, J. 1989. IDEAL: Influence

    Diagram Evaluation and Analysis in Lisp, TechnicalReport, Rockwell International Science Center, Palo

    Alto, California.

    Eugene Charniak is a professorof computer science and cogni-

    tive science at Brown Universi-

    ty and the chairman of theDepartment of Computer Sci-

    ence. He received his B.A.

    degree in physics from theUniversity of Chicago and his

    Ph.D. in computer science

    from the Massachusetts Insti-tute of Technology. He has

    published three books: Computational Semantics,

    with Yorick Wilks (North Holland, 1976); Artificial

    Intelligence Programming (now in a second edition),with Chris Riesbeck, Drew McDermott, and James

    Meehan (Lawrence Erlbaum, 1980, 1987); andIntro-

    duction to Artificial Intelligence, with Drew McDer-mott (Addison-Wesley, 1985). He is a fellow of the

    American Association of Artificial Intelligence and

    was previously a councilor of the organization. Heis on the editorial boards of the journals Cognitive

    Science (of which he is was a founding editor) and

    Computational Linguistics. His research has been inthe area of language understanding (particularly in

    the resolution of ambiguity in noun-phrase refer-

    ence, syntactic analysis, case determination, and

    word-sense selection); plan recognition; and, moregenerally, abduction. In the last few years, his work

    has concentrated on the use of probability theory

    to elucidate these problems, particularly on the useof Bayesian nets (or belief nets) therein.

    rt c es

    WINTER 1991 63