Bayes_network.ppt

8/10/2019 Bayes_network.ppt

1/80

BAYESIAN NETWORK


2/80

References[1]Jiawei Han: Data Mining Concepts and Techniques ,ISBN 1-53860-489-8Morgan Kaufman Publisher.[2] Stuart Russell,Peter Norvig Artificial Intelligence A modernApproach ,Pearson education.

[3] Kandasamy,Thilagavati,Gunavati , Probability, Statistics andQueueing Theory , Sultan Chand Publishers.[4] D. Heckerman: A Tutorial on Learning with Bayesian Networks , In Learning in Graphical Models , ed. M.I. Jordan, The MITPress, 1998.[5] http://en.wikipedia.org/wiki/Bayesian_probability

[6] http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf [7] http://www.murrayc.com/learning/AI/bbn.shtml [8] http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html [9] http://en.wikipedia.org/wiki/Bayesian_belief_network
http://en.wikipedia.org/wiki/Bayesian_probabilityhttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfhttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfhttp://www.murrayc.com/learning/AI/bbn.shtmlhttp://www.cs.ubc.ca/~murphyk/Bayes/bnintro.htmlhttp://en.wikipedia.org/wiki/Bayesian_belief_networkhttp://en.wikipedia.org/wiki/Bayesian_belief_networkhttp://www.cs.ubc.ca/~murphyk/Bayes/bnintro.htmlhttp://www.murrayc.com/learning/AI/bbn.shtmlhttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfhttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfhttp://en.wikipedia.org/wiki/Bayesian_probability


3/80

CONTENTS

HISTORYCONDITIONAL PROBABILITY

BAYES THEOREMNAVE BAYES CLASSIFIERBELIEF NETWORK

APPLICATION OF BAYESIAN NETWORKPAPER ON CYBER CRIME DETECTION


4/80

HISTORYBayesian Probability was named afterReverend Thomas Bayes (1702-1761).He proved a special case of what is currently known as the Bayes Theorem.The term Bayesian came into use aroundthe 1950 s.Pierre-Simon, Marquis de Laplace (1749-1827) independently proved a generalized version of Bayes Theorem.

http://en.wikipedia.org/wiki/Bayesian_probability


5/80

HISTORY (Cont.)1950 s New knowledge in Artificial Intelligence1958 Genetic Algorithms by Friedberg (Holland and Goldberg ~1985)

1965 Fuzzy Logic by Zadeh at UC Berkeley1970 Bayesian Belief Network at StanfordUniversity (Judea Pearl 1988)

The idea s proposed above was not fullydeveloped until later. BBN became popular inthe 1990s.

http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf


6/80

HISTORY (Cont.)

Current uses of Bayesian Networks:Microsoft s printer troubleshooter.

Diagnose diseases (Mycin).Used to predict oil and stock pricesControl the space shuttle

Risk Analysis Schedule and Cost Overruns.


7/80

CONDITIONAL PROBABILITYProbability : How likely is it that an event will happen?Sample Space S

Element of S: elementary eventAn event A is a subset of S

P( A)

P( S) = 1

Events A and B

P(A|B)- Probability that event A occurs given that event B hasalready occurred.

Example:There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2 has 4 re

d ball and 3 blue ball. Find probability of picking a red ballfrom basket 1?


8/80

CONDITIONAL PROBABILITYThe question above wants P(red ball |

basket 1).The answer intuitively wants the probability of red ball from only the sample space of b

asket 1.So the answer is 2/7The equation to solve it is:

P(A|B) = P(AB)/P(B) [Product Rule] P(A,B) = P(A)*P(B) [ If A and B are independe

nt ]How do you solve P(basket2 | red ball) ???


9/80

BAYESIAN THEOREM

A special case of Bayesian Theorem:

P(AB) = P(B) x P(A|B

)P(BA) = P(A) x P(B|A

)

Since P(AB) = P(BA),P(B) x P(A|B) = P(A) x

P(B|A)=> P(A|B) = [P(A) x P(

A B

A B P A P A B P A P A B P A P

B P

A B P A P B A P

||

)|()(

)(

)|()(

)|(


10/80


11/80

BAYESIAN THEOREM

Example 2: A medical cancer diagnosisproblem

There are 2 possible outcomes of a diagnosis: +ve, -ve. We know .8% of world population has cancer. Test gives correct +ve result 98% of the time and gives correct ve result 97% of the time.

If a patient s test returns +ve, should wediagnose the patient as having cancer?


13/80

BAYESIAN THEOREM

General Bayesian Theorem:

Given E1, E2,

,En are mutually disjoint events and P(Ei) 0, (i = 1, 2, , n)

P(Ei/A) = [P(Ei) x P(A|Ei)] / P(Ei) x P(A|Ei)i = 1, 2, , n


14/80

BAYESIAN THEOREM

Example:There are 3 boxes. B1 has 2 white, 3 black and 4 red balls. B2 has 3 white, 2 black and 2 red balls. B3 has 4 white, 1 black and 3 red balls. A box is chosen at random and 2 balls are drawn. 1 is whiteand other is red. What is the probability that they came from the first box??


15/80

BAYESIAN THEOREM

Let E1, E2, E3 denote events of choosingB1, B2, B3 respectively. Let A be the event that 2 balls selected are white and red.

P(E1) = P(E2) = P(E3) = 1/3P(A|E1) = [2c1 x 4c1] / 9c2 = 2/9P(A|E2) = [3c1 x 2c1] / 7c2 = 2/7

P(A|E3) = [4c1 x 3c1] / 8c2 = 3/7


17/80

BAYESIAN CLASSIFICATION

Why use Bayesian Classification:Probabilistic learning: Calculate explicitprobabilities for hypothesis, among the mo

st practical approaches to certain types oflearning problemsIncremental: Each training example can

incrmentally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.


18/80

BAYESIAN CLASSIFICATION

Probabilistic prediction : Predict multiplehypotheses, weighted by their probabilitiesStandard: Even when Bayesian methodsare computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured


19/80

NA VE BAYES CLASSIFIER

A simplified assumption: attributes areconditionally independent:

Greatly reduces the computation cost, only count the class distribution.


20/80


The probabilistic model of NBC is to find the probability of a certain class given multiple dijoint (assumed) events.

The na ve Bayes classifier applies to learning taskswhere each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite setV. A set of training examples of the target functionis provided, and a new instance is presented, described by the tuple of attribute values . The learner is asked to predict the target v

alue, or classification, for this new instance.


21/80


Abstractly, probability model for a classifier is aconditional modelP(C|F1,F2, ,Fn)Over a dependent class variable C with a small

nuumber of outcome or classes conditional over several feature variables F1, ,Fn.

Na ve Bayes Formula:

P(C|F1,F2,

,Fn) = argmax c [P(C) x P(F1|C) xP(F2|C) x x P(Fn|C)] / P(F1,F2, ,Fn)

Since P(F1,F2, ,Fn) is common to all probabilities, we donot need to evaluate the denomitator for comparisons.


22/80


Tennis-Example


23/80


Problem:Use training data from above to classify t

he following instances:a) b)


24/80


Answer to (a):P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=n) = 5/14 = 0.36

P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60P(Temperature=cool|PlayTennis=yes) = 3/9 = 0

.33

P(Temperature=cool|PlayTennis=no) = 1/5 = .20P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=no) = 4/5 = 0.80

P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33= = = =


25/80


26/80

NA VE BAYES CLASSIFIERAnswer to (b):P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=no) = 5/14 = 0.36P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44P(Outlook=overcast|PlayTennis=no) = 0/5 = 0P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33P(Temperature=cool|PlayTennis=no) = 1/5 = .20P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=no) = 4/5 = 0.80P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33P(Wind=strong|PlayTennis=no) = 3/5 = 0.60


27/80


Estimating Probabilities:In the previous example, P(overcast|no) =

0 which causes the formula-P(no)xP(overcast|no)xP(cool|no)xP(high|

no)xP(strong|nno) = 0.0This causes problems in comparing becau

se the other probabilities are not considered. We can avoid this difficulty by using m- estimate.


28/80


M-Estimate Formula:

[c + k] / [n + m] where c/n is the original probability used before, k=1 andm= equivalent sample size.

Using this method our new values ofprobility is given below-


29/80

NA VE BAYES CLASSIFIERNew answer to (b):P(PlayTennis=yes) = 10/16 = 0.63P(PlayTennis=no) = 6/16 = 0.37P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42P(Outlook=overcast|PlayTennis=no) = 1/8 = .13P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33P(Temperature=cool|PlayTennis=no) = 2/8 = .25P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36P(Humidity=high|PlayTennis=no) = 5/7 = 0.71P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36P(Wind=strong|PlayTennis=no) = 4/7 = 0.57


31/80


The conditional probability values of all the

attributes with respect to the class arepre-computed and stored on disk.

This prevents the classifier from computing the conditional probabilities every time it runs.This stored data can be reused to reduce the


32/80

BAYESIAN BELIEF NETWORKIn Na ve Bayes Classifier we make the assumption of class conditional independence, that is given the class label of a sample, the value of theattributes are conditionally independent of oneanother. However, there can be dependences betweenvalue of attributes. To avoid this we use Bayesian Belief Network which provide joint conditionalprobability distribution.A Bayesian network is a form of probabilisticgraphical model. Specifically, a Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables.


33/80


34/80

BAYESIAN BELIEF NETWORK

A Bayesian network is a representation of thejoint distribution over all the variables represented by nodes in the graph. Let the variables be

X(1), ..., X(n).Let parents(A) be the parents of the node A. Then the joint distribution for X(1) through X(n) is represented as the product of the probability distributions P(Xi|Parents(Xi)) for i =1 to n. If X has no parents, its probability distribution is said to be unconditional, otherwiseit is conditional.


35/80



36/80


By the chaining rule of probability, the joint probability of all the nodes in the graph above is:

P(C, S, R, W) = P(C) * P(S|C) * P(R|C) *P(W|S,R)W=Wet Grass, C=Cloudy, R=Rain,

S=SprinklerExample: P(W - RSC)= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)

= 0.9*0.2*0.1*0.5 = 0.009


38/80


39/80


40/80

Problem???

real world Bayesian network application Learning to classify text.

Instances are text documentswe might wish to learn the target concept electronic news articles that I find interesting, or pages on the World Wide Web that discuss data mining topics. In both cases, if a computer could learn the target concept accurately, it could automatically filter the large volu

me ofonline text documents to present only the most relevantdocuments to the user.


41/80

TECHNIQUElearning how to classify text, based on thenaive Bayes classifierits a probabilistic approach and is among the most effective algorithms currently known for learning to classify text documents,Instance space X consists of all possible text documents given training examples of some unknown targetfunction f(x), which can take on any value from somefinite set V we will consider the target function classifying documents as interesting or uninteresting to a particular person, using the target values like and dislike to indicate these two classes.


42/80

Design issues

how to represent an arbitrary text document in terms of attribute values

decide how to estimate the probabilities required by the naive Bayes classifier


43/80


44/80

ASSUMPTIONS

assume we are given a set of 700training documents that a friend hasclassified as dislike and another 300she has classified as like We are now given a new document andasked to classify it

let us assume the new text document isthe preceding paragraph


45/80

We know (P(like) = .3 and P (dislike) = .7 in the currentexampleP a i , = wk|v j) (here we introduce w k to indicate the k th word

in the English vocabulary)estimating the class conditional probabilities (e.g., P(a i =our Idislike)) is more problematic because we mustestimate one such probability term for each combination oftext position, English word, and target value.there are approximately 50,000 distinct words in theEnglish vocabulary, 2 possible target values, and 111 textpositions in the current example, so we must estimate2*111* 50, 000 =~10 million such terms from the trainingdata.


46/80


47/80

Final AlgorithmExamples is a set of text documents along with their target values. V is theset of all possible target values. This function learns the probability termsP( w k | v j ), describing the probability that a randomly drawn word from adocument in class v j will be the English word W k . It also learns the class priorprobabilities P(v i ).1. collect all words, punctuation, and other tokens that occur in Examples Vocabulary set of all distinct words & tokens occurring in any textdocument from Examples2. calculate the required P(v i ) and P( w k | v j ) probability terms For each target value v j in V do docs j the subset of documents from Examples for which the target valueis v j P(v1) Idocs jI / \Examplesl Text j a single document created by concatenating all members of docs j n total number of distinct word positions in Text j for each word W k in Vocabulary

n k number of times word w

k occurs in Text

j

P(w kIvj) n k +1/n+|Vocabulary|

CLASSIFY_NAIVE_BAYES_TEXT( Doc)Return the estimated target value for the document Doc. a i denotes the wordfound in the i th position within Doc. positions all word positions in Doc that contain tokens found inVocabulary Return V NB , where


48/80

During learning, the procedureLEARN_NAIVE_BAYES_TEXT examines all trainingdocuments to extract the vocabulary of all words andtokens that appear in the text, then counts theirfrequencies among the different target classes toobtain the necessary probability estimates. Later,given a new document to be classified, theprocedure CLASSIFY_NAIVE_BAYESTEXT uses theseprobability estimates to calculate VNB according toEquation Note that any words appearing in the newdocument that were not observed in the training setare simply ignored by CLASSIFY_NAIVE_BAYESTEXT


49/80

Effectiveness of the AlgorithmProblem classifying usenet news articlestarget classification for an article name of the usenet newsgroup in whichthe article appearedIn the experiment described by Joachims (1996), 20 electronic newsgroupswere considered1,000 articles were collected from each newsgroup, forming a data set of 20,000 documents. The naive Bayes algorithm was then applied using two-thirds o

f these 20,000 documents as training examples, and performance was measured over the remaining third.100 most frequent words were removed (these include words such as the and of ), and any word occurring fewer than three times was also removed.The resulting vocabulary contained approximately 38,500 words.The accuracy achieved by the program was 89%.

comp.graphics misc.forsale soc.religion.christian alt.atheism

comp.os.ms-winclows.misc rec.autos talk.politics.guns sci.space

cornp.sys.ibm.pc.hardware rec.sport.baseball talk.politics.mideast sci.crypt

comp.windows.x rec.motorcycles talk.politics.misc sci.electronics

comp.sys.mac.hardware rec.sport.hockey talk.creligion.misc sci .med


50/80

APPLICATIONS

A newsgroup posting service that learns toassign documents to the appropriatenewsgroup.

NEWSWEEDER system a program for readingnetnews that allows the user to rate articles ashe or she reads them. NEWSWEEDER thenuses these rated articles (i.e its learned profileof user interests to suggest the most highlyrated new articles each dayNaive Bayes Spam Filtering Using Word- Position-Based Attributes


51/80

Thank you


52/80

Bayesian Learning Networks Approach to


53/80

Cybercrime Detection

N S ABOUZAKHAR, A GANI and G MANSONThe Centre for Mobile Communications Research

(C4MCR),University of Sheffield, Sheffield

Regent Court, 211 Portobello Street,Sheffield S1 4DP, UK

[email protected]@dcs.shef.ac.uk

[email protected]

M ABUITBEL and D KINGThe Manchester School of Engineering,

University of ManchesterIT Building, Room IT 109,

Oxford Road,Manchester M13 9PL, UK

[email protected]

[email protected]


54/80

REFERENCES

1. David J. Marchette, Computer Intrusion Detection and Network Monitoring,

A statistical Viewpoint, 2001,Springer-Verlag, New York, Inc, USA.2. Heckerman, D. (1995), A Tutorial on Learning with Bayesian Networks, TechnicalReport MSR-TR-95-06, Microsoft Corporation.

3. Michael Berthold and David J. Hand, Intelligent Data Analysis, An Introduction, 1999, Springer, Italy.

4. http://www.ll.mit.edu/IST/ideval/data/data_index.html , accessed on 01/12/2002

5. http://kdd.ics.uci.edu/ , accessed on 01/12/2002.6. Ian H. Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools andTechniques with Java Implementations, 2000, Morgan Kaufmann, USA.

7. http://www.bayesia.com , accessed on 20/12/2002

M i i b hi d h
http://www.ll.mit.edu/IST/ideval/data/data_index.htmlhttp://kdd.ics.uci.edu/http://www.bayesia.com/http://www.bayesia.com/http://kdd.ics.uci.edu/http://www.ll.mit.edu/IST/ideval/data/data_index.html


55/80

Growing dependence of modern societyon telecommunication and informationnetworks.

Increase in the number of interconnectednetworks to the Internet has led to an

increase in security threats and cyber crimes.

Motivation behind the paper..


56/80

In order to detect distributed networkattacks as early as possible, an underresearch and development probabilisticapproach, based on Bayesian networkshas been proposed.

Structure of the paper


57/80

Learning Agents which deploy Bayesiannetwork approach are considered to bea promising and useful tool in determining suspicious early events of Internetthreats.

Where can this model be utilized


58/80

Before we look at the details given in the paper lets

understand what BayesianNetworks are and how they

are constructed .

Bayesian Networks


59/80

Bayesian NetworksA simple, graphical notation for conditional independence assertions and hence for compact specification of

fulljoint distributions.

Syntax:

a set of nodes, one per variable a directed, acyclic graph (link "directly influences")

a conditional distribution for each node given itsparents:

P (X i | Parents (X i))In the simplest case, conditional distribution represented

as a conditional robabilit table (CPT) ivin the


60/80

Some conventions .

Variables depicted as nodesArcs represent probabilisticdependence betweenvariables.Conditional probabilitiesencode the strength ofdependencies.Missing arcs impliesconditional independence.

S i


61/80

SemanticsThe full joint distribution is defined as the product of

thelocal conditional distributions:

P (X 1, ,X n ) = i = 1 P (X i | Parents(X i ))

e.g., P (j m a b e)

= P (j | a) P (m | a) P (a | b, e) P ( b) P ( e)


62/80

Example of Construction of a BN


63/80

Back to the discussion of thepaper .


64/80

Description

This paper shows how probabilistically B

ayesian network detects communicationnetwork attacks, allowing for generalization of Network Intrusion Detection Systems(NIDSs).

l


65/80

GoalHow well does our model detect or classif

yattacks and respond to them later on.The system requires the estimation of twoquantities:

The probability of detection (PD)

Probability of false alarm (PFA).It is not possible to simultaneously achieve a PD of 1 and PFA of 0.


66/80


67/80

Construction of the network

The following figure shows the Bayesiannetwork that has been automaticallyconstructed by the learning algorithms ofBayesiaLab.The target variable, activity_type , is directl

y

connected to the variables that heavilycontribute to its knowledge such as servic

e

and protocol type .
http://www.bayesia.com/GB/produits/bLab/BLabApprentissage.phphttp://www.bayesia.com/GB/produits/bLab/BLabApprentissage.php


68/80


69/80

Data Gathering

MIT Lincoln Labs set up an environment to

acquire several weeks of raw TCP dumpdata for a local-area network (LAN)simulating a typical U.S. Air Force LAN. T

hegenerated raw dataset contains about fewmillion connection records.

Mapping the simple


70/80

pp g pBayesian Network that we saw to

the one used in the paper


71/80

Observation 1 :

As shown in the next figure, the most probable activity corresponds to a smurf attack (52.90%), an ecr_i (ECHO_REPLY)

service (52.96%) and an icmp protocol(53.21%).


72/80


73/80

Observation 2 :

What would happen if the probability ofreceiving ICMP protocol packets is increased? Would the probability of having a

smurf attack increase?Setting the protocol to its ICMP value increases the probability of having a smur

f attack from 52.90% to 99.37%.


74/80


75/80

Observation 3 :

Let s look at the problem from the opposite direction. If we set the probability of portsweep attack to 100%,then the value of some associated variables would inevitably vary.We note from Figure 4 that the probabilities of the TCP protocol and private service have been increased from 38.10% to 97.49% and fr

om 24.71% to 71.45% respectively. Also, wecan notice an increase in the REJ and RSTR flags.


76/80


77/80

B fit f th B i M d l


78/80

Benefits of the Bayesian ModelThe benefit of using Bayesian IDSs is the abili

ty to adjust our IDS s sensitivity.This would allow us to trade off betweenaccuracy and sensitivity.Furthermore, the automatic detection networkanomalies by learning allows distinguishing the normal activities from the abnormal ones.Allow network security analysts to see the

amount of information being contributed by each variable in the detection model to the knowledge of the target node
http://www.bayesia.com/GB/produits/bLab/BLabAnalyse.phphttp://www.bayesia.com/GB/produits/bLab/BLabAnalyse.php


79/80

Performance evaluation


80/80

Thank you

QUESTIONS OR QUERIES

Documents

Bayes_network.ppt