7
K236: Basis of Data Analytics Lecture 9: Classification and prediction Bayesian Classification Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 2 Schedule of K236 1. Introduction to data science (1) *"3 6/9 2. Introduction to data science (2) *"3 6/13 3. Data and databases 6/16 4. Review of univariate statistics !2+. 6/20 5. Review of linear algebra ,#$ 6/23 6. Data mining software 6/27 7. Data preprocessing ' 6/30 8. Classification and prediction (1) 5& (1) 7/4 9. Knowledge evaluation )1/ 7/7 10. Classification and prediction (2) 5& (2) 7/11 11. Classification and prediction (3) 5& (3) 7/14 12. Mining association rules (1) (4-% 7/18 13. Mining association rules (2) (4-% 7/21 14. Cluster analysis -% 7/25 15. Review and Examination 06 (the data is not fixed) 7/27 3 Outline 1. About Bayesian classification 2. Naïve Bayesian classification 3. Bayesian belief networks Bayesians in machine learning 54 David Heckerman Judea Pearl Michael Jordan

K236:&Basis&of&Data&Analytics * 3 6/13

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

K236:&Basis&of&Data&AnalyticsLecture&9:&Classification and prediction

Bayesian&Classification

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K236

1. Introduction to data science (1) ���*"�3 6/9

2. Introduction to data science (2) ���*"�3 6/13

3. Data and databases ��������� 6/16

4. Review of univariate statistics !2+. 6/20

5. Review of linear algebra ,#�$ 6/23

6. Data mining software ��������� ��� 6/27

7. Data preprocessing �����' 6/30

8. Classification and prediction (1) �5��& (1) 7/4

9. Knowledge evaluation )1/� 7/7

10. Classification and prediction (2) �5��& (2) 7/11

11. Classification and prediction (3) �5��& (3) 7/14

12. Mining association rules (1) (4����-% 7/18

13. Mining association rules (2) (4����-% 7/21

14. Cluster analysis ����-% 7/25

15. Review and Examination �����06 (the data is not fixed) 7/27

3

Outline

1. About Bayesian classification

2. Naïve Bayesian classification3. Bayesian belief networks

Bayesians in machine learning

54

David(Heckerman Judea(Pearl( Michael(Jordan

Example

A patient takes a lab test on a cancer disease and the

result comes back positive. It is known that

• Among people actually having this cancer, only 98%

are with positive test result (+)

• Among people actually don’t have this cancer, only

97% are with negative result (-)

• Furthermore, 0.008 of the entire population have

this cancer

Does patient have cancer or not?

Example

Does patient have cancer or not?

The patient takes a lab test on a cancer disease

and the result comes back positive (+). It is

known that

• Among people actually having this cancer,

only 98% are with positive test result (+)

• Among people actually don’t have this

cancer, only 97% are with negative result (-)

• Furthermore, 0.008 of the entire

population have this cancer

! "#$"%& + (?

! +( ("#$"%& = 0.98

! −( (¬"#$"%&) = 0.97

!("#$"%&) = 0.008

Question: Can we compute ! "#$"%& + (from the other three probabilities?(

Example

• What is the probability of a patient having a positive test result, ! + =(?

• A patient may have a positive test result if he has this cancer and also when he don’t have this cancer

((((! + = ! + ("#$"%&)! "#$"%& + ! + (¬"#$"%&)! ¬"#$"%&

• Consider

! "#$"%& is a hypothesis ℎ and denote its probability by !(ℎ)

! Positive test result is an evidence, ! + (or(!(7)

! Probability of ℎ given the evidence, !(ℎ|7)

! Probability of the evidence given ℎ, !(7|ℎ)

Reverent Thomas Bayes (1702-1761)

He set down his findings on probability in “Essay Towards

Solving a Problem in the Doctrine of Chances” (1763),

published posthumously in the Philosophical Transactions

of the Royal Society of London

Bayes’ Theorem

! ℎ 7 =! 7 ℎ !(ℎ)

!(7)

! 9 : =! : 9 !(9)

!(:)

Bayes’ Theorem

describes the probability of a

hypothesis, based on conditions

that might be related to the

hypothesis.

Baye’s theorem

• The essence of Bayes’ theorem is that tell us how to update our initial probabilities !(ℎ) if we see evidence 7, in order to find out !(ℎ|7)

! ℎ 7 =! 7 ℎ !(ℎ)

!(7)

! ℎ 7 =! 7 ℎ . !(ℎ)

!(7)=

! 7 ℎ . !(ℎ)

! 7 ℎ . ! ℎ + ! 7 ¬ℎ . !(¬ℎ)

• A prior probability• Conditional probability (likelihood) ← coming from the data • Posteriori probability

Example

! ℎ 7 =! 7 ℎ !(ℎ)

!(7)! "#$"%& + =

! + ("#$"%& !("#$"%&)

!(+)

! + = ! + ("#$"%&)! "#$"%& + ! + (¬"#$"%&)! ¬"#$"%&

! + = 0.98(×(0.008 + 0.003(×(0.992 = ((0.0078 + 0.0298(

! "#$"%& + =0.0078

0.0078(+0.0298(= 0.20745

! "#$"%& = 0.008((((((((! + ("#$"%&) = 0.98(((((((((! − "#$"%& = 0.02(

! ¬"#$"%& = 0.992((((((! + (¬"#$"%&) = 0.03(((((! − ¬"#$"%&) = 0.97(

! + ¬"#$"%&)! ¬"#$"%& = 0.03(×(0.992 = 0.0298

! + "#$"%& ! "#$"%& = 0.98(×0.008 = 0.0078

To say that the patient has cancer or not, we also need to know ((! ¬"#$"%& +)?

Example

• Comparing two probabilities, we say that the patient don’t have this cancer.

• Assume that several hypotheses ℎ1, ℎ2, … (belonging to a hypothesis space D)(relate to the evidence 7. We want the most probable hypothesis given the evidence 7.

• The rule of hypothesis choosing is E#FGEHE(#(IJKL%&GJ&(ℎMIJLℎ%KGK(ℎN9!(

ℎN9! = arg(maxℎ∈D

!(ℎ|7)

! ¬"#$"%& + =! + (¬"#$"%& !(¬"#$"%&)

!(+)=

0.0298

0.0078 + 0.0298= 0.79255

! "#$"%& + =! + ("#$"%& !("#$"%&)

!(+)=

0.0078

0.0078 + 0.0298= 0.20745

argmaxT

U F = M ∀F: U F ≤ U M , i.e.,(the(argument(gives(the(maximum(value

Choosing hypotheses

• Maximum a posterior: (((ℎN9!= arg(maxℎ∈D

!(ℎ|7) ((= agr(maxℎ∈D

!(7|ℎ)!(ℎ)

!(7)

ℎN9Y = ( arg(maxℎ∈D

! 7 ℎ !(ℎ)

• Maximum likelihood (ML) hypothesis (MLE: Maximum likelihood estimation): If assume ! ℎG = !(ℎZ) for all ℎG, ℎZ ∈ D,(then we can further simplify and choose

ℎN[ = argmaxℎG∈D

!(7|ℎG)

Probability vs. Likelihood

• Likelihood captures the idea that something is likely to happen or to have happened. Informally, “likelihood” is often used as a synonym for “probability”.

• Probability is used before data are available to describe possible future outcomes given a fixed value for the parameter (or parameter vector).

• Likelihood is used after data are available to describe a function of a parameter (or parameter vector) for a given outcome.

• The likelihood of parameter \ given data Y, is equal to the probability of those observed data given those parameter values, ℒ \ Y = ! Y \

• Example:! Variable Y (e.g., TOEFL score) follows ^(500, 50)(, we can compute

![450 < Y < 550]

! Given the data of TEOFL score, which parameters b(and c correspond with the data?

14

• Bayesian classification is classification based on Bayes theorem.

• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.

• Bayesian belief networks are graphical models allow the representation of dependencies among subsets of attributes.

What are Bayesian classification?

15

• Let Y be an object whose class label is unknown

• Let ℎ be some hypothesis, e.g.,that Y belongs to class d.

• For classification, we want to determine the posterior probability !(ℎ|Y), of(ℎ(conditioned on X.

• Example: Data object consists of fruits, described by color and shape

! Suppose Y is red and round, and ℎ is the hypothesis that Y is an apple.

! !(ℎ|Y)(is the probability that Y is an apple given that we have seen that Y is red and round.

P(apple | red and round) = ?

Bayes theorem

16

• In contrast, !(ℎ)(is the prior probability of ℎ.

In our example, !(ℎ)(is the probability that any given data object is an apple, regardless of how the data sample looks (independent of Y).

• !(Y|ℎ)(is the likelihood of Y given ℎ, that is the probability that Y is red and round given that we know that it is true that Y is an apple.

• !(Y), !(ℎ),(and !(Y|ℎ)(may be estimated from the given data. Bayes theorem allows us to calculate !(ℎ|Y)

)()()|()|(

roundredPapplePappleroundredP

roundredappleP!

!=!

We can compute ! ℎ Y (using Bayes theorem

Bayes theorem

! ℎ Y =! Y ℎ !(ℎ)

!(Y)

!(ℎ) is(the(prior(probability(of ℎ,(i.e.,(the(probability(that(the(fruit(is(an(apple.

17

• Suppose Y( = ( (F1, F2, … , F$),(attributes 91, 92, … , 9n

• There are E classes d1, d2, … , dE

• !(dG|Y)(denotes probability that Y is classified to class dG.(

• Example:

P(class = N | outlook=sunny, temperature=hot, humidity=high, wind=strong)

• Idea: Assign to object Y the class label dG such that achieves the maximum posterior hypothesis (ℎN9!) that !(dG|Y)(is maximal

! dG Y > (! dZ Y , ∀Z, Z ≠ G

Naïve Bayesian classification

18

• Bayes theorem: ! dG Y =! Y dG !(dG)

!(Y)

• !(Y)(is constant. Only need maximize !(Y|dG)(!(dG)(

• dG(such that !(dG(|Y)(is maximum = dG such that !(Y|(dG) · !(dG)(is maximum

• If prior probability is unknown, commonly assumed that

((((((((((((((((((((((((((((((((((((((((((!(d1) (= (!(d2) (= (… (= (!(dE),

and we would maximize !(Y|dG)(((((((((((((((((((((((((((

• Otherwise, !(dG)(= relative frequency of class di =jk

jl

• Problem: computing !(Y|dG)(is unfeasible!

Estimating a posteriori probabilities

19

• Naïve assumption: We have !(Y|dG) (= (!(F1, … , F$|dG),(if attributes are independent then !(Y|dG) (= (!(F1|dG)(x … x !(F$|dG)

• If 9m is categorical, nGm(is the number of training objects of class Cihaving the value(Fm(for 9m and nG is the number of training objects belonging to dG then

! Fm dG =nGmnG ((

If 9m is continuous then ! Fm dG (is estimated thru a Gaussian density.

• To classify an unknown object Y, !(Y|dG)!(dG)(is evaluated for each class dG. Y is then assigned to the class dG if and only if

!(Y|dG)!(dG) (> (!(Y|dZ)!(dZ), UJ&(1( ≤ Z( ≤ (E, Z( ≠ (G.

Naïve Bayesian classification

20

outlook

P(sunny|Y) = 2/9 P(sunny|N) = 3/5

P(overcast|Y) = 4/9 P(overcast|N) = 0

P(rain|Y) = 3/9 P(rain|N) = 2/5

temperature

P(hot|Y) = 2/9 P(hot|N) = 2/5

P(mild|Y) = 4/9 P(mild|N) = 2/5

P(cool|Y) = 3/9 P(cool|N) = 1/5

humidity

P(high|Y) = 3/9 P(high|N) = 4/5

P(normal|Y) = 6/9 P(normal|N) = 1/5

windy

P(strong|Y) = 3/9 P(strong|N) = 3/5

P(weak|Y) = 6/9 P(weak|N) = 2/5

P(Y) = 9/14

P(N) = 5/14

Play-tennis example: estimating P(xk|Ci)

21

" An unseen object Y( =(< &#G$, ℎJL, ℎGoℎ, p%#m >

" !(Y|q)!!(q)(= !(&#G$|q)!!(ℎJL|q)!!(ℎGoℎ|q)!!(p%#m|q)!!(q)((= 3/9 ! 2/9 ! 3/9 ! 6/9 ! 9/14 = 0.010582

" !(Y|^)!!(^)(= !(&#G$|^)(!(!(ℎJL|^)(! !(ℎGoℎ|^)(!!(p%#m|^)(! P(N) = 2/5 ! 2/5 ! 4/5 ! 2/5 ! 5/14 = 0.018286

" Object(Y(is classified in class ^(don’t play)

outlook

P(sunny|Y) = 2/9 P(sunny|N) = 3/5

P(overcast|Y) = 4/9 P(overcast|N) = 0

P(rain|Y) = 3/9 P(rain|N) = 2/5

temperature

P(hot|Y) = 2/9 P(hot|N) = 2/5

P(mild|Y) = 4/9 P(mild|N) = 2/5

P(cool|Y) = 3/9 P(cool|N) = 1/5

humidity

P(high|Y) = 3/9 P(high|N) = 4/5

P(normal|Y) = 6/9 P(normal|N) = 1/5

windy

P(strong|Y) = 3/9 P(strong|N) = 3/5

P(weak|Y) = 6/9 P(weak|N) = 2/5

Play-tennis example: classifying X

22

• It makes computation possible

• It yields optimal classifiers when independence satisfied

• But it is seldom satisfied in practice, as attributes (variables) are often correlated

• Attempts to overcome this limitation, among others:

! Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes

! Decision trees, that reason on one attribute at the time, considering most important attributes first

The independence hypothesis

23

• Bayesian belief networks allow class conditional independencies to be defined between subsets of variables.

• First component (network structure): directed acyclic graph where each node represents a random variable, each arc represents a probabilistic dependence.

• Second component (network parameter): one conditional probability table (CPT) for each variable.

Bayesian networks (belief networks, probabilistic networks) provide a model of causal relationship and can be learned (part of graphical models, K619)

Bayesian belief networks

24

Bayesian belief networks

• Bayesian belief network allows a subset of the variables

conditionally independent

• A graphical model of causal relationships

• Several cases of learning Bayesian belief networks

! Given both network structure and all the variables: easy

! Given network structure but only some variables (# parameter learning)

! When the network structure is not known in advance (# structure learning)

Probabilistic&graphical&modelsInstances(of(graphical(models

55

Graphical(models

Directed Undirected

Bayes(nets MRFs

DBNs

Hidden(Markov(Model((HMM)

Naïve(Bayes(classifier

Mixture(models

Kalmanfiltermodel

Conditionalrandom(fields

MaxEnt

LDA

Murphy, ML for life sciences

K619

Homework

Use(different(options(of(‘Bayes’(and(‘Trees’(classifiers(of(WEKA(to(analyze(the(‘labor’(dataset:(

1. Run(with(original(‘labor’(data((with(missing(values).(

2. Run(with(your(‘labor’(data(after(filling(in(data(values.(

3. Run(first(with(10Zcross(validation(then(with(your(number(of(folds.(

Draw(your(remarks(on(comparing(the(results(of(two(methods.(