Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
K236:&Basis&of&Data&AnalyticsLecture&9:&Classification and prediction
Bayesian&Classification
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science (1) ���*"�3 6/9
2. Introduction to data science (2) ���*"�3 6/13
3. Data and databases ��������� 6/16
4. Review of univariate statistics !2+. 6/20
5. Review of linear algebra ,#�$ 6/23
6. Data mining software ��������� ��� 6/27
7. Data preprocessing �����' 6/30
8. Classification and prediction (1) �5��& (1) 7/4
9. Knowledge evaluation )1/� 7/7
10. Classification and prediction (2) �5��& (2) 7/11
11. Classification and prediction (3) �5��& (3) 7/14
12. Mining association rules (1) (4����-% 7/18
13. Mining association rules (2) (4����-% 7/21
14. Cluster analysis ����-% 7/25
15. Review and Examination �����06 (the data is not fixed) 7/27
3
Outline
1. About Bayesian classification
2. Naïve Bayesian classification3. Bayesian belief networks
Bayesians in machine learning
54
David(Heckerman Judea(Pearl( Michael(Jordan
Example
A patient takes a lab test on a cancer disease and the
result comes back positive. It is known that
• Among people actually having this cancer, only 98%
are with positive test result (+)
• Among people actually don’t have this cancer, only
97% are with negative result (-)
• Furthermore, 0.008 of the entire population have
this cancer
Does patient have cancer or not?
Example
Does patient have cancer or not?
The patient takes a lab test on a cancer disease
and the result comes back positive (+). It is
known that
• Among people actually having this cancer,
only 98% are with positive test result (+)
• Among people actually don’t have this
cancer, only 97% are with negative result (-)
• Furthermore, 0.008 of the entire
population have this cancer
! "#$"%& + (?
! +( ("#$"%& = 0.98
! −( (¬"#$"%&) = 0.97
!("#$"%&) = 0.008
Question: Can we compute ! "#$"%& + (from the other three probabilities?(
Example
• What is the probability of a patient having a positive test result, ! + =(?
• A patient may have a positive test result if he has this cancer and also when he don’t have this cancer
((((! + = ! + ("#$"%&)! "#$"%& + ! + (¬"#$"%&)! ¬"#$"%&
• Consider
! "#$"%& is a hypothesis ℎ and denote its probability by !(ℎ)
! Positive test result is an evidence, ! + (or(!(7)
! Probability of ℎ given the evidence, !(ℎ|7)
! Probability of the evidence given ℎ, !(7|ℎ)
Reverent Thomas Bayes (1702-1761)
He set down his findings on probability in “Essay Towards
Solving a Problem in the Doctrine of Chances” (1763),
published posthumously in the Philosophical Transactions
of the Royal Society of London
Bayes’ Theorem
! ℎ 7 =! 7 ℎ !(ℎ)
!(7)
! 9 : =! : 9 !(9)
!(:)
Bayes’ Theorem
describes the probability of a
hypothesis, based on conditions
that might be related to the
hypothesis.
Baye’s theorem
• The essence of Bayes’ theorem is that tell us how to update our initial probabilities !(ℎ) if we see evidence 7, in order to find out !(ℎ|7)
! ℎ 7 =! 7 ℎ !(ℎ)
!(7)
! ℎ 7 =! 7 ℎ . !(ℎ)
!(7)=
! 7 ℎ . !(ℎ)
! 7 ℎ . ! ℎ + ! 7 ¬ℎ . !(¬ℎ)
• A prior probability• Conditional probability (likelihood) ← coming from the data • Posteriori probability
Example
! ℎ 7 =! 7 ℎ !(ℎ)
!(7)! "#$"%& + =
! + ("#$"%& !("#$"%&)
!(+)
! + = ! + ("#$"%&)! "#$"%& + ! + (¬"#$"%&)! ¬"#$"%&
! + = 0.98(×(0.008 + 0.003(×(0.992 = ((0.0078 + 0.0298(
! "#$"%& + =0.0078
0.0078(+0.0298(= 0.20745
! "#$"%& = 0.008((((((((! + ("#$"%&) = 0.98(((((((((! − "#$"%& = 0.02(
! ¬"#$"%& = 0.992((((((! + (¬"#$"%&) = 0.03(((((! − ¬"#$"%&) = 0.97(
! + ¬"#$"%&)! ¬"#$"%& = 0.03(×(0.992 = 0.0298
! + "#$"%& ! "#$"%& = 0.98(×0.008 = 0.0078
To say that the patient has cancer or not, we also need to know ((! ¬"#$"%& +)?
Example
• Comparing two probabilities, we say that the patient don’t have this cancer.
• Assume that several hypotheses ℎ1, ℎ2, … (belonging to a hypothesis space D)(relate to the evidence 7. We want the most probable hypothesis given the evidence 7.
• The rule of hypothesis choosing is E#FGEHE(#(IJKL%&GJ&(ℎMIJLℎ%KGK(ℎN9!(
ℎN9! = arg(maxℎ∈D
!(ℎ|7)
! ¬"#$"%& + =! + (¬"#$"%& !(¬"#$"%&)
!(+)=
0.0298
0.0078 + 0.0298= 0.79255
! "#$"%& + =! + ("#$"%& !("#$"%&)
!(+)=
0.0078
0.0078 + 0.0298= 0.20745
argmaxT
U F = M ∀F: U F ≤ U M , i.e.,(the(argument(gives(the(maximum(value
Choosing hypotheses
• Maximum a posterior: (((ℎN9!= arg(maxℎ∈D
!(ℎ|7) ((= agr(maxℎ∈D
!(7|ℎ)!(ℎ)
!(7)
ℎN9Y = ( arg(maxℎ∈D
! 7 ℎ !(ℎ)
• Maximum likelihood (ML) hypothesis (MLE: Maximum likelihood estimation): If assume ! ℎG = !(ℎZ) for all ℎG, ℎZ ∈ D,(then we can further simplify and choose
ℎN[ = argmaxℎG∈D
!(7|ℎG)
Probability vs. Likelihood
• Likelihood captures the idea that something is likely to happen or to have happened. Informally, “likelihood” is often used as a synonym for “probability”.
• Probability is used before data are available to describe possible future outcomes given a fixed value for the parameter (or parameter vector).
• Likelihood is used after data are available to describe a function of a parameter (or parameter vector) for a given outcome.
• The likelihood of parameter \ given data Y, is equal to the probability of those observed data given those parameter values, ℒ \ Y = ! Y \
• Example:! Variable Y (e.g., TOEFL score) follows ^(500, 50)(, we can compute
![450 < Y < 550]
! Given the data of TEOFL score, which parameters b(and c correspond with the data?
14
• Bayesian classification is classification based on Bayes theorem.
• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.
• Bayesian belief networks are graphical models allow the representation of dependencies among subsets of attributes.
What are Bayesian classification?
15
• Let Y be an object whose class label is unknown
• Let ℎ be some hypothesis, e.g.,that Y belongs to class d.
• For classification, we want to determine the posterior probability !(ℎ|Y), of(ℎ(conditioned on X.
• Example: Data object consists of fruits, described by color and shape
! Suppose Y is red and round, and ℎ is the hypothesis that Y is an apple.
! !(ℎ|Y)(is the probability that Y is an apple given that we have seen that Y is red and round.
P(apple | red and round) = ?
Bayes theorem
16
• In contrast, !(ℎ)(is the prior probability of ℎ.
In our example, !(ℎ)(is the probability that any given data object is an apple, regardless of how the data sample looks (independent of Y).
• !(Y|ℎ)(is the likelihood of Y given ℎ, that is the probability that Y is red and round given that we know that it is true that Y is an apple.
• !(Y), !(ℎ),(and !(Y|ℎ)(may be estimated from the given data. Bayes theorem allows us to calculate !(ℎ|Y)
)()()|()|(
roundredPapplePappleroundredP
roundredappleP!
!=!
We can compute ! ℎ Y (using Bayes theorem
Bayes theorem
! ℎ Y =! Y ℎ !(ℎ)
!(Y)
!(ℎ) is(the(prior(probability(of ℎ,(i.e.,(the(probability(that(the(fruit(is(an(apple.
17
• Suppose Y( = ( (F1, F2, … , F$),(attributes 91, 92, … , 9n
• There are E classes d1, d2, … , dE
• !(dG|Y)(denotes probability that Y is classified to class dG.(
• Example:
P(class = N | outlook=sunny, temperature=hot, humidity=high, wind=strong)
• Idea: Assign to object Y the class label dG such that achieves the maximum posterior hypothesis (ℎN9!) that !(dG|Y)(is maximal
! dG Y > (! dZ Y , ∀Z, Z ≠ G
Naïve Bayesian classification
18
• Bayes theorem: ! dG Y =! Y dG !(dG)
!(Y)
• !(Y)(is constant. Only need maximize !(Y|dG)(!(dG)(
• dG(such that !(dG(|Y)(is maximum = dG such that !(Y|(dG) · !(dG)(is maximum
• If prior probability is unknown, commonly assumed that
((((((((((((((((((((((((((((((((((((((((((!(d1) (= (!(d2) (= (… (= (!(dE),
and we would maximize !(Y|dG)(((((((((((((((((((((((((((
• Otherwise, !(dG)(= relative frequency of class di =jk
jl
• Problem: computing !(Y|dG)(is unfeasible!
Estimating a posteriori probabilities
19
• Naïve assumption: We have !(Y|dG) (= (!(F1, … , F$|dG),(if attributes are independent then !(Y|dG) (= (!(F1|dG)(x … x !(F$|dG)
• If 9m is categorical, nGm(is the number of training objects of class Cihaving the value(Fm(for 9m and nG is the number of training objects belonging to dG then
! Fm dG =nGmnG ((
If 9m is continuous then ! Fm dG (is estimated thru a Gaussian density.
• To classify an unknown object Y, !(Y|dG)!(dG)(is evaluated for each class dG. Y is then assigned to the class dG if and only if
!(Y|dG)!(dG) (> (!(Y|dZ)!(dZ), UJ&(1( ≤ Z( ≤ (E, Z( ≠ (G.
Naïve Bayesian classification
20
outlook
P(sunny|Y) = 2/9 P(sunny|N) = 3/5
P(overcast|Y) = 4/9 P(overcast|N) = 0
P(rain|Y) = 3/9 P(rain|N) = 2/5
temperature
P(hot|Y) = 2/9 P(hot|N) = 2/5
P(mild|Y) = 4/9 P(mild|N) = 2/5
P(cool|Y) = 3/9 P(cool|N) = 1/5
humidity
P(high|Y) = 3/9 P(high|N) = 4/5
P(normal|Y) = 6/9 P(normal|N) = 1/5
windy
P(strong|Y) = 3/9 P(strong|N) = 3/5
P(weak|Y) = 6/9 P(weak|N) = 2/5
P(Y) = 9/14
P(N) = 5/14
Play-tennis example: estimating P(xk|Ci)
21
" An unseen object Y( =(< &#G$, ℎJL, ℎGoℎ, p%#m >
" !(Y|q)!!(q)(= !(&#G$|q)!!(ℎJL|q)!!(ℎGoℎ|q)!!(p%#m|q)!!(q)((= 3/9 ! 2/9 ! 3/9 ! 6/9 ! 9/14 = 0.010582
" !(Y|^)!!(^)(= !(&#G$|^)(!(!(ℎJL|^)(! !(ℎGoℎ|^)(!!(p%#m|^)(! P(N) = 2/5 ! 2/5 ! 4/5 ! 2/5 ! 5/14 = 0.018286
" Object(Y(is classified in class ^(don’t play)
outlook
P(sunny|Y) = 2/9 P(sunny|N) = 3/5
P(overcast|Y) = 4/9 P(overcast|N) = 0
P(rain|Y) = 3/9 P(rain|N) = 2/5
temperature
P(hot|Y) = 2/9 P(hot|N) = 2/5
P(mild|Y) = 4/9 P(mild|N) = 2/5
P(cool|Y) = 3/9 P(cool|N) = 1/5
humidity
P(high|Y) = 3/9 P(high|N) = 4/5
P(normal|Y) = 6/9 P(normal|N) = 1/5
windy
P(strong|Y) = 3/9 P(strong|N) = 3/5
P(weak|Y) = 6/9 P(weak|N) = 2/5
Play-tennis example: classifying X
22
• It makes computation possible
• It yields optimal classifiers when independence satisfied
• But it is seldom satisfied in practice, as attributes (variables) are often correlated
• Attempts to overcome this limitation, among others:
! Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes
! Decision trees, that reason on one attribute at the time, considering most important attributes first
The independence hypothesis
23
• Bayesian belief networks allow class conditional independencies to be defined between subsets of variables.
• First component (network structure): directed acyclic graph where each node represents a random variable, each arc represents a probabilistic dependence.
• Second component (network parameter): one conditional probability table (CPT) for each variable.
Bayesian networks (belief networks, probabilistic networks) provide a model of causal relationship and can be learned (part of graphical models, K619)
Bayesian belief networks
24
Bayesian belief networks
• Bayesian belief network allows a subset of the variables
conditionally independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
! Given both network structure and all the variables: easy
! Given network structure but only some variables (# parameter learning)
! When the network structure is not known in advance (# structure learning)
Probabilistic&graphical&modelsInstances(of(graphical(models
55
Graphical(models
Directed Undirected
Bayes(nets MRFs
DBNs
Hidden(Markov(Model((HMM)
Naïve(Bayes(classifier
Mixture(models
Kalmanfiltermodel
Conditionalrandom(fields
MaxEnt
LDA
Murphy, ML for life sciences
K619
Homework
Use(different(options(of(‘Bayes’(and(‘Trees’(classifiers(of(WEKA(to(analyze(the(‘labor’(dataset:(
1. Run(with(original(‘labor’(data((with(missing(values).(
2. Run(with(your(‘labor’(data(after(filling(in(data(values.(
3. Run(first(with(10Zcross(validation(then(with(your(number(of(folds.(
Draw(your(remarks(on(comparing(the(results(of(two(methods.(