Bayesianmethods&Naïve( Bayes(...

Bayesian methods & Naïve Bayes Lecture 18

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Beta prior distribu9on – P(θ)

•  The posterior distribu9on:

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d⌅lnP (D | ⌅) =

ln ⌅�H(1� ⌅)�T

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d⌅ln(1� ⌅) =

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d⌅lnP (D | ⌅) =

ln ⌅�H(1� ⌅)�T

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d⌅ln(1� ⌅) =

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1= ⌅�H+⇥H�1

(1� ⌅)�T+⇥t+1

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d⇧lnP (D | ⇧) =

ln ⇧�H(1� ⇧)�T

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d⇧ln(1� ⇧) =

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

•  We now have a distribu9on over parameters •  For any specific f, a func9on of interest, compute the

expected value of f:

•  Integral is oKen hard to compute •  As more data is observed, prior is more concentrated

•  MAP (Maximum a posteriori approxima9on): use most likely parameter to approximate the expecta9on

Using Bayesian inference for predic9on

MAP for Beta distribu9on

•  MAP: use most likely parameter:

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d⇧lnP (D | ⇧) =

ln ⇧�H(1� ⇧)�T

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d⇧ln(1� ⇧) =

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

�H + ⇥H � 1�H + ⇥H + �T + ⇥T � 2

1•  Beta prior equivalent to extra thumbtack flips •  As N → 1, prior is “forgotten” •  But, for small sample size, prior is important!

What about con9nuous variables?

•  Billionaire says: If I am measuring a con9nuous variable, what can you do for me?

•  You say: Let me tell you about Gaussians…

Some proper9es of Gaussians

•  Affine transforma9on (mul9plying by scalar and adding a constant) are Gaussian –  X ~ N(µ,σ2) –  Y = aX + b Y ~ N(aµ+b,a2σ2)

•  Sum of Gaussians is Gaussian –  X ~ N(µX,σ2

X) –  Y ~ N(µY,σ2

Y) –  Z = X+Y Z ~ N(µX+µY, σ2

X+σ2Y)

•  Easy to differen9ate, as we will see soon!

Learning a Gaussian

•  Collect a bunch of data – Hopefully, i.i.d. samples

– e.g., exam scores

•  Learn parameters – Mean: µ – Variance: σ

xi i =

Exam Score

… …

MLE for Gaussian:

•  Prob. of i.i.d. samples D={x1,…,xN}:

•  Log-likelihood of data:

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

Your second learning algorithm: MLE for mean of a Gaussian

•  What’s MLE for mean?

P (D | µ,�)

= �NX

(xi � µ)

�2= 0

P (D | µ,�)

= �NX

(xi � µ)

�2= 0

P (D | µ,�)

= �NX

(xi � µ)

�2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

�3= 0

MLE for variance •  Again, set deriva9ve to zero:

P (D | µ,�)

= �NX

(xi � µ)

�2= 0

P (D | µ,�)

= �NX

(xi � µ)

�2= 0

= �N

(xi � µ)2

�3= 0

Learning Gaussian parameters

•  MLE:

•  MLE for the variance of a Gaussian is biased – Expected result of es9ma9on is not true parameter! – Unbiased variance es9mator:

Bayesian learning of Gaussian parameters

•  Conjugate priors – Mean: Gaussian prior

– Variance: Wishart Distribu9on

•  Prior for mean:

Naïve Bayes

Slides adapted from Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld

Bayesian Classifica9on

•  Problem statement: – Given features X1,X2,…,Xn – Predict a label Y

Example Applica9on

•  Digit Recogni9on

•  X1,…,Xn ∈ {0,1} (Black vs. White pixels)

•  Y ∈ {0,1,2,3,4,5,6,7,8,9}

Classifier 5

The Bayes Classifier

•  If we had the joint distribu9on on X1,…,Xn and Y, could predict using:

–  (for example: what is the probability that the image represents a 5 given its pixels?)

•  So … How do we compute that?

•  Use Bayes Rule!

•  Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label

Normalization Constant

Likelihood Prior

•  Let’s expand this for our digit recogni9on task:

•  To classify, we’ll simply compute these probabili9es, one per class, and predict based on which one is largest

Model Parameters

•  How many parameters are required to specify the likelihood, P(X1,…,Xn|Y) ?

–  (Supposing that each image is 30x30 pixels)

•  The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters:

– We’ll run out of space

– We’ll run out of 9me –  And we’ll need tons of training data (which is usually not available)

Naïve Bayes •  Naïve Bayes assump9on:

–  Features are independent given class:

– More generally:

•  How many parameters now? •  Suppose X is composed of n binary features

The Naïve Bayes Classifier •  Given:

–  Prior P(Y) –  n condi9onally independent features X given the class Y

–  For each Xi, we have likelihood P(Xi|Y)

•  Decision rule:

If certain assumption holds, NB is optimal classifier! (they typically don’t)

X1 Xn X2

A Digit Recognizer

•  Input: pixel grids

•  Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs)

•  Simple version: –  One feature Fij for each grid position <i,j>

–  Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

–  Each input maps to a feature vector, e.g.

–  Here: lots of features, each is binary valued

•  Naïve Bayes model:

•  Are the features independent given class?

•  What do we need to learn?

What has to be learned?

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

MLE for the parameters of NB •  Given dataset

–  Count(A=a,B=b) Ã number of examples where A=a and B=b

•  MLE for discrete NB, simply: –  Prior:

–  Observa9on distribu9on:

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

MLE for the parameters of NB

•  Training amounts to, for each of the classes, averaging all of the examples together:

MAP es9ma9on for NB •  Given dataset

–  Count(A=a,B=b) Ã number of examples where A=a and B=b

•  MAP es9ma9on for discrete NB, simply: –  Prior:

–  Observa9on distribu9on:

•  Called “smoothing”. Corresponds to Dirichlet prior!

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

P (D | µ,⇥)

= �NX

(xi � µ)

⇥2= 0

= �NX

xi +Nµ = 0

= �N

(xi � µ)2

⇥3= 0

argmaxwln

⇥⇥2�

�[tj �P

iwihi(xj)]2

= argmaxw

�[tj �P

iwihi(xj)]2

= argminw

[tj �X

wihi(xj)]2

+ |X_i|*a

Bayesianmethods&Naïve( Bayes(...

Documents

Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier … · Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier Metagenome Classification Using Naïve Bayes Classifier

23: Naïve Bayes - Stanford Universityweb.stanford.edu/.../lectures/23_naive_bayes_blank.pdf21 “Brute Force Bayes” 24b_brute_force_bayes 32 Naïve Bayes Classifier 24c_naive_bayes

Naïve Bayes Learning

Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

More Naïve Bayes

Spam Filtering with Naïve Bayes – Which Naïve Bayes?cobweb.cs.uga.edu/.../CSCI6900...Presentation2.pdf · Title: Spam Filtering with Naïve Bayes – Which Naïve Bayes? Author:

NAÏVE BAYES CLASSIFIER

Naïve Bayes (Continued)

Classiﬁcation: Naïve Bayes

Bayesian inference, Naïve Bayes model - Svetlana Lazebnikslazebni.cs.illinois.edu/fall16/lec13_bayesian_inference.pdf · Bayesian inference, Naïve Bayes model ... Bayes Rule •

5. Aufgabenblatt Naïve Bayes Klassifikation Abgabe: 07.02 ... · Naïve Bayes Klassifikation • Implementiert den Naïve-Bayes-Klassifikatorals PL-SQL-Funktion(Vorlage s. Webseite)

Naïve Bayes Discussion - UMD

MLE and MAP•Maximum a posteriori (MAP) estimate •Naïve Bayes •Various Naïve Bayes models •model 1: Bernoulli Naïve Bayes •model 2: Multinomial Naïve Bayes •model 3:

Naïve Bayes Classfication

Naïve Bayes

kNN & Naïve Bayes

Naïve Bayes - GitHub Pagesaritter.github.io/courses/5522_slides/naive_bayes.pdf · Naïve Bayes for Digits Naïve Bayes: Assume all features are independent effects of the label

Naïve Bayes based Model

Metode klasifikasi Naïve Bayes