Naïve Bayes

CS188:ArtificialIntelligenceNaïveBayes

Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]

MachineLearning

§ Upuntilnow:howuseamodeltomakeoptimaldecisions

§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)

§ Today:model-basedclassificationwithNaiveBayes

Classification

Example:SpamFilter

§ Input:anemail§ Output:spam/ham

§ Setup:§ Getalargecollectionofexampleemails,eachlabeled

“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails

§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …

DearSir.

First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…

TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.

99MILLIONEMAILADDRESSESFORONLY$99

Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.

Example:DigitRecognition

§ Input:images/pixelgrids§ Output:adigit0-9

§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages

§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …

OtherClassificationTasks

§ Classification:giveninputsx,predictlabels(classes)y

§ Examples:§ Spamdetection(input:document,

classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,

classes:diseases)§ Automaticessaygrading(input:document,

classes:grades)§ Frauddetection(input:accountactivity,

classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore

§ Classificationisanimportantcommercialtechnology!

Model-BasedClassification

§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables

§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures

§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?

NaïveBayesforDigits

§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel

§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity

ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.

§ Here:lotsoffeatures,eachisbinaryvalued

§ NaïveBayes model:

§ Whatdoweneedtolearn?

F1 FnF2

GeneralNaïveBayes

§ AgeneralNaiveBayes model:

§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway

F1 FnF2

|Y|parameters

nx|F|x|Y|parameters

|Y|x|F|n values

InferenceforNaïveBayes

§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel

§ Step2:sumtogetprobabilityofevidence

§ Step3:normalizebydividingStep1byStep2

GeneralNaïveBayes

§ WhatdoweneedinordertouseNaïveBayes?

§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere

§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq

§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon

Example:ConditionalProbabilities

1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1

1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80

1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80

ASpamFilter

§ NaïveBayesspamfilter

§ Data:§ Collectionofemails,labeled

spamorham§ Note:someonehastohand

labelallthisdata!§ Splitintotraining,held-out,

testsets

§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSESFOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

NaïveBayesforText

§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed

§ Generativemodel:

§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel

§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?

§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering

Wordatpositioni,not ith wordinthedictionary!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

in is lecture lecture next over person remember room sitting the the the to to up wake when you

how many variables are there?how many values?

Example:SpamFiltering

§ Model:

§ Whataretheparameters?

§ Wheredothesetablescomefrom?

the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...

the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...

ham : 0.66spam: 0.33

SpamExample

Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9

TrainingandTesting

ImportantConcepts

§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset

§ Features:attribute-valuepairswhichcharacterizeeachx

§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!

§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly

§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot

generalizingwell§ Underfitting:fitsthetrainingsetpoorly

TrainingData

Held-OutData

TestData

UnderfittingandOverfitting

0 2 4 6 8 10 12 14 16 18 20-15

Degree15polynomial

Overfitting

Example:Overfitting

2wins!!

Example:Overfitting

§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):

south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...

Whatwentwronghere?

screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...

GeneralizationandOverfitting

§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability

§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough

§ Togeneralizebetter:weneedtosmoothorregularizetheestimates

ParameterEstimation

§ Estimatingthedistributionofarandomvariable

§ Elicitation: askahuman(whyisthishard?)

§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:

§ Thisistheestimatethatmaximizesthelikelihoodofthedata

r bbrb b

YourFirstConsultingJob

§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?

§ Yousay:Pleaseflipitafewtimes:

§ Yousay:Theprobabilityis:§ P(H)=3/5

§ Hesays:Why???§ Yousay:Because…

YourFirstConsultingJob

§ P(Heads)=q,P(Tails)=1-q

§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution

§ SequenceD ofaH HeadsandaT Tails

D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)

MaximumLikelihoodEstimation

§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem

§ What’stheobjectivefunction?

§ MLE:ChooseqtomaximizeprobabilityofD

MaximumLikelihoodEstimation

§ Setderivativetozero,andsolve!

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Brief Article

The Author

January 11, 2012

�̂ = argmax✓

lnP (D | �)

ln �↵H

d�lnP (D | �) =

d�ln �↵H (1� �)↵T

Brief Article

The Author

January 11, 2012

⇥̂ = argmax⇥

lnP (D | ⇥)

ln ⇥�H

d⇥lnP (D | ⇥) =

d⇥[ln ⇥�H (1� ⇥)�T ] =

d⇥[�H ln ⇥ + �T ln(1� ⇥)]

= �Hd

d⇥ln ⇥ + �T

d⇥ln(1� ⇥) =

⇥� �T

1� ⇥= 0

Smoothing

MaximumLikelihood?

§ Relativefrequenciesarethemaximumlikelihoodestimates

§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata

UnseenEvents

LaplaceSmoothing

§ Laplace’sestimate:§ Pretendyousaweveryoutcome

oncemorethanyouactuallydid

§ CanderivethisestimatewithDirichlet priors (seecs281a)

LaplaceSmoothing

§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes

§ What’sLaplacewithk=0?§ kisthestrength oftheprior

§ Laplaceforconditionals:§ Smootheachconditionindependently:

RealNB:Smoothing

§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:

helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...

verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...

Dothesemakemoresense?

Tuning

TuningonHeld-OutData

§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a

§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata

§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata

§ Choosethebestvalueanddoafinaltestonthetestdata

Summary

§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities

§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel

§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata

§ Smoothingestimatesisimportantinrealsystems

Naïve Bayes - University of California, Berkeleycs188/su19/assets/... · Naïve Bayes for Digits...

Documents

Naïve Bayes and Logistic Regression · Naïve Bayes and Logistic Regression Naïve Bayes •Generative Model "̂=argmax)*(")(-|") •Features assumed to be independent Logistic Regression

NAÏVE BAYES CLASSIFIER

MLE and MAP•Maximum a posteriori (MAP) estimate •Naïve Bayes •Various Naïve Bayes models •model 1: Bernoulli Naïve Bayes •model 2: Multinomial Naïve Bayes •model 3:

5. Aufgabenblatt Naïve Bayes Klassifikation Abgabe: 07.02 ... · Naïve Bayes Klassifikation • Implementiert den Naïve-Bayes-Klassifikatorals PL-SQL-Funktion(Vorlage s. Webseite)

Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier … · Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier Metagenome Classification Using Naïve Bayes Classifier

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

Naïve Bayes - University of California, Berkeleycs188/sp20/assets/... · Naïve Bayes for Digits §Naïve Bayes: Assume all features are independent effects of the label §Simple

Naïve Bayes based Model

Naïve Bayes (Continued)

Bayesian inference, Naïve Bayes model - Svetlana Lazebnikslazebni.cs.illinois.edu/fall16/lec13_bayesian_inference.pdf · Bayesian inference, Naïve Bayes model ... Bayes Rule •

Naïve Bayes Learning

Bayes Classifier and Naïve Bayes - Oregon State …classes.engr.oregonstate.edu/eecs/winter2011/cs434/notes/bayes-6.pdfBayes Classifier and Naïve Bayes ... Probabilistic Classification

Naïve Bayes - GitHub Pagesaritter.github.io/courses/5522_slides/naive_bayes.pdf · Naïve Bayes for Digits Naïve Bayes: Assume all features are independent effects of the label

Naïve Bayes Classfication

Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Spam Filtering with Naïve Bayes – Which Naïve Bayes?cobweb.cs.uga.edu/.../CSCI6900...Presentation2.pdf · Title: Spam Filtering with Naïve Bayes – Which Naïve Bayes? Author:

Metode klasifikasi Naïve Bayes

Naïve Bayes Classifiers

23: Naïve Bayes - Stanford Universityweb.stanford.edu/.../lectures/23_naive_bayes_blank.pdf21 “Brute Force Bayes” 24b_brute_force_bayes 32 Naïve Bayes Classifier 24c_naive_bayes