View
10
Download
0
Category
Preview:
Citation preview
CS188:ArtificialIntelligenceNaïveBayes
Instructors:BrijenThananjeyanandAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine,withsomematerialsfromA.Farhadi.AllCS188materialsareathttp://ai.berkeley.edu.]
MachineLearning
§ Upuntilnow:howuseamodeltomakeoptimaldecisions
§ Machinelearning:howtoacquireamodelfromdata/experience§ Learningparameters (e.g.probabilities)§ Learningstructure(e.g.BNgraphs)§ Learninghiddenconcepts(e.g.clustering)
§ Today:model-basedclassificationwithNaiveBayes
Classification
Example:SpamFilter
§ Input:anemail§ Output:spam/ham
§ Setup:§ Getalargecollectionofexampleemails,eachlabeled
“spam”or“ham”§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futureemails
§ Features:Theattributesusedtomaketheham/spamdecision§ Words:FREE!§ TextPatterns:$dd,CAPS§ Non-text:SenderInContacts§ …
DearSir.
First,Imustsolicityourconfidenceinthistransaction,thisisbyvirtureofitsnatureasbeingutterlyconfidencialandtopsecret.…
TOBEREMOVEDFROMFUTUREMAILINGS,SIMPLYREPLYTOTHISMESSAGEANDPUT"REMOVE"INTHESUBJECT.
99MILLIONEMAILADDRESSESFORONLY$99
Ok,IknowthisisblatantlyOTbutI'mbeginningtogoinsane.HadanoldDellDimensionXPSsittinginthecorneranddecidedtoput ittouse,Iknowitwasworkingprebeingstuckinthecorner,butwhenIpluggeditin,hitthepowernothinghappened.
Example:DigitRecognition
§ Input:images/pixelgrids§ Output:adigit0-9
§ Setup:§ Getalargecollectionofexampleimages,eachlabeledwithadigit§ Note:someonehastohandlabelallthisdata!§ Wanttolearntopredictlabelsofnew,futuredigitimages
§ Features:Theattributesusedtomakethedigitdecision§ Pixels:(6,8)=ON§ ShapePatterns:NumComponents,AspectRatio,NumLoops§ …
0
1
2
1
??
OtherClassificationTasks
§ Classification:giveninputsx,predictlabels(classes)y
§ Examples:§ Spamdetection(input:document,
classes:spam/ham)§ OCR(input:images,classes:characters)§ Medicaldiagnosis(input:symptoms,
classes:diseases)§ Automaticessaygrading(input:document,
classes:grades)§ Frauddetection(input:accountactivity,
classes:fraud/nofraud)§ Customerserviceemailrouting§ …manymore
§ Classificationisanimportantcommercialtechnology!
Model-BasedClassification
Model-BasedClassification
§ Model-basedapproach§ Buildamodel(e.g.Bayes’net)whereboththelabelandfeaturesarerandomvariables
§ Instantiateanyobservedfeatures§ Queryforthedistributionofthelabelconditionedonthefeatures
§ Challenges§ WhatstructureshouldtheBNhave?§ Howshouldwelearnitsparameters?
NaïveBayesforDigits
§ NaïveBayes:Assumeallfeaturesareindependenteffectsofthelabel
§ Simpledigitrecognitionversion:§ Onefeature(variable)Fij foreachgridposition<i,j>§ Featurevaluesareon/off,basedonwhetherintensity
ismoreorlessthan0.5inunderlyingimage§ Eachinputmapstoafeaturevector,e.g.
§ Here:lotsoffeatures,eachisbinaryvalued
§ NaïveBayes model:
§ Whatdoweneedtolearn?
Y
F1 FnF2
GeneralNaïveBayes
§ AgeneralNaiveBayes model:
§ Weonlyhavetospecifyhoweachfeaturedependsontheclass§ Totalnumberofparametersislinear inn§ Modelisverysimplistic,butoftenworksanyway
Y
F1 FnF2
|Y|parameters
nx|F|x|Y|parameters
|Y|x|F|n values
InferenceforNaïveBayes
§ Goal:computeposteriordistributionoverlabelvariableY§ Step1:getjointprobabilityoflabelandevidenceforeachlabel
§ Step2:sumtogetprobabilityofevidence
§ Step3:normalizebydividingStep1byStep2
+
GeneralNaïveBayes
§ WhatdoweneedinordertouseNaïveBayes?
§ Inferencemethod(wejustsawthispart)§ Startwithabunchofprobabilities:P(Y)andtheP(Fi|Y)tables§ UsestandardinferencetocomputeP(Y|F1…Fn)§ Nothingnewhere
§ Estimatesoflocalconditionalprobabilitytables§ P(Y),theprioroverlabels§ P(Fi|Y)foreachfeature(evidencevariable)§ Theseprobabilitiesarecollectivelycalledtheparameters ofthemodelanddenotedbyq
§ Upuntilnow,weassumedtheseappearedbymagic,but…§ …theytypicallycomefromtrainingdatacounts:we’lllookatthissoon
Example:ConditionalProbabilities
1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.10 0.1
1 0.012 0.053 0.054 0.305 0.806 0.907 0.058 0.609 0.500 0.80
1 0.052 0.013 0.904 0.805 0.906 0.907 0.258 0.859 0.600 0.80
ASpamFilter
§ NaïveBayesspamfilter
§ Data:§ Collectionofemails,labeled
spamorham§ Note:someonehastohand
labelallthisdata!§ Splitintotraining,held-out,
testsets
§ Classifiers§ Learnonthetrainingset§ (Tuneitonaheld-outset)§ Testitonnewemails
Dear Sir.
First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.
99 MILLION EMAIL ADDRESSESFOR ONLY $99
Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.
NaïveBayesforText
§ Bag-of-wordsNaïveBayes:§ Features:Wi isthewordatpositon i§ Asbefore:predictlabelconditionedonfeaturevariables(spamvs.ham)§ Asbefore:assumefeaturesareconditionallyindependentgivenlabel§ New:eachWi isidenticallydistributed
§ Generativemodel:
§ “Tied”distributionsandbag-of-words§ Usually,eachvariablegetsitsownconditionalprobabilitydistributionP(F|Y)§ Inabag-of-wordsmodel
§ Eachposition isidenticallydistributed§ Allpositionssharethesameconditional probs P(W|Y)§ Whymakethisassumption?
§ Called“bag-of-words”becausemodelisinsensitivetowordorderorreordering
Wordatpositioni,not ith wordinthedictionary!
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
in is lecture lecture next over person remember room sitting the the the to to up wake when you
how many variables are there?how many values?
Example:SpamFiltering
§ Model:
§ Whataretheparameters?
§ Wheredothesetablescomefrom?
the : 0.0156to : 0.0153and : 0.0115of : 0.0095you : 0.0093a : 0.0086with: 0.0080from: 0.0075...
the : 0.0210to : 0.0133of : 0.01192002: 0.0110with: 0.0108from: 0.0107and : 0.0105a : 0.0100...
ham : 0.66spam: 0.33
SpamExample
Word P(w|spam) P(w|ham) Tot Spam Tot Ham(prior) 0.33333 0.66666 -1.1 -0.4Gary 0.00002 0.00021 -11.8 -8.9would 0.00069 0.00084 -19.1 -16.0you 0.00881 0.00304 -23.8 -21.8like 0.00086 0.00083 -30.9 -28.9to 0.01517 0.01339 -35.1 -33.2lose 0.00008 0.00002 -44.5 -44.0weight 0.00016 0.00002 -53.3 -55.0while 0.00027 0.00027 -61.5 -63.2you 0.00881 0.00304 -66.2 -69.0sleep 0.00006 0.00001 -76.0 -80.5
P(spam | w) = 98.9
TrainingandTesting
ImportantConcepts
§ Data:labeledinstances,e.g.emailsmarkedspam/ham§ Trainingset§ Heldoutset§ Testset
§ Features:attribute-valuepairswhichcharacterizeeachx
§ Experimentationcycle§ Learnparameters(e.g.modelprobabilities)ontrainingset§ (Tunehyperparameters onheld-outset)§ Computeaccuracyoftestset§ Veryimportant:never“peek”atthetestset!
§ Evaluation§ Accuracy:fractionofinstancespredictedcorrectly
§ Overfitting andgeneralization§ Wantaclassifierwhichdoeswellontest data§ Overfitting:fittingthetrainingdataveryclosely,butnot
generalizingwell§ Underfitting:fitsthetrainingsetpoorly
TrainingData
Held-OutData
TestData
UnderfittingandOverfitting
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
Degree15polynomial
Overfitting
Example:Overfitting
2wins!!
Example:Overfitting
§ Posteriorsdeterminedbyrelativeprobabilities(oddsratios):
south-west : infnation : infmorally : infnicely : infextent : infseriously : inf...
Whatwentwronghere?
screens : infminute : infguaranteed : inf$205.00 : infdelivery : infsignature : inf...
GeneralizationandOverfitting
§ Relativefrequencyparameterswilloverfit thetrainingdata!§ Justbecauseweneversawa3withpixel(15,15)onduring trainingdoesn’tmeanwewon’tseeitattesttime§ Unlikelythateveryoccurrenceof“minute”is100%spam§ Unlikelythateveryoccurrenceof“seriously”is100%ham§ Whataboutallthewordsthatdon’toccurinthetrainingsetatall?§ Ingeneral,wecan’tgoaroundgivingunseeneventszeroprobability
§ Asanextremecase,imagineusingtheentireemailastheonlyfeature§ Wouldgetthetrainingdataperfect(ifdeterministic labeling)§ Wouldn’t generalize atall§ Justmaking thebag-of-wordsassumption givesussomegeneralization, butisn’tenough
§ Togeneralizebetter:weneedtosmoothorregularizetheestimates
ParameterEstimation
ParameterEstimation
§ Estimatingthedistributionofarandomvariable
§ Elicitation: askahuman(whyisthishard?)
§ Empirically:usetrainingdata(learning!)§ E.g.:foreachoutcomex,lookattheempiricalrate ofthatvalue:
§ Thisistheestimatethatmaximizesthelikelihoodofthedata
r r b
r b b
r bbrb b
r bb
r
b
b
YourFirstConsultingJob
§ Abillionairetechentrepreneurasksyouaquestion:§ Hesays:Ihavethumbtack,ifIflipit,what’stheprobabilityitwillfallwiththenailup?
§ Yousay:Pleaseflipitafewtimes:
§ Yousay:Theprobabilityis:§ P(H)=3/5
§ Hesays:Why???§ Yousay:Because…
YourFirstConsultingJob
§ P(Heads)=q,P(Tails)=1-q
§ Flipsarei.i.d.:§ Independentevents§ Identicallydistributedaccordingtounknowndistribution
§ SequenceD ofaH HeadsandaT Tails
…
D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)
MaximumLikelihoodEstimation
§ Data:ObservedsetD ofaH HeadsandaT Tails§ Hypothesisspace: Bernoullidistributions§ Learning:findingqisanoptimizationproblem
§ What’stheobjectivefunction?
§ MLE:ChooseqtomaximizeprobabilityofD
MaximumLikelihoodEstimation
§ Setderivativetozero,andsolve!
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Brief Article
The Author
January 11, 2012
�̂ = argmax✓
lnP (D | �)
ln �↵H
d
d�lnP (D | �) =
d
d�ln �↵H (1� �)↵T
1
Brief Article
The Author
January 11, 2012
⇥̂ = argmax⇥
lnP (D | ⇥)
ln ⇥�H
d
d⇥lnP (D | ⇥) =
d
d⇥[ln ⇥�H (1� ⇥)�T ] =
d
d⇥[�H ln ⇥ + �T ln(1� ⇥)]
= �Hd
d⇥ln ⇥ + �T
d
d⇥ln(1� ⇥) =
�H
⇥� �T
1� ⇥= 0
1
Smoothing
MaximumLikelihood?
§ Relativefrequenciesarethemaximumlikelihoodestimates
§ Anotheroptionistoconsiderthemostlikelyparametervaluegiventhedata
????
UnseenEvents
LaplaceSmoothing
§ Laplace’sestimate:§ Pretendyousaweveryoutcome
oncemorethanyouactuallydid
§ CanderivethisestimatewithDirichlet priors (seecs281a)
r r b
LaplaceSmoothing
§ Laplace’sestimate(extended):§ Pretendyousaweveryoutcomekextratimes
§ What’sLaplacewithk=0?§ kisthestrength oftheprior
§ Laplaceforconditionals:§ Smootheachconditionindependently:
r r b
RealNB:Smoothing
§ Forrealclassificationproblems,smoothingiscritical§ Newoddsratios:
helvetica : 11.4seems : 10.8group : 10.2ago : 8.4areas : 8.3...
verdana : 28.8Credit : 28.4ORDER : 27.2<FONT> : 26.9money : 26.5...
Dothesemakemoresense?
Tuning
TuningonHeld-OutData
§ Nowwe’vegottwokindsofunknowns§ Parameters: theprobabilitiesP(X|Y),P(Y)§ Hyperparameters:e.g.theamount/typeofsmoothingtodo,k,a
§ Whatshouldwelearnwhere?§ Learnparametersfromtrainingdata§ Tunehyperparameters ondifferentdata
§ Why?§ Foreachvalueofthehyperparameters,trainandtestontheheld-outdata
§ Choosethebestvalueanddoafinaltestonthetestdata
Summary
§ Bayesruleletsusdodiagnosticquerieswithcausalprobabilities
§ ThenaïveBayesassumptiontakesallfeaturestobeindependentgiventheclasslabel
§ WecanbuildclassifiersoutofanaïveBayesmodelusingtrainingdata
§ Smoothingestimatesisimportantinrealsystems
Recommended