Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
CPSC340:MachineLearningandDataMining
ProbabilisticClassificationFall2020
Admin• Waitinglistpeople:everyshouldbein!• Coursewebpage:– https://www.cs.ubc.ca/~fwood/CS340/
• Homework1duetonight.
LastTime:Training,Testing,andValidation• Trainingstep:
• Predictionstep:
• Whatweareinterestedinisthetesterror:– Errormadebypredictionsteponnewdata.
LastTime:FundamentalTrade-Off• Wedecomposedtesterrortogetafundamentaltrade-off:
– WhereEapprox =(Etest – Etrain).
• Etrain goesdownasmodelgetscomplicated:– Trainingerrorgoesdownasadecisiontreegetsdeeper.
• ButEapprox goesupasmodelgetscomplicated:– Trainingerrorbecomesaworseapproximationoftesterror.
LastTime:ValidationError• Goldenrule:wecan’tlookattestdataduringtraining.• ButwecanapproximateEtest withavalidationerror:– Erroronasetoftrainingexampleswe“hid”duringtraining.
– Findthedecisiontreebasedonthe“train”rows.– Validationerroristheerrorofthedecisiontreeonthe“validation”rows.
• Wetypicallychoose“hyper-parameters”likedepthtominimizethevalidationerror.
OverfittingtotheValidationSet?• Validationerrorusuallyhasloweroptimizationbiasthantrainingerror.
– Mightoptimizeover20valuesof“depth”,insteadofmillions+ofpossibletrees.
• Butwecanstilloverfit tothevalidationerror(commoninpractice):– Validationerrorisonlyanunbiasedapproximationifyouuseitonce.– Onceyoustartoptimizingit,youstarttooverfit tothevalidationset.
• Thisismostimportantwhenthevalidationsetis“small”:– Theoptimizationbiasdecreasesasthenumberofvalidationexamplesincreases.
• Remember,ourgoalisstilltodowellonthetestset(newdata),notthevalidationset(wherewealreadyknowthelabels).
Shouldyoutrustthem?• Scenario1:
– “Ibuiltamodelbasedonthedatayougaveme.”– “Itclassifiedyourdatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”
• Probablynot:– Theyarereportingtrainingerror.– Thismighthavenothingtodowithtesterror.– E.g.,theycouldhavefitaverydeepdecisiontree.
• Why‘probably’?– Iftheyonlytriedafewverysimplemodels,the98%mightbereliable.– E.g.,theyonlyconsidereddecisionstumpswithsimple1-variablerules.
Shouldyoutrustthem?• Scenario2:– “Ibuiltamodelbasedonhalfofthedatayougaveme.”– “Itclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”
• Probably:– Theycomputedthevalidationerroronce.– Thisisanunbiasedapproximationofthetesterror.– Trustthemifyoubelievetheydidn’tviolatethegoldenrule.
Shouldyoutrustthem?• Scenario3:– “Ibuilt10models basedonhalfofthedatayougaveme.”– “Oneofthemclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”
• Probably:– Theycomputedthevalidationerrorasmallnumberoftimes.– Maximizingovertheseerrorsisabiasedapproximationoftesterror.– Buttheyonlymaximizeditover10models,sobiasisprobablysmall.– Theyprobablyknowaboutthegoldenrule.
Shouldyoutrustthem?• Scenario4:– “Ibuilt1billionmodels basedonhalfofthedatayougaveme.”– “Oneofthemclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”
• Probablynot:– Theycomputedthevalidationerrorahugenumberoftimes.– Theytriedsomanymodels,oneofthemislikelytoworkbychance.
• Why‘probably’?– Ifthe1billionmodelswereallextremely-simple,98%mightbereliable.
Shouldyoutrustthem?• Scenario5:
– “Ibuilt1billionmodels basedonthefirstthirdofthedatayougaveme.”– “Oneofthemclassifiedthesecondthirdofthedatawith98%accuracy.”– “Italsoclassifiedthelastthirdofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”
• Probably:– Theycomputedthefirstvalidationerrorahugenumberoftimes.– Buttheyhadasecondvalidationsetthattheyonlylookedatonce.– Thesecondvalidationsetgivesunbiasedtesterrorapproximation.– Thisisideal,aslongastheydidn’tviolategoldenruleonthelastthird.– AndassumingyouareusingIIDdatainthefirstplace.
ValidationErrorandOptimizationBias• Optimizationbiasissmallifyouonlycompareafewmodels:– Bestdecisiontreeonthetrainingsetamongdepths1,2,3,…,10.– Riskofoverfittingtovalidationsetislowifwetry10things.
• Optimizationbiasislargeifyoucomparealotofmodels:– Allpossibledecisiontreesofdepth10orless.– Herewe’reusingthevalidationsettopickbetweenabillion+models:
• Riskofoverfittingtovalidationsetishigh:couldhavelowvalidationerrorbychance.
– Ifyoudidthis,youmightwantasecondvalidationsettodetectoverfitting.
• Andoptimizationbiasshrinksasyougrowsizeofvalidationset.
Aside:OptimizationBiasleadstoPublicationBias• Supposethat20researchersperformtheexactsameexperiment:
• Theyeachtestwhethertheireffectis“significant”(p<0.05).– 19/20findthatitisnotsignificant.– Butthe1groupfindingit’ssignificantpublishesapaperabouttheeffect.
• Thisisagainoptimizationbias,contributingtopublicationbias.– Acontributingfactortomanyreportedeffectsbeingwrong.
Cross-Validation(CV)• Isn’titwastefultoonlyusepartofyourdata?• 5-foldcross-validation:– Trainon80%ofthedata,validateontheother20%.– Repeatthis5moretimeswithdifferentsplits,andaveragethescore.
Cross-Validation(CV)
TRAIN
TRAIN
TRAIN
TRAIN
VALIDATION
TRAIN
TRAIN
TRAIN
VALIDATION
TRAIN
TRAIN
TRAIN
VALIDATION
TRAIN
TRAIN
TRAIN
VALIDATION
TRAIN
TRAIN
TRAIN
VALIDATION
TRAIN
TRAIN
TRAIN
TRAIN
Cross-ValidationPseudo-Code
Cross-Validation(CV)• Youcantakethisideafurther(“k-foldcross-validation”):– 10-foldcross-validation:trainon90%ofdataandvalidateon10%.
• Repeat10timesandaverage(testonfold1,thenfold2,…,thenfold10),
– Leave-one-outcross-validation:trainonallbutonetrainingexample.• Repeatntimesandaverage.
• Gets moreaccurate butmoreexpensive withmorefolds.– Tochoosedepthwecomputethecross-validationscoreforeachdepth.
• Asbefore,ifdataisorderedthenfoldsshouldberandomsplits.– Randomizefirst,thensplitintofixedfolds.
(pause)
The“Best”MachineLearningModel• Decisiontreesarenotalwaysmostaccurateontesterror.• Whatisthe“best”machinelearningmodel?• Analternativemeasureofperformanceisthegeneralizationerror:
– Averageerroroverallxi vectorsthatarenotseeninthetrainingset.– “Howwellweexpecttodoforacompletelyunseenfeaturevector”.
• Nofreelunchtheorem (proofinbonusslides):– Thereisno “best”modelachievingthebestgeneralizationerrorforeveryproblem.
– IfmodelAgeneralizesbettertonewdatathanmodelBononedataset,thereisanotherdatasetwheremodelBworksbetter.
• Thisquestion islikeaskingwhichis“best”among“rock”,“paper”,and“scissors”.
The“Best”MachineLearningModel• Implicationsofthelackofa“best”model:
– Weneedtolearnaboutandtryoutmultiplemodels.• SowhichonestostudyinCPSC340?
– We’llusuallymotivateeachmethodbyaspecificapplication.– Butwe’refocusingonmodelsthathavebeeneffectiveinmanyapplications.
• Caveatofnofreelunch(NFL)theorem:– Theworldisverystructured.– Somedatasetsaremorelikelythanothers.– ModelAreallycouldbebetterthanmodelBoneveryrealdatasetinpractice.
• Machinelearningresearch:– Largefocusonmodelsthatareusefulacrossmanyapplications.
Application:E-mailSpamFiltering• Wantabuildasystemthatdetectsspame-mails.– Context:spamusedtobeabigproblem.
• Canweformulateassupervisedlearning?
SpamFilteringasSupervisedLearning• Collectalargenumberofe-mails,getsuserstolabelthem.
• Wecanuse(yi =1)ife-mail‘i’isspam,(yi =0)ife-mailisnotspam.• Extractfeaturesofeache-mail(likebagofwords).– (xij =1)ifword/phrase‘j’isine-mail‘i’,(xij =0)ifitisnot.
$ Hi CPSC 340 Vicodin Offer …
1 1 0 0 1 0 …
0 0 0 0 1 1 …
0 1 1 1 0 0 …
… … … … … … …
Spam?
1
1
0
…
FeatureRepresentationforSpam• Aretherebetterfeaturesthanbagofwords?– Weaddbigrams (setsoftwowords):
• “CPSC340”,“waitlist”,“specialdeal”.– Ortrigrams (setsofthreewords):
• “Limitedtimeoffer”,“courseregistrationdeadline”,“you’reawinner”.
– Wemightincludethesenderdomain:• <senderdomain==“mail.com”>.
– Wemightincluderegularexpressions:• <yourfirstandlastname>.
ReviewofSupervisedLearningNotation• Wehavebeenusingthenotation‘X’and‘y’forsupervisedlearning:
• Xismatrixofallfeatures,yisvectorofalllabels.– Weuseyi forthelabelofexample‘i’(element‘i’of‘y’).– Weusexij forfeature‘j’ofexample‘i‘.– Weusexi asthelistoffeaturesofexample‘i’ (row‘i’of‘X’).
• Sointheabovex3 =[011100…].• Inpractice,onlystorelistofnon-zerofeaturesforeachxi(smallmemoryrequirement).
$ Hi CPSC 340 Vicodin Offer …
1 1 0 0 1 0 …
0 0 0 0 1 1 …
0 1 1 1 0 0 …
… … … … … … …
Spam?
1
1
0
…
ProbabilisticClassifiers• Foryears,bestspamfilteringmethodsusednaïveBayes.
– Aprobabilistic classifier basedonBayesrule.– Ittendstoworkwellwithbagofwords.– RecentlyshowntoimproveonstateoftheartforCRISPR“geneediting”(link).
• Probabilisticclassifiersmodeltheconditionalprobability,p(yi |xi).– “Ifamessagehaswordsxi,whatisprobabilitythatmessageisspam?”
• Classifyitasspamifprobabilityofspamishigherthannotspam:– Ifp(yi =“spam”|xi)>p(yi =“notspam”|xi)
• return“spam”.– Else
• return“notspam”.
SpamFilteringwithBayesRule• Tomodelconditionalprobability,naïveBayesusesBayesrule:
• Soweneedtofigureoutthreetypesofterms:– Marginalprobabilityp(yi)thatane-mailisspam.– Marginalprobabilityp(xi) thatane-mailhasthesetofwordsxi.– Conditionalprobabilityp(xi |yi)thataspame-mailhasthewordsxi.
• Andthesamefornon-spame-mails.
SpamFilteringwithBayesRule
• Whatdothesetermsmean?
ALLE-MAILS(includingduplicates)
SpamFilteringwithBayesRule
• p(yi =“spam”)isprobabilitythatarandome-mailisspam.– Thisiseasytoapproximatefromdata:usetheproportioninyourdata.
ALLE-MAILS(includingduplicates)SPAMNOT
SPAM Thisisan“estimate”ofthetrueprobability.Inparticular,thisformulaisa“maximumlikelihoodestimate”(MLE).WewillcoverlikelihoodsandMLEslaterinthecourse.
SpamFilteringwithBayesRule
• p(xi)isprobabilitythatarandome-mailhasfeaturesxi:– Hardtoapproximate:with‘d’wordsweneedtocollect2d “coupons”,
andthat’sjusttoseeeachwordcombinationonce.
ALLE-MAILS(includingduplicates)
SpamFilteringwithBayesRule
• p(xi)isprobabilitythatarandome-mailhasfeaturesxi:– Hardtoapproximate:with‘d’wordsweneedtocollect2d “coupons”,butitturnsoutwecanignoreit:
SpamFilteringwithBayesRule
• p(xi |yi =“spam”)isprobabilitythatspamhasfeaturesxi.
ALLE-MAILS(includingduplicates)
NOTSPAM SPAM
• Alsohardtoapproximate.• Andweneedit.
NaïveBayes• NaïveBayesmakesabigassumptiontomakethingseasier:
• Weassumeall featuresxi areconditionallyindependentgivelabel yi.– Onceyouknowit’sspam,probabilityof“vicodin”doesn’tdependon“340”.– Definitelynottrue,butsometimesagoodapproximation.
• Andnowweonlyneedeasyquantitieslikep(“vicodin” =0|yi =“spam”).
NaïveBayes• p(“vicodin”=1|“spam”=1)isprobabilityofseeing“vicodin”inspam.
ALLPOSSIBLEE-MAILS(includingduplicates)SPAMNOT
SPAM
• Easytoestimate:Vicodin
Again,thisisa“maximumlikelihoodestimate”(MLE).Wewillcoverhowtoderivethislater.
Summary• Optimizationbias:usingavalidationsettoomuchoverfits.• Cross-validation:allowsbetteruseofdatatoestimatetesterror.• Nofreelunchtheorem:thereisno“best”MLmodel.• Probabilisticclassifiers:trytoestimatep(yi |xi).• NaïveBayes:simpleprobabilisticclassifierbasedoncounting.– Usesconditionalindependenceassumptionstomaketrainingpractical.
• Nexttime:– A“best”machinelearningmodelas‘n’goesto∞.
BacktoDecisionTrees• Insteadofvalidationset,youcanuseCVtoselecttreedepth.
• Butyoucanalsousethesetodecidewhethertosplit:– Don’tsplitifvalidation/CVerrordoesn’timprove.– Differentpartsofthetreewillhavedifferentdepths.
• Orfitdeepdecisiontreeanduse[cross-]validationtoprune:– Removeleafnodesthatdon’timproveCVerror.
• Popularimplementationsthathavethesetricksandothers.
RandomSubsamples• Insteadofsplittingintok-folds,consider“randomsubsample”method:– Ateach“round”,choosearandomsetofsize‘m’.
• Trainonallexamplesexceptthese‘m’examples.• Computevalidationerroronthese‘m’examples.
• Advantages:– Stillanunbiasedestimatoroferror.– Numberof“rounds”doesnotneedtoberelatedto“n”.
• Disadvantage:– Examplesthataresampledmoreoftengetmore“weight”.
Cross-ValidationTheory• DoesCVgiveunbiasedestimateoftesterror?
– Yes!• Sinceeachdatapointisonlyusedonceinvalidation,expectedvalidationerroroneachdatapointistesterror.
– Butagain,ifyouuseCVtoselectamongmodelsthenitisnolongerunbiased.
• WhataboutvarianceofCV?– Hardtocharacterize.– CVvarianceon‘n’datapointsisworsethanwithavalidationsetofsize‘n’.
• Butwebelieveitisclose.
• Doescross-validationremoveoptimizationbias?– No,butthebiasmightbesmallersinceyouhavemore“test”points.
HandlingDataSparsity• Doweneedtostorethefullbagofwords0/1variables?– No:onlyneedlistofnon-zerofeatures foreache-mail.
– Math/modeldoesn’tchange,butmoreefficientstorage.
$ Hi CPSC 340 Vicodin Offer …
1 1 0 0 1 0 …
0 0 0 0 1 1 …
0 1 1 1 0 0 …
1 1 0 0 0 1 …
Non-Zeroes
{1,2,5,…}
{5,6,…}
{2,3,4,…}
{1,2,6,…}
ProofofNoFreeLunchTheorem• Let’sshowthe“nofreelunch”theoreminasimplesetting:– Thexi andyi arebinary,andyi beingadeterministicfunctionofxi.
• With‘d’features,each“learningproblem”isamapfromeachofthe2d featurecombinationsto0or1:{0,1}d ->{0,1}
• Let’spickoneofthesemaps(“learningproblems”)and:– Generateasettrainingsetof‘n’IIDsamples.– FitmodelA (convolutionalneuralnetwork)andmodelB (naïveBayes).
Feature 1 Feature2 Feature3
0 0 0
0 0 1
0 1 0
… … …
Map1 Map2 Map3 …
0 1 0 …
0 0 1 …
0 0 0 …
… … … …
ProofofNoFreeLunchTheorem• Definethe“unseen”examplesasthe(2d – n)notseenintraining.– Assumingnorepetitionsofxi values,andn<2d.– Generalizationerroristheaverageerroronthese“unseen”examples.
• SupposethatmodelAgot1%errorandmodelBgot60%error.– WewanttoshowmodelBbeatsmodelAonanother“learningproblem”.
• Amongoursetof“learningproblems”findtheonewhere:– Thelabelsyi agreeonalltrainingexamples.– Thelabelsyi disagreeonall“unseen”examples.
• Onthisother“learningproblem”:– ModelAgets99%errorandmodelBgets40%error.
ProofofNoFreeLunchTheorem• Further,acrossall“learningproblems”withthese‘n’examples:– Averagegeneralizationerrorofeverymodelis50%onunseenexamples.
• It’srightoneachunseenexampleinexactlyhalfthelearningproblems.– With‘k’classes,theaverageerroris(k-1)/k(randomguessing).
• Thisiskindofdepressing:– Forgeneralproblems,no“machinelearning”isbetterthan“predict0”.
• ButtheproofalsorevealstheproblemwiththeNFLtheorem:– Assumesevery“learningproblem”isequallylikely.– Worldencouragespatternslike“similarfeaturesimpliessimilarlabels”.