CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L5.pdf · Machine Learning and Data Mining Probabilistic Classification Fall 2019. Admin •Waiting list people: the

CPSC340:MachineLearningandDataMining

ProbabilisticClassificationFall2020

Admin• Waitinglistpeople:everyshouldbein!• Coursewebpage:– https://www.cs.ubc.ca/~fwood/CS340/

• Homework1duetonight.

LastTime:Training,Testing,andValidation• Trainingstep:

• Predictionstep:

• Whatweareinterestedinisthetesterror:– Errormadebypredictionsteponnewdata.

LastTime:FundamentalTrade-Off• Wedecomposedtesterrortogetafundamentaltrade-off:

– WhereEapprox =(Etest – Etrain).

• Etrain goesdownasmodelgetscomplicated:– Trainingerrorgoesdownasadecisiontreegetsdeeper.

• ButEapprox goesupasmodelgetscomplicated:– Trainingerrorbecomesaworseapproximationoftesterror.

LastTime:ValidationError• Goldenrule:wecan’tlookattestdataduringtraining.• ButwecanapproximateEtest withavalidationerror:– Erroronasetoftrainingexampleswe“hid”duringtraining.

– Findthedecisiontreebasedonthe“train”rows.– Validationerroristheerrorofthedecisiontreeonthe“validation”rows.

• Wetypicallychoose“hyper-parameters”likedepthtominimizethevalidationerror.

OverfittingtotheValidationSet?• Validationerrorusuallyhasloweroptimizationbiasthantrainingerror.

– Mightoptimizeover20valuesof“depth”,insteadofmillions+ofpossibletrees.

• Butwecanstilloverfit tothevalidationerror(commoninpractice):– Validationerrorisonlyanunbiasedapproximationifyouuseitonce.– Onceyoustartoptimizingit,youstarttooverfit tothevalidationset.

• Thisismostimportantwhenthevalidationsetis“small”:– Theoptimizationbiasdecreasesasthenumberofvalidationexamplesincreases.

• Remember,ourgoalisstilltodowellonthetestset(newdata),notthevalidationset(wherewealreadyknowthelabels).

Shouldyoutrustthem?• Scenario1:

– “Ibuiltamodelbasedonthedatayougaveme.”– “Itclassifiedyourdatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”

• Probablynot:– Theyarereportingtrainingerror.– Thismighthavenothingtodowithtesterror.– E.g.,theycouldhavefitaverydeepdecisiontree.

• Why‘probably’?– Iftheyonlytriedafewverysimplemodels,the98%mightbereliable.– E.g.,theyonlyconsidereddecisionstumpswithsimple1-variablerules.

Shouldyoutrustthem?• Scenario2:– “Ibuiltamodelbasedonhalfofthedatayougaveme.”– “Itclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”

• Probably:– Theycomputedthevalidationerroronce.– Thisisanunbiasedapproximationofthetesterror.– Trustthemifyoubelievetheydidn’tviolatethegoldenrule.

Shouldyoutrustthem?• Scenario3:– “Ibuilt10models basedonhalfofthedatayougaveme.”– “Oneofthemclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”

• Probably:– Theycomputedthevalidationerrorasmallnumberoftimes.– Maximizingovertheseerrorsisabiasedapproximationoftesterror.– Buttheyonlymaximizeditover10models,sobiasisprobablysmall.– Theyprobablyknowaboutthegoldenrule.

Shouldyoutrustthem?• Scenario4:– “Ibuilt1billionmodels basedonhalfofthedatayougaveme.”– “Oneofthemclassifiedtheotherhalfofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”

• Probablynot:– Theycomputedthevalidationerrorahugenumberoftimes.– Theytriedsomanymodels,oneofthemislikelytoworkbychance.

• Why‘probably’?– Ifthe1billionmodelswereallextremely-simple,98%mightbereliable.

Shouldyoutrustthem?• Scenario5:

– “Ibuilt1billionmodels basedonthefirstthirdofthedatayougaveme.”– “Oneofthemclassifiedthesecondthirdofthedatawith98%accuracy.”– “Italsoclassifiedthelastthirdofthedatawith98%accuracy.”– “Itshouldget98%accuracyontherestofyourdata.”

• Probably:– Theycomputedthefirstvalidationerrorahugenumberoftimes.– Buttheyhadasecondvalidationsetthattheyonlylookedatonce.– Thesecondvalidationsetgivesunbiasedtesterrorapproximation.– Thisisideal,aslongastheydidn’tviolategoldenruleonthelastthird.– AndassumingyouareusingIIDdatainthefirstplace.

ValidationErrorandOptimizationBias• Optimizationbiasissmallifyouonlycompareafewmodels:– Bestdecisiontreeonthetrainingsetamongdepths1,2,3,…,10.– Riskofoverfittingtovalidationsetislowifwetry10things.

• Optimizationbiasislargeifyoucomparealotofmodels:– Allpossibledecisiontreesofdepth10orless.– Herewe’reusingthevalidationsettopickbetweenabillion+models:

• Riskofoverfittingtovalidationsetishigh:couldhavelowvalidationerrorbychance.

– Ifyoudidthis,youmightwantasecondvalidationsettodetectoverfitting.

• Andoptimizationbiasshrinksasyougrowsizeofvalidationset.

Aside:OptimizationBiasleadstoPublicationBias• Supposethat20researchersperformtheexactsameexperiment:

• Theyeachtestwhethertheireffectis“significant”(p<0.05).– 19/20findthatitisnotsignificant.– Butthe1groupfindingit’ssignificantpublishesapaperabouttheeffect.

• Thisisagainoptimizationbias,contributingtopublicationbias.– Acontributingfactortomanyreportedeffectsbeingwrong.

Cross-Validation(CV)• Isn’titwastefultoonlyusepartofyourdata?• 5-foldcross-validation:– Trainon80%ofthedata,validateontheother20%.– Repeatthis5moretimeswithdifferentsplits,andaveragethescore.

Cross-Validation(CV)

TRAIN

TRAIN

TRAIN

TRAIN

VALIDATION

TRAIN

TRAIN

TRAIN

VALIDATION

TRAIN

TRAIN

TRAIN

VALIDATION

TRAIN

TRAIN

TRAIN

VALIDATION

TRAIN

TRAIN

TRAIN

VALIDATION

TRAIN

TRAIN

TRAIN

TRAIN

Cross-ValidationPseudo-Code

Cross-Validation(CV)• Youcantakethisideafurther(“k-foldcross-validation”):– 10-foldcross-validation:trainon90%ofdataandvalidateon10%.

• Repeat10timesandaverage(testonfold1,thenfold2,…,thenfold10),

– Leave-one-outcross-validation:trainonallbutonetrainingexample.• Repeatntimesandaverage.

• Gets moreaccurate butmoreexpensive withmorefolds.– Tochoosedepthwecomputethecross-validationscoreforeachdepth.

• Asbefore,ifdataisorderedthenfoldsshouldberandomsplits.– Randomizefirst,thensplitintofixedfolds.

(pause)

The“Best”MachineLearningModel• Decisiontreesarenotalwaysmostaccurateontesterror.• Whatisthe“best”machinelearningmodel?• Analternativemeasureofperformanceisthegeneralizationerror:

– Averageerroroverallxi vectorsthatarenotseeninthetrainingset.– “Howwellweexpecttodoforacompletelyunseenfeaturevector”.

• Nofreelunchtheorem (proofinbonusslides):– Thereisno “best”modelachievingthebestgeneralizationerrorforeveryproblem.

– IfmodelAgeneralizesbettertonewdatathanmodelBononedataset,thereisanotherdatasetwheremodelBworksbetter.

• Thisquestion islikeaskingwhichis“best”among“rock”,“paper”,and“scissors”.

The“Best”MachineLearningModel• Implicationsofthelackofa“best”model:

– Weneedtolearnaboutandtryoutmultiplemodels.• SowhichonestostudyinCPSC340?

– We’llusuallymotivateeachmethodbyaspecificapplication.– Butwe’refocusingonmodelsthathavebeeneffectiveinmanyapplications.

• Caveatofnofreelunch(NFL)theorem:– Theworldisverystructured.– Somedatasetsaremorelikelythanothers.– ModelAreallycouldbebetterthanmodelBoneveryrealdatasetinpractice.

• Machinelearningresearch:– Largefocusonmodelsthatareusefulacrossmanyapplications.

Application:E-mailSpamFiltering• Wantabuildasystemthatdetectsspame-mails.– Context:spamusedtobeabigproblem.

• Canweformulateassupervisedlearning?

SpamFilteringasSupervisedLearning• Collectalargenumberofe-mails,getsuserstolabelthem.

• Wecanuse(yi =1)ife-mail‘i’isspam,(yi =0)ife-mailisnotspam.• Extractfeaturesofeache-mail(likebagofwords).– (xij =1)ifword/phrase‘j’isine-mail‘i’,(xij =0)ifitisnot.

$ Hi CPSC 340 Vicodin Offer …

1 1 0 0 1 0 …

0 0 0 0 1 1 …

0 1 1 1 0 0 …

… … … … … … …

Spam?

1

1

0

…

FeatureRepresentationforSpam• Aretherebetterfeaturesthanbagofwords?– Weaddbigrams (setsoftwowords):

• “CPSC340”,“waitlist”,“specialdeal”.– Ortrigrams (setsofthreewords):

• “Limitedtimeoffer”,“courseregistrationdeadline”,“you’reawinner”.

– Wemightincludethesenderdomain:• <senderdomain==“mail.com”>.

– Wemightincluderegularexpressions:• <yourfirstandlastname>.

ReviewofSupervisedLearningNotation• Wehavebeenusingthenotation‘X’and‘y’forsupervisedlearning:

• Xismatrixofallfeatures,yisvectorofalllabels.– Weuseyi forthelabelofexample‘i’(element‘i’of‘y’).– Weusexij forfeature‘j’ofexample‘i‘.– Weusexi asthelistoffeaturesofexample‘i’ (row‘i’of‘X’).

• Sointheabovex3 =[011100…].• Inpractice,onlystorelistofnon-zerofeaturesforeachxi(smallmemoryrequirement).


1 1 0 0 1 0 …

0 0 0 0 1 1 …

0 1 1 1 0 0 …

… … … … … … …

Spam?

1

1

0

…

ProbabilisticClassifiers• Foryears,bestspamfilteringmethodsusednaïveBayes.

– Aprobabilistic classifier basedonBayesrule.– Ittendstoworkwellwithbagofwords.– RecentlyshowntoimproveonstateoftheartforCRISPR“geneediting”(link).

• Probabilisticclassifiersmodeltheconditionalprobability,p(yi |xi).– “Ifamessagehaswordsxi,whatisprobabilitythatmessageisspam?”

• Classifyitasspamifprobabilityofspamishigherthannotspam:– Ifp(yi =“spam”|xi)>p(yi =“notspam”|xi)

• return“spam”.– Else

• return“notspam”.

SpamFilteringwithBayesRule• Tomodelconditionalprobability,naïveBayesusesBayesrule:

• Soweneedtofigureoutthreetypesofterms:– Marginalprobabilityp(yi)thatane-mailisspam.– Marginalprobabilityp(xi) thatane-mailhasthesetofwordsxi.– Conditionalprobabilityp(xi |yi)thataspame-mailhasthewordsxi.

• Andthesamefornon-spame-mails.

SpamFilteringwithBayesRule

• Whatdothesetermsmean?

ALLE-MAILS(includingduplicates)


• p(yi =“spam”)isprobabilitythatarandome-mailisspam.– Thisiseasytoapproximatefromdata:usetheproportioninyourdata.

ALLE-MAILS(includingduplicates)SPAMNOT

SPAM Thisisan“estimate”ofthetrueprobability.Inparticular,thisformulaisa“maximumlikelihoodestimate”(MLE).WewillcoverlikelihoodsandMLEslaterinthecourse.


• p(xi)isprobabilitythatarandome-mailhasfeaturesxi:– Hardtoapproximate:with‘d’wordsweneedtocollect2d “coupons”,

andthat’sjusttoseeeachwordcombinationonce.



• p(xi)isprobabilitythatarandome-mailhasfeaturesxi:– Hardtoapproximate:with‘d’wordsweneedtocollect2d “coupons”,butitturnsoutwecanignoreit:


• p(xi |yi =“spam”)isprobabilitythatspamhasfeaturesxi.


NOTSPAM SPAM

• Alsohardtoapproximate.• Andweneedit.

NaïveBayes• NaïveBayesmakesabigassumptiontomakethingseasier:

• Weassumeall featuresxi areconditionallyindependentgivelabel yi.– Onceyouknowit’sspam,probabilityof“vicodin”doesn’tdependon“340”.– Definitelynottrue,butsometimesagoodapproximation.

• Andnowweonlyneedeasyquantitieslikep(“vicodin” =0|yi =“spam”).

NaïveBayes• p(“vicodin”=1|“spam”=1)isprobabilityofseeing“vicodin”inspam.

ALLPOSSIBLEE-MAILS(includingduplicates)SPAMNOT

SPAM

• Easytoestimate:Vicodin

Again,thisisa“maximumlikelihoodestimate”(MLE).Wewillcoverhowtoderivethislater.

Summary• Optimizationbias:usingavalidationsettoomuchoverfits.• Cross-validation:allowsbetteruseofdatatoestimatetesterror.• Nofreelunchtheorem:thereisno“best”MLmodel.• Probabilisticclassifiers:trytoestimatep(yi |xi).• NaïveBayes:simpleprobabilisticclassifierbasedoncounting.– Usesconditionalindependenceassumptionstomaketrainingpractical.

• Nexttime:– A“best”machinelearningmodelas‘n’goesto∞.

BacktoDecisionTrees• Insteadofvalidationset,youcanuseCVtoselecttreedepth.

• Butyoucanalsousethesetodecidewhethertosplit:– Don’tsplitifvalidation/CVerrordoesn’timprove.– Differentpartsofthetreewillhavedifferentdepths.

• Orfitdeepdecisiontreeanduse[cross-]validationtoprune:– Removeleafnodesthatdon’timproveCVerror.

• Popularimplementationsthathavethesetricksandothers.

RandomSubsamples• Insteadofsplittingintok-folds,consider“randomsubsample”method:– Ateach“round”,choosearandomsetofsize‘m’.

• Trainonallexamplesexceptthese‘m’examples.• Computevalidationerroronthese‘m’examples.

• Advantages:– Stillanunbiasedestimatoroferror.– Numberof“rounds”doesnotneedtoberelatedto“n”.

• Disadvantage:– Examplesthataresampledmoreoftengetmore“weight”.

Cross-ValidationTheory• DoesCVgiveunbiasedestimateoftesterror?

– Yes!• Sinceeachdatapointisonlyusedonceinvalidation,expectedvalidationerroroneachdatapointistesterror.

– Butagain,ifyouuseCVtoselectamongmodelsthenitisnolongerunbiased.

• WhataboutvarianceofCV?– Hardtocharacterize.– CVvarianceon‘n’datapointsisworsethanwithavalidationsetofsize‘n’.

• Butwebelieveitisclose.

• Doescross-validationremoveoptimizationbias?– No,butthebiasmightbesmallersinceyouhavemore“test”points.

HandlingDataSparsity• Doweneedtostorethefullbagofwords0/1variables?– No:onlyneedlistofnon-zerofeatures foreache-mail.

– Math/modeldoesn’tchange,butmoreefficientstorage.


1 1 0 0 1 0 …

0 0 0 0 1 1 …

0 1 1 1 0 0 …

1 1 0 0 0 1 …

Non-Zeroes

{1,2,5,…}

{5,6,…}

{2,3,4,…}

{1,2,6,…}

ProofofNoFreeLunchTheorem• Let’sshowthe“nofreelunch”theoreminasimplesetting:– Thexi andyi arebinary,andyi beingadeterministicfunctionofxi.

• With‘d’features,each“learningproblem”isamapfromeachofthe2d featurecombinationsto0or1:{0,1}d ->{0,1}

• Let’spickoneofthesemaps(“learningproblems”)and:– Generateasettrainingsetof‘n’IIDsamples.– FitmodelA (convolutionalneuralnetwork)andmodelB (naïveBayes).

Feature 1 Feature2 Feature3

0 0 0

0 0 1

0 1 0

… … …

Map1 Map2 Map3 …

0 1 0 …

0 0 1 …

0 0 0 …

… … … …

ProofofNoFreeLunchTheorem• Definethe“unseen”examplesasthe(2d – n)notseenintraining.– Assumingnorepetitionsofxi values,andn<2d.– Generalizationerroristheaverageerroronthese“unseen”examples.

• SupposethatmodelAgot1%errorandmodelBgot60%error.– WewanttoshowmodelBbeatsmodelAonanother“learningproblem”.

• Amongoursetof“learningproblems”findtheonewhere:– Thelabelsyi agreeonalltrainingexamples.– Thelabelsyi disagreeonall“unseen”examples.

• Onthisother“learningproblem”:– ModelAgets99%errorandmodelBgets40%error.

ProofofNoFreeLunchTheorem• Further,acrossall“learningproblems”withthese‘n’examples:– Averagegeneralizationerrorofeverymodelis50%onunseenexamples.

• It’srightoneachunseenexampleinexactlyhalfthelearningproblems.– With‘k’classes,theaverageerroris(k-1)/k(randomguessing).

• Thisiskindofdepressing:– Forgeneralproblems,no“machinelearning”isbetterthan“predict0”.

• ButtheproofalsorevealstheproblemwiththeNFLtheorem:– Assumesevery“learningproblem”isequallylikely.– Worldencouragespatternslike“similarfeaturesimpliessimilarlabels”.

Documents

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L5.pdf · Machine Learning and Data Mining Probabilistic Classification Fall 2019. Admin •Waiting list people: the