Center for Big Data Analytics and Discovery Informatics Artificial … · 2018. 9. 9. · Center...

Preview:

Citation preview

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Evaluating Classifier Performance

VasantHonavarArtificialIntelligenceResearchLaboratory

InformaticsGraduateProgramComputerScienceandEngineeringGraduateProgram

BioinformaticsandGenomicsGraduateProgramNeuroscienceGraduateProgram

CenterforBigDataAnalyticsandDiscoveryInformaticsHuckInstitutesoftheLifeSciences

InstituteforCyberscienceClinicalandTranslationalSciencesInstitute

NortheastBigDataHubPennsylvaniaStateUniversity

vhonavar@ist.psu.eduhttp://faculty.ist.psu.edu/vhonavar

http://ailab.ist.psu.edu

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

WhyEvaluateclassifiers?

•  Toknowhowwellaclassifiercanbeexpectedtoperformwhenitisputtouse

•  Tochoosethebestmodelfromamongasetofalternatives

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

EvaluatingaClassifier

•  Howcanwemeasureperformanceofclassifiers?•  Howwellcanaclassifierbeexpectedtoperformonnoveldata,i.e.,

datanotseenduringtraining?•  Wecanestimatetheperformance(e.g.,accuracy,sensitivity)ofthe

classifierusinganevaluationdataset(notusedfortraining)•  Howcloseistheestimatedperformancetothetrueperformance?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Classificationerror

•  Error=classifyingarecordasbelongingtooneclasswhenitbelongstoanotherclass.

•  Errorrate=percentofmisclassifiedsamplesoutofthetotalsamplesinthevalidationdata

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

NaïveBaseline

•  Wehopetodobetterthanthenaïvebaseline•  Whenthegoalistoidentifyhigh-valuebutrare

outcomes,wemaydowellbydoingworsethanthenaïvebaselineintermsofaccuracy

Naïvebaseline:classifyallsamplesasbelongingtothemostprevalentclass

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

EstimatingClassifierPerformance

N:TotalnumberofinstancesinthedatasetTPj: Numberof Truepositivesforclass j FPj : Numberof Falsepositivesforclass j TNj: Numberof TrueNegativesforclass j FNj: Numberof FalseNegativesforclass j

( )jj

jjj

clabelcclassPNTNTP

Accuracy

=∧==

+=

PerfectclassifierßàAccuracy=1PopularmeasureBiasedinfavorofthemajorityclass!Shouldbeusedwithcaution!

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ClassifierLearning--MeasuringPerformanceClassLabel

C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

WhenOneClassisMoreImportantthananother

–  Taxfraud–  Creditdefault–  Responsetopromotionaloffer–  Detectingelectronicnetworkintrusion–  Predictingdelayedflights–  Diagnosingcancer–  Predictingnuclearreactormeltdown

Inmanycasesitismoreimportanttoidentifymembersofaspecifictargetclass

Insuchcases,wemaytolerategreateroverallerror,inreturnforbetterpredictionsofthemoreimportantclass

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringClassifierPerformance:Sensitivity

( )( )

( )jj

j

jj

jj

jj

c classclabelP c classCount

c classclabelCountFNTP

TPensitivityS

===

=

=∧==

+=

|

PerfectclassifieràSensitivity=1ProbabilityofcorrectlylabelingmembersofthetargetclassAlsocalledrecallorhitrate

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ClassifierLearning--MeasuringPerformanceClassLabel

C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringClassifierPerformance:Specificity ( )

( )( ) |

jj

j

jj

jj

jj

clabelcclassP clabelCount

c classclabelCountFPTP

TPpecificityS

===

=

=∧==

+=

PerfectclassifieràSpecificity=1AlsocalledprecisionProbabilitythatapositivepredictioniscorrect

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringPerformance:Precision,Recall,andFalseAlarmRate

jj

jjj FPTP

TPySpecificitPrecision

+==

jj

jjj FNTP

TPySensitivitRecall

+==

( )( )

( )jj

j

jj

jj

jj

cclassclabelPclabelCount

cclassclabelCountFPTN

FPFalseAlarm

¬===

¬=

¬=∧==

+=

|

PerfectclassifieràPrecision=1PerfectclassifieràRecall=1

PerfectclassifieràFalseAlarmRate=0

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ClassifierLearning--MeasuringPerformanceClassLabel

C1 ¬ C1

C1 TP=55 FP=5¬ C1 FN=10 TN=30

355

5305

10085

1003055

6055

55555

6555

105555100

1

1

1

1

=+

=+

=

=+

=+

=

=+

=+

=

=+

=+

=

=+++=

FPTNFPfalsealarm

NTNTPaccuracy

FPTPTPyspecificit

FNTPTPysensitivit

FPTNFNTPN

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringPerformance–CorrelationCoefficient

CC j =TPj ×TN j( ) − FPj × FN j( )

TPJ + FN j( ) TPj + FPj( ) TN j + FPj( ) TN j + FN j( ) −1≤ CC j ≤1

CC j =jlabeli − jlabel( ) jclassi − jclass( )

σ JLABELσ JCLASSdi∈D∑

where jlabeli =1 iff the classifier assigns di to class c jjclassi =1 iff the true class of di is class c j

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bewareofterminologicalconfusionintheliterature!•  Somebioinformaticsauthorsuse“accuracy”incorrectlytorefer

torecalli.e.sensitivityorprecisioni.e.specificity•  Inmedicalstatistics,specificitysometimesreferstosensitivity

forthenegativeclassi.e.•  Someauthorsusefalsealarmratetorefertotheprobabilitythat

apositivepredictionisincorrecti.e.Whenyouwrite•  providetheformulaintermsofTP, TN, FP, FN Whenyouread•  checktheformulaintermsofTP, TN, FP, FN

jj

j

FPTNTN+

jjj

j PrecisionTPFP

FP−=

+1

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringClassifierPerformance•  TP,FP,TN,FNprovidetherelevantinformation•  Nosinglemeasuretellsthewholestory•  Aclassifierwith98%accuracycanbeuselessif98%ofthe

populationdoesnothavecancerandthe2%thatdoaremisclassifiedbytheclassifier

•  Useofmultiplemeasuresrecommended•  Bewareofterminologicalconfusion!

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Micro-averagedperformancemeasuresPerformanceonarandomsample

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛+

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛×⎟⎟⎠

⎞⎜⎜⎝

=

∑∑∑∑∑∑∑∑

∑∑∑∑

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

jj

FNTNFPTNFPTPFNTP

FNFPTNTPCCgeMicroAvera

∑∑

∑+

=

jj

jj

jj

FPTP

TPPrecision geMicroAvera ∑∑

∑+

=

jj

jj

jj

FNTP

TPRecall geMicroAvera

PrecisiongeMicroAveraFalseAlarmgeMicroAvera 1 −=

•  Microaveraginggivesequalimportancetoeachsample•  Classeswithlargenumberofinstancesdominate

N

TPAccuracygeMicroAvera j

j∑= Etc.

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Macro-averagedperformancemeasures

∑=j

jnCoeffCorrelatioM

ionCoeffgeCorrelatMacroAvera 1

∑=j

jpecificitySM

ty SpecificigeMacroAvera 1

∑=j

jensitivitySM

ty SensitivigeMacroAvera 1

MacroaveraginggivesequalimportancetoeachoftheMclasses

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

CutoffforclassificationMostmachinelearningalgorithmsclassifyviaa2-stepprocess:Foreachsample,

1.  Computeprobabilityofbelongingtoclass“1”2.  Comparetocutoffvalue,andclassifyaccordingly

•  Defaultcutoffvalueis0.50If>=0.50,classifyas“1”If<0.50,classifyas“0”

•  Canusedifferentcutoffvaluesfortradingoffonemeasureagainstanother(moreonthislater)

•  Question:HowwouldthisworkinthecaseofKnearestneighbor?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

•  Ifcutoffis0.50:12samplesareclassifiedas“1”•  Ifcutoffis0.80:sevensamplesareclassifiedas“1”

ActualClass Prob.of"1" ActualClass Prob.of"1"1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004

CutoffTable

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ReceiverOperatingCharacteristic(ROC)Curve

•  Theconfusionmatrix,andhencethepreviousmeasuresofclassifierperformancearethresholddependent

•  Wecanoftentradeoffrecallversusprecision–e.g.,byadjustingclassificationthresholdθ

•  Isthereathreshold-independentmeasureofclassifierperformance?– ROCcurveisaplotofSensitivityagainstFalseAlarm

Ratewhichissameas(1-Specificity)whichcharacterizesthistradeoffforagivenclassifier

– ROCcurveisobtainedbyplottingsensitivityagainst(1-specificity)byvaryingtheclassificationthreshold

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Receiveroperatingcharacteristic(ROC)Curve

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MeasuringPerformanceofClassifiers–ROCcurves

•  ROCcurvesofferamorecompletepictureoftheperformanceoftheclassifierasafunctionoftheclassificationthreshold

•  AclassifierhisbetterthananotherclassifiergifROC(h)dominatestheROC(g)

•  ROC(h)dominatesROC(g)àAreaROC(h)>AreaROC(g)

1

1

0

0

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ROCCurve

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

MisclassificationCostsMayDiffer

•  Thecostofmakingamisclassificationerrormaybehigherforoneclassthantheother(s)

•  Lookedatanotherway,thebenefitofmakingacorrectclassificationmaybehigherforoneclassthantheother(s)

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Example–ResponsetoPromotionalOffer

•  “Naïverule”(classifyeveryoneas“0”)haserrorrateof1%(seemsgood)

•  Usingmachinelearningsupposewecancorrectlyclassifyeight1’sas1’s

•  Butatthecostofmisclassifyingtwenty0’sas1’sandtwo1’sas0’s.

•  Supposewesendanofferto1000people,with1%averageresponserate

•  “1”=response,“0”=nonresponse

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Errorrate=(2+20)=2.2%(higherthannaïverate)

ConfusionMatrix

Predictas1 Predictas0Actual1 8 2Actual0 20 970

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

IntroducingCosts&BenefitsSuppose:•  Profitfroma“1”is$10•  Costofsendingofferis$1Then:•  Undernaïverule,allareclassifiedas“0”,sono

offersaresent:nocost,noprofit•  UnderDMpredictions,28offersaresent.

8respondwithprofitof$10each20failtorespond,cost$1each972receivenothing(nocost,noprofit)

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ProfitMatrix

Predictas1 Predictas0Actual1 $80 0Actual0 ($20) 0

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

EvaluatingaClassifier

•  Whatwehavedonesofaristoestimatetheclassifier’sperformanceonsomeavailabledata.

•  Howwellcanaclassifierbeexpectedtoperformonnoveldata?

•  Performanceestimatedontrainingdataisoftenoptimisticrelativetoperformanceonnoveldata

•  Wecanestimatetheperformance(e.g.,accuracy,sensitivity)oftheclassifierusingevaluationdata(notusedfortraining)

•  Howcloseistheestimatedperformancetothetrueperformance?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Evaluationofaclassifierwithlimiteddata

•  Holdoutmethod–usepartofthedatafortraining,andtherestfortesting

•  Wemaybeluckyorunlucky–trainingdataortestdatamaynotberepresentative

•  Solution–Runmultipleexperimentswithdisjointtrainingandtestdatasetsinwhicheachclassisrepresentedinroughlythesameproportionasintheentiredataset

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ClassifierevaluationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

Labe

led

data

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

ClassifierevaluationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

trainaclassifier

model

Labe

led

data

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Classifierevaluation

Data Label

1

0

Pretendlikewedon’tknowthelabels

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Classifierevaluation

Data Label

1

0

model

Classify

1

1

Pretendlikewedon’tknowthelabels

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Classifierevaluation

Data Label

1

0

model

Pretendlikewedon’tknowthelabels

Classify

1

1

Comparepredictedlabelstoactuallabels

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingalgorithms

Data Label

1

0

model1 1

1

model2 10

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingalgorithms

model1 1

1

model2 1

0

Predicted

1

0

Label

1

0

LabelPredicted

Evaluation

score1

score2

model2betterifscore2>score1

Whenwouldwewanttodothistypeofcomparison?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Ismodel2better?Model1:85%accuracyModel2:80%accuracy

Model1:85.5%accuracyModel2:85.0%accuracy

Model1:0%accuracyModel2:100%accuracy

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingscores:significance•  Justcomparingscoresononedatasetisn’t

enough!•  Wedon’tjustwanttoknowwhichsystemis

betterononeparticulardataset,wewanttoknowifmodel1isbetterthanmodel2ingeneral

•  Putanotherway,wewanttobeconfidentthatthedifferenceisrealandnotjustduetorandomchance

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howdoweknowhowvariableamodel’saccuracyis?

Variance

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Varianceofperformance

•  Weneedmultipleaccuracyscores!•  Howcanwegetthem?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

RepeatedexperimentationData Label

0

0

1

1

0

1

0

Trainingdata

Testingdata

Labe

led

data

Insteadofoneevaluationwithaparticularsplitoftrainingandtestdata,runmultipleevaluations,withdifferentsplitsoftrainingandtestdata

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Repeatedexperimentation

Data Label

0

0

1

1

0

1

Trai

ning

dat

a

Data Label

0

0

1

1

0

1

0

0

1

1

0

1

Data Label

=evaluation=train

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

K-foldcrossvalidationTr

aini

ngd

ata

breakintonequal-sizedparts

repeatforallparts/splits:trainonK-1partsevaluateontheother

split1 split2

split3

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

K-foldcrossvalidation

split

1

split

2

…sp

lit3

evaluate

score1

score2

score3

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

K-foldcrossvalidation

•  Betterutilizationoflabeleddata•  Morerobust:don’tjustrelyononeevaluationsetto

evaluatetheapproach(orforoptimizingparameters)•  MultipliesthecomputationaloverheadbyK(haveto

trainKmodelsinsteadofjustone)•  10isthemostcommonchoiceofK

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

EstimatingtheperformanceofaclassifierK-foldcross-validationPartitionthedata(multi)setSintoKequalpartsS1..SK

withroughlythesameclassdistributionasS.Errorc=0

Fori=1toKdo

;iTrain SSS −←iTest SS ←)( TrainSLearn←α

}

{

),( TestSErrorErrorcErrorc α+←

( )ErrorOutputK

ErrorcError ;⎟⎠

⎞⎜⎝

⎛←

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Estimatingclassifierperformance

Recommendedprocedure•  UseK-foldcross-validation(K=5or10)forestimating

performanceestimates(accuracy,precision,recall,pointsonROCcurve,etc.)and95%confidenceintervalsaroundthemean

•  Computemeanvaluesofperformanceestimatesandstandarddeviationsofperformanceestimates

•  Reportmeanvaluesofperformanceestimatesandtheirstandarddeviationsor95%confidenceintervalsaroundthemean

•  Beskeptical–repeatexperimentsseveraltimeswithdifferentrandomsplitsofdataintoKfolds!

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Leave-one-outcrossvalidation•  K-foldcrossvalidationwhereK=numberof

samples•  aka“jackknifing”•  pros/cons?•  whenwouldweusethis?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Leave-one-outcross-validation

•  K-foldcrossvalidationwithK=nwherenisthetotalnumberofsamplesavailable

•  nexperiments–usingn-1samplesfortrainingandtheremainingsamplefortesting

•  Leave-one-outcross-validationdoesnotguaranteethesameclassdistributionintrainingandtestdata!

Extremecase:50%class1,50%class2PredictmajorityclasslabelinthetrainingdataTrueerror–50%;

Leave-one-outerrorestimate–100%!!!!!

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Leave-one-outcrossvalidation•  Canbeveryexpensiveiftrainingisslowand/or

iftherearealargenumberofexamples•  Usefulindomainswithlimitedtrainingdata:

maximizesthedatawecanusefortraining•  Someclassifierspermittheestimationof

leave-1-outperformancemeasurewithoutactuallyhavingtotrainKmodels

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample1split model1 model2

1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85

average: 85 85

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample2split model1 model2

1 87 872 92 883 74 794 75 865 82 846

79 877 83 818 83 929 88 8110 77 85avg 82 85

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample3split model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystemssplit model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

split model1 model2

1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85

What’sthedifference?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystemssplit model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

stddev 2.3 1.7

split model1 model2

1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85

stddev 5.9 3.9

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample4

split model1 model2

1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88

average 83 85

stddev 4.9 4.7

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample4

split model1

model2 model2–model

11 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average 83 85stddev 4.9 4.7

Ismodel2betterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample4

split model1 model2 model2–model1

1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average 83 85stddev 4.9 4.7

Model2isALWAYSbetter

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample4

split model1 model2 model2–model1

1 80 82 22 84 87 33 89 90 14 78 82 45 90 91 16 81 83 27 80 80 08 88 89 19 76 77 110 86 88 2

average: 83 85

stddev 4.9 4.7

Howdowedecideifmodel2isbetterthanmodel1?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

StatisticaltestsSetup:

–  Assumesomedefaulthypothesisaboutthedatathatyou’dliketodisprove,calledthenullhypothesis

–  e.g.model1andmodel2arenotstatisticallydifferentinperformance

Test:–  Calculateateststatisticfromthedata(oftenassuming

somethingaboutthedata)–  Basedonthisstatistic,withsomeprobabilitywecan

rejectthenullhypothesis,thatis,showthatitdoesnothold

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

t-test

Determineswhethertwosamplescomefromthesameunderlyingdistributionornot

?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

t-testNullhypothesis:model1andmodel2accuraciesarenodifferent,i.e.comefromthesamedistributionResult:probabilitythatthedifferenceinaccuraciesisduetorandomchance(lowvaluesarebetter)

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Calculatingt-testForoursetup,we’lldowhat’scalleda“pairt-test”

–  Thevaluescanbethoughtofaspairs,wheretheywerecalculatedunderthesameconditions

–  Inourcase,thesametrain/testsplit– Givesmorepowerthantheunpairedt-test(wehave

moreinformation)

Foralmostallexperiments,we’lldoa“two-tailed”versionofthet-testhttp://en.wikipedia.org/wiki/Student's_t-test

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

p-value•  Theresultofastatisticaltestisoftenap-value•  p-value:theprobabilitythatthenullhypothesis

holds.Specifically,ifwere-ranthisexperimentmultipletimes(sayondifferentdata)whatistheprobabilitythatwewouldrejectthenullhypothesisincorrectly(i.e.theprobabilitywe’dbewrong)

•  Commonvaluestoconsider“significant”:0.05(95%confident),0.01(99%confident)and0.001(99.9%confident)

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample1split model1 model2

1 87 882 85 843 83 844 80 795 88 896 85 857 83 818 87 869 88 8910 84 85

average: 85 85

Ismodel2betterthanmodel1?

Theyarethesamewith:p=1

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample2split model1 model2

1 87 872 92 883 74 794 75 865 82 846 79 877 83 818 83 929 88 8110 77 85

average: 82 85

Ismodel2betterthanmodel1?

Theyarethesamewith:p=0.15

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample3split model1 model2

1 84 872 83 863 78 824 80 865 82 846 79 877 83 848 83 869 85 8310 83 85

average: 82 85

Ismodel2betterthanmodel1?

Theyarethesamewith:p=0.007

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Comparingsystems:sample4split model1 model2

1 80 822 84 873 89 904 78 825 90 916 81 837 80 808 88 899 76 7710 86 88

average: 83 85

Ismodel2betterthanmodel1?

Theyarethesamewith:p=0.001

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Statisticaltestsontestdata

LabeledData

(datawithlabels)

AllTraining

Data

TestData

TrainingData

DevelopmentData

cross-validationwitht-test

Canwedothathere?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bootstrapresamplingtestsettwithnsamplesdomtimes:-  samplenexampleswithreplacementfromthe

testsettocreateanewtestsett’-  evaluatemodel(s)ont’

calculatet-test(orotherstatisticaltest)onthecollectionofmresults

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bootstrapresampling

Test’1

sam

ple

with

re

plac

emen

tTestData

Test’m

Test’2

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bootstrapresampling

modelA

Test’1

Test’2

Test’m

eval

uate

m

odel

on

data

Ascore1

Ascore2

Ascorem

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bootstrapresampling

modelB

Test’1

Test’2

Test’m

eval

uate

m

odel

on

data

Bscore1

Bscore2

Bscorem

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Bootstrapresampling

Ascore1

Ascore2

Ascorem

Bscore1

Bscore2

Bscorem

pairedt-test(orotheranalysis)

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Experimentationgoodpractices

Neverlookatyourtestdata!Duringdevelopment

–  Comparedifferentmodels/hyperparametersondevelopmentdata

–  usecross-validationtogetmoreconsistentresults–  Ifyouwanttobeconfidentwithresults,useat-test

andlookforp=0.05(orevenbetter)Forfinalevaluation,usebootstrapresamplingcombinedwithat-testtocomparemodels

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Estimatingtheperformanceofaclassifier

ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis

[ ])()(Pr)( xhxfhErrorDxD ≠≡

ThesampleerrorofabinaryclassifierhwithrespecttoatargetfunctionfandaninstancedistributionDis

otherwise 0),( ; iff 1),(

))()((||

1)(

=≠=

≠≡ ∑∈

bababa

xhxfS

hErrorSx

S

δδ

δ

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Estimatingclassifierperformance

( )( )

( ) [ ]

41

81

81

00110110

41

81

21

81

=+=

=+==

≠=

⎭⎬⎫

⎩⎨⎧

=

=

)()()()(Pr

,,,)(

},,,{)(

cXDaXDxfxhherror

xfxh

dcbax

XD

dcbaXDomain

DD

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Evaluatingtheperformanceofaclassifier

•  Sampleerrorestimatedfromtrainingdataisanoptimisticestimate

•  Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)

•  Evenwhentheestimateisunbiased,itcanvaryacrosssamples!•  Ifhmisclassifies8outof100samples

[ ] )()( hErrorhErrorEBias DS −=

0801008 .)( ==hErrorS

Howcloseisthesampleerrortothetrueerror?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseistheestimatederrortothetrueerror?•  ChooseasampleSofsizenaccordingtodistributionD•  Measure

)(hErrorS

)(hErrorS isarandomvariable(outcomeofarandomexperiment)

?)( about conclude wecan what,)( Given hErrorhError DS

Moregenerally,giventheestimatedperformanceofahypothesis,whatcanwesayaboutitsactualperformance?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Evaluatingperformancewhenwecanaffordtotestonalargeindependenttestset

ThetrueerrorofahypothesishwithrespecttoatargetfunctionfandaninstancedistributionDis

[ ])()(Pr)( xhxfhErrorDxD ≠≡

The sample error of a classifier hwith respect to a target function fand an instance distribution D is

otherwise 0),( ; iff 1),(

))()((||

1)(

=≠=

≠≡ ∑∈

bababa

xhxfS

hErrorSx

S

δδ

δ

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

EvaluatingClassifierperformance

Sampleerrorestimatedfromtrainingdataisanoptimisticestimate

Foranunbiasedestimate,hmustbeevaluatedonanindependentsampleS(whichisnotthecaseifSisthetrainingset!)

Evenwhentheestimateisunbiased,itcanvaryacrosssamples!Ifhmisclassifies8outof100samples

[ ] )()( hErrorhErrorEBias DS −=

0801008 .)( ==hErrorS

Howcloseisthesampleerrortothetrueerror?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseisestimatederrortoitstruevalue?ChooseasampleSofsizenaccordingtodistributionDMeasure )(hErrorS

)(hErrorS isarandomvariable(outcomeofarandomexperiment)

?)( about conclude wecan what,)( Given hErrorhError DS

Moregenerally,giventheestimatedperformanceofaclassifier,whatcanwesayaboutitsactualperformance?

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseisestimatedaccuracytoitstruevalue?

Question:Howcloseisp(thetrueprobability)to ?Thisproblemisaninstanceofawell-studiedprobleminstatistics•  Theproblemofestimatingtheproportionofapopulationthat

exhibitssomeproperty,giventheobservedproportionoverarandomsampleofthepopulation.

•  Inourcase,thepropertyofinterestisthathcorrectly(orincorrectly)classifiesasample.

•  TestinghonasinglerandomsamplexdrawnaccordingtoDamountstoperformingarandomexperimentwhichsucceedsifhcorrectlyclassifiesxandfailsotherwise.

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseisestimatedaccuracytoitstruevalue?

TheoutputofaclassifierwhosetrueerrorispasabinaryrandomvariablewhichcorrespondstotheoutcomeofaBernoullitrialwithasuccessratep(theprobabilityofcorrectprediction)

ThenumberofsuccessesrobservedinNtrialsisarandom

variableYwhichfollowstheBinomialdistribution

rnr pprnr

nrP −−−

= )()!(!

!)( 1

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Probabilityofobservingrmisclassifiedexamplesinasampleofsizen:

ErrorS(h)isaRandomVariable

rnr pprnr

nrP −−−

= )()!(!

!)( 1r

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Recallbasicstatistics

ConsiderarandomexperimentwithdiscretevaluedoutcomesTheexpectedvalueofthecorrespondingrandomvariableYisThevarianceofYisThestandarddeviationofYis

Myyy ,..., 21

)Pr()( i

M

ii yYyYE =≡ ∑

=1

[ ]2])[()( YEYEYVar −≡

)(YVarY ≡σ

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseisestimatedaccuracytoitstruevalue?

ThemeanofaBernoullitrialwithsuccessratep=pVariance=p(1-p)IfNtrialsaretakenfromthesameBernoulliprocess,the

observedsuccessratehasthesamemeanpandvarianceForlargeN,thedistributionoffollowsaGaussiandistribution

Npp )1( −

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

BinomialProbabilityDistribution

rnr pprnr

nrP −−−

= )()!(!

!)( 1

ProbabilityP(r)ofrheadsinncoinflips,ifp=Pr(heads)• Expected,ormeanvalueofX,E[X],is

∑=

=≡N

inpiiPXE

0)(][

• VarianceofXis

• StandarddeviationofX,σX,is

)(]])[[()( pnpXEXEXVar −=−≡ 12

)(]])[[( pnpXEXEX −=−≡ 12σ

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Estimators,Bias,Variance,ConfidenceInterval

npp

hErrorS

)()(

−=

phErrornrhError

D

S

=

=

)(

)(

nhErrorhError SS

hErrorS

))()(()(

−≈

AnN%confidenceintervalforsomeparameterpthatistheintervalwhichisexpectedwithprobabilityN%tocontainp

nhErrorhError DD

hErrorS

))()(()(

−=

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Normaldistributionapproximatesbinomial

ErrorS(h)followsaBinomialdistribution,with•  mean•  standarddeviation

nhErrorshError

hErrorsDD

S

))()(()(

−= 1σ

WecanapproximatethisbyaNormaldistributionwiththesamemeanandvariancewhennp(1-p)≥5

)()( hErrorDhErrorS=µ

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Normaldistribution2

21 )(1

22)( σ

µ

πσ

−−=x

exp

Expected,ormeanvalueofXisgivenbyE[X]=µVarianceofXisgivenbyVar(X)=σ2StandarddeviationofXisgivenbyσX=σ

TheprobabilitythatXwillfallintheinterval(a,b)isgivenby∫

b

adxxp )(

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseistheestimatedaccuracytoitstruevalue?LettheprobabilitythataGaussianrandomvariableX,withzero

mean,takesavaluebetween–zandz,Pr[-z≤X≤z]=c

Pr[X≥z] z

0.001 3.09

0.005 2.58

0.01 2.33

0.05 1.65

0.10 1.28

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseistheestimatedaccuracytoitstruevalue?

Butdoesnothavezeromeanandunitvariancesowenormalizetoget

cz

nppppz =

⎥⎥⎥⎥

⎢⎢⎢⎢

<−

−<−

)(ˆPr1

Center for Big Data Analytics and Discovery Informatics Artificial Intelligence Research Laboratory

Fall2018 VasantGHonavar

Howcloseistheestimatedaccuracytoitstruevalue?

Tofindconfidencelimits:Givenaparticularconfidencefigurec,usethetabletofindthezcorrespondingtotheprobability½(1-c).Uselinearinterpolationforvaluesnotinthetable

⎥⎦

⎤⎢⎣

⎡+

⎥⎥⎦

⎢⎢⎣

⎡+−±+

=

nz

nz

np

npz

nzp

p2

2

222

1

42ˆˆˆ

Recommended