86
Chapters 8-9. Classification n Classification: Basic Concepts n Decision Tree Induction n Model Evaluation/Learning Algorithm Evaluation n Rule-Based Classification n Bayes Classification Methods n Bayesian Belief Networks (ch9) n Techniques to Improve Classification n Lazy Learners (ch9) n Other known methods: SVM, ANN (ch9) 1

Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

1

Page 2: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Supervised vs. Unsupervised Learning

n Supervisedlearning(classification)

n Supervision:trainingdataarelabeledindicatingclasses

n Newinstancesareclassifiedbasedontrainingset

n Unsupervisedlearning (clustering)

n classlabelsareunknown

n Givenasetofobjects,establishtheexistenceofclassesorclustersinthedata

2

Page 3: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

n Classificationn predictscategoricalclasslabels

n Numericpredictionn modelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvalues

n Typicalapplicationsn Credit/loanapprovaln Medicaldiagnosisn Frauddetectionn Webpagecategorization

Prediction: Classification vs. Numeric Prediction

3

Page 4: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

4

Classification: A Two-Step Process

n Modelconstruction:describingasetofpredeterminedclassesn Eachtuple/sampleisassumedtobelongtoapredefinedclass,as

indicatedbytheclasslabelattributen Thesetoftuplesusedformodelconstructionistrainingsetn Themodelisrepresentedasclassificationrules,decisiontrees,or

mathematicalformulaen Modelusage:forclassifyingfutureorunknowninstances

n Estimateaccuracyofthemodeln Useanindependent(oftrainingset)testingset,comparepredictedclasslabelswithtrueclasslabels

n Computeaccuracy(percentageofcorrectlyclassifiedinstances)n Iftheaccuracyisacceptable,usethemodeltoclassifynewdata

Page 5: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

5

Process 1: Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Page 6: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

6

Process 2: Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 7: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Issues: Data Preparation

n Data cleaningn Preprocess data in order to reduce noise and handle

missing valuesn Relevance analysis (feature selection)

n Remove the irrelevant or redundant attributesn Data transformation

n Generalize and/or normalize data

7

Page 8: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

8

Page 9: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

9

Decision Tree Induction: An Example

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fairexcellentyesno

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

q Trainingdataset:Buys_computerq Resultingtree:

Page 10: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

10

Algorithm for Decision Tree Induction

n Basicalgorithm(agreedyalgorithm)n Treeisconstructedinatop-down(fromgeneraltospecific)recursive

divide-and-conquermannern Atstart,allthetrainingexamplesareattherootn Attributesarecategorical(ifcontinuous-valued,discretizationinadvance)n Examplesarepartitionedrecursivelybasedonselectedattributesn Attributesareselectedbasedonheuristicorstatisticalmeasure(e.g.,

informationgain)

n Whentostopn Allexampleforagivennodebelongtothesameclass(pure),orn Noremainingattributestoselectfrom,or

n majorityvoting todetermineclasslabelforthenoden Noexamplesleft

Page 11: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Random Tree Induction

Let a be the number of attributes. Let v be the maximum number of values any attribute can take

n Upper bound on the number of trees?n Lower bound on the number of trees?

n Random tree inductionn Randomly choose an attribute for splitn Same stopping criteria

n The design of decision trees has been largely influenced by the preference for simplicity.

11

Page 12: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Occam’s Razorn Occam’s Razor: rule of parsimony, principle of economy

n plurality should not be assumed without necessityn meaning, one should not increase, beyond what is necessary,

the number of entities required to explain anything

n Argument: the simplicity of nature and rarity of simple theories can be used to justify Occam's Razer.n First, nature exhibits regularity and natural phenomena are more often

simple than complex. At least, the phenomena humans choose to study tend to have simple explanations.

n Second, there are far fewer simple hypotheses than complex ones, so that there is only a small chance that any simple hypothesis that is wildly incorrect will be consistent with all observations.

n Occam's two razors: The sharp and the blunt (KDD’98)n Pedro Domingos

1288 - 1348

12

Page 13: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Attribute Selection Measure: Information Gain (ID3/C4.5)

n How to obtain smallest (shortest) tree?n Careful design on selection of attributen Quinlan pioneered using entropy in his ID3 algorithmn Entropy: in information theory, also called expected

information, is a measure of uncertainlyn Intuition: chaos, molecular disorder, temperature,

thermodynamic system, universen High entropy = high disorder

13

Page 14: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

14

Attribute Selection Measure: Information Gain (ID3/C4.5)

n Selecttheattributewiththehighestinformationgainn Letpi betheprobabilitythatanarbitrarytupleinDbelongstoclass

Ci,estimatedby|Ci,D|/|D|n Expectedinformation (entropy)neededtoclassifyatupleinD:

n entropy:measureofuncertainty.largerentropy->largeruncertainty

n Information neededtoclassifyD(aggregatedentropyafterusingAtosplitDintovpartitions):

n Informationgained (entropydropped)bybranchingonattributeA

)(log)( 21

i

m

ii ppDInfo ∑

=

−=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo ×=∑=

(D)InfoInfo(D)Gain(A) A−=

Page 15: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

15

Attribute Selection: Information Gain

g ClassP:buys_computer=“yes”g ClassN:buys_computer=“no”

means“age<=30” has5outof14samples,with2yes’esand3no’s.Hence

Similarly,

age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971

694.0)2,3(145

)0,4(144)3,2(

145)(

=+

+=

I

IIDInfoage

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0)()()( =−= DInfoDInfoageGain ageage income student credit_rating buys_computer

<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(145 I

940.0)145(log

145)

149(log

149)5,9()( 22 =−−== IDInfo

Page 16: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

16

Computing Information-Gain for Continuous-Valued Attributes

n LetattributeAbeacontinuous-valuedattribute

n Mustdeterminethebestsplitpoint forA

n SortthevalueAinincreasingorder

n Typically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint

n (ai+ai+1)/2isthemidpointbetweenthevaluesofai andai+1

n Thepointwiththeminimumexpected informationrequirement forAisselectedasthesplit-pointforA

n Split:

n D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point

Page 17: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

17

Gain Ratio for Attribute Selection (C4.5)

n Informationgainisbiasedtowardsattributeswithalargenumberofvalues

n C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)

n GainRatio(A)=Gain(A)/SplitInfo(A)

gain_ratio(income)=0.029/1.557=0.019

n Theattributewiththelargestgainratiowillbeselected

)||||

(log||||

)( 21 D

DDD

DSplitInfo jv

j

jA ×−= ∑

=

Page 18: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

18

Gini Index (CART, IBM IntelligentMiner)

n IfadatasetDcontainsexamplesfromn classes,giniindex,gini(D)isdefinedas

wherepj istherelativefrequencyofclassj inDn IfadatasetD issplitonAintotwosubsetsD1 andD2,thegini

indexgini(D)isdefinedas

n ReductioninImpurity:

n Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(need toenumerateallthepossiblesplittingpointsforeachattribute)

∑=

−=n

jp jDgini121)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA +=

)()()( DginiDginiAgini A−=Δ

Page 19: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

19

Computation of Gini Index

n Ex.Dhas9tuplesinbuys_computer=“yes” and5in“no”

n SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2

Gini{low,high} is0.458;Gini{medium,high} is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex

n Allattributesareassumedcontinuous-valuedn Mayneedothertools,e.g.,clustering,togetthepossiblesplit

valuesn Canbemodifiedforcategoricalattributes

459.0145

1491)(

22

=⎟⎠

⎞⎜⎝

⎛−⎟⎠

⎞⎜⎝

⎛−=Dgini

)(144)(

1410)( 21},{ DGiniDGiniDgini mediumlowincome ⎟

⎞⎜⎝

⎛+⎟⎠

⎞⎜⎝

⎛=∈

Page 20: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

20

Comparing Attribute Selection Measures

n Thethreemeasures,ingeneral,returngoodresultsbutn Informationgain:

n biasedtowardsmultivaluedattributesn Gainratio:

n tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothers

n Giniindex:n biasedtomultivaluedattributesn hasdifficultywhen#ofclassesislargen tendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions

Page 21: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

21

Other Attribute Selection Measures

n CHAID:apopulardecisiontreealgorithm,measurebasedonχ2 testforindependence

n C-SEP:performsbetterthaninfo.gainandginiindexincertaincases

n G-statistic:hasacloseapproximationtoχ2 distribution

n MDL(MinimalDescriptionLength)principle (i.e.,thesimplestsolutionispreferred):

n Thebesttreeastheonethatrequiresthefewest#ofbitstoboth(1)encodethetree,and(2)encodetheexceptionstothetree

n Multivariatesplits(partitionbasedonmultiplevariablecombinations)

n CART:findsmultivariatesplitsbasedonalinearcomb.ofattrs.

n Whichattributeselectionmeasureisthebest?

n Mostgivegoodresults,noneissignificantlysuperiorthanothers

Page 22: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Overfitting and Tree Pruning

n Overfitting: An induced tree may overfit the training data n Too many branches, some may reflect anomalies due to noise or outliersn Poor accuracy for unseen samples

n Blue: training error, red: generalization error

n Two approaches to avoid overfittingn Prepruning: Halt tree construction early—do not split a node if this would

result in the goodness measure falling below a thresholdn Difficult to choose an appropriate threshold

n Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

n Use a set of data (validation set) different from the training data to decide which is the “best pruned tree”

22

Page 23: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

23

Enhancements to Basic Decision Tree Induction

n Allowforcontinuous-valuedattributesn Dynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervals

n Handlemissingattributevaluesn Assignthemostcommonvalueoftheattributen Assignprobabilitytoeachofthepossiblevalues

n Attributeconstructionn Createnewattributesbasedonexistingonesthataresparselyrepresented

n Thisreducesfragmentation,repetition,andreplication

Page 24: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

24

Classification in Large Databases

n Classification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchers

n Scalability:Classifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeed

n Whyisdecisiontreeinductionpopular?n relativelyfasterlearningspeed(thanotherclassificationmethods)

n convertibletosimpleandeasytounderstandclassificationrulesn canuseSQLqueriesforaccessingdatabasesn comparableclassificationaccuracywithothermethods

n RainForest(VLDB’98— Gehrke,Ramakrishnan&Ganti)n BuildsanAVC-list(attribute,value,classlabel)

Page 25: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

25

Scalability Framework for RainForest

n Separates the scalability aspects from the criteria that determine the quality of the tree

n Builds an AVC-list: AVC (Attribute, Value, Class_label)

n AVC-set (of an attribute X )

n Projection of training dataset onto the attribute X and class label where counts of individual class label are aggregated

n AVC-group (of a node n )

n Set of AVC-sets of all predictor attributes at the node n

Page 26: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

26

Rainforest: Training Set and Its AVC Sets

student Buy_Computer

yes no

yes 6 1

no 3 4

Age Buy_Computer

yes no

<=30 2 3

31..40 4 0

>40 3 2

Creditrating

Buy_Computer

yes no

fair 6 2

excellent 3 3

age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

AVC-set on incomeAVC-set on Age

AVC-set on Student

Training Examplesincome Buy_Computer

yes no

high 2 2

medium 4 2

low 3 1

AVC-set on credit_rating

Page 27: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

27

BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)

n Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory

n Each subset is used to create a tree, resulting in several trees

n These trees are examined and used to construct a new tree T’

n It turns out that T’ is very close to the tree that would be generated using the whole data set together

n Adv: requires only two scans of DB, an incremental alg.

27

Page 28: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

28

Page 29: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Evaluation Metrics: Confusion Matrix

Actualclass\Predictedclass buy_computer=yes

buy_computer=no

Total

buy_computer=yes 6954 46 7000buy_computer=no 412 2588 3000

Total 7366 2634 10000

n Givenm classes,anentry,CMi,j inaconfusionmatrix indicates#oftuplesinclassi thatwerelabeledbytheclassifierasclassj

n Mayhaveextrarows/columnstoprovidetotals

ConfusionMatrix:Actualclass\Predictedclass C1 ¬C1

C1 TruePositives(TP) FalseNegatives(FN)

¬C1 FalsePositives(FP) TrueNegatives(TN)

Example of Confusion Matrix:

29

Page 30: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity

n Accuracy,orrecognitionrate:percentageoftestsettuplesthatarecorrectlyclassifiedAccuracy=(TP+TN)/All

n Errorrate: 1– accuracy,orErrorrate=(FP+FN)/All

n ClassImbalanceProblem:n Oneclassmayberare,e.g.fraud,orHIV-positive

n Significantmajorityofthenegativeclass andminorityofthepositiveclass

n Sensitivity:TruePositiverecognitionrate(recallfor+)

n Sensitivity=TP/Pn Specificity:TrueNegativerecognitionrate(recallfor-)

n Specificity=TN/N

A\P C ¬CC TP FN P¬C FP TN N

P’ N’ All

30

Page 31: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Evaluation Metrics: Precision and Recall, and F-measures

n Precision:exactness– what%oftuplesthattheclassifier(model)labeledaspositiveareactuallypositive

n Recall:completeness– what%ofpositivetuplesdidtheclassifier(model)labelaspositive?

n Perfectscoreis1.0n Inverserelationshipbetweenprecision&recalln Fmeasure(F1 or F-score):harmonicmeanofprecisionandrecall,

n Fß:weightedmeasureofprecisionandrecalln assignsß timesasmuchweighttorecallastoprecision

31

Page 32: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Evaluation Metrics: Example

n Precision =90/230=39.13%Recall =90/300=30.00%

ActualClass\Predictedclass cancer=yes cancer=no Total Recognition(%)cancer=yes 90 210 300 30.00(sensitivitycancer=no 140 9560 9700 98.56(specificity)

Total 230 9770 10000 96.40(accuracy)

32

Page 33: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Evaluating Learning Algorithm:Holdout & Cross-Validation Methods

n Holdoutmethodn Givendataisrandomlypartitionedintotwoindependentsets

n Trainingset(e.g.,2/3)formodelconstructionn Testingset(e.g.,1/3)foraccuracy(oranothermetric)estimation

n Randomsampling:avariationofholdoutn Repeatholdoutktimes,accuracy=avg.oftheaccuraciesobtained

n Cross-validation (k-fold,wherek=10ismostcommon)n Randomlypartitionthedataintok mutuallyexclusive subsets,eachapproximatelyequalsize

n Ati-thiteration,useDiastestingsetandothersastrainingsetn Leave-one-out:k foldswherek =#oftuples,forsmallsizeddatan Stratifiedcross-validation:foldsarestratifiedsothatclassdist.ineachfoldisapprox.thesameasthatintheinitialdata

33

Page 34: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Evaluating Classifier Accuracy: Bootstrapn Bootstrap

n Workswellwithsmalldatasetsn Samplesthegiventrainingtuplesuniformlywithreplacement

n i.e.,eachtimeatupleisselected,itisequallylikelytobeselectedagainandre-addedtothetrainingset

n Severalbootstrapmethods,andacommononeis.632boostrapn Adatasetwithd tuplesissampledd times,withreplacement,resultingin

atrainingsetofd samples.Thedatatuplesthatdidnotmakeitintothetrainingsetendupformingthetestset.About63.2%oftheoriginaldataendupinthebootstrap,andtheremaining36.8%formthetestset(since(1– 1/d)d ≈e-1 =0.368)

n Repeatthesamplingprocedurek times,overallaccuracyofthemodel:

34

Page 35: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Selection: ROC Curvesn ROC (ReceiverOperating

Characteristics)curves:forvisualcomparisonofclassificationmodels

n Originatedfromsignaldetectiontheoryn Showsthetrade-offbetweenthetrue

positiverateandthefalsepositiveraten TheareaundertheROCcurveisa

measureoftheaccuracyofthemodeln Rankthetesttuplesindecreasing

order:theonethatismostlikelytobelongtothepositiveclassappearsatthetopofthelist

n Theclosertothediagonalline(i.e.,theclosertheareaisto0.5),thelessaccurateisthemodel

n Verticalaxisrepresentsthetruepositiverate

n Horizontalaxisrep.thefalsepositiverate

n Theplotalsoshowsadiagonalline

n Amodelwithperfectaccuracywillhaveanareaof1.0

35

Page 36: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Model Selection Issues

n Accuracyn classifieraccuracy:predictingclasslabel

n Speedn timetoconstructthemodel(trainingtime)n timetousethemodel(classification/predictiontime)

n Robustness:handlingnoiseandmissingvaluesn Scalability:efficiencyindisk-residentdatabasesn Interpretability

n understandingandinsightprovidedbythemodeln Model(e.g.,decisiontree)sizeorcompactness

36

Page 37: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

37

Page 38: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

38

Using IF-THEN Rules for Classificationn RepresentknowledgeintheformofIF-THEN rules

R:IFage =youthANDstudent =yesTHENbuys_computer =yesn Ruleantecedent/preconditionvs.ruleconsequent

n Assessmentofarule:coverage andaccuracyn ncovers=#oftuplescoveredbyRn ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|accuracy(R)=ncorrect /ncovers

n Ifmorethanonerulearetriggered,needconflictresolutionn Sizeordering:assignthehighestprioritytothetriggeringrulesthathave

the“toughest” requirement(i.e.,withthemostattributetests)n Class-basedordering:decreasingorderofprevalenceormisclassification

costperclassn Rule-basedordering(decisionlist):rulesareorganizedintoonelong

prioritylist,accordingtosomemeasureofrulequalityorbyexperts

Page 39: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

39

age?

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fairexcellentyesno

n Example:Ruleextractionfromourbuys_computer decision-treeIFage =youngANDstudent =no THENbuys_computer =noIFage =youngANDstudent =yes THENbuys_computer =yesIFage =mid-age THENbuys_computer =yesIFage =oldANDcredit_rating =excellent THENbuys_computer=noIFage =oldANDcredit_rating =fair THENbuys_computer =yes

Rule Extraction from Decision Tree

n Aroot-to-leafpathcorrespondstoarulen Eachattribute-valuepairalongapathforms

aconjunction:theleafholdstheclassprediction

n Rulesareexhaustiveandmutuallyexclusive

Page 40: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

40

Rule Induction: Sequential Covering Method

n Sequentialcovering:Extractsrulesdirectlyfromtrainingdatan Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPERn Rulesarelearnedsequentially,eachforagivenclassCiwillcover

manytuplesofCibutnone(orfew)ofthetuplesofotherclassesn Steps:

n Rulesarelearnedoneatatimen Eachtimearuleislearned,thecoveredpositivetuplesareremoved

n Repeatuntilterminationcondition ismet.e.g.,nomoretrainingexamplesorthequalityofarulegeneratedisbelowauser-specifiedthreshold

n Unlikedecision-treesthatlearnasetofrulessimultaneously

Page 41: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

41

Sequential Covering Algorithm

while(enoughtargettuplesleft)generatearuleremovepositivetargettuplessatisfyingthisrule

Examples coveredby Rule 3

Examples coveredby Rule 2Examples covered

by Rule 1

Positive examples

Page 42: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

42

How to Learn One Rule?n Startwiththemostgeneralrule possible:condition=emptyn Addingnewattributes byadoptingagreedydepth-firststrategy

n Pickstheonethatmostimprovestherulequalityn Rule-Qualitymeasures:considerbothcoverageandaccuracy

n Foil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingcondition

n favorsrulesthathavehighaccuracyandcovermanypositivetuples

n Rulepruningbasedonanindependentsetoftesttuples

Pos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_Prune ishigherfortheprunedversionofR,pruneR

)log''

'(log'_ 22 negpospos

negposposposGainFOIL

+−

+×=

negposnegposRPruneFOIL

+

−=)(_

Page 43: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

43

Learn one rulen Togeneratearule

while(true)findthebestpredicatepif foil-gain(p)>thresholdthen addp tocurrentruleelse break

Positive examples

Negative examples

A3=1A3=1&&A1=2A3=1&&A1=2&&A8=5

Page 44: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Trees and rulesn Most tree learners: divide and conquern Most rule learners: separate and conquer, i.e., sequential covering, (AQ,

CN2, RIPPER …)n Some conquering-without-separating (RISE, from Domingos, biased towards

complex models), rules are learned simultaneously, instance-based

n Decision space, decision boundary

n Both are interpretable classifiersn Other usage of rule learning: rule extraction, e.g., from ANN

44

Page 45: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Separate and conquer vs. set covern Set covering problem (minimum set cover): one of the most studied

combinatorial optimization problemsn Given a finite ground set X and S1, S2, … Sm as subsets of X, find I ⊆ {1, … m} with Ui ∈ I Si = X such that |I| is minimized.

n select as few as possible subsets from a given family such that each element in any subset of the family is covered

n NP-hardn Greedy algorithm: iteratively pick the subset that covers the maximum

number of uncovered elementsn Achieves 1 + ln n approximation ratio, optimal

n Greedy set cover vs. sequential coveringn Select one subset (learn one rule) at a timen Consider uncovered elements (remove covered examples)n Iterate until all elements (examples) are covered

n Other related problems: graph coloring, minimum clique partition 45

Page 46: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

46

Page 47: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

47

Bayesian Classification: Why?

n Astatisticalclassifier:performsprobabilisticprediction,i.e.,predictsclassmembershipprobabilities

n Foundation: BasedonBayes’ Theorem.n Performance: AsimpleBayesianclassifier,naïveBayesian

classifier,hascomparableperformancewithdecisiontreeandselectedneuralnetworkclassifiers

n Incremental:Eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrect—priorknowledgecanbecombinedwithobserveddata

n Standard:EvenwhenBayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured

Page 48: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Probability Model for Classifiers

n Let X = (x1, x2, …, xn) be a data sample (“evidence”): class label is unknown

n The probability model for a classifier is to determine P(C|X), the probability that X belongs to class C given the observed data sample Xn predicts X belongs to Ci iff the probability P(Ci|X) is the highest

among all the P(Ck|X) for all the k classes

48

Page 49: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Bayes’ Theorem

n P(C | X) : posteriorn P(C): prior, the initial probability

n E.g., one will buy computer, regardless of age, income, …n P(X): probability that the sample X is observedn P(X|C): likelihood, probability of observing the sample X,

given that the hypothesis holdsn E.g., Given that X will buy computer, the prob. that X is 31..40,

medium income

n Informally, this can be written as posterior = prior x likelihood / evidence

)()|()()|( X

XX PCPCPCP =

49

Page 50: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Maximizing joint probability

n In practice we are only interested in the numerator of that fraction, since the denominator does not depend on H and the same value is shared by all classes.

n The numerator is the joint probability

),...2,1,(),()|()( XnXXCPCPCPCP == XX

)()|()()|( X

XX PCPCPCP =

50

Page 51: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Maximizing joint probability

repeatedly apply conditional probability,

)1,...2,1,|()...1,|2()|1()()2,1,|,...3()1,|2()|1()(

)1,|,...2()|1()()|,...2,1()(

−====

XnXXCXnPXCXPCXPCPXXCXnXPXCXPCXPCP

XCXnXPCXPCPCXnXXPCP

),...2,1,(),()|()( XnXXCPCPCPCP == XX

51

Page 52: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Naïve Bayes Classifier: Assuming Conditional Independence

Simplifying assumption: features are conditionally independent of each other, then,

then,

This greatly reduces the computation cost: Only counts the class distribution

)|(),|( CXiPXjCXiP =

)|()...|2()|1()()1,...2,1,|()...1,|2()|1()(

),...2,1,(

CXnPCXPCXPCPXnXXCXnPXCXPCXPCP

XnXXCP

=−=

52

Page 53: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

53

Naïve Bayes Classifier

n Thisgreatlyreducesthecomputationcost:Onlycountstheclassdistribution

n IfAk iscategorical,P(xk|Ci)isthe#oftuplesinCi havingvaluexkforAk dividedby|Ci,D|(#oftuplesofCi inD)

n IfAk iscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμ andstandarddeviationσ

andP(xk|Ci)is 2

2

2)(

21),,( σ

µ

σπσµ

−−

=x

exg

),,()|(ii CCkxgCiP σµ=X

Page 54: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

54

Naïve Bayes Classifier: Training Dataset

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’

Datatobeclassified:X=(age<=30,Income=medium,Student=yesCredit_rating=Fair)

age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Page 55: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Naïve Bayes Classifier: Examplen X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

n P(C): P(buys_computer = “yes”) = 9/14 = 0.643P(buys_computer = “no”) = 5/14= 0.357

n Compute P(X|C) for each classP(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

n P(X|C) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(C, X) = P(X|C)*P(C) P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

55

Page 56: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

56

Avoiding the Zero-Probability Problem

n NaïveBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero

n Supposetrainingsethas1000tuplesforclassbuys_computer=yes.0forincome=low,990forincome=medium,and10forincome=high

n UseLaplaciancorrection (orLaplacianestimator)n Adding1toeachcase

Prob(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003

n The“corrected” prob.estimatesareclosetotheir“uncorrected”counterparts

∏=

=n

kCixkPCiXP

1)|()|(

Page 57: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

57

Naïve Bayes Classifier: Comments

n Advantagesn Easytoimplementn Goodresultsobtainedinmostofthecases

n Disadvantagesn Assumption:classconditionalindependence,thereforelossofaccuracy

n Practically,dependenciesexistamongvariablesn E.g.,hospitals:patients:Profile:age,familyhistory,etc.

Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.

n DependenciesamongthesecannotbemodeledbyNaïveBayesClassifier

n Howtodealwiththesedependencies?BayesianBeliefNetworks(Chapter9)

Page 58: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

58

Page 59: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Bayesian Belief Networksn Bayesian belief network relieves the conditional independence

assumption in naïve bayesn A graphical model of causal relationships

n Represents dependency among the variables n Gives a specification of joint probability distribution

X Y

ZP

q Nodes: random variablesq Links: dependencyq X and Y are the parents of Z, and Y is the parent of Pq No dependency between Z and Pq Has no loops or cycles

59

Page 60: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Bayesian Belief Network: An Example

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table(CPT) for variable LungCancer:

∏=

=n

ixiParentsxiPxxP n

1))(|(),...,( 1

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:

60

Page 61: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Training Bayesian Networks

n Several scenarios:n Given both the network structure and all variables

observable: learn only the CPTsn Network structure known, some hidden variables:

gradient descent (greedy hill-climbing) method, analogous to neural network learning

n Network structure unknown, all variables observable: search through the model space to reconstruct network topology

n Unknown structure, all hidden variables: No good algorithms known for this purpose

n Ref. D. Heckerman: Bayesian networks for data mining

61

Page 62: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Examplen Two events could cause

grass to be wet: either the sprinkler is on or it's raining

n The rain has a direct effect on the use of the sprinklern when it rains, the sprinkler

is usually not turned on

Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T and F.The joint probability function is:

P(G,S,R) = P(G | S,R)P(S | R)P(R)

where G = Grass wet, S = Sprinkler, and R = Rain

62

Page 63: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Example

n The model can answer questions like "What is the probability that it is raining, given the grass is wet?"

The joint probability function is:

P(G,S,R) = P(G | S,R)P(S | R)P(R)

where G = Grass wet, S = Sprinkler, and R = Rain

63

Page 64: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

64

Page 65: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Ensemble Methods: Increasing the Accuracy

n Ensemblemethodsn Useacombinationofmodelstoincreaseaccuracyn Combineaseriesofklearnedmodels,M1,M2,…,Mk,withtheaimofcreatinganimprovedmodelM*

n Popularensemblemethodsn Bagging:averagingthepredictionoveracollectionofclassifiers

n Boosting:weightedvotewithacollectionofclassifiersn Ensemble:combiningasetofheterogeneousclassifiers

65

Page 66: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Bagging: Boostrap Aggregation

n Analogy:Diagnosisbasedonmultipledoctors’majorityvoten Training

n GivenasetDofdtuples,ateachiterationi,atrainingsetDi ofd tuplesissampledwithreplacementfromD(i.e.,bootstrap)

n AclassifiermodelMi islearnedforeachtrainingsetDi

n Classification:classifyanunknownsampleXn EachclassifierMi returnsitsclasspredictionn ThebaggedclassifierM*countsthevotesandassignstheclasswiththe

mostvotestoXn Prediction:canbeappliedtothepredictionofcontinuousvaluesbytaking

theaveragevalueofeachpredictionforagiventesttuplen Accuracy

n OftensignificantlybetterthanasingleclassifierderivedfromDn Fornoisedata:notconsiderablyworse,morerobustn Provedimprovedaccuracyinprediction

66

Page 67: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Boostingn Analogy:Consultseveraldoctors,basedonacombinationof

weighteddiagnoses—weightassignedbasedonthepreviousdiagnosisaccuracy

n Howboostingworks?n Weights areassignedtoeachtrainingtuplen Aseriesofkclassifiersisiterativelylearnedn AfteraclassifierMi islearned,theweightsareupdatedto

allowthesubsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuplesthatweremisclassifiedbyMi

n ThefinalM*combinesthevotes ofeachindividualclassifier,wheretheweightofeachclassifier'svoteisafunctionofitsaccuracy

n Boostingalgorithmcanbeextendedfornumericpredictionn Comparingwithbagging:Boostingtendstohavegreateraccuracy,

butitalsorisksoverfittingthemodeltomisclassifieddata67

Page 68: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

68

Adaboost (Freund and Schapire, 1997)

n Givenasetofd class-labeledtuples,(X1,y1),…,(Xd,yd)n Initially,alltheweightsoftuplesaresetthesame(1/d)n Generatekclassifiersinkrounds.Atroundi,

n TuplesfromDaresampled(withreplacement)toformatrainingsetDi ofthesamesize

n Eachtuple’schanceofbeingselectedisbasedonitsweightn AclassificationmodelMi isderivedfromDi

n ItserrorrateiscalculatedusingDiasatestsetn Ifatupleismisclassified,itsweightisincreased,o.w.itisdecreased

n Errorrate:err(Xj)isthemisclassificationerroroftupleXj.ClassifierMierrorrateisthesumoftheweightsofthemisclassifiedtuples:

n TheweightofclassifierMi’svoteis

)()(1log

i

i

MerrorMerror−

∑ ×=d

jji errwMerror )()( jX

Page 69: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Random Forest (Breiman 2001) n RandomForest:

n Eachclassifierintheensembleisadecisiontreeclassifierandisgeneratedusingarandomselectionofattributesateachnodetodeterminethesplit

n Duringclassification,eachtreevotesandthemostpopularclassisreturned

n TwoMethodstoconstructRandomForest:n Forest-RI(randominputselection):Randomlyselect,ateachnode,F

attributesascandidatesforthesplitatthenode.TheCARTmethodologyisusedtogrowthetreestomaximumsize

n Forest-RC(randomlinearcombinations): Createsnewattributes(orfeatures)thatarealinearcombinationoftheexistingattributes(reducesthecorrelationbetweenindividualclassifiers)

n ComparableinaccuracytoAdaboost,butmorerobusttoerrorsandoutliersn Insensitivetothenumberofattributesselectedforconsiderationateach

split,andfasterthanbaggingorboosting69

Page 70: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Classification of Class-Imbalanced Data Sets

n Class-imbalanceproblem:Rarepositiveexamplebutnumerousnegativeones,e.g.,medicaldiagnosis,fraud,oil-spill,fault,etc.

n Traditionalmethodsassumeabalanceddistributionofclassesandequalerrorcosts:notsuitableforclass-imbalanceddata

n Typicalmethodsforimbalancedatain2-classclassification:n Oversampling:re-samplingofdatafrompositiveclassn Under-sampling:randomlyeliminatetuplesfromnegativeclass

n Threshold-moving:movesthedecisionthreshold,t,sothattherareclasstuplesareeasiertoclassify,andhence,lesschanceofcostlyfalsenegativeerrors

n Ensembletechniques:Ensemblemultipleclassifiersintroducedabove

n Stilldifficultforclassimbalanceproblemonmulticlasstasks70

Page 71: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

71

Page 72: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

72

Lazy vs. Eager Learningn Lazy vs. eager learning

n Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple

n Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

n Lazy: less time in training but more time in predictingn Accuracy

n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function

n Eager: must commit to a single hypothesis that covers the entire instance space

Page 73: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

73

Lazy Learner: Instance-Based Methods

n Instance-based learning: n Store training examples and delay the processing (“lazy

evaluation”) until a new instance must be classifiedn Typical approaches

n k-nearest neighbor approachn Instances represented as points in a Euclidean

space.n Locally weighted regression

n Constructs local approximationn Case-based reasoning

n Uses symbolic representations and knowledge-based inference

Page 74: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

74

The k-Nearest Neighbor Algorithm

n All instances correspond to points in the n-D spacen The nearest neighbor are defined in terms of

Euclidean distance, dist(X1, X2)n Target function could be discrete- or real- valuedn For discrete-valued, k-NN returns the most common

value among the k training examples nearest to xq

n Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

. _

+_ xq

+

_ _+

_

_

+

..

.. .

Page 75: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

75

Discussion on the k-NN Algorithm

n k-NN for real-valued prediction for a given unknown tuplen Returns the mean values of the k nearest neighbors

n Distance-weighted nearest neighbor algorithmn Weight the contribution of each of the k neighbors

according to their distance to the query xq

n Give greater weight to closer neighborsn Robust to noisy data by averaging k-nearest neighborsn Curse of dimensionality: distance between neighbors could

be dominated by irrelevant attributes n To overcome it, axes stretch or elimination of the least

relevant attributes

2),(1

ixqxdw≡

Page 76: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Chapters 8-9. Classification

n Classification:BasicConceptsn DecisionTreeInductionn ModelEvaluation/LearningAlgorithmEvaluationn Rule-BasedClassificationn BayesClassificationMethodsn BayesianBeliefNetworks(ch9)n TechniquestoImproveClassificationn LazyLearners(ch9)n Otherknownmethods:SVM,ANN(ch9)

76

Page 77: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

SVM—Support Vector Machines

n A new classification method for both linear and nonlinear datan It uses a nonlinear mapping to transform the original training

data into a higher dimensionn With the new dimension, it searches for the linear optimal

separating hyperplane (i.e., “decision boundary”)n With an appropriate nonlinear mapping to a sufficiently high

dimension, data from two classes can always be separated by a hyperplane

n SVM finds this hyperplane using support vectors (“essential”training tuples) and margins (defined by the support vectors)

77

Page 78: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

History and Applications

n Vapnik and colleagues (1992)—groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s

n Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization)

n Used both for classification and predictionn Applications:

n handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

78

Page 79: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

General Philosophy

Support Vectors

Small Margin Large Margin

79

Page 80: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

When Data Is Linearly Separable

m

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi

There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)

80

Page 81: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Kernel functionsn Instead of computing the dot product on the transformed data tuples,

it is mathematically equivalent to instead applying a kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)

n Typical Kernel Functions

n SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional user parameters)

81

Page 82: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

Why Is SVM Effective on High Dimensional Data?

n The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data

n The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)

n If all other training examples are removed and the training is repeated, the same separating hyperplane would be found

n The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality

n Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high

82

Page 83: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

SVM—Introduction Literaturen “Statistical Learning Theory” by Vapnik: extremely hard to understand,

containing many errors too.n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern

Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.n Better than the Vapnik’s book, but still written too hard for

introduction, and the examples are so not-intuitive n The book “An Introduction to Support Vector Machines” by N.

Cristianini and J. Shawe-Taylorn Also written hard for introduction, but the explanation about the

mercer’s theorem is better than above literaturesn The neural network book by Haykins

n Contains one nice chapter of SVM introduction83

Page 84: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

SVM Related Links

n SVM Website

n http://www.kernel-machines.org/

n Representative implementations

n LIBSVM: an efficient implementation of SVM, multi-class

classifications, nu-SVM, one-class SVM, including also various

interfaces with java, python, etc.

n SVM-light: simpler but performance is not better than LIBSVM,

support only binary classification and only C language

n SVM-torch: another recent implementation also written in C.

84

Page 85: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

ANN—Artificial Neural Network(Classification by Backpropagation)

n An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another.

n Deep learning: deep neural networks

85

Page 86: Chapters 8-9. Classification - Texas State University · 2020-02-13 · Chapters 8-9. Classification nClassification: Basic Concepts nDecision Tree Induction nModel Evaluation/Learning

SVM vs. ANN

n SVMn Relatively new conceptn Deterministicn Nice Generalization

propertiesn Hard to learn – learned

in batch mode using quadratic programming techniques

n Using kernels can learn very complex functions

n ANNn Relatively old (but …)n Nondeterministicn Generalizes well but

doesn’t have strong mathematical foundation

n Can easily be learned in incremental fashion

n To learn complex functions—use multilayer perceptron (not that trivial)

86