Lecture 16 ML DT - GitHub Pages · Lecture 16: Intro to ML and Decision Trees Theodoros Rekatsinas...

Preview:

Citation preview

CS639:DataManagementfor

DataScienceLecture16:IntrotoMLandDecisionTrees

TheodorosRekatsinas(lecturebyAnkur Goswami manyslidesfromDavidSontag)

1

Today’sLecture

1. IntrotoMachineLearning

2. TypesofMachineLearning

3. DecisionTrees

2

1. IntrotoMachineLearning

3

WhatisMachineLearning?

• “Learningisanyprocessbywhichasystemimprovesperformancefromexperience”– HerbertSimon

• DefinitionbyTomMitchell(1998):MachineLearningisthestudyofalgorithmsthat• ImprovetheirperformanceP• atsometaskT• withexperienceEAwell-definedlearningtaskisgivenby<P,T,E>.

WhatisMachineLearning?

MachineLearningisthestudyofalgorithmsthat• ImprovetheirperformanceP• atsometaskT• withexperienceE

Awell-definedlearningtaskisgivenby<P,T,E>.

Experience:data-driventask,thusstatistics,probabilityExample:useheightandweighttopredictgender

Whendoweusemachinelearning?

MLisusedwhen:• Humanexpertisedoesnotexist(navigatingonMars)• Humanscan’texplaintheirexpertise(speechrecognition)• Modelsmustbecustomized(personalizedmedicine)• Modelsarebasedonhugeamountsofdata(genomics)

Ataskthatrequiresmachinelearning

Whatmakesahanddrawingbe2?

Modernmachinelearning:Autonomouscars

Modernmachinelearning:SceneLabeling

Modernmachinelearning:SpeechRecognition

2.TypesofMachineLearning

11

TypesofLearning

• Supervised(inductive)learning• Given:trainingdata+desiredoutputs(labels)

• Unsupervisedlearning• Given:trainingdata(withoutdesiredoutputs)

• Semi-supervisedlearning• Given:trainingdata+afewdesiredoutputs

• Reinforcementlearning• Rewardsfromsequenceofactions

SupervisedLearning:Regression

• Given• Learnafunctionf(x)topredictygivenx• yisreal-valued==regression

SupervisedLearning:Classification

• Given• Learnafunctionf(x)topredictygivenx• yiscategorical==regression

SupervisedLearning:Classification

• Given• Learnafunctionf(x)topredictygivenx• yiscategorical==regression

SupervisedLearning

• Value xcanbemulti-dimensional.• Eachdimensioncorrespondstoanattribute

TypesofLearning

• Supervised(inductive)learning• Given:trainingdata+desiredoutputs(labels)

• Unsupervisedlearning• Given:trainingdata(withoutdesiredoutputs)

• Semi-supervisedlearning• Given:trainingdata+afewdesiredoutputs

• Reinforcementlearning• Rewardsfromsequenceofactions

Wewillcoverlaterintheclass

3.DecisionTrees

18

Alearningproblem:predictfuelefficiency

Hypotheses:decisiontreesf:X→Y

InformalAhypothesisisacertainfunctionthatwebelieve(orhope)issimilartothetruefunction,the targetfunction thatwewanttomodel.

WhatfunctionscanDecisionTreesrepresent?

Spaceofpossibledecisiontrees

• Howwillwechoosethebestone?• Letsfirstlookathowtosplitnodes,thenconsiderhowtofindthebesttree

Whatisthesimplesttree?

• Alwayspredictmpg=bad• Wejusttakethemajorityclass

• Isthisagoodtree?• Weneedtoevaluateitsperformance

• Performance: Wearecorrecton22examplesandincorrecton18examples

Adecisionstump

Recursivestep

Recursivestep

Secondleveloftree

Arealldecisiontreesequal?

• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!• e.g., φ = ( A∧ B)∨(¬A∧ C) -- ((A and B) or ( not A and C))

• Whichtreedoweprefer?

Learningdecisiontreesishard

• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]• Resorttoagreedyheuristic:• Startfromemptydecisiontree• Splitonnextbestattribute(feature)• Recurse

Splitting:choosingagoodattribute

Measuringuncertainty

• Goodsplitifwearemorecertainaboutclassificationaftersplit• Deterministicgood(alltrueorallfalse)• Uniformdistributionbad• Whataboutdistributionsinbetween?

Entropy

High,LowEntropy

EntropyExample

ConditionalEntropy

Informationgain

Learningdecisiontrees

Adecisionstump

BaseCases:AnIdea

• BaseCaseOne: Ifallrecordsincurrentdatasubsethavethesameoutputthendonotrecurse• BaseCaseTwo: Ifallrecordshaveexactlythesamesetofinputattributesthendonotrecurse

TheproblemwithBaseCase3

IfweomitBaseCase3

Summary:BuildingDecisionTrees

Fromcategoricaltoreal-valuedattributes

Whatyouneedtoknowaboutdecisiontrees

• DecisiontreesareoneofthemostpopularMLtools• Easytounderstand,implement,anduse• Computationallycheap(tosolveheuristically)

• Informationgaintoselectattributes• Presentedforclassificationbutcanbeusedforregressionanddensityestimationtoo• Decisiontreeswilloverfit!!!• Wewillseethedefinitionofoverfittingandrelatedconceptslaterinclass.