Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003

Learning with Decision Trees

Artificial Intelligence

CMSC 25000

February 20, 2003

Agenda

• Learning from examples– Machine learning review– Identification Trees:

• Basic characteristics• Sunburn example• From trees to rules• Learning by minimizing heterogeneity• Analysis: Pros & Cons

Machine Learning: Review

• Learning: – Automatically acquire a function from inputs to

output values, based on previously seen inputs and output values.

– Input: Vector of feature values– Output: Value

• Examples: Word pronunciation, robot motion, speech recognition

• Key contrasts:– Supervised versus Unsupervised

• With or without labeled examples (known outputs)

– Classification versus Regression• Output values: Discrete versus continuous-valued

– Types of functions learned• aka “Inductive Bias” • Learning algorithm restricts things that can be learned

• Key issues:– Feature selection:

• What features should be used?• How do they relate to each other?• How sensitive is the technique to feature selection?

– Irrelevant, noisy, absent feature; feature types

– Complexity & Generalization• Tension between

– Matching training data– Performing well on NEW UNSEEN inputs

Machine Learning Features

• Inputs: – E.g.words, acoustic measurements, financial data– Vectors of features:

• E.g. word: letters – ‘cat’: L1=c; L2 = a; L3 = t

• Financial data: F1= # late payments/yr : Integer• F2 = Ratio of income to expense: Real

Machine Learning Features

• Question: – Which features should be used?– How should they relate to each other?

• Issue 1: How do we define relation in feature space if features have different scales? – Solution: Scaling/normalization

• Issue 2: Which ones are important?– If differ in irrelevant feature, should ignore

Complexity & Generalization

• Goal: Predict values accurately on new inputs• Problem:

– Train on sample data– Can make arbitrarily complex model to fit– BUT, will probably perform badly on NEW data

• Strategy:– Limit complexity of model (e.g. degree of equ’n)– Split training and validation sets

• Hold out data to check for overfitting

Learning: Identification Trees

• (aka Decision Trees)• Supervised learning• Primarily classification• Rectangular decision boundaries

– More restrictive than nearest neighbor

• Robust to irrelevant attributes, noise• Fast prediction

Sunburn ExampleName Hair Height Weight Lotion Result

Sarah Blonde Average Light No Burn

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Burn

Emily Red Average Heavy No Burn

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

Learning about Sunburn

• Goal:– Train on labeled examples– Predict Burn/None for new instances

• Solution??– Exact match: same features, same output

• Problem: 2*3^3 feature combinations– Could be much worse

Learning about Sunburn

• Better Solution: – Identification tree:– Training:

• Divide examples into subsets based on feature tests• Sets of samples at leaves define classification

– Prediction:• Route NEW instance through tree to leaf based on

feature tests• Assign same value as samples at leaf

Sunburn Identification Tree

Hair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

Simplicity

• Occam’s Razor:– Simplest explanation that covers the data is best

• Occam’s Razor for ID trees:– Smallest tree consistent with samples will be

best predictor for new data

• Problem: – Finding all trees & finding smallest: Expensive!

• Solution:– Greedily build a small tree

Building ID Trees

• Goal: Build a small tree such that all samples at leaves have same class

• Greedy solution:– At each node, pick test such that branches are

closest to having same class• Split into subsets with least “disorder”

– (Disorder ~ Entropy)

– Find test that minimizes disorder

Minimizing DisorderHair Color

BlondeRed

Alex: NPete: NJohn: N

Emily: BSarah: BDana: NAnnie: BKatie: N

Height

WeightLotion

Short AverageTall

Alex:NAnnie:BKatie:N

Sarah:BEmily:BJohn:N

Dana:NPete:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAlex:NAnnie:B

Emily:BPete:NJohn:N

No Yes

Sarah:BAnnie:BEmily:BPete:NJohn:N

Dana:NAlex:NKatie:N

Minimizing DisorderHeight

WeightLotion

Short AverageTall

Annie:BKatie:N

Sarah:B Dana:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAnnie:B

No Yes

Sarah:BAnnie:B

Dana:NKatie:N

Measuring Disorder

• Problem: – In general, tests on large DB’s don’t yield

homogeneous subsets

• Solution:– General information theoretic measure of disorder– Desired features:

• Homogeneous set: least disorder = 0• Even split: most disorder = 1

Measuring Entropy• If split m objects into 2 bins size m1 & m2,

what is the entropy?

loglog

0 0.2 0.4 0.6 0.8 1 1.2

Disorder

Measuring DisorderEntropy

the probability of being in bin i

ii pp 2log

mmp ii /

Entropy (disorder) of a split

00log0 2 Assume

-½ log2½ - ½ log2½ = ½ +½ = 1½½-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811

-1log21 - 0log20 = 0 - 0 = 001

Entropyp2p1

Computing Disorder

classc i

nrAvgDisorde

Disorder of class distribution on branch i

Fraction of samples down branch i

N instances

Branch1 Branch 2

N1 a N1 b

N2 aN2 b

Entropy in Sunburn Example

classc i

nrAvgDisorde

Hair color = 4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0 = 0.5

Height = 0.69Weight = 0.94Lotion = 0.61

Entropy in Sunburn Example

classc i

nrAvgDisorde

Height = 2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 0.5Weight = 2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 0

Building ID Trees with Disorder

• Until each leaf is as homogeneous as possible – Select an inhomogeneous leaf node– Replace that leaf node by a test node creating

subsets with least average disorder

• Effectively creates set of rectangular regions– Repeatedly draws lines in different axes

Features in ID Trees: Pros

• Feature selection:– Tests features that yield low disorder

• E.g. selects features that are important!

– Ignores irrelevant features

• Feature type handling:– Discrete type: 1 branch per value– Continuous type: Branch on >= value

• Need to search to find best breakpoint

• Absent features: Distribute uniformly

Features in ID Trees: Cons

• Features – Assumed independent– If want group effect, must model explicitly

• E.g. make new feature AorB

• Feature tests conjunctive

From Trees to Rules

• Tree:– Branches from root to leaves =– Tests => classifications– Tests = if antecedents; Leaf labels= consequent– All ID trees-> rules; Not all rules as trees

From ID Trees to RulesHair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))

Identification Trees

• Train:– Build tree by forming subsets of least disorder

• Predict:– Traverse tree based on feature tests– Assign leaf node sample label

• Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading

• Cons: Poor feature combination, dependency, optimal tree build intractable

Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003

Documents

1 CMSC 250 Discrete Structures CMSC 250 Lecture 1

CMSC 2015 program

Introduction to Machine Learning - UMIACS · Introduction to Machine Learning CMSC 422 Ramani Duraiswami Decision Trees Wrapup, Overfitting Slides adapted from Profs. Carpuat and

Search: Heuristic &Optimal Artificial Intelligence CMSC 25000 January 16, 2003

CMSC 202, Version 5/02 1 Trees. CMSC 202, Version 5/02 2 Tree Basics 1.A tree is a set of nodes. 2.A tree may be empty (i.e., contain no nodes). 3.If

Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC 25000 February 26, 2008

Nearest Neighbor & Information Retrieval Search Artificial Intelligence CMSC 25000 January 29, 2004

Ceník Alfa Romeo Giulietta s3 · Obj. kód Disky kol & Pneumatiky GIULIETTA SUPER SPORT EXECUTIVE VELOCE 0R4 - 25000 12500 25000 12500 0R5 - 25000 12500 25000 12500 404 • - - -

CMSC 202 CMSC 202, Advanced Section Classes and Objects In Java

Natural Language Processing Artificial Intelligence CMSC 25000 February 28, 2002

CMSC 330: Organization of Programming Languages · B. S → 0S1| S1 | ε C ... •d, c(a), a+, b**c, etc. CMSC 330 Fall 2018 30. Parse Trees Parse tree shows how a string is produced

CMSC 601: Proposals

CMSC Delivery

CMSC San Diego May 31, 2012 · CMSC San Diego May 31, 2012 CMSC San Diego, May 31, 2012

CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio

Learning: Nearest Neighbor Artificial Intelligence CMSC 25000 January 31, 2002

CMSC 451: Minimum Spanning Trees & Clustering

N-gram Models CMSC 25000 Artificial Intelligence March 1, 2005

CMSC 420 – 0201 – Fall 2019 Lecture 06 · 2019. 10. 14. · CMSC 420 – 0201 – Fall 2019 Lecture 06 2-3, Red-black, and AA trees

CMSC$601:$ Topics$ · CMSC$601:$ Topics$ Adapted’from’slides’by’ Prof.’Marie’desJardins’ February 2011