CC282 Decision trees
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 1
Lecture 2 - Outline
• More ML principles:– Concept learning– Hypothesis space– Generalisation and overfitting– Model (hypothesis) evaluation
• Inductive learning– Inductive bias– Decision trees– ID3 algorithm (entropy, information gain)
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 2
Concept learning
• Concept, c is the problem to be learned– Example:
• Classification problem by an optician• Concept - whether to fit or not to fit contact lenses based on user’s budget, user’s
eye condition, user’s environment etc• Inputs, x: user’s budget, user’s eye condition, user’s environment • Output, y: to fit or not to fit
• A learning model is needed to learn a concept• The learning model should ideally
– Capture the training data, <x, y> -> descriptive ability– Generalise to unseen test data, <xnew ,?> -> predictive ability
– Provide plausible explanation on the learned concept, c -> explanatory ability– But descriptive and predictive abilities are generally considered sufficient
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 3
x c y
Learning a concept
• Concept learning • Given many examples - <input, output> of what c does, find a function h
that approximates c• The number of examples is usually a small subset of all possible <input,
output> pairs• h is known as a hypothesis (i.e. learning model)• There might be a number of h that are candidate solutions -we select h
from a hypothesis space H• If the hypothesis matches the behaviour of the target concept for all
training data, then it is a consistent hypothesis• Occam’s razor
• Simpler hypothesis that fits c is preferred• Simpler h means shorter, smaller h• Simpler h is unlikely to be an effect of coincidence
• Learning == search in the H for an appropriate h• Realisable task – H contains the h that fits the concept• Unreliasable task – H does not contain the h that fits the concept
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 4
More terms - Generalisation, overfitting, induction, deduction
• Generalisation– The ability of the trained model to perform well on test data
• Overfitting – If the model learns the training data well but performs poorly on the test data
• Inductive learning (induction)– learning a hypothesis by example, where a system tries to induce a general rule/model
from a set of observed instances/samples
• Inductive bias • Since many choices of h exist in H, any preference of one hypothesis over another
without prior knowledge is called bias • Any hypothesis consistent with the training examples is likely to generalise to unseen
examples - the trick is to find the right bias
• An unbiased learner– Can never generalise so not practically useful
• Deduction– ML gives an output (prediction, classification etc) based on the previously acquired
learning
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 5
Generalisation and overfitting example
• Assume, we have the inputs, x and corresponding outputs, y and we wish to have concept, c that matches x to y
• Examples of hypotheses:
• h1 will give good generalisation• h2 is overfitted
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 6
Model (hypothesis) evaluation
• We need to have some performance measure to estimate how the model h approximates c, i.e. how good is h?
• Possible evaluation methods– Explanatory, gives qualitative evaluation – Descriptive, gives quantitative (numerical) evaluation
• Explanatory evaluation – Does the model provide a plausible description of the learned concept– Classification: does it base its classification on plausible rules?– Association: does it discover plausible relationships in the data?– Clustering: does it come up with plausible clusters?– The meaning of plausible to be defined by the human expert– Hence, not popular in ML
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 7
Descriptive evaluation
• Example: bowel cancer classification problem
– True positives (TP) - diseased patients identified as with cancer– True negatives (TN) - healthy subjects identified as healthy– False negatives (FN)- test identifies cancer patient as healthy – False positives (FP) – test identifies healthy subject as with cancer
• Precision
• Sensitivity (Recall)
• F measure (balanced F score)
• Simple classification accuracyLecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 8
Patients with bowel cancer
True False
Blood Test
Positive TP=2 FP=18
Negative FN=1 TN=182
Source: Wikipedia
%39.17..2
recallprecision
recallprecisionF
%10)182/(2)/( FPTPTPprecision
%67.66)12/(2)/( FNTPTPrecall
%64.90203
184
_
casestotal
TNTPAccu
Descriptive evaluation (contd)
• For prediction problems, mean square error (MSE) is used
where– di is the desired output in the data set
– ai is the actual output from the model
– n is the number of instances in the data set
– If N=2, d1=1.0, a1=0.5, d2=0, a2=1.0:
– MSE=1.25
• Sometimes, root mean square is used instead =sqrt(MSE) Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 9
n
i ii adn
MSE1
2)(1
Decision trees (DT)
• Simple form of inductive learning • Yet successful form of learning algorithm• Consider an example of playing tennis• Attributes (features)
– Outlook, temp, humidity, wind
• Values– Description of features– Eg: Outlook values - sunny, cloudy, rainy
• Target– Play – Represents the output of the model
• Instances– Examples D1 to D14 of the dataset
• Concept– Learn to decide whether to play tennis i.e. find h from given data set
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 10
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
Adapted from Mitchell, 1997
Decision trees (DT)
• Decision tree takes a set of properties as input and provides a decision as output– each row of table corresponds to a path in the tree – decision tree may form more compact representation, especially if
many attributes are irrelevant • DT could be considered as the learning method when
– Instances describable by attribute-value pairs– Target function is discrete valued (eg: YES, NO)– Possibly noisy training data
• It is not suitable (needs further adaptation)– When attribute values and/or target are numerical values
• Eg: Attribute values: Temp=22⁰ C, Windy=25 mph• Target function=70%, 30%
– Some functions require exponentially large decision tree• parity function
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 11
Forming rules from DT
• Example of concept: ‘Should I play tennis today’– Takes inputs (set of attributes)– Outputs a decision (say YES/NO)
• Each non-leaf node is an attribute– The first non-leaf node is root node
• Each leaf node is either Yes or No• Each link (branch) is labeled with
possible values of the associated attribute• Rule formation
– A decision tree can be expressed as a disjunction of conjunctions• PLAY tennis IF (Outlook = sunny) (Humidity = normal) (Outlook=Cloudy)
(Outlook = Rainy) (Wind=Weak)• is disjunction operator (OR)• is conjunction operator (AND)
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 12
NoYesNo Yes
WeakNormalHigh
Outlook
Humidity WindYes
Sunny RainyCloudy
Strong
Another DT example
– Another example (from Lecture 1)
– Reading the tree on the right
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 13
If the parents visiting=yes, then go to the cinemaorIf the parents visiting=no and weather=sunny, then play tennisorIf the parents visiting=no and weather=windy and money=rich, then go shoppingorIf the parents visiting=no and weather=windy and money=poor, then go to cinemaorIf the parents visiting=no and weather=rainy, then stay in.
Source: http://wwwhomes.doc.ic.ac.uk/~sgc/teaching/v231/lecture10.html
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 14
Obtaining DT through top-down induction
• How can we obtain a DT?• Perform a top-down search, through the space of possible
decision trees• Determine the attribute that best classifies the training data• Use this attribute as the root of the tree• Repeat this process for each branch from left to right• Proceed to the next level and determine the next best feature• Repeat until a leaf is reached.• How to choose the best attribute?• Choose the attribute that will yield more information (i.e. the
attribute with the highest information gain)
14
Information gain
• Information gain - > a reduction of Entropy, E• But what is Entropy?
– Is the amount of energy that cannot be used to do work– Measured in bits– A measure of disorder in a system (high entropy = disorder)
– where: – S is the training data set– c is the number of target classes– pi is the proportion of examples in S belonging to target class i
– Note: if your calculator doesn't do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 15
c
iii ppSE
12log)(
Entropy example
• A coin is flipped– If the coin was fair -> 50% chance of head– Now, let us rig the coin -> so that 99% of the time head comes up
• Let’s look at this in terms of entropy:– Two outcomes: head, tail– Probability: phead, ptail
– E(0.5, 0.5)= – 0.5 log2 (0.5) – (0.5) log2 (0.5) = 1 bit
– E(0.01, 0.99) = – 0.01 log2 (0.01) – 0.99 log2 (0.99) = 0.08 bit
• If the probability of heads =1, then entropy=0– E(0, 1.0) = – 0 log2 (0) – 1.0 log2 (1.0) = 0 bit
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 16
Information Gain
• Information Gain, G will be defined as:
• where– Values (A) is the set of all possible values of attribute A– Sv is the subset of S for which A has a value v
– |S| is the size of S and Sv is the size of Sv
• The information gain is the expected reduction in entropy caused by knowing the value of attribute A
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 17
)(
)(||
||)(),(
AValuesvv
v SES
SSEASGain
Example – entropy calculation
• Compute the entropy of the play-tennis example:• We have two classes, YES and NO• We have 14 instances with 9 classified as YES and 5 as NO
– i.e. no. of classes, c=2
• EYES = - (9/14) log2 (9/14) = 0.41
• ENO = - (5/14) log2 (5/14) = 0.53
• E(S) = EYES + ENO = 0.94
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 18
c
iii ppSE
12log)(
Example – information gain calculation
• Compute the information gain for the attributes wind in the play-tennis data set:– |S|=14– Attribute wind
• Two values: weak and strong• |Sweak| = 8
• |Sstrong| = 6
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 19
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
Example – information gain calculation
• Now, let us determine E(Sweak)
• Instances=8, YES=6, NO=2• [6+,2-]• E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 20
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
c
iii ppSE
12log)(
Example – information gain calculation
• Now, let us determine E(Sstrong)
• Instances=6, YES=3, NO=3• [3+,3-]• E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0
• Note, do not waste time if pYES=pNO
Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 21
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
c
iii ppSE
12log)(
21
Example – information gain calculation
• Going back to information gain computation for the attribute wind:
= 0.94 - (8/14) 0.81 - (6/14)1.00= 0.048
Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 22
)(||
||)(
||
||)(),( strong
strongweak
weak SES
SSE
S
SSEwindSGain
22
Example – information gain calculation
• Now, compute the information gain for the attribute humidity in the play-tennis data set:
– |S|=14– Attribute humidity
• Two values: high and normal• |Shigh| = 7
• |Snormal| = 7
– For value: high –> [3+,4-]– For value: normal->[6+,1-]
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 23
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
Example – information gain calculation
• Now, compute the information gain for the attribute humidity in the play-tennis:
– |S|=14– Attribute humidity
• Two values: high and normal• |Shigh| = 7
• |Snormal| = 7
– For value: high –> [3+,4-]– For value: normal->[6+,1-]
= 0.94 - (7/14) 0.98 - (7/14)0.59= 0.15
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 24
)(||
||)(
||
||)(),( normal
normalhigh
high SES
SSE
S
SSEhumiditySGain
E(Shigh)=-(3/7)log2(3/7)-(4/7)log2(4/7)=0.98
E(Snormal)=-(6/7)log2(6/7)-(1/7)log2(1/7)=0.59
So, humidity provides GREATER information gain than wind
24
Example – information gain calculation
• Now, compute the information gain for the attribute outlook and temperature in the play-tennis data set:
• Attribute outlook:
• Attribute temperature:
• Gain(S, outlook)=0.25• Gain(S, temp)=0.03• Gain(S, humidity)=0.15• Gain(S, wind)=0.048• So, attribute with highest info. gain
• OUTLOOK, therefore use outlook as the root node
Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 25
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
25
DT – next level
• After determining OUTLOOK as the root node, we need to expand the tree
• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97
• Entropy (Ssunny)=0.97
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 26
Day Outlook Temp Humidity Wind Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Cloudy Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Cloudy Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Cloudy Mild High Strong Yes
D13 Cloudy Hot Normal Weak Yes
D14 Rainy Mild High Strong No
DT – next level
• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57
• Highest information gain is humidity, so use this attribute
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 27
Continue ….. and Final DT
• Continue until all the examples are classified– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)
– Gain (Srainy, Wind) is the highest
• All leaf nodes are associated with training examples from the same class (entropy=0)
• The attribute temperature is not used Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 28
ID3 algorithm –pseudocode
• Sufficient for exam
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 29
ID3 algorithm –pseudocode (Mitchell)
• From Mitchell (1997) – not important for exam
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 30
Search strategy in ID3
• Complete hypothesis space: any finite discrete-valued function can be expressed
• Incomplete search: searches incompletely through the hypothesis space until the tree is consistent with the data
• Single hypothesis: only one current hypothesis (simplest one) is maintained
• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not be the optimum solution (globally)
• Full training set: attributes are selected by computing information gain on the full training set. Advantage: Robustness to errors. Problem: Non-incremental
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 31
Lecture 2 summary
• From this lecture, you should be able to: • Define concept, learning model, hypothesis, hypothesis space,
consistent hypothesis, induction learning & bias, reliasable & unreliasable tasks, Occam’s razor in view of ML
• Differentiate between generalisation and overfitting• Define entropy & information gain and know how to calculate
them for a given data set• Explain the ID3 algorithm, how it works and describe it in
pseudo-code• Apply ID3 algorithm on a given data set
Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 32