Download pdf - Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees, entropy, information gain, ID3

Relevant Readings: Sections 3.1 through 3.6 in Mitchell

CS495 - Machine Learning, Fall 2009

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Decision trees










Decision trees










Decision trees










Decision trees










Decision trees










Decision trees










Decision trees










Decision trees










Decision trees










How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain





















Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Entropy











Entropy











Entropy











Entropy











Entropy











Entropy











Entropy











Entropy











Entropy











Entropy











Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Entropy





i pi log2(pi )

Entropy





i pi log2(pi )

Entropy





i pi log2(pi )

Entropy





i pi log2(pi )

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Information gain




∑v∈Values(A)






Information gain




∑v∈Values(A)






Information gain




∑v∈Values(A)






Information gain




∑v∈Values(A)






Information gain




∑v∈Values(A)






ID3

I The essentials:

I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

ID3

I The essentials:I Start with all training instances associated with the root node

I Use info gain to choose which attribute to label each nodewith

I No root-to-leaf path should contain the same discreteattribute twice







ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

with

I No root-to-leaf path should contain the same discreteattribute twice







ID3



attribute twice







ID3



attribute twice







ID3



attribute twice


I The border cases:

I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances


ID3



attribute twice



that node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node



ID3



attribute twice




instances left at that node



ID3



attribute twice







ID3



attribute twice






I The pseudocode:

I Table 3.1 on page 56 in Mitchell

ID3



attribute twice







ID3



attribute twice







Notes on ID3

I Inductive bias

I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)

I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)

I Note that this is a form of Ockham’s Razor




Notes on ID3


necessarily minimized)I Note that this is a form of Ockham’s Razor




Notes on ID3






Notes on ID3




I Positives

I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast


Notes on ID3




I PositivesI Robust to errors in the training data

I Training is reasonably fastI Classifying new examples is very fast


Notes on ID3




I PositivesI Robust to errors in the training dataI Training is reasonably fast

I Classifying new examples is very fast


Notes on ID3






Notes on ID3





I Negatives

I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3





I NegativesI Difficult to extend to real-valued target functions

I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3





I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes

I The version of ID3 discussed above can overfit the data

Notes on ID3






Notes on ID3






Avoiding overfitting

I Some possibilities

I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain


I Some possibilitiesI Use a tuning set to decide when to stop growing the tree

I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance

on the tuning set




I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown












I Use the Minimum Description Length principle

I Track the number of bits used to describe the tree itself whencalculating information gain