Decision trees, entropy, information gain, ID3
Relevant Readings: Sections 3.1 through 3.6 in Mitchell
CS495 - Machine Learning, Fall 2009
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
ID3
I The essentials:
I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root node
I Use info gain to choose which attribute to label each nodewith
I No root-to-leaf path should contain the same discreteattribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
with
I No root-to-leaf path should contain the same discreteattribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:
I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly
I If no attributes remain, label with a majority vote of traininginstances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordingly
I If no attributes remain, label with a majority vote of traininginstances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:
I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
Notes on ID3
I Inductive bias
I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)
I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)
I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I Positives
I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training data
I Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fast
I Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I Negatives
I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functions
I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes
I The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
Avoiding overfitting
I Some possibilities
I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the tree
I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance
on the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principle
I Track the number of bits used to describe the tree itself whencalculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain