70
Decision trees, entropy, information gain, ID3 Relevant Readings: Sections 3.1 through 3.6 in Mitchell CS495 - Machine Learning, Fall 2009

Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

  • Upload
    dotuyen

  • View
    228

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees, entropy, information gain, ID3

Relevant Readings: Sections 3.1 through 3.6 in Mitchell

CS495 - Machine Learning, Fall 2009

Page 2: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 3: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 4: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 5: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 6: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 7: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 8: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 9: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 10: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 11: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Page 12: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Page 13: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Page 14: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Page 15: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Page 16: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Page 17: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 18: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 19: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 20: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 21: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 22: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 23: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 24: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 25: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 26: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 27: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Page 28: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Page 29: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Page 30: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Page 31: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Page 32: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Page 33: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 34: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 35: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 36: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 37: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 38: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Page 39: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:

I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 40: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root node

I Use info gain to choose which attribute to label each nodewith

I No root-to-leaf path should contain the same discreteattribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 41: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

with

I No root-to-leaf path should contain the same discreteattribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 42: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 43: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 44: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:

I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 45: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 46: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 47: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 48: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:

I Table 3.1 on page 56 in Mitchell

Page 49: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 50: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

ID3

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

Page 51: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive bias

I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)

I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 52: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)

I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 53: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 54: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 55: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I Positives

I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 56: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training data

I Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 57: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fast

I Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 58: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 59: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I Negatives

I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 60: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functions

I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 61: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes

I The version of ID3 discussed above can overfit the data

Page 62: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 63: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Page 64: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilities

I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

Page 65: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the tree

I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance

on the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

Page 66: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

Page 67: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

Page 68: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principle

I Track the number of bits used to describe the tree itself whencalculating information gain

Page 69: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

Page 70: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,

Avoiding overfitting

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain