Decision trees, entropy, information gain,...

Decision trees, entropy, information gain, ID3

Relevant Readings: Sections 3.1 through 3.6 in Mitchell

CS495 - Machine Learning, Fall 2009

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Decision trees

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

I The essentials:

I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

I The essentials:I Start with all training instances associated with the root node

I Use info gain to choose which attribute to label each nodewith

I No root-to-leaf path should contain the same discreteattribute twice

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

I No root-to-leaf path should contain the same discreteattribute twice

attribute twice

I The border cases:

I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances

attribute twice

that node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

attribute twice

instances left at that node

attribute twice

I The pseudocode:

I Table 3.1 on page 56 in Mitchell

attribute twice

Notes on ID3

I Inductive bias

I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)

I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)

I Note that this is a form of Ockham’s Razor

Notes on ID3

necessarily minimized)I Note that this is a form of Ockham’s Razor

Notes on ID3

I Positives

I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

Notes on ID3

I PositivesI Robust to errors in the training data

I Training is reasonably fastI Classifying new examples is very fast

Notes on ID3

I PositivesI Robust to errors in the training dataI Training is reasonably fast

I Classifying new examples is very fast

Notes on ID3

I Negatives

I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I NegativesI Difficult to extend to real-valued target functions

I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes

I The version of ID3 discussed above can overfit the data

Notes on ID3

Avoiding overfitting

I Some possibilities

I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

I Some possibilitiesI Use a tuning set to decide when to stop growing the tree

I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance

on the tuning set

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Use the Minimum Description Length principle

I Track the number of bits used to describe the tree itself whencalculating information gain

Decision trees, entropy, information gain,...

Documents

The research of fuzzy decision trees building based on ...ceur-ws.org/Vol-2212/paper40.pdf · The research of fuzzy decision trees building based on entropy and the theory of fuzzy

Decision tree representation • ID3 learning algorithm ...rmaclin/cs5541/Notes/ML_Chapter03.pdfDecision Trees • Decision tree representation • ID3 learning algorithm • Entropy

Decision Trees Jeff Storey. Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Decision

Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision Trees -

Entropy Based Trees to Support Decision Making for …przyrbwn.icm.edu.pl/APP/PDF/129/a129z5p15.pdf · 2016-06-06 · Entropy Based Trees to Support Decision Making for Customer Churn

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain

DECISION TREES Asher Moody, CS 157B. Overview Definition Motivation Algorithms ID3 Example Entropy Information Gain Applications Conclusion

DECISION TREES. Decision trees One possible representation for hypotheses

Decision Tree Learning: Part 1pages.cs.wisc.edu/.../slides/lecture3-decision-trees-1.pdf · 2018-09-29 · • entropy and information gain. Decision Tree Representation. ... How

Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

1 Ch 3. Decision Tree Learning Decision trees Basic learning algorithm (ID3) Entropy, information gain Hypothesis space Inductive bias Occam’s

Decision Trees

Lecture 23: Decision Trees Decision trees

Decision Analysis-Decision Trees

ﻢِ ﺴﺑِ ا ﻦِﻤﺣﺮﻟا ﻪﻠﻟا ِﻢِ · 2020-04-02 · Lecture Outline • Decision Trees • ID3 - Basic Decision Tree Learning Algorithm • Entropy and Information

Creating Decision Trees from Rules using RBDT-1 - uvic.ca · Most of the methods that generate decision trees for a speciﬁc problem use the examples of data instances in ... entropy

Partitioning Nominal Attributes in Decision Trees - UGRsci2s.ugr.es/.../partitioning-nominal-attributes-in-decision-trees.pdf · Partitioning Nominal Attributes in Decision ... entropy,

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

Entropy Estimation and Applications to Decision Trees