Decision Tree - ID3

Decision Tree(ID3)Xueping Peng

[email protected]

Outline What is decision tree How to Use Decision Tree How to Generate a Decision Tree Sum Up and Some Drawbacks

What is decision tree(1/3)

Decision tree is a hierarchical tree structure that used to classify classes based on a series of questions (or rules) about the attributes of the class.

The attributes of the classes can be any type of variables from binary, nominal, ordinal, and quantitative values.

The classes must be qualitative type (categorical or binary, or ordinal).

In short, given a data of attributes together with its classes, a decision tree produces a sequence of rules (or series of questions) that can be used to recognize the class.

What is decision tree(2/3)Attributes Classes

Gender Car Ownership Travel Cost ($)/km

Income Level Transportation Mode

Male 0 Cheap Low Bus

Male 1 Cheap Medium Bus

Female 1 Cheap Medium Train

Female 0 Cheap Low Bus

Male 1 Cheap Medium Bus

Female 0 Standard Medium Train

Female 1 Standard Medium Train

Female 1 Expensive High Car

Male 2 Expensive Medium Car

Female 2 Expensive High Car

What is decision tree(3/3)

How to Use Decision Tree

Person Name Gender Car Ownership

Travel Cost ($)/km Income Level Transportation Level

Alex Male 1 Standard High ?

Buddy Male 0 Cheap Medium ?

Cherry Female 1 Cheap High ?

Test Data

What transportation mode would Alex, Buddy and Cheery use?AlexBudd

yCherry

How to Generate a Decision Tree(1/13) Description of ID3

How to Generate a Decision Tree(2/13)

Which is the best choice? We have 29 positive examples and 35 negative ones Should I use attribute 1 or attribute 2 in this iteration of the node?


Use Entropy to Measure Degree of Impurity Entropy

All above formulas contain values of probability of Pj a class j.

How to Generate a Decision Tree(4/13) What does Entropy mean?

Entropy is the minimum number of bits needed to encode the classification of a member of S randomly drawn. P+ = 1, the receiver knows the class, no message sent, Entropy=0. P+ = 0.5, 1 bit needed.

Optimal length code assigns –log2p to message having probability p The idea behind is to assign shorter codes to the more probable

messages and longer codes to less likely examples Thus,the expected number of bits to encode + or – of random

member of S is: H(S) = p+ (-log2 p+) + p-(-log2 p-)


Information Gain Measures the expected reduction in entropy caused by

partitioning the examples according to the given attribute

IG(S|A): the number of bits saved when encoding the target value of an arbitrary member of S, knowing the value of attribute A.

Expected reduction in entropy caused by knowing the value of A IG(S|A) = H(S) – Σj Prob(A=vj) H(Y | A = vj)


Which is the best choice? We have 29 positive examples and 35 negative ones Should I use attribute 0 or attribute 2 in this iteration of the node?

IG(A1) = 0.993 – 26/64*0.70 – 36/64*0.74 = 0.292IG(A2) = 0.993 – 51/64*0.93 – 13/64*0.61 = 0.128


Specific Conditional Entropy H(Y|X=v) Y is class, X is attribute and v is value of X H(Y |X=v) = The entropy of Y among only those records in which X has value v H(Class|Travel Cost=Cheap) =-0.8*log20.8 - 0.2*log20.2 = 0.722

H(Class|Travel Cost=Expensive) =-1*log21 = 0

H(Class|Travel Cost=Standard) =-1*log21 = 0


Conditional Entropy H(Y|X) H(Y |X) = The average specific conditional entropy of Y=

Σj Prob(X=vj) H(Y | X = vj)

e.g. H(Class|Travel Cost) = prob(Travel Cost=Cheap) * H(Class|Travel Cost=Cheap) + prob(Travel Cost=Expensive) * H(Class|Travel Cost=Expensive) +prob(Travel Cost=Standard) * H(Class|Travel Cost=Standard)

= 0.5 * 0.722 + 0.2 * 0 + 0.3 * 0 = 0.361


Information Gain IG(Y|X) IG(Y|X) = H(Y) - H(Y | X) e.g.

H(Class) = – 0.4 log2 (0.4) – 0.3 log2 (0.3) – 0.3 log2 (0.3) = 1.571 IG(Class|Travel Cost) = H(Class) – H(Class|Travel Cost)

1.571 – 0.361 = 1.210

Results of first iterationGain Gender Car

OwnershipTravel Cost ($)/km Income Level

IG 0.125 0.534 1.210 0.695


Root Node

Split Node


Second Iteration


Results of Second Iteration

Split Node

Update Decision Tree

Gain Gender Car Ownership

Income Level

IG 0.322 0.171 0.171


Third Iteration

Update Decision Tree

To Sum Up ID3 is a strong system that

Uses hill-climbing search based on the information gain measure to search through the space of decision trees

Outputs a single hypothesis Never backtracks.It converges to locally optimal solutions Uses all training examples at each step, contrary to methods that

make decisions incrementally Uses statistical properties of all examples:the search is less

sensitive to errors in individual training examples

Some Drawbacks It can only deal with nominal data It may be not robust in presence of noise It is not able to deal with noisy data sets

References Tutorial on Decision Tree,

http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html

Information Gain, http://www.autonlab.org/tutorials/infogain11.pdf

http://www.slideshare.net/aorriols/lecture5-c45




http://www.autonlab.org/tutorials/infogain11.pdf

http://www.autonlab.org/tutorials/infogain11.pdf





Documents

Decision Tree - ID3