By: Phuong H. Nguyen Professor: Lee, Sin-Min Course: CS 157B Section: 2 Date: 05/08/07 Spring 2007

By: Phuong H. Nguyen

Professor: Lee, Sin-Min

Course: CS 157B

Section: 2

Date: 05/08/07

Spring 2007

Overview

Introduction Entropy Information Gain Detailed Example Walkthrough Conclusion References

Introduction

ID3 algorithm is a greedy algorithm for decision tree construction developed by Ross Quinlan in 1987.

ID3 algorithm uses information gain to select best attribute as root node or decision nodes:Max-Gain approach (highest information

gain) for splitting

Entropy Measure the impurity or randomness of

an example collection.

A quantitative measurement of the homogeneity of a set of examples.

Basically, it tells us how random the given examples are according to the target classification class.

Entropy (cont.) Entropy (S) = -Ppositive log2Ppositive – Pnegative log2Pnegative

Where:- Ppositive = proportion of positive examples- Pnegative = proportion of negative examples

Example: If S is a collection of 14 examples with 9 YES and 5 NO,

then:

Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940

Entropy (cont.) More than two classification classes:

Entropy(S) = ∑ -p(i) log2 p(i)

- Result for any entropy calculation will be between 0 and 1.- Two special cases:

Age Income Buys Computer

<= 20 Low Yes

21…40 High Yes

>40 Medium Yes

Age Income Buys Computer

<15 Low No

>= 25 High Yes

If Entropy(S) = 1(max value) members are split equally between the two classes (min uniformity, max randomness)

If Entropy(S) = 0 all members in S belong to strictly one class (max uniformity, min randomness)

Information Gain

A statistical property measures how well a given attribute separates example collection into target classes.

ID3 algorithm uses Max-Gain approach (highest information gain) to select best attribute for root node and decision nodes.

Information Gain (cont.)

Gain(S, A) = Entropy(S) – ∑((|Sv| / |S|) *Entropy(Sv))

Where: A is an attribute of collection S Sv = subset of S for which attribute A has

value v |Sv| = number of elements in Sv

|S| = number of elements in S

Information Gain (cont.)Example:Collection S = 14 examples (9 YES - 5 NO)Wind speed is one attribute of S = {Weak, Strong}

- Weak = 8 occurrences (6 YES - 2 NO)- Strong = 6 occurrences (3 YES - 3 NO)

Calculation:Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00

Gain(S,Wind) = Entropy(S) - (8/14)*Entropy(Sweak) - (6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00

= 0.048

- Then for each attribute in S, the information gain is calculated in the same way.

- The highest gain attribute is used in the root node or decision node.

Example Walkthrough An example of a company sending out some promotions to

various houses and recording a few facts about each house and also whether people responded or not:

District House Type Income Previous Customer Outcome

Suburban Detached High No Nothing

Suburban Detached High Responded Nothing

Rural Detached High No Responded

Urban Semi-detached High No Responded

Urban Semi-detached Low No Responded

Urban Semi-detached Low Responded Nothing

Rural Semi-detached Low Responded Responded

Suburban Terrace High No Nothing

Suburban Semi-detached Low No Responded

Urban Terrace Low No Responded

Suburban Terrace Low Responded Responded

Rural Terrace High Responded Responded

Rural Detached Low No Responded

Urban Terrace High Responded Nothing

Example Walkthrough (cont.)

District House Type Income Previous Customer Outcome

Suburban Detached High No NothingSuburban Detached High Responded Nothing

Rural Detached High No RespondedUrban Semi-detached High No RespondedUrban Semi-detached Low No RespondedUrban Semi-detached Low Responded NothingRural Semi-detached Low Responded Responded

Suburban Terrace High No NothingSuburban Semi-detached Low No Responded

Urban Terrace Low No RespondedSuburban Terrace Low Responded Responded

Rural Terrace High Responded RespondedRural Detached Low No RespondedUrban Terrace High Responded Nothing

The target classification is “Outcome” which can be “Responded” or “Nothing”.

The attributes in collection are “District, House Type, Income, Previous Customer, and Outcome”.

They have the following values:- District = {Suburban, Rural, Urban}- House Type = {Detached, Semi-detached, Terrace}- Income = {High, Low}- Previous Customer = {No, Responded}- Outcome = {Nothing, Responded}

Example Walkthrough (cont.)

District House Type Income Previous Customer

Outcome















Detailed Calculation for Gain(S, District):

Entropy (S = [9/14 responses, 5/14 no responses]) = -9/14 log2 9/14 - 5/14 log2 5/14 = 0.40978 + 0.5305 = 0.9403

Entropy(SDistrict = Suburban = [2/5 responses, 3/5 no responses])

= -2/5 log2 2/5 – 3/5 log2 3/5 = 0.5288 + 0.4422 = 0.9709

Entropy(SDistrict = Rural = [4/4 responses, 0/4 no responses])

= -4/4 log2 4/4 = 0

Entropy(SDistrict = Urban = [3/5 responses, 2/5 no responses])

= -3/5 log2 3/5 – 2/5 log2 2/5 = 0.4422 + 0.5288 = 0.9709

Gain(S, District) = Entropy(S) – ((5/14) * Entropy(SDistrict = Suburban) + (5/14) * Entropy(SDistrict = Urban) + (4/14) * Entropy(SDistrict = Rural))

= 0.9403 – ((5/14)*0.9709 + (5/14)*0 + (4/14)*0.9709)= 0.9403 – 0.3468 – 0 – 0.34678

= 0.2468

Example Walkthrough (cont.)So we now have: Gain(S, District) = 0.2468

Apply the same process to the remaining 3 attributes of S, we get:- Gain(S,House Type) = 0.049- Gain(S,Income) = 0.151- Gain(S,Previous Customer) = 0.048

Comparing the information gain of the four attributes, we see that “District” has the highest value. “District” will be the root node of the decision tree.

So far the decision tree will look like following:

District

??? ??? ???

SuburbanRural Urban

Example Walkthrough (cont.)Apply the same process to the left side of the root node (Suburban), we get:

- Entropy(Ssuburban) = 0.970 - Gain(Ssuburban,House Type) = 0.570 - Gain(Ssuburban,Income) = 0.970 - Gain(Ssuburban,Previous Customer) = 0.019

The information gain of “Income” is highest: “Income” will be the decision node.

Then decision tree will look like following:

District

Income ??? ???

SuburbanRural Urban


Outcome






Example Walkthrough (cont.)For the center of the root node (Rural),it is a special case because: - Entropy(SRural) = 0 all members in SRural belong to strictly one target classification class, which is “Responded”

Thus, we skip all the calculation and add the corresponding target classification value to the tree.

Then decision will look like following:

District

Income Responded ???

SuburbanRural Urban


Outcome





Example Walkthrough (cont.)Apply the same process to the right side of the root node (Urban), we get:

Entropy(Surban) = 0.970Gain(Surban,House Type) = 0.019Gain(Surban,Income) = 0.019Gain(Surban,Previous Customer) = 0.970

The information gain of “Previous Customer” is highest: “Previous Customer” will be the decision node.

Then decision tree will look like following:

District

Income Previous Customer

SuburbanRural Urban

Responded


Outcome






For “Income” side, we have: High Nothing (3/3) Entropy = 0and Low Responded (2/2) Entropy = 0

For “Previous Customer” side, we have: No Responded (3/3) Entropy = 0and Yes Nothing (2/2) Entropy = 0

No longer need to split the tree;therefore, the final decision tree will look like following:

District


SuburbanRural Urban

Responded

Responded RespondedNothing Nothing

High Low No Yes

District Income Outcome

Suburban High Nothing



Suburban Low Responded

Suburban Low Responded

District Previous Customer

Outcome

Urban No Responded

Urban No Responded

Urban Responded Nothing

Urban No Responded

Urban Responded Nothing

District


SuburbanRural Urban

Responded

Responded RespondedNothing Nothing

High Low No Yes

From the above decision tree, some rules can be extracted:

Examples:

1)(District = Suburban) AND (Income = Low) (Outcome = Responded)

2)(District = Rural) (Outcome = Responded)

3)(District = Urban) AND (Previous Customer = Yes) (Outcome = Nothing)

4)and so on…

Conclusion ID3 algorithm is easy to implement if

we know how it works.

ID3 algorithm is one of the most important techniques in data mining.

Industry has shown that ID3 algorithm has been effective for data mining.

References Dr. Lee’s Slides, San Jose State University, Spring 2007,

http://www.cs.sjsu.edu/%7Elee/cs157b/cs157b.html

"Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs Journal, June 1996

"Incremental Induction of Decision Trees", by Paul E. Utgoff, Kluwer Academic Publishers, 1989

http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm

http://decisiontrees.net/node/27

Documents

By: Phuong H. Nguyen Professor: Lee, Sin-Min Course: CS 157B Section: 2 Date: 05/08/07 Spring 2007