Upload
bernice-pope
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
By: Phuong H. Nguyen
Professor: Lee, Sin-Min
Course: CS 157B
Section: 2
Date: 05/08/07
Spring 2007
Overview
Introduction Entropy Information Gain Detailed Example Walkthrough Conclusion References
Introduction
ID3 algorithm is a greedy algorithm for decision tree construction developed by Ross Quinlan in 1987.
ID3 algorithm uses information gain to select best attribute as root node or decision nodes:Max-Gain approach (highest information
gain) for splitting
Entropy Measure the impurity or randomness of
an example collection.
A quantitative measurement of the homogeneity of a set of examples.
Basically, it tells us how random the given examples are according to the target classification class.
Entropy (cont.) Entropy (S) = -Ppositive log2Ppositive – Pnegative log2Pnegative
Where:- Ppositive = proportion of positive examples- Pnegative = proportion of negative examples
Example: If S is a collection of 14 examples with 9 YES and 5 NO,
then:
Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940
Entropy (cont.) More than two classification classes:
Entropy(S) = ∑ -p(i) log2 p(i)
- Result for any entropy calculation will be between 0 and 1.- Two special cases:
Age Income Buys Computer
<= 20 Low Yes
21…40 High Yes
>40 Medium Yes
Age Income Buys Computer
<15 Low No
>= 25 High Yes
If Entropy(S) = 1(max value) members are split equally between the two classes (min uniformity, max randomness)
If Entropy(S) = 0 all members in S belong to strictly one class (max uniformity, min randomness)
Information Gain
A statistical property measures how well a given attribute separates example collection into target classes.
ID3 algorithm uses Max-Gain approach (highest information gain) to select best attribute for root node and decision nodes.
Information Gain (cont.)
Gain(S, A) = Entropy(S) – ∑((|Sv| / |S|) *Entropy(Sv))
Where: A is an attribute of collection S Sv = subset of S for which attribute A has
value v |Sv| = number of elements in Sv
|S| = number of elements in S
Information Gain (cont.)Example:Collection S = 14 examples (9 YES - 5 NO)Wind speed is one attribute of S = {Weak, Strong}
- Weak = 8 occurrences (6 YES - 2 NO)- Strong = 6 occurrences (3 YES - 3 NO)
Calculation:Entropy(S) = - (9/14) log2 (9/14) - (5/14) log2 (5/14) = 0.940Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
Gain(S,Wind) = Entropy(S) - (8/14)*Entropy(Sweak) - (6/14)*Entropy(Sstrong) = 0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
- Then for each attribute in S, the information gain is calculated in the same way.
- The highest gain attribute is used in the root node or decision node.
Example Walkthrough An example of a company sending out some promotions to
various houses and recording a few facts about each house and also whether people responded or not:
District House Type Income Previous Customer Outcome
Suburban Detached High No Nothing
Suburban Detached High Responded Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Responded Nothing
Rural Semi-detached Low Responded Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Responded Responded
Rural Terrace High Responded Responded
Rural Detached Low No Responded
Urban Terrace High Responded Nothing
Example Walkthrough (cont.)
District House Type Income Previous Customer Outcome
Suburban Detached High No NothingSuburban Detached High Responded Nothing
Rural Detached High No RespondedUrban Semi-detached High No RespondedUrban Semi-detached Low No RespondedUrban Semi-detached Low Responded NothingRural Semi-detached Low Responded Responded
Suburban Terrace High No NothingSuburban Semi-detached Low No Responded
Urban Terrace Low No RespondedSuburban Terrace Low Responded Responded
Rural Terrace High Responded RespondedRural Detached Low No RespondedUrban Terrace High Responded Nothing
The target classification is “Outcome” which can be “Responded” or “Nothing”.
The attributes in collection are “District, House Type, Income, Previous Customer, and Outcome”.
They have the following values:- District = {Suburban, Rural, Urban}- House Type = {Detached, Semi-detached, Terrace}- Income = {High, Low}- Previous Customer = {No, Responded}- Outcome = {Nothing, Responded}
Example Walkthrough (cont.)
District House Type Income Previous Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Responded Nothing
Rural Detached High No Responded
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Responded Nothing
Rural Semi-detached Low Responded Responded
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Urban Terrace Low No Responded
Suburban Terrace Low Responded Responded
Rural Terrace High Responded Responded
Rural Detached Low No Responded
Urban Terrace High Responded Nothing
Detailed Calculation for Gain(S, District):
Entropy (S = [9/14 responses, 5/14 no responses]) = -9/14 log2 9/14 - 5/14 log2 5/14 = 0.40978 + 0.5305 = 0.9403
Entropy(SDistrict = Suburban = [2/5 responses, 3/5 no responses])
= -2/5 log2 2/5 – 3/5 log2 3/5 = 0.5288 + 0.4422 = 0.9709
Entropy(SDistrict = Rural = [4/4 responses, 0/4 no responses])
= -4/4 log2 4/4 = 0
Entropy(SDistrict = Urban = [3/5 responses, 2/5 no responses])
= -3/5 log2 3/5 – 2/5 log2 2/5 = 0.4422 + 0.5288 = 0.9709
Gain(S, District) = Entropy(S) – ((5/14) * Entropy(SDistrict = Suburban) + (5/14) * Entropy(SDistrict = Urban) + (4/14) * Entropy(SDistrict = Rural))
= 0.9403 – ((5/14)*0.9709 + (5/14)*0 + (4/14)*0.9709)= 0.9403 – 0.3468 – 0 – 0.34678
= 0.2468
Example Walkthrough (cont.)So we now have: Gain(S, District) = 0.2468
Apply the same process to the remaining 3 attributes of S, we get:- Gain(S,House Type) = 0.049- Gain(S,Income) = 0.151- Gain(S,Previous Customer) = 0.048
Comparing the information gain of the four attributes, we see that “District” has the highest value. “District” will be the root node of the decision tree.
So far the decision tree will look like following:
District
??? ??? ???
SuburbanRural Urban
Example Walkthrough (cont.)Apply the same process to the left side of the root node (Suburban), we get:
- Entropy(Ssuburban) = 0.970 - Gain(Ssuburban,House Type) = 0.570 - Gain(Ssuburban,Income) = 0.970 - Gain(Ssuburban,Previous Customer) = 0.019
The information gain of “Income” is highest: “Income” will be the decision node.
Then decision tree will look like following:
District
Income ??? ???
SuburbanRural Urban
District House Type Income Previous Customer
Outcome
Suburban Detached High No Nothing
Suburban Detached High Responded Nothing
Suburban Terrace High No Nothing
Suburban Semi-detached Low No Responded
Suburban Terrace Low Responded Responded
Example Walkthrough (cont.)For the center of the root node (Rural),it is a special case because: - Entropy(SRural) = 0 all members in SRural belong to strictly one target classification class, which is “Responded”
Thus, we skip all the calculation and add the corresponding target classification value to the tree.
Then decision will look like following:
District
Income Responded ???
SuburbanRural Urban
District House Type Income Previous Customer
Outcome
Rural Detached High No Responded
Rural Semi-detached Low Responded Responded
Rural Terrace High Responded Responded
Rural Detached Low No Responded
Example Walkthrough (cont.)Apply the same process to the right side of the root node (Urban), we get:
Entropy(Surban) = 0.970Gain(Surban,House Type) = 0.019Gain(Surban,Income) = 0.019Gain(Surban,Previous Customer) = 0.970
The information gain of “Previous Customer” is highest: “Previous Customer” will be the decision node.
Then decision tree will look like following:
District
Income Previous Customer
SuburbanRural Urban
Responded
District House Type Income Previous Customer
Outcome
Urban Semi-detached High No Responded
Urban Semi-detached Low No Responded
Urban Semi-detached Low Responded Nothing
Urban Terrace Low No Responded
Urban Terrace High Responded Nothing
For “Income” side, we have: High Nothing (3/3) Entropy = 0and Low Responded (2/2) Entropy = 0
For “Previous Customer” side, we have: No Responded (3/3) Entropy = 0and Yes Nothing (2/2) Entropy = 0
No longer need to split the tree;therefore, the final decision tree will look like following:
District
Income Previous Customer
SuburbanRural Urban
Responded
Responded RespondedNothing Nothing
High Low No Yes
District Income Outcome
Suburban High Nothing
Suburban High Nothing
Suburban High Nothing
Suburban Low Responded
Suburban Low Responded
District Previous Customer
Outcome
Urban No Responded
Urban No Responded
Urban Responded Nothing
Urban No Responded
Urban Responded Nothing
District
Income Previous Customer
SuburbanRural Urban
Responded
Responded RespondedNothing Nothing
High Low No Yes
From the above decision tree, some rules can be extracted:
Examples:
1)(District = Suburban) AND (Income = Low) (Outcome = Responded)
2)(District = Rural) (Outcome = Responded)
3)(District = Urban) AND (Previous Customer = Yes) (Outcome = Nothing)
4)and so on…
Conclusion ID3 algorithm is easy to implement if
we know how it works.
ID3 algorithm is one of the most important techniques in data mining.
Industry has shown that ID3 algorithm has been effective for data mining.
References Dr. Lee’s Slides, San Jose State University, Spring 2007,
http://www.cs.sjsu.edu/%7Elee/cs157b/cs157b.html
"Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs Journal, June 1996
"Incremental Induction of Decision Trees", by Paul E. Utgoff, Kluwer Academic Publishers, 1989
http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm
http://decisiontrees.net/node/27