1 ACM Student Chapter, Heritage Institute of Technology 3 rd
February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto
Banerjee Ashish Baheti
Slide 2
Machine Learning A computer program is said to learn from
experience E with respect to a class of tasks T and performance
measure P if its performances at tasks T as measured by P, improves
with experience E. 2
Slide 3
An Example: Checkers learning Problem Task T : Playing checkers
Performance P : Percent of games won by opponents Experience E :
Gained by playing against itself 3
Slide 4
Concept Learning Concept learning can be formulated as a
problem of searching through a predefined space of potential
hypotheses for the hypothesis that best fits the training examples.
Much of learning involves acquiring general concepts from specific
training examples. 4
Slide 5
Representing Hypotheses Let H be a hypothesis space. For each h
belonging to H, h is a conjunction of literals. Let X be a set of
possible instances each described by a set of attributes. Example-
Target function C: X-> {0,1} Training examples D: positive and
negative examples of the target function.,, 5
Slide 6
Types of Training Examples Positive Examples: those training
examples that satisfy the target function, ie. For which c(x)=1 or
TRUE. Negative Examples: those training examples that do not
satisfy the target function, ie. For which c(x)=0 or FALSE. 6
Inductive learning hypothesis Any hypothesis found to
approximate the target function well over a sufficiently large set
of training examples, will also approximate the target function
well over other unobserved examples. Any hypothesis h is said to be
consistent with a set of training examples D of target concept c
iff h(x)=c(x) for each training example 8
Slide 9
Classification Techniques Decision Tree based Methods
Rule-based Methods Memory based reasoning Neural Networks Nave
Bayes and Bayesian Belief Networks Support Vector Machines
Slide 10
Decision Tree Goal is to create a model that predicts the value
of a target variable based on several input variables. 10
Slide 11
Decision tree representation Each internal node tests an
attribute. Each branch corresponds to an attribute value. Each leaf
node assigns a classification. 11
Slide 12
A quick recap CNF = Conjunctive Normal Form DNF = Disjunctive
Normal Form 12
Slide 13
Disjunctive Normal Form In Boolean Algebra, a formula is in DNF
if it is a disjunction of clauses, where a clause is a conjunction
of literals. Also known as Sum of Products. Example: (A ^ B ^ C) V
(B ^ C)
Slide 14
Conjunctive Normal Form In Boolean Algebra, a formula is in CNF
if it is a conjunction of clauses, where a clause is a disjunction
of literals. Also known as Product of Sum. Example: (A V B V C) ^
(B V C)
Slide 15
Decision Tree: contd. Decision trees represent a
disjunction(OR) of conjunctions(AND) of constraints on the
attribute values of instances, Each path from the tree root to a
leaf corresponds to a conjunction of attribute tests, and the tree
itself to a disjunction of these conjunctions. Hence, DT represents
a DNF. 15
Slide 16
Attribute splitting 2- way split Multi- way split 16
Slide 17
Splitting Based on Nominal Attributes Multi-way split: Use as
many partitions as distinct values. Binary split: Divides values
into two subsets. Need to find optimal partitioning. CarType Family
Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports,
Luxury} {Family} OR
Slide 18
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal
partitioning. Splitting Based on Ordinal Attributes Size Small
Medium Large Size {Medium, Large} {Small} Size {Small, Medium }
{Large} OR
Slide 19
Splitting Based on Continuous Attributes
Slide 20
Example of a Decision Tree categorical continuous class Refund
MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K
Splitting Attributes Training Data Model: Decision Tree
Slide 21
DT Classification Task 21
Slide 22
Measures of Node Impurity Entropy GINI Index Misclassification
Error
Slide 23
Entropy It characterizes the impurity of an arbitrary
collection of examples. It is a measure of randomness. Entropy(S)=
-p log p - p log p Where, S is a collection containing positive
& negative examples of some target concept. P is the proportion
of positive ex in S. P is the proportion of negative ex in S.
23
Slide 24
An example of Entropy Let S is a collection of 14 examples,
including 9 positive & 5 negative examples, denoted by [9+,
5-]. Then entropy[9+, 5-] = -9/14 log(9/14) 5/14 log(5/14) = 0.94
24
Slide 25
More on Entropy In a more general sense, Entropy= 0, if all
members belong to the same class. = 1, if collection contains equal
no. of positive & negative examples. = lies between 0 & 1,
if there are unequal no. of positive & negative examples.
25
Slide 26
GINI Index GINI Index for a given node t : (NOTE: p( j | t) is
the relative frequency of class j at node t). Maximum (1 - 1/nc)
when records are equally distributed among all classes, implying
least interesting information Minimum (0.0) when all records belong
to one class, implying most interesting information
Splitting Based on GINI When a node p is split into k
partitions (children), the quality of split is computed as,
where,ni = number of records at child i, n = number of records at
node p.
Slide 29
Binary Attributes: Computing GINI Index Splits into two
partitions Effect of Weighing partitions: Larger and Purer
Partitions are sought for. B? YesNo Node N1Node N2 GINI(N1) = 1
(5/7) 2 (2/7) 2 = 0.408 GINI (N2) = 1 (1/5) 2 (4/5) 2 = 0.320 GINI
(Children) = 7/12 * 0.408+ 5/12 * 0.320 = 0.371
Slide 30
Categorical Attributes: Computing Gini Index For each distinct
value, gather counts for each class in the dataset Use the count
matrix to make decisions Multi-way split Two-way split (find best
partition of values)
Slide 31
A set of training examples Day outlook humidity wind play
tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh
weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal
strongno D7overcastnormal strongyes D8sunnyhigh weakno
D9sunnynormal weakyes D10rainnormal weakyes 31
Slide 32
Decision Tree Learning Algorithms Variations of a core
algorithm that employs a top- down, greedy search through the space
of possible decision trees. Examples are Hunts Algorithm, CART,
ID3, C4.5, SLIQ,SPRINT, Mars. 32
Slide 33
Algorithm ID3 33
Slide 34
Algorithm ID3 Greedy algorithm that grows the tree top-down.
Begins with the question "which attribute should be tested at the
root of the tree? A statistical property called information gain is
used. 34
Slide 35
Information Gain Expected reduction in entropy caused by
partitioning the example according to a particular attribute. Gain
of an attribute A relative to a collection of example S is defined
as- Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A)
where Values(A): set of all positive values of an attribute A. Sv:
subset of S for which attribute A has value v. 35
Slide 36
Information Gain: contd. Gain(S,A) is the information provided
about the target function value, given the value of some other
attribute A. E xample: S is a collection described by attributes
including Wind, which can have the values Weak or Strong. Assume S
has 14 examples. Then S=[9+, 5-] S weak = [6+, 2-] S strong = [3+,
3-] 36
Play Tennis example: revisited Day outlook humidity wind play
tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh
weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal
strongno D7overcastnormal strongyes D8sunnyhigh weakno
D9sunnynormal weakyes D10rainnormal weakyes 38
Slide 39
Application of ID3 on Play Tennis There are 3 attributes-
Outlook, humidity and Wind We need to choose one of them as the
root of the tree. We make this choice based on the information
gain(IG) of each of the attributes. The one with the highest IG
gets to be the root. The calculations are shown in the following
slides. 39
Slide 40
Quick recap of formulae Entropy: p log (1/p ) + p log ( 1/p )
Information Gain: Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv)
v->Values(A) Where S is the collection, A is a particular
attribute. Values(A): set of all positive values of an attribute A.
Sv: subset of S for which attribute A has value v. 40
Slide 41
Calculations: For Outlook: The training set has 6 positive and
4 negative examples. Hence entropy =4/10* lg(10/4) + 6/10*
lg(10/6)= 0.970 Outlook can have 3 values- sunny [ 1+, 3- ] rain [
3+, 1- ] overcast.[ 1+ ] Entropy of sunny= * lg 4 + * lg (4/3)
=0.324 Entropy of rain = * lg (4/3) + * lg 4 =0.324 Entropy of
overcast= 2/4* lg (2/2) =0 41
Slide 42
Calculations: Sv/S for each of them are as follows: sunny- 4/10
(means 4 out of 10 examples have sunny as their outlook) rain -
4/10 overcast- 2/10 Hence, Information gain of outlook = 0.970(
4/10 *0.324*2 + 2/10*0) = 0.711 42
Slide 43
Calculations: For Humidity: The training set has 6 positive and
4 negative examples. Hence entropy = 0.970 Humidity can take 2
values- High [3+,2-] Normal [4+, 1-] Entropy of High = 3/5* lg
(5/3) + 2/5* lg (5/2) = 0.970 Entropy of Normal = 1/5* lg 5 + 4/5*
lg(5/4) = 0.7195 43
Slide 44
Calculations: Sv/S for High = 5/10 for normal = 5/10 Hence IG(
Humidity) =0.970 (5/10*0.970 + 5/10*0.7195) =0.125 Similarly for
Wind, the IG is 0.0910. Hence, IG(Outlook)=0.7110
IG(Humidity)=0.125 IG(Wind)= 0.0910 Comparing the IG s of the 3
attributes, we find Outlook has got the highest IG(0.7110) 44
Slide 45
Partially formed tree Hence outlook is chosen as the root of
the decision tree. The partially formed decision tree is as
follows: 45 Outlook Sunny [1+,3-] Overcast [2+] Rain [3+,1-]
yes
Slide 46
Further calculations: Since sunny and rain have both positive
and negative examples, they have fair degrees of randomness and
hence need to be classified further. For sunny : As computed
earlier, Entropy of sunny= 0.324 Now, we need to find the
corresponding humidity and wind for those training examples who
have outlook=sunny. 46
Slide 47
Further calculations Day Outlook Humidity wind Play tennis D1
sunny high weak no D2 sunny high strongno D8 sunny high weakno D9
sunny normal weakyes For Humidity: Sv/S* Entropy(high)= *0=0 Sv/S*
Entropy(low)=1/4*0=0 47
Slide 48
Calculations: Zero because there is no randomness. All the
examples that have Humidity= high have Play tennis= No and those
having Humidity=low have Play Tennis=yes. IG (S sunny, Humidity)=
0.324- 0 = 0.324 For Wind: Sv/S * Entropy(weak)=*(2/3*lg 3/2) +
1/3*lg 3=0.687. Sv/S* Entropy(strong)=1/4* 0=0 IG(S sunny, Wind) =
0.324 0.687 = -0.363 Clearly, humidity has a higher IG. 48
Slide 49
49 Outlook Humidity yes No Sunny[1+,3-] Rain[3+,1-]
Overcast[2+] High[3+]Normal[1+]
Slide 50
Further Calculations Now for Rain[ 3+,1-], Day Outlook Humidity
Wind Play tennis D4 Rainhighweakyes D5 Rainnormalweakyes D6
Rainnormalstrongno D10 Rainnormalweakyes 50
Slide 51
Further Calculations Entropy of rain= * lg (4/3) + * log
4=0.810 Checking IG for Wind: Entropy(weak)= 3/3* lg(3/3) =0
Entropy(strong)= lg 1 =0 Hence, IG(S sunny, wind) = 0.810 0 0 =
0.810 Checking IG of Humidity: Entropy(High)= 1* lg 1 =0
Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917 Hence, IG(S sunny,
wind) =0.810 0 0.917*(3/4)=0.122 51
Slide 52
Final Decision Tree Outlook Humidity Wind Sunny Rain YES High
Normal YESNO YES 52 Overcast
Slide 53
Play Tennis : contd. This decision tree corresponds to the
following expression: (Outlook = Sunny ^ Humidity = Normal) V
(Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak) As we can see
this is actually a disjunction of conjunctions ( DNF ). 53
Slide 54
Features of ID3 Maintains only a single current hypothesis as
it searches the space of decision trees. No backtracking at any
step. Uses all training examples at each step in the search.
54
Slide 55
Inductive bias in decision tree learning Inductive bias is the
set of assumptions that together with the training data,
deductively justify the classifications assigned by the learner to
future instances. 55
Slide 56
Inductive bias in decision tree learning Roughly, ID3 search
strategy- Selects in favor of shorter trees over longer ones.
Selects trees that place attributes with highest information gain
closest to the root. ID3 employs preference bias. 56
Slide 57
Occams razor Prefer the simplest hypothesis that fits the data.
Justification: there are fewer short(hence simple) hypotheses than
long ones, so it is less likely that one will find a short
hypothesis that coincidentally fits the training data. So, we might
believe a 5- node tree is less likely to be a statistical
coincidence and prefer this hypothesis over the 500-node
hypothesis. 57
Slide 58
Slide 59
A few terms: ------ACCURACY ------ERROR RATE ------PRECISION
------RECALL
Slide 60
An Example: SUPPOSE THERE ARE 2 CLASSES: CLASS 1 CLASS 0
Slide 61
CLASS 1CLASS 0 CLASS 1 f 11 f 10 CLASS 0 f 01 f 00 PREDICTED
CLASS ACTUAL CLASS
Slide 62
Suppose let us assume that class 1 is +ve, class 0 is ve So f
11 means that a +ve output predicted as +ve f 10 means that a +ve
output- predicted as -ve f 01 means that a ve output- predicted as
+ve f 00 means that a ve output- predicted as -ve Meanings of the
terms
Slide 63
Hence f 11 and f 00 are two cases which have been accurately
predicted and f 10 and f 01 are the two cases which are predicted
with an error. ACCURACY=(f 11 +f 00) )/( f 11 + f 10 + f 01 + f 00
) ERROR RATE=(f 01 + f 10 )/( f 11 + f 10 + f 01 + f 00 ) Clearly
accuracy + error rate=1
Slide 64
Precision = f ff/ /( f 11 +f 01 ) Recall = f 11 /(f 11 + f 10 )
Example- let there be 8 batsmen present . Now the prediction is
made and said that there are 7 batsmen. It is found that out of
these 7,5 are batsmen and 2 were bowlers. So precision is 5/7 and
recall is 5/8
Slide 65
Over-fitting A hypothesis over-fits the training examples if
some other hypothesis that fits the training example less well,
actually performs better over the entire distribution of instances.
Given a hypothesis space H, a hypothesis h H is said to over-fit
the training data if there exists some alternative hypothesis h H
such that h has smaller error than h over the training ex, but h
has a smaller error than h over the entire distribution of
instances. 65
Slide 66
Over-fitting Solid line : accuracy over training data Broken
line : accuracy over independent set of test examples, not included
in training examples 66
Slide 67
Causes of Over-fitting When training examples contain random
errors/ noise. When the training data consists of small number of
examples. 67
Slide 68
1. Approach that stops growing the tree earlier, before it
reached the point where it perfectly classifies the training data
2. Approach that first over-fits the data, then post prunes the
tree.
Slide 69
How to determine the correct final size of tree Use a separate
set of examples, called Validation set distinct from the training
examples, to evaluate the utility of post-pruning nodes from the
tree. Use all the available data for training, but apply a
statistical test to estimate whether expanding (or pruning) a
particular node is likely to produce an improvement beyond the
training set.
Slide 70
Pruning the decision tree There are many methods by which a
decision tree can be pruned: 1. Reduced Error Pruning 2. Rule Post
Pruning
Slide 71
Reduced Error Pruning Consider each of the decision nodes in
the tree to be candidates for pruning. Pruning a node consists of
removing the sub-tree rooted at that node, making it a leaf node,
and assigning it the most common classification of the training
examples affiliated with that node.
Slide 72
Reduced Error Pruning (contd.) Nodes are pruned iteratively,
always choosing the node whose removal most increases the decision
tree accuracy over the validation set. Pruning is done until
further pruning results in reducing the accuracy of decision tree
over the validation set.
Slide 73
Reduced Error Pruning Performance
Slide 74
Rule Post Pruning The steps of Post Pruning are: Infer the
decision tree from the training set, and allow over-fitting to
occur. Convert the learned tree into an equivalent set of rules by
creating one rule for each path.
Slide 75
Rule Post Pruning Prune each rule by removing any preconditions
and check its estimated accuracy. Sort the pruned rules by their
estimated accuracy, and consider them in this sequence when
classifying subsequent instances.
Slide 76
Post Pruning (contd.) In rule post-pruning, one rule is
generated for each leaf node in the tree. Each attribute test along
the path from the root to the leaf becomes a rule antecedent
(pre-condition) and the classification at the leaf node becomes the
rule consequent (post-condition). Let us consider an example:
Slide 77
Step1: Infer the decision tree from the training set, growing
the tree until the training data is fit as well as possible and
allowing over- fitting to occur.
Slide 78
Step 2: Convert the learned tree into an equivalent set of
rules by creating one rule for each path from the root node to a
leaf node.
Slide 79
Step 3: Prune (generalize) each rule by removing any
preconditions that result in improving its estimated accuracy.
Slide 80
Step 3 :contd.
Slide 81
Step 4. Sort the pruned rules by their estimated accuracy, and
consider them in this sequence when classifying subsequent
instances.
Slide 82
Post Pruning for Binary Case S S1S2Sm Error(S1) Error(S2)
Error(Sm) P1 P2 Pm E(S) BackUpError(S) For any node S which is not
a leaf node we can calculate BackUpError(S) = Pi Error(Si) i
Error(S) = MIN {} P i = Num of examples in Si Num of examples in S
For leaf nodes S i Error(S i ) = E(S i ) E(S) BackUpError(S)
Decision: Prune at S if BackUpError(S) Error(S)
Slide 83
Example of Post Pruning Before Pruning a [6, 4] b [4, 2] c [2,
2] d [1, 2] [x, y] means x YES cases and y NO cases We underline
Error(Sk) [3, 2] 0.429 [1, 0] 0.333 [1, 1] 0.5 [0, 1] 0.333 [1, 0]
0.333 0.375 0.413 0.417 0.378 0.5 0.383 0.4 0.444 PRUNE PRUNE means
cut the sub- tree below this point
Slide 84
Result of Pruning After Pruning a [6, 4] [4, 2] c [2, 2] [1, 2]
[1, 0]
Slide 85
Advantages of DT: Simple to understand and interpret. Requires
little data preparation. Able to handle both numerical and
categorical data. Possible to validate a model using statistical
tests. Robust Performs well with large data in a short time.
85
Slide 86
Limitations of DT The problem of learning an optimal decision
tree is known to be NP-complete. Prone to over-fitting. There are
concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems.
For data including categorical variables with different number of
levels, information gain in DT are biased in favor of those
attributes with more levels. 86