1 ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti

Machine Learning A computer program is said to learn from experience E with respect to a class of tasks T and performance measure P if its performances at tasks T as measured by P, improves with experience E. 2

An Example: Checkers learning Problem Task T : Playing checkers Performance P : Percent of games won by opponents Experience E : Gained by playing against itself 3

Concept Learning Concept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples. Much of learning involves acquiring general concepts from specific training examples. 4

Representing Hypotheses Let H be a hypothesis space. For each h belonging to H, h is a conjunction of literals. Let X be a set of possible instances each described by a set of attributes. Example- Target function C: X-> {0,1} Training examples D: positive and negative examples of the target function.,, 5

Types of Training Examples Positive Examples: those training examples that satisfy the target function, ie. For which c(x)=1 or TRUE. Negative Examples: those training examples that do not satisfy the target function, ie. For which c(x)=0 or FALSE. 6

Attribute Types Nominal / Categorical Ordinal Continuous 7

Inductive learning hypothesis Any hypothesis found to approximate the target function well over a sufficiently large set of training examples, will also approximate the target function well over other unobserved examples. Any hypothesis h is said to be consistent with a set of training examples D of target concept c iff h(x)=c(x) for each training example 8

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Nave Bayes and Bayesian Belief Networks Support Vector Machines

Decision Tree Goal is to create a model that predicts the value of a target variable based on several input variables. 10

Decision tree representation Each internal node tests an attribute. Each branch corresponds to an attribute value. Each leaf node assigns a classification. 11

A quick recap CNF = Conjunctive Normal Form DNF = Disjunctive Normal Form 12

Disjunctive Normal Form In Boolean Algebra, a formula is in DNF if it is a disjunction of clauses, where a clause is a conjunction of literals. Also known as Sum of Products. Example: (A ^ B ^ C) V (B ^ C)

Conjunctive Normal Form In Boolean Algebra, a formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals. Also known as Product of Sum. Example: (A V B V C) ^ (B V C)

Decision Tree: contd. Decision trees represent a disjunction(OR) of conjunctions(AND) of constraints on the attribute values of instances, Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. Hence, DT represents a DNF. 15

Attribute splitting 2- way split Multi- way split 16

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium } {Large} OR

Splitting Based on Continuous Attributes

Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

DT Classification Task 21

Measures of Node Impurity Entropy GINI Index Misclassification Error

Entropy It characterizes the impurity of an arbitrary collection of examples. It is a measure of randomness. Entropy(S)= -p log p - p log p Where, S is a collection containing positive & negative examples of some target concept. P is the proportion of positive ex in S. P is the proportion of negative ex in S. 23

An example of Entropy Let S is a collection of 14 examples, including 9 positive & 5 negative examples, denoted by [9+, 5-]. Then entropy[9+, 5-] = -9/14 log(9/14) 5/14 log(5/14) = 0.94 24

More on Entropy In a more general sense, Entropy= 0, if all members belong to the same class. = 1, if collection contains equal no. of positive & negative examples. = lies between 0 & 1, if there are unequal no. of positive & negative examples. 25

GINI Index GINI Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 (0)^2 (1)^2 = 1-0-1= 1 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6)^2 (5/6)^2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6)^2 (4/6)^2 =0.444

Splitting Based on GINI When a node p is split into k partitions (children), the quality of split is computed as, where,ni = number of records at child i, n = number of records at node p.

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 GINI(N1) = 1 (5/7) 2 (2/7) 2 = 0.408 GINI (N2) = 1 (1/5) 2 (4/5) 2 = 0.320 GINI (Children) = 7/12 * 0.408+ 5/12 * 0.320 = 0.371

Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values)

A set of training examples Day outlook humidity wind play tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal strongno D7overcastnormal strongyes D8sunnyhigh weakno D9sunnynormal weakyes D10rainnormal weakyes 31

Decision Tree Learning Algorithms Variations of a core algorithm that employs a top- down, greedy search through the space of possible decision trees. Examples are Hunts Algorithm, CART, ID3, C4.5, SLIQ,SPRINT, Mars. 32

Algorithm ID3 33

Algorithm ID3 Greedy algorithm that grows the tree top-down. Begins with the question "which attribute should be tested at the root of the tree? A statistical property called information gain is used. 34

Information Gain Expected reduction in entropy caused by partitioning the example according to a particular attribute. Gain of an attribute A relative to a collection of example S is defined as- Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A) where Values(A): set of all positive values of an attribute A. Sv: subset of S for which attribute A has value v. 35

Information Gain: contd. Gain(S,A) is the information provided about the target function value, given the value of some other attribute A. E xample: S is a collection described by attributes including Wind, which can have the values Weak or Strong. Assume S has 14 examples. Then S=[9+, 5-] S weak = [6+, 2-] S strong = [3+, 3-] 36

Information Gain: contd. Gain(S, wind) = Entropy(S) (8/14) Entropy(S weak ) (6/14) Entropy(S strong ) = 0.94 (8/14) 0.811 = 0.048 37

Play Tennis example: revisited Day outlook humidity wind play tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal strongno D7overcastnormal strongyes D8sunnyhigh weakno D9sunnynormal weakyes D10rainnormal weakyes 38

Application of ID3 on Play Tennis There are 3 attributes- Outlook, humidity and Wind We need to choose one of them as the root of the tree. We make this choice based on the information gain(IG) of each of the attributes. The one with the highest IG gets to be the root. The calculations are shown in the following slides. 39

Quick recap of formulae Entropy: p log (1/p ) + p log ( 1/p ) Information Gain: Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A) Where S is the collection, A is a particular attribute. Values(A): set of all positive values of an attribute A. Sv: subset of S for which attribute A has value v. 40

Calculations: For Outlook: The training set has 6 positive and 4 negative examples. Hence entropy =4/10* lg(10/4) + 6/10* lg(10/6)= 0.970 Outlook can have 3 values- sunny [ 1+, 3- ] rain [ 3+, 1- ] overcast.[ 1+ ] Entropy of sunny= * lg 4 + * lg (4/3) =0.324 Entropy of rain = * lg (4/3) + * lg 4 =0.324 Entropy of overcast= 2/4* lg (2/2) =0 41

Calculations: Sv/S for each of them are as follows: sunny- 4/10 (means 4 out of 10 examples have sunny as their outlook) rain - 4/10 overcast- 2/10 Hence, Information gain of outlook = 0.970( 4/10 *0.324*2 + 2/10*0) = 0.711 42

Calculations: For Humidity: The training set has 6 positive and 4 negative examples. Hence entropy = 0.970 Humidity can take 2 values- High [3+,2-] Normal [4+, 1-] Entropy of High = 3/5* lg (5/3) + 2/5* lg (5/2) = 0.970 Entropy of Normal = 1/5* lg 5 + 4/5* lg(5/4) = 0.7195 43

Calculations: Sv/S for High = 5/10 for normal = 5/10 Hence IG( Humidity) =0.970 (5/10*0.970 + 5/10*0.7195) =0.125 Similarly for Wind, the IG is 0.0910. Hence, IG(Outlook)=0.7110 IG(Humidity)=0.125 IG(Wind)= 0.0910 Comparing the IG s of the 3 attributes, we find Outlook has got the highest IG(0.7110) 44

Partially formed tree Hence outlook is chosen as the root of the decision tree. The partially formed decision tree is as follows: 45 Outlook Sunny [1+,3-] Overcast [2+] Rain [3+,1-] yes

Further calculations: Since sunny and rain have both positive and negative examples, they have fair degrees of randomness and hence need to be classified further. For sunny : As computed earlier, Entropy of sunny= 0.324 Now, we need to find the corresponding humidity and wind for those training examples who have outlook=sunny. 46

Further calculations Day Outlook Humidity wind Play tennis D1 sunny high weak no D2 sunny high strongno D8 sunny high weakno D9 sunny normal weakyes For Humidity: Sv/S* Entropy(high)= *0=0 Sv/S* Entropy(low)=1/4*0=0 47

Calculations: Zero because there is no randomness. All the examples that have Humidity= high have Play tennis= No and those having Humidity=low have Play Tennis=yes. IG (S sunny, Humidity)= 0.324- 0 = 0.324 For Wind: Sv/S * Entropy(weak)=*(2/3*lg 3/2) + 1/3*lg 3=0.687. Sv/S* Entropy(strong)=1/4* 0=0 IG(S sunny, Wind) = 0.324 0.687 = -0.363 Clearly, humidity has a higher IG. 48

49 Outlook Humidity yes No Sunny[1+,3-] Rain[3+,1-] Overcast[2+] High[3+]Normal[1+]

Further Calculations Now for Rain[ 3+,1-], Day Outlook Humidity Wind Play tennis D4 Rainhighweakyes D5 Rainnormalweakyes D6 Rainnormalstrongno D10 Rainnormalweakyes 50

Further Calculations Entropy of rain= * lg (4/3) + * log 4=0.810 Checking IG for Wind: Entropy(weak)= 3/3* lg(3/3) =0 Entropy(strong)= lg 1 =0 Hence, IG(S sunny, wind) = 0.810 0 0 = 0.810 Checking IG of Humidity: Entropy(High)= 1* lg 1 =0 Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917 Hence, IG(S sunny, wind) =0.810 0 0.917*(3/4)=0.122 51

Final Decision Tree Outlook Humidity Wind Sunny Rain YES High Normal YESNO YES 52 Overcast

Play Tennis : contd. This decision tree corresponds to the following expression: (Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak) As we can see this is actually a disjunction of conjunctions ( DNF ). 53

Features of ID3 Maintains only a single current hypothesis as it searches the space of decision trees. No backtracking at any step. Uses all training examples at each step in the search. 54

Inductive bias in decision tree learning Inductive bias is the set of assumptions that together with the training data, deductively justify the classifications assigned by the learner to future instances. 55

Inductive bias in decision tree learning Roughly, ID3 search strategy- Selects in favor of shorter trees over longer ones. Selects trees that place attributes with highest information gain closest to the root. ID3 employs preference bias. 56

Occams razor Prefer the simplest hypothesis that fits the data. Justification: there are fewer short(hence simple) hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data. So, we might believe a 5- node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis. 57

A few terms: ------ACCURACY ------ERROR RATE ------PRECISION ------RECALL

An Example: SUPPOSE THERE ARE 2 CLASSES: CLASS 1 CLASS 0

CLASS 1CLASS 0 CLASS 1 f 11 f 10 CLASS 0 f 01 f 00 PREDICTED CLASS ACTUAL CLASS

Suppose let us assume that class 1 is +ve, class 0 is ve So f 11 means that a +ve output predicted as +ve f 10 means that a +ve output- predicted as -ve f 01 means that a ve output- predicted as +ve f 00 means that a ve output- predicted as -ve Meanings of the terms

Hence f 11 and f 00 are two cases which have been accurately predicted and f 10 and f 01 are the two cases which are predicted with an error. ACCURACY=(f 11 +f 00) )/( f 11 + f 10 + f 01 + f 00 ) ERROR RATE=(f 01 + f 10 )/( f 11 + f 10 + f 01 + f 00 ) Clearly accuracy + error rate=1

Precision = f ff/ /( f 11 +f 01 ) Recall = f 11 /(f 11 + f 10 ) Example- let there be 8 batsmen present . Now the prediction is made and said that there are 7 batsmen. It is found that out of these 7,5 are batsmen and 2 were bowlers. So precision is 5/7 and recall is 5/8

Over-fitting A hypothesis over-fits the training examples if some other hypothesis that fits the training example less well, actually performs better over the entire distribution of instances. Given a hypothesis space H, a hypothesis h H is said to over-fit the training data if there exists some alternative hypothesis h H such that h has smaller error than h over the training ex, but h has a smaller error than h over the entire distribution of instances. 65

Over-fitting Solid line : accuracy over training data Broken line : accuracy over independent set of test examples, not included in training examples 66

Causes of Over-fitting When training examples contain random errors/ noise. When the training data consists of small number of examples. 67

1. Approach that stops growing the tree earlier, before it reached the point where it perfectly classifies the training data 2. Approach that first over-fits the data, then post prunes the tree.

How to determine the correct final size of tree Use a separate set of examples, called Validation set distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree. Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set.

Pruning the decision tree There are many methods by which a decision tree can be pruned: 1. Reduced Error Pruning 2. Rule Post Pruning

Reduced Error Pruning Consider each of the decision nodes in the tree to be candidates for pruning. Pruning a node consists of removing the sub-tree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.

Reduced Error Pruning (contd.) Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set. Pruning is done until further pruning results in reducing the accuracy of decision tree over the validation set.

Reduced Error Pruning Performance

Rule Post Pruning The steps of Post Pruning are: Infer the decision tree from the training set, and allow over-fitting to occur. Convert the learned tree into an equivalent set of rules by creating one rule for each path.

Rule Post Pruning Prune each rule by removing any preconditions and check its estimated accuracy. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Post Pruning (contd.) In rule post-pruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (pre-condition) and the classification at the leaf node becomes the rule consequent (post-condition). Let us consider an example:

Step1: Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over- fitting to occur.

Step 2: Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.

Step 3: Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.

Step 3 :contd.

Step 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Post Pruning for Binary Case S S1S2Sm Error(S1) Error(S2) Error(Sm) P1 P2 Pm E(S) BackUpError(S) For any node S which is not a leaf node we can calculate BackUpError(S) = Pi Error(Si) i Error(S) = MIN {} P i = Num of examples in Si Num of examples in S For leaf nodes S i Error(S i ) = E(S i ) E(S) BackUpError(S) Decision: Prune at S if BackUpError(S) Error(S)

Example of Post Pruning Before Pruning a [6, 4] b [4, 2] c [2, 2] d [1, 2] [x, y] means x YES cases and y NO cases We underline Error(Sk) [3, 2] 0.429 [1, 0] 0.333 [1, 1] 0.5 [0, 1] 0.333 [1, 0] 0.333 0.375 0.413 0.417 0.378 0.5 0.383 0.4 0.444 PRUNE PRUNE means cut the sub- tree below this point

Result of Pruning After Pruning a [6, 4] [4, 2] c [2, 2] [1, 2] [1, 0]

Advantages of DT: Simple to understand and interpret. Requires little data preparation. Able to handle both numerical and categorical data. Possible to validate a model using statistical tests. Robust Performs well with large data in a short time. 85

Limitations of DT The problem of learning an optimal decision tree is known to be NP-complete. Prone to over-fitting. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. For data including categorical variables with different number of levels, information gain in DT are biased in favor of those attributes with more levels. 86

Documents

1 ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti