Classification Using Decision tree

By : Mohd. Noor Abdul Hamid, Ph.D (Universiti Utara Malaysia)

Introduction to ClassificationClassification the task of assigning objects to one of several predefined categories or class.Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.


By : Mohd. Noor Abdul Hamid (UUM)

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Introduction to Classification

Introduction to ClassificationExample of Classification techniques:- Decision Tree- Neural Network- Rule-Based- naïve Bayes Classifier, etc.Classification techniques are most suited for predicting data sets with binary or nominal categories. They are less effective for ordinal categories since they do not consider the implicit order among the categories.


Introduction to ClassificationPerformance of a Classification model is evaluated based on the counts of test records correctly and incorrectly predicted by the model Confusion Matrix

Figure : Confusion matrix for a 2-class problem

Based on the entries of the confusion matrix, the total number of :- correct prediction made by the model is (f11 + f00) - incorrect predictions is(f01 + f10).

Predicted ClassClass = 1 Class = 0

Actual Class

Class = 1 f11 f10

Class = 0 f01 f00


Introduction to ClassificationTherefore we can evaluate the performance of Classification model by looking at the accuracy of the model to make prediction.

Equivalently, the performance of a model can be expressed in terms of its error rate:

Accuracy =

Error Rate = 00011011

0110

ffffff

spredictionofnumberTotalspredictionwrongofNumber

00011011

0011

ffffff

spredictionofnumberTotalspredictioncorrectofNumber


• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or fraudulent

• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

• Categorizing news stories as finance, weather, entertainment, sports, etc

Example of Classification Task



What is a Decision Tree?A decision tree is a structure that can be used to divide up a large collection of records into successfully smaller sets of records by applying a sequence of simple decision rules.With each successive division, the members of the resulting sets become more and more similar to each other.A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous (mutually exclusive) groups with respect to a particular target.Hence, the algorithm used to construct decision tree is referred to as recursive partitioning.


What is a Decision Tree?The target variable is usually categorical and the decision tree is used either to:

Calculate the probability that a given record belong to each of the category or,To classify the record by assigning it to the most likely class (or category).

Note : Decision tree can also be used to estimate the value of a continuous target variable. However, regression models and neural network are generally more appropriate for estimation.


What is a Decision Tree?Decision Tree has three types of nodes:

Root Node : top (or left-most) node with no incoming edges and zero or more outgoing edges.Child or Internal Node : descendent node which has exactly one incoming edge and two or more outgoing edges.Leaf Node : terminal node which has exactly one incoming edge and no outgoing edges.

In Decision Tree, each leaf node is assigned a class label.The rules or branches are the unique path (edges) with a set of conditions (attribute) that divide the observations into smaller subset.


Decision Tree Diagram

Gender

Height Height

Short Tall ShortMedium

Tall

Female Male

<1.3m >1.8m <1.5m> 2.0m

ROOT NODE

INTERNAL NODE

BRANCH

MediumLEAF NODE


Types of Decision Tree

Balanced Tree

Bushy Tree

Deep Tree


How to Build Decision Tree?Generally, building a decision tree involved 2 steps:

Tree construction recursively split the tree according to selected attributes (conditions),Tree pruning identify and remove the irrelevance branches (that might lead to outliers) – to increase classification accuracy.

Pruning



How to Build Decision Tree?In principle, there are exponentially many decision tree that can be construct from a given set of attributes finding the optimal tree is computationally infeasible because of the exponential size of the search space.Efficient algorithms has been develop to induce reasonably accurate, albeit suboptimal, decision tree in a reasonable amount of time.These algorithm usually employ a greedy strategy making a series of locally optimal decisions about which attribute to use for partitioning the data.One such algorithm is Hunt’s Algorithm – which is the basis of many existing decision treealgorithm including ID3, C4.5 and CART.


Hunt’s AlgorithmLet Dt be the set of training records that are associated with node t and y = {y1, y2,…yc}, where y is the target variable with c number of classes.The following is a recursive definition of Hunt’s algorithm:

Step 1 : If all the records in Dt belong to the same class yt, then node t is a leaf node labeled as yt.

Step 2 :If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is then recursively applied to each child node.


Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree


Another Example of Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

classMarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!


Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No


< 80K > 80K



No Married 80K ? 10

Test Data

Assign Cheat to “No”


Design Issue of Decision Tree InductionHow should the training records be split?

Which attribute test condition works better to classify the records?What is the objective measures for evaluating the goodness of each test condition?

How should the splitting procedure stop?What is the condition to stop splitting the records?One strategy is to continue expanding a node until all the records belong to the same class or all the records have identical attributes values.Other criteria can also be imposed to allow the tree-growing procedure to terminate earlier.


Methods for Expressing Attribute Test Condition

a) Binary Attributes generates two possible outcomes (binary split)

GENDER

Male Female



b) Nominal Attributes : Multiway split

Marital Status

Single Divorced Married



b) Nominal Attributes : Binary split (eg : in CART)

Marital Status

{Single} { Married, Divorced}

Marital Status

{Married} { Single, Divorced}

Marital Status

{Divorced} { Married, Single }

OR



c) Ordinal Attributes : Multiway split

Shirt Size

Small Large Extra LargeMedium



c) Ordinal Attributes : Binary split – as long as it does not violate the order property of the attribute values.

Shirt Size

{S, M} { L, XL}

Marital Status

{S} { M, L, XL }

Marital Status

{S, L } { M, XL }

OR



d) Continuous Attributes Binary split

Annual Income > 80K

Yes No



d) Continuous Attributes : Multiway split

Annual Income

<10K 50K, 80K} >80k{10K, 25K} {25K, 50K}


Measures for selecting the Best Split

Let p(i|t) denote the fraction of records belonging to class i at a given node t.In a two-class problem, the class distribution at any node can be written as (p0, p1), where p1 = 1 – p0 The measure developed for selecting the best split are often based on the degree of impurity of the child class distribution (0,1) has zero impurity, whereas a node with uniform class distribution (0.5, 0.5) has the highest impurity.Example of impurity measures include:

EntropyGiniClassification error Categorical TargetInformation Gain RatioChi Square TestVariance ReductionF-Test Interval Target


Measures of Impurity (I)

1

02 )|(log)|()(

c

i

tiptiptEntropy

1

0

2)]|([1)(c

i

tiptGini

)]|([max1)( tipterrortionClassificai

Where c is the number of classes and0log20 = 0 in the entropy calculations


Example : Measures of ImpurityParent Node Count

Class = 0 4

Class = 1 14

Node N1 Count

Class = 0 0

Class = 1 6

Node N2 Count

Class = 0 1

Class = 1 5

Node N3 Count

Class = 0 3

Class = 1 3

Node N1:

0)]6/6(),6/0max[(10)6/6(log)6/6()6/0(log)6/0(

0)6/6()6/0(1

22

22

ErrorEntropyGini



Class = 0 4

Class = 1 14

Node N1 Count

Class = 0 0

Class = 1 6

Node N2 Count

Class = 0 1

Class = 1 5

Node N3 Count

Class = 0 3

Class = 1 3

Node N2:

167.0)]6/5(),6/1max[(165.0)6/5(log)6/5()6/1(log)6/1(

278.0)6/5()6/1(1

22

22

ErrorEntropyGini



Class = 0 4

Class = 1 14

Node N1 Count

Class = 0 0

Class = 1 6

Node N2 Count

Class = 0 1

Class = 1 5

Node N3 Count

Class = 0 3

Class = 1 3

Node N3:

5.0)]6/3(),6/3max[(11)6/3(log)6/3()6/3(log)6/3(

5.0)6/3()6/3(1

22

22

ErrorEntropyGini



Class = 0 4

Class = 1 14

Node N1 Count

Class = 0 0

Class = 1 6

Gini 0

Entropy 0

Error 0

Node N2 Count

Class = 0 1

Class = 1 5

Gini 0.278

Entropy 0.650

Error 0.167

Node N3 Count

Class = 0 3

Class = 1 3

Gini 0.5

Entropy 1

Error 0.5

N1 has the lowest impurity value, followed by N2 and N3


Measures of ImpurityTo determine how well a test condition performs, we need to compare the degree of impurity of the parent node (before splitting) and the child node (after splitting).The larger the different, the better the test condition.The gain ∆, is a criterion that can be used to determine the goodness of a split.

Where: I(.) is the impurity measure of a given nodeN is the total number of records at the parent nodek is the number of attributes value (class)N(vj) is the number of records associated with thechild node vj.

)()(

)(1

j

k

j

j vINvN

parentI

Weighted AverageImpurity


Measures of Impurity : Info. Gain RatioSince I(parent) is the same for all test condition, maximizing the gain is equivalent to minimizing the weighted average impurity measure of the child nodes. When entropy is used as the impurity measure, the difference in entropy is known as the Information Gain Ratio (IGR)

Decision tree build using entropy tend to be quite bushy. Bushy tree with many multi-way split are undesirable as these splits lead to small numbers of records in each node.


Splitting Binary Attributes (using Gini)Example :

Suppose there are two ways (A and B) to split the data into smaller subset.

N2

C0 2

C1 3

N1

C0 1

C1 4

N1

C0 4

C1 3

N2

C0 5

C1 2

Parent

C0 6

C1 6

Gini = 0.5

A B

Which one is a better split?? Compute the weighted average of the Gini

index of both attribute

Gini :1 –(6/12)2 – (6/12)2

= 0.5

Gini Index:0.4898

Gini Index:0.480

Gini Index:0.4082

Gini Index:0.320


Splitting Binary Attributes (using Gini)Example :

N2

C0 2

C1 3

N1

C0 1

C1 4

N1

C0 4

C1 3

N2

C0 5

C1 2

A B

Gini Index:0.4898

Gini Index:0.480

Gini Index:0.4082

Gini Index:0.320

Weighted Average of Gini Index:[(7/12) x 0.4898] + [(5/12) x 0.480]

= 0.486

Gain, ∆ = 0.5 - 0.486 = 0.014

Weighted Average of Gini Index:[(5/12) x 0.320] + [(7/12) x 0.4082]

= 0.3715

Gain, ∆ = 0.5 - 0.3715 = 0.1285

Therefore, B is preferred


Splitting Nominal Attributes (using Gini)Example : Which split is better? Binary or Multi-way splits.

Family

C0 1

C1 3

Sports

C0 8

C1 0

Sports, Luxury

C0 9

C1 7

Family, Luxury

C0 2

C1 10

Car Type Car Type

Car Type

Family

C0 1

C1 3

Sports

C0 8

C1 0

Luxury

C0 1

C1 7

Weighted Average Gini = 0.468 Weighted Average Gini = 0.167

Weighted Average Gini = 0.163


Splitting Continuous Attributes (using Gini) A brute-force method is used to find the best split position (v) for a continuous attribute (eg: Annual Income).

To reduce complexity, the training records are sorted based on the annual income.

Class No No No Yes Yes Yes No No No No

Annual Income (sorted)

60 70 75 85 90 95 100 120 125 220

Candidate split positions (v) are identified by taking the midpoints between two adjacent sorted values.

0716253443434343526170No

0303030303122130303030Yes

>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤

23017212211097928780726555Split position

(mid points)


Splitting Continuous Attributes (using Gini) A brute-force method is used to find the best split position (v) for a continuous attribute (eg: Annual Income).

We then compute the Gini index for each candidate and choose the one that gives the lowest value.

Class No No No Yes Yes Yes No No No No

Annual Income (sorted)

60 70 75 85 90 95 100 120 125 220

0716253443434343526170No

0303030303122130303030Yes

>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤

23017212211097928780726555Split position

(mid points)

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.4 0.420

Data & Analytics

Classification Using Decision tree