Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas

Preview:

Citation preview

Chapter 6 Classification and Prediction

Dr. Bernard Chen Ph.D.University of Central Arkansas

Outline

Classification Introduction Decision Tree Classifier Accuracy Measures

Classification and Prediction Classification and Prediction are two

forms of data analysis that can be used to extract models describing important data classes or to predict future data trends

For example: Bank loan applicants are “safe” or “risky” Guess a customer will buy a new computer? Analysis cancer data to predict which one of

three specific treatments should apply

Classification Classification is a Two-Step Process

Learning step:classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction step:predicts categorical class labels (discrete or nominal)

Learning step: Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Learning step Model construction: describing a set of

predetermined classes Each tuple/sample is assumed to belong to

a predefined class, as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees, or mathematical formulae

Prediction step:Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Prediction step Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set, otherwise over-fitting will occur

N-fold Cross-validation In order to solve over-fitting problem, n-fold

cross-validation is usually used

For example, 7 fold cross validation: Divide the whole training dataset into 7 parts equally Take the first part away, train the model on the rest

6 portions After the model is trained, feed the first part as

testing dataset, obtain the accuracy Repeat step two and three, but take the second part

away and so on…

Supervised learning VS Unsupervised learning Because the class label of each

training tuple is provided, this step is also known as supervised learning

It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is unknown

Issues: Data Preparation

Data cleaning Preprocess data in order to reduce noise and

handle missing values

Relevance analysis (feature selection) Remove the irrelevant or redundant attributes

Data transformation Generalize and/or normalize data

Issues: Evaluating Classification Methods Accuracy Speed

time to construct the model (training time) time to use the model

(classification/prediction time) Robustness: handling noise and missing

values Scalability: efficiency in disk-resident

databases Interpretability

Outline

Classification Introduction Decision Tree Classifier Accuracy Measures

Decision Tree Decision Tree induction is the learning

of decision trees from class-labeled training tuples

A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute

Each Branch represents an outcome of the test

Each Leaf node holds a class label

Decision Tree Example

Decision Tree Algorithm

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive

divide-and-conquer manner At start, all the training examples are at the

root Attributes are categorical (if continuous-

valued, they are discretized in advance) Test attributes are selected on the basis of a

heuristic or statistical measure (e.g., information gain)

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest

information gain Let pi be the probability that an

arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|

Expected information (entropy) needed to

classify a tuple in D:)(log)( 2

1i

m

ii ppDInfo

Attribute Selection Measure: Information Gain (ID3/C4.5) Information needed (after using A

to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

Decision Tree

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Decision Tree

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s.

I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5)

)3,2(14

5I

Tools http://www.ehow.com/how_5144933_calcula

te-log.html

http://web2.0calc.com/

Decision Tree Infoage (D)= 5/14 I(2,3)+4/14 I(4,0)+ 5/14 I(3,2)

=5/14 * ( )+4/14 * (0)+5/14 * ( )=0.694

For type in http://web2.0calc.com/:-2/5*log2(2/5)-3/5*log2(3/5)

Decision Tree

(0.940 - 0.694)

Similarily, we can compute Gain(income)=0.029 Gain(student)=0.151 Gain(credit_rating)=0.048

Since “student” obtains highest information gain, we can partition the tree into:

246.0)()()( DInfoDInfoageGain age

Decision Tree Infoincome (D)= 4/14 I(3,1)+6/14 I(4,2)+ 4/14

I(2,2) =4/14 * ( )+6/14 * ( )+4/14 * (1)=

0.232+0.393+0.285=0.911

Gain(income)=0.940-0.911=0.029

Decision Tree

Decision Tree

Another Decision Tree Example

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Decision Tree Example Info(Tenured)=I(3,3)=

log2(12)=log12/log2=1.07918/0.30103=3.584958.

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Decision Tree Example InfoRANK (Tenured)=

3/6 I(1,2) + 2/6 I(1,1) + 1/6 I(1,0)=3/6 * ( ) + 2/6 (1) + 1/6 (0)= 0.79

3/6 I(1,2) means “Assistant Prof” has 3 out of 6 samples, with 1 yes’s and 2 no’s.

2/6 I(1,1) means “Associate Prof” has 2 out of 6 samples, with 1 yes’s and 1 no’s.

1/6 I(1,0) means “Professor” has 1 out of 6 samples, with 1 yes’s and 0 no’s.

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Decision Tree Example InfoYEARS (Tenured)=

1/6 I(1,0) + 2/6 I(0,2) + 1/6 I(0,1) + 2/6 I (2,0)= 0

1/6 I(1,0) means “years=2” has 1 out of 6 samples, with 1 yes’s and 0 no’s.

2/6 I(0,2) means “years=3” has 2 out of 6 samples, with 0 yes’s and 2 no’s.

1/6 I(0,1) means “years=6” has 1 out of 6 samples, with 0 yes’s and 1 no’s.

2/6 I(2,0) means “years=7” has 2 out of 6 samples, with 2 yes’s and 0 no’s.

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Group Practice Example

SEX RANK YEARS TENUREDM Assistant Prof 3~5 noF Assistant Prof >=6 yesM Professor <3 yesM Associate Prof >=6 yesM Assistant Prof >=6 noF Associate Prof 3~5 no

Outline

Classification Introduction Decision Tree Classifier Accuracy Measures

Classifier Accuracy Measures

classes (Real) buy computer =

yes

(Real) buy computer =

no

total

(Predict) buy computer = yes

6954 412 7366

(Predict) buy computer = no

46 2588 2634

total 7000 (Buy

Computer)

3000 (Does not buy

Computer)

10000

Classifier Accuracy Measures

Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos = 6954/7000 specificity = t-neg/neg = 2588/3000 precision = t-pos/(t-pos + f-pos) = 6954/7366

accuracy =

Recommended