Classification and Prediction Final

7/31/2019 Classification and Prediction Final

1/43

May 23, 2012 Data Mining: Concepts and Techniques 1

Classification andPrediction


2/43

Agenda

Introduction Decision Tree Induction

Statistical based Algo

Distance based algo Rule based algo



3/43


Classification predicts categorical class labels

classifies data (constructs a model) based on thetraining set and the values (class labels) in a

classifying attribute and uses it in classifying new data Prediction

models continuous-valued functions, i.e., predictsunknown or missing values

Typical applications Loan Disbursement risky or safe

Medical diagnosis (prediction of treatment from below) Treatment A, Treatment B, Treatment C

Classification vs. Prediction


4/43


ClassificationA Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees,

or mathematical formulae Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with theclassified result from the model

Accuracy rate is the percentage of test set samples that arecorrectly classified by the model

Test set is independent of training set

If the accuracy is acceptable, use the model to classify datatuples whose class labels are not known


5/43


Process (1): Model Construction

Training

Data

NAME age income oan_decision

Mike young low risky

Mary young low risky

Bill middle_aged high safe

Jim middle_aged low risky

Dave senior low safe

Anne senior medium safe

Classification

Algorithms

IF age = youth

THEN loan_decision =

risky..

Classifier

(Model)


6/43


Process (2): Using the Model in Prediction

Classifier

Testing

Data

NAME age income loan_decision

Tom senior low safe

Crest middle_aged low risky

yee middle_aged high safe

Unseen Data

(Henry, middle_aged,l

Loan_decision?

risky


7/43May 23, 2012 Data Mining: Concepts and Techniques 7

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data are accompanied by

labels indicating the class of the observations e.g.

risky, safe etc.

New data is classified based on the training set

Unsupervised learning(clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or

clusters in the data



Issues: Data Preparation

Data cleaning

Preprocess data in order to reduce noise and handle

missing values

Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes

Data transformation

Generalize and/or normalize data



Issues: Evaluating Classification Methods

Accuracy classifier accuracy: predicting class label

predictor accuracy: guessing value of predictedattributes

Speed time to construct the model (training time)

time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases Interpretability

understanding and insight provided by the model


10/43

Classification by Decision Tree Induction

Decision tree is a flowchart like tree structure Non-leaf node denotes a test on an attribute

Each branch represents an outcome of the test

Each leaf node holds a class label Topmost node is root




Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer

40 low yes fair no

>40 low yes excellent yes

3140 low yes excellent yes



Output: A Decision Tree for buys_computer

age?

overcast

student? credit rating?

40

no yes yes

yes

31..40youth

Middle_aged senior

noyes

excellentfair



Algorithm for Decision Tree Induction

Basic algorithm Tree is constructed in a top-down recursive divide-and-conquer

manner

At start, all the training examples are at the root

Examples are partitioned based on selected attributes

Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


14/43

DT Algorithm



15/43

Issues faced by DT algo

Choosing splitting attributes-which attributes to use as splitting impacts the

performance e.g. age or credit_rating, student-name is not useful

Ordering of splitting attributes order in which attributes are chosen isimportant. In example age is first, then student and credit_rating is chosen.

Splits number of splits to take place Tree structure balanced tree with fewest levels is desirable.

Training data the structure of DT depends on training data. If training datais very small generated tree might not be enough to work properly and if it is

too large created tree may overfit.(might not work for future states)

Pruning once a tree is constructed, some modifications to the tree might beneeded to improve the performance of the tree.

-- pruning phase redundant comparisons or remove subtrees to achieve better

performance.May 23, 2012 Data Mining: Concepts and Techniques 15



Presentation of Classification Results



Visualization of a Decision Tree in SGI/MineSet 3.0



Interactive Visual Mining by Perception-Based

Classification (PBC)


19/43

Statistical based algorithms


Regression

Bayesian classification


20/43

Regression


Regression deals with estimation of output value based on input values. In classification the input values are values from database D and output

values represent the classes. If we know input parameters x1,x2,x3..xn then the relation between the

output parameter y and input parameters can be

y=c0+c1x1+c2x2+cnxn

c0,c1,..cn are regression coefficients


21/43 Prentice Hall 21

Linear Regression Poor Fit


22/43 Prentice Hall 22

Classification Using Regression

Division: Use regression function to divide area into regions. In this data is plotted in an n-dimensional space without any explicit

class.

Through regression the space is divided into regionsone per class.

Prediction: Formulas are generated to predict the output class

value.


23/43

Prentice Hall 23

Division


24/43

Prentice Hall 24

Prediction


25/43


Bayesian Classification: Why?

A statistical classifier: performs probabilistic prediction,i.e., predicts class membership probabilities

Foundation: Based on Bayes Theorem.

Performance: A simple Bayesian classifier, nave Bayesianclassifier, has comparable performance with decision treeand selected neural network classifiers

Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect prior knowledge can be combined with observed

data


26/43


27/43


Bayesian Theorem

Given training dataX, posteriori probability of a

hypothesisH, P(H|X), follows the Bayes theorem

Informally, this can be written as

posteriori = likelihood x prior/evidence

Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost

)()()|(

)|(X

XX

P

HPHPHP


28/43


29/43


Nave Bayesian Classifier: Training Dataset

Class:

C1:buys_computer = yes

C2:buys_computer = no

Data sample

X = (age 40 low yes excellent no

3140 low yes excellent yes


30/43


Nave Bayesian Classifier: An Example

P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357

Compute P(X|Ci) for each classP(age =


31/43


Nave Bayesian Classifier: Comments

Advantages Easy to implement

Good results obtained in most of the cases

Disadvantages

Assumption: class conditional independence, thereforeloss of accuracy

Practically, dependencies exist among variables


32/43

Rule based Classification



33/43


Using IF-THEN Rules for Classification

Represent the knowledge in the form ofIF-THEN rulesr= where a is antecedent and c is consequent

R: IF age= youth AND student= yes THEN

buys_computer= yes

Rules are generated by many techniques like decision trees

and Neural networks.


34/43

Prentice Hall 34

Generating Rules from DTs

The algorithm will generate a rule foreach leaf node in DT

All the rules with same consequentcould be combined together by Oringthe antecedents of the simpler rules


35/43

Algorithm for Generating Rules from DTs



36/43

Prentice Hall 36

Generating Rules Example


37/43

Distance based algorithms



38/43

Prentice Hall 38

Classification Using Distance

Each item that is mapped to the same class are moresimilar to the other items in that class than the itemsin other classes

So distance (or similarity) measures may be used toidentify alikeness of different items in database.

Place items in class to which they are closest.

Must determine distance between an item and a class.

Classification using simple distance based


39/43

Prentice Hall 39

Classification using simple distance basedalgo

Distance Based

Classes represented byCentroid:Central value

/representative vectore.g. Class A is represented by

Class B by Class C by


40/43

Simple distance based Algo

Input :c1,..,cm // centers for each class

t //input tuple to classify

Output :

c //class to which t is assigneddist = ;

for i=1 to m do

if dis ( ci, t ) < dist, then

c = i;

dist = dist (ci, t);



41/43

Prentice Hall 41

K Nearest Neighbor (KNN):

Common classification scheme

Lazy learners due to learning from neighbors

Training set includes data alongwith classes.

Examine K closest items near the item to beclassified.

New item is placed in the class where the

most of the k closest items are.


42/43

Prentice Hall 42

KNN

K=3Three closest items in the training setare shownT will be placed in the class where mostof these are.


43/43

KNN Algorithm

Documents

Classification and Prediction Final