Upload
samhitha-reddy
View
226
Download
0
Embed Size (px)
Citation preview
7/31/2019 Classification and Prediction Final
1/43
May 23, 2012 Data Mining: Concepts and Techniques 1
Classification andPrediction
7/31/2019 Classification and Prediction Final
2/43
Agenda
Introduction Decision Tree Induction
Statistical based Algo
Distance based algo Rule based algo
May 23, 2012 Data Mining: Concepts and Techniques 2
7/31/2019 Classification and Prediction Final
3/43
May 23, 2012 Data Mining: Concepts and Techniques 3
Classification predicts categorical class labels
classifies data (constructs a model) based on thetraining set and the values (class labels) in a
classifying attribute and uses it in classifying new data Prediction
models continuous-valued functions, i.e., predictsunknown or missing values
Typical applications Loan Disbursement risky or safe
Medical diagnosis (prediction of treatment from below) Treatment A, Treatment B, Treatment C
Classification vs. Prediction
7/31/2019 Classification and Prediction Final
4/43
May 23, 2012 Data Mining: Concepts and Techniques 4
ClassificationA Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees,
or mathematical formulae Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with theclassified result from the model
Accuracy rate is the percentage of test set samples that arecorrectly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to classify datatuples whose class labels are not known
7/31/2019 Classification and Prediction Final
5/43
May 23, 2012 Data Mining: Concepts and Techniques 5
Process (1): Model Construction
Training
Data
NAME age income oan_decision
Mike young low risky
Mary young low risky
Bill middle_aged high safe
Jim middle_aged low risky
Dave senior low safe
Anne senior medium safe
Classification
Algorithms
IF age = youth
THEN loan_decision =
risky..
Classifier
(Model)
7/31/2019 Classification and Prediction Final
6/43
May 23, 2012 Data Mining: Concepts and Techniques 6
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME age income loan_decision
Tom senior low safe
Crest middle_aged low risky
yee middle_aged high safe
Unseen Data
(Henry, middle_aged,l
Loan_decision?
risky
7/31/2019 Classification and Prediction Final
7/43May 23, 2012 Data Mining: Concepts and Techniques 7
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data are accompanied by
labels indicating the class of the observations e.g.
risky, safe etc.
New data is classified based on the training set
Unsupervised learning(clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
7/31/2019 Classification and Prediction Final
8/43May 23, 2012 Data Mining: Concepts and Techniques 8
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
7/31/2019 Classification and Prediction Final
9/43May 23, 2012 Data Mining: Concepts and Techniques 9
Issues: Evaluating Classification Methods
Accuracy classifier accuracy: predicting class label
predictor accuracy: guessing value of predictedattributes
Speed time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases Interpretability
understanding and insight provided by the model
7/31/2019 Classification and Prediction Final
10/43
Classification by Decision Tree Induction
Decision tree is a flowchart like tree structure Non-leaf node denotes a test on an attribute
Each branch represents an outcome of the test
Each leaf node holds a class label Topmost node is root
May 23, 2012 Data Mining: Concepts and Techniques 10
7/31/2019 Classification and Prediction Final
11/43May 23, 2012 Data Mining: Concepts and Techniques 11
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
40 low yes fair no
>40 low yes excellent yes
3140 low yes excellent yes
7/31/2019 Classification and Prediction Final
12/43May 23, 2012 Data Mining: Concepts and Techniques 12
Output: A Decision Tree for buys_computer
age?
overcast
student? credit rating?
40
no yes yes
yes
31..40youth
Middle_aged senior
noyes
excellentfair
7/31/2019 Classification and Prediction Final
13/43May 23, 2012 Data Mining: Concepts and Techniques 13
Algorithm for Decision Tree Induction
Basic algorithm Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Examples are partitioned based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
7/31/2019 Classification and Prediction Final
14/43
DT Algorithm
May 23, 2012 Data Mining: Concepts and Techniques 14
7/31/2019 Classification and Prediction Final
15/43
Issues faced by DT algo
Choosing splitting attributes-which attributes to use as splitting impacts the
performance e.g. age or credit_rating, student-name is not useful
Ordering of splitting attributes order in which attributes are chosen isimportant. In example age is first, then student and credit_rating is chosen.
Splits number of splits to take place Tree structure balanced tree with fewest levels is desirable.
Training data the structure of DT depends on training data. If training datais very small generated tree might not be enough to work properly and if it is
too large created tree may overfit.(might not work for future states)
Pruning once a tree is constructed, some modifications to the tree might beneeded to improve the performance of the tree.
-- pruning phase redundant comparisons or remove subtrees to achieve better
performance.May 23, 2012 Data Mining: Concepts and Techniques 15
7/31/2019 Classification and Prediction Final
16/43May 23, 2012 Data Mining: Concepts and Techniques 16
Presentation of Classification Results
7/31/2019 Classification and Prediction Final
17/43May 23, 2012 Data Mining: Concepts and Techniques 17
Visualization of a Decision Tree in SGI/MineSet 3.0
7/31/2019 Classification and Prediction Final
18/43May 23, 2012 Data Mining: Concepts and Techniques 18
Interactive Visual Mining by Perception-Based
Classification (PBC)
7/31/2019 Classification and Prediction Final
19/43
Statistical based algorithms
May 23, 2012 Data Mining: Concepts and Techniques 19
Regression
Bayesian classification
7/31/2019 Classification and Prediction Final
20/43
Regression
May 23, 2012 Data Mining: Concepts and Techniques 20
Regression deals with estimation of output value based on input values. In classification the input values are values from database D and output
values represent the classes. If we know input parameters x1,x2,x3..xn then the relation between the
output parameter y and input parameters can be
y=c0+c1x1+c2x2+cnxn
c0,c1,..cn are regression coefficients
7/31/2019 Classification and Prediction Final
21/43 Prentice Hall 21
Linear Regression Poor Fit
7/31/2019 Classification and Prediction Final
22/43 Prentice Hall 22
Classification Using Regression
Division: Use regression function to divide area into regions. In this data is plotted in an n-dimensional space without any explicit
class.
Through regression the space is divided into regionsone per class.
Prediction: Formulas are generated to predict the output class
value.
7/31/2019 Classification and Prediction Final
23/43
Prentice Hall 23
Division
7/31/2019 Classification and Prediction Final
24/43
Prentice Hall 24
Prediction
7/31/2019 Classification and Prediction Final
25/43
May 23, 2012 Data Mining: Concepts and Techniques 25
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction,i.e., predicts class membership probabilities
Foundation: Based on Bayes Theorem.
Performance: A simple Bayesian classifier, nave Bayesianclassifier, has comparable performance with decision treeand selected neural network classifiers
Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect prior knowledge can be combined with observed
data
7/31/2019 Classification and Prediction Final
26/43
7/31/2019 Classification and Prediction Final
27/43
May 23, 2012 Data Mining: Concepts and Techniques 27
Bayesian Theorem
Given training dataX, posteriori probability of a
hypothesisH, P(H|X), follows the Bayes theorem
Informally, this can be written as
posteriori = likelihood x prior/evidence
Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost
)()()|(
)|(X
XX
P
HPHPHP
7/31/2019 Classification and Prediction Final
28/43
7/31/2019 Classification and Prediction Final
29/43
May 23, 2012 Data Mining: Concepts and Techniques 29
Nave Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = yes
C2:buys_computer = no
Data sample
X = (age 40 low yes excellent no
3140 low yes excellent yes
7/31/2019 Classification and Prediction Final
30/43
May 23, 2012 Data Mining: Concepts and Techniques 30
Nave Bayesian Classifier: An Example
P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357
Compute P(X|Ci) for each classP(age =
7/31/2019 Classification and Prediction Final
31/43
May 23, 2012 Data Mining: Concepts and Techniques 31
Nave Bayesian Classifier: Comments
Advantages Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, thereforeloss of accuracy
Practically, dependencies exist among variables
7/31/2019 Classification and Prediction Final
32/43
Rule based Classification
May 23, 2012 Data Mining: Concepts and Techniques 32
7/31/2019 Classification and Prediction Final
33/43
May 23, 2012 Data Mining: Concepts and Techniques 33
Using IF-THEN Rules for Classification
Represent the knowledge in the form ofIF-THEN rulesr= where a is antecedent and c is consequent
R: IF age= youth AND student= yes THEN
buys_computer= yes
Rules are generated by many techniques like decision trees
and Neural networks.
7/31/2019 Classification and Prediction Final
34/43
Prentice Hall 34
Generating Rules from DTs
The algorithm will generate a rule foreach leaf node in DT
All the rules with same consequentcould be combined together by Oringthe antecedents of the simpler rules
7/31/2019 Classification and Prediction Final
35/43
Algorithm for Generating Rules from DTs
May 23, 2012 Data Mining: Concepts and Techniques 35
7/31/2019 Classification and Prediction Final
36/43
Prentice Hall 36
Generating Rules Example
7/31/2019 Classification and Prediction Final
37/43
Distance based algorithms
May 23, 2012 Data Mining: Concepts and Techniques 37
7/31/2019 Classification and Prediction Final
38/43
Prentice Hall 38
Classification Using Distance
Each item that is mapped to the same class are moresimilar to the other items in that class than the itemsin other classes
So distance (or similarity) measures may be used toidentify alikeness of different items in database.
Place items in class to which they are closest.
Must determine distance between an item and a class.
Classification using simple distance based
7/31/2019 Classification and Prediction Final
39/43
Prentice Hall 39
Classification using simple distance basedalgo
Distance Based
Classes represented byCentroid:Central value
/representative vectore.g. Class A is represented by
Class B by Class C by
7/31/2019 Classification and Prediction Final
40/43
Simple distance based Algo
Input :c1,..,cm // centers for each class
t //input tuple to classify
Output :
c //class to which t is assigneddist = ;
for i=1 to m do
if dis ( ci, t ) < dist, then
c = i;
dist = dist (ci, t);
May 23, 2012 Data Mining: Concepts and Techniques 40
7/31/2019 Classification and Prediction Final
41/43
Prentice Hall 41
K Nearest Neighbor (KNN):
Common classification scheme
Lazy learners due to learning from neighbors
Training set includes data alongwith classes.
Examine K closest items near the item to beclassified.
New item is placed in the class where the
most of the k closest items are.
7/31/2019 Classification and Prediction Final
42/43
Prentice Hall 42
KNN
K=3Three closest items in the training setare shownT will be placed in the class where mostof these are.
7/31/2019 Classification and Prediction Final
43/43
KNN Algorithm