Upload
kaiwen-qi
View
1.751
Download
5
Tags:
Embed Size (px)
Citation preview
Classification Technique KNN in Data Mining
---on dataset “Iris”
Comp722 data miningKaiwen Qi, UNC
Spring 2012
Outline
Dataset introduction Data processing Data analysis KNN & Implementation Testing
Dataset Raw dataset Iris(http
://archive.ics.uci.edu/ml/datasets/Iris)
150 total records
50 records Iris Setosa
50 records Iris Versicolour
50 records Iris Virginica
5 Attributes
Sepal length in cm(continious number)
Sepal width in cm(continious number)
Petal length in cm(continious number)
Petal width in cm(continious number)
Class(nominal data: Iris Setosa Iris Versicolour Iris Virginica)
(a) Raw data
(b) Data organization (C) Data
organization
Classification Goal
Task
Data Processing
Original data
Data Processing
• Balanced distribution
Data Analysis
Statistics
Data Analysis
Histogram
Data Analysis
Histogram
KNN
KNN algorithm
The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.
KNN
Advantage the skimpiness of implementation. It is
good at dealing with numeric attributes. Does not set up the model and just
imports the dataset with very low computer overhead.
Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data
Implementation of KNN Algorithm
Algorithm: KNN. Asses a classification label from training data for an unlabeled data
Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification
Method: Create a distance array whose size is K Initialize the array with the distances between the unlabeled tuple with
first K records in dataset Let i=k+1 calculate the distance between the unlabeled tuple with the (k+1)th
record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1
repeat step (4) until i is greater than dataset size(150) Count the class number in the array, the class of biggest number is
mining result
Implementation of KNN
UML
Testing
Testing (K=7, total 150 tuples)
Testing Testing (K=7, 60% data as training data)
Testing
Input random distribution dataset
Random dataset
Accuracy test:
n
Performance
Comparison Decision tree
Advantage• comprehensibility • construct a decision tree
without any domain knowledge• handle high dimensional • By eliminating unrelated
attributes and tree pruning, it simplifies classification calculation
Disadvantage• requires good quality of training
data. • usually runs in memory • Not good at handling
continuous number features.
Advantage• relatively simply. • By simply calculating
attributes frequency from training data and without any other operations (e.g. sort, search),
Disadvantage• The assumption of
independence is not right
• No available probability data to calculate probability
Naïve Bayesian
Conclusion
KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.
It shows high performance with balanced distribution training data as input.
ThanksQuestion?