Data Mining: Implementation of Data Mining Techniques using RapidMiner software

Preview:

DESCRIPTION

Data Mining: Implementation of Data Mining Techniques using RapidMiner software presentation

Citation preview

Data Mining: Implementation of Data Mining Techniques using

RapidMiner softwarePrepared by

Mohammed Kharma

Definitions review

• Cluster: A collection of data objects– similar (or related) to one another within the

same group– dissimilar (or unrelated) to the objects in other

groups• Cluster analysis– Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

Clustering Methods

• Partitioning : – Unsupervised learning algorithms, Construct various

partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors

– Typical methods: k-means, k-medoids• Hierarchical : – Create a hierarchical decomposition of the set of

data (or objects) using some criterion– Typical methods: Diana, Agnes, BIRCH, ROCK,

CAMELEON

Illustration & compression of 2 clustering technique using Rapidminer tool and Java

application

illustrate of 2 clustering technique using Rapidminer tool and Java

• K-means algorithm: We performed two test

1. Using java program: program parameters K = 2;Data: 22 2123 2024 2225 33 2

6

K-means Clustering• Input: the number of clusters K and the collection of n

instances• Output: a set of k clusters that minimizes the squared error

criterion• Method:– Arbitrarily choose k instances as the initial cluster centers– Repeat• (Re)assign each instance to the cluster to which the

instance is the most similar, based on the mean value of the instances in the cluster• Update cluster means (compute mean value of the

instances for each cluster)– Until no change in the assignment

• Squared Error Criterion– E = ∑i=1 k ∑ pЄCi |p-mi|2 – where mi are the cluster means and p are points in clusters

The result K-Means-java program

The result of K-Means-RapidMiner

The result of K-Means-RapidMiner

Continued-The result of K-Means-RapidMiner

11

K-medoids• Input: the number of clusters K and the collection of n

instances• Output: A set of k clusters that minimizes the sum of the

dissimilarities of all the instances to their nearest medoids• Method:– Arbitrarily choose k instances as the initial medoids– Repeat• (Re)assign each remaining instance to the cluster with

the nearest medoid• Randomly select a non-medoid instance, or• Compute the total cost, S, of swapping Oj with Or• If S<0 then swap Oj with Or to form the new set of k

medoids– Until no change

The result of k-medoids-RapidMiner

The result of k-medoids-RapidMiner

Java Live Demo:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Comparison

The results of both algorithms are the sameBoth require K to be specified in the

inputK-medoids is less influenced by outliers in the

dataBoth methods assign each instance exactly to

one cluster

»Thank you

Recommended