3
An Efficient Grid Algorithm for Faster Clustering Using K Medoids Approach Golam Moktader Daiyan Department of Computer Science & Information Technology, Southern,University Bangladesh. Faisal Bin Al Abid Department of Compute Science & Engineering, Islamic University of Technology, Bangladesh. Md. Ataur Rahman Khan Department of Electronic and Communication Engineering. Southern University, Bangladesh. Ahsan Habib Tareq Department of Information & Communication Engineering, Mawlana Bhashani Science & Technology University, Tangile, Bangladesh. Abstract— Clustering is the methodology to separate similar objects of data set in one cluster and dissimilar objects of data set in another cluster. K means and K medoids are most widely used Clustering algorithms for selecting group of objects for data sets. k means clustering has less time complexity than k medoids method, but k means clustering method suffers from extreme values. So, we have focused our view to k medoids clustering method. Conventional k-medoids clustering algorithm suffers from many limitations. We have done analysis on these limitations such as the problem of finding natural clusters, the dependency of output on the order of input data. In this paper we have proposed a new algorithm named Grid Multidimensional K medoids which is designed to overcome the above limitations and provide a faster clustering than K medoids. . Keywords- Medoid; Dataset; Partitioning; Grid, Outlier, Time complexity. I. INTRODUCTION Data Mining is the process of non-trivial extraction of implicit, previously unknown, and potentially useful information from data [1]. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database i.e. descriptive task finds the human-interpretable patterns that describe the data. Predictive mining tasks perform inference on the current data in order to make predictions. Clustering is one of the major descriptive data mining tasks. Clustering is a division of data into groups of similar objects. Representing the dataset by fewer clusters loses certain fine details, but achieves simplification [1]. It represents data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. Clustering can be viewed as a density estimation problem. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the problems of very large datasets with very many attributes of different types. This imposes sole computational prerequisites on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. Clustering in data mining was brought to life by intense developments in information retrieval and text mining, spatial database applications, for example, GIS or astronomical data[2], sequence and heterogeneous data analysis [3], Web applications [4], DNA analysis in computational biology [5], and many others. II. TAXONOMY OF CLUSTERING There are several well known clustering algorithms; different clustering algorithms may provide different clusters. The most well known clustering algorithms are hierarchical clustering, density based clustering, grid based clustering, model based clustering and partition based clustering. Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down) methods. [6,7]. An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. An open set in the Euclidean space can be divided into a set of its connected components .The implementation of this idea for partitioning of a finite set of points requires concepts of density, connectivity and boundary. They are closely related to a point’s nearest neighbors. A cluster, defined as a connected dense component, 978-1-4673-4836-2/12/$31.00 ©2012 IEEE 1

[IEEE 2012 15th International Conference on Computer and Information Technology (ICCIT) - Chittagong, Bangladesh (2012.12.22-2012.12.24)] 2012 15th International Conference on Computer

Embed Size (px)

Citation preview

An Efficient Grid Algorithm for Faster Clustering Using K Medoids Approach

Golam Moktader Daiyan Department of Computer Science & Information Technology, Southern,University Bangladesh.

Faisal Bin Al Abid Department of Compute Science & Engineering, Islamic University of Technology, Bangladesh.

Md. Ataur Rahman Khan Department of Electronic and Communication Engineering. Southern University, Bangladesh.

Ahsan Habib Tareq Department of Information & Communication Engineering, Mawlana Bhashani Science & Technology University, Tangile, Bangladesh.

Abstract— Clustering is the methodology to separate similar objects of data set in one cluster and dissimilar objects of data set in another cluster. K means and K medoids are most widely used Clustering algorithms for selecting group of objects for data sets. k means clustering has less time complexity than k medoids method, but k means clustering method suffers from extreme values. So, we have focused our view to k medoids clustering method. Conventional k-medoids clustering algorithm suffers from many limitations. We have done analysis on these limitations such as the problem of finding natural clusters, the dependency of output on the order of input data. In this paper we have proposed a new algorithm named Grid Multidimensional K medoids which is designed to overcome the above limitations and provide a faster clustering than K medoids. .

Keywords- Medoid; Dataset; Partitioning; Grid, Outlier, Time complexity.

I. INTRODUCTION Data Mining is the process of non-trivial extraction of implicit, previously unknown, and potentially useful information from data [1]. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database i.e. descriptive task finds the human-interpretable patterns that describe the data. Predictive mining tasks perform inference on the current data in order to make predictions. Clustering is one of the major descriptive data mining tasks. Clustering is a division of data into groups of similar objects. Representing the dataset by fewer clusters loses certain fine details, but achieves simplification [1]. It represents data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. Clustering can be viewed as a density estimation problem. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial

database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the problems of very large datasets with very many attributes of different types. This imposes sole computational prerequisites on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. Clustering in data mining was brought to life by intense developments in information retrieval and text mining, spatial database applications, for example, GIS or astronomical data[2], sequence and heterogeneous data analysis [3], Web applications [4], DNA analysis in computational biology [5], and many others.

II. TAXONOMY OF CLUSTERING There are several well known clustering algorithms; different clustering algorithms may provide different clusters. The most well known clustering algorithms are hierarchical clustering, density based clustering, grid based clustering, model based clustering and partition based clustering. Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down) methods. [6,7]. An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. An open set in the Euclidean space can be divided into a set of its connected components .The implementation of this idea for partitioning of a finite set of points requires concepts of density, connectivity and boundary. They are closely related to a point’s nearest neighbors. A cluster, defined as a connected dense component,

978-1-4673-4836-2/12/$31.00 ©2012 IEEE 1

grows in any direction that density leads. Therefore, density-based algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers or noise. Since density-based algorithms require a metric space, the natural setting for them is spatial data clustering [8]. To make computations feasible, some index of data is constructed (such as R*-tree). This is a topic of active research. Classic indices were effective only with reasonably low-dimensional data. The algorithm DENCLUE that, in fact, is a blend of a density-based clustering and a grid-based preprocessing is lesser affected by data dimensionality. Grid based clustering methods are used for multi resolution data structure. It is used to quantize the object space into a finite number of cells that form a grid structure on which all the operations are to be performed. The main advantage of grid clustering is it’s faster processing time, which is typically independent of the number of data objects, yet dependent on the number of cells in each dimension in the quantized space. Some typical example of grid based clustering are STING(Statistical information grid) which represents and explores grid information stored in grid cells and processes fast than other conventional clustering [9] Data partitioning algorithms, which divide data set into several subsets. Because checking all possible subset systems is computationally infeasible, certain greedy heuristics are used in the form of iterative optimization. Specifically, this means different relocation schemes that iteratively reassign points between the k clusters. Unlike traditional hierarchical methods, in which clusters are not revisited after being constructed, relocation algorithms gradually improve clusters. With appropriate dataset, this results in high quality clusters. One approach to data partitioning is to take a conceptual point of view that identifies the cluster with a certain model whose unknown parameters have to be found. Another approach starts with the definition of objective function depending on a partition, computation of objective function becomes linear in N (and in a number of clusters k<<N). Depending on how representatives are constructed, iterative optimization partitioning algorithms are subdivided into k-medoids and k-means methods. K-medoid is the most appropriate data point within a cluster that represents it.

Representation by k-medoids has two advantages: First, it presents no limitations on attributes types, Second, the choice of medoids is dictated by the location of a predominant i.e. major fraction of points inside a cluster and, therefore, it is lesser sensitive to the presence of outliers and noise. One of the most popular versions of k medoids are PAM (Partitioning Around Medoids).PAM is iterative optimization that combines relocation of points between perspective clusters with re-nominating the points as potential medoids. The guiding principle for the process is the effect on an

objective function, which, obviously, is a costly strategy. CLARA uses several samples, each with 40+2k points, which are each subjected to PAM. The whole dataset is assigned to resulting medoids, the objective function is computed, and the best system of medoids is retained. CLARA is used to deal with very large data set. Further progress is associated with Ng & Han [10] who introduced the algorithm CLARANS (Clustering Large Applications based upon Randomized Search) in the context of clustering in spatial databases. It uses a sample with some randomness at each step of the search. Conceptually the clustering process can be viewed as a search through a graph, where each node is a potential solution (a set of k medoids). Two nodes are neighbors (that is, connected by arc in the graph)if their sets differ by only one object. Each node can be assigned a cost. PAM searches and examines all of the neighbors of the current node in its search for a minimum cost. CLARA has time complexity O(ks2+k(n-k)),CLARANS has time complexity O(N2). So for this purpose we propose a GMK (Grid multidimensional algorithm) to eliminate the shortcoming of k medoids method.

III. PROPOSED ALGORITHM 1. Get special grids based on the formula /б

б=√N/M where N=number of data points and M is the average number of data points in each grid (considered to be 3 to 5 for 150 data points or objects). Calculate the same for y axis [11].

1) If cluster (size)(grid[i][j])< threshold value, then grid[i][j]=noise grid .(Detecting outlier)

2) Let choose first point inside grid[i][j].For each point of grid[i][j] calculate the points from each point to all other points inside grid[i][j] using Euclidean distance Let choose the point inside grid[i][j] that has the minimum cost as he center of the grid.

Flow Chart 1: GMK Algorithm

978-1-4673-4836-2/12/$31.00 ©2012 IEEE 2

IV. EXPERIMENTAL RESULT This experiment is carried out using standard data set [12]. It shows that grid takes lesser time. The snapshot of the standard adult dataset is given below in table 1 than k medoids.

.

Fig. 1: Comparison graph of k medoids and GMK

Table: 1: standard adult dataset

V. CONCLUSION

We can merge the neighboring grids to provide a good clustering .Although it will take time, it will provide with better and accurate clustering. Flood fill algorithm could be

used for clustering. Thus the accuracy will increase more by dividing the wrong points with the total number of points.

REFERENCES [1] J. Han, M. Kamber “Data Mining: Concepts and Techniques,” Second Edition, Elsevier Inc. Rajkamal Electric Press, 2006. [2] Ester M., Frommelt A, Kriegel H, P., and Sander “Spatial data mining: database primitives, algorithms and efficient DBMS support. Data Mining and Knowledge Discovery”, Kluwer Academic Publishers, 4, 2-3, 193-216, 2000. [3] Cadez, I., Smyth, P., and Mannila, “Probabilistic modeling of transactional data with applications to profiling, Visualization, and Prediction”, In Proceedings of the 7th ACM SIGKDD, 37-46, San Francisco, CA, 2001. [4] Cooley, R., Mobasher, B., and Srivastava “Data preparation for mining world wide web browsing. Journal of Knowledge Information Systems”, 1, 1, 5-32, 1999. [5] Bendor, A. and Yakhini, Z. “Clustering gene expression patterns”, Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB 99), 11-14, Lyon, France, 1999. [6] Jain, A. and Dubes, R.” Algorithms for Clustering Data”, Prentice-Hall, Engle wood Cliffs, NJ, 1998. [7] Kaufman, L. and Rousseeuw, P., “Finding Groups in Data: An Introduction to Cluster Analysis.”John Wiley and Sons, New York, NY, 1999. [8] Kolatch, E. ”Clustering Algorithms for Spatial Databases: A Survey” 2001. [9] W. Wang, J. Yang, and R. Muntz,“STING: a statistical information gridapproach to spatial data mining”, in Proceedings of the 23rd VLDB Conference, Athens, Greece, , pp.186-195, 1997. [10] NG, R. and Han, J. “ Efficient and effective clustering methods for spatial data mining”, In Proceedings of the 20th Conference on VLDB, 144-155, Santiago, Chile, 1994. [11] Xiaoyun Chen,Youli Su,Yi Chen,Guohua Liu,” GK-means: An Efficient K-means Clustering Algorithm Based On Grid “ 2009.

[12] Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.

Capital Gain Education Number

2174 13 0 13 0 9 0 7 0 13 0 14 0 16 0 9 0 14

14084 10 5174 13 3132 10 2178 13 2345 12

Educ

atio

n N

umbe

r

Capital Gain

978-1-4673-4836-2/12/$31.00 ©2012 IEEE 3