13
Applied Soft Computing 19 (2014) 134–146 Contents lists available at ScienceDirect Applied Soft Computing j ourna l ho me page: www.elsevier.com/locate /asoc Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data T. Velmurugan PG and Research Department of Computer Science, D.G. Vaishnav College, Chennai - 600106, India a r t i c l e i n f o Article history: Received 27 April 2012 Received in revised form 11 January 2014 Accepted 9 February 2014 Available online 19 February 2014 Keywords: k-Means algorithm Fuzzy C-Means algorithm Data clustering Data analysis Telecommunication data a b s t r a c t Data mining is the process of discovering meaningful new correlation, patterns and trends by sifting through large amounts of data, using pattern recognition technologies as well as statistical and math- ematical techniques. Cluster analysis is often used as one of the major data analysis technique widely applied for many practical applications in emerging areas of data mining. Two of the most delegated, partition based clustering algorithms namely k-Means and Fuzzy C-Means are analyzed in this research work. These algorithms are implemented by means of practical approach to analyze its performance, based on their computational time. The telecommunication data is the source data for this analysis. The connection oriented broad band data is used to find the performance of the chosen algorithms. The distance (Euclidian distance) between the server locations and their connections are rearranged after processing the data. The computational complexity (execution time) of each algorithm is analyzed and the results are compared with one another. By comparing the result of this practical approach, it was found that the results obtained are more accurate, easy to understand and above all the time taken to process the data was substantially high in Fuzzy C-Means algorithm than the k-Means. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Data mining (DM) is the extraction of useful and non-trivial information from the large amount of data that is possible to collect in many and diverse fields of science, business and engineering. DM is part of a bigger framework, referred to as knowledge discovery in databases (KDD) that covers a complex process from data prepa- ration to knowledge modeling. Within this process, DM techniques and algorithms are the actual tools that analysts have at their dis- posal to find unknown patterns and correlation in the data. Typical DM tasks are classification (assign each record of a database to one of a predefined set of classes), clustering (find groups of records that are close according to some user defined metrics) and association rules (determine implication rules for a subset of record attributes). A considerable number of algorithms have been developed to per- form these and others tasks, from many fields of science, from machine learning to statistics through various computing technolo- gies like neural and fuzzy computing. What was a hand tailored set of case specific recipes, about ten years ago, is now recognized as Correspondence to: Tel.: +91 9381032070. E-mail address: velmurugan [email protected] a proper science. It is sufficient to consider the remarkable wide spectrum of applications where DM techniques are currently being applied to understand the ever growing interest from the research community in this domain. Data mining can be viewed as an essential step in the process of knowledge discovery. Data are normally preprocessed through data cleaning, data integration, data selection, and data transforma- tion and prepared for the mining task. Started as little more than a dry extension of DM techniques, DM is now bringing important contributions in crucial fields of investigations and in the traditional sciences like astronomy, high energy physics, biology and medicine [8] that have always provided a rich source of applications to data miners. An important field of application for data mining tech- niques is also the World Wide Web. The Web provides the ability to access one of the largest data repositories, which in most cases still remains to be analyzed and understood. Recently, data min- ing techniques are also being applied to social sciences, home land security and counter terrorism. A DM system is therefore composed of a software environment that provides all the functionalities to compose DM applications, and a hardware back-end onto which the DM applications are executed. This paper discusses about one of the application areas of par- tition based clustering algorithms k-Means and Fuzzy C-Means by http://dx.doi.org/10.1016/j.asoc.2014.02.011 1568-4946/© 2014 Elsevier B.V. All rights reserved.

Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

Embed Size (px)

Citation preview

Page 1: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

PCt

TP

a

ARRAA

KkFDDT

1

iiiirapDoarAfmgo

h1

Applied Soft Computing 19 (2014) 134–146

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l ho me page: www.elsev ier .com/ locate /asoc

erformance based analysis between k-Means and Fuzzy-Means clustering algorithms for connection orientedelecommunication data

. Velmurugan ∗

G and Research Department of Computer Science, D.G. Vaishnav College, Chennai - 600106, India

r t i c l e i n f o

rticle history:eceived 27 April 2012eceived in revised form 11 January 2014ccepted 9 February 2014vailable online 19 February 2014

eywords:-Means algorithmuzzy C-Means algorithm

a b s t r a c t

Data mining is the process of discovering meaningful new correlation, patterns and trends by siftingthrough large amounts of data, using pattern recognition technologies as well as statistical and math-ematical techniques. Cluster analysis is often used as one of the major data analysis technique widelyapplied for many practical applications in emerging areas of data mining. Two of the most delegated,partition based clustering algorithms namely k-Means and Fuzzy C-Means are analyzed in this researchwork. These algorithms are implemented by means of practical approach to analyze its performance,based on their computational time. The telecommunication data is the source data for this analysis.The connection oriented broad band data is used to find the performance of the chosen algorithms. The

ata clusteringata analysiselecommunication data

distance (Euclidian distance) between the server locations and their connections are rearranged afterprocessing the data. The computational complexity (execution time) of each algorithm is analyzed andthe results are compared with one another. By comparing the result of this practical approach, it wasfound that the results obtained are more accurate, easy to understand and above all the time taken toprocess the data was substantially high in Fuzzy C-Means algorithm than the k-Means.

© 2014 Elsevier B.V. All rights reserved.

. Introduction

Data mining (DM) is the extraction of useful and non-trivialnformation from the large amount of data that is possible to collectn many and diverse fields of science, business and engineering. DMs part of a bigger framework, referred to as knowledge discoveryn databases (KDD) that covers a complex process from data prepa-ation to knowledge modeling. Within this process, DM techniquesnd algorithms are the actual tools that analysts have at their dis-osal to find unknown patterns and correlation in the data. TypicalM tasks are classification (assign each record of a database to onef a predefined set of classes), clustering (find groups of records thatre close according to some user defined metrics) and associationules (determine implication rules for a subset of record attributes).

considerable number of algorithms have been developed to per-orm these and others tasks, from many fields of science, from

achine learning to statistics through various computing technolo-ies like neural and fuzzy computing. What was a hand tailored setf case specific recipes, about ten years ago, is now recognized as

∗ Correspondence to: Tel.: +91 9381032070.E-mail address: velmurugan [email protected]

ttp://dx.doi.org/10.1016/j.asoc.2014.02.011568-4946/© 2014 Elsevier B.V. All rights reserved.

a proper science. It is sufficient to consider the remarkable widespectrum of applications where DM techniques are currently beingapplied to understand the ever growing interest from the researchcommunity in this domain.

Data mining can be viewed as an essential step in the processof knowledge discovery. Data are normally preprocessed throughdata cleaning, data integration, data selection, and data transforma-tion and prepared for the mining task. Started as little more thana dry extension of DM techniques, DM is now bringing importantcontributions in crucial fields of investigations and in the traditionalsciences like astronomy, high energy physics, biology and medicine[8] that have always provided a rich source of applications to dataminers. An important field of application for data mining tech-niques is also the World Wide Web. The Web provides the abilityto access one of the largest data repositories, which in most casesstill remains to be analyzed and understood. Recently, data min-ing techniques are also being applied to social sciences, home landsecurity and counter terrorism. A DM system is therefore composedof a software environment that provides all the functionalities to

compose DM applications, and a hardware back-end onto whichthe DM applications are executed.

This paper discusses about one of the application areas of par-tition based clustering algorithms k-Means and Fuzzy C-Means by

Page 2: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

putin

mmdatcdtcietTaarBskti

statbpjohrdiCmltdotctAfobdmonm

fIittrcfiCina

V. T. / Applied Soft Com

eans of an experimental approach choosing a real time telecom-unication data. Many applications have been proposed by using

ifferent algorithms. Now, it is necessary to discuss some of thepplications of related areas. This will be helpful to understandhe related breakthrough in computations and engineering appli-ations. Ling-Zhong Lin and Tsuen-Ho Hsu give a study aboutesigning a model of FANP in brand image decision-making inheir paper [21]. They discuss that both theoretical and practi-al efforts in band images often neglect the characteristics havingnteractions and mutual influence among attributes or criteria,ven in the stages of different brand life cycles. This study aimso create a hierarchical framework for brand image management.he analytical network process and fuzzy sets theory have beenpplied to both mindshare in brand images and inherent inter-ction/interdependencies among diverse information resources. Aeal empirical application is demonstrated in the department store.oth the theoretical and practical background of this paper havehown the fuzzy analytical network process can capture expert’snowledge existing in the form of incomplete and vague informa-ion for the mutual inspiration on attribute and criteria of brandmage management.

Chien-wen Shen et al. presents a fuzzy AHP-based fault diagno-is for semiconductor lithography process [22]. This study proposeshe approach of fuzzy analytic hierarchy process (FAHP) for thembiguous fault evaluations of lithography process. The applica-ion of FAHP has several advantages over conventional approachesecause it is able to enumerate the managerial causes of lithogra-hy faults and to homogenize the differences among the subjective

udgments of on-site engineers. Together with the fuzzy set the-ry, this study provides a systematic mechanism to construct aierarchy of FAHP model and a FAHP diagnosis map for the lithog-aphy process. A paper titled as “Application of quality functioneployment to improve the quality of Internet shopping website

nterface design,” is presented by Hui-Ming Kuo and Cheng-Wuhe [23]. They analyzed that the popularization and rapid develop-ent of the Internet has fostered the growth of online shopping

eading it to become an important new channel for consumerso make purchases. The Internet users’ rate of satisfaction haseclined since online shopping has become an important consumerption. In order to improve customer satisfaction and to enhancehe shopping experience, it is very important to understand theustomer quality needs particular to the Internet shopping website,hen to meet these needs through suitable website interface design.

B2C shopping website is used as an example in this study. Qualityunction deployment (QFD) is utilized to attain an understandingf customer quality needs, quality elements, and the relationshipetween them. Suggestions for improving the quality of websiteesign are proposed based on the case study and the major perfor-ance indices discussed. Conclusions can be used as reference for

nline shopping website operators wishing to enhance the keen-ess of their websites in the highly competitive online shoppingarket.“Fuzzy control of interconnected structural systems using the

uzzy Lyapunov method”, a study carriedout by C.W. Chen [24].n this study, a closed-form, easy-to-use fuzzy control method fornterconnected structural systems is developed. First, the struc-ural systems are reviewed. Interconnected schemes are employedo represent the structural systems and their subsystems. Theepresentation of the interconnected systems consists of J inter-onnected subsystems. Stability is ensured by the criteria derivedrom the fuzzy Lyapunov functions. The fuzzy Lyapunov functions defined as the fuzzy blending of quadratic Lyapunov functions.

ommon solutions can be obtained by solving a set of linear matrix

nequalities that are numerically feasible. To verify the effective-ess of the interconnected structural system, explanatory examplesre presented for a practical tuned mass damper mounted on

g 19 (2014) 134–146 135

a building structure. Finally, the proposed control method isdemonstrated using an example in which the interconnected tech-nique is utilized to represent large-scale structural systems.

A paper presented by Cheng-Wu Chen titled as “Stability con-ditions of fuzzy systems and its application to structural andmechanical systems” [25]. This paper provided the stability condi-tions and controller design for a class of structural and mechanicalsystems represented by Takagi–Sugeno (T–S) fuzzy models. In thisdesign procedure of controller, parallel-distributed compensation(PDC) scheme was utilized to construct a global fuzzy logic con-troller by blending all local state feedback controllers. A stabilityanalysis was carried out not only for the fuzzy model but also for areal mechanical system. Furthermore, this control problem can bereduced to linear matrix inequalities (LMI) problems by the Schurcomplements and efficient interior-point algorithms are now avail-able in Matlab toolbox to solve this problem. A simulation examplewas given to show the feasibility of the proposed fuzzy controllerdesign method.

Data mining is to be performed on various types of databases andinformation repositories, but the kind of patterns to be found arespecified by various data mining functionalities like class/conceptdescription, association, correlation analysis, classification, predic-tion, cluster analysis etc. Among these, Cluster analysis is one ofthe major data analysis method widely used for many practicalapplications in emerging areas [8,11]. The quality of a cluster-ing result depends on both the similarity measure used by themethod and its implementation and also by its ability to discoversome or all of the hidden patterns. There is a number of clusteringtechniques that have been proposed over the years [3]. Differentclustering approaches may yield different results. The partitioningbased algorithms are frequently used by many researchers for var-ious applications in different domains. This research work dealstwo of the partitioning based clustering techniques as stated. Withthese discussions, about data mining and clustering methods, thenext section discusses some of the application areas of the twoalgorithms.

The rest of the paper is structured as follows. Section 2 presentssome of the application areas of chosen algorithms and theirapproaches. In Section 3, the basics of k-Means and fuzzy clusteringmethods are described. Section 4 discusses about the experimen-tal setup of the proposed method is discussed. Section 5 covers theexperimental studies, results, and a brief discussion. Finally, Section6 presents the conclusions of the research work.

2. Applications of cluster analysis

Clustering means creating groups of objects based on their fea-tures in such a way that the objects belonging to the same groupsare similar and those belonging in different groups are dissimilar.Clustering is one of the standard workhorse techniques in the fieldof data mining. Its intention is to systematize a dataset into a set ofgroups, or clusters, which contain similar data items, as measuredby some distance function. The major applications of clusteringinclude document categorization, scientific data analysis, cus-tomer/market segmentation and www. The other areas includepattern recognition, artificial intelligence, information technology,image processing, biology, psychology, and marketing. Some of theareas specifically used clustering concepts nowadays are:

• Marketing: Help marketers discover distinct groups in their cus-

tomer bases, and then use this knowledge to develop targetedmarketing programs.

• Land use: Identification of areas of similar land use in an earthobservation database.

Page 3: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

1 putin

idpipTsojmipt

aiomFcttfticp

ymiataostaAa

NGkcreadtpiSvaoec

36 V. T. / Applied Soft Com

Insurance: Identifying groups of motor insurance policy holderswith a high average claim cost.City-planning: Identifying groups of houses according to theirhouse type, value, and geographical location.Earth-quake studies: Observed earth quake epicenters should beclustered along continent faults.

Data clustering has attracted the attention of many researchersn different disciplines. It is an important and useful technique inata analysis. A large number of clustering algorithms have beenut forward and investigated. Clustering is an unsupervised learn-

ng technique. Unlike classification, in which objects are assigned toredefined classes, clustering does not have any predefined classes.he main advantage of clustering is that interesting patterns andtructures can be found directly from very large data sets with littler none of the background knowledge. The cluster results are sub-ective and implementation dependent. The quality of a clustering

ethod depends on the similarity measure used by the method andts implementation; its ability to discover some or all of the hiddenatterns and the definition and representation of cluster chosen byhe user [12].

A variety of data clustering algorithms are developed andpplied for many applications domain in the field of data min-ng [11]. Clustering techniques have been applied to a wide varietyf research problems. Hartigan provides an excellent summary ofany published studies reporting the results of cluster analysis [9].

or example, in the field of medicine, clustering diseases, treatmenturves for diseases, or symptoms of diseases can lead to very usefulaxonomies. In the field of psychiatry, the correct diagnosis of clus-ers of symptoms such as paranoia, schizophrenia, etc. is essentialor successful therapy. In archeology, researchers have attemptedo establish taxonomies of stone tools, funeral objects, etc. by apply-ng cluster analytic techniques. In general, whenever one needs tolassify a “mountain” of information into manageable meaningfuliles, cluster analysis is of great utility.

A review of the most common partition algorithms in cluster anal-sis: A comparative study is discussed in a research work [14]. Theain objective of this study is to compare several partition methods

n the context of cluster analysis, which are also called nonhier-rchical methods. In this work, a simulation study is performedo compare the results obtained from the implementation of thelgorithms k-Means, k-Medians, PAM and CLARA when continu-us multivariate information is available. Additionally, a study ofimulation is presented to compare partition algorithms qualita-ive information, comparing the efficiency of the PAM and k-modeslgorithms. The efficiency of the algorithms is compared using thedjusted Rand Index and the correct classification rate. Finally, thelgorithms are applied to real databases with predefined classes.

An enhanced k-Means algorithm to improve the Efficiency Usingormal Distribution Data Points is discussed by Napoleon andanga Lakshmi. This paper [15] proposes a method for making the-Means algorithm more effective and efficient; so as to get betterlustering with reduced complexity. In this research, the most rep-esentative algorithms k-Means and the enhanced k-Means werexamined and analyzed based on their basic approach. The bestlgorithm was found out based on their performance using normalistribution data points. The accuracy of the algorithm was inves-igated during different execution of the program on the input dataoints. The elapsed time taken by proposed enhanced k-Means

s less than k-Means algorithm. Benderskaya et al. describes [7]elf-organized Clustering and Classification: A Unified Approachia Distributed Chaotic Computing. The paper describes a unified

pproach to solve clustering and classification problems by meansf oscillatory neural networks with chaotic dynamics. It is discov-red that self-synchronized clusters once formed can be applied tolassify objects. The advantages of distributed clusters formation

g 19 (2014) 134–146

in comparison to centers of clusters estimation are demonstrated.New approach to clustering on-the-fly is proposed.

A Novel Approach to Medical Image Segmentation is presentedby Shanmugam et al. in their paper [13]. In this research, a novelapproach is used to segment the 2D echo images of various views. Amodified k-Means clustering algorithm, called “Fast SQL k-Means”is proposed using the power of SQL in DBMS environment. Ink-Means, Euclidean distance computation is the most time con-suming process. However, here it computed with a single databasetable and no joins. This method takes less than 10 s to cluster animage size of 400 × 250 (100 K pixels), whereas the running timeof direct k-Means is around 900 s. Since the entire processing isdone with database, additional overhead of import and export ofdata is not required. The 2D echo images are acquired from the localCardiology Hospital for conducting the experiments. A Fast Fuzzy C-Means algorithm (FFCM) is proposed by Al-zoubi et al. [2] based onexperimentations, for improving fuzzy clustering. The algorithm isbased on decreasing the number of distance calculations by check-ing the membership value for each point and eliminating thosepoints with a membership value smaller than a threshold value.They applied FFCM on several data sets. The experiments demon-strate the efficiency of the proposed algorithm.

An Application of fuzzy models for the monitoring of ecologi-cally sensitive ecosystems in a dynamic semi-arid landscape fromsatellite imagery is discussed by Meng-Lung Lin and Cheng-WuChen [26]. The purpose of this paper is for better understandingof landscape dynamics in arid and semi-arid environments. Landdegradation has recently become an important issue for land man-agement in western China. This study shows that it is possible toderive important parameters linked to landscape sensitivity frommoderate resolution imaging spectro radiometer (MODIS) and thederived imagery, such as normalized difference vegetation index(NDVI) time-series data. The results of landscape sensitivity anal-ysis prove the effectiveness of the method in assessing landscapesensitivity from the years 2001–2005. Practical implications – thenovel strategy used in this investigation is based on the T–S fuzzymodel, which is in turn based on fuzzy theory and fuzzy opera-tions. Originality/value – simulation results based on fuzzy modelswill help to improve the monitoring techniques used to evaluateland degradation and to estimate the newest tendency in landscapegreen cover dynamics in the Ejin Oasis.

Tsung-Hao Chen and Cheng-Wu Chen present a study about anapplication of data mining to the spatial heterogeneity of foreclosedmortgages [27]. The loss given a default (LGD) is a key componentwhen calculating the credit risk associated with an asset portfo-lio. However, the issue of default probability has not often beenaddressed in past mortgage loan data mining studies. The LGDhas rarely been used to assess the comprehensive credit risk fora portfolio of mortgage loans. The location of a mortgaged prop-erty is strongly correlated with the price of that property as well asproviding social, demographic, and economic information whichinherently characterizes the mortgage loan population. To makean accurate assessment of the credit risk associated with the loanportfolio, one requires a specific data mining technique capable ofdetermining the heterogeneity of the portfolio across regions. Thesample used in this study consists of data on two thousand fore-closed mortgages in Kaohsiung City. They first test the homogeneitybetween the different city districts; second, they estimate the mag-nitude of the heterogeneity, including the spatial heterogeneity;third, a prior distribution for the heterogeneity is formulated usingdata mining methods; finally, the overall LGD, showing the creditrisk for a given default probability is calculated.

Feng-Hsiag Hsiao et al. given a detailed study on “Robust Stabi-lization of Nonlinear Multiple Time-Delay Large-scale Systems viaDecentralized Fuzzy Control” [28]. To overcome the effect of model-ing errors between nonlinear multiple time-delay subsystems and

Page 4: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

putin

Troimttclg

3

afdTmrbwskm

scmfs(tod

•••

sldImmtd

2

tIwo

V. T. / Applied Soft Com

akagi–Sugeno (T–S) fuzzy models with multiple time delays, aobustness design of fuzzy control is proposed in this work. In termsf Lyapunov’s direct method, a delay-dependent stability criterions hence derived to guarantee the asymptotic stability of nonlinear

ultiple time-delay large-scale systems. Based on this criterion andhe decentralized control scheme, a set of model-based fuzzy con-rollers is then synthesized via the technique of parallel distributedompensation (PDC) to stabilize the nonlinear multiple time-delayarge-scale system. Finally, a numerical example with simulations isiven to demonstrate the concepts discussed throughout this work.

. Materials and methods

Cluster analysis is sensitive to both the distance metric selectednd the criterion for determining the order of clustering [6]. Dif-erent approaches may yield different results. Consequently, theistance metric and clustering criterion should be chosen carefully.he results should also be compared to analyses based on differentetrics and clustering criteria, or to an ordination, to determine the

obustness of the results. In the partitioning methods, the centroidased technique is k-Means method and Fuzzy C-Means (FCM),hich is the special kind of k-Means method. Many researchers

uggest these algorithms are performing well to implement someind of inputs. Both the algorithms work based the given distanceeasure.An important step in most clustering is to select a distance mea-

ure, which will determine how the similarity of two elements isalculated. This will influence the shape of the clusters, as some ele-ents may be close to one another according to one distance and

arther away according to another. For example, in a 2-dimensionalpace, the distance between the point (x = 1, y = 0) and the originx = 0, y = 0) is always 1 according to the usual norms, but the dis-ance between the point (x = 1, y = 1) and the origin can be 2,

√2

r 1 if you take respectively the 1-norm, 2-norm or infinity-normistance. The variety of common distance functions is as follows:

The Euclidean distance. This type of distance is also called as thedistance as the crow flies or 2-norm distance.The Manhattan distance (aka taxicab norm or 1-norm).The maximum norm (aka infinity norm).The Mahalanobis distance corrects data for different scales andcorrelations in the variables.The angle between two vectors can be used as a distance measurewhen clustering high dimensional data.The Hamming distance measures the minimum number ofsubstitutions required to change one member into another.

Another important distinction is whether the clustering usesymmetric or asymmetric distances. Many of the distance functionsisted above have the property that distances are symmetric (theistance from object A to B is the same as the distance from B to A).

n other applications, this is not the case. A true metric gives sym-etric measures of distance. The symmetric and 2-norm distanceeasure is used in this research work. In the Euclidean space Rn,

he distance between two points is usually given by the Euclideanistance (2-norm distance). The formula for 2-norm distance is

-Norm distance =(

n∑i=1

|xi − yi|2)1/2

(1)

The 2-norm distance is the Euclidean distance, a generaliza-

ion of the Pythagorean Theorem to more than two coordinates.t is what would be obtained if the distance between two points

ere measured with a ruler: the “intuitive” idea of distance. Basedn this idea of finding the distance, the clustering qualities of the

g 19 (2014) 134–146 137

proposed algorithms are analyzed here. As stated in previoussections, this research is to compare the two partitioning basedclustering algorithms namely k-Means, FCM are discussed below.

3.1. The k-Means algorithm

The k-Means is one of the simplest unsupervised learningalgorithms that solve the well-known clustering problem. The pro-cedure follows a simple and easy way to classify a given data setthrough a certain number of clusters (assume k clusters) fixed apriori [1,10]. The main idea is to define k centroids, one for eachcluster. These centroids should be placed in a cunning way becauseof different location causes different result. So, the better choiceis to place them as much as possible far away from each other.The next step is to take each point belonging to a given data setand associate it to the nearest centroid. When no point is pend-ing, the first step is completed and an early group age is done. Atthis point it is necessary to re-calculate k new centroids as bar cen-ters of the clusters resulting from the previous step. After obtainingthese k new centroids, a new binding has to be done between thesame data set points and the nearest new centroid. A loop has beengenerated. As a result of this loop, one may notice that the k cen-troids change their location step by step until no more changes aredone. In other words centroids do not move any more. Finally, thisalgorithm aims at minimizing an objective function, in this case asquared error function. The objective function

J =k∑

j=1

n∑i=1

∥∥∥x(j)i

− cj

∥∥∥2

, (2)

where∥∥∥x(j)

i− cj

∥∥2is a chosen distance measure between a data

point x(j)i

and the cluster center cj, is an indicator of the distanceof the n data points from their respective cluster centers. The algo-rithm is composed of the following steps:

1. Place k points into the space represented by the objects that arebeing clustered. These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.3. When all objects have been assigned, recalculate the positions

of the k centroids.4. Repeat Steps 2 and 3 until the centroids no longer move. This

produces a separation of the objects into groups from which themetric to be minimized can be calculated.

Although it can be proved that the procedure will always ter-minate, the k-Means algorithm does not necessarily find the mostoptimal configuration, corresponding to the global objective func-tion minimum. The algorithm is also significantly sensitive to theinitial randomly selected cluster centers. The k-Means algorithmcan be run multiple times to reduce this effect. k-Means is a simplealgorithm that has been adapted to many problem domains and itis a good candidate to work for a randomly generated data points.One of the most popular heuristics for solving the k-Means problemis based on a simple iterative scheme for finding a locally minimalsolution [16,18]. This algorithm is often called the k-Means algo-rithm. There are some difficulties in using k-Means for clusteringdata. This is proved by several times in this current as well as inthe past research, and an oft-recurring problem has to do with theinitialization of the algorithm.

3.2. The Fuzzy C-Means algorithm

Clustering approaches based on fuzzy logic, such as FCM[4,5] and its variants [19,20] have proved to be competitive to

Page 5: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

1 puting 19 (2014) 134–146

ccdidpaiva

tccieIgciwfoamccamb

J

wmsncmo

u

T

wtos

Mamrcta

Table 1Total number of connections (monthly basis).

Month Connection numbers Total connections

From To

Jan 1 25,094 25,094Feb 25,095 45,416 20,322Mar 45,417 66,977 21,561Apr 66,978 94,928 27,951May 94,929 122,271 27,343Jun 122,272 143,345 21,074Jul 143,346 168,791 25,446Aug 168,792 189,130 20,339Sep 189,131 209,432 20,302Oct 209,433 231,583 22,151Nov 231,584 261,107 29,524

38 V. T. / Applied Soft Com

onventional clustering algorithms, especially for real-world appli-ations. The comparative advantage of these approaches is that theyo not consider sharp boundaries between the clusters, thus allow-

ng each feature vector to belong to different clusters by a certainegree (the so-called soft clustering in contrast to hard clusteringroduced by conventional methods). The degree of membership of

feature vector to a cluster is usually considered as a function ofts distance from the cluster centroids or from other representativeectors of the cluster. The fuzzy features of the k-Means algorithmre some times referred as Fuzzy C-Means algorithm.

Traditional clustering approaches generate partitions; in a par-ition, each pattern belongs to one and only one cluster. Fuzzylustering extends this notion to associate each pattern with everyluster using a membership function. The output of such algorithmss a clustering, but not a partition some times. Fuzzy clustering is anxtensively applied method for obtaining fuzzy models from data.t has been applied successfully in various fields including geo-raphical surveying, finance or marketing. The most widely usedlustering algorithm implementing the fuzzy philosophy is FCM,nitially developed by Dunn and later generalized by Bezdek [4],

ho proposed a generalization by means of a family of objectiveunctions. Despite this algorithm proved to be less accurate thanthers, its fuzzy nature and the ease of implementation made it veryttractive for a lot of researchers that proposed various improve-ents and applications. Usually FCM is applied to unsupervised

lustering problems. The basic structure of the FCM algorithm is dis-ussed below. The Algorithm FCM is a method of clustering whichllows one piece of data to belong to two or more clusters. Thisethod is frequently used in pattern recognition [2,18,29]. It is

ased on minimization of the following objective function:

m =N∑

i−1

C∑j−1

umij

∥∥xi − cj

∥∥2, 1 ≤ m < ∞ (3)

here m is any real number greater than 1, uij is the degree ofembership of xi in the cluster j, xi is the ith of d-dimensional mea-

ured data, cj is the d-dimension center of the cluster, and ||*|| is anyorm expressing the similarity between any measured data and theenter. Fuzzy partitioning is carried out through an iterative opti-ization of the objective function shown above, with the update

f membership uij and the cluster centers cj by:

ij = 1∑ck=1[||xi − cj||/||xi − ck||]2/(m−1),...cj

, =∑N

i=1umij

xi∑Ni=1um

ij

(4)

his iteration will stop when maxij{|u(k+1)ij

− u(k)ij

|} < �, (5)

here � is a termination criterion between 0 and 1, whereas k ishe iteration steps. This procedure converges to a local minimumr a saddle point of Jm. The algorithm is composed of the followingteps:

Step 1: Initialize U = [uij] matrix, U(0).Step 2: At k-step: calculate the centers vectors C(k) = [cj] with U(k).Step 3: Update U(k), U(k+1).Step 4: If ||U(k+1) − U(k)|| < � then STOP; otherwise return to step 2.

In this algorithm, data are bound to each cluster by means of aembership Function, which represents the fuzzy behavior of the

lgorithm. To do that, the algorithm have to build an appropriateatrix named U whose factors are numbers between 0 and 1, and

epresent the degree of membership between data and centers oflusters. In general introducing the fuzzy logic in k-Means clus-ering algorithm is the FCM algorithm. FCM clustering techniquesre based on fuzzy behavior and provide a natural technique for

Dec 261,108 285,520 24,413

Grand total 285,520

producing a clustering where membership weights have a natural(but not probabilistic) interpretation. This algorithm is similar instructure to the k-Means algorithm and also behaves in a similarway.

4. Experimental setup

Data mining concepts are used in different applications as perthe need, demand, nature of the problem and domain. The data min-ing concepts such as association, clustering, classification, indexingis used as per the process requirements [17]. In this research, theclustering process is achieved using a distance method. The cluster-ing process is aimed to minimize the expenditure of the businessapplication and increase the benefits. The algorithms are imple-mented in the real time connection oriented telecommunicationdata and the results are discussed. In this process, the data com-munication connection structure is evaluated and reconstructedusing clustering techniques for the effective data distribution. Thedata distribution process is affected by the connected server, dis-tance and number of connections available in the specific server.The distance factor also create an impact on the creation of theinfrastructure using cable, cost of the cable, manpower, mainte-nance and the data distribution based on the bandwidth. Thereforethe data access points are considered as data points and plannedto optimize the network using clustering concepts. In this process,the proposed two algorithms namely k-Means and Fuzzy C-Meansare adopted.

4.1. Properties of data set

The data set is collected from a broadband service provider atChennai city. The connection oriented data set contains 285,520data connection points with 27 server locations. The 27 servers aretreated as 27 clusters in this work and they are called as data cen-ters. The user points are called as data access points. There are 12data sets available. One data set for each month. The data set con-tains information about distance, type of connection (single user,multi user), data transfer capacity (256, 512, 1024, 2048), area codeand which server the data points are connected. This representa-tion is based on the connections established from the month ofJanuary to December. The collected data consists of the connectionestablishment month, area and the connected data center, type ofthe data service and the volume of data used in the year. The totalconnection of data access points is connected according to the geo-

graphical location. This connection is made, based on the demandof the customer which is provided by the service provider.

Table 1 shows the total number of connected data points foreach and every month for the 27 servers. The distance between

Page 6: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

V. T. / Applied Soft Computing 19 (2014) 134–146 139

Table 2Number of connections in the servers.

Month/servers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 1268 793 853 975 1158 1196 1121 1150 1146 1205 1145 1156 1143 1163 11172 1008 719 701 743 889 897 916 948 923 998 927 904 936 936 8603 1106 764 686 815 927 1012 998 1045 997 905 987 964 1019 995 9794 1356 903 909 1046 1239 1295 1263 1282 1263 1311 1298 1306 1256 1260 12825 1394 875 958 1091 1238 1193 1324 1234 1259 1274 1194 1264 1182 1234 12146 1068 747 652 791 915 964 949 942 985 945 998 954 946 948 9737 1258 889 795 995 1104 1204 1173 1143 1130 1137 1156 1167 1183 1194 11658 970 711 670 762 888 918 936 992 961 881 868 921 980 941 9259 992 676 695 758 925 911 937 912 905 945 916 883 971 923 895

10 1027 766 762 884 1032 992 1040 977 974 1003 952 998 986 1034 103311 1472 960 975 1163 1317 1249 1328 1352 1301 1346 1368 1294 1410 1290 1375

0

5 1

ddctdp

ottapndwtc

5

Mitaciutaata

(((

((

(

((

TN

12 1182 804 837 1001 1030 1133 110

Total 14,101 9607 9493 11,024 12,662 12,964 13,08

ata centers and its data access points in meters are given in theata set. The servers (data centers) are treated as center point in thelustering process. Normally, the distance between one server andhe same server is zero. The number of data access points in eachata center is represented in Tables 2 and 3 before the clusteringrocess.

The data access points are currently distributed unevenly basedn the request of the user at every month and connection given tohe users based on the availability of the nearest data centers. Buthis caused the issues in the traffic and the distribution of data. Itffects the quality of the server to the user as well as the servicerovider. In the exiting connection, the data center 27 has 279 con-ections. This is the minimum number of connection in a particularata center. The first data center has 14,101 data access points,hich is the maximum of all the data centers. Fig. 1 show the dis-

ribution of data access points in different data centers before thelustering process.

. Implementation of algorithms

The discussed k-Mean, FCM algorithms are implemented usingATLAB 7.5 and the results are discussed in this section. In the

mplementation process, the data set is processed based on the dis-ance as stated. Initially, first data center (selected as center point)nd the first month (January) are selected. For the selected dataenter and for the month, the distance is reconstructed and storedn the process matrix. The reconstruction of distance is made bysing the Pythagoras theorem. After the reconstruction of distance,he data access points are clustered using any one of the proposed

lgorithm. In this process, the distance between the data centersnd data access points is clustered for the chosen algorithm. Afterhe processing, each data center has some new number of dataccess points. Next, by choosing the same first data center, the

able 3umber of connections in the servers (contd. . .).

Month\servers 16 17 18 19 20

1 1125 1104 1089 1143 998

2 981 957 920 855 774

3 932 950 964 1004 849

4 1288 1310 1257 1273 1133

5 1255 1253 1237 1231 1131

6 982 959 932 956 874

7 1131 1182 1134 1140 1042

8 882 964 917 918 843

9 948 944 925 867 807

10 1007 1009 968 1021 928

11 1372 1357 1348 1300 1231

12 1097 1131 1142 1137 924

Total 13,000 13,120 12,833 12,845 11,534

1121 1089 1069 1113 1112 1191 1122 1086

3,098 12,933 13,019 12,922 12,923 13,203 13,040 12,904

second month (February) data access points are chosen and clus-tered using the algorithm. The process is repeated up to the lastmonth (December) data. After processing the 12th month data bychoosing the first server, the second server is chosen and the pro-cess repeated. Hence the number of connections in each data centeris considered as application impact and the process time is consid-ered as computation impact. The algorithmic steps involved in theclustering process are summarized below.

a) Selection of algorithm from k-Means or FCM.b) Selection of data center.c) Calculation of the distance between data access points based on

selected data center.d) Selection of monthly data.e) Implementation of the selected algorithm and cluster the dis-

tance.f) According to the processed cluster, the data points are reas-

signed to the data center.g) Observe the cluster process start time and completion time.h) Summaries the number of connection into each data center.(i) Implement the step b to h to all the different data sets.(j) Represent the data connection according to the newly assigned

data center.

Using the above procedure, the computational time (in seconds)of the two algorithms for the entire 12 data set is calculated andstored. Also, the number of connections in each server is assigned

by the algorithm and stored in the result matrix. This work mainlydiscusses about the time complexity between the two algorithms.The experimental results of both the algorithms are discussed inthe subsequent sections.

21 22 23 24 25 26 27

848 686 593 444 270 179 26740 598 438 357 247 128 22769 595 503 395 251 130 20972 773 635 496 310 202 33881 786 689 469 300 163 20738 562 498 385 244 145 22867 734 581 447 313 164 18656 613 477 366 244 117 18698 629 473 403 216 129 19798 634 521 374 280 128 23984 805 747 576 374 194 36827 729 542 418 314 140 22

9778 8144 6697 5130 3363 1819 279

Page 7: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

140 V. T. / Applied Soft Computing 19 (2014) 134–146

Fig. 1. Distribution of data points before clustering.

Table 4k-Means processing time.

Month Process start time Process finish time Total process time

Year Month Date Hour Min Second Year Month Date Hour Min Second

1 2012 3 11 15 47 29.018 2012 3 11 15 47 45.671 16.6532 2012 3 11 15 48 29.363 2012 3 11 15 48 38.942 9.5793 2012 3 11 15 49 27.144 2012 3 11 15 50 4.152 37.0084 2012 3 11 15 50 43.083 2012 3 11 15 51 1.773 18.6905 2012 3 11 15 51 41.498 2012 3 11 15 52 2.019 20.5216 2012 3 11 15 52 42.292 2012 3 11 15 52 58.14 15.8487 2012 3 11 15 53 37.016 2012 3 11 15 53 53.722 16.7068 2012 3 11 15 54 33.336 2012 3 11 15 54 42.165 8.8299 2012 3 11 15 55 21.446 2012 3 11 15 55 29.496 8.050

10 2012 3 11 15 56 8.745 2012 3 11 15 56 24.782 16.03711 2012 3 11 15 57 3.969 2012 3 11 15 57 35.325 31.35612 2012 3 11 15 58 16.26 2012 3 11 15 58 32.078 15.818

Average process time 17.925

Table 5Results of k-Means algorithm.

Month\server 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 556 738 902 724 483 627 1429 1177 561 1080 1205 1267 670 7322 664 590 693 380 1213 659 853 675 675 609 1230 1147 562 7233 828 515 647 767 505 540 968 455 734 998 871 913 1092 6704 751 696 1266 964 622 660 1555 1328 569 897 971 1325 1099 14575 1198 1518 708 845 535 1425 1300 825 535 1124 1440 1136 1598 10076 879 963 891 719 647 1018 640 844 723 876 573 764 858 8617 792 911 635 827 918 622 1185 917 934 548 963 968 1076 9408 1002 368 1033 799 284 263 797 307 349 288 1109 523 825 10469 696 826 635 729 713 969 757 574 576 759 702 745 834 893

10 962 499 752 912 1123 687 1121 608 707 1125 626 685 562 6711012

10

5

da

11 883 999 1821 1200 751 809

12 853 1195 1199 846 927 701

Average 839 818 932 809 727 748

.1. The k-Means clustering results

The k-Mean clustering algorithm is implemented as per theiscussion above. Table 4 is the processing time for the k-Meanslgorithm. The total elapsed time to cluster the first month data

43 1115 1090 1690 1007 877 968 230735 963 765 1088 643 661 1181 979

74 816 685 924 945 918 944 1024

set for the entire 27 data center is given in the last column. The

average processing time of all the 12 month data is available inthe last row, which is found to be 17.925 s. Tables 5 and 6 showsthat the results of k-Means algorithm by choosing the first serverfor all the 12 month data set. After the clustering process, the first
Page 8: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

V. T. / Applied Soft Computing 19 (2014) 134–146 141

Table 6Results of k-Means algorithm (contd. . .).

Month\server 15 16 17 18 19 20 21 22 23 24 25 26 27 Total

1 1099 1066 973 973 707 838 1255 808 1415 577 1409 1188 635 25,0942 459 1629 687 594 651 700 380 507 746 838 597 646 1215 20,3223 953 904 637 1122 609 436 1058 560 476 1208 970 1111 1014 21,5614 1310 770 992 1559 1097 1235 633 675 835 1469 1432 600 1184 27,9515 950 626 1479 596 545 1184 1216 812 664 1352 503 985 1237 27,3436 675 807 517 917 791 1123 717 639 736 724 578 791 803 21,0747 937 902 1403 996 1869 846 1321 745 572 884 758 1440 537 25,4468 790 866 321 979 944 893 631 905 1188 893 715 1139 1082 20,3399 701 481 955 966 976 531 619 593 855 708 972 684 853 20,302

10 770 931 1008 1045 848 1191 732 651 699 953 521 822 940 22,15111 642 833 813 852 1037 971 687 1040 913 2512 809 780 1075 29,52412 674 685 808 1248 872 829 874 929 829 796 740 768 1125 24,413

Average 830 875 883 987 912 898 844 739 827 1076 834 913 975 285,520

Table 7Average results of k-Means.

Run S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

1 839 818 932 809 727 748 1074 816 685 924 945 918 944 1024 8302 962 1087 880 932 825 942 909 695 1000 790 875 910 853 809 8413 762 935 953 939 769 898 974 899 892 914 801 914 966 939 9954 746 1051 858 862 843 898 839 941 847 748 919 901 927 830 9335 936 874 960 1018 810 917 939 973 826 893 935 924 853 931 8686 946 796 934 864 871 881 871 931 893 866 988 981 870 798 9127 909 911 969 865 980 909 874 946 903 866 915 789 838 813 8138 962 914 824 992 926 986 827 784 785 887 879 899 880 945 8499 897 892 1005 929 846 874 1027 763 1036 768 751 858 869 1023 795

10 891 1057 936 961 994 962 771 780 791 862 826 800 856 905 84611 829 927 773 765 818 905 894 861 905 808 862 857 888 963 78512 866 923 792 767 843 851 935 984 1023 986 809 1004 775 853 809

Table 8Average results of k-Means (contd. . .).

Run S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27

1 875 883 987 912 898 844 739 827 1076 834 913 9752 793 921 891 991 882 781 766 834 884 918 910 9123 776 949 903 832 895 842 924 753 849 738 846 9374 886 945 796 828 946 824 923 700 869 905 942 10855 963 806 843 860 856 777 952 856 764 849 840 7726 1009 872 786 950 901 889 871 763 816 895 853 7897 989 957 877 908 956 793 847 799 848 842 856 8238 907 854 846 982 784 774 841 858 805 993 870 9429 1024 861 813 873 880 889 863 682 994 857 878 849

008 895 832 866 807 860 824887 865 1009 978 957 844 1006784 906 940 1013 894 861 818

shtcrdstiacTpa

apdte

Table 9k-Means processing time.

Run Time

1 17.9252 17.1423 16.8224 19.6645 19.7966 16.1297 15.9198 17.2249 18.700

10 18.984

10 931 881 908 797 946 111 872 795 980 874 889

12 815 843 930 858 913

erver contains 556 data points. Before clustering, the first serveras 1268 data points. The last row consists of average processingime of the first server data points. The data points in each dataenter points are clustered (distributed) using the k-Means algo-ithm based on the neighborhood distance. The total of 285,520ata points is distributed. The sum of the average of the entireerver is 23,793. Multiplying this value by 12, the result is equalo the total number of data access points 285,520. Like, by choos-ng the first server as a center point, the second server is chosennd the clustering process repeated. The k-Means algorithm is exe-uted 12 times and the average results are listed in Tables 7 and 8.hese tables contain only the average data points. The average dataoints in the first server is found to be 839, the second server is 818nd so on.

The average results of k-Means algorithm given in Tables 7 and 8re represented by means of a graph in Fig. 2. In this clustering

rocess, the data center 23 is assigned with a minimum of 9853ata access points and the maximum of 11,185 in the data cen-er 2. It is clear that from Fig. 2, the distribution of data points inach and every data center is almost closer to one another. In the

11 19.47412 16.243

distribution process, the data access points are equally divided for

all the data centers. The average processing time of 12 runs of thek-Means clustering is observed and it is summarized in Table 9. Thetotal of 285,520 data points with different data center distance isprocessed. The algorithm takes a minimum of 15.919 s in the 7th
Page 9: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

142 V. T. / Applied Soft Computing 19 (2014) 134–146

Fig. 2. Distribution of k-

rsi

5

csTntcfim

drrf

Fig. 3. k-Means results.

un and a maximum of 19.796 s in the 5th run. The graphical repre-entation of the average processing time of 12 runs listed in Table 9s given in Fig. 3.

.2. The Fuzzy C-Mean clustering results

The implementation of Fuzzy C Means (FCM) method is dis-ussed for the same data set. As per the discussion of the previousections, the FCM method is executed and the results are given inable 10. Initially, the new distance matrix is formed. Based on theew distance between the data access points and the data cen-ers, the clustering process is carried out by the FCM algorithm. Byhoosing the first data center as a center point, the processing timeor all the 12 month data set is clustered and the processing times given in Table 10. The average processing time for the entire 12

onth data set is found to be 101.007 s.Tables 11 and 12 are the results of FCM algorithm for the first

ata center as a center point. This result is the single executionesult of all the 12 months data set. In these tables, the FCM algo-ithm assigns 518 data access points for the first server, 960 pointsor second server and so on. The average of all the 12 month data

Means clustering.

set for the first server is 461; second server is 911 and so on. Like,the algorithm is executed 12 times (by trial and error method) andthe average data access points are tabulated in Tables 13 and 14.

In the FCM clustering process, the minimum and maximum datapoints assigned by the algorithm is 4348 and 15,763 respectively.The minimum number of data points assigned to the first data cen-ter and the maximum is assigned to the 5th data center. Fig. 4represents the distribution of data access points by the FCM cluster-ing method. The number of connections is more for some clustersand for some other, it is very less. The processing time of 12 runsof the algorithm is listed in Table 15. The maximum time 101.007 sis taken by the algorithm for the first run and the minimum time69.920 s is taken by the 10th run. The process time of FCM is com-paratively high with the previous two methods. Fig. 5 shows thegraphical representation of the results of FCM algorithm.

5.3. Results and discussion

The total numbers of 285,520 data access points are clusteredin 27 data centers by the proposed two algorithms. Based on thedistance between the data access points and data centers, theperformance of the clustering process is analyzed. The k-Meansalgorithm assigns a minimum of 9853 data points and a maximumof 11,185 data points. The minimum and maximum data pointsassigned by the FCM method is 4348 and 15,763 respectively. Fig. 6shows that the minimum and maximum data access points for theproposed two algorithms. The minimum and the maximum datapoints are assigned by the FCM method. The difference of the min-imum and the maximum is very less in k-Means methods, but inthe FCM method, it is very high for some executions.

Table 16 is the number of data points distributed by the threealgorithms for all the 27 data centers. Fig. 7 is the graphical rep-resentation of the distribution of data points for the same twoalgorithms after the clustering process. From this figure, it is easy to

Page 10: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

V. T. / Applied Soft Computing 19 (2014) 134–146 143

Table 10FCM Processing Time.

Data set Process start time Process finish time Total process time

Year Month Date Hour Min Second Year Month Date Hour Min Second

1 2012 2 22 19 55 29 2012 2 22 19 58 1.39 152.4002 2012 2 22 20 1 15 2012 2 22 20 3 19.3 124.2933 2012 2 22 20 5 53.6 2012 2 22 20 8 21.2 147.5204 2012 2 22 20 11 29.4 2012 2 22 20 13 38.3 128.9415 2012 2 22 20 16 28.3 2012 2 22 20 18 35.6 127.3456 2012 2 22 20 21 33.8 2012 2 22 20 23 13.1 99.2977 2012 2 22 20 24 48.5 2012 2 22 20 25 58.4 69.9678 2012 2 22 20 27 33.7 2012 2 22 20 28 44.3 70.6139 2012 2 22 20 30 20.2 2012 2 22 20 31 29.6 69.359

10 2012 2 22 20 33 6.1 2012 2 22 20 34 15.2 69.12511 2012 2 22 20 35 50.8 2012 2 22 20 37 12.6 81.80012 2012 2 22 20 38 51.3 2012 2 22 20 40 2.76 71.424

Average processing time 101.007

Table 11Results of FCM algorithm.

Month\server 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 518 960 904 1127 951 980 1080 925 913 870 793 920 1022 8872 423 832 770 915 851 872 831 683 664 721 782 855 751 5953 433 860 664 781 764 794 980 864 687 609 637 595 676 7644 546 1075 1007 1188 1071 1061 1130 1035 1008 952 877 999 1124 9855 565 1067 1032 1319 1233 1263 1271 939 785 859 903 1034 991 7716 435 859 777 896 842 872 844 776 705 789 790 884 783 6687 495 978 790 943 911 992 1223 1004 825 738 801 935 1033 9158 264 576 554 689 883 883 909 866 707 654 697 723 836 8009 401 802 735 919 727 731 869 747 661 573 597 687 791 639

10 413 847 734 899 820 843 1061 894 786 694 650 674 781 67211 539 1095 877 1044 1121 1120 1269 1146 929 813 871 873 1095 118112 499 981 925 1189 1138 1185 1088 1028 938 825 929 890 720 729

Average 461 911 814 992 943 966 1046 909 801 758 777 839 884 801

Table 12Results of FCM Algorithm (contd. . .).

Month\server 15 16 17 18 19 20 21 22 23 24 25 26 27 Total

1 861 865 855 692 768 1004 943 951 1128 1237 1010 909 1021 25,0942 602 624 640 518 669 916 769 711 852 973 912 728 863 20,3223 722 707 720 757 657 719 933 866 965 1389 1156 806 1056 21,5614 910 899 933 771 807 1049 1054 884 1217 1713 1423 1026 1207 27,9515 783 792 828 729 749 998 1079 919 1150 1614 1322 1017 1331 27,3436 613 674 664 525 701 972 794 862 1137 993 715 723 781 21,0747 804 855 939 796 797 1013 1054 887 1156 1369 1181 930 1082 25,4468 766 749 731 593 642 797 746 630 873 1089 989 771 922 20,3399 568 597 645 668 547 701 941 774 909 1236 1065 809 963 20,302

10 628 641 701 721 601 759 1019 861 986 1414 1154 851 1047 22,15111 1068 1078 1104 980 1011 1194 1206 1033 1323 1662 1437 1130 1325 29,52412 680 686 667 514 756 981 896 827 1064 1231 1133 922 992 24,413

Average 750 764 786 689 725 925 953 850 1063 1327 1125 885 1049 285,520

Table 13Average results of FCM algorithm.

Run S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

1 461 911 814 992 943 966 1046 909 801 758 777 839 884 801 7502 313 842 1244 1266 1238 1118 946 884 764 649 638 672 660 637 6673 336 740 1105 1591 1431 1065 885 860 854 806 708 618 606 680 7574 406 948 1286 1612 1473 1177 1049 755 565 638 772 721 604 551 6445 435 882 1016 1442 1461 1313 1172 1022 867 715 681 661 628 574 5776 382 744 1042 1537 1570 1338 1049 860 855 882 864 797 744 674 5897 422 804 862 1143 1338 1314 1154 1001 969 1104 1098 913 866 702 5208 470 870 1116 1485 1505 1135 981 1040 1081 1022 916 842 766 674 6759 301 611 697 1093 1438 1464 1395 1243 1025 933 830 797 814 812 774

10 277 637 810 1074 1207 1304 1338 1220 918 748 806 770 819 859 82911 286 640 763 923 1093 1175 1151 1009 901 919 920 880 833 786 75712 259 585 644 889 1066 1160 1216 1191 1076 900 783 740 659 641 690

Page 11: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

144 V. T. / Applied Soft Computing 19 (2014) 134–146

Table 14Average results of FCM algorithm (contd. . .).

Run S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27

1 764 786 689 725 925 953 850 1063 1327 1125 885 10492 783 901 844 665 819 1123 1120 1043 1070 971 960 9573 846 814 754 693 776 930 954 927 1050 1014 967 10284 773 838 844 819 779 787 949 1060 977 877 828 10605 659 810 881 871 852 860 894 935 923 995 787 8826 512 508 654 830 852 801 935 1125 1233 947 708 7607 526 667 862 928 907 622 731 1073 993 788 668 8208 693 712 720 724 740 792 888 916 870 745 678 7399 726 812 921 1003 1055 794 641 744 790 672 682 728

10 848 821 923 953 985 1078 1034 990 790 577 550 63011 882 982 918 898 853 860 1120 1273 1105 723 550 59512 803 871 963 958 908 925 1073 1239 1292 1041 714 509

ion of

idrabm

by the MATLAB software, the results are analyzed. According to the

Fig. 4. Distribut

dentify that the differences between the sizes of the clusters. Theistribution of data access points by the k-Means and FCM algo-

ithms, the data points are evenly assigned only by the k-Meanslgorithm, but in the FCM algorithm, there are a considerable num-er of deviations in the assignment of data points. There is no muchore difference between the sizes of the clusters in the k-Means

1 2 3 4 5 6 7 8 9 10 11 12FCM Clus tering Aver age

Process �me 101.071.4672.4376.2074.7075.0572.0370.3382.6769.9275.1779.79

0.000

20.00 0

40.00 0

60.00 0

80.00 0

100.000

120.00 0

Aver

age

Tim

e in

Sec

FCM Process Time

Fig. 5. FCM results.

FCM clustering.

result. As a result of the execution of these algorithms many times

efficiency of the process of these algorithms, the performance ofthe k-Means method is better than the FCM algorithm. The average

9853

4348

11185

15763

02000400060008000

1000 01200 0140001600 01800 0

k-Means FCM

Mini mum-M aximum Data Points

Fig. 6. Minimum and maximum data access points.

Page 12: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

V. T. / Applied Soft Computing 19 (2014) 134–146 145

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 63 4 5 7 8 9 101112131415161718192021222324252627

Distribu �on of Data Acce ss Poi nts A�er Clus terin g

k-Means

FCM

Fig. 7. Distribution of data points after clustering.

Table 15FCM processing time.

Run Time

1 101.0072 71.4613 72.4344 76.2075 74.7056 75.0577 72.0388 70.3369 82.673

10 69.92011 75.17012 79.791

Table 16Total number of data points after clustering.

Data centers k-Means FCM

1 10,543 43482 11,185 92133 10,816 11,3994 10,702 15,0485 10,249 15,7636 10,773 14,5287 10,935 13,3828 10,372 11,9939 10,586 10,677

10 10,311 10,07411 10,506 979412 10,754 925013 10,519 888314 10,833 839115 10,276 822816 10,840 881417 10,567 952118 10,559 997219 10,663 10,06620 10,744 10,45121 10,089 10,52522 10,391 11,18923 9853 12,38824 10,763 12,42025 10,487 10,47426 10,472 8976

egtt

Table 17Average execution time (in seconds).

Data center k-Means FCM

1 17.92 101.012 17.14 71.463 16.82 72.434 19.66 76.215 19.80 74.716 16.13 75.067 15.92 72.048 17.22 70.349 18.70 82.67

10 18.98 69.9211 19.47 75.1712 16.24 79.79

Average time(s) 17.84 76.73

From the experimental approach, by several executions of the

27 10,731 9756285,520 285,520

xecution time for both the algorithms is given in Table 17 and the

raphical representation of the average time is given in Fig. 8. Fromhis figure, it is evident that the FCM algorithm takes more time thanhe k-Means algorithm. Thus, for the telecommunications data set,

Fig. 8. Results comparison.

the performance of the k-Means algorithm is best compared withFCM.

6. Conclusions

Cluster analysis plays an indispensable role for understandingvarious phenomena. Cluster analysis, primitive exploration withlittle or no prior knowledge, consists of research developed acrossa wide variety of communities. Here it is necessary to summa-rize with listing some important issues and research trends forclustering algorithms. There is no clustering algorithm that canbe universally used to solve all problems. Usually, algorithms aredesigned with certain assumptions and favor some type of biases.In this sense, it is not accurate to say “best” in the context of clus-tering algorithms, although some comparisons are possible. Thesecomparisons are mostly based on some specific applications, undercertain conditions, and the results may become quite different if theconditions change. In summary, clustering is an interesting, useful,and challenging problem. It has great potential in applications likeobject recognition, image segmentation, and information filteringand retrieval. However, it is possible to exploit this potential onlyafter making several designs choices carefully. The k-Means andFuzzy C-Means are very well known clustering algorithms in thepartition based algorithms and they have been used in many appli-cation areas. It is very well known that each of the two algorithmshas pros and cons. On because of its simplicity, the k-Means runsfaster, but vulnerable to noises. Fuzzy C-Means is a little bit morecomplex and hence runs slower, but stronger to noises. In order toconfirm this stronger quality of these two algorithms, one of theapplications area is chosen to test this notion in this work.

program for the proposed two algorithms in this research work,the following results were obtained. Usually, the time complex-ity varies from one processor to another processor, which depends

Page 13: Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data

1 putin

ormikersddtB

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

46 V. T. / Applied Soft Com

n the speed and type of the system. The partitioning based algo-ithms work well for finding spherical-shaped clusters in small toedium-sized data points. The advantage of the k-Means algorithm

s its favorable execution time. Its drawback is that the user has tonow in advance how many clusters are searched for. From thexperimental analysis, the computational time of k-Means algo-ithm is less than the FCM algorithm. Further, k-Means algorithmtamps its superiority in terms of its lesser execution time. Also, theistribution of data points by k-Means algorithm is even to all theata centers, but, it is not even by the FCM algorithm. This meanshat the data points are evenly distributed by k-Means algorithm.ut, the FCM algorithm has some variations in the distribution.

eferences

[1] A. Rakhlin, A. Caponnetto, Stability of k-Means clustering Advances inNeural Information Processing Systems, vol. 12, MIT Press, Cambridge,MA, 2007, pp. 216–222, http://www.cbcl.mit.edu/projects/cbcl/publications/ps/rakhlin-stability-clustering.pdf

[2] M.B. Al-Zoubi, A. Hudaib, B. Al-Shboul, A fast fuzzy clustering algorithm, in:Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, KnowledgeEngineering and Data Bases, 2007, pp. 28–32.

[3] P. Berkhin, Survey of Clustering Data Mining Techniques, Techni-cal Report, Accrue Software, Inc., 2002 www.ee.ucr.edu/ barth/EE242/clustering survey.pdf

[4] J.C. Bezdek, R. Ehrlich, W. Full, FCM: the Fuzzy C-Means clustering algorithm,Computers and Geosciences 10 (1984) 191–203.

[5] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,Kluwer Academic Publishers, Norwell, MA, USA, 1981, ISBN 0306406713.

[6] P.S. Bradley, U.M. Fayyad, C.A. Reina, Scaling clustering algorithms to largedatabases, in: Proceedings of the 4th International Conference on Knowl-edge Discovery & Data Mining (KDD98), AAAI Press, Menlo Park, CA, 1998,pp. 9–15.

[7] E.N. Benderskaya, S.V. Zhukova, Self-organized clustering and clas-sification: a unified approach via distributed chaotic computing,international symposium on distributed computing and artificial intelli-gence, Advances in Intelligent and Soft Computing 91 (2011) 423–431,http://dx.doi.org/10.1007/978-3-642-19934-9 54.

[8] J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., MorganKaufmann Publishers, New Delhi, 2006, ISBN 978-81-312-0535-8.

[9] J.A. Hartigan, Clustering Algorithms, Wiley Publishers, 1975, January, ISBN

047135645X.

10] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall Inc., Engle-wood Cliffs, NJ, 1988, ISBN 0-13-022278-X.

11] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM ComputingSurveys 31 (September (3)) (1999), DOI:10.1.1.18.2720&rep=rep1&type=pdf.

[

g 19 (2014) 134–146

12] L. Kaufman, P.J. Rousseeuw, Finding groups in data: an introduction to clusteranalysis, vol. 344, John Wiley & Sons, 2009.

13] N. Shanmugam, A.B. Suryanarayana, S.D. Chandrashekar, C.N. Manjunath, Anovel approach to medical image segmentation, Journal of Computer Science7 (5) (2011) 657–663, ISSN: 1549-3636.

14] A. Susana, Leiva-Valdebenito, J. Francisco, Torres-Aviles, A review of the mostcommon partition algorithms in cluster analysis: a comparative study, Colom-bian Journal of Statistics 33 (December (2)) (2010) 321–339, ISSN: 0120-1751.

15] D. Napoleon, P. Ganga Lakshmi, An enhanced k-means algorithm to improvethe efficiency using normal distribution data points, International Journal onComputer Science and Engineering 2 (7) (2010) 2409–2413, ISSN: 0975-3397.

16] H.S. Park, J.S. Lee, C.H. Jun, A k-Means-Like Algorithm for k-Medoids Clusteringand Its Performance, Department of Industrial and Management Engineering,POSTECH, South Korea, 2009.

17] M. Vijayakumar, R.M.S. Parvathi, Concept mining of high volume data streamsin network traffic using hierarchical clustering, European Journal of ScientificResearch 39 (2) (2010) 234–242, ISSN: 1450-216X.

18] T. Velmurugan, T. Santhanam, Computational complexity between K-meansand K-medoids clustering algorithms for normal and uniform distributions ofdata points, Journal of Computer Science 6 (3) (2010) 363–368, ISSN: 1549-3636 www.thescipub.org/fulltext/jcs/jcs63363-368.pdf

19] T. Velmurugan, T. Santhanam, Performance evaluation of k-means and fuzzyC-means clustering algorithms for statistical distributions of input data points,European Journal of Scientific Research 46 (3) (2010) 320–330, ISSN: 1450-216X http://www.eurojournals.com/ejsr.htm

20] Y. Yong, Z. Chongxun, L. Pan, A novel fuzzy c-means clustering algorithm forimage thresholding, Measurement Science Review 4 (91) (2004).

21] L.-Z. Lin, T.-H. Hsu, Designing a model of FANP in brand image decision making,Applied Soft Computing 11 (1) (2011) 561–573.

22] C.-w. Shen, M.J. Cheng, C.-W. Chen, F.-M. Tsai, Y.-C. Cheng, A fuzzy AHP-basedfault diagnosis for semiconductor lithography process, International Journal ofInnovative Computing, Information and Control 7 (2) (2011) 805–816.

23] H.-M. Kuo, C.-W. Chen, Application of quality function deployment to improvethe quality of Internet shopping website interface design, International Journalof Innovative Computing, Information and Control 7 (1) (2011) 253–268.

24] C.W. Chen, Fuzzy control of interconnected structural systems using the fuzzyLyapunov method, Journal of Vibration and Control 17 (11) (2011) 1693–1702.

25] C.-W. Chen, Stability conditions of fuzzy systems and its application to struc-tural and mechanical systems, Advances in Engineering Software 37 (9) (2006)624–629.

26] M.-L. Lin, C.-W. Chen, Application of fuzzy models for the monitoring of eco-logically sensitive ecosystems in a dynamic semi-arid landscape from satelliteimagery, Engineering Computations 27 (1) (2010) 5–19.

27] T.H. Chen, Application of data mining to the spatial heterogeneity of foreclosedmortgages, Expert Systems with Applications 37 (2) (2010) 993–997.

28] F.-H. Hsiao, J.-D. Hwang, C.-W. Chen, Z.-R. Tsai, Robust stabilization of nonlinear

multiple time-delay large-scale systems via decentralized fuzzy control, IEEETransactions on Fuzzy Systems 13 (2005) 152–163.

29] T. Velmurugan, T. Santhanam, A comparative analysis between k-medoids andfuzzy c-means clustering algorithms for statistically distributed data points,Journal of Theoretical and Applied Information Technology 27 (1) (2011) 19–30.