Rough K-Means based on Mapreduce

Mapreduce based Rough K-Means Clustering

Varad Meru1 and Pawan Lingras2

1 Orzota India Development Center, Chennai, Tamil Nadu, India [email protected]

2 Department of Mathematics & Computing Science, Saint Mary’s University, Halifax, Nova Scotia, B3H 3C3, Canada

[email protected]

Abstract. Clustering has been the one of the most used mining methodology. With the advent of sensor technologies, the Internet of things, and rising social data, the era of the big data is upon us and scaling the current set of clustering algorithms is one of the fields primarily researched on. In this paper, we pro-pose a parallel rough K-means algorithm based on MapReduce proposed by Lingras et. al and present you with the preliminary experiments done with syn-thetic data.

Keywords. Rough sets, Clustering, Rough K-means, MapReduce, distributed computing, Hadoop.

1 Introduction

In recent years, the tremendous rise in data capturing technologies has given rise to the need of new approaches for storage and processing of humongous data, both in academic and industrial setting. Google devised its own systems for large-scale data storage [1] and processing [2] needs. Its open source version, Hadoop [3], is a spin-off of the open source search engine project Nutch [4], and has got a wide-spread acceptance as the de facto standard for large data analysis. To aid in the formal analysis of MapReduce algorithms, Karloff et. al. [10] introduced a model of compu-tation for MapReduce, which has been used to reason about many MapReduce based algorithms.

Traditionally, clustering algorithms have relied on the processing power of a single processing unit and assumed the availability of ample memory for processing. But scaling the current algorithmic approaches would not be a feasible task. New approaches for implementing the algorithms have been devised to scale for use with data sources such as sensor networks, Internet of things, and web-logs. Many of the new approaches are based on the MapReduce paradigm and have been successfully used in industrial settings [7]. Some of the approaches have been investigated and presented in [5, 6, 11, 12]. Rough K-means, proposed by Lingras [8], describes uncer-tainty of objects by assigning the data objects in the boundary region to more that one

cluster based on the threshold factor, decided by the subject matter expert finding the clusters. It is an adaptation of the rough set theory introduced in Pawlak [9, 14].

This paper introduces a method to implement a parallel version of rough k-means using MapReduce paradigm. The organization of the rest of the paper is as follows. Rough k-means is introduced in Section 2. We give an introduction to MapReduce paradigm in Section 3. Section 4 contains our methodology to implement a MapReduce version of rough k-means. The implementation and the experimental results are demonstrated in Section 5. The conclusion as well as future research direc-tion appears as a last section.

2 Rough set variant of K-means clustering

Due to the space limitations, familiarity with rough sets is assumed [13]. Let U be the set of data objects. Rough sets were originally proposed using the equivalence relations on U. However, it is possible to define a pair of upper and lower bounds (𝐴(𝐶),𝐴(𝐶)) or a rough set for every set 𝐶 ⊆ 𝑈 as long as the properties specified by Pawlak [9, 13, 14] are satisfied. Yao et. al. [15] described various generalizations of rough sets by relaxing the assumptions of an underlying equivalence relation. Such a trend towards generalization is also evident in rough mereology proposed by Polkow-ski and Skowron [16] and the use of information granules in a distributed environ-ment by Skowron and Stepaniuk [17]. The present study uses such a generalized view of rough sets. If one adopts a more restrictive view of rough set theory, the developed in this paper may look upon as interval sets.

Let us consider a hypothetical clustering scheme

𝑈 𝑃 = 𝐶!,𝐶!,… ,𝐶! (1)

that partitions the set U based on an equivalence relation P. Let us assume that due to insufficient knowledge it is not possible to precisely describe the sets 𝐶!,1 ≤ 𝑖 ≤ 𝑘, in the partition. However, it is possible to define each set 𝐶! ∈ 𝑈 𝑃 using its lower 𝐴(𝐶) and upper 𝐴(𝐶) bounds based on the available information. We will use vector representations, u, v for objects and 𝑐! for cluster 𝐶!.

We are considering the upper and lower bounds of only a few subsets of U. Therefore, it is not possible to verify all the properties to verify all the properties of the rough sets [9, 14]. However the family of upper and lower bounds of 𝐶! ∈ 𝑈 𝑃 are required to follow some of the basic rough set properties such as:

(P1) An object v can be part of at most one lower bound. (P2) v ∈ 𝐴 ∁!

v ∈ 𝐴 𝐶!

(P3) An object v is not part of any lower bound

⇕ v belongs to two or more upper bounds.

Property (P1) emphasizes the fact that a lower bound is included in a set. If two sets are mutually exclusive, their lower bounds should not overlap. Property (P2) confirms the fact that the lower bound is contained in the upper bound. Property (P3) is applicable to the objects in the boundary regions, which are defined as the differ-ences between upper and lower bounds. The exact membership of objects in the boundary region is ambiguous. Therefore, property (P3) states that an object cannot belong to only. Note that (P1)-(P3) are not necessarily independent or complete. However, enumerating them will be helpful in understanding the rough set adaptation of evolutionary, neural, and statistical, clustering methods.

Incorporating rough sets into K-means clustering requires the addition of the concept of upper bounds and lower bounds into the calculation of the centroids. The modified centroid calculation for rough sets are then given by:

If 𝐴 𝑥 ≠ ∅ and 𝐴 𝑥 − 𝐴 𝑥 = ∅

𝑥! =∑!∈! ! !!

! ! (2)

else, if 𝐴 𝑥 = ∅ and 𝐴 𝑥 − 𝐴 𝑥 ≠ ∅

𝑥! =∑!∈(! ! !! ! ) !!

! ! !! ! (3)

else

𝑥! = 𝑤!"#$%×∑!∈! ! !! ! !

+ 𝑤!""#$ × ∑!∈(! ! !! ! ) !!

! ! !! ! (4)

where 1 ≤ 𝑗 ≤ 𝑚. The parameters 𝑤!"#$% and 𝑤!""#$ correspond to the relative im-portance of lower and upper bounds, and 𝑤!"#$%+ 𝑤!""#$ = 1. To determine the asso-ciation of a data object vector to any cluster, the criteria, to determine whether the data object vector belongs to the lower bound or upper bound of the cluster, is as fol-lows: For each object vector 𝐯, let 𝑑(𝐯, 𝐱!) be the distance between itself and the cen-troid of the cluster 𝐱!. Let 𝑑(𝐯, 𝐱!) = 𝑚𝑖𝑛!!! !! 𝑑(𝐯, 𝐱!). The ratio 𝑑(𝐯, 𝐱!) 𝑑(𝐯, 𝐱!), 1 ≤ 𝑗 ≤ 𝑘, are used to determine the membership of 𝐯. Let 𝑇 = {𝑗: 𝑑 𝐯, 𝐱! 𝑑 𝐯, 𝐱! ≤ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 and 𝑖 ≠ 𝑗}.

1. If 𝑇 ≠ ∅, 𝐯 ∈ 𝐴 𝑥 and 𝐯 ∈ 𝐴 𝑥 , ∀ 𝑗 ∈ 𝑇. Furthermore, 𝐯 is not part of any lower bound. The above criterion guarantees the property (P3) is satisfied.

2. Otherwise, if 𝑇 = ∅, 𝐯 ∈ 𝐴 𝑥 . In addition, by property (P2), 𝐯 ∈ 𝐴 𝑥 . The upper and lower bounds are constructed based on the criteria described above.

3 MapReduce Paradigm

MapReduce [2] has become a popular paradigm for data-intensive parallel processing on shared-nothing clusters. It is inspired from functional programming and

has spawned a number of proprietary and open-source implementations [3, 18, 19, 20, 21] based on Google’s system.

The data is split and stored across machines of the cluster as shards and is usu-ally represented as (key, value) tuples. Data is often replicated to efficiently distribut-ed the tasks and parallelize them, and offer tolerance against machine failures. The computation tasks are assigned block of data to process on. If the machine fails, the replicated factor for the cluster is maintained by copying the particular blocks of failed machine to other nodes of the machine. In Hadoop, HDFS (Hadoop Distributed File System [22]), which implements Google File System, manages the storage, main-tains the replications factor and the shards. It is designed to scale to mass storage across multiple machines, and provide read/write, backup and fault tolerance. The tasks contain the computation code embedded into two functions vis-à-vis map and reduce functions.

Fig. 1. Illustration of MapReduce processing

3.1 The map function

The map function works on individual shards of data, and the output tuples are grouped (partitioned and sorted) by the intermediate key and then sent to the reducer machines in the shuffle phase. The function 𝑓 takes a tuple as the input and generates output tuples based on the map logic.

𝑓 𝑘!, 𝑣! → 𝑙𝑖𝑠𝑡 (𝑘!, 𝑣!) (5)

As illustrated in fig. 1, the input shards are processed over by the multiple map tasks and give the produced output to the shuffler and sorting unit.

3.2 The reduce function

The reduce function firstly does the task of merging the output tuple values based on the intermediate key. Once the merge is done, and the list is generated, the logic written in the reduce function 𝑔 is applied to the list.

𝑔 𝑘!, 𝑙𝑖𝑠𝑡 𝑣! → 𝑙𝑖𝑠𝑡 (𝑘!, 𝑣!) (6)

The 𝑙𝑖𝑠𝑡 𝑣! is generated by the shuffle and sort phase, combining all the val-ues associated with a key, 𝑘!. As illustrated in fig. 1, the output of the shuffle phase is then input of the reduce functions and the output is written back on the DFS.

4 Rough K-means based on MapReduce paradigm

This section proposes a parallel approach based on MapReduce paradigm to rough K-means clustering. The objective is to scale the traditional rough K-means clustering for very large datasets. We demonstrate an approach of doing the rough K-means using MapReduce, but it is not the only method to adapt rough K-means for MapReduce, and can be implemented in other ways to improve the performance and lower the network bandwidth usage, factors important in the industrial applications.

Fig. 2. MapReduce based Clustering. The mappers are denoted with Mn notation, and the re-

ducers are denoted with Rk notation.

4.1 Initialization

The initialization phase prepares the initial centroid file for the by randomly selecting from the dataset. Various ways to find/calculate the initial set of centroids can be used for getting different cluster shapes and better cluster quality. We have used a random selection

The centroids are shared with the help of a distributed cache [23], a feature provided by the HDFS. This file contains the dimensions of the centroids, in a tab-separated format and is read by the individual mapper. The parameters such as threshold, 𝑤!"#$% and 𝑤!""#$ are passed to the mappers and reducers as parameters in the configuration [24] object to the cluster.

4.2 map Phase

The map phase of the approach works on distributing the task of finding the nearest centroid and the centroids that stay within the threshold ratio. Mapper Algorithm:

Converting text to numeric array Calculating the distances from all the centroids Finding the nearest centroid. Add to centroids list. Find centroids within the threshold ratio. Add to cen-

troids list. If ( centroids.size() == 1 )

spill the tuple - < centroids.get(0), L (delim) data-object>

else for (Integer centroid-num: centroids )

spill the tuple - < centroid-num, U (delim) data-object>

Fig. 3. Abstract view of the mapper side algorithm.

An abstract view of the mapper side algorithm is presented in Fig. 3. Mapper tasks process the input data-object, and spill the tuple in the format <cluster-number, L/U (delim) data-object>. The letters L denotes the lower bound object, and U de-notes upper bound object. The notation (s) denotes the custom delimiter used identify the two parts of the intermediate value being written. For an object upper of multiple clusters, the tuple is written multiple times in the intermediate output. Let 𝑓 be the mapper function, then

𝑓 𝑜𝑓𝑓𝑠𝑒𝑡,𝑑𝑎𝑡𝑎 − 𝑝𝑜𝑖𝑛𝑡 𝑑𝑖𝑚𝑒𝑛𝑠𝑡𝑖𝑜𝑛𝑠→ (𝑐𝑙𝑢𝑠𝑡𝑒𝑟 − 𝑛𝑢𝑚, 𝐿 𝑜𝑟 𝑈 |𝑑𝑒𝑙𝑖𝑚| 𝑑𝑎𝑡𝑎 − 𝑝𝑜𝑖𝑛𝑡)

the notation |𝑑𝑒𝑙𝑖𝑚| denotes a delimiter here. The shuffler then processes this spilled output, which merge all the values with same key (𝑐𝑙𝑢𝑠𝑡𝑒𝑟 − 𝑛𝑢𝑚) into an iterable list.

4.3 reduce Phase

The reduce phase of the approach works on calculating the updated centroid after getting the data-points and their respective associations with the clusters. The reducer gets all the data points for the particular cluster (key being the 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 −𝑛𝑢𝑚), as a list of values. This list of values is then aggregated into two different nu-merical arrays – for the lower and upper bounds. The cardinality of these sets is also calculated while aggregating the data-points. It is required to find out the cluster as specified in (4). Reducer Algorithm:

Aggregating the upper bound and lower bound data-points Calculation of the updated centroid. Writing the output tuple

<cluster-num, centroid-dimenstions>.

Fig. 4. Abstract view of the reducer side algorithm.

5 Experimental Results

We implement our algorithm in Java language on the Hadoop MapReduce [3] platform. The mappers and reducers are separate java processes running on different systems in the distributed systems. The implementation pseudo-code is already ex-plained in the previous section.

We then generated various datasets to test the scalability of the algorithm with larger dataset sizes. The data was generated by a generic data-generator, which takes the range, the number of data-points to be generated, the number of dimensions and other parameters, to give out the data file and a centroids file. The future implementa-tions would replace this data generator with specific use-cases and tuned initial cen-troid generators. The data used by us consisted of three-dimensional floating-point numbers in the range 1 to 10000. In the experiments, we identified the values of threshold to be 0.7, 𝑤! to be 0.7, 𝑤! to be 0.3. We ran all two sets of experiments – Multiple dataset runs, and Iterative test run.

The Hadoop cluster on a public cloud provisioned in the Rackspace cloud [28] and had HDP 1.1 deployed [27] from Hortonworks. The cluster was a 6-node cluster with the architecture specified in table 1.

6-Node Hadoop Cluster

Architecture Nodes

Namenode 1 node Datanode 5 nodes

Namenode Configuration RAM 4096 MB DISK 128 GB CPU 1 VCPU

Datanode Configuration RAM 512 MB DISK 256 GB CPU 1 VCPU

Table. 1. Configuration of the Hadoop cluster.

5.1 Multiple Dataset Runs

Each data point is a record stored as a line of string in the files. The record reader of hadoop parses the line and gives it to the mapper for further processing. If a record is split due to the hard split due to sharding of data, the record reader will fetch all the record parts first and then only give it to the mapper for processing. For this experi-

ment, the number of experimental records were 100, 1000, 10000, 100000, 1000000, 10000000 (10 million), 50000000 (50 million).

Run No. of Data Points k Time(in s) 1 100 3 57.072 2 1000 3 59.146 3 10000 3 67.137 4 100000 3 68.059 5 1000000 3 123.237 6 10000000 3 2194.545 7 50000000 3 9260.431

Table. 2. Time (in s) taken for one run. A run has does 2 iterations of the algorithm. The cen-troids generated in the first iteration are given as the input centroid to the second iteration.

Fig. 5. Running time for each Run.

As you can observe in fig. 5 and can infer from table 2, the times recorded runs on smaller datasets (<100000) does not differ much. In these cases, the major overhead is the initialization and staging of the hadoop job. For intermediate-sized and larger datasets, the time taken to complete the run depends on computation and network data transfers of intermediate results.

5.2 Iterative Test Run

For this experiment, the number of experimental records was set to 1000000 and the number of iterations to 10. The dataset was roughly 200 MB on disk, but was replicated 3 times by the distributed file system to help in the case of node failures and speculative executions.

By observing the running times given in table 3 and fig. 5, we observed the time taken for initial runs was due to the random centroids generated for the experimenta-tion. After the first iteration, the data centroids got better and suite for further itera-tions. Subsequent iterations did not show fluctuations with the time taken for the it-erations.

Iteration, i Time-taken (in seconds) per iteration 1 913.780 2 1757.879 3 1646.543 4 1504.225 5 1555.398 6 1548.411 7 1497.287 8 1548.626 9 1503.246

10 1515.255

Table. 3. Time (in s) taken per Iteration. i = 10 and n = 1,000,000.

Fig. 6. Running time for each iteration.

6 Conclusion

This paper combines the efficiency of rough K-means with the inherent fea-tures provided by parallel algorithms – Scalability, faster performance, large data. The paper demonstrates the use of the MapReduce paradigm used in the context of Rough K-means. The mapper focuses on the cluster centroid selection, and the reducer fo-cuses on updated centroid calculation. We plan to test the MapReduce rough K-means for optimization with different cluster quality measures and compare the efficiency with classical rough K-means on a high-performance computer. We also plan to ex-tend this work in a real-world setting. Results of our experiments will appear in future publications.

Acknowledgement

The authors would like to thank Orzota, Inc. for the time required to complete the experimental analysis and the Hadoop cluster provisioned for the experiment.

References

1. Ghemawat S., Gobioff H., Leung. S.: The Google file system. In Proceedings of the nine-teenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43 (2003)

2. Dean J., Ghemawat. S.: MapReduce: simplified data processing on large clusters. In Pro-ceedings of the 6th conference on Symposium on Opearting Systems Design & Implemen-tation - Volume 6 (OSDI'04), Vol. 6. USENIX Association, Berkeley, CA, USA, 10-10 (2004)

3. http://hadoop.apache.org/ - Hadoop home page in Apache software foundation. 4. http://nutch.apache.org/ - Nutch home page in Apache software foundation. 5. Cordeiro R., Traina C., Junior, Traina A., López J., Kang U., Faloutsos C.: Clustering very

large multi-dimensional datasets with MapReduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). ACM, New York, NY, USA, 690-698 (2011)

6. Ene A., Im S., Moseley B.: Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '11). ACM, New York, NY, USA, 681-689 (2011)

7. https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms - Apache Mahout algorithms page in Apache software foundation.

8. Lingras P.: Evolutionary Rough K-Means Clustering. In Proceedings of the 4th Interna-tional Conference on Rough Sets and Knowledge Technology (RSKT '09), Springer-Verlag, Berlin, Heidelberg, 68-75 (2009)

9. Pawlak, Z.: Rough sets. International Journal of Computing and Information Sciences -11, 341-356 (1982)

10. Karloff H., Suri S., Vassilvitskii S.: A model of computation for MapReduce. In Proceed-ings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '10). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 938-948 (2010)

11. Meru V.: Data clustering using MapReduce: A Look at Various Clustering Algorithms Implemented with MapReduce Paradigm. In Software Developer’s Journal Vol. 2 No. 2, Issue 2/2013, Software Media Sp. z o.o. Sp. Komandytowa ul. Bokserska 1, 02-682 War-saw, Poland, 40-47 (2013)

12. Zhang J., Wu G., Li H., Hu X., Wu X.: A 2-Tier Clustering Algorithm with Map-Reduce. In Proceedings of the ChinaGrid Conference (ChinaGrid), Fifth Annual, Guangzhou, Chi-na, 160 – 166 (2010)

13. Pawlak, Z.: Rough Sets. CSC '95, Proceedings of the 1995 ACM 23rd Annual Conference on Computer Science, Nashville, TN, USA. 262-264 (1995)

14. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Norwell, MA, USA (1992)

15. Yao. Y.: Constructive and algebraic methods of the theory of rough sets. Information Sci-ences 109, 1-4 , 21-47 (1998)

16. Polkowski L., Skowron A.: Rough Mereology: A New Paradigm for Approximate Reason-ing. International Journal of Approximate Reasoning, 15(4), pp. 333-365 (1997)

17. Skowron.A. Stepaniuk, J.: Information Granules in Distributed Environment. In Proceed-ings of the 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing (RSFDGrC '99), Ning Zhong, Andrzej Skowron, and Setsuo Ohsuga (Eds.). Springer-Verlag, London, UK, UK, 357-365. (1999)

18. http://skynet.rubyforge.org - A Ruby MapReduce framework 19. http://discoproject.org - Distributed computing framework based on MapRe-

duce 20. http://mapreduce.stanford.edu - The Phoenix and Phoenix++ System for

MapReduce programming 21. http://mfisk.github.io/filemap - File based MapReduce 22. Shvachko K., Kuang H., Radia S., Chansler R.: The Hadoop Distributed File System. In

Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technolo-gies (MSST '10). IEEE Computer Society, Washington, DC, USA, 1-10 (2010)

23. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/filecache/DistributedCache.html - The Documentation page for Distributed Cache.

24. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html - The Documentation page for Configuration Class.

25. Lingras P., Chen M., Miao D.: Rough multi-category decision theoretic framework. In Proceedings of the 3rd international conference on Rough sets and knowledge technology (RSKT'08), Guoyin Wang, Tianrui Li, Jerzy W. Grzymala-Busse, Duoqian Miao, Andrzej Skowron, and Yiyu Yao (Eds.). Springer-Verlag, Berlin, Heidelberg, 676-683 (2008)

26. Peters G.: Some refinements of rough k-means clustering. Pattern Recogn. 39, 8, 1481-1491 (2006)

27. http://hortonworks.com/blog/welcome-hortonworks-data-platform-1-1/ - Hortonworks Data Platform 1.1 Introductory blog.

28. http://www.rackspace.com/ - Rackspace, US Inc.

Documents

Rough K-Means based on Mapreduce