A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

1/5

INFORMATION PAPER

International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

161

A Novel Fuzzy Clustering Method for Outlier

Detection in Data Mining

Binu Thomas1and Raju G

2,

1 Research Scholar, Mahatma Gandhi University,Kerala, [email protected]

2 SCMS School of Technology & Management, Cochin, Kerala, [email protected]

Abstract In data mining, the conventional clusteringalgorithms have difficulties in handling the challenges posed

by the collection of natural data which is often vague and

uncertain. Fuzzy clustering methods have the potential to

manage such situations efficiently. This paper introduces

the limitations of conventional clustering methods through

k-means and fuzzy c-means clustering and demonstrates thedrawbacks of the algorithms in handling outlier points. In

this paper, we propose a new fuzzy clustering method which

is more efficient in handling outlier points than conventional

fuzzy c-means algorithm. The new method excludes outlier

points by giving them extremely small membership values in

existing clusters while fuzzy c-means algorithm tends give

them outsized membership values. The new algorithm also

incorporates the positive aspects of k-means algorithm in

calculating the new cluster centers in a more efficient

approach than the c-means method.

Index Termsfuzzy clustering, outlier points, knowledge

discovery, c-means algorithm

I. INTRODUCTION

The process of finding useful patterns and informationfrom raw data is often known as Knowledge discovery indatabases or KDD. Data mining is a particular step in thisprocess involving the application of specific algorithmsfor extracting patterns (models) from data [5]. Cluster

analysis is a technique for breaking data down intorelated components in such a way that patterns and orderbecomes visible. It aims at sifting through large volumesof data in order to reveal useful information in the formof new relationships, patterns, or clusters, for decision-making by a user. Clusters are natural groupings of data

items based on similarity metrics or probability densitymodels. Clustering algorithms maps a new data item

into one of several known clusters. In fact clusteranalysis has the virtue of strengthening the exposure ofpatterns and behavior as more and more data becomesavailable [7]. A cluster has a center of gravity which is

basically the weighted average of the cluster.Membership of a data item in a cluster can be determined

by measuring the distance from each cluster center to thedata point [6]. The data item is added to a cluster forwhich this distance is a minimum.This paper provides an overview of the crisp clustering

technique, advantages and limitations of fuzzy c-means

clustering and a new fuzzy clustering method which issimple and superior to c-means clustering in handlingoutlier points. Section 2 describes the basic notions of

clustering and also introduces k-means clustering

algorithm. In Section 3 we explain the concept ofvagueness and uncertainty in natural data. Section 4introduces the fuzzy clustering method and describes howit can handle vagueness and uncertainty with the conceptof overlapping clusters with partial membership

functions. The same section also introduces the mostcommon fuzzy clustering algorithm namely c-meansalgorithm and it ends with the limitations of it. Section 5proposes the new fuzzy clustering method. Section 6demonstrates the concepts presented in the paper.Finally, section 7 concludes the paper.

II. CRISP CLUSTERING TECHNIQUES

Traditional clustering techniques attempt to segment databy grouping related attributes in uniquely definedclusters. Each data point in the sample space is assigned

to only one cluster. K-means algorithm and its different

variations are the most well-known and commonly usedpartitioning methods. The value k stands for the numberof cluster seeds initially provided for the algorithm. Thisalgorithm takes the input parameter k and partitions aset of m objects into k clusters [7]. The technique workby computing the distance between a data point and the

cluster center to add an item into one of the clusters sothat intra-cluster similarity is high but inter-clustersimilarity is low. A common method to find the distanceis to calculate to sum of the squared difference as followsand it is known as the Euclidian distance [10](exp.1).

2

= nj

kjk CiXd (1)

where,dk : is the distance of the k

thdata point from Cj

With the definition of the distance of a data point fromthe cluster centers, the k-means the algorithm is fairly

simple. The cluster centers are randomly initialized andwe assign a data point xi into a cluster to which it has

minimum distance. When all the data points have beenassigned to clusters, new cluster centers are calculated byfinding the weighted average of all data points in acluster. The cluster center calculation causes the previous

centroid location to move towards the center of thecluster set. This is continued until there is no change in

cluster centers.

2009 ACADEMY PUBLISHER


2/5

INFORMATION PAPER


162

A. Limitations of k-means algorithm

The main limitation of the algorithm comes from itscrisp nature in assigning cluster membership to datapoints. Depending on the minimum distance, a data pointalways becomes a member of one of the clusters. This

works well with highly structured data. The real worlddata is almost never arranged in clear cut groups. Instead,

clusters have ill defined boundaries that smear into thedata space often overlapping the perimeters ofsurrounding clusters [4]. In most of the cases the realworld data have apparent extraneous data points clearly

not belonging to any of the clusters and they are calledoutlier points. The k-means algorithm is not capable ofdealing with overlapping clusters and outlier points sinceit has to include a data point into one of the existingclusters. Because of this even extreme outlier points willbe included in to a cluster based on the minimum

distance.

III. FUZZY LOGIC

The modeling of imprecise and qualitative knowledge, aswell as handling of uncertainty at various stages ispossible through the use of fuzzy sets. Fuzzy logic iscapable of supporting, to a reasonable extent, human typereasoning in natural form by allowing partial membershipfor data items in fuzzy subsets [2]. Integration of fuzzy

logic with data mining techniques has become one of thekey constituents of soft computing in handling thechallenges posed by the massive collection of natural data[1].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has,potentially, an infinite range of truth values between oneand zero[3]. Propositions in fuzzy logic have a degree of

truth, and membership in fuzzy sets can be fullyinclusive, fully exclusive, or some degree inbetween[13].The fuzzy set is distinct from a crisp set isthat it allows the elements to have a degree ofmembership. The core of a fuzzy set is its membershipfunction: a function which defines the relationship

between a value in the sets domain and its degree ofmembership in the fuzzy set(exp 2). The relationship isfunctional because it returns a single degree ofmembership for any value in the domain[11].

=f(s,x) (2)Here,

: is he fuzzy membership value for the elements : is the fuzzy setx : is the value from the underlying domain.

Fuzzy sets provide a means of defining a series ofoverlapping concepts for a model variable since it

represent degrees of membership. The values from thecomplete universe of discourse for a variable can havememberships in more than one fuzzy set.

IV.FUZZY CLUSTERING METHODS

The central idea in fuzzy clustering is the non-uniquepartitioning of the data in a collection of clusters. The

data points are assigned membership values for each ofthe clusters. The fuzzy clustering algorithms allow theclusters to grow into their natural shapes [15]. In somecases the membership value may be zero indicating that

the data point is not a member of the cluster under

consideration. Many crisp clustering techniques havedifficulties in handling extreme outliers but fuzzyclustering algorithms tend to give them very smallmembership degree in surrounding clusters [14].

The non-zero membership values, with a maximum ofone, show the degree to which the data point represents acluster. Thus fuzzy clustering provides a flexible androbust method for handling natural data with vaguenessand uncertainty. In fuzzy clustering, each data point willhave an associated degree of membership for each

cluster. The membership value is in the range zero to oneand indicates the strength of its association in that cluster.

A. C-means fuzzy clustering algorithm[10]

Fuzzy c-means clustering involves two processes: thecalculation of cluster centers and the assignment of points

to these centers using a form of Euclidian distance. Thisprocess is repeated until the cluster centers stabilize. Thealgorithm is similar to k-means clustering in many waysbut it assigns a membership value to the data items for theclusters within a range of 0 to 1. So it incorporates fuzzysets concepts of partial membership and forms

overlapping clusters to support it. The algorithm needs afuzzification parameter m in the range [1,n] which

determines the degree of fuzziness in the clusters. Whenm reaches the value of 1 the algorithm works like a crisppartitioning algorithm and for larger values of m the

overlapping of clusters is tend to be more. The algorithmcalculates the membership value with the formula,

=

=p

k

m

ki

m

ji

ij

d

dx

1

1

1

1

1

1

1

)( (3)

where

j(xi) : is the membership of xiin the jth

clusterdji : is the distance of xiin cluster cjm : is the fuzzification parameter

p : is the number of specified clustersdki : is the distance of xi in cluster Ck

The new cluster centers are calculated with these

membership values using the exp. 4.

[ ][ ]

=

i

m

ij

i i

m

ij

jx

xxc

)(

)(

(4)

whereCj : is the center of the j

thclusterxi : is the i

thdata point



3/5

INFORMATION PAPER


163

j :the function which returns the membershipm : is the fuzzification parameter

This is a special form of weighted average. We modify

the degree of fuzziness in xis current membership and

multiply this by xi. The product obtained is divided bythe sum of the fuzzified membership.

The first loop of the algorithm calculates membership

values for the data points in clusters and the second looprecalculates the cluster centers using these membershipvalues. When the cluster center stabilizes (when there isno change) the algorithm ends.

B. Limitations of the algorithm

The fuzzy c-means approach to clustering suffers fromseveral constrains that affect the performance [10]. Themain drawback is from the restriction that the sum of

membership values of a data point xi in all the clustersmust be one as in Expression (5), and this tends to give

high membership values for the outlier points. So thealgorithm has difficulty in handling outlier points.Secondly the membership of a data point in a clusterdepends directly on the membership values of othercluster centers and this sometimes happens to produce

undesirable results.

= =

p

j ij x1 1)( (5)

In fuzzy c-means method a point will have partial

membership in all the clusters. The exp.(4) for calculatingthe new cluster centers finds a special form of weighted

average of all the data points. The third limitation of thealgorithm is that due to the influence (partialmembership) of all the data members, the cluster centerstend to move towards the center of all the data points

[10].The fourth constrain of the algorithm is its inabilityto calculate the membership value if the distance of a datapoint is zero(exp.3)

TABLE 1.

FUZZY C-MEANS ALGORITHM

initialize p=number of clusters

initialize m=fuzzification parameter

initialize Cj (cluster centers)

Repeat

For i=1 to n :Update j(xi) applying (3)

For j=1 to p :Update Ci with(4)with current j(xi)

Until Cj estimate stabilize

V.THE NEW FUZZY CLUSTERING METHOD

The new fuzzy clustering method we propose,

removes the restriction imposed by exp (4). Due to thisconstrain the c-means algorithm tends to give moremembership values for the outlier points. In c-meansalgorithm, the membership of a point a cluster is

calculated based on its membership in other clusters.Many limitations of the algorithm arise due to this and in

the new method the membership of a point in a clustercenter depends only on its distance in that cluster. Forcalculating the membership values, we use a new simpleexpression as given in exp (6).

)(

)()(

j

jij

ijdMax

ddMaxx

= (6)

Where

j(xi) : is the membership of xiin the jth

clusterdji : is the distance of xiin cluster cjMax(dj) : is the maximum distance in the

cluster cjSince

1)(

)(=

j

j

dMax

dMax , the above membership function

(exp.6) will generate values closer to one(1) for smallerdistances (dji) and a membership value of zero for the

maximum distance. If the distance of a data point is zerothen the function returns a membership value of one and

thus it overcomes the fourth constrain of c-meansalgorithm. The membership values are calculated onlybased on the distance of a data member in the cluster anddue to this the method does not suffer from the first and

second constrains of c-means algorithm. To overcome thethird limitation of c-means algorithm in calculating newcluster centers, the new method inherits the features of k-means algorithm. For this purpose, A data point iscompletely assigned to a cluster where it has gotmaximum membership and the point is used only for the

calculation of new cluster center in that cluster. This waythe influence of a data point in the calculation of all thecluster centers can be avoided. The point is used of the

calculations only if its membership value falls above athreshold value . This way we can ensure that outlier

points are not considered for new cluster centercalculations.Unlike the crisp clustering algorithms like k-means, mostof the fuzzy clustering algorithms are sensitive to theselection of initial centroids. The effectiveness of thealgorithms depends on the initial cluster seeds [17]. For

better results we suggest to use k-means algorithm to getthe initial cluster seeds. This way we can ensure that the

algorithm converges with best possible results. In the newmethod, since the membership values vary strictly from 0to 1, we find that initializing the threshold value of to

0.5 produces better results in cluster center calculation.



4/5

INFORMATION PAPER


164

TABLE 2.THE NEW FUZZY CLUSTERING ALGORITHM

initialize p=number of clusters

initialize Cj (cluster centers)

initialize (threshold value)

Repeat

For i=1 to n :Update j(xi) applying (6)

For k=1 to p :

Sum=0

Count=0

For i=1 to n :

If (xi) is maximum in Ckthen

If (xi)>=

Sum=Sum+xi

count=count + 1

Ck=Sum/count

Until C estimate stabilize

The fuzzy membership values found with the expression6. can be used in the new fuzzy clustering algorithm

given in table2. The algorithm stops when the clustercenters stabilize.The algorithm will be more efficient in handling datawith outlier points and in overcoming other constrainsimposed by c-means algorithm. The new method wepropose is also far superior in the calculation of newcluster centers, which we demonstrate in the next session

with two numerical examples.

VI.ILLUSTRATIONS

In order to find the effectiveness of the new algorithm,we applied it with a small synthetic data set todemonstrate the functioning of the algorithm in detail.The algorithm is also tested with real data collected for

Bhutans Gross National Happiness (GNH) program.

A. With synthetic data

Consider the sales of an item from different shops at

different periods of the year, from figure 1. There is anoutlier point at (12,400). If we start data exploration with

two random cluster centers at C1(.5,150) and C2(8,150),the algorithm ends with the cluster centers at

C1(1.52,166.20) and C2(5.49,172.11). We also applied c-means algorithm with the same data set and we found that

the algorithm converges to cluster centers C1(2.79,47.65)and C2(3.29,204.4).

TABLE 3.C-MEANS ANDNEW METHOD

Methods Final Cluster Centers Outliermembership

Point - 12,400

C-means C1 2.79 , 47.65 .37

C2 3.29 , 204.4 .63

NewMethod C1 1.52,166.20 0C2 5.49,172.11 0

Figure 1. The data points and final overlapping clusters

We also found that the algorithm is far superior to c-

means in handling outlier points. (See table 3). The newalgorithm is capable of giving very low membershipvalues to the outlier points. If we start with C1 at (.5,150),the algorithm takes four steps to converge the first clustercenter to(1.52,166.23). Similarly the second cluster center

C2 converges to (5.49,172.11) from (8,150) in four steps.The algorithm finally ends with the final cluster centers atC1(1.5,166.2) and C2 (5.4,172.11) and with twooverlapping clusters as shown in figure 1. The outlierpoint is given zero membership in both the clusters.

From Table 3, it is clear that the new fuzzy clustering

method we propose is better than the conventional c-means algorithm in handling outlier points and in thecalculation of new cluster centers. Due the constraingiven in expression 5, the c-means algorithm givesmembership values .37 and .63 to the outlier point

(12,400) so that the sum of memberships of this point is

one. But the new algorithm gives zero membership to thispoint.

6.2 With Natural Data

Happiness and satisfaction are directly related with a

communitys ability to meet their basic needs and theseare important factors in safeguarding their physicalhealth. The unique concept of Bhutans Gross NationalHappiness (GNH) depends on nine factors like health,ecosystem, emotional well being etc.[16]. GNH regionalchapter at Sherubtse College conducted a survey among

1311 villagers and the responses were converted into

numeric values. For the analysis of the new method wetook the attributes income and health index as shown infigure 2.

Figure 2. The data points



5/5

INFORMATION PAPER


165

TABLE 4.THE FINAL CLUSTER CENTERS

Cent

ers

Final Cluster Centers

C-means New Method

X Y X Y

C1 24243.11 6.53 24376.32 8.135

C2 69749.6 5.08 68907 5.62

C3 115979.3 2.83 113264.6 1.815

As we can see from the data set, in Bhutan the lowincome group maintains better health than high incomegroup since they are self sufficient in many ways. Like

any other natural data this data set also contains manyoutlier points which do not belong to any of the groups. Ifwe apply c-means algorithm these points tend to get moremembership values due to exp.4.To start the data analysis, first we applied k-means

algorithm to find the initial three cluster centers. Thealgorithm ended with three cluster centers atC1(24243,6.7),C2(69794,5.1) and C3(11979.29,2.72).

We applied these initial values in both c-means algorithmand the new method to analyze the data and thealgorithms ended with centroids as given in Table 4.

From figure 2, and table 4, It can be seen that the finalcentroids of c-means method does not represent the actual

centers of the clusters and this is due to the influence ofoutlier points. The memberships of all the points are alsoconsidered for the calculation of cluster centers. So thecluster centers tend to move towards the centre of all the

points. But the new method identifies the cluster centersin a better way. The efficiency of the new algorithm lies

in its ability to provide very low membership values tooutlier points and also to consider a point only for thecalculation of one cluster centre.

VII.CONCLUSION

A good clustering algorithm produces high quality

clusters to yield low inter cluster similarity and high intracluster similarity. Many conventional clusteringalgorithms like k-means and fuzzy c-means algorithmachieve this on crisp and highly structured data. But they

have difficulties in handling unstructured natural datawhich often contain outlier data points. The proposed

new fuzzy clustering algorithm combines the positiveaspects of both crisp and fuzzy clustering algorithms. It ismore efficient in handling the natural data with outlierpoints than both k-means and fuzzy c-means algorithm. Itachieves this by assigning very low membership values tothe outlier points. But The Algorithm has limitations in

exploring highly structured crisp data which is free from

outlier points. The efficiency of the algorithm has to befurther tested on a comparatively larger data set.

REFERENCES

[1] Sankar K. Pal, P. Mitra, Data Mining in Soft ComputingFramework: A Survey, IEEE transactions on neuralnetworks, vol. 13, no. 1, January 2002

[2] R. Cruse, C. Borgelt, Fuzzy Data Analysis Challenges and

Perspectivehttp://citeseer.ist.psu.edu/ kruse99fuzzy.html

[3] G. Raju, Th. Shanta Kumar, Binu Thomas, Integrationof Fuzzy Logic in Data Mining: A comparativeCaseStudy, Proc. of International Conf. onMathematicsand Computer Science, Loyola College,Chennai, 128-136,2008

[4]Maria Halkidi, Quality assessment and UncertaintyHandling in Data Mining Process,

http://citeseer.ist.psu.edu/ halkidi00quality.html[5] W. H. Inmon The data warehouse and data mining,

Commn. ACM, vol. 39, pp. 4950, 1996.[6] U. Fayyad and R. Uthurusamy, Data mining and

knowledge discovery in databases, Commn. ACM, vol. 39,

pp. 2427, 1996.

[7] Pavel Berkhin, Survey of Clustering Data Mining

Techniques,http://citeseer.ist.psu.edu/berkhin02survey.html

[8] Chau, M., Cheng, R., and Kao, B, Uncertain Data Mining:

A New Research Direction, www.business.hku.hk/~mchau/papers/UncertainDataMining_WSA.pdf

[9] Keith C.C, C. Wai-Ho Au, B. Choi, Mining Fuzzy Rules in

A Donor Database for Direct Marketing by A CharitableOrganization,Proc of First IEEE International Conferenceon Cognitive Informatics, pp: 239 - 246, 2002

[10] E. Cox, Fuzzy Modeling And Genetic Algorithms For DataMining And Exploration,Elsevier, 2005

[11] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty andInformation, Prentice Hall,1988

[12] J Han, M Kamber,Data Mining Concepts and Techniques,

Elsevier, 2003

[13] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification,Ph.D. thesis, Center for Applied Mathematics, Cornell

University, Ithica, N.Y., 1973.[14] Carl G. Looney, A Fuzzy Clustering and Fuzzy Merging

Algorithm,http://citeseer.ist.psu.edu/399498.html

[15] Frank Klawonn, Anette Keller, Fuzzy Clustering Based onModified Distance Measures,http://citeseer.ist.psu.edu/klawonn99fuzzy.html

[16] Sullen Donnelly, How Bhutan Can Develop and Measure

GNH, www.bhutanstudies.org.bt/seminar/ 0402-

gnh/GNH-papers-1st_18-20.pdf

[17] Lei Jiang and Wenhui Yang, A Modified Fuzzy C-MeansAlgorithm for Segmentation of Magnetic ResonanceImages Proc. VIIth Digital Image Computing: Techniques

and Applications, pp. 225-231, 10-12 Dec. 2003, Sydney.


Documents

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining