Upload
adactivator
View
217
Download
0
Embed Size (px)
Citation preview
8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
1/5
INFORMATION PAPER
International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009
161
A Novel Fuzzy Clustering Method for Outlier
Detection in Data Mining
Binu Thomas1and Raju G
2,
1 Research Scholar, Mahatma Gandhi University,Kerala, [email protected]
2 SCMS School of Technology & Management, Cochin, Kerala, [email protected]
Abstract In data mining, the conventional clusteringalgorithms have difficulties in handling the challenges posed
by the collection of natural data which is often vague and
uncertain. Fuzzy clustering methods have the potential to
manage such situations efficiently. This paper introduces
the limitations of conventional clustering methods through
k-means and fuzzy c-means clustering and demonstrates thedrawbacks of the algorithms in handling outlier points. In
this paper, we propose a new fuzzy clustering method which
is more efficient in handling outlier points than conventional
fuzzy c-means algorithm. The new method excludes outlier
points by giving them extremely small membership values in
existing clusters while fuzzy c-means algorithm tends give
them outsized membership values. The new algorithm also
incorporates the positive aspects of k-means algorithm in
calculating the new cluster centers in a more efficient
approach than the c-means method.
Index Termsfuzzy clustering, outlier points, knowledge
discovery, c-means algorithm
I. INTRODUCTION
The process of finding useful patterns and informationfrom raw data is often known as Knowledge discovery indatabases or KDD. Data mining is a particular step in thisprocess involving the application of specific algorithmsfor extracting patterns (models) from data [5]. Cluster
analysis is a technique for breaking data down intorelated components in such a way that patterns and orderbecomes visible. It aims at sifting through large volumesof data in order to reveal useful information in the formof new relationships, patterns, or clusters, for decision-making by a user. Clusters are natural groupings of data
items based on similarity metrics or probability densitymodels. Clustering algorithms maps a new data item
into one of several known clusters. In fact clusteranalysis has the virtue of strengthening the exposure ofpatterns and behavior as more and more data becomesavailable [7]. A cluster has a center of gravity which is
basically the weighted average of the cluster.Membership of a data item in a cluster can be determined
by measuring the distance from each cluster center to thedata point [6]. The data item is added to a cluster forwhich this distance is a minimum.This paper provides an overview of the crisp clustering
technique, advantages and limitations of fuzzy c-means
clustering and a new fuzzy clustering method which issimple and superior to c-means clustering in handlingoutlier points. Section 2 describes the basic notions of
clustering and also introduces k-means clustering
algorithm. In Section 3 we explain the concept ofvagueness and uncertainty in natural data. Section 4introduces the fuzzy clustering method and describes howit can handle vagueness and uncertainty with the conceptof overlapping clusters with partial membership
functions. The same section also introduces the mostcommon fuzzy clustering algorithm namely c-meansalgorithm and it ends with the limitations of it. Section 5proposes the new fuzzy clustering method. Section 6demonstrates the concepts presented in the paper.Finally, section 7 concludes the paper.
II. CRISP CLUSTERING TECHNIQUES
Traditional clustering techniques attempt to segment databy grouping related attributes in uniquely definedclusters. Each data point in the sample space is assigned
to only one cluster. K-means algorithm and its different
variations are the most well-known and commonly usedpartitioning methods. The value k stands for the numberof cluster seeds initially provided for the algorithm. Thisalgorithm takes the input parameter k and partitions aset of m objects into k clusters [7]. The technique workby computing the distance between a data point and the
cluster center to add an item into one of the clusters sothat intra-cluster similarity is high but inter-clustersimilarity is low. A common method to find the distanceis to calculate to sum of the squared difference as followsand it is known as the Euclidian distance [10](exp.1).
2
= nj
kjk CiXd (1)
where,dk : is the distance of the k
thdata point from Cj
With the definition of the distance of a data point fromthe cluster centers, the k-means the algorithm is fairly
simple. The cluster centers are randomly initialized andwe assign a data point xi into a cluster to which it has
minimum distance. When all the data points have beenassigned to clusters, new cluster centers are calculated byfinding the weighted average of all data points in acluster. The cluster center calculation causes the previous
centroid location to move towards the center of thecluster set. This is continued until there is no change in
cluster centers.
2009 ACADEMY PUBLISHER
8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
2/5
INFORMATION PAPER
International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009
162
A. Limitations of k-means algorithm
The main limitation of the algorithm comes from itscrisp nature in assigning cluster membership to datapoints. Depending on the minimum distance, a data pointalways becomes a member of one of the clusters. This
works well with highly structured data. The real worlddata is almost never arranged in clear cut groups. Instead,
clusters have ill defined boundaries that smear into thedata space often overlapping the perimeters ofsurrounding clusters [4]. In most of the cases the realworld data have apparent extraneous data points clearly
not belonging to any of the clusters and they are calledoutlier points. The k-means algorithm is not capable ofdealing with overlapping clusters and outlier points sinceit has to include a data point into one of the existingclusters. Because of this even extreme outlier points willbe included in to a cluster based on the minimum
distance.
III. FUZZY LOGIC
The modeling of imprecise and qualitative knowledge, aswell as handling of uncertainty at various stages ispossible through the use of fuzzy sets. Fuzzy logic iscapable of supporting, to a reasonable extent, human typereasoning in natural form by allowing partial membershipfor data items in fuzzy subsets [2]. Integration of fuzzy
logic with data mining techniques has become one of thekey constituents of soft computing in handling thechallenges posed by the massive collection of natural data[1].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has,potentially, an infinite range of truth values between oneand zero[3]. Propositions in fuzzy logic have a degree of
truth, and membership in fuzzy sets can be fullyinclusive, fully exclusive, or some degree inbetween[13].The fuzzy set is distinct from a crisp set isthat it allows the elements to have a degree ofmembership. The core of a fuzzy set is its membershipfunction: a function which defines the relationship
between a value in the sets domain and its degree ofmembership in the fuzzy set(exp 2). The relationship isfunctional because it returns a single degree ofmembership for any value in the domain[11].
=f(s,x) (2)Here,
: is he fuzzy membership value for the elements : is the fuzzy setx : is the value from the underlying domain.
Fuzzy sets provide a means of defining a series ofoverlapping concepts for a model variable since it
represent degrees of membership. The values from thecomplete universe of discourse for a variable can havememberships in more than one fuzzy set.
IV.FUZZY CLUSTERING METHODS
The central idea in fuzzy clustering is the non-uniquepartitioning of the data in a collection of clusters. The
data points are assigned membership values for each ofthe clusters. The fuzzy clustering algorithms allow theclusters to grow into their natural shapes [15]. In somecases the membership value may be zero indicating that
the data point is not a member of the cluster under
consideration. Many crisp clustering techniques havedifficulties in handling extreme outliers but fuzzyclustering algorithms tend to give them very smallmembership degree in surrounding clusters [14].
The non-zero membership values, with a maximum ofone, show the degree to which the data point represents acluster. Thus fuzzy clustering provides a flexible androbust method for handling natural data with vaguenessand uncertainty. In fuzzy clustering, each data point willhave an associated degree of membership for each
cluster. The membership value is in the range zero to oneand indicates the strength of its association in that cluster.
A. C-means fuzzy clustering algorithm[10]
Fuzzy c-means clustering involves two processes: thecalculation of cluster centers and the assignment of points
to these centers using a form of Euclidian distance. Thisprocess is repeated until the cluster centers stabilize. Thealgorithm is similar to k-means clustering in many waysbut it assigns a membership value to the data items for theclusters within a range of 0 to 1. So it incorporates fuzzysets concepts of partial membership and forms
overlapping clusters to support it. The algorithm needs afuzzification parameter m in the range [1,n] which
determines the degree of fuzziness in the clusters. Whenm reaches the value of 1 the algorithm works like a crisppartitioning algorithm and for larger values of m the
overlapping of clusters is tend to be more. The algorithmcalculates the membership value with the formula,
=
=p
k
m
ki
m
ji
ij
d
dx
1
1
1
1
1
1
1
)( (3)
where
j(xi) : is the membership of xiin the jth
clusterdji : is the distance of xiin cluster cjm : is the fuzzification parameter
p : is the number of specified clustersdki : is the distance of xi in cluster Ck
The new cluster centers are calculated with these
membership values using the exp. 4.
[ ][ ]
=
i
m
ij
i i
m
ij
jx
xxc
)(
)(
(4)
whereCj : is the center of the j
thclusterxi : is the i
thdata point
2009 ACADEMY PUBLISHER
8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
3/5
INFORMATION PAPER
International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009
163
j :the function which returns the membershipm : is the fuzzification parameter
This is a special form of weighted average. We modify
the degree of fuzziness in xis current membership and
multiply this by xi. The product obtained is divided bythe sum of the fuzzified membership.
The first loop of the algorithm calculates membership
values for the data points in clusters and the second looprecalculates the cluster centers using these membershipvalues. When the cluster center stabilizes (when there isno change) the algorithm ends.
B. Limitations of the algorithm
The fuzzy c-means approach to clustering suffers fromseveral constrains that affect the performance [10]. Themain drawback is from the restriction that the sum of
membership values of a data point xi in all the clustersmust be one as in Expression (5), and this tends to give
high membership values for the outlier points. So thealgorithm has difficulty in handling outlier points.Secondly the membership of a data point in a clusterdepends directly on the membership values of othercluster centers and this sometimes happens to produce
undesirable results.
= =
p
j ij x1 1)( (5)
In fuzzy c-means method a point will have partial
membership in all the clusters. The exp.(4) for calculatingthe new cluster centers finds a special form of weighted
average of all the data points. The third limitation of thealgorithm is that due to the influence (partialmembership) of all the data members, the cluster centerstend to move towards the center of all the data points
[10].The fourth constrain of the algorithm is its inabilityto calculate the membership value if the distance of a datapoint is zero(exp.3)
TABLE 1.
FUZZY C-MEANS ALGORITHM
initialize p=number of clusters
initialize m=fuzzification parameter
initialize Cj (cluster centers)
Repeat
For i=1 to n :Update j(xi) applying (3)
For j=1 to p :Update Ci with(4)with current j(xi)
Until Cj estimate stabilize
V.THE NEW FUZZY CLUSTERING METHOD
The new fuzzy clustering method we propose,
removes the restriction imposed by exp (4). Due to thisconstrain the c-means algorithm tends to give moremembership values for the outlier points. In c-meansalgorithm, the membership of a point a cluster is
calculated based on its membership in other clusters.Many limitations of the algorithm arise due to this and in
the new method the membership of a point in a clustercenter depends only on its distance in that cluster. Forcalculating the membership values, we use a new simpleexpression as given in exp (6).
)(
)()(
j
jij
ijdMax
ddMaxx
= (6)
Where
j(xi) : is the membership of xiin the jth
clusterdji : is the distance of xiin cluster cjMax(dj) : is the maximum distance in the
cluster cjSince
1)(
)(=
j
j
dMax
dMax , the above membership function
(exp.6) will generate values closer to one(1) for smallerdistances (dji) and a membership value of zero for the
maximum distance. If the distance of a data point is zerothen the function returns a membership value of one and
thus it overcomes the fourth constrain of c-meansalgorithm. The membership values are calculated onlybased on the distance of a data member in the cluster anddue to this the method does not suffer from the first and
second constrains of c-means algorithm. To overcome thethird limitation of c-means algorithm in calculating newcluster centers, the new method inherits the features of k-means algorithm. For this purpose, A data point iscompletely assigned to a cluster where it has gotmaximum membership and the point is used only for the
calculation of new cluster center in that cluster. This waythe influence of a data point in the calculation of all thecluster centers can be avoided. The point is used of the
calculations only if its membership value falls above athreshold value . This way we can ensure that outlier
points are not considered for new cluster centercalculations.Unlike the crisp clustering algorithms like k-means, mostof the fuzzy clustering algorithms are sensitive to theselection of initial centroids. The effectiveness of thealgorithms depends on the initial cluster seeds [17]. For
better results we suggest to use k-means algorithm to getthe initial cluster seeds. This way we can ensure that the
algorithm converges with best possible results. In the newmethod, since the membership values vary strictly from 0to 1, we find that initializing the threshold value of to
0.5 produces better results in cluster center calculation.
2009 ACADEMY PUBLISHER
8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
4/5
INFORMATION PAPER
International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009
164
TABLE 2.THE NEW FUZZY CLUSTERING ALGORITHM
initialize p=number of clusters
initialize Cj (cluster centers)
initialize (threshold value)
Repeat
For i=1 to n :Update j(xi) applying (6)
For k=1 to p :
Sum=0
Count=0
For i=1 to n :
If (xi) is maximum in Ckthen
If (xi)>=
Sum=Sum+xi
count=count + 1
Ck=Sum/count
Until C estimate stabilize
The fuzzy membership values found with the expression6. can be used in the new fuzzy clustering algorithm
given in table2. The algorithm stops when the clustercenters stabilize.The algorithm will be more efficient in handling datawith outlier points and in overcoming other constrainsimposed by c-means algorithm. The new method wepropose is also far superior in the calculation of newcluster centers, which we demonstrate in the next session
with two numerical examples.
VI.ILLUSTRATIONS
In order to find the effectiveness of the new algorithm,we applied it with a small synthetic data set todemonstrate the functioning of the algorithm in detail.The algorithm is also tested with real data collected for
Bhutans Gross National Happiness (GNH) program.
A. With synthetic data
Consider the sales of an item from different shops at
different periods of the year, from figure 1. There is anoutlier point at (12,400). If we start data exploration with
two random cluster centers at C1(.5,150) and C2(8,150),the algorithm ends with the cluster centers at
C1(1.52,166.20) and C2(5.49,172.11). We also applied c-means algorithm with the same data set and we found that
the algorithm converges to cluster centers C1(2.79,47.65)and C2(3.29,204.4).
TABLE 3.C-MEANS ANDNEW METHOD
Methods Final Cluster Centers Outliermembership
Point - 12,400
C-means C1 2.79 , 47.65 .37
C2 3.29 , 204.4 .63
NewMethod C1 1.52,166.20 0C2 5.49,172.11 0
Figure 1. The data points and final overlapping clusters
We also found that the algorithm is far superior to c-
means in handling outlier points. (See table 3). The newalgorithm is capable of giving very low membershipvalues to the outlier points. If we start with C1 at (.5,150),the algorithm takes four steps to converge the first clustercenter to(1.52,166.23). Similarly the second cluster center
C2 converges to (5.49,172.11) from (8,150) in four steps.The algorithm finally ends with the final cluster centers atC1(1.5,166.2) and C2 (5.4,172.11) and with twooverlapping clusters as shown in figure 1. The outlierpoint is given zero membership in both the clusters.
From Table 3, it is clear that the new fuzzy clustering
method we propose is better than the conventional c-means algorithm in handling outlier points and in thecalculation of new cluster centers. Due the constraingiven in expression 5, the c-means algorithm givesmembership values .37 and .63 to the outlier point
(12,400) so that the sum of memberships of this point is
one. But the new algorithm gives zero membership to thispoint.
6.2 With Natural Data
Happiness and satisfaction are directly related with a
communitys ability to meet their basic needs and theseare important factors in safeguarding their physicalhealth. The unique concept of Bhutans Gross NationalHappiness (GNH) depends on nine factors like health,ecosystem, emotional well being etc.[16]. GNH regionalchapter at Sherubtse College conducted a survey among
1311 villagers and the responses were converted into
numeric values. For the analysis of the new method wetook the attributes income and health index as shown infigure 2.
Figure 2. The data points
2009 ACADEMY PUBLISHER
8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
5/5
INFORMATION PAPER
International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009
165
TABLE 4.THE FINAL CLUSTER CENTERS
Cent
ers
Final Cluster Centers
C-means New Method
X Y X Y
C1 24243.11 6.53 24376.32 8.135
C2 69749.6 5.08 68907 5.62
C3 115979.3 2.83 113264.6 1.815
As we can see from the data set, in Bhutan the lowincome group maintains better health than high incomegroup since they are self sufficient in many ways. Like
any other natural data this data set also contains manyoutlier points which do not belong to any of the groups. Ifwe apply c-means algorithm these points tend to get moremembership values due to exp.4.To start the data analysis, first we applied k-means
algorithm to find the initial three cluster centers. Thealgorithm ended with three cluster centers atC1(24243,6.7),C2(69794,5.1) and C3(11979.29,2.72).
We applied these initial values in both c-means algorithmand the new method to analyze the data and thealgorithms ended with centroids as given in Table 4.
From figure 2, and table 4, It can be seen that the finalcentroids of c-means method does not represent the actual
centers of the clusters and this is due to the influence ofoutlier points. The memberships of all the points are alsoconsidered for the calculation of cluster centers. So thecluster centers tend to move towards the centre of all the
points. But the new method identifies the cluster centersin a better way. The efficiency of the new algorithm lies
in its ability to provide very low membership values tooutlier points and also to consider a point only for thecalculation of one cluster centre.
VII.CONCLUSION
A good clustering algorithm produces high quality
clusters to yield low inter cluster similarity and high intracluster similarity. Many conventional clusteringalgorithms like k-means and fuzzy c-means algorithmachieve this on crisp and highly structured data. But they
have difficulties in handling unstructured natural datawhich often contain outlier data points. The proposed
new fuzzy clustering algorithm combines the positiveaspects of both crisp and fuzzy clustering algorithms. It ismore efficient in handling the natural data with outlierpoints than both k-means and fuzzy c-means algorithm. Itachieves this by assigning very low membership values tothe outlier points. But The Algorithm has limitations in
exploring highly structured crisp data which is free from
outlier points. The efficiency of the algorithm has to befurther tested on a comparatively larger data set.
REFERENCES
[1] Sankar K. Pal, P. Mitra, Data Mining in Soft ComputingFramework: A Survey, IEEE transactions on neuralnetworks, vol. 13, no. 1, January 2002
[2] R. Cruse, C. Borgelt, Fuzzy Data Analysis Challenges and
Perspectivehttp://citeseer.ist.psu.edu/ kruse99fuzzy.html
[3] G. Raju, Th. Shanta Kumar, Binu Thomas, Integrationof Fuzzy Logic in Data Mining: A comparativeCaseStudy, Proc. of International Conf. onMathematicsand Computer Science, Loyola College,Chennai, 128-136,2008
[4]Maria Halkidi, Quality assessment and UncertaintyHandling in Data Mining Process,
http://citeseer.ist.psu.edu/ halkidi00quality.html[5] W. H. Inmon The data warehouse and data mining,
Commn. ACM, vol. 39, pp. 4950, 1996.[6] U. Fayyad and R. Uthurusamy, Data mining and
knowledge discovery in databases, Commn. ACM, vol. 39,
pp. 2427, 1996.
[7] Pavel Berkhin, Survey of Clustering Data Mining
Techniques,http://citeseer.ist.psu.edu/berkhin02survey.html
[8] Chau, M., Cheng, R., and Kao, B, Uncertain Data Mining:
A New Research Direction, www.business.hku.hk/~mchau/papers/UncertainDataMining_WSA.pdf
[9] Keith C.C, C. Wai-Ho Au, B. Choi, Mining Fuzzy Rules in
A Donor Database for Direct Marketing by A CharitableOrganization,Proc of First IEEE International Conferenceon Cognitive Informatics, pp: 239 - 246, 2002
[10] E. Cox, Fuzzy Modeling And Genetic Algorithms For DataMining And Exploration,Elsevier, 2005
[11] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty andInformation, Prentice Hall,1988
[12] J Han, M Kamber,Data Mining Concepts and Techniques,
Elsevier, 2003
[13] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification,Ph.D. thesis, Center for Applied Mathematics, Cornell
University, Ithica, N.Y., 1973.[14] Carl G. Looney, A Fuzzy Clustering and Fuzzy Merging
Algorithm,http://citeseer.ist.psu.edu/399498.html
[15] Frank Klawonn, Anette Keller, Fuzzy Clustering Based onModified Distance Measures,http://citeseer.ist.psu.edu/klawonn99fuzzy.html
[16] Sullen Donnelly, How Bhutan Can Develop and Measure
GNH, www.bhutanstudies.org.bt/seminar/ 0402-
gnh/GNH-papers-1st_18-20.pdf
[17] Lei Jiang and Wenhui Yang, A Modified Fuzzy C-MeansAlgorithm for Segmentation of Magnetic ResonanceImages Proc. VIIth Digital Image Computing: Techniques
and Applications, pp. 225-231, 10-12 Dec. 2003, Sydney.
2009 ACADEMY PUBLISHER