A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

Embed Size (px)

Citation preview

  • 8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

    1/5

    INFORMATION PAPER

    International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

    161

    A Novel Fuzzy Clustering Method for Outlier

    Detection in Data Mining

    Binu Thomas1and Raju G

    2,

    1 Research Scholar, Mahatma Gandhi University,Kerala, [email protected]

    2 SCMS School of Technology & Management, Cochin, Kerala, [email protected]

    Abstract In data mining, the conventional clusteringalgorithms have difficulties in handling the challenges posed

    by the collection of natural data which is often vague and

    uncertain. Fuzzy clustering methods have the potential to

    manage such situations efficiently. This paper introduces

    the limitations of conventional clustering methods through

    k-means and fuzzy c-means clustering and demonstrates thedrawbacks of the algorithms in handling outlier points. In

    this paper, we propose a new fuzzy clustering method which

    is more efficient in handling outlier points than conventional

    fuzzy c-means algorithm. The new method excludes outlier

    points by giving them extremely small membership values in

    existing clusters while fuzzy c-means algorithm tends give

    them outsized membership values. The new algorithm also

    incorporates the positive aspects of k-means algorithm in

    calculating the new cluster centers in a more efficient

    approach than the c-means method.

    Index Termsfuzzy clustering, outlier points, knowledge

    discovery, c-means algorithm

    I. INTRODUCTION

    The process of finding useful patterns and informationfrom raw data is often known as Knowledge discovery indatabases or KDD. Data mining is a particular step in thisprocess involving the application of specific algorithmsfor extracting patterns (models) from data [5]. Cluster

    analysis is a technique for breaking data down intorelated components in such a way that patterns and orderbecomes visible. It aims at sifting through large volumesof data in order to reveal useful information in the formof new relationships, patterns, or clusters, for decision-making by a user. Clusters are natural groupings of data

    items based on similarity metrics or probability densitymodels. Clustering algorithms maps a new data item

    into one of several known clusters. In fact clusteranalysis has the virtue of strengthening the exposure ofpatterns and behavior as more and more data becomesavailable [7]. A cluster has a center of gravity which is

    basically the weighted average of the cluster.Membership of a data item in a cluster can be determined

    by measuring the distance from each cluster center to thedata point [6]. The data item is added to a cluster forwhich this distance is a minimum.This paper provides an overview of the crisp clustering

    technique, advantages and limitations of fuzzy c-means

    clustering and a new fuzzy clustering method which issimple and superior to c-means clustering in handlingoutlier points. Section 2 describes the basic notions of

    clustering and also introduces k-means clustering

    algorithm. In Section 3 we explain the concept ofvagueness and uncertainty in natural data. Section 4introduces the fuzzy clustering method and describes howit can handle vagueness and uncertainty with the conceptof overlapping clusters with partial membership

    functions. The same section also introduces the mostcommon fuzzy clustering algorithm namely c-meansalgorithm and it ends with the limitations of it. Section 5proposes the new fuzzy clustering method. Section 6demonstrates the concepts presented in the paper.Finally, section 7 concludes the paper.

    II. CRISP CLUSTERING TECHNIQUES

    Traditional clustering techniques attempt to segment databy grouping related attributes in uniquely definedclusters. Each data point in the sample space is assigned

    to only one cluster. K-means algorithm and its different

    variations are the most well-known and commonly usedpartitioning methods. The value k stands for the numberof cluster seeds initially provided for the algorithm. Thisalgorithm takes the input parameter k and partitions aset of m objects into k clusters [7]. The technique workby computing the distance between a data point and the

    cluster center to add an item into one of the clusters sothat intra-cluster similarity is high but inter-clustersimilarity is low. A common method to find the distanceis to calculate to sum of the squared difference as followsand it is known as the Euclidian distance [10](exp.1).

    2

    = nj

    kjk CiXd (1)

    where,dk : is the distance of the k

    thdata point from Cj

    With the definition of the distance of a data point fromthe cluster centers, the k-means the algorithm is fairly

    simple. The cluster centers are randomly initialized andwe assign a data point xi into a cluster to which it has

    minimum distance. When all the data points have beenassigned to clusters, new cluster centers are calculated byfinding the weighted average of all data points in acluster. The cluster center calculation causes the previous

    centroid location to move towards the center of thecluster set. This is continued until there is no change in

    cluster centers.

    2009 ACADEMY PUBLISHER

  • 8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

    2/5

    INFORMATION PAPER

    International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

    162

    A. Limitations of k-means algorithm

    The main limitation of the algorithm comes from itscrisp nature in assigning cluster membership to datapoints. Depending on the minimum distance, a data pointalways becomes a member of one of the clusters. This

    works well with highly structured data. The real worlddata is almost never arranged in clear cut groups. Instead,

    clusters have ill defined boundaries that smear into thedata space often overlapping the perimeters ofsurrounding clusters [4]. In most of the cases the realworld data have apparent extraneous data points clearly

    not belonging to any of the clusters and they are calledoutlier points. The k-means algorithm is not capable ofdealing with overlapping clusters and outlier points sinceit has to include a data point into one of the existingclusters. Because of this even extreme outlier points willbe included in to a cluster based on the minimum

    distance.

    III. FUZZY LOGIC

    The modeling of imprecise and qualitative knowledge, aswell as handling of uncertainty at various stages ispossible through the use of fuzzy sets. Fuzzy logic iscapable of supporting, to a reasonable extent, human typereasoning in natural form by allowing partial membershipfor data items in fuzzy subsets [2]. Integration of fuzzy

    logic with data mining techniques has become one of thekey constituents of soft computing in handling thechallenges posed by the massive collection of natural data[1].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has,potentially, an infinite range of truth values between oneand zero[3]. Propositions in fuzzy logic have a degree of

    truth, and membership in fuzzy sets can be fullyinclusive, fully exclusive, or some degree inbetween[13].The fuzzy set is distinct from a crisp set isthat it allows the elements to have a degree ofmembership. The core of a fuzzy set is its membershipfunction: a function which defines the relationship

    between a value in the sets domain and its degree ofmembership in the fuzzy set(exp 2). The relationship isfunctional because it returns a single degree ofmembership for any value in the domain[11].

    =f(s,x) (2)Here,

    : is he fuzzy membership value for the elements : is the fuzzy setx : is the value from the underlying domain.

    Fuzzy sets provide a means of defining a series ofoverlapping concepts for a model variable since it

    represent degrees of membership. The values from thecomplete universe of discourse for a variable can havememberships in more than one fuzzy set.

    IV.FUZZY CLUSTERING METHODS

    The central idea in fuzzy clustering is the non-uniquepartitioning of the data in a collection of clusters. The

    data points are assigned membership values for each ofthe clusters. The fuzzy clustering algorithms allow theclusters to grow into their natural shapes [15]. In somecases the membership value may be zero indicating that

    the data point is not a member of the cluster under

    consideration. Many crisp clustering techniques havedifficulties in handling extreme outliers but fuzzyclustering algorithms tend to give them very smallmembership degree in surrounding clusters [14].

    The non-zero membership values, with a maximum ofone, show the degree to which the data point represents acluster. Thus fuzzy clustering provides a flexible androbust method for handling natural data with vaguenessand uncertainty. In fuzzy clustering, each data point willhave an associated degree of membership for each

    cluster. The membership value is in the range zero to oneand indicates the strength of its association in that cluster.

    A. C-means fuzzy clustering algorithm[10]

    Fuzzy c-means clustering involves two processes: thecalculation of cluster centers and the assignment of points

    to these centers using a form of Euclidian distance. Thisprocess is repeated until the cluster centers stabilize. Thealgorithm is similar to k-means clustering in many waysbut it assigns a membership value to the data items for theclusters within a range of 0 to 1. So it incorporates fuzzysets concepts of partial membership and forms

    overlapping clusters to support it. The algorithm needs afuzzification parameter m in the range [1,n] which

    determines the degree of fuzziness in the clusters. Whenm reaches the value of 1 the algorithm works like a crisppartitioning algorithm and for larger values of m the

    overlapping of clusters is tend to be more. The algorithmcalculates the membership value with the formula,

    =

    =p

    k

    m

    ki

    m

    ji

    ij

    d

    dx

    1

    1

    1

    1

    1

    1

    1

    )( (3)

    where

    j(xi) : is the membership of xiin the jth

    clusterdji : is the distance of xiin cluster cjm : is the fuzzification parameter

    p : is the number of specified clustersdki : is the distance of xi in cluster Ck

    The new cluster centers are calculated with these

    membership values using the exp. 4.

    [ ][ ]

    =

    i

    m

    ij

    i i

    m

    ij

    jx

    xxc

    )(

    )(

    (4)

    whereCj : is the center of the j

    thclusterxi : is the i

    thdata point

    2009 ACADEMY PUBLISHER

  • 8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

    3/5

    INFORMATION PAPER

    International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

    163

    j :the function which returns the membershipm : is the fuzzification parameter

    This is a special form of weighted average. We modify

    the degree of fuzziness in xis current membership and

    multiply this by xi. The product obtained is divided bythe sum of the fuzzified membership.

    The first loop of the algorithm calculates membership

    values for the data points in clusters and the second looprecalculates the cluster centers using these membershipvalues. When the cluster center stabilizes (when there isno change) the algorithm ends.

    B. Limitations of the algorithm

    The fuzzy c-means approach to clustering suffers fromseveral constrains that affect the performance [10]. Themain drawback is from the restriction that the sum of

    membership values of a data point xi in all the clustersmust be one as in Expression (5), and this tends to give

    high membership values for the outlier points. So thealgorithm has difficulty in handling outlier points.Secondly the membership of a data point in a clusterdepends directly on the membership values of othercluster centers and this sometimes happens to produce

    undesirable results.

    = =

    p

    j ij x1 1)( (5)

    In fuzzy c-means method a point will have partial

    membership in all the clusters. The exp.(4) for calculatingthe new cluster centers finds a special form of weighted

    average of all the data points. The third limitation of thealgorithm is that due to the influence (partialmembership) of all the data members, the cluster centerstend to move towards the center of all the data points

    [10].The fourth constrain of the algorithm is its inabilityto calculate the membership value if the distance of a datapoint is zero(exp.3)

    TABLE 1.

    FUZZY C-MEANS ALGORITHM

    initialize p=number of clusters

    initialize m=fuzzification parameter

    initialize Cj (cluster centers)

    Repeat

    For i=1 to n :Update j(xi) applying (3)

    For j=1 to p :Update Ci with(4)with current j(xi)

    Until Cj estimate stabilize

    V.THE NEW FUZZY CLUSTERING METHOD

    The new fuzzy clustering method we propose,

    removes the restriction imposed by exp (4). Due to thisconstrain the c-means algorithm tends to give moremembership values for the outlier points. In c-meansalgorithm, the membership of a point a cluster is

    calculated based on its membership in other clusters.Many limitations of the algorithm arise due to this and in

    the new method the membership of a point in a clustercenter depends only on its distance in that cluster. Forcalculating the membership values, we use a new simpleexpression as given in exp (6).

    )(

    )()(

    j

    jij

    ijdMax

    ddMaxx

    = (6)

    Where

    j(xi) : is the membership of xiin the jth

    clusterdji : is the distance of xiin cluster cjMax(dj) : is the maximum distance in the

    cluster cjSince

    1)(

    )(=

    j

    j

    dMax

    dMax , the above membership function

    (exp.6) will generate values closer to one(1) for smallerdistances (dji) and a membership value of zero for the

    maximum distance. If the distance of a data point is zerothen the function returns a membership value of one and

    thus it overcomes the fourth constrain of c-meansalgorithm. The membership values are calculated onlybased on the distance of a data member in the cluster anddue to this the method does not suffer from the first and

    second constrains of c-means algorithm. To overcome thethird limitation of c-means algorithm in calculating newcluster centers, the new method inherits the features of k-means algorithm. For this purpose, A data point iscompletely assigned to a cluster where it has gotmaximum membership and the point is used only for the

    calculation of new cluster center in that cluster. This waythe influence of a data point in the calculation of all thecluster centers can be avoided. The point is used of the

    calculations only if its membership value falls above athreshold value . This way we can ensure that outlier

    points are not considered for new cluster centercalculations.Unlike the crisp clustering algorithms like k-means, mostof the fuzzy clustering algorithms are sensitive to theselection of initial centroids. The effectiveness of thealgorithms depends on the initial cluster seeds [17]. For

    better results we suggest to use k-means algorithm to getthe initial cluster seeds. This way we can ensure that the

    algorithm converges with best possible results. In the newmethod, since the membership values vary strictly from 0to 1, we find that initializing the threshold value of to

    0.5 produces better results in cluster center calculation.

    2009 ACADEMY PUBLISHER

  • 8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

    4/5

    INFORMATION PAPER

    International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

    164

    TABLE 2.THE NEW FUZZY CLUSTERING ALGORITHM

    initialize p=number of clusters

    initialize Cj (cluster centers)

    initialize (threshold value)

    Repeat

    For i=1 to n :Update j(xi) applying (6)

    For k=1 to p :

    Sum=0

    Count=0

    For i=1 to n :

    If (xi) is maximum in Ckthen

    If (xi)>=

    Sum=Sum+xi

    count=count + 1

    Ck=Sum/count

    Until C estimate stabilize

    The fuzzy membership values found with the expression6. can be used in the new fuzzy clustering algorithm

    given in table2. The algorithm stops when the clustercenters stabilize.The algorithm will be more efficient in handling datawith outlier points and in overcoming other constrainsimposed by c-means algorithm. The new method wepropose is also far superior in the calculation of newcluster centers, which we demonstrate in the next session

    with two numerical examples.

    VI.ILLUSTRATIONS

    In order to find the effectiveness of the new algorithm,we applied it with a small synthetic data set todemonstrate the functioning of the algorithm in detail.The algorithm is also tested with real data collected for

    Bhutans Gross National Happiness (GNH) program.

    A. With synthetic data

    Consider the sales of an item from different shops at

    different periods of the year, from figure 1. There is anoutlier point at (12,400). If we start data exploration with

    two random cluster centers at C1(.5,150) and C2(8,150),the algorithm ends with the cluster centers at

    C1(1.52,166.20) and C2(5.49,172.11). We also applied c-means algorithm with the same data set and we found that

    the algorithm converges to cluster centers C1(2.79,47.65)and C2(3.29,204.4).

    TABLE 3.C-MEANS ANDNEW METHOD

    Methods Final Cluster Centers Outliermembership

    Point - 12,400

    C-means C1 2.79 , 47.65 .37

    C2 3.29 , 204.4 .63

    NewMethod C1 1.52,166.20 0C2 5.49,172.11 0

    Figure 1. The data points and final overlapping clusters

    We also found that the algorithm is far superior to c-

    means in handling outlier points. (See table 3). The newalgorithm is capable of giving very low membershipvalues to the outlier points. If we start with C1 at (.5,150),the algorithm takes four steps to converge the first clustercenter to(1.52,166.23). Similarly the second cluster center

    C2 converges to (5.49,172.11) from (8,150) in four steps.The algorithm finally ends with the final cluster centers atC1(1.5,166.2) and C2 (5.4,172.11) and with twooverlapping clusters as shown in figure 1. The outlierpoint is given zero membership in both the clusters.

    From Table 3, it is clear that the new fuzzy clustering

    method we propose is better than the conventional c-means algorithm in handling outlier points and in thecalculation of new cluster centers. Due the constraingiven in expression 5, the c-means algorithm givesmembership values .37 and .63 to the outlier point

    (12,400) so that the sum of memberships of this point is

    one. But the new algorithm gives zero membership to thispoint.

    6.2 With Natural Data

    Happiness and satisfaction are directly related with a

    communitys ability to meet their basic needs and theseare important factors in safeguarding their physicalhealth. The unique concept of Bhutans Gross NationalHappiness (GNH) depends on nine factors like health,ecosystem, emotional well being etc.[16]. GNH regionalchapter at Sherubtse College conducted a survey among

    1311 villagers and the responses were converted into

    numeric values. For the analysis of the new method wetook the attributes income and health index as shown infigure 2.

    Figure 2. The data points

    2009 ACADEMY PUBLISHER

  • 8/13/2019 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

    5/5

    INFORMATION PAPER

    International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

    165

    TABLE 4.THE FINAL CLUSTER CENTERS

    Cent

    ers

    Final Cluster Centers

    C-means New Method

    X Y X Y

    C1 24243.11 6.53 24376.32 8.135

    C2 69749.6 5.08 68907 5.62

    C3 115979.3 2.83 113264.6 1.815

    As we can see from the data set, in Bhutan the lowincome group maintains better health than high incomegroup since they are self sufficient in many ways. Like

    any other natural data this data set also contains manyoutlier points which do not belong to any of the groups. Ifwe apply c-means algorithm these points tend to get moremembership values due to exp.4.To start the data analysis, first we applied k-means

    algorithm to find the initial three cluster centers. Thealgorithm ended with three cluster centers atC1(24243,6.7),C2(69794,5.1) and C3(11979.29,2.72).

    We applied these initial values in both c-means algorithmand the new method to analyze the data and thealgorithms ended with centroids as given in Table 4.

    From figure 2, and table 4, It can be seen that the finalcentroids of c-means method does not represent the actual

    centers of the clusters and this is due to the influence ofoutlier points. The memberships of all the points are alsoconsidered for the calculation of cluster centers. So thecluster centers tend to move towards the centre of all the

    points. But the new method identifies the cluster centersin a better way. The efficiency of the new algorithm lies

    in its ability to provide very low membership values tooutlier points and also to consider a point only for thecalculation of one cluster centre.

    VII.CONCLUSION

    A good clustering algorithm produces high quality

    clusters to yield low inter cluster similarity and high intracluster similarity. Many conventional clusteringalgorithms like k-means and fuzzy c-means algorithmachieve this on crisp and highly structured data. But they

    have difficulties in handling unstructured natural datawhich often contain outlier data points. The proposed

    new fuzzy clustering algorithm combines the positiveaspects of both crisp and fuzzy clustering algorithms. It ismore efficient in handling the natural data with outlierpoints than both k-means and fuzzy c-means algorithm. Itachieves this by assigning very low membership values tothe outlier points. But The Algorithm has limitations in

    exploring highly structured crisp data which is free from

    outlier points. The efficiency of the algorithm has to befurther tested on a comparatively larger data set.

    REFERENCES

    [1] Sankar K. Pal, P. Mitra, Data Mining in Soft ComputingFramework: A Survey, IEEE transactions on neuralnetworks, vol. 13, no. 1, January 2002

    [2] R. Cruse, C. Borgelt, Fuzzy Data Analysis Challenges and

    Perspectivehttp://citeseer.ist.psu.edu/ kruse99fuzzy.html

    [3] G. Raju, Th. Shanta Kumar, Binu Thomas, Integrationof Fuzzy Logic in Data Mining: A comparativeCaseStudy, Proc. of International Conf. onMathematicsand Computer Science, Loyola College,Chennai, 128-136,2008

    [4]Maria Halkidi, Quality assessment and UncertaintyHandling in Data Mining Process,

    http://citeseer.ist.psu.edu/ halkidi00quality.html[5] W. H. Inmon The data warehouse and data mining,

    Commn. ACM, vol. 39, pp. 4950, 1996.[6] U. Fayyad and R. Uthurusamy, Data mining and

    knowledge discovery in databases, Commn. ACM, vol. 39,

    pp. 2427, 1996.

    [7] Pavel Berkhin, Survey of Clustering Data Mining

    Techniques,http://citeseer.ist.psu.edu/berkhin02survey.html

    [8] Chau, M., Cheng, R., and Kao, B, Uncertain Data Mining:

    A New Research Direction, www.business.hku.hk/~mchau/papers/UncertainDataMining_WSA.pdf

    [9] Keith C.C, C. Wai-Ho Au, B. Choi, Mining Fuzzy Rules in

    A Donor Database for Direct Marketing by A CharitableOrganization,Proc of First IEEE International Conferenceon Cognitive Informatics, pp: 239 - 246, 2002

    [10] E. Cox, Fuzzy Modeling And Genetic Algorithms For DataMining And Exploration,Elsevier, 2005

    [11] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty andInformation, Prentice Hall,1988

    [12] J Han, M Kamber,Data Mining Concepts and Techniques,

    Elsevier, 2003

    [13] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification,Ph.D. thesis, Center for Applied Mathematics, Cornell

    University, Ithica, N.Y., 1973.[14] Carl G. Looney, A Fuzzy Clustering and Fuzzy Merging

    Algorithm,http://citeseer.ist.psu.edu/399498.html

    [15] Frank Klawonn, Anette Keller, Fuzzy Clustering Based onModified Distance Measures,http://citeseer.ist.psu.edu/klawonn99fuzzy.html

    [16] Sullen Donnelly, How Bhutan Can Develop and Measure

    GNH, www.bhutanstudies.org.bt/seminar/ 0402-

    gnh/GNH-papers-1st_18-20.pdf

    [17] Lei Jiang and Wenhui Yang, A Modified Fuzzy C-MeansAlgorithm for Segmentation of Magnetic ResonanceImages Proc. VIIth Digital Image Computing: Techniques

    and Applications, pp. 225-231, 10-12 Dec. 2003, Sydney.

    2009 ACADEMY PUBLISHER