8
Abstract—In this paper, we address the problem of entropy-based clustering in the framework of possibility theory. First, we introduce the possibilistic entropy with brief discussion. Then we develop the possibilistic entropy theory for clustering analysis and investigate the general Possibilistic Entropy Clustering (PEC) problems, based on which a Fully Unsupervised Possibilistic Entropy Clustering (FUPEC) algorithm is elaborated in detail with the following advantages: (1) having clearer physical meaning and well-defined mathematical features; (2) automatically determining the number of the clusters; (3) automatically controlling the resolution parameter during the clustering progress; (4) overcoming the sensitivity to initialization and to the noise and outliers. Finally, we illustrate the effectiveness of this novel algorithm with various examples. I. INTRODUCTION lustering analysis has been one of the fundamental research tools in computer vision, pattern classification, and data mining. It attempts to discover an inherent structure in a set of data points. Recently, many entropy-based algorithms have been proposed for clustering analysis, which consider the objective function containing an entropy term [1] [11] [17] [18]. The memberships in these algorithms have a normalized Gaussian function form, which ensures that noise and outliers are assigned little weights, thus making these algorithms achieve robustness. These methods are summarized in [19] as a combination of an entropy term and constraint terms, with a parameter or Lagrangian multiplier to balance the role of the fuzzy entropy and the distortion measure between the prototypes and the feature vectors in the objective function. Compared with parameter m in FCM-type algorithms [2], these parameters have clearer physical meaning. However, these complicated parameters are also usually specified or empirically determined by the user. On the other hand, these fuzzy entropy terms are introduced directly using an analogous entropy form as in the Shannon’s formulation of information theory [9] [12], where the probability entropy represents the average of uncertainty of a random process and copes with probability distribution. When it is applied to combinatorial optimal problems like clustering analysis, some properties and merits are still to be studied. L. Wang is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (Corresponding author, phone: 86-29-88207447, email: [email protected]) HB Ji is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (email: [email protected]) XB Gao is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (email: [email protected]). Our study is inspired by the work of Nasraoui and Krishnapuram [20] and the work of Frigui and Krishnapuram [5]. The former used a delicate iterative scheme of scale parameter in clustering progress, and the latter performed a kind of progressive clustering scheme with an overspecified number of clusters. Although the membership in [20] has a similar form to the membership derived from our algorithm, it is introduced by definition. It will be proved later that the proposed method here has more general interpretation of membership than many available methods and can derive the representation of membership naturally with regard to a combinatorial objective function. II. POSSIBILISTIC ENTROPY Most concepts of entropy are closely related to Shannon’s formulation of information theory [12], in which the information or negative entropy is given by the formula 1 ln k i i i I p p = =− (1) where there are k different components to a system and i p is the probability of occurrence of the i th component. These components are often symbols and are used to describe or transmit information. It should be noted that the concept of entropy discussed above are related to probability entropy in nature. Fuzzy entropy is introduced in many clustering algorithms to measure uncertainty of a fuzzy set. It is often integrated into the objective function as an entropy term. Similar to the entropy of probability distribution defined by Shannon [21], the fuzzy entropy term in these algorithms [11] [16] [17] often has the following form ( ) 1 1 ln C N ij ij i j u u = = ∑∑ (2) where the membership ij u denotes the grade of the feature point or vector j x belonging to the i th fuzzy subset. N is the total number of feature vectors. The entropy of the data set is a combination of the entropies of C independent clusters, each of which is written as 1 ln N ij ij j u u = (3) Similar to the entropy in Shannon’s information theory, it is required that memberships should satisfy probability constraints in addition to the range requirement [0.0,1.0], that is, 1 1 c ij i u = = (4) Fully Unsupervised Possibilistic Entropy Clustering Lei Wang, Hongbing Ji, Xinbo Gao C 0-7803-9489-5/06/$20.00/©2006 IEEE 2006 IEEE International Conference on Fuzzy Systems Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 2351

Fully Unsupervised Possibilistic Entropy Clustering

Embed Size (px)

Citation preview

Abstract—In this paper, we address the problem of entropy-based clustering in the framework of possibility theory. First, we introduce the possibilistic entropy with brief discussion. Then we develop the possibilistic entropy theory for clustering analysis and investigate the general Possibilistic Entropy Clustering (PEC) problems, based on which a Fully Unsupervised Possibilistic Entropy Clustering (FUPEC) algorithm is elaborated in detail with the following advantages: (1) having clearer physical meaning and well-defined mathematical features; (2) automatically determining the number of the clusters; (3) automatically controlling the resolution parameter during the clustering progress; (4) overcoming the sensitivity to initialization and to the noise and outliers. Finally, we illustrate the effectiveness of this novel algorithm with various examples.

I. INTRODUCTION lustering analysis has been one of the fundamental research tools in computer vision, pattern classification,

and data mining. It attempts to discover an inherent structure in a set of data points. Recently, many entropy-based algorithms have been proposed for clustering analysis, which consider the objective function containing an entropy term [1] [11] [17] [18]. The memberships in these algorithms have a normalized Gaussian function form, which ensures that noise and outliers are assigned little weights, thus making these algorithms achieve robustness. These methods are summarized in [19] as a combination of an entropy term and constraint terms, with a parameter or Lagrangian multiplier to balance the role of the fuzzy entropy and the distortion measure between the prototypes and the feature vectors in the objective function. Compared with parameter m in FCM-type algorithms [2], these parameters have clearer physical meaning. However, these complicated parameters are also usually specified or empirically determined by the user. On the other hand, these fuzzy entropy terms are introduced directly using an analogous entropy form as in the Shannon’s formulation of information theory [9] [12], where the probability entropy represents the average of uncertainty of a random process and copes with probability distribution. When it is applied to combinatorial optimal problems like clustering analysis, some properties and merits are still to be studied.

L. Wang is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (Corresponding author, phone: 86-29-88207447, email: [email protected])

HB Ji is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (email: [email protected])

XB Gao is with Lab. 202 of School of Electronic Engineering, Xidian University, 710071 Xi'an, China (email: [email protected]).

Our study is inspired by the work of Nasraoui and Krishnapuram [20] and the work of Frigui and Krishnapuram [5]. The former used a delicate iterative scheme of scale parameter in clustering progress, and the latter performed a kind of progressive clustering scheme with an overspecified number of clusters. Although the membership in [20] has a similar form to the membership derived from our algorithm, it is introduced by definition. It will be proved later that the proposed method here has more general interpretation of membership than many available methods and can derive the representation of membership naturally with regard to a combinatorial objective function.

II. POSSIBILISTIC ENTROPY Most concepts of entropy are closely related to Shannon’s

formulation of information theory [12], in which the information or negative entropy is given by the formula

1

lnk

i ii

I p p=

= −∑ (1)

where there are k different components to a system and ip is the probability of occurrence of the i th component. These components are often symbols and are used to describe or transmit information. It should be noted that the concept of entropy discussed above are related to probability entropy in nature.

Fuzzy entropy is introduced in many clustering algorithms to measure uncertainty of a fuzzy set. It is often integrated into the objective function as an entropy term. Similar to the entropy of probability distribution defined by Shannon [21], the fuzzy entropy term in these algorithms [11] [16] [17] often has the following form

( )1 1

lnC N

ij iji j

u u= =

−∑∑ (2)

where the membership iju denotes the grade of the feature point or vector jx belonging to the i th fuzzy subset. N is the total number of feature vectors. The entropy of the data set is a combination of the entropies of C independent clusters, each of which is written as

1

lnN

ij ijj

u u=

−∑ (3)

Similar to the entropy in Shannon’s information theory, it is required that memberships should satisfy probability constraints in addition to the range requirement [0.0,1.0], that is,

1

1c

iji

u=

=∑ (4)

Fully Unsupervised Possibilistic Entropy Clustering Lei Wang, Hongbing Ji, Xinbo Gao

C

0-7803-9489-5/06/$20.00/©2006 IEEE

2006 IEEE International Conference on Fuzzy SystemsSheraton Vancouver Wall Centre Hotel, Vancouver, BC, CanadaJuly 16-21, 2006

2351

The physical meaning of this probability entropy lies in that entropy is less if the data set has orderly configurations or is separable, and more if the data set has disorderly configurations or is difficult to separate [23]. If we try to visualize the complete data set from individual points then an orderly configuration means that for most individual data points there are some data points close to it (i.e., they probably belong to the same cluster), and others away from it. In the same reasoning, a disorderly configuration means that most of the data points are scattered randomly [4].

Things are somewhat different when probability entropy is introduced into clustering space to optimize a combinatorial programming problem. On the one hand, one may wish the entropy at a clustered situation to be as low as possible, which means the partition of the data set is efficient. However, the minimization of the entropy may leads to the situation that all feature points are in the same cluster and all other clusters are empty. On the other hand, when we compare the probability entropy in the clustering analysis and the one in information theory, a difference may be found that the probability constraints on the entropy of a cluster is associated with the number of the clusters to be clustered while Shannon’s entropy is constrained cross over the whole components. Although these two entropies are similar in form, they are indeed different in constraint condition.

Here we introduce a new type of entropy, which is different from the probability entropy in Shannon’s information theory and is within the framework of possibility theory, to measure uncertainty. It is given by the formula

1

lnk

i ii

E p p=

= −∑ (5)

where k represents the number of different components to a system and ip is the possibility of occurrence of the i th component. To distinguish from the entropy of probability distribution in Shannon’s information theory, we name it Possibilistic Entropy.

When we apply the concept of possibilistic entropy to the combinatorial optimization problem in the clustering space, the possibilistic entropy of a data set may be written as

( )1 1

lnC N

ij iji j

u u= =

−∑∑ (6)

where iju is the possibilistic membership. It is obvious that possibilistic entropy has a similar meaning to fuzzy or probability entropy and also represents the grade of disorderly configurations of fuzzy sets (i.e., a less entropy indicates the data set has orderly configurations or is separable). The possibilistic entropy of the data set is a combination of possibilistic entropies of C independent clusters as the follows,

1

lnN

ij ijj

u u=

−∑ (7)

It is obvious that Eq. (7) is in accord with Eq. (5), as they do not have probability constraints. And we will see later the extreme situation that all the feature points will be clustered into one cluster by minimization of the entropy term may be

restrained by adding a penalty term into the objective function.

III. POSSIBILISTIC ENTROPY CLUSTERING We develop our clustering algorithm in the framework of

possibilistic entropy theory. For a data set with clustering tendency, since we wish to subdivide it into a set of subsets or clusters with an orderly configuration or minimum uncertainty, one may have the entropy minimized for global effect. Mathematically, possibilistic entropy clustering problem is written as

minimize: ( )1 1

lnC N

ij iji j

u u= =

⎛ ⎞− ⎜ ⎟

⎝ ⎠∑∑ (8)

We name Eq. (8) entropy term. In order to avoid singular solution or the extreme situation

that all the feature points will be clustered into one cluster by minimization of the entropy term, we may add a penalty term to objective function on each cluster with regard to local effect. It is regarded that every cluster should contain as many close points as possible (i.e., they belong to one cluster). For each cluster, it has the following objective function,

maximize: 1

N

ijj

u=

∑ (9)

As a whole, the objective function is converted into the following form to be minimized,

( )1 1 1 1

lnC N C N

ij ij i iji j i j

J u u uα= = = =

= − −∑∑ ∑ ∑ (10)

where iα is the adjustment factor to balance the entropy term and the penalty term in clustering effect It can be determined experimentally as a priori. However, unlike those parameters in previous methods, it is independent on the data set and can be set to a general constant. Further analysis on iα can be found in section IV.

It is the added penalty term that adjusts the tendency of minimization of objective function and makes the algorithm be inclined to distinguish the possible distributions and represent the “typicality” of feature points to the distribution sources. Take a two-point clustering in a domain with cardinality 2 for example, both the best case (i.e., each cluster contains one point related with one distribution) and the worst case (i.e., only one cluster is identified which contains two points related to just one distribution while another cluster being kept empty) may achieve the same least Possibilistic Entropy, however they could be distinguished with the help of the penalty term discussed above, in terms of combinatorial possibility distributions.

Here we define a loss function (the within-group sum of squared-error (WGSS)) as follows,

2

1 1

C N

ij iji j

L u d= =

= ∑∑ (11)

where 2ijd is the distance of feature point or vector j to

2352

prototype of cluster iC . Different distance measures lead to different algorithms, tailored to detect desired shapes, such as the Euclidean distance for spherical clusters, or the Gustafson and Kessel distance [8] (i.e., the scaled Mahalanobis distance) for ellipsoidal clusters, etc. And possibilistic membership

iju should satisfy the following conditions,

0 1iju≤ ≤ ,i j∀ (12a)

1

0N

ijj

u N=

< <∑ i∀ (12b)

Therefore, Possibilistic Entropy Clustering (PEC) problem is to find a set of prototypes which can minimize Eq. (10) and satisfy constraints (11) and (12). Using Lagrangian multipliers method, we obtain the following solutions,

2

0i ijd

iju e βλ −= (13)

where ( )10

ie αλ − += ∈ 1(0, )e− , and parameter iβ is a Lagrangian multiplier determined by Eq. (11). We name it the resolution parameter.

It is obvious that ( )10 1iiju e α− +≤ ≤ < . On the one hand,

0λ is a constant, which does not affect the prototype value of cluster iC . On the other hand, to agree with intuition,

iju should be in the range of [0.0,1.0]. Hence we revise the membership function by eliminating the constant factor

0λ and obtain the following formula, 2

i ijdiju e β−= (14)

Obviously, the revised membership also satisfies the constraint (12),

In place of iju in Eq. (9) by Eq (14), we obtain,

maximize: 2

1

exp{ }N

i ijj

dβ=

−∑ . (15)

So we find that the maximization of Eq. (9) is in accord with potential function method [24], which requires that a reasonable cluster should cover a “dense” region in feature space (i.e. to possess a high potential). So we come to conclude that the membership function of a feature point or vector is consistent with its potential function. For this reason, we name Eq. (9) potential term.

On the other hand, substituting the resulted iju for the

second iju in Eq. (8), we obtain,

minimize: 2

1 1

C N

ij i iji j

u dβ= =∑∑ (16)

As resolution parameter iβ is an independent constant, minimization of above formula is in accord with the following requirement to each cluster,

minimize: 2

1

N

ij ijj

u d=

∑ (17)

where 2exp( )ij i iju dβ= − . We can see the membership values

are high for the feature points which are near to a prototype, while the far points including noise and outliers are assigned relatively low values of possibilistic membership. Hence

iju also plays a role as a weight function, which makes our algorithm achieve robustness. From above evolvements it may be concluded that the possibilistic entropy based clustering algorithm has a powerful physical meaning and well-defined mathematical features. Possibilistic entropy clustering establishes a connection among fuzzy set theory, possibility theory, and robust statistics.

IV. FULLY UNSUPERVISED POSSIBILISTIC ENTROPY CLUSTERING ALGORITHM

A. One iterative scheme From the previous analysis, we know that possibilistic

entropy clustering is suitable to be characterized by a combinatorial objective function (10), which is made up of entropy term (8) and potential term (9). In most cases, iα in Eq. (10) may be adopted as an equal constant for all clusters as it plays a uniform role in detecting all clusters. Substituting the deduced equation (16) for the entropy term in the combinatorial objective function (10), we convert a possibilistic entropy-based clustering problem into the following requirement, which is convenient to get an iterative scheme of prototype iC and resolution parameter iβ

minimize: 2

1 1 1 1

C N C N

ij i ij iji j i j

u d uβ α= = = =

⎧ ⎫−⎨ ⎬

⎩ ⎭∑∑ ∑∑ (18)

By setting the derivative of the above objective function to zero for prototype iC and resolution parameter iβ separately with respect to Eq. (14), it can be easily shown that the update equations for iC and iβ are

1

1

N

ij jj

i N

ijj

u xC

u

=

=

=∑

∑ (19)

( )2

1

4

1

1

N

ij ijj

i N

ij ijj

u d

u dβ α =

=

= + ⋅∑

∑ (20)

With regard to Eq. (14), equations (19) and (20) can be used in an alternating fashion in an iterative algorithm to estimate the cluster center iC and to adjust the resolution parameter iβ . Compared with traditional probability entropy based clustering algorithms, whose complicated parameters are usually assumed to be known or empirically determined by the user as many other parameter-type algorithms, this delicate iterative scheme can automatically control resolution parameters during the clustering procedure. At the beginning of the iterative scheme, α may be set as a less value to get a lower resolution parameter 0iβ , which indicates the

2353

membership values of feature points around a cluster have a slow descension potential to zero with regard to Eq. (14). In other words, in the first steps as many feature points as possible will be contained in a cluster. As the clustering progresses, the parameter iβ will increase gradually and automatically, and hypersperes centered at the prototypes will gradually shrink to improve the resolution of clustering.

It is noted that iβ also depends on the chosen constant α , which balances the effect of the entropy term and the potential term in objective function. To minimize Eq. (18), one may conjecture the constant α should represent overall 2

i ijdβ . Experiments show that the ideal α should be near the constant 0.5, which is just corresponding to the 3 dB point in a cluster with regard to Eq. (14), if we force 2

i ijdβ to equalα .

B. Fully unsupervised method Lorette et al. stated that, if the operator needs to choose

training sets in order to estimate some parameters then these methods are said to be supervised. For unsupervised methods the operator sometimes interferes to choose the number of clusters. If this number is automatically defined the method is said to be fully unsupervised (see [18] for a more precise definition).

The classic solution to determine the unknown number of clusters is to choose the best partition by evaluating the validity of resulting partitions over a range of values of cluster number [2] [7], which is very time-consuming and sensitive to validity criterion especially in a noisy environment [3] [6]. Another technique related to the idea of cluster removal, is to seek one cluster at a time and then remove from the data set the points belonging to a found cluster if it passes a validity test as in the GMVE [10]. This procedure is repeated till no more “good” clusters can be found. However, such technique is still dependent on the validity measure and requires the parameters or thresholds to be set in advance, which may vary greatly from one data set to another. Moreover, when the clusters overlap, the idea of extracting them in a serial fashion will not work because removing one cluster may partially destroy the structure of other clusters, or we might get “bridging fits” [6] [22].

In this paper, we propose a fully unsupervised simple approach to automatically identify the number of the clusters in a noisy data set. It may be viewed as a kind of progressive clustering scheme, which consists of starting the clustering process with an overspecified number of clusters, and then merging the similar clusters and eliminating spurious clusters until the appropriate number of clusters is left as in Compatible Cluster Merging [5] [13]. Our algorithm finds the “optimum” number of clusters by repeatedly merging the similar clusters, which needs a reliable measure of clusters compatibility to measure their similarity. Here we use the compatibleness of fuzzy subsets (clusters), which takes into account both the overlapped parts and unoverlapped parts of

each pair of clusters, as follows ( )

( , )( )

i jij i j

i j

M C CS S C C

M C C= =

IU

(21)

where

1

( )N

i ijj

M C u=

= ∑ (22)

denoting the cardinality of cluster i . It can be easily shown that 0 1ijS≤ ≤ , where 1ijS = when the clusters i and j are

identical, and 0ijS = when the clusters i and j are disjoint,

i.e., i jC C = ∅I . If (1 )ijS ε≥ − . iC and jC are merged. This procedure is repeated until no merging takes place with regard to the similarity of each pair of clusters. It should be noted that this measure just depends on the membership values iju and is independent of the distance measure, so it can be used for various shapes and sizes of clusters.

C. FUPEC algorithm description We summarize our Fully Unsupervised Possibilistic

Entropy Clustering (FUPEC) algorithm as follows,

We elaborate the clustering progress through a running example shown in Fig. 1. Fig. 1(a) is the original noisy data that contains two Gaussian clusters contaminated by noise. Firstly, we assume a large number of clusters to be found. The number should be larger than the maximum number of clusters that the data set is expected to contain. Here we specify it to be 5. Since we expect to obtain the initial position of the clusters principally, we perform 10 iterations of FCM algorithm to get a preliminary partition of data set with overspecified number of clusters. Setting 0iβ as 0.005 in order to contain as many as data points with a low resolution, we use Eq. (14) to calculate the initial membership iju . Then

calculating the cluster centers iC and the resolution parameter

iβ alternately with Eq. (19) and (20) and updating the

Set the overspecified number C and initial 0iβ ; Perform FCM algorithm by several iterations; Repeat

Compute the membership matrix using Eq. (14); Update the prototype iC using Eq. (19); Update the resolution parameter iβ using Eq. (20)

Until prototype iC stabilize. Repeat

For each pair of clusters i and j Do Calculate ijS using Eq. (21); IF (1 )ijS ε≥ − Then merge clusters i and j ;

End For; Update the number of clusters C ;

Until (no merging takes place).

2354

membership matrix with Eq. (14), we perform our algorithm with an iterative optimal scheme. It is noted that any reasonable initial values of 0iβ that represent a low resolution or a big cluster size are acceptable. We here use a value less than 0.1 for generalization although a value much greater than 0.1 is efficient in some cases. Fig. 1(b) shows clustering results after 2 iterations of proposed algorithm. Dense regions are shared by one or more clusters because possibilistic clustering tends to find the most typical clusters. As the clustering progresses, the centers of the clusters are moving to its convergence point and becoming identical. Fig 1(c) shows results after 15 iterations, after which convergence is achieved. Fig. 1(d) shows the final clusters after merging using the similarity measure in Section 4.2.

(a) (b)

(c) (d)

Fig. 1. Clustering procedures of FUPEC algorithm. (a) Original noisy data set with two Gaussian clusters; (b) Result after 2 iterations; (c) Result after 15 iterations (convergence); (d) Final clusters after merging.

It is recommended to overspecify the number of clusters. This makes the algorithm not to be susceptible to the initialization and to detect all available clusters (especially tiny clusters) in a more great possibility

V. EXPERIMENTAL RESULTS In the following experiments, we focus on Gaussian

density data sets, which are widely used in the mixtures modeling. Note the approach proposed can be also extended to any type of mixture model. For generalization, the initial parameters are all fixed in all the experiments. The initial centers of clusters are obtained by performing 10 iterations using FCM algorithm. Then our approach is applied using the distance metric selected to detect the particular shape

structure. The initial parameters 0iβ are all set to 0.005 in the first step of FUPEC algorithm. And constantα is set to 0.6. In all resulted clustering pictures, the centers obtained are marked by crosses and the boundaries of the ellipses in these figures enclose points whose membership values are greater than 0.003.

A. Experiment I In this experiment, a general data set is used which

contains three synthetic multivariate Gaussian clusters with various sizes and orientations but without noise points as shown in Fig. 2(a). As there are two obvious ellipses in shape, the G-K distance metric [8] is used. The resulted partition is shown in Fig. 2(b), where the three ellipses are well found. When uniformly distributed noise is added to the data set so that the noise constitute about 35% of the total points, the results of FUPEC algorithm remain almost identical, as shown in Fig. 2(c). Again it shows that FUPEC algorithm is robust when the noise is added. Even when the noise proportion in the total points achieves 60%, the resulted partition is almost unchanged as shown in Fig. 2(d). It is interesting that when the proportion of noise points is increasing, the resulted elliptical contours tend to expand by a little extent. This is due to the added noise points that lead

iβ to a less value in Eq. (20), thus making membership values have a slower tendency to descend by Eq. (14).

(a) (b)

(c) (d)

Fig. 2. Results on a data set that contains three Gaussian clusters. (a) Original data set; (b) Results of FUPEC algorithm; (c) Results of FUPEC algorithm when uniformly distributed noise is added; (d) Results when data set is contaminated with more noise.

2355

B. Experiment II The effectiveness of the new algorithm is confirmed by a

more general example, where a complicated structure is found in the data set. Fig. 3(a) shows a data set that is composed of four components with various sizes and orientations and added by random noise. By performing FUPEC algorithm for 3 iterations starting from the centers of clusters obtained by 10 iterations of FCM algorithm with the number of clusters being set to 12, an intermediate result is shown in Fig. 3(b). It can be seen that each ideal cluster in a dense region splits into multiple small clusters and some spurious clusters form around noise points, It is due to the overspecified number of clusters that conduct the algorithm to find possible distribution sources. With clustering progresses, the spurious clusters have moved to one of the dense regions, and adjacent sub-clusters of a single split cluster have expanded to cover the entire dense region and have become identical. Fig. 3(c) shows the final result of FUPEC algorithm, where the four desired clusters are well detected and the contours of the right two clusters that have some overlapping parts are also well delimited.

C. Experiment III The final test is devoted to a special case where two

clusters overlap and the two centers are near to each other. It may be viewed as a noisy crossing data set as shown in Fig. 4(a). It is known that many traditional probability constrained clustering algorithms fail to distinguish them well. From this experiment we can see that FUPEC algorithm performs excellently to find the centers accurately and delimits the boundary of clusters properly. Furthermore, with resolution parameter iβ being automatically controlled, the FUPEC clustering algorithm can distinguish the two overlapping clusters clearly and need not specify the parameter by debugging as many other parameter–type methods.

(a) (b)

Fig. 4. Result on a noisy data set with two overlapping clusters. (a) Original data set; (b) Final results of FUPEC algorithm.

VI. A THEORETICAL COMPARISON WITH PREVIOUS CLUSTERING ALGORITHMS

It is not appropriate to make a direct experimental comparison between the previous algorithms and the proposed FUPEC algorithm, since it involves a fully unsupervised progressive clustering scheme, with which it overcomes the sensitivity to initialization and can detect all available clusters (especially tiny clusters) in a more great possibility. It is due to the fact that a probabilistic method is primarily a partitioning algorithm, whereas the possibilistic approach including PCM [14] is primarily a mode-seeking algorithm, whose power lies in finding meaningful clusters as defined by dense regions [15]. Besides, the FUPEC algorithm provides efficient and robust estimation of the prototype parameters even when the clusters vary significantly in size and shape, while “least-biased fuzzy clustering (LBFC) method” proposed by Beni and Liu can find only clusters of a particular size [1]. The proposed FUPEC algorithm can separate overlapping clusters accurately, which is still a big problem for many probabilistic algorithms including FCM and many fuzzy entropy based algorithms.

(a) (b) (c)

Fig. 3. Results on a complicated data set that contains four components and random noise. (a) Original data set; (b) Results after 3 iterations; (d) Final result of FUPEC algorithm.

2356

The most distinguished merit of proposed algorithm lies in its capability of automatically controlling the resolution parameter during the clustering progress. This unique iterative scheme of prototype and resolution parameter profits from the combinatorial objective function of general Possibilistic Entropy Clustering (PEC) problems with partial substitution of possibilistic membership, thus avoiding specification or empirical determination of a particular parameter as in those fuzzy entropy based algorithms such as resolution parameter iβ in [1], fuzzification parameter α in [11], and admissible error radius σ in [17]. Similar weakness can be found in PCM [14], where it requires a reliable scale estimate of parameter iη to function effectively.

Although the proposed method is something like a variant on Expectation-Maximization (EM) algorithm [25] from a possibilistic framework, it overcomes several drawbacks of basic EM method, i.e., 1) the number of clusters needs to be known as a priori and 2) solution depends strongly on initial conditions. We can even name this method possibilistic EM algorithm in some sense. In probabilistic framework, probabilistic EM based clustering will split a noisy data set into multiple adjacent clusters without merging, with the number of clusters being overspecified. While the possibility theory guided clustering tends to find the most typical clusters. So with the clustering progresses, the preliminary over-partitioned clusters will expand to cover the entire dense region to get the most reasonable partition as shown in Fig. 2(b). In the same time, the initialization sensitivity for EM algorithm is also overcome.

A simple simulation comparison between the basic possibilistic entropy clustering algorithm (i.e., without above fully unsupervised iterative scheme) and the FCM algorithm in terms of clustering noisy data can be found in our early paper [27].

VII. CONCLUSION In this paper, we address the problem of entropy-based

clustering within the framework of possibility theory. Before illustrating our special clustering algorithm, we introduce a new concept of possibilistic entropy, which has a similar form to Shannon’s information entropy but has a different meaning in theory. When extended to a combinatorial programming problem such as clustering analysis, possibilistic entropy is more in accord with its original definition compared with probability entropy.

We develop the possibilistic entropy theory for clustering analysis and investigate the general clustering problem. A general objective function is derived, which can be viewed as a combination of entropy term and potential term and takes into account global effect and local effect of entropy clustering. Based on this idea, various versions of possibilistic entropy clustering may be developed.

Using a delicate iterative scheme of prototype and

resolution parameter, we propose a Fully Unsupervised Possibilistic Entropy Clustering (FUPEC) algorithm, which inherits the merits of possibility theory and overcomes the drawbacks of the available methods. FUPEC algorithm is characterized by the following features: 1) having clearer physical meaning and well-defined

mathematical features; 2) automatically determining the number of the clusters; 3) automatically controlling the resolution parameter

during the clustering progress; 4) overcoming the sensitivity to initialization and to the

noise and outliers. Moreover, FUPEC algorithm provides efficient

estimation of the prototype parameters even when the clusters vary significantly in size and shape, and the data set is contaminated by heavy noise. It also can separate overlapping cluster accurately, which is a big problem for many other available algorithms.

It is interesting to continue the study of the properties of Possibilistic Entropy and its application in Possibility theory [26], Optimization and Communication fields. Besides, the employment of penalty term and the choice of adjustment factor need to be further researched.

APPENDIX

1) Appendix A Derivation of Eq. (13)

Solution of Eq. (10) involves the using of Lagrange multipliers iβ . For Eq. (10), let its LaGrangian be

2

1 1 1 1 1 1

ln ( )C N C N C N

ij ij i ij i ij iji j i j i j

J u u u u d Lα β= = = = = =

= − − − −∑∑ ∑ ∑ ∑∑

(A1) Setting the derivative of (A1) equal to zero with respect to

iju yields

21(ln )ij ij i i ijij ij

J u u du u

α β∂= − + × − −

2ln (1 ) 0ij i i iju dα β= − − + − = (A2) Rearranging, we get

2exp{ (1 ) }ij i i iju dα β= − + − (1 ) 2exp{ }i

i ije dα β− += ⋅ − (13)

2) Appendix B Derivation of Eq. (20)

The objective function Eq. (18) is as follows

2

1 1 1 1

C N C N

ij i ij iji j i j

J u d uβ α= = = =

⎧ ⎫= −⎨ ⎬

⎩ ⎭∑∑ ∑∑ (B1)

Differentiating (B1) to zero for resolution parameter iβ gives 2 2 2 2

1 1

( )( ) ( ) 0N N

ij ij i ij ij ij ij ijj j

u d d u d u dβ α= =

⎡ ⎤− + − − =⎣ ⎦∑ ∑

(B3) Rearranging, we get

2357

( )2

1

4

1

1

N

ij ijj

i N

ij ijj

u d

u dβ α =

=

= + ⋅∑

∑ (20)

where 2exp{ }ij i iju dβ= −

ACKNOWLEDGMENT The authors would like to sincerely thank the two

anonymous reviewers for their valuable comments and insightful suggestions.

REFERENCES [1] G. Beni, and X. Liu, “A least biased fuzzy clustering

method,” IEEE Trans. Pattern Anal. and Machine Intell., vol. 16, no. 9, pp. 954-960, Sep. 1994.

[2] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981.

[3] RN Davé, R. Krishnapuram, “Robust clustering methods: a unified view”, IEEE Trans. Fuzzy Systems 1997, 5 (2), pp. 270-293.

[4] J.D. Fast, “Entropy : the significance of the concept of entropy and its applications in science and technology,” in: The Statistical Significance of the Entropy Concept, Eindhoven, Philips Technical Library, 1962.

[5] H. Frigui and R. Krishnapuram, “A robust algorithm for automatic extraction of an unknown number of clusters from noisy data”, Pattern Recognition Letters, 1996, 17(12), pp. 1223-1232.

[6] H. Frigui and R. Krishnapuram, “A robust competitive clustering algorithm with applications in computer vision”, IEEE Trans. Pattern And Machine Intell., 1999, 21(5), pp. 450-465.

[7] I. Gath, AB Geva, “Unsupervised Optimal Fuzzy Clustering”, IEEE Trans. Pattern Anal. and Machine Intell. 1989, 11(7), pp. 773-781.

[8] E.E. Gustafson, and W.C. Kessel, “Fuzzy Clustering With a Fuzzy Matrix,” in IEEE CDC, 1979, pp. 761 – 766.

[9] E. T. Jaynes, "Information theory and statistical mechanics", Physical Review, vol. 106, pp. 620-630,1957.

[10] JM Jolion, P. Meer, S. Bataouche, "Robust clustering with applications in computer vision”. IEEE Trans. Pattern And Machine Intell. 1991, 13(8), pp. 791-802.

[11] N. Karayannis, "MECA: Maximum Entropy Clustering Algorithm", in Proc. of the 3rd IEEE Conf. on Fuzzy Sysiems, pp.630-635, 1994.

[12] AI. Khinchin, Mathematical foundations of information theory, Dover Publications, New York, 1957.

[13] R. Krishnapuram and CP Freg, "Fitting an unknown number of lines and planes to image data through compatible cluster merging", Pattern Recognition, 1992, 25(4), pp. 385-400.

[14] R. Krishnapuram, and J. Keller, “A Possibilistic Approach to Clustering,” IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110, May 1993.

[15] R. Krishnapuram and JM Keller, "The possibilistic c-means algorithm: Insights and recommendations", IEEE Trans. on Fuzzy Systems, 1996, 4(3), pp. 385-393.

[16] R. Li, and M. Mukaidono, “Gaussian clustering method based on maximum-fuzzy-entropy interpretation,” Fuzzy Sets and systems, vol.102, no.2, pp. 253-259, 1999.

[17] R. Li, and M. Mukaidono, “A maximum entropy approach to fuzzy clustering,” in Conference on Fuzzy Systems (Fuzz’IEEE), Yokohama, Japan, 1995.

[18] A. Lorette, X. Descombes, and J. Zerubia, “Fully Unsupervised Fuzzy Clustering with Entropy Criterion,” in ICPR'00, Barcelone, Sept. 2000, pp. 998 - 1001.

[19] M. Ménard, V. Courboulay, and PA Dardignac, “Possibilistic and probabilistic fuzzy clustering: Unification within the framework of the non-extensive thermostatistics,” Pattern Recognition vol.36, no.6, pp.1325-1342, 2003.

[20] O. Nasraoui, and R. Krishnapuram, “A Robust Estimator Based on Density and Scale Optimization, and its Application to Clustering,” in Proc. of IEEE International Conf. on Fuzzy System, 1996, pp.1031-1035.

[21] C.E. Shannon, “A mathematical theory of commu-nication”, Bell System Technical Journal, vol. 27, pp. 379 – 423, 1948.

[22] C.V. Stewart, "Minpran: A New Robust Estimator for Computer Vision", IEEE Trans. Pattern Anal. and Machine Intell,. 1995, 17(10), pp. 925-938.

[23] JB. Swartz, "An entropy-based algorithm for detecting clusters of cases and controls and its comparison with a method using nearest neighbours", Health and Place, 1998, 4(1), pp. 67-77.

[24] J.T. Tou, and R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, MA, 1974.

[25] Dempster, A. P., Laird, N.M., and Rubin, D.B. (1977), “Miximum Likelyhood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39(1):1-38.

[26] G. Klir and T. Folger, Fuzzy sets, uncertainty and information, Prentice Hall, 1988.

[27] L Wang, HB Ji, XB Gao. “Clustering Based on Possibilistic Entropy”, Proceedings of the Seventh International Conference on Signal Processing (ICSP’04), Beijing, Aug. 2004, pp1457-1470.

2358