Fuzzy c-Means Clustering Algorithms

Joumal of Classificanon 5:237-247 (1988)

James C. Bezdek

Boeing ElecEonics

IIIII

Recent Convergence Results for the Fuzzy c-MeansClustering Algorithms

Richard J. Hathaway

Georgia Southern College

Abstract: One of the main techniques embodied in many pattem recognition sys-tems is cluster analysis

- the identification of substructure in unlabeled data sets.Ttrc fuzzy c-means algorithms (FCM) have often been used to solve certain typesof clustering problems. During the last two years several new local results con-cerning both numerical and stochastic convergence of FCM have been found.Numerical resuls describe how the algorithms behave when evaluated as optimiza-tion algorithms for fnding minima of the corresponding family ol fuzzy c-meansfunctionals. Stochastic properties refer to the accuracy of minima of FCM func-tionals as al4rroximations to parameters of statistical populations which are some-times assumed to be associated with the data. The purpose of this paper is to col-lect the main global and local, numerical and stochastic, convergence results forFCM in a brief and unified way.

Keywords: Cluster analysis; Convergence; Fuzzy c-mears algorithm; Optimiza-tion; Partitioning Algorithms; Pattern recognition.

1. Introduction

The purpose of this paper is to gather and discuss some recent conver-gence results regarding the Fuzzy c-Means (FCM) clustering atgorithms.

Authors' Addresses: Richard J. Hathaway, Mathematics and Computer Science Depart-men! Georgia Southern College, Statesboro, Georgia 30460, USA and James C. BezdehInformation Processing Lab., Boeing Electronics, PO Box 24969, Seattle, Washington 98124-6269, USA.

]IiII

238 R. J. Hathaway and J. C. Bezdek

These algorithms are quite useful in solving certain kinds of clustering prob-lems where a population X of n objects, each represented by some vector of snumerical features or measurements x e R", is to be decomposed into subpo-pulations (or clusters) of similar objects. The FCM algorithms use the set offeature vectors, along with some initial guess about the cluster substructure,to obtain a partitioning of the objects into fuzzy clusters, and as a by-productofthe partitioning procedure, produce a prototypical feature vector represent-ing each subpopulation. FCM is known to produce reasonable partitioningsof the original data in many cases. (See Bezdek 1981, for many illustrativeexamples.) Furthermore, the algorithm is known to produce partitioningsvery quickly compared to some other approaches. For example, Bezdek,Hathaway, Davenport and Glynn (1985) have shown that FCM is, on average,perhaps an order of magnitude faster than the maximum likelihood approachto estimation of the parameters of a mixture of two univariate normal distribu-tions. These facts justify the study of convergence properties of the fuzzy c-means algorithms, which have become much better understood during the lasttwo years.

In this note we wish to survey convergence theory for the FCM algo-rithms on two different levels. First of all, the algorithms are iterative optimi-zalson schemes tailored to find minima of a corresponding family of fuzzy c-means functionals. Our first look at convergence results will concem numeri-cal convergence properties of the algorithms (or equivalently, of the sequence

of iterates produced by the algorithms): how well do they attain the minimathey were designed to find. This type of theory is referred to herein as numer-ical convergence theory, and is concemed with questions like how fast theiterates converge to a minimum of the appropriate functional, or whether theyconverge at all. These properties are discussed in Section 3, where boththeoretical and empirical studies are cited.

The second type of convergence theory examined is referred to hereinas stochastic convergence theory. It concems a completely different kind ofquestion: how accurate are the minima (that the fuzzy c-means algorithms tryto find) in representing the actual cluster substructure of a sample. The per-

tinent theoretical result cited in Section 4 regards the statistical concept ofconsistency. Some additional light can be shed on stochastic convergenceproperties by considering the results of empirical tests, also contained in Sec-

tion 4.It is clear that both types of convergence results are useful in interpret-

ing final partitionings produced by the FCM algorithms. The algorithms are

introduced in the next section; the final section contains a discussion and

topics for further research.

Recent Convergence Results forFuzzy c-Means 239

2. The FCM Algorithms

Let c>2 be an integer; let X = {xr,...,xn}cR" be a finite data setcontaining at least c < n distinct points; and let Rc" denote the set of all realc x n matrices. A nondegenerate fuzzy c-partition of X is convenientlyrepresented by a matrix U = Iuiil G R"n, the entries of which satisfy

u;1, e f0,lf 1 <i < c; t < k 3n, (1a)

(1b)

(1c)

1; l<k1n,

u;p)Oi l<i<c,

The set of all matrices in R"n satisfying (1) is denoted by M7"o. Amatrix U e My,n can be used to describe the cluster structure of X by inter-preting u;p as the grade of membership of x1 in cluster i: u;p =.95 represents astrong association of x1 to cluster i, while uit = .01 represents a very weakone. Note that Mrn, the subset of M7"n which contains only matrices with allr;1's in {0,1}, is exactly the set of non-degenerate crisp (or conventional) c-partitions of X. Other useful information about cluster substructure can beconveyed by identifying prototypes (or cluster centen)v = (v1, . . . ,v")T € R'", where v; is the prototype for class i,1 < i < c, v; € R". "Good" partitions U of X and representatives (v; forclassl) may be defined by considering minimization of one of the family of c-means objective functionals J^: (Myrn x R"") + R defined by

"/-(U,v) = | | (ui)^ ll ** - "; ll2 ,

t=1 i=1

where L <m<- and ll'll is any inner product induced nolm on R". Thisapproach was first given for rn = 2 in Dunn (1973) and then generalized to theabove range ofvalues ofn in Bezdek (1973). For m > 1, Bezdek (1973) gavethe following necessary conditions for a minimum (U*,v*) of ,I-(U,v) overMy", \R"t .

("1) =

L @ir)^ *rk=l

for all i ; (3a)

L @iD^k=l

c

2u;r=i =l

n

k=1

(2)


and for each & such that dit =ll x* - vl ll' > 0 V i, rhen

,ir = l?^"r,r)-'

ror an i , (3b1)

where

I

ai11, = fdip t d];{^-t> ,

but if ftis such that diz = ll xr -"1 ll' = 0 forsome r, then zL V i are any non-negative numbers satisfying

I and ui* = Oit dL*O . (3b2)

The FCM algorithms consist of iterations altemating between equa-tions (3a) and (3b). The process is started either with an initial guess for thepartitioning U or an initial guess for the prototype vectors v, and is continueduntil successive iterates of the partitioning matrix barely differ; that is, itera-tion stops with the fint U'+l such that ll U'*t - U' ll< q where e is a smallpositive number. The numerical convergence results which follow concemthe behavior of the sequences {U'} and {v'}, while the stochastic theoryrefers to how well minima of (2) actually represent the cluster substructure ofa population under certain statistical assumptions.

3. Numerical Convergence Properties

3.1 Theory

The properties of sequences of iterates produced by optimization algo-rithms can be divided into two different kinds of results. Global results referto properties which hold for every iteration sequence produced by the algo-rithm, regardless of what the initial iterate was; whereas local convergenceproperties refer to the behavior ofsequences ofiterates produced by the algo-rithm when the initial iterate is "sufficienfly close" to an actual solution (inthis case a local minimum of J^'in (2)). As a simple example, local conver-gence results for Newton's method show it to be quadratically convergent inmost cases, after it has gonen sufficiently close to the solution, while the glo-bal convergence theory for Newton's method without modification is weak;the algorithm readily fails to converge (at any rate) when started from asufficiently poor initial guess.

iuir=i=l

Recent Convergence Results forFuzzy c-Means

Early FCM convergence results were of the global type. In Bezdek(1980), it was claimed that iterates produced by FCM always converged, atleast along a subsequence, to a minimum of (2). The proof utilized the con-vergence theory in Zangwill (1969), but incorrectly identified the set of possi-ble limit points as consisting only of minima. This original theorem wasclearly identified as being incorrect by particular counterexamples found byTircker (1987). The corrected global result given below is taken from Hatha-way, Bezdek and Tbcker (1987):

A. Global Convergence Theorem for FCM. Let the sample X contain atIeast c < n distinct points, and let (U0,"0) be any starting point in M6oXRcsfor the FCM iteration sequence {(U',v')}, r =1,2,... ff(U*,v*) is any timitpoint of the sequence, then it is either a minimurn or saddle point of (2).

Note: it is worth re-emphasizing that Theorem A is called a global conver-gence theorem because convergence to a minimum or saddle point occursfrom any initialization; when convergence is to a minimum, it may be either alocal or (the) global minimum of J^.

Local results for FCM are very recent. The following result, takenfrom Hathaway and Bezdek (1986a), was the first local convergence propertyderived for FCM.

B. Local Convergence Theorem for FCM. Let the sample X contain atleast c < n distinct points, and (lJ* ,v*) be any minimum of (2) such thatdi* > O for all i,k, and at which the Hessian of J^ is positive definite relativeto all feasible directions. Then FCM is locally convergent fo 1U*,v*).

This theorem guarantees that when FCM iteration is started closeenough to a minimum of J^, the ensuing sequence is guaranteed to convergeto that particular minimum. The last theorem, from Bezdek, Hathaway,Howard and Wilson (1987), regards the rate of local convergence.

C. Local Convergence Rate Theorem for FCM. Izt the sarnple X containat least c < n distinct points, and (IJ*,v*) be any minimrm of (2) such thatdit > AY i,k, and atwhich the Hessian of J^ is positive definite relative to allfeasible directions. f {(U',v')} is an FCM sequence converging /o (U*,v*),then the sequence converges linearly to the solution; that is, there is anumber 0 < l" < I and normll.llsrclr thatfor all suftciently large r,

il (u'*l,v'*l)- (u*,v*) ll< l"ll (u',v') - (u*,v.) ll.

24t


Note: The number l, in Theorem C is equal to the spectral radius of a matrixobtained from the Hessian of J- which is exhibited in Bezdek, Hathaway,Howard and Wilson (1987).

To summarize the numerical convergence theory for FCM, the algo-rithm is globally convergent, at least along subsequences, to minima, or atworst saddle points, of the FCM functionals in (2). Additionally, the algo-rithm is locally, linearly convergent to (ocal) minima of J^. The followingresult conceming tests for optimality is taken from Kim, Bezdek and Hatha-way (1987). In the statement of the result, H(fD is the cn x cn Hessian matrixof the function"f(tD=min {"I-(U,v) lve R""} and P=I-(l/c)K, where Iis the cn x cn identity matrix and K is the cn x cn block-diagonal matrix withn(c xc) diagonal blocks ofall 1's.

D. Optimality Tests Theorem for FCM. At termination of the FCM algo-rithm, f (U*,v*) is a local minimum of the objective function J^(IJ,v), thenPH(U-)P is positive semidefinite.

Note that Theorem D gives a necessary but not sufficient condition for alocal minimum. The importance of Theorem D is due to the fact that efficientalgorithms exist for checking whether PH(U.)P is positive semidefinite. (See

Kim, Bezdek and Hathaway 1987, for implementing an optimality test basedon Theorem D.) Other recent work conceming numerical convergence andthe testing of points for optimality is given by Ismail and Selim (1986), andSelim and Ismail (1986).

3.2 Empirical Observations

While theoretical results are important in understanding the FCM clus-tering algorithms, they do not by themselves indicate exactly how effectivethe algorithms are in finding minima of (2). For example, knowing only thatan algorithm is linearly convergent locally does not guarantee convergencewill occur quickly enough to be useful; many linearly convergent algorithmsare of little practical utility exactly because they converge too slowly. Muchnumerical testing has been done in order to leam more about the effectivenessof FCM in flnding minima of ,I- (Hathaway, Huggins and Bezdek 1984;Bez-dek, Hathaway and Huggins 1985; Bezdek, Davenport, Hathaway and Glynn1985; Davenport, Bezdek and Hathaway 1987). Some of the general resultsfound in those empirical tests are discussed below.

The FCM algorithms converge very quickly to optimal points for (2).The simulations done in the four papers mentioned above all involved genera-tion of data from a known mixture of normal distributions, so that each


subpopulation was in fact normally distributed. The FCM approach (withm = 2) was used to decompose the unlabeled total population into its nor-mally distributed component subpopulations. This approach was comparedwith parametric estimation using Wolfe's EM algorithm based on the max-imum likelihood principle to find extrema of the likelihood functioncorresponding to a mixture of normal distributions (Wolfe l97O). The FCMapproach almost always converged within 10 to 20 iterations, while thewidely used EM algorithm took hundreds, and in some cases over a thousand,iterations to converge. This empirical result is even more signiflcant whenwe note that each iteration of FCM is relatively inexpensive computationallycompared to approaches such as EM.

In addition to being fast, these numerical simulations indicate that theFCM approach is relatively independent of initialization. It is not the case,however, that termination, which is relatively independent of the initial guess,usually occurs at the global minimum of J^. Rather, in this instance a localminimum often dominates convergence (presumably because it identifiestruly distinctive substructure). Although no comprehensive study has beendone regarding whether terminal points of FCM are usually minima or saddlepoints of (2), in our experience convergence to a saddle point for other thancontrived data happens very rarely, if ever, in practice. No Monte Carlo typesimulation studies have been conducted to date conceming the percentages ofruns that terminate at each type of extremum (ocal minimum, saddle point,global minimum). Indeed, it is not clear how one determines the globalminimum needed to conduct such studies except for trivially small artificialdata sets. Further, Bezdek (1973) exhibits an example in which the globalminimum is less attractive (visually) than local minima of J^ for m> l.Nonetheless, this would constitute an interesting and useful numerical experi-ment for a future study. Next, the question of (statistical) accuracy under themixture assumption is discussed.

4. Stochastic Convergence Properties

4.l Theory

In order to construct a statistical theory conceming the accuracy of par-titionings and cluster prototypes produced by FCM, it is necessary to imposea statistical framework by which estimator accuracy can be measured. Other-wise, there are only specific examples from which general conclusions abotttpartitioning quality cannot be easily drawn. There are certainly 2-dimensional examples where FCM has done a good job (visually) ofrepresenting the cluster substructure of a population, and other cases whereFCM has done a poor job. Indeed, this is the case for all clustering


algorithms. The sole theoretical result in this context, taken from Hathawayand Bezdek (1986b) is somewhat negative.

E. Consistency Theorem. Let p(y:at,sa) = o4pr0) + azpz(!), where p1andp2 are symmetic densities with respective centers (means) of 0 and 1,and suppose that the expected values of lYlz taken with respect to the com-ponent distributions are finite. Then there exist subpopulation proportionsa1 and a2 such that the FCM cluster centers v1 and v2 for m = 2 are notconsistent for the true component distribution centers 0 and I of p(y:at,so).

The statistical concept ofconsistency refers to the accuracy ofthe pro-cedure as it is given increasing information (in this case through more andmore members of the population of objects to be clustered). The theoremstates that even when it is possible to obsewe an infinite number of membersof the population to be clustered, the FCM approach has limited accuracy inbeing able to determine the true centers (means) of the component popula-tions. This result is not particularly surprising, however, because the FCMfunctionals in (2) are not based on statistical assumptions; that is, minimizing,I- does not optimize any principle of statistical inference, such as maximiz-ing a likelihood functional. Note that Theorem E implies a limitation on theaccuracy obtainable for any type of symmetric component distributions, andthis limitation in accuracy would actually be observable given sequences oflarger and larger samples and sufficiently accurate calculation of the FCMprototypes. It is reasonable to conjecture that the asymptotic accuracydepends on such things as the amount of component separation and types ofcomponents; but no theoretical work has been done on this. @mpiricalfindings regarding the accuracy are given in the next section.)

This section is ended by noting that result E resolves a longstandingquestion about whether FCM does in fact provide consistent estimators fornormal mixtures (cf. Bezdek 1981). As with numerical convergence, theabove theory only provides partial understanding about FCM partitioning. Asin Section 3, we supplement this by the following results of numerical experi-ments.

4.2 Empirical Observations

To assess empirically the accuracy of FCM in one particular case (amixture of c = 2 univariate, normally distributed subpopulations), MonteCarlo simulations, which are discussed in the four references given in Section3.2,were conducted to compare the FCM approach to Wolfe's (1970) methodof maximum likelihood specific to the true family of distributions used togenente the feature vector data.


As in the case of numerical convergence properties, the observedbehavior of the approach was as good or better than ttrat indicated by thetheory. Not only did FCM produce cluster substructure estimates faster thanthe maximum-likelihood method, but in most cases the estimates were at leastas accurate. Only when the component centers got very close did the max-imum likelihood approach become clearly superior to that based on FCM.Rougtrly speaking, if there is enough separation of component distributions tocreate multimodality of the corresponding mixture density, then FCM has a"reasonable" chance of producing estimates which are at least as accurate as

those obtained by maximum likelihood. It must be kept in mind ttrat FCM isnonparametric in that it does not assume any particular form for the underly-ing distributions, while the maximum likelihood method relies heavily on the(conect) assumption that each component population is normally distributed.The motivation for this study is simple: FCM is less computationally demand-ing than maximum likelihood in terms of both time and space complexity. Itshould be noted that several comparisons of FCM to Hard c-Means (HCM) orBasic ISODATA are discussed in Bezdek (1981): FCM substantially extendsthe utility of HCM through the expedient of allowing overlapping clusters.

5. Discussion

The FCM approach has proven to be very effective for solving manycluster analysis problems; and the behavior of the FCM approach in practiceis well documented. There are, of course, many substantial unanswered ques-tions about FCM. For example, the choice of (m) in (2). This parameter insome sense controls the extent to which U is fuzzy. As m approaches onefrom above, optimal U's for,I- approach M"n; conversely, as 4 , e, aye;r!u;1, at (3b) approaches (1/c). Moreover, interpretation of the numbers {r;1} isitself controversial; to what extent do these numbers really assess a "degreeof belongingness" to different subpopulations? Aspects of FCM such asthese are further discussed in Bezdek (1981). It is clear that much more canbe leamed about the stochastic convergence theory of FCM, but it is probablytrue that numerical aspects of these algorithms are currently well understood.

Interesting questions remain conceming the saddle points of (2). First,how often will FCM converge to saddle points rather than minima? Is this aproblem in practice, or just a necessary theoretical consideration? Are thereever cases when saddle points of (2) do a good job of representing the struc-ture of the population? Another line of research involves extension of theresults collected above to the more general fuzzy c-vaieties functionals dis-cussed in Bezdek (1981): which, if any of the results above carry over to themore general setting? We hope to make these questions subjects of futurereports. Readers interested in obtaining listings and/or computer programs


for FCM in BASIC, PASCAL, FORTRAN or C may contact either author at

their listed addresses.

References

BEZDEK, J. (1973), "Fuzzy Mathematics in Pattern Classification," Ph.D. dissertation, Cor-nell University, Ithaca, New York.

BEZDEK, J. (1980), "A Convergence Theorem for the Fuzzy ISODATA Clustering Algo-rithms," Institute of Electrical and Electronic Engineers Transactions on PatternArnlysis and Machine lruelligence, 2,1-8.

BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithnts, NewYork: Plenum Press.

BEZDEK, J., DAVENPORT, J., HATHAWAX R., and CLYNN, T. (1985), "A Comparison ofthe Fuzzy c-Means and EM Algorithms on Mixture Distributions with Different Levelsof Component Overlapping," n The Proceedings of the 1985 IEEE Worl<shop onLanguages for Autonntion: Cognitive Aspects in Inforrnation Processing, ed. S. K.Chang, Silver Spring, Maryland: Institute of Electrical and Electronic Engineers Com-puter Society kess, 98-102.

BEZDEK, J., HATHAWAX R., HOWARD, R., WILSON, C., and WINDHAM, M. (1987),"t ocal Convergence Analysis of a Grouped Variable Version of Coordinate Descent,"Iournal of Optimization Theory and Applicatior*, 54,471-477.

BEZDEK, J., HATHAWAY, R., and HUGGINS, V. (1985), "Parametric Estimation for Nor-mal Mixtures," Pattern Recognition Letters, 3,79-84.

BEZDEK, J., HAIHAWAY, R., SABIN, M., and TUCKER, W. (1987), "Convergence Theoryfor Frzzy c-Means: Counterexamples and Repairs," Institute of Electrical and Elec-tronic Engineers Transactions on Systerns, Man and Cybernetics, 17,873-877.

DAVENPORT, J., BEZDEK, J., and HAIHAWAY, R. (1988), "Parameter Estimation for a

Mixture of Distributions Using Fuzzy c-Means and Constrained Wolfe Algorithms,"Journal of Compfters and. Mathematics with Applications, 15,819-828.

DUNN, J. (1973), "AFuzzy Relative of the ISODATA kocess and Its Use in Detecting Com-pacg Well-Separated Clusters," f ourral of Cyberrutic s, 3, 32-57 .

HATHAWAY, R., and BEZDEK, J. (1986a), "[,ocal Convergence of the Fuzzy c-Means Algo-rithms," Paft ern Recognition, I 9, 477 -480.

HAIHAWAY, R., and BEZDEK, J. (1986b), "On the Asymptotic hoperties of Ftzzy c-MeansCluster Prototl'pes as Estimators of Mixture Subpopulation Centers," Conununicationsin Statistics: Theory and Methods,15, 505-513.

HATHAWAY, R., BEZDEK, J., and TUCKER, W. (1987), "An Improved ConvergenceTheory for the Fuzzy ISODATA Clustering Algorithms," tn Analysis of Fuzzy Informa-tioq ed. J. C. Bezdeh Volume 3, Boca Raton: CRC Press, 123-132.

HAIHAWAX R., HUGGINS, V., and BEZDEK, J. (1984), "A Comparison of Methods for l

Computing Parameter Estimates for a Mixture of Normal Distributions, " in Proceedings "

of the Fifteenth Awwal Pittsburgh Conference on Modeling and Simulations, ed. E.

Casetti, Research Triangle Park, NC: ISA, 1853-1860.

ISMAIL, M., and SELIM, S. (1984), "Fnzzy c-Means: Optimality of Solutions and EffectiveTermination of the Algorithm," P att ern Reco gnition, I 9, 481 -485.

KIM, T., BEZDEK, J., and HATHAWAY R. (1987), "Optimality Test for Fixed Poins of ttreFCM Algorithms," Pattern Recognition (in press).


SELM, S., and ISMAIL, M. (1986), "On the Local Optimality of 0re Fuzry ISODATA Clus-tering Algorithm," Institute of Electrbal and Electronic Engineers Transactiora on Pat-tern Analysis ard Machine lrtelligence, 8, 2U-288.

TUCKER, W. (1987), "Counterexamples to the Convergence Theorem for Fuzzy ISODATAClustering Algorithms," n Analysis of Fuzzy Inforrnation, ed. J. Bezdelq Volume 3,Boca Ra&on: CRC Press, lW-122.

WOLFE, J. H. (1970), "Pattern Clustering by Multivariate Mixnre Analysis," MultivariateB ehov ioral R e search, 5, 329 -350.

ZANGWILL, W. (1969), Non-Linear Prograntning: A Unifed Approach, Englewood Clift,NJ: Prentice Hall.

Education

Fuzzy c-Means Clustering Algorithms