26
1 A Clustering Method based on the Estimation of the Probability Density Function and on the Skeleton by Influence Zones. Application to Image Processing M. Herbin, N. Bonnet *, and P. Vautrot * Université de Reims, 21, rue Clément Ader, F 51100 Reims, France. Abstract : This paper investigates a new approach for data clustering. The probability density function (p.d.f.) is estimated by using the Parzen window technique. The p.d.f. thresholding permits the segmentation of the data space by influence zones (SKIZ algorithm). A bottom-up thresholding procedure is iterated to refine the segmentation. As a result, a complete partition of the data space is obtained in parallel to the clustering of the data samples. In addition, an estimation of the intrinsic dimensionality of the data set is provided. This approach of clustering is tested with simulated data and applied to color image data. Keywords : Clustering, Probability Density Function, Skeleton by Influence Zones. I Introduction The aim of clustering is to partition a data set into subsets of "similar" data in an unsupervised manner. The data to be classified are represented by points in a n-dimensional space. The coordinates of these points correspond to the n observed features. Many approaches of clustering have been suggested (Duda and Hart, 1973; Fukunaga, 1972). Some of them use an estimation of the probability density function (p.d.f.) and others do not. Among the latter methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms are used to produce a hierarchy of classes (Sokal and Sneath, 1963). By cutting this hierarchy at a certain level, classes of objects can be obtained. Among the methods based on the estimation of the p.d.f., some are

A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

1

A Clustering Method based on the Estimation of the Probability Density

Function and on the Skeleton by Influence Zones.

Application to Image Processing

M. Herbin, N. Bonnet *, and P. Vautrot

* Université de Reims, 21, rue Clément Ader, F 51100 Reims, France.

Abstract : This paper investigates a new approach for data clustering. The probability density

function (p.d.f.) is estimated by using the Parzen window technique. The p.d.f. thresholding

permits the segmentation of the data space by influence zones (SKIZ algorithm). A bottom-up

thresholding procedure is iterated to refine the segmentation. As a result, a complete partition

of the data space is obtained in parallel to the clustering of the data samples. In addition, an

estimation of the intrinsic dimensionality of the data set is provided. This approach of

clustering is tested with simulated data and applied to color image data.

Keywords : Clustering, Probability Density Function, Skeleton by Influence Zones.

I Introduction

The aim of clustering is to partition a data set into subsets of "similar" data in an unsupervised

manner. The data to be classified are represented by points in a n-dimensional space. The

coordinates of these points correspond to the n observed features. Many approaches of

clustering have been suggested (Duda and Hart, 1973; Fukunaga, 1972). Some of them use an

estimation of the probability density function (p.d.f.) and others do not. Among the latter

methods are the hierarchical clustering approaches : a similarity criterion between data is

defined and aggregation algorithms are used to produce a hierarchy of classes (Sokal and

Sneath, 1963). By cutting this hierarchy at a certain level, classes of objects can be obtained.

Among the methods based on the estimation of the p.d.f., some are

Page 2: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

2

parametric approaches. These approaches assume that the shapes of the individual p.d.f. are

known and only the related parameters have to be estimated. This assumption is often implicit

: it is the case of the K-means or ISODATA procedures, which assume that the clusters' shape

is hyperspherical (when a Euclidean distance is used) or hyperelliptical (when the

Mahalanobis distance is used). The fuzzy C-means procedure (Dunn, 1974) corresponds to the

same category. Unfortunately, real world clusters may possess a shape which is far from being

hyperelliptical, and the classical clustering methods listed above often fail. Therefore, there is

a need for clustering algorithms which do not make any (implicit or explicit) hypothesis

concerning the shape of the p.d.f. Such algorithms have been suggested, among which : the

density gradient-based approach (Fukunaga and Hostetler, 1975; Devijver and Kittler, 1982),

the maximum entropy clustering approach (Rose, Gurewitz, and Fox, 1990), the blurring (or

mean-shift) approach (Cheng and Wan, 1992; Cheng, 1995), the mode detection and convexity

analysis (Touzani and Postaire, 1988; Postaire and Olejnik, 1994), the dynamic procedure of

splitting (Garcia et al, 1995). A detailed comparison of these methods in different conditions

of application is still to be made.

The purpose of this paper is to suggest and experiment with another method belonging to this

category. Like the valley-seeking approach (Koontz and Fukunaga, 1972), our method is based

on an estimation of the p.d.f. through the Parzen window technique. But since gradient

analysis can be very sensitive to local irregularities of the estimated p.d.f. (Kittler, 1976), we

do not follow this way of analysis. Instead, we have found more valuable the method of

iteratively thresholding the estimated p.d.f., starting from low threshold values and proceeding

upwards to high threshold values (i.e. from one class to the largest number of classes

corresponding to the modes of the estimated p.d.f.). At each level of thresholding, the

connected regions are detected in the n-dimensional space. The boundaries between these

connected regions (i.e. clusters) are estimated by using a classical image processing

transformation : the SKIZ (Skeleton by influence zones) (Serra, 1982). Once the highest level

thresholding is performed, the data space is partitioned. The boundaries are marked which

define the clusters. Not only are the data classified, but in addition, the intrinsic dimensionality

of the data set can be estimated during the course of the clustering process.

Page 3: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

3

The outline of the paper is the following : in the next section, we describe the discretization of

the data space and the p.d.f. estimation. Then the iterative algorithm is described by defining

the clustering procedure. Section 3 is devoted to testing this procedure with simulated data, as

well as proposing applications in the field of image processing. A discussion and conclusion

are developed in the last section.

II Clustering method

The method we suggest is composed of three steps : the discretization of the data space, the

estimation of the p.d.f., and the segmentation of the discretized data space to determine the

clusters.

A Discretization of the Data Space Let Q be the set of n-dimensional data to be clustered : { }qj xxxQ ,...,...1= , and q the number

of data. Each datum x j (1≤ ≤j q) is a n-dimensional point : ( )njijjj xxxx ,,1, ,...,...= with

1≤ ≤i n .

The clustering domain D is a hypercube of the data space. We assume that all data x j

(1≤ ≤j q) belong to this domain D and :

D a bi ii n

=≤ ≤∏ ,1

with ( )ijqji xa ,1min

≤≤≤ and ( )ijqji xb ,1

max≤≤

> . (1)

In practical applications, the outliers could be excluded from D.

Each interval a bi i, is partitioned into intervals of length λi. Let pi equal the number of these

intervals. The partition of each a bi i, is defined as :

a b a k a ki i i i i ik pi

, , ( )= + + +≤ <

λ λ10� with λ i

i i

i

b ap

= − (2)

These partitions permit the p tiles to be defined as : p =≤ ≤Π

1 i n ip . These tiles cover the domain

D. Note that rectangular tiles are used in dimension 2 (i.e. n = 2).

Page 4: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

4

Each tile Pr (1 ≤ ≤r p) is defined as :

Pr r i r i ii n

= +≤ ≤∏ α α λ, ,,1

with α λr i i r i ia p, ,= + and 0 ≤ <p pr i i, . (3)

The center of each tile Pr is a n-dimensional point xr with :

( )nrirrr xxxx ,,1, ,...,...= and x2r,i r,i

i= +α λ with 1≤ ≤i n . (4)

Note that the tile centers are the nodes of a rectangular grid in dimension two (n = 2). These p

points xr (1≤ ≤r p) define the set S as a discrete sampling in the data space. Each initial data

point x j (1≤ ≤j q) belongs to one tile of the domain D, and this data is associated with the

center point of this tile. Then each data is associated with one point of S.

B Estimation of probability density function

The probability density function (p.d.f.) is estimated for each point xr (1≤ ≤r p) of the set S

(the nodes of a rectangular grid if n = 2 ). S is a n-dimensional set of points and each point has

2n neighbours (except the border points). Using this neighbourhood connection, S is

considered as a n-dimensional network with pi elements per dimension (1≤ ≤i n).

Each point xr of this network is the center of the tile Pr and each tile has the same volume v

with : vi n i=

≤ ≤Π

1λ . Thus, the density of data in each tile is : d r = # P

vr where # Pr is the

percentage of initial data in the tile.

Using the classical Parzen windows approach (Duda and Hart, 1973), the p.d.f. is estimated at

each point xr by interpolation :

( ) ( )�≤≤

−=ps

srsr xxdxfdp1

... γ (5)

where ( )sr xx −γ is an interpolation kernel.

We choose for γ a n-dimensional Gaussian kernel. Note that the interpolation is a classical

discrete convolution. As the Gaussian filter is separable, the n-dimensional convolution is

obtained by n iterations of one-dimensional convolution on each dimension. Thus, we

compute only n kernels of one-dimensional Gaussian filter to estimate the p.d.f. values. Fig.1

shows a 900 data set in a two dimensional domain. The data set is obtained by using three

Gaussian distributions of 300 data. The domain is discretized and represented by the image A

of 512*512 pixels (i.e. a partition of D with 512*512 rectangular tiles) : the higher the p.d.f.,

Page 5: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

5

the higher the grey level values of image pixels. Three p.d.f. estimations are given using three

standard deviation values in the Parzen window interpolation. The standard deviation values

are defined as : σ =18, σ = 26 and σ = 51 in pixel units for images B, C, and D respectively.

Let M be the maximum value of the p.d.f. estimation on the set S. L levels are selected

between 0 and M. Fig.1 shows the p.d.f. using 32 levels from 0 to 31 ( L = 32 ). The level at

each point xr is the integer value defined as :

( ) ( )��

���

� ×=M

xfdpLINTxlevel rr

... (with 1≤ ≤r p). (6)

In the following, these levels are used in place of p.d.f. values. Note that these levels

correspond to thresholds of the estimated p.d.f.

C Clustering Algorithm

We have L levels for the p.d.f. from 0 to L-1. At each level e (0 ≤ <e L), let Se be the set of

points with a density level higher than e. Thus S Se ⊆ . We use this set to define temporary

clusters of S at each level of density. First, coarse clusters are defined. Then the clustering is

refined by increasing the level. This refinement is obtained by splitting the previous clusters

using the SKIZ procedure described below.

Using the n-dimensional network defined with neighbourhood relations between the points of

S, the set Se is composed of connected regions (i.e. sets of connected points). c(Se) is defined

as the number of connected regions. The clustering algorithm is described using an example

(Fig.2) which shows the clustering process with four levels (L=4) in a two-dimensional data

set (n=2). The set S has 14*12 points. Each point is the center of one tile of the domain

partition, the grey grid shows the tile boundaries. The black grid shows the neighbourhood

connections between points (i.e. the network). For initialization (at level e = 0), we have : S0 =

S. Thus only one connected region is detected at this initial level (c S( )0 1= ) and there is only

one cluster as shown by the grey in Fig.2A. Let us describe the iteration from the level e to the

level e+1. Note that Se+1 is included in Se (S S Se e+ ⊆ ⊆1 ), thus a connected region at the level e

could be disconnected at the level e+1. In Fig.2B, the grey part shows S1, we have :

S S S1 0⊆ ⊆ and S1 has two connected regions. In a case of disconnection, there is a split. The

C cluster of the e level is split into new clusters only if C contains more than one connected

region of the level e+1 (i.e. c C Se( )∩ >+1 1). The number of the new clusters is equal to

Page 6: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

6

c C Se( )∩ +1 (i.e. the number of connected regions inside C). These connected regions are used

as seeds (i.e. markers) for the splitting procedure. A city-block distance is used in the network

S. As in the image processing field, the use of distance and markers permits us to define

influence zones, and the boundary between influence zones is known as SKIZ (skeleton by

influence zone) (Serra, 1982). The SKIZ principle is easily described : each point of the old

cluster C is associated with the nearest connected region (nearest in terms of the city-block

distance). For each connected region a distance fonction is computed in C. The influence zone

of a connected region contains all the points in C for which the associated distance function is

smaller than the distance function of the other regions. Fig.2C shows a split of the initial

cluster into two new clusters, the black line in bold shows the border between the two new

clusters. At each iteration of the algorithm, all the clusters are checked to decide an eventual

split. If a cluster C contains either only one connected region of level e+1 (c C Se( )∩ =+1 1), or

no connected region of level e+1 (c C Se( )∩ =+1 0), then this cluster C is kept without splitting.

At level 2 of the example (Fig.2D), one of the initial clusters contains only one connected

region, so this cluster is kept. The second initial cluster contains two connected regions;

therefore, this cluster is split into two parts. Thus three clusters are detected at level 2 (Fig.2E).

At level 3, these three clusters are checked; since they contain one or zero connected region,

they are kept without splitting. At the end of this example three clusters are obtained.

After the L iterations, we obtain a partition of S. This partition also causes a partition of the

data space D. Each part of D is composed of the tiles whose centers are in the same part of the

S partition. Thus each initial datum of Q belongs to one and only one part of D, and so the

clustering of Q is performed.

D Implementation

The result of clustering depends on three parameters : the data space discretization step, the

Parzen window width, and the number of levels L.

First we assume that the discretization is as high as possible with respect to the device capacity

as explained below. The domain D is defined as the smaller hypercube containing the data set

Q. The n-dimensional domain D is partitionned by p hypercube tiles with p = pn0 . Therefore

the discretization step λ i of each dimension i (1≤ ≤i n) is the ratio of the D width by the p0

Page 7: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

7

value. p0 is the discretization parameter for clustering; it defines the number of discretization

steps in each dimension. When the dimension n is equal to 2, we select a large value for p0

(p0 512= for instance). But when the dimension n is equal to 4, the use of a

512 512 512 512× × × discrete set requires a lot of memory, and the algorithm needs a lot of

CPU time; thus we select a small value for p0 (p0 32= for instance). Then two strategies are

developed according to this discretization parameter, and the estimation of the p.d.f. depends

on the selection of p0.

For each dimension i (1≤ ≤i n), we have to select the standard deviation value of the Gaussian

filter for estimating the p.d.f.. As the initial distribution of data is unknown, these standard

deviations si are selected by an empirical approach. The discretization steps are used as units

in each dimension i. The standard deviations have a constant value s for each dimension :

s si = (1≤ ≤i n). Thus the Parzen window width is defined through this s value. The result of

clustering depends to a large extent on the ratio ρ = sp0

.

Concerning the strategy for selecting ρ in the case of a large p0 value (p0 512= ), if ρ is too

large, the estimated p.d.f. is too "smoothed" and tends toward an unimodal function. On the

other hand, if ρ is too small, the p.d.f. estimation is "noisy" describing "the superposition of

sharp pulses" (Duda and Hart, 1973) as we observe in Fig.1. Therefore the number of detected

clusters depends on this value ρ. The smaller ρ is, the higher the number of detected clusters

will be. We suggest taking advantage of this relationship to select the final ρ value. Fig. 3

shows the number of clusters as a function of ρ. The curve is obtained with the same data as in

Fig.1 by using the algorithm with 8 iterations (i.e. 8 p.d.f. thresholds), and ρ lies between 1%

and 10%. The number of clusters remains constant and equal to 3 in a large plateau showing

that the intrinsic dimensionality of the problem is equal to 3. Thus we propose selecting the ρ

value in the middle of this largest range (i.e. ρ = 7% ). Note that a similar empirical procedure

has already been used in cluster analysis to select this kind of tuning parameter (Zhang and

Postaire, 1994).

If p0 has a small value (p0 32= ), then the discretization steps are large with respect to the

domain size, and each tile of the domain is large. Each tile can integrate many samples of the

initial data set and the estimated p.d.f. is already "smoothed" without any interpolation

Page 8: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

8

procedure. Thus it could be difficult to observe a large range with a constant number of

detected clusters. Table 1 shows the number of detected clusters as a function of ρ values

using the data shown in Fig.1 and the algorithm with 16 iterations (i.e. 16 p.d.f. thresholds).

Note that the higher p0 is, the larger the plateau is, and the percentage of labelling errors

decreases within this plateau. If such a plateau is not detected, a small value of ρ is required

and we proposed ρ ≤ 1%.

The clustering depends on the number L of algorithm iterations. The number L of p.d.f.

threshold levels should have the highest value with respect to a reasonable computing time.

Eight iterations (L=8) seem to be a good compromise; the clustering improvement obtained by

increasing L (using L=16) does not seem significant in the following examples.

III Applications

Our strategy has been to test this clustering procedure with simulated data, at which point we

propose applications in image processing.

A Example of overlap between clusters of unequal weights

The goal of the study is to compare the results of this clustering procedure with those of the

theoretical Bayes decision. Toward this end, three Gaussian distributions of two-dimensional

data are simulated. Three clusters of 900, 600, and 300 data are generated. The mean values of

the Gaussian distributions are equal to (0, 0), (40, 0), and (20, 30), and the standard deviations

have the same value of 10. Fig.4A shows these simulated data distributions with unequal

weights and an overlap between the clusters. The data space is discretized in 512*512

elements. The p.d.f. is estimated using a ρ value equal to 5 %. Fig.4B shows a representation

of this p.d.f. estimation. The clustering procedure is defined with 8 iterations (i.e. 8 p.d.f.

thresholds). For ρ values between 7 % and 3 %, the procedure causes three influence zones.

For these values, the data space is partitioned into three parts. As described above, the middle

of the largest plateau is taken into consideration (ρ = 5%). Fig.4A shows the three detected

parts with three grey levels. Through this clustering procedure, the error rate is equal to 5.4 %

(98 labelling errors). Using Bayes theoretical decision, the error rate should be equal to 3.7 %

(66 errors induced by the overlap). Note that the result of our clustering approach is not so far

Page 9: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

9

from that of the Bayes decision, but our approach does not use any a priori knowledge about

the number of clusters and the probability distribution of the cluster data. Using a 16 iteration

algorithm, a slight improvement is observed; the error rate is reduced to 4.1 % (73 labelling

errors).

B Example of non-linearly separable clusters

The clustering procedure is particularly well suited to non-linearly separable clusters where

classical procedures fail. This example shows the results in such a case. Two non-linearly

separable clusters are simulated in dimension two with 300 and 400 data (Fig.5A). The data

space is discretized in 512*512 elements. Fig.5B shows the p.d.f. estimation with ρ = 6%. The

clustering procedure is used with 8 iterations. Fig.5A shows the result of the clustering. Note

that the procedure gives two clusters for ρ values between 2% and 10%. The error rate is

found to be equal to 4.7 % (33 labelling errors). Using our procedure with 16 iterations, the

error rate is equal to 4.3 % (30 labelling errors). Such clustering cannot be obtained with

classical procedure as the K-means using either Euclidean distance or Mahalanobis distance

because the decision boundary between clusters is either a straight line or a quadrics (Duda

and Hart, 1973).

C An example in four-dimensions

Imaging instruments are nowadays able to record multiple images (multispectral images for

instance, or multiple maps of chemical species in X-ray microanalysis (Bonnet and Quintana,

1994), or Secondary Ion Mass Spectroscopy (Van Espen et al, 1992) ). The classification of

pixels according to their multiple components can sometimes be performed by supervised

procedures as in aerial imagery for instance, but this is not affordable for studies at a

microscopic level. There is thus a need to develop unsupervised classification procedures for

the analysis of these multidimensional images (Bonnet, 1995). As a preliminary investigation

of this problem, we simulated 128*128 images in four dimensions, supposed to represent four

elemental maps corrupted by Poisson noise. Figs.6A, 6B, 6C and 6D shows these four maps.

The data space is discretized in 32*32*32*32 elements. The p.d.f. is estimated in this 4-

dimensional space with a ρ value equal to 1 %. Neither this data space nor the estimated p.d.f.

can be easily displayed which prevents any interactive partitioning method to be applied.

Page 10: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

10

Using 8 p.d.f. thresholds, the procedure detects 10 clusters (Fig.6E), as expected. Testing

multidimensional problem, the proposed procedure exhibits the expected results with non

Gaussian distribution and unsupervised clustering. Note that one expected cluster contains

more than 50 % of the data (background in Fig.6E) and each of the 9 other clusters contains

about 5 % of the data. Despite the unequal weights of the clusters the clustering procedure

does not fail.

D Application in Color Image Processing

Each pixel of a color image has three values : red (R), green (G) and blue (B). These values are

an integer between 0 and 255 (i.e. 8 bit depth code). A classical problem in color image

processing is to search how many "colors" are present in an image. This problem cannot have

a simple answer because of the difficulties met when defining the term color in image

processing field. The discrimination between two colors is not a trivial procedure (Herbin,

Venot, Devaux et al, 1990). In this paper, a datum is a (R, G, B) point in the color space. A

color image is considered as a set of 3-dimensional data to which we can apply the clustering

analysis. Note that we do not use the spatial distribution of the data inside the image; thus this

clustering approach is only considered as a pre-processing of color images. The goal is to

study the ability of this clustering algorithm to discriminate color phenomena. We propose

distinguishing 18 greyish color patches on a neutral grey background (Fig.7). White lines

separate color patches. The classical transformation of the RGB space into IHS space

(Intensity, Hue, Saturation) is not used in such a case because the hue is very noisy or cannot

be computed with grey, whiteness and greyish color (Otha, Kanade and Sakai, 1980). Note

that the background represents more than half of the data. Thus many transformations, as

Karhunen-Loeve transformation for instance, are not well suited to this data processing

because they increase the importance of small differences between background grey values

and decrease the importance of small differences between patch color values. Fig.7 shows the

three components of the color images: images A, B and C are respectively the Red, Green, and

Blue components. The 512*512 data of the color image are discretized in 64*64*64 elements.

The p.d.f. is estimated using 8 levels. The ρ value is taken as 0.5 % and 19 clusters are

detected with the clustering procedure. The cluster results are shown with 19 grey levels in

Page 11: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

11

Fig.7D. Each color patch and background has a different label. These results can be compared

to those of the K-means procedure with Euclidean distance (Fig.7E). Using many random

initializations, the K-means procedure searching 19 clusters always fails because of the large

number of background data compared to the small number of color patch data.

IV Discussion and Conclusion

The aim of this paper was to investigate a new approach for data clustering, with a specific

emphasis on applications in the field of image analysis. The new approach is intended to avoid

any foreknowledge assumed by many classical clustering procedures and concerning the shape

and/or the number of clusters present in the data set.

Concerning the shape of clusters : our approach is based on the estimate of the probability

density functions (p.d.f.) according to the Parzen window technique. Thus, the clusters can

have any shape, including irregular ones which are difficult to describe analytically.

Concerning the clustering process : the samples are gathered in different clusters according to

a bottom-up iterative thresholding of the p.d.f.. The successive detected parts define a

hierarchy of regions, from which the last one is used for clustering. Since neither the concept

of center of cluster nor the concept of class variance is used, the process remains independent

of any assumption about the shape of class distributions.

Concerning the number of clusters : the whole process is in fact repeated for different values

of one parameter involved in the estimate of the p.d.f.; this parameter is the standard deviation

in the case of a Gaussian interpolating function. Plotting the number of clusters obtained for

different values of this parameter reveals (as a plateau) the intrinsic dimensionality of the data

set. Of course, in some situations, several plateaus are observed, revealing the hierarchical

nature of clustering (main classes can be subdivided in sub-classes). In this case, the user has

to chose between several clustering alternatives.

Page 12: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

12

A detailed comparison between this approach and previously described algorithms is beyond

the scope of this paper and will be the subject of a forthcoming article. However, some

qualitative indications can be given here. We concentrate the discussion on methods which do

not hypothesize concerning the shape of clusters, because classical clustering algorithms such

as K-means, fuzzy C-means or ISODATA are disqualified in the general case of arbitrary

shaped clusters.

One suggested method which is relatively close to ours is the "dynamic procedure of splitting"

of Garcia et al (Garcia et al, 1995). The idea of this method is that two data belong to the same

cluster if one can go from the first one to the second one without having to jump a large step.

This idea can be restated as : the two data points belong to the same region of the p.d.f.

without any valley between the two points. There is thus a common background shared by the

two methods. However, the explicit use of the p.d.f. in our method allows us to handle

situations with overlapping distributions, which cannot be handled by the "dynamic procedure

of splitting".

Another method which does not make any assumptions relative to the shape of clusters is the

"mean shift" clustering (Cheng, 1995). It has been shown that this approach is equivalent to a

hill-climbing of the p.d.f. Thus, there is also a common basis with our approach. Detailed

comparisons, including numerical figures of merit in various situations, have to be performed

before a more definitive conclusion can be drawn. We have the impression that the

computation time is higher in the case of mean shift than with our algorithm.

Another method which deserves attention is the method of Postaire and his co-workers

(Touzani and Postaire, 1988; Zhang and Postaire, 1994). The idea is also to start with an

estimation of the p.d.f.. The main difference concerns the second step : clustering according to

the p.d.f. estimation. They proceed by searching the modes of the p.d.f. as convex parts of the

p.d.f.; the points belonging to the concave parts are put into the different regions after a data

space equi-partition through the influence zones of the detected modes. Although once again

additional work is needed for comparison, we can make two comments about this approach.

Page 13: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

13

First, in order to maintain the connectivity of the regions, the data space has to be discretized

into a small number of points. In these conditions, a precise partition of the data space cannot

be obtained. One important advantage of our bottom-up approach is that the connectivity of

regions is always preserved. Second, we can notice that a bias is introduced during the

partition of the data space when modes approach is used in the cases of classes with rather

different variances. This is illustrated in Fig.8. In Fig.8A, two modes of the p.d.f. are first

detected and then, the remaining data space is split into two influence zones. One can notice

too much space on the right of the boundary is allocated to the smallest class. In Fig.8B, which

corresponds to our approach, the p.d.f. is first thresholded. The iterative thresholding ensures

that the correct threshold will be found. Then the remaining space is divided into two parts.

The misclassified samples should be less numerous in this case.

Although some work remains to be done to evaluate more deeply our method, we consider that

these preliminary results are encourageing and the method could be of great interest for

solving many clustering problems arising in image processing applications, where the clusters

may have arbitrary shapes, and where the classical clustering methods (K-means, Fuzzy C-

means, ISODATA,..) often fail.

In this preliminary version, the iterative algorithm uses a fixed number of predetermined

equally-spaced thresholds of the p.d.f. Future work is intended to develop a strategy for

automatically selecting non-equally spaced thresholds adapted to specific clustering problems.

Acknowledgements :

The authors thank the anonymous referees for their helpful suggestions and corrections of the

first version of the manuscript.

Page 14: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

14

Figure Captions

Fig. 1 : Estimation of probability density function.

Example of p.d.f. estimation with Parzen window function and Gaussian interpolation.

Image A : Distributions of 1200 data in a two-dimensional space (512*512 square units)

Images B, C, and D: Estimates of p.d.f. with σ equal to 18, 26, and 51 units, respectively.

Fig. 2 : Iterative SKIZ

Iterative partitioning through influence zones: example of a four-level process in a two-

dimensional space. The grey grid limits the rectangular tiles in the search domain. The black

grid shows the network connecting the tiles through neighbourhood relations.

Image A : At level 0, only one connected region appears. The grey part overlaps the whole

domain.

Image B : At level 1, two grey parts appear; there are two connected regions.

Image C : The black line in bold separates the two influence zones associated with connected

regions. It was computed as the skeleton influence zones (SKIZ) of the connected regions.

Image D : At level 2, three grey parts show three connected regions.

Image E : The part containing two connected regions is split. A new black line in bold

appears as the SKIZ between the new created regions. The second part remains the same.

Image F : At level 3, the three previous parts are maintained because they contain only zero

or one connected region.

Fig. 3 : Number of clusters

The number of detected clusters as a function of the tuning parameter ρ. ρ is the percentage of

the standard deviations for p.d.f. estimate with Parzen window function with respect to the

number p0 of discretization steps in the data space.

Page 15: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

15

Fig. 4 : Clustering of three Gaussian distributions.

Image A : Three simulated Gaussian distributions of 900, 600, and 300 data. The grey

backgrounds show the zones of influence attributed to the three clusters.

Image B : The p.d.f. estimation used for this clustering.

Fig. 5 : Clustering of non linearly separable distributions.

Image A : Two distributions of 400 and 300 data. The grey backgrounds show the zones

of influence attributed to the two clusters (Note that the shape of the boundary is neither linear

nor quadric).

Image B : The p.d.f. estimation used for this clustering.

Fig. 6 : Clustering in a four-dimensional space.

Images A, B, C, and D : The four components of 128*128 images.

Image E : The 10 grey values show the 10 detected clusters. Some labelling errors could

be reduced by relaxation procedures

Fig. 7 : Clustering of color image data.

Images A, B, and C : The three color components (Red, Green, and Blue respectively) of the

image (512*512 pixels).

Image D : The 19 grey values show the 19 detected clusters.

Image E : The 19 clusters detected using a K-means procedure with random initialization.

Page 16: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

16

Fig. 8 : Partition of the space through p.d.f. analysis

Comparison of the modes analysis approach and our thresholding approach : Example of p.d.f.

with two modes in a one-dimensional space.

A : Detection of the p.d.f. modes by convexity analysis and partition of the space into two

clusters defined by influence zones. Misclassifications occur for the points on the right side of

the estimated boundary.

B : Thresholding of the p.d.f. and partition of the space into two clusters defined by influence

zones. Misclassifications are much weaker than in A.

Table 1 : Number of clusters and error rate.

Number of detected clusters as a function of the two parameters p0 and ρ (using 16 iteration

procedure and the data shown in Fig.1): p0 is the discretization parameter defined as the

number of discretization steps in the data space; and ρ is the tuning parameter for the p.d.f.

estimation, with ρ defined as the percentage of the standard deviation for Parzen window

function with respect to data space size. If the number of detected clusters is equal to 3, then

the percentage of labelling errors is computed.

Page 17: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

17

References

Bonnet, N. and C. Quintana (1994), Multivariate Statistical Analysis applied to X-ray Spectra

and X-ray Mapping of Liver Cell Nuclei. Scanning Microscopy 8 (3), 563-586.

Bonnet N. (1995), Preliminary Investigation of two Methods for the Automatic Handling of

Multivariate Maps in Microanalysis. Ultramicroscopy 57, 17-27.

Cheng, Y. and S. Wan (1992), Analysis of the Blurring Process. In: T. Petsche et al, Ed.,

Computational Learning Theory and Natural Learning Systems 3, MIT Press, London,

257-276.

Cheng, Y. (1995), Mean Shift, Mode Seeking, and Clustering. IEEE Trans. Pattern Anal.

Machine Intell. 17 (8), 790-799.

Devijver, P.A. and J. Kittler (1982), Pattern Recognition: a Statistical Approach. Englewood

Cliffs, NJ: Prentice-Hall.

Duda, R.O. and P.E. Hart (1973), Pattern Classification and Scene Analysis. Wiley, New

York.

Dunn, J.C. (1973), A Fuzzy Relative of the ISODATA Process and Its Use in Detecting

Compact Well-Separated Clusters. J. Cybernetics 3 (3), 32-57.

Fukunaga, K. (1972), Introduction to Statistical Pattern Recognition. Academic, New York.

Fukunaga, K. and C.D. Hostetler (1975), The Estimation of the Gradient of a Density

Function, with Application in Pattern Recognition. IEEE Trans. Inf. Theory 21(1), 32-40.

Garcia, J.A., J. Fdez-Valdivia, F.J. Cortijo, and R. Molina (1995), A Dynamic Approach for

Clustering Data. Signal Processing 44, 181-196.

Herbin, M., A. Venot, J.Y. Devaux, and C. Piette (1990), Color Quantitation Through Image

Processing in Dermatology. IEEE Trans. Medical Imaging 9 (3), 262-269.

Kittler, J. (1976), A Locally Sensitive Method for Cluster Analysis. Pattern Recognition 8, 23-

33.

Page 18: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

18

Koontz, W.L. and K. Fukunaga (1972), A Nonparametric Valley-Seeking Technique for

Cluster Analysis. IEEE Trans. Comput 21 (2), 171-178.

Otha, Y., T. Kanade and T. Sakai (1980), Color Information for Region Segmentation.

Comput. Vision Graphic Image Process. 13, 22-241.

Postaire, J.G. and S. Olejnik (1994), A Relaxation Scheme for Improving a Convexity Based

Clustering Method. Pattern Recognition Letters 15, 1211-1221.

Rose, K., E. Gurewitz, and G.C. Fox (1990), Statistical Mechanics and Phase Transitions in

Clustering. Physical Review Letters 65, 945-948.

Serra, J. (1982), Image Analysis and Mathematical Morphology. Academic Press, New York.

Sokal, R.R. and P.H. Sneath (1963), Principles of Numerical Taxonomy. W. H. Freeman, San

Francisco.

Touzani, A. and J.-G. Postaire (1988), Mode Detection by Relaxation. IEEE Trans. Pattern

Anal. Machine Intell. 10, 970-978.

Van Espen, P., G. Jannssens, W. Vanhoolst, and P. Geladi (1992), Analusis 20, 81.

Zhang, R.D. and J.-G. Postaire (1994), Convexity Dependent Morphological Tranformations

for Mode detection in Cluster Analysis. Pattern Recognition 27 (1), 135-148.

Page 19: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

19

A B

C D

FIG. 1 : Estimation of Probability Density Function

Page 20: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

20

B

D

F

C

E

A

FIG. 2 : Iterative SKIZ

Page 21: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

21

parameter for p.d.f. estimation

numberof

clusters

0123456789

10

0% 2% 4% 6% 8% 10% 12% 14%

FIG. 3 : Number of clusters

Page 22: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

22

A B

FIG. 4 : Clustering of three Gaussian Distributions

Page 23: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

23

A B

FIG. 5 : Clustering of non-linearly separable distributions

Page 24: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

24

A B

C D

E

FIG. 6 : Clustering in a four-dimensional space

Page 25: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

25

A B C

D E

FIG. 7 : Clustering of color image data

Page 26: A Clustering Method based on the Estimation of the ... · methods are the hierarchical clustering approaches : a similarity criterion between data is defined and aggregation algorithms

26

First cluster Second cluster

Boundarybetween clusters

Thresholded zone Thresholdedzone

Thresholdof p.d.f.

B - Analyzing the p.d.f. through thresholding

First cluster Second cluster

Convexity Convexity

Boundarybetween clusters

A - Analyzing the p.d.f. through convexity

FIG. 8 : Partition of the space through p.d.f. analysis