If you can't read please download the document
Upload
ngodieu
View
231
Download
1
Embed Size (px)
Citation preview
Chapter 4
Rough Fuzzy c-Means
Subspace Clustering
In this chapter, we propose a novel adaptation of rough fuzzy c-means al-
gorithm for high dimensional data by modifying its objective function. The
proposed algorithm automatically detects the relevant cluster dimensions of
the high dimensional data set. The assignment of weights to attributes being
specific to each cluster, an efficient subspace clustering scheme is generated.
We have also discussed the convergence of the proposed algorithm. The re-
mainder of this chapter is organised as follows: section 4.1 introduces rough
set theory, in section 4.2 on related work, we describe how classical cluster-
ing methods have been adapted to suit the requirements of high dimensional
data, in section 4.3, we extend the rough fuzzy c-means algorithm for sub-
space clustering in the form of Rough Fuzzy c-Means Subspace (RFCMS)
algorithm, in section 4.4 we discuss the convergence of proposed algorithm,
in section 4.5, we present the results of applying RFCMS algorithm on several
UCI data sets, and finally section 4.6 summarizes the chapter.
41
4.1 Introduction
Pawlak introduced rough set theory as a new framework for dealing with
imperfect knowledge [Pawlak, 1991]. Rough set theory provides a method-
ology for addressing the problem of relevant feature selection, by selecting
a set of information rich features from a data set that retains the seman-
tics of the original data and requires no human inputs unlike statistical ap-
proaches [Jensen, 1999]. It is often possible to arrive at a minimal feature set
(called reduct in rough set theory) that can be used for data analysis tasks
such as classification and clustering [Lingras and West, 2004], [Mitra et al.,
2006]. When feature selection approaches based on rough sets are combined
with an intelligent classification system like those based on fuzzy systems or
neural networks, they retain the descriptive power of the overall classifier and
result in simplified system structure which enhances the understandability
of the resultant system [Shen, 2007].
Following Rutkowski we describe the notion of rough sets used to model
uncertainty in information systems [Rutkowski, 2008]. Formally, an infor-
mation system is a pair (U,A), where U is a non-empty finite set of ob-
jects and A is a non-empty finite set of attributes such that each attribute
a has an associated value set Va, i.e. a : U Va for every a A. A
Decision System DS is defined as a pair (U,A {d}), d / A is called de-
cision attribute and the elements of A are called condition attributes. For
an attribute set B A, the set of objects in the information system, in-
discernible w.r.t. B is described by the indiscernibility relation INDIS(B)
defined as: INDIS(B) = {(x1, x2) U2|a(x1) = a(x2) a B}. The
objects x1 and x2 are indiscernible from each other by attributes from B if
(x1, x2) INDIS(B). The equivalence classes of the B-indiscernibility re-
lation are denoted by [x]B. If X U then X can be approximated using
B by constructing three approximations, namely, B lower approximation:
BX = {x|[x]B X}, B upper approximation: BX = {x|[x]B X 6= },
42
and the B boundary region: BX BX of X. Evidently, the boundary
region consists of all objects in upper approximation but not in lower ap-
proximation of X. Bazan et al. discuss various techniques for rough set
reduct generation and argue that the classical reducts being static may not
be stable in randomly chosen samples of a given decision table [Bazan et al.,
2000]. To deal with such situations they focus on reducts that are stable over
different subsets of samples chosen from a given decision table. Such reducts
are called dynamic reducts. They compute reducts using an order based
genetic algorithm and subsequently extract dynamic reducts which are used
to generate classification rules. Each rule set is associated with a measure
called the rule strength which is used later to resolve conflicts when several
rules are applicable. Slezak generalized the concept of reduct by introduc-
ing the notion of association reducts corresponding to both association rules
and rough set reducts [Slezak, 2005]. He defined association reduct as a pair
(A, B) of disjoint subsets of attributes such that all data supported patterns
involving A approximately determine those involving B. He developed an
information theory based algorithm to compute association reducts. As the
algorithm needs to examine all association reducts, it has exponential time
requirements. In order to alleviate this hardship, Slezak targeted significantly
smaller ensembles of dependencies providing reasonably rich knowledge, and
developed an order based genetic algorithm to achieve this [Slezak, 2009].
Shen and Jensen proposed the concept of retainer as an approximation of a
reduct [Richard and Qiang, 2001]. The authors suggest a heuristic to com-
pute the retainer and demonstrate its usefulness for the classification task.
For clustering textual database consisting of N documents, with a vocabu-
lary of size V, Li et al. developed an algorithm based on approximate reducts
that works in time O(VN) [Li et al., 2006].
43
4.2 Related Work
Rough sets have been widely used for classification and clustering [Lingras
and West, 2004], [Mitra et al., 2006], [Pawlak, 1991]. The classical k-means
algorithm has been extended to rough k-means algorithm by Lingras et al.
[Lingras and West, 2004]. In rough k-means algorithm, a cluster in the lower
approximation, called the core cluster, is surrounded by a buffer or boundary
set having objects with unclear membership status [Lingras and West, 2004].
A data point in the lower approximation surely belong to a cluster, although,
membership of the objects in an upper approximation is uncertain. Signature
of each cluster is represented by its center, lower and upper approximation.
If lower and upper approximations are equal then buffer set is empty and
the data objects are crisply assigned to the cluster. The rough k-means
algorithm follows an iterative process, wherein cluster centers are updated
until convergence criterion is met. Asharaf et al. have extended rough k-
means algorithm in such a way that it does not require prior specification of
the number of clusters [Asharaf and Murty, 2004] . They have proposed a
two phase algorithm. It identifies a set of leaders which act as prototypes in
the first phase. Subsequently a set of supporting leaders are identified, which
can act as leaders, provided they yield better partitioning. The evolutionary
rough k-medoids algorithm [Peters et al., 2008] is based on the family of
rough clustering algorithms and the classical k-medoids algorithm [Kaufman
and Rousseeuw, 1990]. In Malyszko et al. have extended rough k-means
clustering to rough entropy clustering [Malyszko and Stepaniuk, 2009]. It is
an iterative process: firstly a predefined number of weight pairs are selected,
for each weight pair a new offspring clustering is determined, rough entropy
is computed, and the partition which gives highest rough entropy is selected.
Liu et al. have proposed a feature selection method ISODATA-RFE for
high dimensional gene expression datasets [Liu et al., 2012]. Bhattacharya
distance is used to rank the features of training set. Features with low Bhat-
44
tacharya distance are removed from feature set. For separating different
classes, fuzzy ISODATA algorithm is used to calculate sensitivity index of
each feature. A recursive feature elimination method is applied to feature
set for removing unimportant features. It generates multiple nested candi-
date feature subsets. Finally, the feature subset with least error is selected
for use in classification and clustering algorithms. Own and Abraham have
proposed a new weighted rough set framework based classification for neo-
natal jaundice [Own and Abraham, 2012]. The weighted information table
is built by applying class equal sample weighting. While samples in ma-
jority class have smaller weight, the samples in minority class have larger
weight. A weighted reduction algorithm MLEM2 exploits the significance
of the attributes to extract a set of diagnosis rules from decision system
of NeoNatal Jaundice database. Deng et al. have proposed an enhanced
entropy weighting subspace clustering algorithm for high dimensional gene
expression data [Deng et al., 2011]. Its objective function integrates the
fuzzy within cluster compactness and between cluster information simulta-
neously. [Cordeiro de Amorim and Mirkin, 2012] have extended the weighted
K-means algorithm proposed by Huang et al.. They have replaced Euclidean
distance metric by minkowski metric for measuring distances as the Euclidean
distance cannot capture the relationship between scales of the feature values
and feature weights. Bai et al. have proposed a novel weighting algorithm
for categorical data [Bai et al., 2011]. The algorithm computes two weights
for each dimension in each cluster. These weight values are used to identify
the subsets of attributes which can categorize different clusters.
Rough set theory has been applied in conjunction with fuzzy set theory in
several domains such as fuzzy rule extraction, reasoning with uncertainty,
fuzzy modelling, and feature selection [Maji and Pal, 2010]. The classical
fuzzy c-means algorithm has been used in conjunction with rough sets to
develop rough fuzzy c-means (RFCM) algorithm [Mitra and Banka, 2007].
The concept of membership in FCM enables efficient handling of overlapping
45
partitions, while, the rough sets are aimed at modelling uncertainty in data.
Such hybrid techniques provide a strong paradigm for uncertainty handling in
various application domains such as pattern recognition, image processing,
mining stock prices, vocabulary for information retrieval, fuzzy clustering,
dimensionality reduction, data mining and knowledge discovery [Maji and
Paul, 2011], [Maji and Pal, 2010]. Maji and Pal proposed an algorithm
RFCMdd for selecting the most informative bio-basis (medoids), where each
partition is represented by a medoid computed as weighted average of the
crisp lower approximation and fuzzy boundary [Maji and Pal, 2007b]. In Maji
introduced a quantitative measure of similarity among genes based on fuzzy
rough sets to develop fuzzy-rough supervised attribute clustering (FRSAC)
algorithm [Maji, 2011].
4.3 Rough Fuzzy c-Means Subspace Cluster-
ing
In this section, we propose an algorithm based on rough fuzzy c-means algo-
rithm for subspace clustering.
4.3.1 Rough c-Means
Rough c-means algorithm [Lingras and West, 2004], has extended the concept
of c-means by considering each cluster as an interval or rough set, where lower
and upper approximations BX and BX are characteristics of rough set X.
A rough set has following properties:
(i) An object xj can belong to at most one lower approximation.
(ii) If xj BX of cluster X, then xj BX also.
(iii) If xj does not belongs to any lower approximation, then it belongs to two
or more upper approximations, i.e. overlap between clusters is possible.
46
The iterative steps of the Rough c-Means Algorithm are as follows:
Algorithm 2 Rough c-Means Algorithm
1. Chose initial means zi, 1 i k, for the k clusters.
2. Assign each data point xj, 1 j n, to the lower approximation BUi or
upper approximations BUi, BUi of cluster pairs Ui , Ui by computing the
difference in its distance dij dij, where xj be a jth data point at distance
dij from ith centroid zi of cluster Ui.
3. Let dij be minimum and dij be the next to minimum. If dijdij is less than
some threshold then xj BUi and xj BUi and xj cannot be a member of
any lower approximation, else xj BUi such that distance dij is minimum
over the k clusters.
4. Compute new mean zi for each cluster, as
zi =
xj(BUiBUi)
xj
|BUiBUi|if BUi = BUi BUi 6=
wlow
xjBUixj
|BUi| +wup
xj(BUiBUi)
xj
|BUiBUi|if BUi 6= BUi BUi 6=
xjBUixj
|BUi| otherwise.
where the parameters wlow and wup represents the relative importance of the
lower and upper approximations respectively. Thus, RCM generates three
types of clusters, with objects (i) in both the lower and upper approximations,
(ii) only in lower approximation, and (iii) only in upper approximation.
5. Repeat Steps 2-4 until convergence, i.e., there are no more new assignments,
or upper limit on the number of iterations is reached.
Note: wup = 1 wlow , 0.5 < wlow < 1, and 0 < threshold < 0.5.
47
4.3.2 Rough-Fuzzy c-Means
Rough-fuzzy c-means algorithm [Mitra et al., 2006] incorporates weighted
distance in terms of fuzzy membership value uij of a data point xj to a clus-
ter mean zi, instead of the absolute individual distance dij of jth data point
from ith cluster center.
The iterative steps of the algorithm are as follows:
Algorithm 3 Rough Fuzzy c-Means Algorithm
1. Chose initial means zi, 1 i k, for the k clusters.
2. Compute uij by eq. 3.9 for k clusters and n data objects.
3. Assign each data point xj to the lower approximation BUi or upper approxi-
mation BUi, BUi of cluster pairs Ui , Ui by computing the difference in its
membership uij uij.
4. Let uij be maximum and uij be the next to maximum. If uij - uij is less
than some threshold then xj BUi and xj BUi and xj cannot be a
member of any lower approximation, else xj BUi such that membership uijis maximum over the k clusters.
5. Compute new mean zi for each cluster, as
zi =
xj(BUiBUi)
ijxjxj(BUiBUi)
ijif BUi = BUi BUi 6=
wlow
xjBUi
ijxjxjBUi
ij+ wup
xj(BUiBUi)
ijxjxj(BUiBUi)
ijif BUi 6= BUi BUi 6=
xjBUiijxj
xjBUiij
, otherwise.
6. Repeat Steps 2-5 until convergence, i.e., there are no more new assignments
, or upper limit on the number of iterations is reached.
Note: wup = 1 wlow, 0.5 < wlow < 1, and 0 < threshold < 0.5.
48
4.3.3 Rough Fuzzy c-Means Subspace Clustering Al-
gorithm
The proposed algorithm called Rough Fuzzy c-Means Subspace (RFCMS)
has been developed by hybridizing the concept of fuzzy membership for ob-
jects (in clusters) and dimensions (fuzzy membership serves as weight of
dimension) and rough set based approximations of clusters.
Objective Function Let, BUi, BUi and BUiBUi denote lower approxima-
tion, upper approximation, and boundary region of ith cluster Ui respectively.
In [Lingras and West, 2004] the classical objective function of fuzzy c-means
algorithm has been modified in the rough framework by incorporating the
lower and upper approximations of the clusters. We have extended the ob-
jective function of rough fuzzy c-means algorithm [Lingras and West, 2004]
by incorporating the weights of dimensions as relevant to different clusters.
We associate with ith cluster, the weight vector, i which represents the rel-
ative relevance of different attributes for the ith cluster. Thus, in the matrix
W = [ir]kd, ir denote the contribution of rth dimension to the ith cluster.
The sum of contributions from all dimensions adds to 1 for each cluster.
dr=1
ir = 1, 1 i k, (4.1)
ir [0, 1] , 1 i k, 1 r d (4.2)
The proposed RFCMS algorithm minimizes the following objective function
JRFCMS to partition data set into k clusters.
JRFCMS =
aA+ bB if BU 6= BU BU 6=
A if BU 6= BU BU =
B otherwise.
49
where,
A =
xjBUi
ki=1
dr=1
ijird
2ijr
B =
xj(BUiBUi)
ki=1
dr=1
ijird
2ijr (4.3)
In the above formulation, A and B correspond to lower and upper approx-
imations. Parameters a and b control the contribution of lower and upper
approximation of a cluster.
d2ijr = (xjr zir)2 (4.4)
is the distance between ith cluster center and jth data object along rth dimen-
sion. Parameters (1,) , (1,) are weighting components. These
parameters control the fuzzification of ij and ir respectively.
Solving 4.3 w.r.t ij and ir we get:
ij =1
kl=1
[dr=1
(ir)d2ijrdr=1
(lr)d2ljr
]1/(1) (4.5)
ir =1
dl=1
[nj=1
(ij)d2ijrnj=1
(ij)d2ijl
]1/(1) (4.6)
The weights of dimensions are computed using eq. 4.6 as in [Kumar and
Puri, 2009].
Cluster Center The cluster centers are computed as:
zir =
xj(BUiBUi)
ijxjrxj(BUiBUi)
ijif BUi = BUi BUi 6=
axjBUi
ijxjrxjBUi
ij
+bxj(BUiBUi)
ijxjrxj(BUiBUi)
ijif BUi 6= BUi BUi 6=
xjBUiijxjr
xjBUiij
otherwise.
50
(4.7)
As the objects lying in lower approximation definitely belong to the cluster so they
are assigned higher weights as compared to weight for objects lying in boundary
region. For the case a 1 cluster center may get stuck in local optimum because
clusters cannot find the objects lying in the boundary region and therefore, they
may not be able to move towards the best cluster center. In order to maintain
the greater degree of freedom to move, the values of parameters a and b are set as
o < a < b < 1 such that a+ b = 1 [Maji and Pal, 2007a]. Like FCM [Bezdek et al.,
1987], and Yans fuzzy curve tracing algorithm [Yan, 2004] the proposed RFCMS
algorithm converges, at least along a subsequence, to a local optimum solution.
The iterative steps of the algorithm are as follows:
Algorithm 4 Rough Fuzzy c-Means Subspace Clustering Algorithm
1. Chose initial cluster centers zi, 1 i k, for the k clusters.
2. Compute ij by eq. 4.5 for k clusters and n data objects.
3. Let ij be maximum and ij be the next to maximum for an object xj.
If ij - ij is less than some threshold then xj BUi and xj BUi and xjcannot
be a member of any lower approximation, else xj BUi such that membership
ij is maximum over the k clusters.
4. Compute ir by eq. 4.6 for k clusters and d dimensions.
5. Compute new cluster centers zi for each cluster, as in eq. 4.7.
6. Repeat steps 2-5 until convergence, i.e., there are no more new assignments,
or limit on maximum number of iterations is reached.
Note: a = 1 b, 0.5 < a < 1, and 0 < threshold < 0.5.
51
4.4 Convergence
In this section, we discuss the convergence criteria of the proposed algorithm along
with its proof. On the similar lines, as global convergence property of FCM al-
gorithm, global convergence property of RFCMS states that for any data set and
initialization parameters, an iteration sequence of RFCMS algorithm either (i)
converges to a local minimum or (ii) there exists a subsequence of the iteration
sequence that converges to a stationary point. Theorems 4.1, 4.2 and 4.3 below
show that necessary and sufficient conditions hold for U , W , and Z respectively.
Theorem 4.1 Let : Mknf
Assuming that Sij 6= 0, 1 j n, 1 i k, we get:
dr=1
S22ij P2ir d
2ijr + j = 0
or j = dr=1
S22ij P2ir d
2ijr
or S22ij =j
dr=1 P
2ir d
2ijr
or S2ij =
jdr=1 P
2ir d
2ijr
1(1)
ij =
jdr=1 P
2ir d
2ijr
1(1) (4.10)Using constraint eq. 2.11 in eq. 4.10, we get:
ki=1
ij =ki=1
jdr=1 P
2ir d
2ijr
1(1) = 1
Substituting the value of j in eq. 4.10, we obtain:
ij =1
kl=1
[dr=1
ird2ijrd
r=1lrd2ljr
]1/(1) (4.11)Now, to prove the sufficiency condition we compute the second order partial deriva-
tive.
2JRFCMSSijSij
=
2(2 1)dr=1 S22ij P 2ir d2ijr + 2j if i = i j = j ,0 otherwise.
= 2(2 1)dr=1
(1)ij P
2ir d
2ijr + 2j (4.12)
= 2(2 1)(1)ij d2ij+ 2j (4.13)
53
where
d
ij =dr=1
P 2ir d2ijr
Substituting the value of ij and j in 4.13, we get:
2(2 1)d2ij
1/
kl=1
d2ij d2lj
1/(1)
(1)
2
1/
kl=1
1d2lj
1/(1)
(1)
= (2(2 1) 2)
1/
kl=1
1d2lj
1/(1)
(1)
(4.14)
= 4( 1)[kl=1
[d2lj]1/(1)](1)
(4.15)
Letting, aj =
[kl=1
(d2lj)1/(1)
](1), 1 j n,
2JRFCMSSijSij
= j where, j = 4( 1)aj 1 j n. (4.16)
Hence there are n distinct eigen values each of multiplicity k, of Hessian matrix of
U which is a diagonal matrix. With the assumptions > 1, > 1 and d2ij > 0 l, j
it implies j > 0 j. Thus, Hessian matrix of U is positive definite and hence, the
sufficiency condition is proved.
Theorem 4.2 Let : Mkdf
Since, ir = P2ir we get:
ir =
[i
nj=1 S
2ij d
2ijr
] 1(1)
(4.18)
Using constraint eq. 3.4 we get:
dr=1
ir =dr=1
[i
nj=1 S
2ij d
2ijr
] 1(1)
= 1
Substituting the value of i in eq. 4.18, we obtain:
ir =1
dl=1
[nj=1
ijd2ijrn
j=1ijd
2
ijl
]1/(1) (4.19)Now, to prove the sufficiency condition we compute the second order partial deriva-
tive
2JRFCMSPirPir
=
2(2 1)nj=1 P 22ir S2ij d2ijr + 2i if i = i , r = r0 otherwise.
= 2(2 1)nj=1
(1)ir S
2ij d
2ijr + 2i (4.20)
(4.21)
= 2(2 1)(1)ir d2ir + 2i (4.22)
where
d2ir =nj=1
S2ij d2ijr (4.23)
Substituting the value of ir and i in 4.22, we get:
= 2(2 1)d2ir
1/
dl=1
d2ird2il
1/(1)
(1)
2
1/
dl=1
1d2
il
1/(1)
(1)
= (2(2 1) 2)
1/
dl=1
1d2il
1/(1)
(1)
= 4( 1)
dl=1
(d2il
)1/(1)
(1)
55
Letting, bi =
dl=1
(d2il
)1/(1)
(1) 1 i k,2JRFCMSPirPir
= i where, i = 4( 1)bi 1 i k. (4.24)
Hence there are k distinct eigen values each of multiplicity r, of Hessian matrix of
W which is a diagonal matrix. With the assumption > 1, > 1 and d2il> 0 i, l
it implies i > 0 i. Thus, Hessian matrix of W is positive definite and hence, the
sufficiency condition is proved.
Theorem 4.3 Let :
zirupper approx =
xj(BUiBUi)
ijxjr
xj(BUiBUi) ij(4.27)
As an object may not belong to both lower approximation and upper
approximation, thus, the convergence of cluster center depends on both the lower
and upper approximation of cluster center. Eqs. 4.26 and 4.27 can be written as:
|BUi|zirlower approx =
xjBUixjr (4.28)
|BUi BUi|ijzirlower approx =
xj(BUiBUi)ijxjr (4.29)
Eqs. 4.28 and 4.29 represents a linear set of equations. In order to prove the
convergence we treat eqs. 4.26 and 4.27 as a Gauss-seidel iterations for solving
the set of equations with ij considered to be fixed. The sufficient condition by
Gauss-seidel algorithm for assuring the convergence of the matrix, representing
each iteration is that it should be diagonally dominant. The matrices
corresponding to eqs. 4.26 and 4.27 are:
A =
|BU1| 0 . . . . . . 0
0 |BU2| 0 . . . 0
0 . . . . . . 0 |BUk|
B =
1 0 . . . . . . 0
0 2 0 . . . 0
0 0 . . . . . . k
where
i =xj(BUiBUi)
ijxj
The sufficient condition for matrices A and B to be diagonally dominant is:
|BUi| > 0 and i > 0 respectively.
57
Also, going by the convergence theorem proposed by [Bezdek et al., 1987] for
FCM, [Maji and Pal, 2007a] and [Yan, 2004] convergence analysis of the fuzzy
curve tracing algorithm, matrices A and B are hessian of A and B w.r.t zirlower approx
and zirupper approx respectively with all postive eigen values and hence proved that
these matrices are diagonally dominant. Thus, by theorem 4.1, 4.2 and 4.3 the
proposed algorithm RFCMS converges, at least along a subsequence, to a local
optimum solution.
4.5 Experiments
In this section, we present the comparative performance of proposed subspace clus-
tering algorithm RFCMS with FCM, RCM, RFCM, DOC, and PROCLUS, using
UCI data sets [uci, ]. While FCM, RCM, RFCM are full dimensional clustering
algorithms, PROCLUS and DOC, are subspace clustering algorithms tailored for
high-dimensional applications. We used MATLAB version of FCM, opensubspace
weka [osw, ] implementation for DOC and PROCLUS, and implemented RCM,
RFCM, and RFCMS algorithms in MATLAB. In all the experiments, with FCM,
RCM, RFCM and RFCMS algorithm the stopping criterion parameter was set
as 103 and the maximum number of iterations was restricted to 100. However,
in all the experiments we conducted, the algorithms always converged before the
limit on the number of iterations was reached. The normed differences between
successive iterations of matrix Z is compared with the threshold parameter , set
to define convergence criterion. Based on experimentation, we set the value of
parameters a = 0.85 and b = 0.25 for RCM, RFCM and RFCMS algorithms. The
parameters for DOC algorithm were used as mentioned in [Procopiuc et al., 2002].
The number of clusters k was set equal to the number of classes given in each
data set, as indicated in Table 4.1. We have evaluated the effect of fuzzification
parameters and of RFCMS algorithm and fuzzification parameter m of FCM
and RFCM algorithms. We evaluated the performance of all the algorithms w.r.t.
quality and validity measures. The set of relevant dimensions computed by each
58
Data Sets Instances Attributes Classes
Alzheimr 45 8 3
Breast Cancer 569 30 2
Spambase 4601 57 2
Wine 178 13 3
Diabetes 768 8 2
Magic 19020 10 2
Table 4.1: Data Sets
of the subspace clustering algorithms RFCMS, DOC and PROCLUS have been
shown for all the data sets.
4.5.1 Data Sets
We experimented with Alzhemir, Breast Cancer, Spambase, Wine, Diabetes and
Magic data sets from the UCI data repository [uci, ]. These data sets are heteroge-
neous in terms of size, number of clusters, and distribution of classes and have no
missing values. General characteristics of the data sets are summarized in Table
4.1.
4.5.2 Effect of Fuzzification Parameters
For the RFCMS algorithm, the best combination of fuzzification parameters and
was determined by varying the values of and in the range 2-10 independent
of each other. This was done for each data set. Similarly, the best value of fuzzi-
fication parameter m for FCM and RFCM algorithm was determined by varying
the values of m. Table 4.2 shows the complete list of fuzzification parameters we
found for different data sets as a result of fine-tuning.
59
Data Sets RFCMS FCM RFCM
m m
Alzehmir 2 2 6 4
Breast Cancer 4 10 6 6
Spambase 3 10 10 6
Wine 3 9 9 2
Diabetes 2 2 2 2
Magic 2 2 2 2
Table 4.2: Fuzzifier Values: RFCMS, FCM, and RFCM
Data Sets RFCMS FCM RCM RFCM PROCLUS DOC
Alzehmir 0.7556 0.8000 0.6889 0.7333 0.0750 0.2813
Breast Cancer 0.9192 0.8282 0.8541 0.8682 0.8336 0.0887
Spambase 0.7457 0.6568 0.6433 0.6568 0.5885 0.7062
Wine 0.9101 0.7079 0.6854 0.6966 0.5427 0.2743
Diabetes 0.6510 0.6589 0.6589 0.6589 0.5248 0.6910
Magic 0.6931 0.6961 0.6961 0.7294 0.2813 0.4817
Table 4.3: Accuracy: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC
4.5.3 Cluster Validity
Table 4.3 shows accuracy results for all the algorithms and data sets. RFCMS
algorithm has highest accuracy for Breast Cancer, Spambase and Wine data sets.
FCM algorithm achieves highest accuracy for Alzehmir data set, RFCM algorithm
achieves highest accuracy for Magic data set and Doc algorithm achieves high-
est accuracy for Diabetes data set. In Table 4.4, 4.5, 4.6, and 4.7, we present
the results of applying recall, specificity, precision and F1-measure to the out-
comes of clustering schemes produced by different algorithms. RFCMS algorithm
achieves highest recall and specificity for Breast Cancer, Spambase and Wine data
sets. FCM algorithm achieves highest recall and specificity for Alzehmir data set,
RFCM algorithm achieves highest recall and specificity for Magic data set and Doc
60
Data Sets RFCMS FCM RCM RFCM PROCLUS DOC
Alzehmir 0.7470 0.7976 0.6921 0.7367 0.0953 0.4193
Breast Cancer 0.8944 0.8241 0.8052 0.8241 0.8906 0.1527
Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543
Wine 0.9249 0.7030 0.6765 0.6904 0.5488 0.2702
Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488
Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787
Table 4.4: Recall: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC
Data Sets RFCMS FCM RCM RFCM PROCLUS DOC
Alzehmir 0.8769 0.9003 0.8465 0.8684 0.5155 0.4193
Breast Cancer 0.8949 0.8241 0.8052 0.8241 0.8906 0.1527
Spambase 0.7740 0.5798 0.5550 0.5767 0.4485 0.6543
Wine 0.9559 0.8565 0.8446 0.8508 0.8380 0.6327
Diabetes 0.5000 0.5943 0.5943 0.5943 0.4902 0.6488
Magic 0.6236 0.5722 0.5722 0.7982 0.4913 0.3787
Table 4.5: Specificity: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC
algorithm achieves highest recall and specificity for Diabetes data set. RFCMS al-
gorithm has highest precision for Breast Cancer, Spambase, Diabetes, Magic and
Wine data sets. FCM algorithm achieves highest precision for Alzehmir data set.
RFCMS algorithm achieves highest F1-measure for Breast Cancer, Spambase and
Wine data sets. FCM algorithm achieves highest F1-measure for Alzehmir data
set, RFCM algorithm achieves highest F1-measure for Magic data set. FCM, RCM
and RFCM algorithms achieve highest F1-measure for Diabetes data set. In sum-
mary itcan be seen that no algorithm is a clear winner w.r.t all measures for all
the algorithms and all the data sets.
61
Data Sets RFCMS FCM RCM RFCM PROCLUS DOC
Alzehmir 0.7407 0.9716 0.7008 0.7463 0.0769 0.1806
Breast Cancer 0.9371 0.9104 0.9026 0.9104 0.7825 0.1332
Spambase 0.7677 0.6810 0.6982 0.6938 0.4994 0.5981
Wine 0.9202 0.7301 0.7084 0.7211 0.5104 0.1778
Diabetes 0.6510 0.6120 0.6120 0.6120 0.4897 0.6028
Magic 0.7982 0.7870 0.7870 0.7958 0.1806 0.4054
Table 4.6: Precision: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC
Data Sets RFCMS FCM RCM RFCM PROCLUS DOC
Alzehmir 0.7439 0.7946 0.6964 0.7415 0.0851 0.2525
Breast Cancer 0.9153 0.8651 0.8511 0.8651 0.8330 0.1423
Spambase 0.7708 0.6263 0.6184 0.6299 0.4726 0.6250
Wine 0.9225 0.7163 0.6921 0.7054 0.5289 0.2145
Diabetes 0.5656 0.7062 0.7062 0.7062 0.4899 0.6249
Magic 0.7002 0.6626 0.6626 0.7970 0.2641 0.3916
Table 4.7: F1-measure: RFCMS, FCM, RCM, RFCM, PROCLUS, and DOC
4.5.4 Subspaces Generated
The proposed algorithm RFCMS, is an objective function based subspace clustering
algorithm. For such algorithms fewer the number of dimensions lesser will be the
error or scatter among objects of a cluster. We have compared RFCMS, DOC and
PROCLUS algorithms in terms of the number of dimensions found.
Tables 4.8, 4.9, 4.10, 4.11, 4.12 and 4.13 show the sets of dimensions found
for Alzehmir, Breast Cancer, Spambase, Wine, Diabetes and Magic data sets by
RFCMS, PROCLUS and DOC algorithms. For all the data sets mentioned above,
RFCMS algorithm finds subspaces with fewer dimensions.
62
Cluster No. RFCMS PROCLUS DOC
1 4 4,6,7 1,2,3,4,5,6,7
2 4, 5, 7 4,5,6 1,2,3,4,5,6,7
3 4, 5, 6 4,5,6 1,2,3,4,5,6,7
Table 4.8: Dimensions: RFCMS, PROCLUS and DOC for Alzehmir
Cluster No. RFCMS PROCLUS DOC
1 10, 15, 20 1-3, 5-13 1-3, 5-13
15-24, 26-30 15-23, 25-30
2 10, 15, 20 1,2 1-3,5-13
15-23, 25-30
Table 4.9: Dimensions: RFCMS, PROCLUS and DOC for Breast Cancer
Cluster No. RFCMS PROCLUS DOC
1 28, 29, 32, 34, 38, 44, 47 1-54 1-56
2 45, 46, 47, 51, 52 40, 49 1-56
Table 4.10: Dimensions: RFCMS, PROCLUS and DOC for Spambase
Cluster No. RFCMS PROCLUS DOC
1 3, 8, 11 1,2,3,6,7,8,9,11,12 1-12
2 3, 8, 11 1,3,6,7,8,9,11,12 1-12
3 3, 7, 8, 9, 11 1,2 1-12
Table 4.11: Dimensions: RFCMS, PROCLUS and DOC for Wine
Cluster No. RFCMS PROCLUS DOC
1 1,6,7 1,6-8 1, 6-8
2 1,7 1,4,5,7 1, 6-8
Table 4.12: Dimensions: RFCMS, PROCLUS and DOC for Diabetes
63
Cluster No. RFCMS PROCLUS DOC
1 4,5 3,4,5,8,9 2-6,8,9
2 4,5 1,2,3,4,5 1-5,8
Table 4.13: Dimensions: RFCMS, PROCLUS and DOC for Magic
4.5.5 Experiments on Biological Datasets
In this section, we present the comparative performance of proposed projected
clustering algorithm RFCMS with EWKM, FWKM and LAC algorithms for bi-
ological data sets. RFCMS, EWKM, FWKM and LAC algorithms are subspace
clustering algorithms tailored for high-dimensional applications. We used weka
implementation for EWKM, FWKM and LAC [Peng and Zhang, 2011]. The pa-
rameters for EWKM, FWKM and LAC algorithm were used as mentioned in [Jing
et al., 2007], [Jing et al., 2005] and [Domeniconi et al., 2007]. We have evaluated
the effect of fuzzification parameters and of RFCMS algorithm. We evaluated
the performance of all the algorithms w.r.t. validity measures. The set of relevant
dimensions computed by each of the subspace clustering algorithms RFCMS have
been shown for all the data sets.
4.5.5.1 Data Sets
We experimented with Colon, Embroynal Tumours, Prostate and Leukemia data
sets [bio, ]. These data sets are heterogeneous in terms of size, and have no missing
values. We have chosen datasets which are pre-classified as it helps in evaluating
the results of applying clustering algorithms. General characteristics of the data
sets are summarized in Table 4.14.
4.5.5.2 Effect of Fuzzification Parameters
For the RFCMS algorithm the best combination of fuzzification parameters and
was determined by varying the values of and in the range 2-5 independent of
each other. This was done for each data set. Table 4.15 shows the complete list of
64
Data Sets Instances Attributes Classes
Colon Cancer 62 2001 2
Embroynal Tumours 60 7130 2
Leukemia 38 7130 2
Prostate 21 12601 2
Table 4.14: Data Sets
fuzzification parameters we found for different data sets as a result of fine-tuning.
Data Sets
Colon Cancer 2 4
Embroynal Tumours 3 5
Leukemia 3 4
Prostate 2 2
Table 4.15: Fuzzifier Values
4.5.5.3 Cluster Validity
Table 4.16 shows accuracy results for all the algorithms and data sets. RFCMS
algorithm achieves highest accuracy for Colon and Leukemia datasets. FWKM and
LAC algorithm achieves highest accuracy for Embroynal Tumour data set. FWKM
algorithm achieves highest accuracy for Prostate data set. However accuracy of
Data Sets RFCMS EWKM FWKM LAC
Colon Cancer 0.58065 0.5322 0.5322 0.5438
Embroynal Tumours 0.5833 0.6666 0.6666 0.6666
Leukemia 0.8421 0.5526 0.0.5526 0.5263
Prostate 0.6195 0.6190 0.6666 0.6190
Table 4.16: Accuracy: RFCMS, EWKM, FWKM and LAC
65
Data Sets RFCMS EWKM FWKM LAC
Colon Cancer 0.5318 0.53636 0.53636 0.51364
Embroynal Tumours 0.63553 0.47619 0.47619 0.47619
Leukemia 0.83502 0.4697 0.4697 0.45118
Prostate 0.5240 0.47596 0.41346 0.47596
Table 4.17: Specificity: RFCMS, EWKM, FWKM and LAC
RFCMS was comparable with FWKM algorithm for both Embryonal Tumour and
Prostate data set.
In Table 4.18, 4.17, 4.19, and 4.20, we present the results of applying recall,
specificity, precision and F1-measure to the outcomes of clustering schemes pro-
duced by different algorithms.
RFCMS algorithm achieves highest recall for Leukemia data set. EWKM,
FWKM and LAC algorithms achieve highest recall for Embryonal Tumour data
set. FWKM algorithm achieves highest recall for Prostate data set. RFCMS al-
gorithm achieves highest specificity for Colon, Embroynal Tumours, Prostate and
Leukemia data sets.
RFCMS algorithm has highest precision for Colon and Leukeima data sets.
EWKM, FWKM and LAC algorithms achieve highest precision for Embryonal Tu-
mour data set. EWKM and LAC algorithm achieves highest precision for Prostate
data set. RFCMS algorithm achieves highest F1-measure for Colon, Embroynal Tu-
mours, and Prostate data sets. FWKM achieves highest F1-measure for Leukemia
data set.
4.5.5.4 Subspaces Generated
Figures 4.1 to 4.12 show the set of dimensions found for Colon, Embroynal Tu-
mours, Prostate and Leukemia data sets by RFCMS, EWKM and LAC algorithms.
RFCMS algorithm finds fewer dimensions as compared to EWKM and LAC al-
gorithms. For Embroynal Tumours data set EWKM and LAC algorithms fails to
66
Data Sets RFCMS EWKM FWKM LAC
Colon Cancer 0.53182 0.5322 0.5322 0.5483
Embroynal Tumours 0.63553 0.6666 0.6666 0.6666
Leukemia 0.83502 0.5526 0.5526 0.5263
Prostate 0.52404 0.6190 0.6666 0.6190
Table 4.18: Recall: RFCMS, EWKM, FWKM and LAC
Data Sets RFCMS EWKM FWKM LAC
Colon Cancer 0.5333 0.5057 0.5057 0.5288
Embroynal Tumours 0.63415 0.7796 0.7796 0.7796
Leukemia 0.8061 0.5642 0.5642 0.5263
Prostate 0.5657 0.5814 0.6666 0.5814
Table 4.19: Precision: RFCMS, EWKM, FWKM and LAC
distinguish the relevance of dimensions for cluster 2. However RFCMS algorithm
distinguishes the relevant and non relevant dimensions for cluster 2. For Prostate
data set RFCMS algorithm finds fewer dimensions as compared to EWKM and
LAC algorithms. For Leukemia data set results of RFCMS, EWKM, and LAC
algorithms are comparable.
Data Sets RFCMS EWKM FWKM LAC
Colon Cancer 0.5325 0.5322 0.5322 0.5483
Embroynal Tumours 0.63415 0.6666 0.6666 0.6666
Leukemia 0.83502 0.5526 0.5526 0.5263
Prostate 0.5240 0.6190 0.6666 0.6190
Table 4.20: F1-measure: RFCMS, EWKM, FWKM and LAC
67
Figure 4.1: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for
Colon Dataset
Figure 4.2: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for
Colon Dataset
68
Figure 4.3: LAC: Memberships of dimensions in cluster 1 and cluster 2 for Colon
Dataset
Figure 4.4: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for
Embryonal Tumours Dataset
69
Figure 4.5: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for
Embryonal Tumours Dataset
Figure 4.6: LAC: Memberships of dimensions in cluster 1 and cluster 2 for
Embryonal Tumours Dataset
70
Figure 4.7: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for
Prostate Dataset
Figure 4.8: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for
Prostate Dataset
71
Figure 4.9: LAC: Memberships of dimensions in cluster 1 and cluster 2 for
Prostate Dataset
Figure 4.10: RFCMS: Memberships of dimensions in cluster 1 and cluster 2 for
Leukemia Dataset
72
Figure 4.11: EWKM: Memberships of dimensions in cluster 1 and cluster 2 for
Leukemia Dataset
Figure 4.12: LAC: Memberships of dimensions in cluster 1 and cluster 2 for
Leukemia Dataset
73
4.6 Summary
In this chapter, we have proposed a novel subspace clustering algorithm which
employs a combination of rough sets and fuzzy set theory. Rough fuzzy c-Means
Subspace (RFCMS) algorithm is an extension of rough fuzzy c-means algorithm,
which incorporates fuzzy membership of data points and dimensions in each cluster.
In each iteration, cluster centers are updated and a data point is assigned to lower
approximations or upper approximation of a cluster. This process is repeated
until convergence criterion is met. We have also discussed the convergence of the
proposed algorithm. The results of applying the proposed approach to UCI data
sets shows that the proposed algorithm scores over its competitors in terms of
several validity measures. The proposed algorithm can be used in conjunction
with density based algorithms to automatically detect the number of clusters.
74