21
JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering and Visualization of Gene Expression Data Anirban Mukhopadhyay University of Kalyani Kalyani-741235, India Sudip Poddar Indian Statistical Institute Kolkata-700108, India Abstract The result of one clustering algorithm varies from that of another for the same input dataset as the input parameters of an algorithms can substantially affect the behavior and execution of the algorithms. Cluster validity measures can be used to find the par- titioning that best fits the underlying data. In most realistic applications, this analysis can be visualized using simple Computer-Aided-Design package specifying various con- straints, as for example MATLAB GUI. In gene clustering, grouping related genes in the same cluster based on their expression patterns, or clustering different samples based on expression values of genes is the foundation of different genomic studies that aim at ana- lyzing the function of genes. Microarray technology has made it possible to measure gene expression levels for thousands of genes simultaneously. Gene clustering methods help in grouping similarly expressed genes together. EXCLUVIS is an application developed in the MATLAB GUI environment that represents an interface between the user and the results of various clustering algorithms. In this application package, users select a number of parameters like internal validity indices, external validity indices, number of clusters etc. from the active windows for evaluating the performance of the clustering algorithms. EXCLUVIS compares the performance of K-means, fuzzy C-means, hierarchical cluster- ing and multiobjective clustering with support vector machine. Heatmap is also included for visualizing the results of the cluster analysis. This application package, developed in Matlab R2009b, allows the users to easily find the goodness of the clustering solutions and immediately see the difference of those algorithmic solutions graphically. Keywords : Gene expression, clustering, validity indices, GUI, MATLAB.

EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

  • Upload
    lymien

  • View
    226

  • Download
    4

Embed Size (px)

Citation preview

Page 1: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

JSS Journal of Statistical SoftwareMMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/

EXCLUVIS: A MATLAB GUI Software for

Comparative Study of Clustering and Visualization

of Gene Expression Data

Anirban MukhopadhyayUniversity of KalyaniKalyani-741235, India

Sudip PoddarIndian Statistical Institute

Kolkata-700108, India

Abstract

The result of one clustering algorithm varies from that of another for the same inputdataset as the input parameters of an algorithms can substantially affect the behaviorand execution of the algorithms. Cluster validity measures can be used to find the par-titioning that best fits the underlying data. In most realistic applications, this analysiscan be visualized using simple Computer-Aided-Design package specifying various con-straints, as for example MATLAB GUI. In gene clustering, grouping related genes in thesame cluster based on their expression patterns, or clustering different samples based onexpression values of genes is the foundation of different genomic studies that aim at ana-lyzing the function of genes. Microarray technology has made it possible to measure geneexpression levels for thousands of genes simultaneously. Gene clustering methods helpin grouping similarly expressed genes together. EXCLUVIS is an application developedin the MATLAB GUI environment that represents an interface between the user and theresults of various clustering algorithms. In this application package, users select a numberof parameters like internal validity indices, external validity indices, number of clustersetc. from the active windows for evaluating the performance of the clustering algorithms.EXCLUVIS compares the performance of K-means, fuzzy C-means, hierarchical cluster-ing and multiobjective clustering with support vector machine. Heatmap is also includedfor visualizing the results of the cluster analysis. This application package, developed inMatlab R2009b, allows the users to easily find the goodness of the clustering solutions andimmediately see the difference of those algorithmic solutions graphically.

Keywords: Gene expression, clustering, validity indices, GUI, MATLAB.

Page 2: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

2 EXCLUVIS: A MATLAB GUI Software

1. Introduction

Clustering (Jain, Murty, and Flynn 1999) is an important unsupervised data mining task thatpartitions the input space into different homogeneous clusters such that the objects withinsame cluster are as similar as possible, while the objects belonging to different clusters areas dissimilar as possible. The similarities among the objects are measured in terms of somedistance metric. There are a variety of algorithms available for clustering in literature likepartitional, hierarchical, density-based, evolutionary algorithms-based etc. (Mukhopadhyay,Maulik, and Bandyopadhyay 2015).

Microarray gene expression datasets are useful for studying the expression levels of thousandsof genes simultaneously (Quackenbush 2001; Shannon, Culverhouse, and Duncan 2003). Clus-tering gene expression data helps in grouping the genes based on their expression patternsor grouping samples based on gene expression values, which further facilitates prediction ofgene functions and genetic markers. As there are a number of clustering algorithms availablein literature, therefore a software for comparing these algorithms for a particular expressiondataset will be very helpful for the biologists and bioinformaticians.

In view of this, we have developed a MATLAB GUI package called EXCLUVIS (gene EXpres-sion data CLUstering and VISualization) for comparative study of clustering and visualiza-tion of gene expression data. The package, developed in MATLAB 2009b, presents a very userfriendly graphical user interface for comparison of different clustering algorithms visually andnumerically. In this initial version we have implemented some popular clustering algorithmslike K-means, Fuzzy C-means, hierarchical clustering algorithms and a recent multiobjectiveclustering algorithm (Jain et al. 1999; Maulik, Mukhopadhyay, and Bandyopadhyay 2009).However, one can incorporate other clustering algorithms in future as the package is open-source. The performance of the algorithms can be compared in terms of some cluster validityindices and also by visualization of the clustering results.

In the subsequent sections we have described data preprocessing techniques, visualizationtools, clustering algorithms used in this software, cluster validity indices incorporated, sys-tem requirement followed by demonstration of the software. Finally we have discussed theavailability of the software package and subsequently concluded the article.

2. Data preprocessing

Preprocessing the input dataset is an important step in microarray analysis due to presenceof noise. Normalization, an important preprocessing step, refers to the process of finding andeliminating the systematic effects and rescaling the data from different microarrays onto acommon scale. For example, one needs to scale the gene whose expression level changes from5000 to 10000 so that it looks same as a gene whose expression level changes from 500 to 1000.This is because for gene expression data analysis, the pattern of the change of expression levelsof genes is more important than the absolute value. One possibility is to scale all genes tomean of 0 and standard deviation of 1. In EXCLUVIS, at first, the variances of the genes arecalculated and sorted in decreasing order. For normalizing the genes, scaling is performed onthe dataset.

Page 3: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 3

3. Visualization

Groups of functionally related genes in microarray data can be identified by applying theavailable clustering algorithms in data mining literature. But it is very difficult to find outthe most appropriate algorithm to apply and due to the lack of a gold-standard verification ofany clustering algorithm. Interestingly, this analytical process can be performed using datavisualization tools such as heatmap, profile plot etc. For this reason, these two features isalso included in this application package for visualizing the gene dataset.

3.1. Heatmap

Heatmap is considered as a widely popular data visualization technique, which plot the ge-nomic data in a two dimensional grid. It is generally used to visualize gene expression datain which rows correspond to genes and columns correspond features. Here, the magnitudeof each matrix entry is represented using color scale. In summary, heatmap provides a mostgeneralized view of data which cannot be available in any charting and graphing techniques.

3.2. Cluster profile plot

In gene expression analysis expression profile of genes is studied in different experimentalconditions. In multiple phenotype conditions, perturbation in expression pattern is detectedby visualizing the expression profiles. Another complementary approach to visualize thedynamics of altered expression patterns is to measure the gene expression at different timeinterval, under the same phenotype condition. Finding genes with similar expression patternis one of the main interests for the biologists as these genes provide a mean to understand theco-regulation pattern in gene network. Several methods are adopted from machine learningand statistics to find co-expressed/co-regulated genes. The cluster profile plot is used tovisualize those groups of co-regulated genes.

4. Clustering algorithms

In this application package several clustering algorithms are used for identifying groups offunctionally related genes in microarray data. The results of clustering solution are validatedusing validity indices. Also this package enable the users to visually compare clusteringsolution using heatmap and profile plot. Clustering algorithms that have been implementedin EXCLUVIS is described in the following subsections.

4.1. K-Means

In statistics and data mining K-means clustering is widely used clustering technique developedby MacQeen in 1967 (MacQueen 1967). It is one of the simplest and effective techniques whichaims to partition n observations into K clusters in d-dimensional space. The partitioning isperformed by assigning each observation to the nearest mean. It minimizes a square errorfunction as objective function as follows:

J =K∑j=1

n∑Xi∈Cj

‖ Xi − cj ‖2, (1)

Page 4: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

4 EXCLUVIS: A MATLAB GUI Software

where ‖ Xi − cj ‖ is a chosen distance measure between a data point Xi and the clustercenter cj . K-means minimizes the global cluster variance J to maximize the compactness ofthe clusters. It may happen that the values returned by K-means is not optimal and for fixedK and d, this can be solved in O(ndK+1 log n) time; where is the number of entities need tobe clustered.

4.2. Fuzzy C-Means

In fuzzy C-means (FCM) clustering (Bezdek 1981; Dunn 1973) each observation belongs to acluster with a certain degree of membership value. This method (developed by Dunn in 1973and improved by Bezdek in 1981) is widely used in statistics and pattern recognition. It isbased on minimization of the following objective function:

Jm =K∑i=1

n∑j=1

umij ‖ Xi − cj ‖2, (2)

where m is any real number greater than 1, uij is the degree of membership of Xj in thecluster i, Xj is the jth of d-dimensional measured data, cj is the d-dimensional center ofthe cluster. FCM generally produces better clustering results than K-means for overlappingclusters and noisy data. However, FCM and K-means both are sensitive to outliers.

4.3. Hierarchical clustering

In Hierarchical clustering (Johnson 1967; D’andrade 1978) a sequence of clusters is generatedin a hierarchy. Each level of hierarchy provides a particular clustering of the data. Hierarchicalclustering may be either agglomerative or divisive. In agglomerative clustering at first eachdata point is regarded as a singleton cluster. At each iteration two nearest clusters are mergedinto a single cluster. The merging is performed until a single cluster remains. On the contrary,in divisive case it starts with a single cluster containing all the data points. At each step,clusters are successively split into smaller clusters according to some dissimilarity measure.

The main shortcoming of hierarchical clustering is that the interpretation of the hierarchy iscomplex and often confusing. The deterministic nature of the method prevents the reevalu-ation of the clusters after grouping the nodes. Also the time complexity is at least O(n2),where n is the total number of objects and they can never undo what was done previously.In EXCLUVIS, some agglomerative hierarchical clustering algorithms are implemented.

4.4. Multiobjective clustering with support vector machine (MocSvm)

EXCLUVIS also includes a multiobjective evolutionary algorithm-based clustering algorithm(Maulik et al. 2009; Mukhopadhyay and Maulik 2009). In this approach two cluster valid-ity indices, namely Jm index (Bezdek 1981) and Xie-Beni index (Xie and Beni 1991) areoptimized simultaneously to yield robust clustering solutions. The algorithm is developedbased on non-dominated sorting genetic algorithm - II (NSGA-II) (Deb, Pratap, Agrawal,and Meyarivan 2002; Mukhopadhyay et al. 2015) and generates a near-Pareto-optimal set ofclustering solutions (Maulik, Bandyopadhyay, and Mukhopadhyay 2011). These solutions arethen integrated based on a fuzzy majority voting to obtain a single final solution.

Page 5: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 5

5. Cluster validity indices

The main objectives of clustering is to find similar groups of objects present in a dataset. Eachclustering algorithm searches for clusters in which members are close to each other showinghigh degree of similarity. The main difficulty of clustering algorithms is to find optimal numberof clusters that best suit the dataset.

Visual verification of the validity of clustering results in 2D data set is possible. But formultidimensional data it is difficult to validate the clustering results visually. Moreover theclustering results may produce non-optimal number of clusters for improper value of param-eters. The problem of finding the optimal number clusters and visualization of clusteringresults has been subjected to several research efforts. In general, there are two approaches toinvestigate cluster validity.

� External Indices: Used to measure the extent to which cluster labels match externallysupplied class labels, e.g., Minkowski index.

� Internal Indices: Used to measure the goodness of a clustering structure based on theintrinsic information of the data alone, e.g., Sum of Squared Error (SSE).

5.1. External validity indices

External validity measures are used to compare the resultant clustering solution with thetrue clustering of data if available. These indices are included in EXCLUVIS, as these arevery useful for comparing the performance of different clustering techniques when the trueclustering is known. Suppose T is the true clustering of a dataset and C is a clustering resultgiven by some clustering algorithm. Let a, b, c and d respectively denote the number of pairsof points belonging to the same cluster in both T and C, the number of pairs belonging to thesame cluster in T but to different clusters in C , the number of pairs belonging to differentclusters in T but to the same cluster in C, and the number of pairs belonging to differentclusters in both T and C. Then the external validity indices are defined as follows.

Minkowski index

The Minkowski index (Ben-Hur and Isabelle 2003) M(T,C) is defined as

M(T,C) =

√b+ c

a+ b. (3)

Lower values of the Minkowski index indicate better matching between T and C, where theminimum value M(T, T ) = 0.

Adjusted Rand index

The adjusted Rand index (Yeung and Ruzzo 2001) ARI(T,C) is then defined as

ARI(T,C) =2(a ∗ d− b ∗ c)

(a+ b)(b+ d) + (a+ c)(c+ d). (4)

The value of ARI(T,C) also lies between 0 and 1; a higher value indicates that C is moresimilar to T . Also, ARI(T, T ) = 1.

Page 6: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

6 EXCLUVIS: A MATLAB GUI Software

Percentage of correctly classified pairs

This index (Bandyopadhyay, Maulik, and Mukhopadhyay 2007) is computed as

P (T,C) =((a+ d)× 100)

(a+ b+ c+ d). (5)

The value of P (T,C) lies between 0 and 100; a higher value indicates that C is more similarto T . Also, P (T, T ) = 100.

5.2. Internal validity indices

To validate a clustering solution sometimes these are termed as the criteria instead of indices.Again the result of one clustering algorithm can be very different from another for the sameinput data set as the other input parameters of an algorithm can substantially affect thebehavior and execution of the algorithm. Internal validity indices use to evaluate the qualityof a clustering solution using the geometrical property of the clusters, such as compactness,separation and connectedness. These indices may serve as an objective function in order todetermine the optimal cluster structure in a dataset. To serve the purpose of validating aclustering solution the following internal validity indices are implemented in EXCLUVIS.

J index

J index (Bezdek 1981) is minimized by fuzzy C-means clustering. It is defined as follows:

J =K∑k=1

n∑i=1

umkiD2(Zk, Xi), (6)

where the uki fuzzy membership matrix (partition matrix) and m is denotes the fuzzy ex-ponent. D(Zk, Xi) denotes the distance between the kth cluster center Zk and the ith datapoint Xi . J can be considered as the global fuzzy cluster variance. A lower value of J indexindicates more compact clusters. However, the J value is not independent of the number ofclusters K, i.e., as the value of K increases, the J value gradually decreases and it takes theminimum value 0 when K = n. It is possible to have a crisp version of J when the partitionmatrix u has only binary values.

Davies-Bouldin index

Davies-Bouldin (DB) index (Davies and Bouldin 1979) is a function of the ratio of the sumof within-cluster scatter to between-cluster separation. The scatter within the ith cluster Siis computed as

Si =1

| Ci |∑x∈Ci

D2(Zi, x). (7)

Here |Ci| denotes the number of data points belonging to cluster Ci. The distance betweentwo clusters Ci and Cj , dij is defined as the distance between the centers.

dij = D2(Zi, Zj). (8)

Page 7: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 7

The DB index is then defined as

DB =1

K

K∑i=1

Ri, (9)

where

Ri = maxj,j 6=i

{Si + Sjdij

}. (10)

The value of DB index is to be minimized in order to achieve proper clustering.

Dunn index

Suppose δ(Ci, Cj) denotes the distance between two clusters Ci and Cj , and δ(Ci) denotesthe diameter of cluster Ci; then any index of the following form falls under Dunn family ofindices (Dunn 1974):

DN = min1≤i≤K

{min

1≤j≤K,j 6=i

{δ(Ci, Cj)

max1≤k≤K{∆(Ck)}

}}. (11)

Originally Dunn used the following forms of δ and ∆:

δ(Ci, Cj) = minx∈Ci,y∈Cj

{D(x, y)} , (12)

and

∆(Ci) = maxx,y∈Ci

{D(x, y)} . (13)

Here D(x, y) denotes the distance between the data points x and y. A larger value of theDunn index implies compact and well-separated clusters. Hence the objective is to maximizethe Dunn index.

Xie-Beni index

Xie-Beni (XB) index (Xie and Beni 1991) is defined as a function of the ratio of the totalfuzzy cluster variance σ to the minimum separation sep of the clusters. Here σ and sep canbe written as

σ =

K∑k=1

n∑i=1

u2k∗iD2(Zk, Xi), (14)

and

sep = mink 6=l{D2(Zk, Zl)}. (15)

XB index is then written as

XB =σ

n× sep. (16)

Lower value of σ and higher value of sep indicate that the partitioning is good and compact.Hence, the objective here is to minimize the XB index for achieving proper clustering.

Page 8: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

8 EXCLUVIS: A MATLAB GUI Software

I index

I index (Maulik and Bandyopadhyay 2002) is defined as follows.

I =

(1

K× El

Ek×Dk

)p

, (17)

where

Ek =K∑k=1

n∑j=1

ukjD(Zk, Xj), (18)

and

Dk =K

maxi,j=1

{D(Zi, Zj)

}. (19)

The different symbols used are as discussed earlier. I index has three factors, namely1K ,

ElEk, DK . The first factor tries to reduce value of index I as K increases. The second

factor consists of the ratio of El to EK ; where EI is constant for a given dataset and value ofEK decreased as value of K increased. Hence, because of this term, index I increases as EK

decreases. This, in turn, indicates that formation of more clusters that are compact in naturewould be encouraged. Finally, value of the third factor (DK)(which computes the maximumseparation between two clusters over all possible pairs of clusters), will increases with thevalue of K. However, note that, maximum separation between two points in the datasetshould be upper bound of this value. Thus, the three factors compete with and balance eachother. Contrast between the different cluster configurations is controlled by the power p. Itcan be said that clustering is better if value returned by the I index is high.

Silhouette index

Suppose ai represents the average distance of an assigned point xi from the other points of thecluster, and bi represents the minimum of the average distances. Then the silhouette widthSi of the point can be defined as follows.

Si =bi − ai

max(ai, bi). (20)

Now, Silhouette index (Rousseeuw 1987) S is the averaged silhouette width of all the datapoints.

S =1

n

n∑i=1

Si. (21)

Note that the value of the Silhouette index varies from -1 to 1, where a higher value indicatesa better clustering result.

6. System requirement

The EXCLUVIS package has been developed in MATLAB 2009b. Hence for running the soft-ware, a machine with MATLAB R2009b or higher is required. Note that, MATLAB toolboxessuch as, bioinformatics and statistical toolboxes are also necessary in order to run EXCLU-VIS, as this package uses some tools from these toolboxes. In terms of memory requirement,

Page 9: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 9

the machine must have at least 1 GB RAM. However 2 GB RAM is recommended for smoothperformance.

7. Demonstration of EXCLUVIS package

MATLAB GUI is used to develop this application package in order to give the user flexibilityto perform comparative study of clustering algorithms and to visualize gene expression datawithout implementing the corresponding algorithms and the methods. In this applicationpackage the used graphical components are Pushbutton, Radio Button, Edit Box, Static TextBox, Pop-Up menu, Toggle Button, Table, Axes, Panel, Button Group, Labels, and Menusetc. Implemented MATLAB GUI is available at http://kucse.in/excluvis for academic andresearch purpose.

7.1. Initial window

Figure 1: Homepage of EXCLUVIS.

The initial window of this application package is shown in Figure 1. For selecting the dataset,browser button is included in the“Homepage window”. This window can be invoked by typing“Homepage” in the MATLAB prompt and pressing return key. The dataset needs to be pre-processed in order to do the analysis and format of the dataset will be displayed by clickingon “Help for Dataset” button. The dataset should be real and no missing value is allowed.If true clustering exists (previous knowledge of the dataset) then proper column number ofclass attribute needs to be specified in the edit box. The initial label vector is also saved ina directory for further analysis such as external validity indices. If user wants to take top Ngenes then the number of genes needs to be specified in the given field also. For selectingtop genes, at first variance of each gene is calculated and the genes are sorted in descendingorder. Thereafter top N genes with high variance are selected. The normalization is donein such a way so that the minimum and maximum values of each row are mapped to defaultmean and standard deviation of 0 and 1. It is also assumed that the dataset has only finitereal values, and that the elements of each row are not all equal. The selected dataset as wellas top genes are saved in different text files. Now, for doing clustering, any of the clusteringalgorithms’ (K-means, Fuzzy C-Means, Hierarchical clustering, MocSvm clustering) windowcan be opened by selecting the algorithm. If user wants to run all the algorithms at a time

Page 10: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

10 EXCLUVIS: A MATLAB GUI Software

for doing analysis, he/she can choose all clustering window option available at the initialwindow. For demonstration purpose, here we choose K-means window for doing analysis onthe selected dataset. The dataset used for the analysis is braintumor (Pomeroy et al.) datasetwhich can be found in http://kucse.in/excluvis. The dataset contains the true clustering, andthe column number of class attribute is 7130. Top 100 genes are selected. Tooltip is alsoincluded in all the fields (what they do) in order to help the user for giving input duringexperiment. A tooltip is also displayed in Figure 1.

7.2. K-means window

Figure 2: K-means clustering window.

After opening the K-means window shown in Figure 2, the dataset selected initially, is au-tomatically displayed on the edit box. In order to cluster the dataset, the user can givethe probable range of number of clusters, and the number of iterations for determining theproper number of clusters in the corresponding fields. Internal validity index needs to beselected for finding the optimal number of clusters. Larger value indicates better result forDunn Index, Silhouette Index, and I index, whereas smaller value indicates better result forJ Index, Davies-Bouldin Index, and Xie-Beni Index. Default value is given in few fields,i.e., 2 for number of iterations, label vector saved directory (here KmClus), etc. Number ofclusters generated and the corresponding value of the selected internal index will be shownin the edit box after clicking on the ‘Generate’ button. A graph is also plotted on the plotfield, where X-axis denotes the range of number of clusters and Y -axis denotes the generatedvalidity index value for the corresponding number of clusters. For each number of clusters,the corresponding index value is also marked with a marker on the graph. In the plot, leg-end is also used to track the algorithm and internal validity indices used. Heatmap is alsoincluded to represent the level of expression of many genes across a number of comparablesamples. In addition, profile plot is also added for showing normalized gene expression values(light green) of the genes of each cluster with respect to the time points. The graph can becleaned by clicking the refresh button. For finding the time complexity, cpu running time isalso displayed in the execution time field.

The procedure used for determining the value of the internal validity index (maximum orminimum depending upon internal validity index used) and number of cluster generated is asfollows.

Page 11: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 11

� Suppose range of the number of clusters is 2 to 7 and the number of iterations is 2.

� Dunn Index, Silhouette Index, I Index, J Index, Davies-Bouldin Index, and Xie-BeniIndex are used as internal validity indices. Suppose selected index is Silhouette Index.

� For iteration number 1. An index value is generated for each number of clusters andthe maximum or minimum values are selected according to the validity index used andcorresponding label vector is also saved in a directory specified by the user. In ourexample 6 different values is generated for Silhouette index, among them maximum oneis chosen and corresponding label vector is also saved.

Similarly in the next iteration, another maximum value is selected and corresponding la-bel vector is also saved. Finally maximum value among these 2 iterations is selected andcorresponding label vector is saved in a assigned directory (KmClus) for further analysis.

In order to compare the results of all clustering solution, Report table window is also integratedin this package to store the number of clusters generated, final index value as well as executiontime for each clustering algorithm used. For finding the external validity index value, AdjustedRand Index is selected and its value as well as execution time are displayed, both on thewindow and the Report table. The Report table can also be seen any time by clickingon the ‘Report’ button. Moreover for clustering analysis, Heatmap and Profile plot are alsoincorporated in this application package. In Heatmap the X-axis denotes time/conditions andY -axis denotes name of the genes and in each case, each cluster is separated by a separator inthe Heatmap. In Figure 3 Heatmap of K-Means clustering solution is shown, whereas profileplot of the clustering solution is shown in Figure 4.

Figure 3: Heatmap of K-means clustering.

This package is designed in such a way that in profile plot, profile of each cluster is drawn ina single window with different colors. Cluster number is also shown in each figure with thehelp of legend.

7.3. Fuzzy C-means window

The MATLAB GUI code is written in such a way that all the values given by the user in theK-Means window are automatically populated in the corresponding fields of Fuzzy C-Means

Page 12: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

12 EXCLUVIS: A MATLAB GUI Software

Figure 4: Profile plot of K-means clustering.

window, except the label vector saved directory (default values are given). If user wants tochange the value of any field then he/she can do that. Users can find the values of internal aswell as external validity indices by clicking on the appropriate buttons. Corresponding valuesas well as execution time are also saved in the Report table window. The graph is also plottedon the plot with different colors and marker, whereas legend is also appended in the plot fortracking the used algorithm and validity indices. Figure 5 shows the Fuzzy C-Means windowand Figure 6 and Figure 7 show the corresponding Heatmap and Profile plot, respectively.

Figure 5: Fuzzy C-means clustering window.

Hierarchical clustering window is selected for doing further analysis.

7.4. Hierarchical clustering

After opening this window, selected data points, range of number of clusters, number of itera-tions, internal as well as external validity index field values, which are already set by the userin previous windows are automatically populated in the corresponding fields of this windowas mentioned previously. To run this clustering algorithm two extra fields are added, one is‘Distance’ and another one is ‘Method’. One distance value among Euclidean, Seuclidean,Cityblock, Mahalanobis, Minkowski, Cosine, Correlation, Spearman, Hamming, Jaccard and

Page 13: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 13

Figure 6: Heatmap of fuzzy C-means clustering.

Figure 7: Profile plot of fuzzy C-means clustering.

Chebychev need to be selected. Also one linkage method needs to be selected for finding thedistance when groups are formed. Available linkage methods are Single, Complete, Average,Weighted, Centroid, Median and Ward. Figure 8 shows the Hierarchical clustering windowafter determining the value of internal as well as external validity indices. These values arealso saved in Report table and the graph is drawn in the plot with different colors and mark-ers. Selected internal validity index as well as used algorithm are also added in the graphas legend. Heatmap and Profile plot of this clustering are shown in Figure 9 and Figure 10,respectively.

7.5. MocSvm clustering

In this clustering technique, the parameters which are additionally included are populationsize, probability of crossover (Pcrossover), probability of mutation (Pmutation), membershipthreshold of the point (α), threshold of fuzzy majority voting (β) and Weight. Generallypopulation size is chosen by the user randomly to find the possible solution, whereas crossoveris the exchange of genetic information that takes place between randomly selected parentchromosomes. Mutation is the random alteration in the genetic structure for introducinggenetic diversity into the population. Both are probabilistic operation and generally value of

Page 14: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

14 EXCLUVIS: A MATLAB GUI Software

Figure 8: Hierarchical clustering window.

Figure 9: Heatmap of hierarchical clustering window.

crossover probability is kept high and mutation probability is kept low. In this clustering,size of training set depends on α and β. Size of training set decreases by increasing thevalue of α and β. On the other hand, size of training set increases by decreasing the valueof α and β. Generally the values of both parameters are set to 0.5 for finding good solution.Internal validity index value and corresponding label vector are stored in the Report table aftergeneration as before. These values are also populated in appropriate boxes in this window.The graph is marked with different markers and legends are also added with different colors.Same way external validity index can also be determined and populated in the appropriateplaces of this window as well as in the Report table. This graphical window is shown inFigure 11.

Heatmap and Profile plot of this clustering solution are also shown in Figure 12 and Figure 13,respectively.

7.6. Report Table Window

This window contains two tables, one is for internal validity indices and another one is for ex-ternal validity indices. Each index field of internal validity indices has three columns, namely

Page 15: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 15

Figure 10: Profile plot of hierarchical clustering.

Figure 11: MocSvm clustering window.

value generated (maximum or minimum depending on the validity index used), number of gen-erated clusters and execution time. Whereas external validity indices table has two columns,value generated and execution time. If dataset does not contain true clustering column, thenexternal table does not appear in the Report table. During simulation, the values generatedin each window is populated in the corresponding column of the chosen validity indices fieldin this window. After populating the values, Report table looks like Figure 14. The Reporttable becomes invisible by clicking on the close button.

7.7. All clustering running window

In the previous subsections, demonstration of all the clustering windows is given one by oneand for this users need to run the clustering algorithms one after another. However, to givethe user flexibility to run all the algorithms at a time by one click, “All clustering window” isalso provided with this application package. In order to choose “All clustering window” userneeds to select this option available in the “Homepage window”. This graphical window isshown in Figure 15.

In this window, user needs to give possible range of number of clusters, number of iterations,

Page 16: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

16 EXCLUVIS: A MATLAB GUI Software

Figure 12: Heatmap of MocSvm clustering.

Figure 13: Profile plot of MocSvm clustering.

desired internal validity indices and external validity indices (if exist) only once for all clus-tering algorithms. These parameters are designed in a common panel of this window for allalgorithms. If true clustering information does not exist, then external validity field buttonbecomes inactive. K-Means, Fuzzy C-Means, Hierarchical Clustering, and MocSvm havedifferent panels in this window for taking inputs specific to the algorithms. In this windowdefault value is also given to some fields and users can find the functionality of each field (bytooltips) by placing mouse on the desired field. The background code is written in such away that after clicking on ‘Done’ button it first opens the K-Means window with invisibilitymode. Then sets all the fields of K-Means window (take the values which are given in theK-Means panel of the “all clustering running window”). After that it automatically hits onthe generate button of internal validity indices and external validity indices field one afteranother and generates values, which are automatically populated in the Report table. Simi-larly other algorithms run and generated values are stored in the Report table. Here datasetused for demonstration is prostate (Singh et al.) data available at http://kucse.in/excluvis.Selected internal validity index is Xie-beni index (minimum value gives better result). Nowif user wants to visualize the results, then he/she can do this by clicking on Heatmap andProfile plot for each clustering algorithm. This dataset does not contain true clustering in-

Page 17: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 17

Figure 14: Report table window.

Figure 15: All cluster running window.

formation. Therefore all fields of external validity indices become inactive in all the windows.The detailed result is also seen in the Report table. For demonstration purpose, Heatmapand Profile plot of MocSvm clustering are shown in Figure 16 and Figure 17, respectively.For detailed results, the Report table is shown in Figure 18.

7.8. Comparison of clustering algorithms

EXCLUVIS package is also suitable for comparison of different clustering algorithms for aparticular dataset. Comparison can be done both visually as well as numerically. For visualcomparison, the user can produce the validity index plots with respect to different numberof clusters for different algorithms, and then these plots can be visualized for comparison asshown in Figure 11. It is evident from the figure that, different clustering algorithms providebest (maximum) value of Silhouette index for different number of clusters. K-means, FuzzyC-means, hierarchical clustering and MocSvm provide the maximum values of Silhouetteindex when the numbers of clusters are 2, 2, 2 and 4, respectively. It is clear that MocSvmprovides the maximum value of Silhouette index.

Moreover, the different algorithms can be compared based on the best values obtained for

Page 18: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

18 EXCLUVIS: A MATLAB GUI Software

Figure 16: Heatmap of MocSvm clustering from “all clustering running window”.

Figure 17: Profile plot of MocSvm clustering from “all clustering running window”.

different cluster validity indices using the report table window as shown in Figure 14. Thiswindow shows the values of the validity indices for different algorithms. This helps directcomparison of different algorithms. It is evident from the figure that MocSvm provides themaximum value of Silhouette index (0.5758). However, MocSvm takes the maximum time(10.1172 seconds) among all the algorithms.

Furthermore, to facilitate one-click comparison of different algorithms, EXCLUVIS also hasan “all clustering running window” (Figure 15) which takes input for each clustering algo-rithm and then runs all the algorithms on one click and produce the results as desired (SeeFigure 18). This way, EXCLUVIS not only helps running individual clustering algorithms,but also facilitates comparison of different algorithms.

8. Version and Availability

The first version of this application software package is freely available http://kucse.in/excluviswith detailed documentation having demo files, code, dataset and instructions for performingthe analysis of the examples given in this article.

Page 19: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 19

Figure 18: Report table window for running all clustering algorithms.

9. Conclusion

In this article, we have presented an effective and user-friendly application tool for Geneclustering. We have developed this analytical design tool and software using MATLAB tool-boxes in such a way that users can include new algorithms in it if required. Researchers arealso allowed to include his/her validity indices (internal as well as external) in this package.EXCLUVIS has several features that make it a potentially useful tool for a community ofresearchers and developers. Researcher can also visualize the results using the features likeHeatmap and cluster Profile plot available in EXCLUVIS. A detailed user manual is alsoprovided at http://kucse.in/excluvis.

We have plan to upgrade the software package in future in various sections. For example,we plan to include more clustering algorithms in the package. Moreover some other validityindices will also be incorporated. As the software is focused towards analyzing gene expres-sion data mainly, we also plan to include biological validation options with the help of geneontology.

References

Bandyopadhyay S, Maulik U, Mukhopadhyay A (2007). “Multiobjective Genetic Clusteringfor Pixel Classification in Remote Sensing Imagery.” IEEE Transactions on Geoscience andRemote Sensing, 45(5), 1506–1511.

Ben-Hur A, Isabelle G (2003). “Detecting Stable Clusters using Principal Component Anal-ysis.” In Methods in Molecular Biology, 224, 159–182.

Bezdek JC (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. PlenumPress, New York.

D’andrade RG (1978). “U-Statistic Hierarchical Clustering.” Psychometrika, 43(1), 59–67.

Davies DL, Bouldin D (1979). “A Cluster Separation Measure.” IEEE Transactions on PatternAnalysis and Machine Intelligence, 1(2), 224–227.

Page 20: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

20 EXCLUVIS: A MATLAB GUI Software

Deb K, Pratap A, Agrawal S, Meyarivan T (2002). “A fast and elitist multiobjective geneticalgorithm: NSGA-II.” IEEE Transactions on Evolutionary Computation, 6(2), 182–197.

Dunn JC (1973). “A Fuzzy Relative of the ISODATA Process and Its Use in DetectingCompact Well-Separated Clusters.” The Journal of Cybernetics, 3(3), 32–57.

Dunn JC (1974). “Well Separated Clusters and Optimal Fuzzy Partitions.” The Journal ofCybernetics, 4(1), 95–104.

Jain AK, Murty MN, Flynn PJ (1999). “Data clustering: A review.” ACM Computing Surveys,31(3).

Johnson SC (1967). “Hierarchical Clustering Schemes.” Psychometrika, 32(3), 241–254.

MacQueen JB (1967). “Some Methods for Classification and Analysis of MultiVariate Ob-servations.” In LML Cam, J Neyman (eds.), Proc. of the fifth Berkeley Symposium onMathematical Statistics and Probability, volume 1, pp. 281–297. University of CaliforniaPress.

Maulik U, Bandyopadhyay S (2002). “Performance Evaluation of Some Clustering Algorithmsand Validity Indices.” IEEE Transactions on Pattern Analysis and Machine Intelligence,24(12), 1650–1654.

Maulik U, Bandyopadhyay S, Mukhopadhyay A (2011). Multiobjective Genetic Algorithms forClustering: Applications in Data Mining and Bioinformatics. Springer-Verlag, Heidelberg.

Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009). “Combining Pareto-Optimal Clus-ters using Supervised Learning for Identifying Co-expressed Genes.” BMC Bioinformatics,10(27).

Mukhopadhyay A, Maulik U (2009). “Unsupervised Pixel Classification in Satellite Imageryusing Multiobjective Fuzzy Clustering Combined with SVM Classifier.” IEEE Transactionson Geoscience and Remote Sensing, 47(4), 1132–1138.

Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015). “A Survey of Multiobjective Evolu-tionary Clustering.” ACM Computing Surveys, 47(4), 61:1–61:46.

Quackenbush J (2001). “Computational analysis of microarray data.” Nature Reviews. Ge-netics, 2(6), 418–427.

Rousseeuw P (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation ofCluster Analysis.” The Journal of Computational and Applied Mathematics, 20(1), 53–65.

Shannon W, Culverhouse R, Duncan J (2003). “Analyzing Microarray Data using ClusterAnalysis.” Pharmacogenomics, 4(1), 41–51.

Xie XL, Beni G (1991). “A Validity Measure for Fuzzy Clustering.” IEEE Transactions onPattern Analysis and Machine Intelligence, 13(8), 841–847.

Yeung KY, Ruzzo WL (2001). “An Empirical Study on Principal Component Analysis forClustering Gene Expression Data.” Bioinformatics, 17(9), 763–774.

Page 21: EXCLUVIS: A MATLAB GUI Software for Comparative … JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. EXCLUVIS: A MATLAB GUI Software for Comparative Study of Clustering

Journal of Statistical Software 21

Affiliation:

Anirban MukhopadhyayDepartment of Computer Science and EngineeringUniversity of KalyaniKalyani - 741235, West Bengal, IndiaE-mail: [email protected]: www.anirbanm.in

Sudip PoddarAdvanced Computing and Microelectronics UnitIndian Statistical InstituteKolkata - 700108, West Bengal, IndiaE-mail: [email protected]

Journal of Statistical Software http://www.jstatsoft.org/

published by the American Statistical Association http://www.amstat.org/

Volume VV, Issue II Submitted: yyyy-mm-ddMMMMMM YYYY Accepted: yyyy-mm-dd