[IEEE 2013 Third International Conference on Recent Trends in Information Technology (ICRTIT) - Chennai, India (2013.7.25-2013.7.27)] 2013 International Conference on Recent Trends

2013 International Conference on Recent Trends in Information Technology (ICRTIT)

ISBN:978-1-4799-1024-3/13/$31.00 ©2013 IEEE 13

Clustering of Lung Cancer Data Using Foggy K-Means

Akhilesh Kumar Yadav#1, Divya Tomar#2, Sonali Agarwal#3

#Indian Institute of Information Technology Allahabad, India

[email protected] [email protected]

[email protected]

Abstract-In the medical field, huge data is available, which leads to the need of a powerful data analysis tool for extraction of useful information. Several studies have been carried out in data mining field to improve the capability of data analysis on huge datasets. Cancer is one of the most fatal diseases in the world.Lung Cancer with high rate of accurance is one of the serious problems and biggest killing disease in India. Prediction of occurance of the lung cancer is very difficult because it depends upon multiple attributes which could not be analyzedeasily. In this paper a real time lung cancer dataset is taken from SGPGI (Sanjay Gandhi Post Graduate Institute of Medical Sciences) Lucknow. A realtime dataset is always associated with its obvious challenges such as missing values,highly dimensional, noise, and outlier, which is not suitable for efficient classification. A clustering approach is an alternative solution to analyze the data in an unsupervised manner. In this current research work main focus is to develop a novel approach to create accurate clusters of desired real time datasets called Foggy K-means clustering. The result of the experiment indicates that foggy k-means clustering algorithm gives better result on real datasets as compared to simple k-means clustering algorithm and provides a better solution to the real world problem. Keywords: Clustering, Foggy k-means clustering, Lung Cancer.

I. INTRODUCTION Data mining in health care is an emerging application to

findout the useful knowledge and interesting patterns related to various diseases. An efficient Data Mining method could be adopted as a diagnostic tool for effective decision making. According to National Cancer Institute, mostly the cause of death in all over India is due to Cancer [1]. Lung cancer got the second position among all types of cancers due to his deadliness [2]. The survival rate is only 15% after 5 years of diagnoses [3]. In India, the growing rate of cancer is 11 percent annually. 2.5 million people affected by this and more than 4 lakh deaths in a year. 20 % of men in India die between age 30 to 69 due to tobacco-related cancers and lung cancer is one of them.

In Data Mining clustering of datasets is a major problem and it has a wide range of applications. There are two types of clustering algorithm (1) descriptive and (2) predictive. Descriptive mining forms a cluster of available data. In

clustering finite points in multidimensional space are partitioned into labeled classes (Cluster). Similar points belong to the same class and dissimilar points belong to the different classes.

This research work focuses on lung cancer dataset provided by SGPGI statistic department including the age, sex, family-history, tumor size etc. of the patient. There are two types of attribute (1) demographic attributes (gender, age, location) and (2) diagnosis attributes (smoking, BMI, tumor size). The dataset contains records of 177 patients. On lung cancer data we are applying foggy k-means clustering approach and divide the datasets into two types of cluster : one cluster of those patients who are suffering from cancer and another for those not having cancer.The previous clustering work focuses on only the number of clusters and their centers using different techniques. They are not handling the outliers very well. So the main goal of this research work is to propose a new model which handles the problem of outliers.

The paper is organized in following way: the section II highlights the relatedwork done in this filed, section III gives an overview of clustering and describe the proposed methodology that is foggy K-means algorithm. Section IV deals with the results of the experiment and section V discuss about conclusion and future work of this research.

II. RELATED WORK

Lots of work has been done in the medical field using basic clustering algorithm. In the recent years, a lot of versions of K-means algorithm come such as: Bradley and Fayyad gives the basic idea of K-mean clustering by randomly breaking the data into 10 and then applying k –means clustering on them and resulting k- centers are the initial center for the whole dataset[4]. Khan and Ahmad proposed a method to initialize cluster centers based on two observations similar patterns in the similar cluster in the place of choosing the initial cluster center [5]. Redmond and Heneghan eliminate the density of data in various locations using kd-tree and sequentially select the information on the basis of density and distance information [6]. The improved version of GMK algorithm proposed by Bagirov (MGMK) in 2008. The author has given the idea of minimizing the auxiliary cluster function. It requires more computational time than GMK. Park and Jun represent the idea of K-medoids clustering [7].


14

A popular heuristic method of k-means clustering is a Lloyd’s algorithm which clears the basic things about the K-means clustering algorithm [8]. The research paper suggests how to handle missing values in the lung cancer dataset using BN and SVM. K –means clustering having one drawback that it requires to initialize the number of clusters [9]. So the incremental approach, global K-means algorithm came and in this algorithm add one cluster center at a time [10] [11].There are lots of work have been done on the gene data that is related to lung cancer. For example Benny Y.M.Fung gave a model with impact factor to increase the performance of the classifier that works on cancer gene data [12]. Ankit Agrawal in 2011 gives a new greedy approach with association rule to identify hotspots [13]. Escudero et al. used k-means clustering to classify the Alzheimer’s disease (AD) data feature into pathologic and non-pathologic groups. They used the concept of Bioprofile and K-means clustering for early detection of AD [14]. The research work[15] detect the recurrence of breast cancer with the help of clustering approach. The main focus of this paper is to evaluate the performance of diffirent clustering approach for breast cancer recurrence. Shi et al. proposed an enhancement in the K-means clustering. They analyze the limitation of k-means clustering and proposed an improved k-means which have better accuracy, better speed and also reduce the complexity of K-means clustering [16]. In this data mining technique first maximize the branching factor and then minimize the segment size. In addition, this paper presents a method that is not affecting the performance and reduces the impact of outlier and computational load.

III. K-means clustering and Foggy K-means clustering K-means clustering:

K-means (MacQueen, 1967) is the basic clustering method [17]. The first step of K-means clustering is to define the number of the cluster and their centroids. The placing of the centroid is a very smart work because the different place causes the different result. The next step is to associate data point to the nearest centroid. When all the points are plotted then the algorithm calculates k new centroids and again allots the points to a new nearest centroid. After that the centroids are changing their location. Repeat the steps until the position of centroids becomes stationary. The main aim to minimize the objective function: ∑ ∑ (1)

Where = distance between cluster center and the data point . J = distance of ‘n’ data points from their own centroid The steps of K-means clustering are given below: 1) First take k (no of clusters) and plot the k centroids on the plane.

2) Plot each point to its nearest centroid. 3) When the plotting is over, measure the new positions of the centroids. Step 2 and 3 are repeated until the movement of the centroid is stopped. This produces the separation of the points into the clusters An example: There are‘n’ sample vectors , ,…, , all belong to the same class initially. Then divide them into desired number of clusters, suppose k but k< n always. Let cluster vector mean is . Use maximum distance classifier to separate them if they are not well separated. So, ‘x’ is in the cluster if

is the minimum of all the k distances. Steps are given below: • , ,…, are the initial guesses for the mean • Until mean become stable or not changing

o Samples are classified into the clusters with the use of estimated mean.

For i=1 to k o The mean of all the samples of ith cluster

changes the mi o End for loop

• End until loop How the means ,and moves one place to another place and become a centroid is shown in figure 1.

Fig.1 K-means with two Clusters

K-means clustering algorithm has several drawback which are given below:

• There are no specific criteria to select the k and initialize the mean.

• It is difficult to get the correct result on the basis of initial values.

• Sometimes ‘ mi ‘ is not having any point so it could not be updated.

• Sometimes the result is affected by the normalization of a variable by its standard deviation.

• The value of ‘ k’ changes the result. The above mentioned drawbacks are described below in

figure 2. If the value of k is wrong,then the cluster is changed and it selects the group of outliers and makes an extra cluster. In the second case if there is a group of outliers then the old method can pick the centroid in one of them. The result and number of iterations are also increased in the previous method.


15

The biggest question with K-means algorithm is how to find the optimal solution for the no of clusters in the given dataset.

Fig .2 K-means with two correct clusters and one fake cluster

Foggy K-means Clustering:

In medical data mining, it is beneficial to get domain knowledge from domain subject experts. Here, also in Foggy k-means clustering of lung cancer dataset has been discussed with domain experts and certain attributes with the prominent impact factor has been identified. Number of the clusters has been decided on the basis of the value of these attributes. For examples in the lung cancer dataset if the tumor size is greater than 3 then there is a possibility that the patient is suffering from lung cancer. On the basis of these plotted points two clusters are formed and take two points as centroid. The effect of new attributes on the cluster is described as- suppose next attribute is smoking, then the cluster will move either left or right to a particular distance according to the impact of that attribute on the cluster.

Fig. 3 Cluster on the basis of Tumor size In figure 3, the data according to tumor size is plotted and

the two cluster centroid are found. The line shows the minimum distance between them and is divided into 7 parts according to

the number of attributes. Changing positions of the cluster after applying new attribute is shown in figure 4.

Distance= (2) . (3)

Fig.4 Changing positions of cluster on applying new attribute

Algorithm: Foggy k-means (attributes (n), priority(no of cluster)) //priority reflects the expert knowledge for all of the n attributes { Sort(attribute.Priority);

//priority descending order Plotinplane(attributepriority 1,value);

For (i=1 to k) { Centroid[i]=findmean(Ci); //Ci=ith cluster; }

For (j=1 to k(k-1)/2) { Dis[j]=dis(Centroid[1tok]);

//find the distance between each centroid Temp[j]=dis[j]/n; }

Until (--n! =0) { Apply n to ploted points; Shift +or –temp towards other

Centroid; For (i=1 to k)


16

{ Centroid[i]=find mean(Ci); //Ci =ith cluster;

} For (j=1tok(k-1)/2) { Dis[j] =dis(Centroid[1to k]); //find the distance between each centroid Temp[j] =dis[j]/n; } }//end of until }//end of foggy

IV. EXPERIMENT AND RESULT A lung cancer data provided by SGPGI Lucknow is

considered for study. The original data were highly dimensional, but only 9 attributes has been finally considered on the basis of advice of medical experts are shown in the table 1.

TABLE I. DATASET DETAILS S.No.

Attribute Name Attribute Description

1 Age Patient’s age 2 Sex Either Male or Female 3 BMI Body Mass Index 4 Family History Previously any one

effective in the family 5 Tuber Culosis Effected by tuber Culosis 6 Smoking Smoker or not 7 Lymph Node Involvement L.N. involve or not 8 Tumor Size <3 or >3 9 Radiation/radon/Asbestos Effected by these things

or not In preprocessing step of dataset, the value of data is changed

into 0 or 1(male 0, female 1).Missing values and duplicates are also removed from the database. The preprocessed data are shown in the figure 5.

Fig.5 Preprocessed Data

There are 177 patients in this datasetand the result of Foggy K-means clustering gives three files as output. The figure 6 shows the working model of this research work.

Fig. 6 Expert model for clustering Normal K-means can select the outlier points as a cluster.

Through Foggy K-means algorithm, this drawback is removed.

TABLE II. COMPARISON OF K-MEANS AND FOGGY K-MEANS CLUSTERING

In the above table 2 it is clear that when the number of seeds

increases the k-means shows group of outliers together which can be considered as a cluster by a user at the time of observation. The third cluster which is actually a collection of outlier may lead to incorrect interpretation of the result while in foggy-means for a similar situation, it will always generate two clusters as indicated by the expert.There is no possibility of any fake cluster which is identified in the previous phase. There are several validation techniques to validate our cluster [18].Here three of them are used: Connectivity :Let in the ith iteration

nni(j)= jth nearest neighbor , = 0 if nni(j) and ‘i’ belong to the same cluster.Else 1/j. Suppose in clustering P partition is there such as P= {C1.C2, C3…..CK} in N iteration and k separate clusters,then the connectivity is:


17

Conn (P)=∑ ∑ , (4) Where L is the number of neighbors that contribute to the connectivity measures.Its value is between 0 to ∞ and should be minimum. Silhouette width:After a particular iteration Silhouette width calculate the degree of confidence in the cluster. Well -clustered iteration having value nearer to 1and poorly clustered having value nearer to -1and should be maximized. For iteration ‘i’, it is:

S (i)= , (5) Where ai= the average of intra cluster distance between i and other points within the cluster bi= the average of inter-cluster distance between i and the closest neighbor cluster. Dunn index:-It considers the smallest inter cluster distance and largest intra-cluster distance as per the formula given:

D(P) = , , , ,

(6)

Where diam (Cm) is the maximum distance between all points in cluster Cm. It has the value between 0 to ∞ and should be maximized.

Validation measures such as Dunn index, Connectivity, Silhouette for k-means and Foggy k-means have been calculated and shown in the table 3. For better cluster quality low connectivity, high Dunn index and the Silhouette value between 1 and -1 is preferred and the same has been achieved with Foggy k-means. Figure 7 shows the comparison of validity parameters for K-means and Foggy K-means clustering. Y-axis of the figure indicates value of validity parameters and X-axis indicates number of cluster for k-means and foggy K-means. We check the value of each validity parameter for K-means and Foggy K-means clustering by varying the number of cluster. We also normalize the value of connectivity by 10 so that it can lie within 0-2 range. As per figure 7 it is clearly indicated that foggy k-means clustering has shown better performance as compare to K-means Clustering. Dunn index for foggy K-means clustering for 2 cluster is 0.1616 which is greater than the Dunn Index for K-means having same number of cluster. Foggy K-means also shows better results for all validity parameters.

TABLE III. VALIDATION MEASURES FOR K-MEANS AND FOGGY K-

MEANS Validity Parameters

2 cluster

3 Cluster

K-means Foggy K-means


Dunn Index 0.0464 0.1616 0.0659 0.1027

Silhouette 0.4571 0.5136 0.4182 0.4531

Connectivity 13.528 9.842 17.893 12.701

Fig. 7 Comparison of Validity Parameters between K-means and Foggy K-means Clustering

In the K mean clustering

For general space‘d’ (dimensions) even for two clusters it is NP-hard [19] [20]. For any number of clusters k even in the plane, k mean clustering is NP-hard [21]. In Foggy K mean algorithm

For general space‘d’ even for two clusters it is NP-hard but for a general number of cluster k even in the plane it is O ((k (k-1)) time complexity.

V. CONCLUSION

Lung cancer is one of the most crucial disease of the world. Large amout of data is available of lung cancer patients. So, there is a need to construct a model for effectively utilize the available information and find out the factors which causes lung cancer. In this paper a novel clustering algorithm has been presented with powerful diagnostic features for the prevention of lung cancer. The proposed Foggy K-means approach has been compared with traditional K means approach and it is found that the cluster validitity parameters are more satisfactory for the foggy K means. The results in terms of more accuarate clusters could be utilized by domain experts for their strategic planning. In the similar fashion we can correlate ill-effects of smoking, tuberculosis and radiations produced by different industries or radioactive substances and their consequences in terms of various diseases. In future we will utilize the results of foggy k-means clustering for classification of lung cancer patients and extract the factors that shown great impact on lung cancer.

ACKNOWLEDGEMENT

The authors are highly thankful to Dr.M.D.Tiwari, Honorable director of IIIT Allahabad for providing us excellent infrastructure and environment to complete this research work

00.20.40.60.8

11.21.41.61.8

2



2 cluster 3 Cluster

Dunn Index

Silhouette

Connectivity


18

and also thankful to Sanjay Gandhi Post Graduate Institute of Medical Sciences for providing relevant data required for the study. A sincere thanks to Prof.G.N. Pandey to his expert opinion.

REFERENCES

[1] “Introduction to lung cancer,” National Cancer Institute, SEER training

modules, URL:http://training.seer.cancer.gov/lung/intro/ accessed:Aug 2, 2011.

[2] “Lung cancer statistics,” centers for Disease Control and Prevention, URL:http://www.cdc.gov/cancer/lung/statistics/ accessed:Aug 2, 2011.

[3] L. A. G. Ries and M. P. Eisner, “Cancer of the lung”, National Cancer Institute, SEER Program, 2007, ch. 9.

[4] P. S. Bradley and U. M. Fayyad, “Refining initial points for k-means clustering,” Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc. San Francisco,CA, USA, 1998, pp. 91–99.

[5] S. S. Khan and A. Ahmad, “Cluster center initialization algorithm for kmeans clustering,” Pattern Recognition Letters, vol. 25, pp. 1293–1302, 2004.

[6] S. J. Redmond and C. Heneghan, “A method for initializing the Kmeansclustering algorithm using kd-trees,” Pattern Recognition Letters,vol. 28, pp. 965–973, 2007.

[7] H. S. Park and C. H. Jun, “A simple and fast algorithm for K-medoids clustering,” Expert Systems with Applications, vol. 36, pp. 3336–3341, 2009.

[8] E.W. Forgy, "Cluster analysis of multivariate data: efficiency versus interpretability of classifications". Biometrics 21: 768–769,1965.

[9] A. Dekker, A. Hope, K. Komati, G. Fung and Shipeng Yu. “Survival Prediction in Lung Cancer Treated with Radiotherapy Bayesian Networks vs. Support Vector Machines in Handling Missing Data”. 978-0-7695-3926-3/09 , IEEE, 2009 .

[10] A. Likas, M. Vlassis, and J. Verbeek, “The global k-means clusteringalgorithm,” Pattern Recognition, vol. 36, pp. 451–461, 2003.

[11] A. M. Bagirov, “Modified global k-means algorithm for a minimum sum of-squares clustering problems,” Pattern Recognition, vol. 41, pp. 3192–3199, 2008.

[12] Benny Y. M. Fung, Vincent T. Y. Ng, “Improving Classification Performance for Heterogeneous Cancer Gene Expression Data.” Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04)0-7695-2108-8/04 ,IEEE 2004.

[13] A. Agrawal and A. Choudhary.”Identifying HotSpots in Lung Cancer Data Using Association Rule Mining.” 978-0-7695-4409-0/11 2011 IEEEDOI 10.1109/ICDMW,2011.

[14] T. Balasubramanian T.,R. Umarani , “An Analysis on the Impact of Fluoride in Human Health (Dental) using Clustering Data mining Technique”, Proceedings of the International Conference on Pattern Recognition, Informatics and Medical Engineering , March 21-23, 2012.

[15] S. Belciug, F. Gorunescu, A.B. Salem and M. Gorunescu, “Clustering-based approach for detecting breast cancer recurrence”, 10th International Conference on Intelligent Systems Design and Applications, 2010.

[16] S. Na, L. Xumin and G. yong, “Research on k-means Clustering Algorithm An Improved k-means Clustering Algorithm”, Third International Symposium on Intelligent Information Technology and Security Informatics, 2010.

[17] J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations” , Proceedings of 5-th Berkeley

Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297, 1967.

[18] J. Handl, J. Knowles , D.B. Kell , “Computational Cluster Validation in Post-GenomicData Analysis." Bioinformatics, Volume 21, Issue 15, August 2005, Pages: 3201-3212.doi: 10.1093/bioinformatics/bti517.

[19] D. Aloise, A. Deshpande, P. Hansen, P. Popat, "NP-hardness of Euclidean sum-of-squares clustering". Machine Learning 75:245–249. doi:10.1007/s10994-009-5103-0.

[20] S. Dasgupta and Y. Freund, "Random Projection Trees for Vector Quantization".Information Theory, IEEE Transactionson 55:32293242. arXiv:0805.1390.doi:10.1109/TIT.2009.2021326, 2009.

[21] M.Mahajan,P.Nimbhorkar,K. Varadarajan,"The Planar k-Means Problem is NP-Hard".Lecture Notes in Computer Science 5431: 274–285. doi:10.1007/978-3-642-00202-1_24, 2009.

Documents

[IEEE 2013 Third International Conference on Recent Trends in Information Technology (ICRTIT) - Chennai, India (2013.7.25-2013.7.27)] 2013 International Conference on Recent Trends