Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Whale Optimization Algorithm based Feature
Selection with Improved Relevance Vector
Machine Classifier for Gastric Cancer
Classification 1L.Thara and
2R. Gunasundari
1Department of Computer Science,
Karpagam Academy of Higher Education.
PSG College of Arts & Science,
Tamil Nadu, India.
[email protected] 2Department of Information Technology,
Karpagam Academy of Higher Education,
Tamil Nadu, India.
Abstract Data mining extends its application wings to almost all domains.
Medical / Clinical applications are one such application that directly give
positive impact on human beings. Prevailing of cancer is nowadays a huge
threat to individuals. The changing food habits of the individuals lead to
Gastric cancer disease. Hence there exists a vast scope for pursuing
research in Gastric cancer prediction from the available data. This research
work aims in design and development of whale optimization algorithm
(WOA) for feature selection from the dataset. Two binary variants of the
WOA are used to obtain feature subsets. Then obtained subset is given as
the input for the Improved Relevance Vector Machine (IRVM) classifier.
Performance of the said work is evaluated by the performance metrics such
as accuracy, true positive rate, true negative rate, precision, false positive
rate, false negative rate and accuracy. Implementations are carried out
using MATLAB and the results depicts that the proposed classification
strategy based on WOA feature selection obtains better results.
International Journal of Pure and Applied MathematicsVolume 119 No. 10 2018, 337-348ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
337
Key Words:Data mining, gastric cancer data classification, whale
optimization algorithm, feature selection, subset selection.
International Journal of Pure and Applied Mathematics Special Issue
338
1. Introduction
In the present days, data mining is gaining its scope in the field of healthcare
analytics, healthcare management and also extends its wings to disease
prediction. Recently, the medical knowledge of the data is very widespread on
the indications of patients with a variety of diseases and the ways to aid them
with the finding of these diseases. Analyzing and making an allowance for all
the betrothed factors by one person are typically not easy. As a result, the call
for for an automated system to lend a hand to discover the rules and patterns
and predict future events is completely felt. Data mining, acts as the bringer of
this automated system, facilitates many medical advances, more than ever in the
field of finding of a variety of diseases and get hold of useful relations among
features available in the data. One of the leading causes of gastric cancer is the
Helicobacter Pylori (H. pylori) bacteria. H. pylori is bacterium with spiral form
that lives in the mucous lining of the stomach. H. pylori has coexisted with
humans for many centuries and millennia and infection with these bacteria is
very common [1]. The Centre for Disease Control and Prevention (CDC)
affirms that around two thirds of the world population is infected with this
bacterium type and infection probability in newly developed countries,
including Latvia, is higher than in developed countries. This research, where
data mining methods and algorithms are used to analyse impact of harmful
factors on development of H. pylori, which causes higher risk of developing
gastric cancer along with optimization technique is employed in order to obtain
better results.
In our previous work [1], a model of genetic fuzzy ARTMAP classifier
(GFAM) for gastric cancer data classification has been proposed. Then in earlier
proposed work [2], an improved rule based classifier based on decision trees
(IRBC-DT) for gastric cancer data classification has been proposed. Also a
comparative study about data mining approaches for diagnosis and prevention
measures for helicobacter pylori infection and associated stomach diseases has
been carried out [3]. This research work makes use of WOA for performing the
feature selection and also IRVM is employed for performing the task of
classification.
2. Related Works
Kirshners et al. [5] made use of dataset that contains 819 samples (24 positive
and 795 negative samples) and 31 features and three algorithms CN2 Rules and
C4.5 and Naive Bayes to diagnose stomach cancer. Their results presented that
sex and protein HER-1, are important factors in the diagnosis and classification
of gastric cancer. Experimental results showed the average sensitivity >50 and
86–100 % at most, at the same time, having classification accuracy and
specificity close to 65–70 %. Silvera et al. [6] employed classification tree
analysis to examine data from a population-based case–control study (1095
cases, 687 controls) carried out in Connecticut, New Jersey, and Western
International Journal of Pure and Applied Mathematics Special Issue
339
Washington State. Frequency of reported gastro esophageal reflux disease
symptoms was the most important risk stratification factor for esophageal
adenocarcinoma, gastric cardia, and noncardia gastric, with dietary factors
(esophageal adenocarcinoma, noncardia gastric), smoking (esophageal
adenocarcinoma, gastric cardia), wine intake (gastric cardia, noncardia gastric),
age (noncardia gastric), and income (noncardia gastric) appearing to modify the
risk of these cancer sites. Wang et al. [7] involved hierarchical clustering
mechanism on 14 available clinical factors from three categories, i.e., the
clinical background, immunohisto chemistry data, and the caner’s stage
information. Their results showed that that two clinical factors, Her-1 and
gender, can clearly characterize and differentiate these three groups. In
classifying and clustering somehow these methods derive patterns from the
dataset. The classification algorithms that are used for ensemble result of
diagnosis of stomach cancer are CART (Classification and Regression Tree),
TSVM (Transductive Support Vector Mechanism).
Chun et al. [8] explored how data mining and knowledge discovery possibly be
applied to medical informatics using human gene information. The authors
applied case-based reasoning to a cancer detection problem using human gene
information and performed analysis as case-based reasoning has been applied in
medicine relatively less often than other data mining techniques. The authors
proposed a modified case-based reasoning method that is appropriate for
associated categorical variables to use in detecting gastric cancer.
Monges et al [9] aimed to make use of the real world data study HERABLE in
gastric (GC) and gastroesophageal junction adenocarcinoma (GEJC) was to
assess the quality of HER2 testing, which is part of routine assessment to decide
target treatment initiation.
In addition, exploratory analysis has been performed to identify variables
influencing discordance between a local and central evaluation of HER2 over
expression determination. Their work results better when compared with their
chosen earlier methods.
3. Background of Whale Optimization Algorithm
WOA is gaining its interest in many application domains and it fits in to the
stochastic population-based algorithm [45]. WOA imitates the bubble-net
feeding characteristic that is present in the foraging behavior of the whales [4].
The whale chases near to the region with entrapping the victim in a net of
bubbles.
Formerly this has been created while swimming in a ‘6’-shaped manner. WOA
involves two stages namely exploitation stage (encircling a victim and coiling
bubble-net offensive manner), and exploration stage (probing randomly for the
victim).
International Journal of Pure and Applied Mathematics Special Issue
340
3.1. Exploitation Stage (Encircling victim/Bubble-Net Offending Manner)
In order to revise the solution, Eqs. (1) and (2) are employed to perform
movement of a whale around a victim.
tXtXCD *. … (1)
DAtXtX .*1 … (2)
In the above equations - t represents the current iteration, *X represents the
best solution got so far, X is the current solution, is the utter value, and . is
an element-by-element multiplication. A and C are coefficient vectors that are
calculated as in Eqs. (3) and (4):
araA .2 … (3)
rC .2 … (4)
where a decreases linearly from 2 to 0 and r is a random vector in [0,1]. The
tuning of the values of A and C vectors manage the regions where a solution
can be located in the neighborhood of the best solution. The whales stir its
movement through surrounding means and along a spiral-shaped path towards
the victim.
MaxIterta
22 … (5)
where t is the iteration number and MaxIter is the maximum number of
allowed iterations. The spiral-shaped path is achieved by calculating the
distance between the solution X and the leading solution *X . After that a
spiral equation is portrayed that places between the current solution and the best
solution as in Eq. (6).
tXleDtX bl *2cos..'1 … (6)
where D is the distance between a whale X and a victim
ttXtXD ,** defines the spiral’s shape of the spiral, and l is a
random number in [−1,1]
A probability of 50% is assumed to choose between them during the
optimization as in Eq. (7).
International Journal of Pure and Applied Mathematics Special Issue
341
5.06.
5.02.1
pifEqshapedpathspiral
pifEqncirclingShrinkingEtX
… (7)
Where p is a random number in [0,1]
3.2. Exploration Stage (Search for Victim)
For betterment of the exploration in WOA, as an alternative of requiring the
solutions to look for randomly based on the position of the best solution set up
so far, a randomly preferred solution is employed for updating the position
henceforth. Consequently, a vector A with the random values greater than 1 or
less than −1 is employed to strengthen a solution to shift far away from the best
known search means and is given in Eqn. (8) and (9).
XXCD rand. … (8)
DAXtX rand .1 … (9)
whererandX is a random whale chosen from the current population.
4. Improved RVM Classifier
It is indicated by X a set of N vectors D
i Rx and by c the corresponding
class labels Cci ,...,1 , which will be used in order to train a single layer feed
forward network (SLFN) using the IRVM algorithm. It consists of D input
(equal to the dimensionality), L hidden and C output (equal to the number of
classes involved in the classification problem) neurons. The number of hidden
layer neurons is usually selected to be much greater than the number of classes,
i.e., CL . The elements of the network target vectorsT
iCii ttt ,...,1 , each
corresponding to a training vector ix , are set to 1ikt for vectors belonging to
class k , i.e., when kci , and to 1ikt when kci . In IRVM-based
approaches, the network input weights LD
in RW and the hidden layer bias
values LRb are randomly assigned, while the network output weights
CL
out RW are analytically calculated. Let us denote by kjki wwq ,, the j -th
column of inW , the k -th row of outW and the j -the element of kw , respectively.
Given an activation function for the network hidden layer and using a
linear activation function for the network output layer, the response T
iCii ooo ,...,1 of the network corresponding to xi is calculated by:
International Journal of Pure and Applied Mathematics Special Issue
342
L
j
ijjkjik Ckxbqwo1
.,...,1,,, … (10)
By storing the network hidden layer outputs L
i R corresponding to all the
training vectors Nixi ,...,1, in a matrix N,...,1 , or:
,
,,,,
,,,,
1
11111
NLLLL
N
xbqxbq
xbqxbq
… (11)
IRVM presumes zero training error, by assuming that Nito ii ,...,1, , or in a
matrix notation: ,TO … (12)
where NttT ,...,1 is a matrix containing the network target vectors. The
network output weights outW can be analytically calculated by:
T
out TW… (12)
where1T is the generalized pseudo-inverse of T . After the
calculation of the network output weights outW , the network response for a
vector D
l Rx is given by:
l
T
outl Wo… (13)
where l is the network hidden layer output for ix .
The calculation of the network output weights outW through is sometimes
inaccurate, since the matrix T may be singular. A regularized version of the
IRVM algorithm that allows small training errors and tries to minimize the
norm of the network output weights outW
5. WOA for Feature Selection
WOA is employed for feature selection by which each feature subset is
considered to be a position of a whale. Every subset possibly has N number of
features, where N is the number of features in the original set. The less the
number of features in the solution and the higher the classification accuracy, the
better is the solution. WOA begins with a set of randomly generated solutions
(subsets) called population. After that, the available solution will be assessed by
the employed fitness function. The fittest solution in the population is marked as
*X (victim). During every round, solution gets updated with their
International Journal of Pure and Applied Mathematics Special Issue
343
corresponding positions based on each other. This is carried out in order to
imitate the bubble-net attacking and searching for victim strategy. To imitate the
bubble-net attacking, a probability of 50% is presumed to prefer between the
shrinking encircling mechanism and the spiral model is used to update the
position of the solutions. During the process of shrinking encircling mechanism,
equilibrium between exploration and exploitation is required. A random vector
A which contains values >1 or <1 is taken to perform the above said operation.
If 1A then the exploration is employed by searching in the neighborhood of a
randomly selected solution. When A> 1, the region of best solution so far is
exploited. The above said procedure is repeated until maximum number of
iterations.
5.1. Fitness Function
The fitness function employed in this research work is modeled by poising
between the number of selected features in each solution (minimum) and the
classification accuracy (maximum) obtained by using these selected features as
depicted in Eq. (14).
C
RDFitness R … (14)
where DR represents the classification error rate of a given classier (the
IRVM classifier is used here). R is the cardinality of the chosen subset and C
is the total number of features in the dataset, and are two parameters
corresponding to the importance of classification quality and subset length.
Algorithm 1 (Pseudo-Code of the WOA Algorithm)
*
1
*,
)6.(
)5.0(
1
)2(.
1
5.0
,,,,
_
*
,...,2,1
Xreturn
tt
XupdatesolutionbetteraisthereIf
solutioneachoffitnesstheCalculate
itamendandspacesearchthebeyondgoessolutionanyifCheck
EqthebysearchcurrenttheofpositiontheUpdate
p
XsolutionrandomaSelect
A
solutioncurrenttheofpositiontheupdatetoEqUse
A
p
pandlCAaUpdate
solutioneach
IterationMaxt
solutionbesttheX
solutioneachofvalueobjectivetheCalculate
niXPopulationInitialGenerate
rand
i
whileend
forend
if1end
if1else
if2end
if2else
if2
if1
for
while
International Journal of Pure and Applied Mathematics Special Issue
344
6. Datasets Used and Results
Two datasets are used to test the performance of the proposed research work.
Dataset – 1:
The dataset were collected from 25 healthcare centers such as hospitals and
clinics in and around Coimbatore district. Around 1127 patients’ dataset were
obtained with 28 features.
Out of 1127 patients’ records – 898 patients are affected by gastric cancer and
229 patients are not having gastric cancer.
In this dataset - 1, from the results it is evident that the proposed work
outperforms in terms if accuracy, hit rate and elapsed time.
Dataset – 2:
The dataset were obtained from 90 healthcare centers in an around Tamil Nadu.
Around 10038 records were obtained with 28 features.
Out of 10038 records – 8999 patients are affected by gastric cancer and 1039
patients are not having gastric cancer. In this dataset - 2, from the results it is
evident that the proposed work outperforms in terms if accuracy, hit rate and
elapsed time.
The graph results of the performance analysis are predicted by the figures 1, 2
and 3 respectively.
TP TN FP FN Accuracy Hit RateElapsed
Time
WOA-RVM 893 223 5 6 99.02 % 98.9% 64 seconds
WOA - SVM 861 185 41 40 92.81 % 91.8 %142
seconds
TP TN FP FN Accuracy Hit RateElapsed
Time
WOA - RVM 8625 1274 78 61 98.62 % 99.1%197
seconds
WOA - SVM 7012 2316 253 457 92.93 % 91.3 %468
seconds
International Journal of Pure and Applied Mathematics Special Issue
345
Fig.1: Performance Analysis - Accuracy
Fig. 2: Performance Analysis – Hit Rate
Fig. 3: Performance Analysis – Elapsed Time
7. Conclusions
In this research work whale optimization algorithm (WOA) is employed for
performing the task of feature selection. Two binary variants of the WOA are
used to obtain feature subsets. The obtained subset is given as the input for the
Improved Relevance Vector Machine (IRVM) classifier. Also significantly the
error rate is obtained back to the WOA fitness function for avoiding the errors.
Performance of the WOA – IRVM is evaluated by the performance metrics such
as accuracy, true positive rate, true negative rate, precision, false positive rate,
International Journal of Pure and Applied Mathematics Special Issue
346
false negative rate and accuracy. Implementations are carried out using
MATLAB and the results depicts that the proposed classification strategy based
on WOA feature selection obtains better results.
References
[1] Thara Lakshmipathy, Gunasundari Ranganathan, Model of Genetic Fuzzy ARTMAP Classifier (GFAM) for Gastric Cancer Data Classification, ARPN Journal of Engineering and Applied Sciences 12(11) (2017), 3509 – 3517.
[2] Thara Lakshmipathy, Gunasundari Ranganathan, Improved Rule Based Classifier Based on Decision Trees (IRBC-DT) for Gastric Cancer Data Classification, Indian Journal of Science and Technology 10(20) (2017), 1 – 7.
[3] Gunasundari R., Thara L., Helicobacter pylori infection and associated stomach diseases: Comparative data mining approaches for diagnosis and prevention measures, IEEE International Conference on Advances in Computer Applications (ICACA) (2016), 9 – 13.
[4] Mirjalili S., Lewis A., The whale optimization algorithm, Advancement in Engineering and Software 95 (2016), 51–67.
[5] Kirshners A., Parshutin S., Leja M., Research on application of data mining methods to diagnosing gastric cancer, Advances in Data Mining, Lecture Notes in Computer Science 7377 (2012), 24–37.
[6] Silvera S.A.N., Mayne S.T., Marilie D., Gammon D., Diet and lifestyle factors and risk of subtypes of esophageal and gastric cancers: classification tree analysis, Ann Epidemiol 24(1) (2014), 50–57.
[7] Wang X., Duren Z., Zhang C., Clinical data analysis reveals three sub types of gastric cancer, IEEE 6thinternational conference on systems biology (2012), 315–320.
[8] Chul Chun, Jin Kim, Ki-BaikHahm, Yoon-Joo Park, Se-Hak Chun, Data mining technique for medical informatics: detecting gastric cancer using case-based reasoning and single nucleotide polymorphisms, Expert Systems 25(2) (2008), 163 – 172.
[9] Monges G., Doucet L., Terris B., Chenard M., Bibeau F., Penault-Llorca F., Martin J., Rabut J., Leroux D., PMD6 - Data Mining Used To Characterize Discordance In Gastric Cancers HER2 Status Determination To Help For A Better Treatment Decision, Value in Health 19(7) (2016).
International Journal of Pure and Applied Mathematics Special Issue
347
348