Whale Optimization Algorithm based Feature Selection with ... · propose d a modified case -based reasoning method that is appropriate for associated categorical variables to use

Whale Optimization Algorithm based Feature

Selection with Improved Relevance Vector

Machine Classifier for Gastric Cancer

Classification 1L.Thara and

2R. Gunasundari

1Department of Computer Science,

Karpagam Academy of Higher Education.

PSG College of Arts & Science,

Tamil Nadu, India.

[email protected] 2Department of Information Technology,

Karpagam Academy of Higher Education,

Tamil Nadu, India.

[email protected]

Abstract Data mining extends its application wings to almost all domains.

Medical / Clinical applications are one such application that directly give

positive impact on human beings. Prevailing of cancer is nowadays a huge

threat to individuals. The changing food habits of the individuals lead to

Gastric cancer disease. Hence there exists a vast scope for pursuing

research in Gastric cancer prediction from the available data. This research

work aims in design and development of whale optimization algorithm

(WOA) for feature selection from the dataset. Two binary variants of the

WOA are used to obtain feature subsets. Then obtained subset is given as

the input for the Improved Relevance Vector Machine (IRVM) classifier.

Performance of the said work is evaluated by the performance metrics such

as accuracy, true positive rate, true negative rate, precision, false positive

rate, false negative rate and accuracy. Implementations are carried out

using MATLAB and the results depicts that the proposed classification

strategy based on WOA feature selection obtains better results.

International Journal of Pure and Applied MathematicsVolume 119 No. 10 2018, 337-348ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

337

Key Words:Data mining, gastric cancer data classification, whale

optimization algorithm, feature selection, subset selection.

International Journal of Pure and Applied Mathematics Special Issue

338

1. Introduction

In the present days, data mining is gaining its scope in the field of healthcare

analytics, healthcare management and also extends its wings to disease

prediction. Recently, the medical knowledge of the data is very widespread on

the indications of patients with a variety of diseases and the ways to aid them

with the finding of these diseases. Analyzing and making an allowance for all

the betrothed factors by one person are typically not easy. As a result, the call

for for an automated system to lend a hand to discover the rules and patterns

and predict future events is completely felt. Data mining, acts as the bringer of

this automated system, facilitates many medical advances, more than ever in the

field of finding of a variety of diseases and get hold of useful relations among

features available in the data. One of the leading causes of gastric cancer is the

Helicobacter Pylori (H. pylori) bacteria. H. pylori is bacterium with spiral form

that lives in the mucous lining of the stomach. H. pylori has coexisted with

humans for many centuries and millennia and infection with these bacteria is

very common [1]. The Centre for Disease Control and Prevention (CDC)

affirms that around two thirds of the world population is infected with this

bacterium type and infection probability in newly developed countries,

including Latvia, is higher than in developed countries. This research, where

data mining methods and algorithms are used to analyse impact of harmful

factors on development of H. pylori, which causes higher risk of developing

gastric cancer along with optimization technique is employed in order to obtain

better results.

In our previous work [1], a model of genetic fuzzy ARTMAP classifier

(GFAM) for gastric cancer data classification has been proposed. Then in earlier

proposed work [2], an improved rule based classifier based on decision trees

(IRBC-DT) for gastric cancer data classification has been proposed. Also a

comparative study about data mining approaches for diagnosis and prevention

measures for helicobacter pylori infection and associated stomach diseases has

been carried out [3]. This research work makes use of WOA for performing the

feature selection and also IRVM is employed for performing the task of

classification.

2. Related Works

Kirshners et al. [5] made use of dataset that contains 819 samples (24 positive

and 795 negative samples) and 31 features and three algorithms CN2 Rules and

C4.5 and Naive Bayes to diagnose stomach cancer. Their results presented that

sex and protein HER-1, are important factors in the diagnosis and classification

of gastric cancer. Experimental results showed the average sensitivity >50 and

86–100 % at most, at the same time, having classification accuracy and

specificity close to 65–70 %. Silvera et al. [6] employed classification tree

analysis to examine data from a population-based case–control study (1095

cases, 687 controls) carried out in Connecticut, New Jersey, and Western


339

Washington State. Frequency of reported gastro esophageal reflux disease

symptoms was the most important risk stratification factor for esophageal

adenocarcinoma, gastric cardia, and noncardia gastric, with dietary factors

(esophageal adenocarcinoma, noncardia gastric), smoking (esophageal

adenocarcinoma, gastric cardia), wine intake (gastric cardia, noncardia gastric),

age (noncardia gastric), and income (noncardia gastric) appearing to modify the

risk of these cancer sites. Wang et al. [7] involved hierarchical clustering

mechanism on 14 available clinical factors from three categories, i.e., the

clinical background, immunohisto chemistry data, and the caner’s stage

information. Their results showed that that two clinical factors, Her-1 and

gender, can clearly characterize and differentiate these three groups. In

classifying and clustering somehow these methods derive patterns from the

dataset. The classification algorithms that are used for ensemble result of

diagnosis of stomach cancer are CART (Classification and Regression Tree),

TSVM (Transductive Support Vector Mechanism).

Chun et al. [8] explored how data mining and knowledge discovery possibly be

applied to medical informatics using human gene information. The authors

applied case-based reasoning to a cancer detection problem using human gene

information and performed analysis as case-based reasoning has been applied in

medicine relatively less often than other data mining techniques. The authors

proposed a modified case-based reasoning method that is appropriate for

associated categorical variables to use in detecting gastric cancer.

Monges et al [9] aimed to make use of the real world data study HERABLE in

gastric (GC) and gastroesophageal junction adenocarcinoma (GEJC) was to

assess the quality of HER2 testing, which is part of routine assessment to decide

target treatment initiation.

In addition, exploratory analysis has been performed to identify variables

influencing discordance between a local and central evaluation of HER2 over

expression determination. Their work results better when compared with their

chosen earlier methods.

3. Background of Whale Optimization Algorithm

WOA is gaining its interest in many application domains and it fits in to the

stochastic population-based algorithm [45]. WOA imitates the bubble-net

feeding characteristic that is present in the foraging behavior of the whales [4].

The whale chases near to the region with entrapping the victim in a net of

bubbles.

Formerly this has been created while swimming in a ‘6’-shaped manner. WOA

involves two stages namely exploitation stage (encircling a victim and coiling

bubble-net offensive manner), and exploration stage (probing randomly for the

victim).


340

3.1. Exploitation Stage (Encircling victim/Bubble-Net Offending Manner)

In order to revise the solution, Eqs. (1) and (2) are employed to perform

movement of a whale around a victim.

tXtXCD *. … (1)

DAtXtX .*1 … (2)

In the above equations - t represents the current iteration, *X represents the

best solution got so far, X is the current solution, is the utter value, and . is

an element-by-element multiplication. A and C are coefficient vectors that are

calculated as in Eqs. (3) and (4):

araA .2 … (3)

rC .2 … (4)

where a decreases linearly from 2 to 0 and r is a random vector in [0,1]. The

tuning of the values of A and C vectors manage the regions where a solution

can be located in the neighborhood of the best solution. The whales stir its

movement through surrounding means and along a spiral-shaped path towards

the victim.

MaxIterta

22 … (5)

where t is the iteration number and MaxIter is the maximum number of

allowed iterations. The spiral-shaped path is achieved by calculating the

distance between the solution X and the leading solution *X . After that a

spiral equation is portrayed that places between the current solution and the best

solution as in Eq. (6).

tXleDtX bl *2cos..'1 … (6)

where D is the distance between a whale X and a victim

ttXtXD ,** defines the spiral’s shape of the spiral, and l is a

random number in [−1,1]

A probability of 50% is assumed to choose between them during the

optimization as in Eq. (7).


341

5.06.

5.02.1

pifEqshapedpathspiral

pifEqncirclingShrinkingEtX

… (7)

Where p is a random number in [0,1]

3.2. Exploration Stage (Search for Victim)

For betterment of the exploration in WOA, as an alternative of requiring the

solutions to look for randomly based on the position of the best solution set up

so far, a randomly preferred solution is employed for updating the position

henceforth. Consequently, a vector A with the random values greater than 1 or

less than −1 is employed to strengthen a solution to shift far away from the best

known search means and is given in Eqn. (8) and (9).

XXCD rand. … (8)

DAXtX rand .1 … (9)

whererandX is a random whale chosen from the current population.

4. Improved RVM Classifier

It is indicated by X a set of N vectors D

i Rx and by c the corresponding

class labels Cci ,...,1 , which will be used in order to train a single layer feed

forward network (SLFN) using the IRVM algorithm. It consists of D input

(equal to the dimensionality), L hidden and C output (equal to the number of

classes involved in the classification problem) neurons. The number of hidden

layer neurons is usually selected to be much greater than the number of classes,

i.e., CL . The elements of the network target vectorsT

iCii ttt ,...,1 , each

corresponding to a training vector ix , are set to 1ikt for vectors belonging to

class k , i.e., when kci , and to 1ikt when kci . In IRVM-based

approaches, the network input weights LD

in RW and the hidden layer bias

values LRb are randomly assigned, while the network output weights

CL

out RW are analytically calculated. Let us denote by kjki wwq ,, the j -th

column of inW , the k -th row of outW and the j -the element of kw , respectively.

Given an activation function for the network hidden layer and using a

linear activation function for the network output layer, the response T

iCii ooo ,...,1 of the network corresponding to xi is calculated by:


342

L

j

ijjkjik Ckxbqwo1

.,...,1,,, … (10)

By storing the network hidden layer outputs L

i R corresponding to all the

training vectors Nixi ,...,1, in a matrix N,...,1 , or:

,

,,,,

,,,,

1

11111

NLLLL

N

xbqxbq

xbqxbq

… (11)

IRVM presumes zero training error, by assuming that Nito ii ,...,1, , or in a

matrix notation: ,TO … (12)

where NttT ,...,1 is a matrix containing the network target vectors. The

network output weights outW can be analytically calculated by:

T

out TW… (12)

where1T is the generalized pseudo-inverse of T . After the

calculation of the network output weights outW , the network response for a

vector D

l Rx is given by:

l

T

outl Wo… (13)

where l is the network hidden layer output for ix .

The calculation of the network output weights outW through is sometimes

inaccurate, since the matrix T may be singular. A regularized version of the

IRVM algorithm that allows small training errors and tries to minimize the

norm of the network output weights outW

5. WOA for Feature Selection

WOA is employed for feature selection by which each feature subset is

considered to be a position of a whale. Every subset possibly has N number of

features, where N is the number of features in the original set. The less the

number of features in the solution and the higher the classification accuracy, the

better is the solution. WOA begins with a set of randomly generated solutions

(subsets) called population. After that, the available solution will be assessed by

the employed fitness function. The fittest solution in the population is marked as

*X (victim). During every round, solution gets updated with their


343

corresponding positions based on each other. This is carried out in order to

imitate the bubble-net attacking and searching for victim strategy. To imitate the

bubble-net attacking, a probability of 50% is presumed to prefer between the

shrinking encircling mechanism and the spiral model is used to update the

position of the solutions. During the process of shrinking encircling mechanism,

equilibrium between exploration and exploitation is required. A random vector

A which contains values >1 or <1 is taken to perform the above said operation.

If 1A then the exploration is employed by searching in the neighborhood of a

randomly selected solution. When A> 1, the region of best solution so far is

exploited. The above said procedure is repeated until maximum number of

iterations.

5.1. Fitness Function

The fitness function employed in this research work is modeled by poising

between the number of selected features in each solution (minimum) and the

classification accuracy (maximum) obtained by using these selected features as

depicted in Eq. (14).

C

RDFitness R … (14)

where DR represents the classification error rate of a given classier (the

IRVM classifier is used here). R is the cardinality of the chosen subset and C

is the total number of features in the dataset, and are two parameters

corresponding to the importance of classification quality and subset length.

Algorithm 1 (Pseudo-Code of the WOA Algorithm)

*

1

*,

)6.(

)5.0(

1

)2(.

1

5.0

,,,,

_

*

,...,2,1

Xreturn

tt

XupdatesolutionbetteraisthereIf

solutioneachoffitnesstheCalculate

itamendandspacesearchthebeyondgoessolutionanyifCheck

EqthebysearchcurrenttheofpositiontheUpdate

p

XsolutionrandomaSelect

A

solutioncurrenttheofpositiontheupdatetoEqUse

A

p

pandlCAaUpdate

solutioneach

IterationMaxt

solutionbesttheX

solutioneachofvalueobjectivetheCalculate

niXPopulationInitialGenerate

rand

i

whileend

forend

if1end

if1else

if2end

if2else

if2

if1

for

while


344

6. Datasets Used and Results

Two datasets are used to test the performance of the proposed research work.

Dataset – 1:

The dataset were collected from 25 healthcare centers such as hospitals and

clinics in and around Coimbatore district. Around 1127 patients’ dataset were

obtained with 28 features.

Out of 1127 patients’ records – 898 patients are affected by gastric cancer and

229 patients are not having gastric cancer.

In this dataset - 1, from the results it is evident that the proposed work

outperforms in terms if accuracy, hit rate and elapsed time.

Dataset – 2:

The dataset were obtained from 90 healthcare centers in an around Tamil Nadu.

Around 10038 records were obtained with 28 features.

Out of 10038 records – 8999 patients are affected by gastric cancer and 1039

patients are not having gastric cancer. In this dataset - 2, from the results it is

evident that the proposed work outperforms in terms if accuracy, hit rate and

elapsed time.

The graph results of the performance analysis are predicted by the figures 1, 2

and 3 respectively.

TP TN FP FN Accuracy Hit RateElapsed

Time

WOA-RVM 893 223 5 6 99.02 % 98.9% 64 seconds

WOA - SVM 861 185 41 40 92.81 % 91.8 %142

seconds

TP TN FP FN Accuracy Hit RateElapsed

Time

WOA - RVM 8625 1274 78 61 98.62 % 99.1%197

seconds

WOA - SVM 7012 2316 253 457 92.93 % 91.3 %468

seconds


345

Fig.1: Performance Analysis - Accuracy

Fig. 2: Performance Analysis – Hit Rate

Fig. 3: Performance Analysis – Elapsed Time

7. Conclusions

In this research work whale optimization algorithm (WOA) is employed for

performing the task of feature selection. Two binary variants of the WOA are

used to obtain feature subsets. The obtained subset is given as the input for the

Improved Relevance Vector Machine (IRVM) classifier. Also significantly the

error rate is obtained back to the WOA fitness function for avoiding the errors.

Performance of the WOA – IRVM is evaluated by the performance metrics such

as accuracy, true positive rate, true negative rate, precision, false positive rate,


346

false negative rate and accuracy. Implementations are carried out using

MATLAB and the results depicts that the proposed classification strategy based

on WOA feature selection obtains better results.

References

[1] Thara Lakshmipathy, Gunasundari Ranganathan, Model of Genetic Fuzzy ARTMAP Classifier (GFAM) for Gastric Cancer Data Classification, ARPN Journal of Engineering and Applied Sciences 12(11) (2017), 3509 – 3517.

[2] Thara Lakshmipathy, Gunasundari Ranganathan, Improved Rule Based Classifier Based on Decision Trees (IRBC-DT) for Gastric Cancer Data Classification, Indian Journal of Science and Technology 10(20) (2017), 1 – 7.

[3] Gunasundari R., Thara L., Helicobacter pylori infection and associated stomach diseases: Comparative data mining approaches for diagnosis and prevention measures, IEEE International Conference on Advances in Computer Applications (ICACA) (2016), 9 – 13.

[4] Mirjalili S., Lewis A., The whale optimization algorithm, Advancement in Engineering and Software 95 (2016), 51–67.

[5] Kirshners A., Parshutin S., Leja M., Research on application of data mining methods to diagnosing gastric cancer, Advances in Data Mining, Lecture Notes in Computer Science 7377 (2012), 24–37.

[6] Silvera S.A.N., Mayne S.T., Marilie D., Gammon D., Diet and lifestyle factors and risk of subtypes of esophageal and gastric cancers: classification tree analysis, Ann Epidemiol 24(1) (2014), 50–57.

[7] Wang X., Duren Z., Zhang C., Clinical data analysis reveals three sub types of gastric cancer, IEEE 6thinternational conference on systems biology (2012), 315–320.

[8] Chul Chun, Jin Kim, Ki-BaikHahm, Yoon-Joo Park, Se-Hak Chun, Data mining technique for medical informatics: detecting gastric cancer using case-based reasoning and single nucleotide polymorphisms, Expert Systems 25(2) (2008), 163 – 172.

[9] Monges G., Doucet L., Terris B., Chenard M., Bibeau F., Penault-Llorca F., Martin J., Rabut J., Leroux D., PMD6 - Data Mining Used To Characterize Discordance In Gastric Cancers HER2 Status Determination To Help For A Better Treatment Decision, Value in Health 19(7) (2016).


347

348

Documents

Whale Optimization Algorithm based Feature Selection with ... · propose d a modified case -based reasoning method that is appropriate for associated categorical variables to use