14
Hybrid Ant Bee Algorithm for Fuzzy Expert System Based Sample Classification Pugalendhi GaneshKumar, Chellasamy Rani, Durairaj Devaraj, and T. Aruldoss Albert Victoire Abstract—Accuracy maximization and complexity minimization are the two main goals of a fuzzy expert system based microarray data classification. Our previous Genetic Swarm Algorithm (GSA) approach has improved the classification accuracy of the fuzzy expert system at the cost of their interpretability. The if-then rules produced by the GSA are lengthy and complex which is difficult for the physician to understand. To address this interpretability-accuracy tradeoff, the rule set is represented using integer numbers and the task of rule generation is treated as a combinatorial optimization task. Ant colony optimization (ACO) with local and global pheromone updations are applied to find out the fuzzy partition based on the gene expression values for generating simpler rule set. In order to address the formless and continuous expression values of a gene, this paper employs artificial bee colony (ABC) algorithm to evolve the points of membership function. Mutual Information is used for idenfication of informative genes. The performance of the proposed hybrid Ant Bee Algorithm (ABA) is evaluated using six gene expression data sets. From the simulation study, it is found that the proposed approach generated an accurate fuzzy system with highly interpretable and compact rules for all the data sets when compared with other approaches. Index Terms—Microarray data, fuzzy expert system, ant colony optimization, artificial bee colony, mutual information Ç 1 INTRODUCTION D NA microarrays [1] is an important technology for studying gene expression. With a single hybridization, the level of expression of thousands of genes, or even entire genome, can be estimated for a sample of cells. Generally, the microarray data are images, which have to be trans- formed into gene expression matrices in which rows repre- sent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of particular gene in the particular sample. Recently gene expression profiles are more preferred form of disease classification system [2] than morphology. Microarray based disease classification system takes labeled gene expression data samples and gen- erates a classifier model that classifies new data samples into different predefined diseases. In the literature, statistical approaches like weighted vot- ing scheme [3], nearest neighbor classification [4], discrimi- nation methods [5], least square and logistic regression [6] and naive bayes approach [7] were used to generate the classifier model for gene expression data. These statistical approaches usually result in an inflexible classification sys- tem that is unable to classify a sample, if the expressions of genes differ slightly from the predefined profile. Moreover, the classifier model produced by them is memory resistant and is not scalable. Machine learning approaches like artifi- cial neural networks [8] and support vector machine [9] have been successfully applied to classify microarray data. Eventhough these approaches produce good classification accuracy, the results produced by them are hard to inter- pret. They are popularly called as “Black Box” method as they focus only on classification performance and do not provide any measure on the deeper understanding of the fundamental questions in biology and medicine. Decision tree [10] is used to construct a rule based clas- sifier model. Eventhough the rules produced by them con- tain biologically meaningful terms, it is a sensitive type of classifier. Small disturbances in the training sample lead to large differences in the tree structure. A new symbolic machine learning approach [11] is proposed to extract human understandable rules from decision trees. This approach manipulates symbols on the assumption that such a behavior can be stored in symbolically structured knowledge bases. In practice, the symbolic manipulations [12] limit the situations to which the conventional AI theo- ries can be applied, because knowledge acquisition and representation are easier by no means, but are arduous tasks. Although the rule based classifier systems, reported in [13], [14] produces simple and interpretable rules, they cannot completely bring out the hidden information in the data. Moreover, they lack in robustness with respect to the noisy and missing data. Microarray data classification is a complex classification problem that involves a decision-making process. This deci- sion making process has lot of uncertainties since the infor- mation related to gene expressions are vague in nature and P. GaneshKumar is with the Department of Information Technology, Anna University Regional Centre, Coimbatore, Tamil Nadu, India. E-mail: [email protected]. C. Rani is with the Department of Computer Science and Engineering, Government College of Engineering, Salem, India. E-mail: [email protected]. D. Devaraj is with the Department of Electrical and Electronics Engineer- ing, Kalasalingam University, Krishnankoil, India. E-mail: [email protected]. T.A.A. Victoire is with the Department of Electrical and Electronics Engi- neering, Anna University Regional Centre, Coimbatore, India. E-mail: [email protected]. Manuscript received 6 Sept. 2013; revised 23 Dec. 2013; accepted 21 Jan. 2014. Date of publication 19 Feb. 2014; date of current version 15 May 2014. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCBB.2014.2307325 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014 347 1545-5963 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Hybrid ant bee algorithm for fuzzy expert system based sample classification

Embed Size (px)

Citation preview

Page 1: Hybrid ant bee algorithm for fuzzy expert system based sample classification

Hybrid Ant Bee Algorithm for Fuzzy ExpertSystem Based Sample Classification

Pugalendhi GaneshKumar, Chellasamy Rani, Durairaj Devaraj, and T. Aruldoss Albert Victoire

Abstract—Accuracy maximization and complexity minimization are the two main goals of a fuzzy expert system based microarray data

classification. Our previous Genetic Swarm Algorithm (GSA) approach has improved the classification accuracy of the fuzzy expert

system at the cost of their interpretability. The if-then rules produced by the GSA are lengthy and complex which is difficult for the

physician to understand. To address this interpretability-accuracy tradeoff, the rule set is represented using integer numbers and the

task of rule generation is treated as a combinatorial optimization task. Ant colony optimization (ACO) with local and global pheromone

updations are applied to find out the fuzzy partition based on the gene expression values for generating simpler rule set. In order to

address the formless and continuous expression values of a gene, this paper employs artificial bee colony (ABC) algorithm to evolve

the points of membership function. Mutual Information is used for idenfication of informative genes. The performance of the proposed

hybrid Ant Bee Algorithm (ABA) is evaluated using six gene expression data sets. From the simulation study, it is found that the

proposed approach generated an accurate fuzzy system with highly interpretable and compact rules for all the data sets when

compared with other approaches.

Index Terms—Microarray data, fuzzy expert system, ant colony optimization, artificial bee colony, mutual information

Ç

1 INTRODUCTION

DNA microarrays [1] is an important technology forstudying gene expression. With a single hybridization,

the level of expression of thousands of genes, or even entiregenome, can be estimated for a sample of cells. Generally,the microarray data are images, which have to be trans-formed into gene expression matrices in which rows repre-sent genes, columns represent various samples such astissues or experimental conditions, and numbers in eachcell characterize the expression level of particular gene inthe particular sample. Recently gene expression profiles aremore preferred form of disease classification system [2]than morphology. Microarray based disease classificationsystem takes labeled gene expression data samples and gen-erates a classifier model that classifies new data samplesinto different predefined diseases.

In the literature, statistical approaches like weighted vot-ing scheme [3], nearest neighbor classification [4], discrimi-nation methods [5], least square and logistic regression [6]and naive bayes approach [7] were used to generate theclassifier model for gene expression data. These statistical

approaches usually result in an inflexible classification sys-tem that is unable to classify a sample, if the expressions ofgenes differ slightly from the predefined profile. Moreover,the classifier model produced by them is memory resistantand is not scalable. Machine learning approaches like artifi-cial neural networks [8] and support vector machine [9]have been successfully applied to classify microarray data.Eventhough these approaches produce good classificationaccuracy, the results produced by them are hard to inter-pret. They are popularly called as “Black Box” method asthey focus only on classification performance and do notprovide any measure on the deeper understanding of thefundamental questions in biology and medicine.

Decision tree [10] is used to construct a rule based clas-sifier model. Eventhough the rules produced by them con-tain biologically meaningful terms, it is a sensitive type ofclassifier. Small disturbances in the training sample lead tolarge differences in the tree structure. A new symbolicmachine learning approach [11] is proposed to extracthuman understandable rules from decision trees. Thisapproach manipulates symbols on the assumption thatsuch a behavior can be stored in symbolically structuredknowledge bases. In practice, the symbolic manipulations[12] limit the situations to which the conventional AI theo-ries can be applied, because knowledge acquisition andrepresentation are easier by no means, but are arduoustasks. Although the rule based classifier systems, reportedin [13], [14] produces simple and interpretable rules, theycannot completely bring out the hidden information in thedata. Moreover, they lack in robustness with respect to thenoisy and missing data.

Microarray data classification is a complex classificationproblem that involves a decision-making process. This deci-sion making process has lot of uncertainties since the infor-mation related to gene expressions are vague in nature and

� P. GaneshKumar is with the Department of Information Technology, AnnaUniversity Regional Centre, Coimbatore, Tamil Nadu, India.E-mail: [email protected].

� C. Rani is with the Department of Computer Science and Engineering,Government College of Engineering, Salem, India.E-mail: [email protected].

� D. Devaraj is with the Department of Electrical and Electronics Engineer-ing, Kalasalingam University, Krishnankoil, India.E-mail: [email protected].

� T.A.A. Victoire is with the Department of Electrical and Electronics Engi-neering, Anna University Regional Centre, Coimbatore, India.E-mail: [email protected].

Manuscript received 6 Sept. 2013; revised 23 Dec. 2013; accepted 21 Jan.2014. Date of publication 19 Feb. 2014; date of current version 15 May 2014.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TCBB.2014.2307325

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014 347

1545-5963� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Hybrid ant bee algorithm for fuzzy expert system based sample classification

are very hard to predict as the boundaries between themcannot be well defined. With the ability of Fuzzy Logic [15],[16] that deals with uncertain situation and vagueness, itseems to be an appropriate approach for classification ofmicroarray gene expression data. There are two main cate-gories of fuzzy logic based classifiers: pure fuzzy classifiersand fuzzy rule based system or simply fuzzy expert system.Pure fuzzy classifications methods based on fuzzy patternmatching [17], fuzzy clustering [18], and fuzzy integral [19]are poorly suited for classification problems because of theirlack of normalization and have unacceptable performance.The key to the success of the fuzzy rule based system [20] isits ability to incorporate human expert knowledge in deci-sion making.

An important issue in the design of fuzzy expert systemis the formation of fuzzy if-then rules and the membershipfunctions, i.e., Knowledge Acquisition. In general, the rulesand membership functions are formed from the experienceof the human experts. But, for Microarray data classificationproblem with large number of input genes, the possiblenumber of rules increases exponentially, which makes it dif-ficult for experts to define a complete rule set. In [21], amodel for automatic generation and adjustment of member-ship functions and rules directly from Microarray data isproposed. This data-driven approach is very weak in selflearning and determining the required number of fuzzy if-then rules.

Knowledge acquisition for a fuzzy expert system can beformulated as a search problem in high dimensional spacewhere each point represents a rule set, membership func-tion and the corresponding system behavior. Given someperformance criteria, the performance of the system forms ahyper surface in the space. Developing the optimal fuzzysystem is equivalent to finding the optimal location of thishyper surface. This makes Evolutionary Algorithms as abetter candidate for knowledge acquisition. In [22], a hybridfuzzy (HF) method for extracting a compact rule base is pro-posed. This approach fails to model the fuzzy systemcompletely as it represents only the rule set in the geneticpopulation. Since in a fuzzy system, the membership func-tion and rule set are co-dependent, they should be designedor evolved at the same time.

In [23], a hybrid Genetic Swarm Algorithm is proposedby combining strengths of Genetic Algorithm (GA) [24]and Particle Swarm Optimization (PSO) [25]. The GSA rep-resents rule set using binary strings and the values ofmembership function using floating point numbers. Boththe solution variables are encoded as a single individual inthe genetic population and are evolved simultaneouslysuch that GA is used to find the near optimal rules andPSO is used to tune the membership function. Fuzzyexpert system usually comes with two contradictoryrequirements to the obtained model; namely accuracy andinterpretability [26]. It is often found that a fuzzy systemwith highest classification accuracy shows poorerinterpretability and vice versa. The GSA approach produ-ces an accurate fuzzy expert system with knowledgeexpressed in the form of if-then rules.

Even though GSA has better classification accuracy, theif-then rules produced by GSA involve more input genesand each gene takes more linguistic values. Thus the length

of an if-then rule produced by the GSA is very long andcomplex which is difficult for the Physician to understand.In view of this interpretability-accuracy tradeoff, moreeffort is still required to increase the efficiency of the learn-ing of fuzzy expert system.

Recently, scientists have begun to realize the natureand the actions that happen in specific natural systems orspecies in order to develop intelligent optimization algo-rithms. Ant colony optimization [27] is a meta-heuristicalgorithm inspired by the foraging behavior of ants forsolving combinatorial optimization problem. Artificialbee colony [28] is another popular algorithm inspired bythe foraging behavior of honey bees that finds solutioneffectively for complex optimization. In this paper, thepower of ant colony optimization and artificial bee colonyare combined and a novel hybrid Ant Bee Algorithm isproposed to design an accurate and interpretable FuzzyExpert System. The performance of the proposedapproach is evaluated using six gene expression data setsviz., Type 2 Diabetes (T2D) [29], Colon cancer [30], Leu-kemia [3], Lymphoma [31], Rheumatoid Arthritis versusOsteoarthritis (RAOA) [32] and Rheumatoid Arthritisversus Controls (RAC) [33].

The number of genes in the above gene expression pro-file (usually in the range of 2,000-30,000) is larger than thenumber of samples (usually in the range of 30-100). Afuzzy expert system designed using the large set of geneexpression features will have higher computational cost,slower learning process and poor classification accuracydue to the phenomenon known as curse of dimensional-ity. Recent researches [34], [35], [36], [37], [38] haveshown that a small number of genes are sufficient foraccurate diagnosis of most of the diseases, eventhoughthe number of genes vary greatly between different dis-eases. The use of small subset of genes helps, not only toachieve better diagnostic accuracy, but also get an oppor-tunity to further analyze the nature of the disease and thegenetic mechanisms responsible for it. Thus gene selec-tion plays a major role in the proposed system. Thispaper uses mutual information [39] for selecting informa-tive genes because of its nonlinearity, robustness, scalabil-ity and good empirical successes.

The structure of the rest of this paper is described as fol-lows. Section 2 briefly introduces the various componentsof fuzzy expert system and its suitability for Microarraydata classification. In Section 3, the issue of interpretabilityaccuracy tradeoff is discussed. In Section 4, details of theproposed Ant Bee Algorithm are presented. Details of sim-ulations conducted using six gene expression data setsand the results are reported in Section 5. Concludingremarks are given in Section 6.

2 MICROARRAY DATA CLASSIFICATION USING

FUZZY EXPERT SYSTEM

Microarray data classification is a supervised learning taskthat predicts the diagnostic category of a sample from itsexpression array phenotype. This problem can be solvedusing fuzzy expert system which implements a nonlinearmapping from its input space to output space and producea discrete class label indicating only the predicted class of

348 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 3: Hybrid ant bee algorithm for fuzzy expert system based sample classification

the instance. Fig. 1 illustrates the various components offuzzy expert system for microarray data classification.

Gene selection has drawn special attention whiledesigning a fuzzy expert system for Microarray data clas-sification because of its high dimensional nature. Mutualinformation technique is used for selecting informativegenes. A fuzzy expert system is an expert system that usesa collection of if-then rules and membership functions,instead of Boolean logic to reason about data. The generalform of an if-then rule in the proposed fuzzy expert systemis as below:

Rj: if xp1 is Aj1 and . . . and xpn is Ajn then class Cj whereAj1; . . . ;Ajn are antecedent fuzzy sets of the input genesxp1; . . . ; xpn and Cj is one of the output class label. A collec-tion of such rules forms the rule base for the fuzzy expertsystem upon which qualitative reasoning is performed toinfer the results. The relation between input and output isexpressed using a fuzzy relation constructed on the basis offuzzy if-then rules. A fuzzy relation is a fuzzy set definedon universal sets, which are Cartesian products. Mathemati-cally, a fuzzy set A in the universe of discourse X is definedto be a set of ordered pairs,

A ¼ fðx;mAðxÞÞ j x 2 Xg; (1)

where mAðxÞ is called the membership function of x in A.Triangular and trapezoidal membership function are thetwo most commonly used membership function.

Mamdani inferencing system [12] with product t-normand max t-conorm is used in this paper. Here, the set ofinput genes is matched against the if part of each if-thenrule and the response of each rule is obtained through fuzzyimplication operation. The response of each rule is weightedaccording to the extent to which each rule fires. Theresponses of all the fuzzy rules for a particular output classare combined to obtain the confidence with which the inputis classified to the corresponding output class.

3 INTERPRETABILITY—ACCURACY TRADEOFF

There are two main goals [26], [40] in the design of fuzzyexpert system: One is the accuracy maximization and theother is the complexity minimization. During the past his-tory, emphasis was placed on accuracy maximization andseveral approaches have been proposed to improve theaccuracy of fuzzy systems at the cost of their interpretabil-ity. The complexity of fuzzy systems usually increases as aresult of accuracy maximization.

Recently researchers tried to simultaneously performaccuracy maximization and complexity minimization inorder to design fuzzy systems. It is, however, impossible tosimultaneously optimize these two objectives. Thus, theexistence of the accuracy-complexity tradeoff or interpret-ability-accuracy tradeoff in the design of fuzzy systems hasbeen realized. These two requirements are contradictoryand conflicting with one another and is depicted in Fig. 2.

Interpretability of fuzzy system depends on the fuzzyrules that are formed using input genes of Microarray data.It is observed from our previous GSA approach that theantecedent part of the rule takes only the combination ofthe discrete integer number assigned to the linguistics of thefuzzy set. As the antecedent part decides the length andcomplexity of the rule, linguistics selection has to be con-ducted in the antecedent part to generate small, simple andreadable rules. In this repect, forming of effective combina-tion along with linguistics selection is viewed as a combina-torial optimization in the discrete space and problemspecific ant colony optimization is found to be suitable togive solution to this NP-Complete problem.

Accuracy of the fuzzy expert system can be improved bytuning the membership functions. Due to the amorphousnature of gene expression values, tuning is needed to bedone with the multimodal and multidimensional optimiza-tion technique that searches the promising regions of thesolution space in detail within in a short span of time. Inthis respect, artificial bee colony algorithm is found to besuitable that can improve the accuracy of fuzzy expert sys-tem without losing the interpretability.

Owing to the problem specific nature of ACO and ABCalgorithms, the author developed the fuzzy expert systemby combining the power of ACO and ABC algorithm andthe proposed ABA algorithm addresses the accuracy-interpretability tradeoff reasonably good when comparedwith the previous GSA approach.

4 ANT BEE ALGORITHM

Ant colony optimization maintains a colony of ants and setof permissible ranges (PRs) associated with all possible dis-crete values of the design variable. Each ant is allowed tochoose a permissible range that represents the path. Onceall ants in the colony choose its path, then the possible dis-crete value associated with the path is taken as the candi-date value for ants. Then, the candidate values of all theants are formed as combination to evaluate the objectivefunction.

Fig. 2. Interpertability—accuracy tradeoff.Fig. 1. Fuzzy expert system for microarray data classification.

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 349

Page 4: Hybrid ant bee algorithm for fuzzy expert system based sample classification

Artificial bee colony algorithm initializes the position offood source as possible solution for the optimization taskand the objective function is evaluated under three phasesnamely, Employed bee phase, Onlooker bee phase andScout bee phase. Bees assigned to the food source arecalled Employed Bees, Bees waiting on the dance area formaking decision to choose a food source are calledOnlooker Bees, and a bee carrying out random search iscalled a Scout Bee. The proposed Ant Bee Algorithm com-bines the strength of ACO and ABC. Fig. 3 shows the vari-ous steps in applying proposed ABA algorithm fordesigning a fuzzy expert system based microarray dataclassifier model. The following section presents the detailsof the proposed ABA algorithm.

4.1 Modified Representation

ABA uses a modified form of representation for encodingthe solution variables of the fuzzy expert system. The ruleset encodes the linguistics (low, medium and high) of theexpression value of genes and the sample’s class label (nor-mal and disease). The range of each input gene is parti-tioned into areas for identifying the linguistics of an inputgene. In general, three to seven fuzzy partitions are appro-priate to cover the range of expression values of a gene.Since the duty of the rule is to find out the specific fuzzypartition based on the expression values of a gene, it is rep-resented using integer numbers. Integer number for rule setavoids the hamming cliff problem of previous GSAapproach that uses binary strings. Further the task of rulegeneration is viewed as a combinatorial optimization andACO is applied to find the optimal rule set.

Microarray gene expression data sets are formless thanany other bench mark data sets and the expression values ofevery gene are highly complex because they are taken fromdifferent tissue samples and under different experimentalconditions. Hence the points of membership functions thatencode the expression values of genes are represented usingcontinuous number. ABC is applied for fine tuning of thepoints of membership function, which uniformly distributethe points within the range of gene expressions with a half-way overlap between them.

With this idea, the modified representation for the ruleset and membership function is shown in Fig. 4.

Each rule in a rule set consists of three sections namelyRule selection (‘R’), Input variables ð‘I1; I2; I3; . . . In’Þ andOutput class label (‘O’). The ‘R’ can take either ‘0’ to omitthe rule or ‘1’ to select the rule. The ‘I1; I2; I3; . . . In’ can takeany one value among 0, 1, 2 and 3 for representing ‘none’using 0, ‘low’ using 1, ‘medium’ using 2 and ‘high’ using 3.The ‘O’ can take a number assigned to a class label that canbe ‘1’ for first class, ‘2’ for second class and so on. The geneexpression values of each input gene is partitioned intothree linguistics namely low (‘L’), medium (‘M’) and high(‘H’) and is shown in Fig. 5. Trapezoidal membership func-tion is used to represent the lower and higher values of theinput gene and triangular membership function is used torepresent the medium value.

As shown in Fig. 5, three membership points are neededto represent each membership function and hence a total ofnine membership points ðP1;P2;P3;P4;P5;P6;P7;P8;P9Þare required to encode a single input gene. In these ninepoints, first and last points ðP1 and P9Þ are fixed as they rep-resent the minimum and maximum value of an input gene.The remaining seven membership points are evolvedbetween the dynamic ranges such that P2 has ½P1;P9], P3

has ½P2;P9�;P4 has ½P2;P3�;P5 has ½P4;P9�;P6 has ½P5;P9�;P7

has ½P5;P6] and P8 has ½P7;P9] as limits. The number 5.6,6.1, 7.0 and so on are the numbers that represent the pointsP1;P2, and P3 and so on respectively.

As an extension of the above method, if five partitions(very low (VL), L, M, H and very high (VH)) are used to rep-resent each input gene, then a total of fifteen membershippoints ðP1; P2; P3; P4; P5; P6; P7; P8; P9; P10;P11;P12;P13;P14;

P15Þ are required. During the course of ABA run, the solu-tion variables given in Fig. 4, is split into two such that ACOis used to find the optimal rules and the optimal points ofmembership function are found using ABC.

4.2 Fitness Function Formulation

The next important consideration following the representa-tion is the formulation of fitness function. In the classifica-tion problem under consideration, there are two objectives;

Fig. 3. Flowchart of ant bee algorithm.

Fig. 4. Representation of rule set and membership function.

Fig. 5. Fuzzy Space with three membership function per input gene.

350 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 5: Hybrid ant bee algorithm for fuzzy expert system based sample classification

one is to maximize the correctly classified data and the otherone is to minimize the number of rules. These two objectivesare conflicting objectives. This is overcome by reformulatingthe first objective of maximizing the correctly classified dataas minimizing the difference between total number of sam-ples and the correctly classified data.

Given the total number of samples (‘S’) and the maxi-mum number of rules (‘MNR’), the task is to find out thedifference between ‘S’ and the correctly classified data(‘Cc’) for the selected number of rules(‘SNR’) of every ABArun. During the ABA run, the objective is to find out theminimum of the above said value. This is mathematicallyrepresented as,

Min f ¼ S � Ccð Þ þ k� SNRð Þ; (2)

where ‘k’ is a constant introduced to amplify ‘SNR’ whosevalue is usually small. In this paper, the value of ‘k’ istaken as 3.

In general, ABA searches for a solution with maximumfitness function value. Hence, the minimization objectivefunction given by (2) is transformed to a fitness function tobe maximized as

Fitness ¼ K

f; (3)

where ‘K’ is another constant used to amplify (1/f), whosevalue is usually small; so that the fitness value of the chro-mosome will be in a wider range. In this paper, the value of‘K’ is taken as 10.

4.3 Operations of ACO

ACO starts with the construction of path (permissiblerange) by calculating the probability ðpijÞ from the initialpheromone ðtijÞ using

pij ¼ tijPPj¼1 tij

i ¼ 1; 2; . . . :;D j ¼ 1; 2; . . . ;P; (4)

where D and P denotes the number of decision variable andnumber of permissible value respectively. The constructedpath is selected using the random value initialized for antswhich is then explored to form different combinations ofthe decision variable and then the objective function is eval-uated. Two kinds of pheromone updations are carried outuntil optimum solution is reached. At first, local pheromoneupdate rule is applied using

tij ¼ ð1� rÞtij þ rt0: (5)

After all ants have built their solutions, global phero-mone updating rule is applied on the best path using

tij ¼ ð1� rÞtij þ ðs þ ð1=ðfbestþ 1ÞÞÞ; (6)

where r is pheromone decay factor, t0is minimum phero-

mone value and ‘fbest ‘ means best path. The local phero-mone updating rule is applied to decrease the value ofpheromone that reduces the attraction for the later ants.Thus local pheromone updating rule helps to enhance thediversity of the decision variables. The importance of globalpheromone updation is that it distincts the best path with

other paths by adding extra pheromone and tends thefuture ants to select the path once again.

4.4 Operations of ABC

ABC starts with the employed bee phase that performsneighborhood search to update the food source positions(points of membership function) using

xijðtþ 1Þ ¼ uij þ fðuijðtÞ � ukjðtÞÞ; (7)

where xij denotes the position of ith employed bee of jthvariable, t is the iteration number and uij is a real numberrandomly generated in the range [–1, 1]. A local selectionprocess called greedy selection is carried out by employedbees to memorize the best food source. After that onlookerbee selects the food source from the employed bee based onthe probability value associated with the food source posi-tion. The probability value Pi is calculated using the

Pi ¼ F ðuiÞPsi¼1 F ðuiÞ; (8)

where ui is the fitness of the ith employed bee. Then theonlooker bee produces candidate food position using theequation (4) and from which the best food source positionare selected using a greedy algorithm. If the food positioncannot be improved for a predetermined number of cycles,then it is abandoned and replaced by a new randomly gen-erated food source by a scout bee using

uij ¼ uijmin þ r:ðuijmax� uijminÞ; (9)

where r is a random number and r 2 [0, 1]

5 SIMULATION RESULT

This section presents the details of the simulation carriedout using six gene expression data sets. Simulations are con-ducted to examine the learning ability as well as the gener-alization ability of the proposed Ant Bee Algorithm. Theproposed ABA approach is implemented in MATLAB 7.5and executed in a PC with Intel Core 2 Duo processor with2.60 GHz speed and 2 GB of RAM.

5.1 Gene Expression Data Sets

The data sets considered in the simulation are Type 2 Diabe-tes, Colon cancer (Col), Leukemia (Leu), Lymphoma (Lym),Rheumatoid Arthritis versus Osteoarthritis (RAO) andRheumatoid Arthritis versus Controls. All these data setsare publicly available and are two class gene expressionprofiles. Table 1 gives the details of gene expression data setused in the simulation.

5.2 Gene Selection

Mutual Information technique is used to select informativegenes from the original gene expression profile. To computeMI, the probability distribution of genes are neededwhich inpractice are not known, and the best we can do is to use thehistogram of the data. The steps involved in computing theMI from the histogram of the training data are given below:

� The data set is arranged in the ascending order basedon the output.

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 351

Page 6: Hybrid ant bee algorithm for fuzzy expert system based sample classification

� The output class label is divided into two groups andthe initial entropy is calculated using,

H ðY Þ ¼ �XNy

j¼1

P ðyjÞ � log ðP ðyiÞÞ: (10)

� The input genes are divided into 10 levels andtheir conditional entropies are evaluated using,

H Y=Xð Þ ¼ �XNx

i¼1

P xið Þ:XNy

j¼1

P yj jxi

� � � log P yj jxi

� �� � !:

(11)

� Next, the mutual information of each gene withrespect to the output is computed using,

IðY ;XÞ ¼ HðY Þ �HðY jXÞ: (12)

In the equations (10)-(12), X corresponds to the set ofinput genes and Y corresponds to the class label. Fig. 6shows the mutual information of 19,319 input genes of T2Ddata set (except the genes with missing value) with respectto the output classes. From this figure, it is evident that onlya few genes are having significant information about thedisease and the remaining genes have very less amount ofinformation. With regard to the studies suggesting that onlyfew genes are sufficient for understanding their biologicalrelationship with the target diseases, 10 genes with higher

MI value are selected as informative genes. Table 2 givesthe detail of the genes selected using MI for T2D data set.Simulations are conducted to analyze the performance ofthe proposed ABA approach in terms of classification accu-racy and interpretability of the rules.

5.3 Accuracy

Learning ability and generalization ability are the two meas-ures used to study the accuracy of the proposed ABAapproach.

5.3.1 Learning Ability

The learning ability is examined by using all the samples ofa data set as training patterns. By following the representa-tion strategy given in Fig. 4, the membership function andthe rule set are represented by a mixed number. A maxi-mum of 10 rules are included in solution space. Each ante-cedent part is represented using integer number with ‘1’representing “low,” ‘2’ representing “medium” and ‘3’ rep-resenting “high.” The output class is also represented usinginteger number with ‘1’ representing ‘class 1’ and ‘2’ repre-senting ‘class 2.’

A number in the starting of the rule ‘1’ is used to selectthe rule and ‘0’ is used to omit the rule. When coded usinginteger number each rule requires 12 design variables (1 forrule selection, 10 for input genes and 1 for output class) andhence a total of 120 (12 � 10) design variables are needed torepresent the complete rule set in the solution space. Foreach variable in the rule set, pheromone matrix is initializedand probability matrix is constructed from the pheromonematrix using equation (13). For each design variable, per-missible range matrix is constructed from the probabilitymatrix by using the following procedure:

PRi;j ¼

0 pi;j

pi;j

pi;j þ pi;jþ1

:

:

:

pi;j þ pi;jþ1

pi;j þ pi;jþ1 þ pi;jþ2

:

:

pi;j þ pi;jþ1 þ � � � þ pi;jþP

2666666664

3777777775

i ¼ 1; . . . ;D; j ¼ 1; . . . ;P:

(13)Fig. 6. Mutual information for the input genes of T2D.

TABLE 2Genes Selected through MI for T2D

TABLE 1Details of Gene Expression Data Set

352 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 7: Hybrid ant bee algorithm for fuzzy expert system based sample classification

Ants for each design variable are randomly generatedbetween the over all range of a permissible range matrix,i.e., from 0 to pi;j þ pi;jþ1 þ � � � þ pi;jþP . Each ant value ismapped with its permissible range value for identifying thepath. Once a path is identified, its corresponding indexvalue is taken as candidate value for the design variable.The candidate values of each design variable found duringpath identification are formed as different combination andare used for evaluating the objective function.

After evaluating the objective function, the value whichgives optimum objective value is selected and its associatedpermissible range value is selected as best path and phero-mone updation is carried out. The updated pheromonevalue is then used to update the probability value which inturn will update the permissible range value. The initialpheromone matrix is updated on the selected best path todirect future ants more strongly toward better solution.

As mentioned in Fig. 5, seven points are required to rep-resent an input gene and hence a total of seventy (10 � 7 ¼70) membership function points are needed. The range ofeach membership function point is computed dynamicallyas discussed in Section 4.1. The initial set of possible mem-bership function points are randomly generated within theirdynamic range and all the solutions are assigned toemployed bees. The fitness of the employed bee is calcu-lated using (14), where ‘fi’ is the value of objective function:

fitðiÞ ¼1

1þfi ; fi � 0;

1þ absðfiÞ; fi < 0:

�(14)

Then a neighborhood search is performed using (7) anda greedy selection process is carried out. After thatonlooker phase is conducted using (8) followed by agreedy selection again and then scout phase is executedusing (9). The whole process of ABC is repeated for mem-bership function tuning along with the ACO for optimalrule generation. The complete ABA (ACO þ ABC) algo-rithm is run with different values of control parametersfor 30 independent trials and the optimal results areobtained with the following setting:

� Number of Ants: 30

� Pheromone decay factor (r): 0.5

� Pheromone error factor (s): 1

� Min. Pheromone value ðt0Þ: 0.01� Colony size (Csize): 30

� Number of Employed Bee (NE): 15

� Number of Onlooker Bee (NO): 15

� Number of Scout Bee (NS): 1.A fuzzy system with four rules and 98.5 percent classifica-tion accuracy was evolved. The four near optimal rulesevolved by the ACO for T2D data set are given below.

Rule 1 If NM_021133.1 is medium and NM_0223491.1 islow then it is NGT.

Rule 2 If AW291218 is low and AL523575 is high then itis NGT.

Rule 3 If NM_022349.1 is medium and BC000229.1 is lowthen it is DM2.

Rule 4 If NM_005260.2 is high and BG339560 is me diumthen it is DM2.

From the rule sets evolved by the proposed approach forT2D data set, it is substantiated that the ACO simplifies thedescription of each rule by eliminating the irrelevant genesand involved only few input genes that has reasonable sig-nificance in producing the output. The optimal membershipfunction points obtained by the ABC for the Gene ID:AL523275 of T2D is given in Table 3.

The points P1 and P9 are fixed points that are the mini-mum and maximum expression values of the geneAL523275. It is clear from the Table 3 that the valuesobtained for all other points are reasonable and have notinclined towards the lower and higher expression values ofthe dynamic range produced during membership functiontuning. Fig. 7 shows the optimal membership functionsdrawn using these points.

To compare the performance of the proposed ABA, fourdifferent approaches were developed. The first one is BinaryCoded Genetic Algorithm (BCGA) that uses binary stringsfor solution variables and basic genetic operators such astournament selection, two point crossover and bitwisemutation. The second one is Real Coded Genetic Algorithm(RCGA) that represents solution variables as floating pointnumbers and uses tournament selection, BLX-a crossoverand Non-uniform mutation for genetic operation. The thirdone evolves rule set and membership function simulta-neously using standard PSO and the last one is our previoushybrid GSA approach.

It is seen from the Fig. 7, that fuzzy set achieved byABA is justifiable without any skewedness and has a half-way overlap between the linguistic labels namely low,medium and high when compared with our previousGSA approach. The fuzzy set formed using PSO in GSAis either prejudice to low or high since the tuning is notfair enough to make each membership function distin-guishable from one another. This confirms the power ofABC in tuning the continuous values of membershipfunction points than PSO. With these distinct membershipvalues obtained using ABC, an accurate fuzzy expert sys-tem is achieved that corroborates the usefulness of theproposed ABA.

The convergence characteristics of the proposed ABAmethod in designing the fuzzy expert system during learn-ing is shown in Fig. 8.

TABLE 3Optimal Membership Function Points Obtained by ABC

Fig. 7. Optimal membership function formed by ABC.

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 353

Page 8: Hybrid ant bee algorithm for fuzzy expert system based sample classification

It is found that the proposed ABA has sharp increase inthe fitness value for the first 40 iterations. After that the pro-posed ABA has drastic improvement in its fitness value upto 70 iterations and above 140 iterations, it reaches the opti-mal value. Accordingly the average and worst fitness valuealso show considerable improvement in each generation.

Table 4 presents a comparison between proposed ABAand the other algorithms interms of fitness value obtained,generations taken and the CPU time consumed. All algo-rithms are convincingly good in their performance but it isobserved that PSO is fast enough when compared with allalgorithms because of its simplified operations but failedto produce better optimal value than GSA and ABA. Fur-ther the proposed ABA algorithm and our previous GSAalgorithm are competitive with one another in all the met-rics. ABA consumes little more CPU time than GSA except

for lymphoma data sets because of the combinatorial oper-ation of ACO in forming simple rules. On the whole, theperformance of the proposed ABA is reasonably good andhas a small improvement in the fitness value in all thecases than GSA.

5.3.2 Generalization Ability

A cross validation procedure uses the data more efficientlyand can be used to assess the performance of a classifiermodel. Since the proposed method uses medical/biologicaldata, the objective of a cross validation procedure shouldconcern with the performance of the proposed method onfuture independent data set. In this respect, Monte CarloCross Validation (MCCV) [41], [42] is used as it is a suitablemethod for multiscale genomic data analysis. In MCCV, thetraining sets Tr(s) (s ¼ 1, 2, 3; . . . ; S) are selected out of ‘N’samples from a data set ‘D’ randomly and without replace-ment. The testing sets Te(s) consist of the remaining obser-vations N\Tr(s) in the user fixed common ratio of NTr(s):NTe(s). Each testing set contains the samples that are not inthe corresponding training set. The MCCV error rate isgiven as

2^MMCCV

�D�TrðsÞ

�s¼1;2;3;...;S

� ¼ 1

S

XSs¼1

2^MTEST

�D;�TrðsÞ; TeðsÞ

��:

(15)

To estimate the error, the data set is repeatedly split into atraining set and a testing set in the ratio of 4:1. In eachMCCViteration, MI is computed for informative gene selection.After that the proposed ABA based fuzzy expert system isdeveloped using the training set and then its performance isevaluated using the testing set. Error rates are estimatedwith the proportion of misclassified independent test obser-vations. The whole procedure is repeated 30 times and theerror rates are averaged. Table 5 gives the error estimatedusing the proposedABA based fuzzy expert system.

It is observed that the mean (m) and standard deviation(s) of the classification error are low for all the data sets. Animportant thing that should be noticed is that, if the num-bers of genes are increased, then the error rate also increasesslightly. This shows that individually some of the genes arerelevant but as a group they are redundant since it reducesthe overall classification accuracy of the proposed ABAbased fuzzy expert system.

5.4 Interpretability

The interpretability of the rule set is deliberated using theprocedure of rule assessment and rule set analysis [43].

Fig. 8. Convergence of ABA for T2D data set.

TABLE 5Mean Classification Error of MCCV

TABLE 4Comparison of CPU Time of ABA with Other Algorithms

354 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 9: Hybrid ant bee algorithm for fuzzy expert system based sample classification

Coverage and accuracy are the two measures of assessing arule ‘R’ in the rule set and it is calculated as

Coverage ðRÞ ¼ NcoversS

; (16)

Accuracy ðRÞ ¼ Ncorrect

Ncovers; (17)

where ‘Ncovers’ is the number of samples covered by ‘R’ fromthe total number of samples ‘S’, and ‘Ncorrect’ is the numberof samples correctly classified by ‘R’ from ‘Ncovers’. The per-formance of near optimal rule set obtained for every dataset using the proposed ABA is evaluated by allowing eachrule at a time to perform classification of all the samples of adata set and the readings are given in Table 6.

It is found that most of the rules in a rule set producegood accuracy for the sample it covers. The measure of cov-erage computed for each rule indicates that the rules areexclusive with each other and cover sufficient number ofsamples. Further the overlapping among the rules in a ruleset is minimum within the attributes and is more compre-hensive due to the use of AND operation. In Table 7, themean coverage of the rule set (mC), number of variables(#V) and the mean number of membership function (mmmMF)obtained for both the GSA and the ABA approaches arereported.

It is observed that both the GSA and ABA approachesproduce comparable coverage value. Of course ABA is

better than GSA. Further, it is clear that ABA obtains sim-pler rule set with very minimum number of genes andmembership function than GSA. Thus ABA showsimproved interpretability than GSA. In Table 8, the ID ofthe selected gene (GID), number of the membership func-tion (#MF) and their labels are reported for all the data sets.During the run, the proposed ABA tune the membershipfunction of each gene resulting in selection of minimumnumber of gene with fewer number of membership functionfor each gene. More over the linguistic label obtained foreach variable is evident and complete.

5.5 Receiver Operating Characteristics (ROC)Analysis

Receiver operating characteristics [44] curve is a graphicrepresentation of the relationship between both sensitivityand specificity and it helps to visualize the performance ofthe classifier. ROC curve is often plotted by using true posi-tive rate (TPR) against false positive rate (FPR) for differentcut-points of a diagnostic test, starting from coordinate (0,0) and ending at coordinate (1, 1). FPR (1—specificity) is

TABLE 6Performance of Rule Sets of ABA in All the Data Sets

TABLE 7Interpretability Comparison of ABA with GSA

TABLE 8Details of Genes in the Rule Set Formed by ABA

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 355

Page 10: Hybrid ant bee algorithm for fuzzy expert system based sample classification

represented by x-axis and TPR (sensitivity) is representedby y-axis. Fig. 9 shows the two dimensional ROC curve forall data sets considered in the simulation.

Each point of the ROC curve shown in Fig. 9 repre-sents a sensitivity/specificity pair corresponding to a par-ticular threshold. The interpretation of ROC curve issimilar to a single point in the ROC space, the closer thepoint on the ROC curve to the ideal coordinate, the moreaccurate the test is. The closer the points on the ROCcurve to the diagonal, the less accurate the test is. FromFig. 9, it is observed that the ROC curves for all the datasets are reasonably closer to the upper left corner thatconfirms the higher sensitivity/specificity rate and overallaccuracy of the proposed approach. As this performanceanalysis confirms that the proposed approach is good atdetecting positives with low false positive rate, it is cor-roborated that the proposed ABA is more suitable fordiagnostic based disease classification.

Eventhough the proposed model has high sensitivityand specificity, false positives and false negatives also takeplace. In general, among the two false results, false posi-tive is taken as more serious in a screening test because wedon’t want to tell someone that they have a serious diseasewhen they do not really have it. For the data sets coloncancer, lymphoma, leukemia, RAC the proposed ABAapproach has zero false positives. For other data set evenif the proposed ABA approach has false positives, it isvery low. This clearly substantiated that the discriminationpower of the selected gene subset and their linguistics bythe proposed method is very large since they are good atdetecting disease and correctly ruled out the normalsamples.

The performance of the proposed ABA is compared withother algorithms using the value of area under ROC curve(AUC) and it is reported in Table 9. From these compari-sons, it is found that ABA has the ability to provide a richermeasure of classification performance with compact rule setthan the other algorithms in designing the fuzzy expert sys-tem for all the six data sets.

5.6 Performance Comparison

Table 10 shows the performance comparison of the resultobtained by the proposed approach with some existingapproaches in the literature. It reports for each data setregarding the number of informative genes (#IG) produced

by each approach and the percentage of correctly classifiedsamples (%CC). The first approach HykGene (HG) is freelyavailable software to find a very small subset of genes thatcombines gene ranking and clustering for phenotype classi-fication of any microarray data. The second approachHybrid Fuzzy is a genetic algorithm tuned fuzzy classifiermethod which is implemented with 20 as number of rules, 4as number of replaced rules, 0.2 as penalty term, 0.5, 0.9 and0.1 as don’t care, crossover and mutation probabilitiesrespectively. f-Information (f-I), fuzzy rough set basedf-Information (FRf-I), Multistage Mutual Information(MSMI), and Genetic Swarm Algorithm are our previouslydeveloped approaches. The readers can refer papers [46],[47] and [23] to know about the parameter settings.

It is corroborated from this comparison that HG, f-I,FRf-I, and MSMI uses neural network and support vectormachine for sample classification and their results are notunderstandable whereas HF, GSA, ABA uses fuzzy expertsystem that produces understandable classification resultsin the form of if-then rules. In HF, single optimizationalgorithm (GA) is used whose emphasis is on classifica-tion accuracy but not in producing interpretable rules.Similarly eventhough GSA has two optimization algo-rithms (GAþPSO), they are mainly used to address thecomplexity in microarray data in order to improve theclassification accuracy but fail to produce the compactrule set. It is observed that the use of ACO in the pro-posed ABA; perform gene filtering in each rule throughlinguistic selection based on the expression values thatmakes the rule set simple, compact and interpretable thanGA. Further, the ABC algorithm in the proposed ABA;tunes the membership function efficiently that result inbetter classification accuracy with less complexity thanGA and PSO.

Fig. 9. ROC analysis of all data sets.

TABLE 9Statistical Comparison of ABA with Other Approaches

Data Sets BCGA RCGA PSO GSA ABA

Col 0.731(5) 0.745(4) 0.758(3) 0.766(2) 0.785(1)Lym 0.638(5) 0.692(4) 0.713(3) 0.752(2) 0.769(1)Leu 0.399(5) 0.419(4) 0.436(3) 0.459(2) 0.497(1)RAC 0.568(5) 0.623(4) 0.676(2) 0.653(3) 0.691(1)RAO 0.634(5) 0.712(4) 0.735(3) 0.768(2) 0.803(1)T2D 0.426(5) 0.462(4) 0.483(3) 0.525(2) 0.561(1)Avg. Rank 5 4 2.83 2.16 1

TABLE 10Performance Comparison with the Other Approaches

356 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 11: Hybrid ant bee algorithm for fuzzy expert system based sample classification

5.7 Gene Ontology (GO) Based BiologicalSemantics

The biological significance of the genes selected and sam-ples classified by the proposed method is investigatedusing Gene Ontology based biological semantics [48]. GOterms cover the functions of genes in three orthogonal tax-onomies: molecular function (MF), biological process (BP)and cellular component (CC). According to Lin’s measure[49], the biological similarity of genes in a sample ‘S’ isgiven by

rðSÞ ¼ 2

jSj ðjSj � 1ÞX

i;j2S; i6¼j

1

jTij jTjjX

ti 2Ti; tj2Tjrðti; tjÞ; (18)

where Ti and Tj are the sets of terms annotating the twogenes i and j respectively. GOSim [50] package developedin R is used to compute the above measure. Gene Entrez Idsare obtained from National Centre for Biotechnology Infor-mation (NCBI) database. Table 11 gives the GO-based simi-larities for the informative genes selected using MI for coloncancer data set.

The GO Terms of each gene that causes colon cancer andthe process involved are taken from GO Sim package andthen their similarity values are computed. All the processesmentioned in Table 11 are process related to DNA metabo-lism which is one of the major causes of colon cancer. Thisprocess has only repair, positive regulation, reduction, cellsize, development and assembly and has no molecular andcellular reaction. Hence only the BP values are reported andMF and CC values are not available.

Figs. 10 and 11 show the directed acyclic graph (DAG)and Ancestar chart view of the genes selected by the pro-posed approach using GO Sim package. The DAG graphshows the network analysis and visualization of genesinvolved in colon cancer. Here node represents the GOTerms and arrow indicates the hierarchical process. For bet-ter understanding of this process, ancestar chart view isgiven in Fig. 11. It is observed that the GO:0008150 isinvolved in metabolic process of BP and is present inSMARCC1 gene has ‘is-a’ relationship with GO:0008152.

The metabolic process have following childrenGO:009987, GO:0044260, GO:0044237, GO:0006807 andGO:0044230. Each child has distinct relationship with eachother. The Cellular Macro Molecular Metabolic processwhich is occurred in DNA metabolic process is present in

DEPDC6 gene and has ‘is a’ relationship with GO:0008152.In the same manner, Celluar Metabolic Process(GO:0044237) from APEX1 gene has ‘is a’ relationship withGO:0008152 and is also involved in a DNA MetabolicProcess.

TABLE 11Gene Ontology Analysis

Fig. 10. DAG graph of GO analysis.

Fig. 11. Anchestar chart view of gene ontology analysis.

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 357

Page 12: Hybrid ant bee algorithm for fuzzy expert system based sample classification

The Nitrogen Compound Metabolic process of DNA-SEIL2 gene has ‘is a’ relationship with GO:0008152 and isinvolved in DNA metabolic process. The Primary Metabolicprocess of MFAP4 has ‘is a’ relationship with GO:0008152and Cellular process of MFAP4 has ‘is a’ relationship withGO:0008152. The process of Cellular Nitrogen CompoundMetabolism present in SNRNP200 has ‘part of’ relationshipwith GO:0044237 and GO:0006807. The process of Nucleo-base containing Compound Metabolism present in EIF2Ahas ‘part of’ relationship with GO:0034641 and GO:0006139.The GO:0006139 has ‘is a’ relationship with GO:0090304 andis present in AMELX.

The DNA metabolic process which is involved in the dis-ease of colon cancer has ‘part of’ relationship withGO:0044260 and GO:0090304 taken from FGF19 gene. It iscorroborated that the genes selected are involved in theDNA metabolic process of colon cancer disease and hence itgives good classification accuracy. This kind of analysis isperformed for the remaining data sets. As most of the genesmeasured by microarray technology do not have entries inthe GO database, this is not viable at this time for all thedata sets.

6 CONCLUSION

The bottleneck of fuzzy expert system for microarray dataclassification is knowledge acquisition in the form of if-then rules and membership function. In this paper, an AntBee Algorithm is proposed to address the Accuracy-Interpretability tradeoff in the design of fuzzy expert sys-tem for sample classification. In the proposed ABA, Ruleset is represented using integer numbers and evolvedusing ACO. The values of membership function use float-ing point numbers and are evolved using ABC simulta-neously along with the rule set. The effectiveness of theproposed approach has been demonstrated using sixmicroarray data sets. From the simulation result, it isunderstood that the learning ability of ABA is comparableand its classification error estimated for all the data setsusing MCCV procedure during generalization is minimumthan the other approaches.

Further, through ROC analysis, it is observed that theproposed ABA approach has low false positive rate andhigh discrimination power in improving the classificationaccuracy. With the help of GO analysis, it is confirmedthat the linguistics identified for the genes using the pro-posed ABA approach are tangled in metabolic progres-sion and have biological significance in classification ofmicroarray samples. On the whole, for all the data sets,the proposed ABA approach generated a compact (aver-age of 5.4 genes in a rule), accurate (average of 98.5 per-cent overall classification accuracy) and interpretable(average of 2.3 linguistics for a gene in a rule) fuzzyexpert system than GSA and other approaches reportedin the literature.

ACKNOWLEDGMENTS

The authors are very much grateful to the reviewers’ highlyvaluable comments and suggestions that improved themanuscript.

REFERENCES

[1] T.L. Bergemann and L.P. Zhao, “Signal Quality Measurements forcDNA Microarray Data,” IEEE/ACM Trans. Computational Biologyand Bioinformatics, vol. 7, no. 2, pp. 299-308, Mar./Apr. 2010.

[2] A. Benso, S.D. Carlo, and G. Politano, “A cDNA Microarray GeneExpression Data Classifier for Clinical Diagnosis Based on GraphTheory,” IEEE/ACM Trans. Computational Biology and Bioinformat-ics, vol. 8, no. 3, pp. 577-591, May/June 2011.

[3] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,C.D. Bloomfield, and E.S. Lander, “Molecular Classification ofCancer: Class Discovery and Class Prediction by Gene ExpressionMonitoring,” Science, vol. 286, pp. 531-537, 1999.

[4] L. Li, “Gene Selection for Sample Classification Based on GeneExpression Data: Study of Sensitivity to Choice of Parameters ofthe ga/knnMethod,” Bioinformatics, vol. 17, pp. 1131-1142, 2001.

[5] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimi-nation Methods for the Classification of Tumors Using GeneExpression Data,” J. Am. Statistics Assoc., vol. 97, no. 457, pp. 77-87, 2000.

[6] G. Fort and S.L. Lacroix, “Classification Using Partial LeastSquares with Penalized Logistic Regression,” Bioinformatics,vol. 21, no. 7, pp. 1104-1111, 2005.

[7] L. Fan, K.L. Poh, and P. Zhou, “A Sequential Feature ExtractionApproach for Na€ıve Bayes Classification of Microarray Data,”Expert Systems with Applications, vol. 36, no. 6, pp. 9919-9923, 2009.

[8] J. Khan, J.S. Wei, M. Ringn�er, L.H. Saal, M. Ladanyi, F. Wester-mann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, andP.S. Meltzer, “Classification and Diagnostic Prediction of CancersUsing Gene Expression Profiling and Artificial Neural Networks,”Nature Medicine, vol. 7, pp. 673-679, 2001.

[9] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M.Schummer, and D. Haussler, “Support Vector Machine Classifica-tion and Validation of Cancer Tissue Samples Using MicroarrayExpression Data,” Bioinformatics, vol. 16, pp. 906-914, 2000.

[10] A.C. Tan and D. Gilbert, “Ensemble Machine Learning on GeneExpression Data for Cancer Classification,” Applied Bioinformatics,vol. 2, pp. 75-83, 2003.

[11] D.E. Johnson, F.J. Oles, T. Zhang, and T. Goetz, “A Decision-Tree-Based Symbolic Rule Induction System for Text Categorization,”IBM Systems J., vol. 41, no. 3, pp. 1-10, 2002.

[12] J.S.R. Jang, C.T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Com-putting. Prentice Hall, 1997.

[13] A.C. Tan, D.Q. Naiman, L. Xu, R.L. Winslow, and D. Geman,“Simple Decision Rules for Classifying Human Cancers fromGene Expression Profiles,” Bioinformatics, vol. 21, pp. 3896-3904,2005.

[14] Y. Yoon, S. Bien, and S. Park, “Microarray Data Classifier Consist-ing of k-Top-Scoring Rank-Comparison Decision Rules with aVariable Number of Genes,” IEEE Trans. Systems, Man, and Cyber-netics-Part C: Applications and Rev., vol. 40, no. 2, pp. 216-226, Mar.2010.

[15] P. Woolf and Y. Wang, “A Fuzzy Logic Approach to AnalyzingGene Expression Data,” Physiological Genomics, vol. 3, pp. 9-15,2000.

[16] S. Vinterbo, “Small, Fuzzy and Interpretable Gene ExpressionBased Classifiers,” Bioinformatics, vol. 21, no. 9, pp. 1964-1970,2005.

[17] G. Schaefer, “Thermography Based Breast Cancer Analysis UsingStatistical Features and Fuzzy Classification,” Pattern Recognition,vol. 42, no. 6, pp. 1133-1137, 2009.

[18] X. Zong, Z. Yong, J. Li-min, and H. Wei-li, “Construct Interpret-able Fuzzy Classification System Based on Fuzzy Clustering Ini-tialization,” Int’l J. Information Technology, vol. 11, no. 6, pp. 91-107, 2005.

[19] Y. Hu, “Fuzzy Integral-Based Perceptron for Two-Class PatternClassification Problems,” Information Sciences, vol. 177, no. 7,pp. 1673-1686, 2007.

[20] Z. Wang and V. Palade, “A Comprehensive Fuzzy-Based Frame-work for Cancer Microarray Data Gene Expression Analysis,”Proc. IEEE Int’l Conf. Bioinformatics and Bioeng., pp. 1003-1010,2007.

[21] S.M. Chen and F.M. Tsai, “Generating Fuzzy Rules from TrainingInstances for Fuzzy Classification Systems,” Expert Systems withApplications, vol. 35, no. 3, pp. 611-621, 2008.

358 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014

Page 13: Hybrid ant bee algorithm for fuzzy expert system based sample classification

[22] G. Schaefer and T. Nakashima, “Data Mining of Gene ExpressionData by Fuzzy and Hybrid Fuzzy Methods,” IEEE Trans. Informa-tion Technology in Biomedicine, vol. 14, no. 1, pp. 23-29, Jan. 2010.

[23] P. GaneshKumar, T. Aruldoss Albert Victore, P. Renukadevi, andD. Devaraj, “Design of Fuzzy Expert System for Microarray DataClassification Using a Novel Genetic Swarm Algorithm,” ExpertSystems with Applications, vol. 39, no. 2, pp. 1811-1812, 2012.

[24] D. Devaraj and B. Yegnanarayana, “Genetic Algorithm-BasedOptimal Power Flow for Security Enhancement,” IEE Proc. Genera-tion, Transmission and Distribution, vol. 152, no. 6, pp. 899-905, Nov.2005.

[25] T. Aruldoss Albert Victoire and A.E. Jeyakumar, “ReserveConstrained Dynamic Dispatch of Units with Valve-PointEffects,” IEEE Trans. Power Systems, vol. 20, no. 3, pp. 1273-1282, Aug. 2005.

[26] P. Pulkkinen and H. Koivisto, “Identification of Interpretable andAccurate Fuzzy Classifiers and Function Estimators with HybridMethods,” Applied Soft Computing, vol. 7, pp. 520-533, 2007.

[27] M. Dorigo and T. Stutzle, Ant Colony Optimization. MIT Press,2004.

[28] D. Karaboga and B. Basturk, “A Powerful and Efficient Algorithmfor Numerical Function Optimization: Artificial Bee Colony(ABC) Algorithm,” J. Global Optimization, vol. 39, pp. 459-471,2007.

[29] V.K. Mootha, C.M. Lindgren, K.F. Eriksson, A. Subramanian, S.Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstra

�le, E.

Laurila, N. Houstis, M.J. Daly, N. Patterson, J.P. Mesirov, T.R.Golub, P. Tamayo, B. Spiegelman, E.S. Lander, J.N. Hirschhorn,D. Altshuler, and L.C. Groop, “PGC-1a Responsive GenesInvolved in Oxidative Phosphorylation are Coordinately DownRegulated in Human Diabetes,” Nature Genetics, vol. 34, no. 3,pp. 267-273, 2003.

[30] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,and A.J. Levine, “Broad Patterns of Gene Expression Revealed byClustering of Tumor and Normal Colon Tissues Probed by Oligo-nucleotide Arrays,” Proc. Nat’l Academy of Sciences USA, vol. 96,no. 12, pp. 6745-6750, 1999.

[31] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rose-nwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang,G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani,G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O.Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd,D. Botstein, P.O. Brown, and L.M. Staudt, “Distinct Types of Dif-fuse Large B-Cell Lymphoma Identified by Gene ExpressionProfiling,”Nature, vol. 403, no. 3, pp. 503-511, 2000.

[32] T.C. Kraan, F.A. Gaalen, P.V. Kasperkovitz, N.L. Verbeet, T.J.Smeets, M.C. Kraan, M. Fero, P.P. Tak, T.W. Huizinga, E. Pieter-man, F.C. Breedveld, A.A. Alizadeh, and C.L. Verweij,“Rheumatoid Arthritis Is a Heterogeneous Disease: Evidence forDifferences in the Activation of the STAT-1 Pathway betweenRheumatoid Tissues,” Arthritis and Rheumatism, vol. 48, no. 8,pp. 2132-2145, 2003.

[33] V.H. Teixeira, R. Olaso, M.L.M. Magniette, S. Lasbleiz, L. Jacq,C.R. Oliveira, P. Hilliquin, I. Gut, F. Cornelis, and E.P. Teixeira,“Transcriptome Analysis Describing New Immunity and DefenseGenes in Peripheral Blood Mononuclear Cells of RheumatoidArthritis Patients,” PLoS ONE, vol. 4, no. 8, p. e6803, 2009.

[34] P. Maji, “f-Information Measures for Efficient Selection of Dis-criminative Genes fromMicroarray Data,” IEEE Trans. BiomedicineEng., vol. 56, no. 4, pp. 1063-1069, Apr. 2009.

[35] P. Maji and S.K. Pal, “Fuzzy-Rough Sets for Information Measuresand Selection of Relevant Genes from Microarray Data,” IEEETrans. Systems, Man and Cybernetics, vol. 40, no. 3, pp. 741-752,June 2010.

[36] Y. Leung and Y. Hung, “A Multiple-Filter-Multipe-WrapperApproach to Gene Selection and Microarray Data Classification,”IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7,no. 1, pp. 108-117, Jan./Feb. 2010.

[37] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C.Molter, V. Schaetzen, R. Duque, H. Bersini, and A. Nowe, “A Sur-vey on Filter Techniques for Feature Selection in Gene ExpressionMicroarray Analysis,” , vol. 9, no. 4, pp. 1106-1119, 2012.

[38] A. Sharma, S. Imoto, and S. Miyano, “A Top-r Feature SelectionAlgorithm for Microarray Gene Expression Data,” IEEE/ACMTrans. Computational Biology and Bioinformatics, vol. 9, no. 3,pp. 754-764, May/June 2012.

[39] D. Devaraj, J. Preetha Roselyn, and R. Uma Rani, “Artificial Neu-ral Network Model for Voltage Security Based ContingencyRanking,” Int’l J. Applied Soft Computing, vol. 7, no. 3, pp. 722-727,2007.

[40] H. Seker, M.O. Odetayo, D. Petrovic, and R.N.G. Naguib, “AFuzzy Logic Based Method for Prognostic Decision Making inBreast and Prostate Cancers,” IEEE Trans. Information Technologyin Biomedicine, vol. 7, no. 2, pp. 114-122, June 2003.

[41] A.L. Boulesteix, C. Porzelius, and M. Daumer, “Microarray-BasedClassification and Clinical Predictors: On Combined Classifiersand Additional Predictive,” Bioinformatics, vol. 24, no. 15,pp. 1698-1706, 2008.

[42] A.L. Boulesteix, C. Strobi, T. Augustin, and M. Daumer,“Evaluating Microarray-Based Classifiers: An Overview,” CancerInformatics, vol. 6, pp. 77-97, 2008.

[43] J. Han and M. Kamber, Data Mining: Concepts and Techniques.Morgan Kaufmann, 2006.

[44] T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recogni-tion Letters, vol. 27, pp. 861-874, 2006.

[45] Y. Wang, S.F. Makedon, C.J. Ford, and J. Pearlman, “HykGene: AHybrid Approach for Selecting Marker Genes for Phenotype Clas-sification Using Microarray Gene Expression Data,” Bioinformatics,vol. 21, no. 8, pp. 1530-1537, 2005.

[46] P. GaneshKumar, C. Rani, D. Mahibha, and T. Aruldoss AlbertVictoire, “Fuzzy Rough Neural Based f-Information for GeneSelection and Sample Classification,” Int’l J. Data Mining and Bioin-formatics, in Press.

[47] P. GaneshKumar and T. Aruldoss Albert Victoire, “MultistageMutual Information for Informative Gene Selection,” J. BiologicalSystems, vol. 19, no. 4, pp. 725-746, 2011.

[48] P. kumar, A. Mundra, and J.C. Rajapakse, “SVM-RFE with MRMRFilter for Gene Selection,” IEEE Trans. Nanobioscience, vol. 9, no. 1,pp. 31-37, Mar. 2010.

[49] D. Lin, “An Information-Theoretic Defnition of Similarity,” Proc.15th Int’l Conf. Machine Learning, pp. 296-304, 1998.

[50] H. Frohlich, N. Speer, A. Poustka, and T. Beibarth, “Gosim—AnRpackage for Computation of Information Theoretic go Similari-ties between Terms and Gene Products,” BMC Bioinformatics,vol. 8, article 166, 2007.

Pugalendhi GaneshKumar received theBTech, MS (by research), and PhD degrees all ininformation technology in 2003, 2008, and 2012,respectively, from the University of Madras,Anna University, Chennai, and Anna University,Coimbatore, resepectively. Currently, he is anassistant professor in the Department of Informa-tion Technology, Anna University, Regional Cen-tre, Coimbatore. His research interest includesapplication of soft computing techniques in datamining, and bioinformatics.

Chellasamy Rani received the BE degree ininformation technology from Madurai KamarajUniversity in 2003, the ME degree in multimediatechnology from Anna University Chennai in2008, and the PhD degree in information andcommunication engineering from Anna Univer-sity, Chennai in 2012. She is currently an assis-tant professor in the Department of ComputerScience and Engineering, Government Collegeof Engineering, Salem. Her research interestincludes development of intelligent optimization

algorithms for data mining.

GANESHKUMAR ET AL.: HYBRID ANT BEE ALGORITHM FOR FUZZY EXPERT SYSTEM BASED SAMPLE CLASSIFICATION 359

Page 14: Hybrid ant bee algorithm for fuzzy expert system based sample classification

Durairaj Devaraj received the BE, ME, and PhDdegrees all in electrical and electronics engineer-ing in 1992, 1994, and 2002, respectively, fromThiyagaraja College of Engineering and IITMadras, respectively. Currently, he is a professorin the Department of Electrical and ElectronicsEngineering, Kalasalingam University, Krishnan-kovil. He is the project director of Power SystemAutomation Group, TIFAC-CORE in networkengineering sponsored by DST, Government ofIndia. His research interest includes applications

of computational intelligent techniques, power system optimization, anddata mining.

T. Aruldoss Albert Victoire received the BTech,ME, and PhD degrees all in electrical and elec-tronics engineering in 1998, 2000, and 2006,respectively, from Pondicherry Engineering Col-lege, Thiyagarajar College of Engineering, andAnna University, Chennai, India, respectively.Currently, he is an associate professor in theDepartment of Electrical and Electronics Engi-neering, Anna University, Regional Centre,Coimbatore. His research interest includes thedevelopment of hybrid intelligent algorithms to

the power system optimization.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

360 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 2, MARCH/APRIL 2014