Rough set based approaches to feature selection for Case-Based Reasoning classifiers

Pattern Recognition Letters 32 (2011) 280–292

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Rough set based approaches to feature selection forCase-Based Reasoning classifiers

Maria Salamó ⇑, Maite López-SánchezDept. de Matemàtica Aplicada i Anàlisi, Universitat de Barcelona, Gran Via de les Corts Catalanes, 585-08007 Barcelona, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 30 July 2009Available online 18 September 2010Communicated by T.K. Ho

Keywords:Feature selectionDimensionality reductionClassification techniquesCase-Based ReasoningRough Set Theory

0167-8655/$ - see front matter � 2010 Elsevier B.V. Adoi:10.1016/j.patrec.2010.08.013

⇑ Corresponding author. Tel.: +34 934039372; fax:E-mail addresses: [email protected] (M. Salamó),

Sánchez).1 CBR systems that are built for the classification prob

This paper investigates feature selection based on rough sets for dimensionality reduction in Case-BasedReasoning classifiers. In order to be useful, Case-Based Reasoning systems should be able to manageimprecise, uncertain and redundant data to retrieve the most relevant information in a potentially over-whelming quantity of data. Rough Set Theory has been shown to be an effective tool for data mining andfor uncertainty management. This paper has two central contributions: (1) it develops three strategies forfeature selection, and (2) it proposes several measures for estimating attribute relevance based on RoughSet Theory. Although we concentrate on Case-Based Reasoning classifiers, the proposals are generalenough to be applicable to a wide range of learning algorithms. We applied these proposals on twentydata sets from the UCI repository and examined the impact of feature selection over classification perfor-mance. Our evaluation shows that all three proposals benefit the basic Case-Based Reasoning system.They also present robustness in comparison to well-known feature selection strategies.

� 2010 Elsevier B.V. All rights reserved.

1. Introduction

Case-Based Reasoning (CBR) is the process of solving new prob-lems (cases) by retrieving the most relevant ones from an existingknowledge-base (called the case-base) and adapting them to fitnew situations (Riesbeck and Schank, 1989). The CBR cycle, definedby Aamodt and Plaza (1994), is described in four processes (i.e.,retrieve, reuse, revise, and retain). Fig. 1 shows the process inwhich a new case is solved by retrieving one or more previouslyexperienced cases, reusing the case in one way or another, revisingthe solution, and retaining the new experience by incorporating itinto the existing case-base. CBR starts from a set of training cases.Along its cycle, CBR forms implicit (lazy) generalizations by identi-fying commonalities between retrieved cases and the targetproblems.

CBR systems have been used in a wide variety of fields andapplications (Watson, 1997). One example is its use in predictionand classification1 (Althoff et al., 1994; Golobardes et al., 2002).CBR systems are well known for their ability to successfully tacklerich and complex domains. CBR is often used when generalizedknowledge is lacking. Nevertheless, a CBR system is sensitive tonoisy and unreliable data (Aha, 1992) which may contribute nega-tively to its classification accuracy. In fact, this problem may ap-pear in CBR even if the domain contains few features and/or

ll rights reserved.

+34 [email protected] (M. López-

lem are called CBR classifiers.

cases. Additionally, CBR classifiers suffer from the curse of dimen-sionality2 problem (Jain and Chandrasekaran, 1982; Jain et al.,2000). This is also the case for other learning approaches, neverthe-less (Korn et al., 2001) showed that CBR is more dependent on theactual sample distribution than on the dimensions of the problem.In CBR literature, these problems have been faced from two areasof research: feature selection and instance selection. The first onealleviates these problems by identifying as much of the irrelevantdescriptive information (features) of a case as possible. On theother hand, instance selection – known as case-base maintenancein CBR – aims at reducing the number of unnecessary or redundantcases.

Our previous work (Salamó and Golobardes, 2002a,b, 2003,2004) focuses on feature weighting and instance selection methodsbased on Rough Set Theory (RST). They have been proven to offer agood trade-off between reduction and problem solving efficiency.Nevertheless, some data sets may also contain irrelevant features(attributes in CBR) in addition to irrelevant cases. Thus, it becomesnecessary to reduce the number of features (rather than justweighting them) that are considered when solving new cases. Infact, we have conducted some preliminary work (Salamó, 2004;Salamó and López-Sánchez, 2007) on feature selection and it con-stitutes the focus of this paper.

Many algorithms within Artificial Intelligence literature dealwith feature selection. These algorithms can be placed in twomain categories: wrappers and filters. Wrapper methods use the

2 The expression curse of dimensionality is due to Bellman (1961). The curse ofimensionality occurs in very high-dimensional domains with tens of thousands ofttributes and only a few hundred cases.

da

http://dx.doi.org/10.1016/j.patrec.2010.08.013

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.patrec.2010.08.013

http://www.sciencedirect.com/science/journal/01678655

http://www.elsevier.com/locate/patrec

Retrieval

Revise

ReuseRetain

Similarityknowledge

Vocabularyknowledge

Adaptationknowledge

Caseknowledge

Descriptionof a NewSituation

RetrievedCase

SuggestedSolution

InsuficientSolution

ConfirmedSolution

LearnedCase

GeneralKnowledge

Fig. 1. Case-Based Reasoning cycle.

M. Salamó, M. López-Sánchez / Pattern Recognition Letters 32 (2011) 280–292 281

performance algorithm itself as an evaluation function to estimatethe accuracy of attribute subsets (Kohavi and John, 1997). Thus,wrappers tend to be computationally expensive because the learn-ing algorithm is called on repeatedly. On the other hand, filtermethods (Blum and Langley, 1997) filter out undesirable attributesbefore learning takes place. Therefore, Filters have been proven tobe much faster than wrappers and hence they can be applied effi-ciently to large data sets containing many attributes.

This paper addresses the feature selection from the filter per-spective, presenting three different selection strategies and severalmeasures for estimating attribute relevance based on Rough SetTheory. RST is an extension of set theory for the management ofinexact, uncertain or insufficient information acquired from expe-rience. Moreover, it may serve as a mathematical tool to soft com-puting similarly to fuzzy set theory (Zadeh, 1965). Rough SetsTheory has been successfully applied in machine learning (Jensenand Shen, 2004), knowledge discovery (Beaubouef et al., 2004),pattern recognition (Shen and Chouchoulas, 2002), and so on (Shiuand Pal, 2004). In particular, we show that the proposal of thispaper enjoys considerable performance when compared to well-known feature selection techniques while reducing the featurespace.

The rest of the paper is structured as follows: Section 2 intro-duces the related work on filter methods; Section 3 gives a briefoverview to the main concepts of RST; next section details theproposed feature selection strategies; Section 5 exposes the relatedevaluation methodology that we used and the results that ensued;and finally, Section 6 presents the conclusions.

2. Related work

Mostly, feature selection approaches in CBR use the same algo-rithms that have been widely applied for pattern recognition, ma-chine learning and data mining. In this section, we review the mostwell-known algorithms for feature selection used in CBR literatureand, in particular, in classification tasks.

Many filter methods have been proposed for feature selection, areview of them can be found in (Blum and Langley, 1997; Guyonand Elisseeff, 2003). The simplest filtering scheme is to evaluate

each attribute individually, measuring its correlation to the targetfunction (e.g., using a mutual information measure) and thenselecting K attributes with the highest value. Relief algorithm (Kiraand Rendell, 1992) follows this general paradigm. Relief samplesrandomly an instance, locating its nearest neighbour from thesame and opposite class. It was originally defined for two-classproblems. Kononenko (1994) did an extension, called ReliefF, whichcan handle noisy and multiclass problems.

On the other hand, unlike Relief, Correlation-based FeatureSelection (CFS) (Hall, 2000) evaluates and ranks attribute subsetsrather than individual attributes. CFS algorithm is a subset evalua-tion heuristic that takes into account the usefulness of individualattributes for predicting the class along with the level of intercor-relation among them.

Many rough set based approaches (Choybey et al., 1996) for fea-ture selection can be found in literature. Zhong et al. (2001) pro-pose a hybrid filter/wrapper algorithm which uses Rough SetTheory with greedy heuristics for feature selection. The selectionis similar to the filter approach but the evaluation criterion is re-lated to the performance of induction. In the same way, a hybridfilter/wrapper approach based on feature weighting can be foundin (Al-Radaideh et al., 2005). The selection is based on feature rank-ing and greedy forward selection. The Parameterized Average Sup-port Heuristic (PASH) (Zhang and Yao, 2004) is another featureselection approach based on RST which is applied over classifica-tion rules. PASH is based on parameterized lower approximationdefinition in Rough Sets and its main advantage is that it considersthe overall quality of the potential rules. Yun et al. (2004) use RSTas a way to decide which attributes are used for constructing adecision tree with a minimum number of leaves. Another approach(Wang et al., 2007) focused on RST and Particle Swarm Optimiza-tion (PSO) investigates how PSO can be applied to find optimal fea-ture subsets or rough set reducts.

In CBR literature, Li et al. (2006) combine a feature selection ap-proach with different case selection strategies. Although theauthors analyzed feature and case selection proposals separately,the best results were shown to the combination of strategies. Inthis paper, we may place emphasis on Gupta et al. (2006) approachwhich proposes two rough set algorithms (i.e., JohnsonsReduct andMarginal Relative Dependency – MRD) for feature selection. Both

Table 1An example of a decision table.

Universe area brightness color TV

x1 15 500 Steel Samsungx2 20 450 Black Sonyx3 20 400 White Samsungx4 40 500 Black Phillipsx5 20 500 Black Sonyx6 20 500 Black Samsung

282 M. Salamó, M. López-Sánchez / Pattern Recognition Letters 32 (2011) 280–292

algorithms improve task performance and reduce feature selectiontimes when applied to textual CBR.3 However, the computationalcomplexity of MRD is an order of magnitude more complex thanJohnsonsReduct. Therefore, we use JohnsonsReduct as a referenceto compare with our feature selection method. As previously stated,our method is based on Rough Set Theory and it constitutes the maincontribution of this paper together with the relevance measures andselection strategies it invokes.

3. Building blocks of Rough Set Theory

Rough Set Theory, defined by Pawlak (1982, 1991), is one of thetechniques for the identification and recognition of common pat-terns in data, especially in the case of uncertain and incompletedata. Briefly, a rough set is a formal approximation of a crisp setin terms of a pair of sets which give the lower and the upperapproximation of the original set.

This section presents the basic concepts that are required tounderstand our proposed application of RST into feature selectionfor CBR classification problems. Notice that we use establishedCBR terminology to present these basic concepts. Thus, we referto cases instead of instances or objects, and attributes instead offeatures or relations.

In RST, the data is collected in a table, called decision table. Deci-sion tables are defined as follows:

Definition 1 (Decision table). A decision table is denoted byT = (U,A,C,D) where:

U = {x1,x2, . . .,xn} is a finite set of cases (the Universe),A = {a1,a2, . . .,am} is a set of attributes, andC,D � A are two subsets of attributes that are called condition(C) and decision (D) attributes, respectively.

Thus, a decision table specifies the actual attribute values forthe cases of the Universe. Table 1 shows an example of a simpledecision table in which U = {x1,x2, . . .,x6} (where xi represents acase) and A = {area,brightness,color,TV}, being area, brightness andcolor the condition attributes C, and TV the decision attribute D(i.e., the class to predict).

From here, we can define indiscernibility relations and equiva-lence classes. Let a 2 A, P # A. Then, the binary indiscernibility rela-tion, denoted by IND(P), is defined as:

INDðPÞ ¼ fðx; yÞ 2 U � U : 8a 2 P; aðxÞ ¼ ðyÞg;

where (x,y) is a pair of cases, and a(x) denotes the value of attributea for case x. Thus, if (x,y) 2 IND(P), x and y are said to be indiscern-ible with respect to P. The family of all equivalence classes of IND(P)(Partition of U determined by P) is denoted by U/IND(P). For the sakeof simplicity, U/P will be used instead of U/IND(P).

For example, given the TV decision table in Table 1, we canappreciate the following indiscernibility relations:

� U/IND(area) = U/area = {(x1), (x2,x3,x5,x6), (x4)}, since case x1 isthe only one having an area value of 15, cases x2, x3, x5, and x6

share the same value for this attribute (i.e., area(x2) = ar-ea(x3) = area(x5) = area(x6) = 20), and case x4 is the only one witha 40 value. Fig. 2(a) illustrates this graphically.� U/IND(brightness) = U/brightness = {(x1,x4,x5,x6), (x2), (x3)} (see

Fig. 2(b)).� U/IND(color) = U/color = {(x1), (x2,x4,x5,x6), (x3)} (see Fig. 2(c)).

3 Textual CBR (TCBR) analyzes texts of a given domain and typically builds semi-structured cases with which new text can be meaningfully compared. Main challengein TCBR is the acquisition of the data.

Furthermore, an indiscernibility relation is an equivalence rela-tion that partitions the set of cases into equivalence classes. Eachequivalence class contains a set of indiscernible cases for the givenset of condition attributes C. As Fig. 2(d) shows, in our TV decisiontable example:

� U/IND(C = area,brightness,color) = U/C = {(x1), (x2), (x3), (x4),(x5,x6)}.

As we said before, a rough set is a formal approximation of acrisp set (i.e., conventional set) in terms of a pair of sets which givethe lower and the upper approximation of the original set. The low-er and upper approximation sets themselves are crisp sets in thestandard version of Rough Set Theory (Pawlak, 1991), but in othervariations, the approximating sets may be fuzzy sets as well.

In other words, given a target set X, we can define two approx-imations from the available data in its cases: namely, the lower andupper approximations of X (see Fig. 3). The lower approximation PXis the set of all cases of U which can certainly be classified as casesof X having knowledge P (see Fig. 3(b)). The upper approximationPX (see Fig. 3(c)) is the set of cases of U which can possibly be clas-sified as cases of X from considering knowledge P. Definition 2 de-tails these two concepts formally.

Definition 2 (Lower and upper approximations). Let P # C andX # U.

The P-lower approximation set of X, formally presented asPX =

S{Y 2 U/P : Y # X}, is the complete set of cases of U which

can be positively (i.e., unambiguously) classified as belonging tothe target set X using the knowledge P.

The P-upper approximation set of X, formally presented asPX ¼

SfY 2 U=P : Y \ X–;g, is the set of objects of U that are

possibly in X.The set BndPðXÞ ¼ PX � PðXÞ is called the P-boundary of X. If the

BndP(X) – ;, then we say that X is a rough set on R (see Fig. 3(d)).Thus, for example, in the TV decision table, consider

X{TV=Samsung} = {x1,x3,x6}, then the lower and upper approximationsof X with respect to knowledge C are respectively: CX = {x1,x3}4

and CðXÞ ¼ fx1; x3; x5; x6g. Fig. 4 depicts these definitions graphically(notice that their computation consider the partition of the Universewith respect to the condition attributes C in Fig. 2(d)).

Another important concept of the Rough Set Theory is the no-tion of the positive region. The C-positive region of D is the set ofall cases from Universe U which can be certainly classified intoclasses of U/D by employing attributes from C. Formally, it is de-scribed in Definition 3.

Definition 3 (Positive region). Let C and D be condition anddecision equivalence attributes over U. The C-positive region ofD, denoted by POSC(D), corresponds to:

4 Y = {x5, x6} 2 U/C is not included in the lower approximation because it is not asubset of X : Y 6 # X.

(a) U/area (b) U/brightness (c) U/color (d) U/C

Fig. 2. Graphical illustration of the TV example. Partitions of the Universe U determined by the three condition attributes: (a) area (U/area); (b) brightness (U/brightness);(c) color (U/color); and (d) the combination of previous partitions (U/C).

(a) (b) (c) (d)

Fig. 3. Lower and upper approximations of a set X. (a) The X set (curve); (b) X and P(X) (small shaded rectangle); (c) X, P(X) (large shaded rectangle) and negative region(surrounding white region); and (d) X and BndP(X) (shaded region).

Fig. 4. Further illustration of the TV example: (a) U/C from Fig. 2(d); (b) X set X = U/(D = TV = Samsung); (c) lower approximation of X, C(X); and (d) upper approximation of X,CðXÞ.


POSCðDÞ ¼[

X2U=D

CX

Following our TV example, the positive region of D = {TV} withrespect to C = {area,brightness,color} can be computed as:

� Consider U/C = {(x1), (x2), (x3), (x4), (x5,x6)} and U/D = {(x1,x3,x6),(x2,x5), (x4)}.� We have POSC(D) = C(X{TV=Samsung}) [ C(X{TV=Sony}) [ C(X{TV=Phillips})

where:– CX{TV=Samsung} =

S{Y 2 U/C : Y # X{TV=Samsung} = {x1,x3,x6}} =

{x1,x3}.– CX{TV=Sony} =

S{Y 2 {U/C : Y # X{TV=Sony} = {x2,x5}} = {x2}.

– CX{TV=Phillips} =S

{Y 2 {U/C : Y # X{TV=Phillips} = {x4}} = {x4}.� So that, finally, POSC(D) = {x1,x2,x3,x4}.

Moreover, an attribute c 2 C is said to be dispensable if IN-D(C � {c}) = IND(C); otherwise attribute c is indispensable in deci-sion table T. Therefore, if c is an indispensable attribute, deletingit from T will cause T to be inconsistent. Additionally, T is said tobe independent if all c 2 C are indispensable.

Definition 4 (Reduct). A set of attributes R # C is called a reduct ofC, if T0 = (U,A,R,D) is independent and IND(R) = IND(C). In otherwords, using R it is possible to approximate the same as using C.Notice that C may have many reducts.

We notate RED(C) as the set of all reducts of C. The set of allindispensable attributes in C will be called the core of C, and willbe denoted as CORE(C) =

TRED(C). The core can be interpreted as

the set of the most characteristic part of knowledge, which cannot be eliminated when reducing the knowledge.

In order to illustrate these concepts, we can compute both re-ducts and core in our TV example:

� First of all, we compute the indiscernibility relations for thecombination of attributes.– U/(area,brightness) = {(x1), (x2), (x3), (x4), (x5,x6)}.– U/(brightness,color) = {(x1), (x2), (x3), (x4,x5,x6)}.– U/(area,color) = {(x1), (x2,x5,x6), (x3), (x4)}.– U/area = {(x1), (x2,x3,x5,x6), (x4)}.– U/brightness = {(x1,x4,x5,x6), (x2), (x3)}.– U/color = {(x1), (x2,x4,x5,x6), (x3)}.� And then, we can search for indispensable attributes:

– Since U/(area,brightness) = U/C, then the attribute color isdispensable.

– Since U/(brightness,color) – U/C, then the attribute area isindispensable.

– Since U/(area,color) – U/C, then the attribute brightness isindispensable.

� Thus, considering also that U/(area,brightness) – U/area, and U/(area,brightness) – U/brightness, then we obtain the only reductRED(C) = {area,brightness}.


� Finally, we compute the core CORE(C) = {area,brightness}, wherearea and brightness are indispensable attributes in our TVclassifier.

Before specifying how our feature selection method uses theseconcepts, note that the computational complexity for extractingthe indiscernibility relations is O(n2 �m) (Bell and Guan, 1998), ina Universe U having n cases and m attributes. As we can observe,this complexity is a function of the square of the number of train-ing cases. Therefore, one way to alleviate this complexity is reduc-ing the number of training cases that need to be considered at atime. This can be accomplished by using Randomized Training Par-titions (RTP) (Gupta et al., 2005). RTP procedure is defined asfollows:

� Randomly create n0 equal-sized partitions of the training set.� From each partition, select features using a feature selection

algorithm (e.g., JohnsonsReduct, MS, HS or ST).� Define the final feature set as the union of features selected

from each partition.

As demonstrated in (Gupta et al., 2005, 2006), RTP approach couldreduce the training time by a factor of n0 for the RST feature selec-tion algorithms. In this paper, however, we have not consideredlarge data sets and, therefore, the RTP approach has not been used.

4. Feature selection using Rough Set Theory

Feature selection is a process to find the optimal subset of attri-butes that satisfy a given criteria. In this paper we present differentapproaches to feature selection based on RST. Next subsection de-tails the common steps of all the proposed approaches. Subsequentsubsections describe in detail each one of these proposals.

4.1. Feature selection method

The process of feature selection we propose is divided in fourbasic steps:

1. Discretize the data.2. Measure the relevance of each condition attribute using Rough

Set Theory.3. Rank the attributes in decreasing relevance measure.4. Apply a feature selection strategy.

The first step of the process is necessary due to the fact that RSTcan only handle nominal values. For this reason we apply Fayyadand Irani (1992) algorithm, which is a well-known algorithm, inall data sets containing numerical values. In the second step weare looking for patterns of features (i.e., reducts) using Rough SetsTheory and, with the aid of these patterns, we estimate the rele-vance for each attribute. Once we have estimated the relevance,we rank the attributes and apply a feature selection strategy. Thisstrategy evaluates the number of attributes that is sufficient for thegiven data.

In many domains (such as bioinformatics), attribute indepen-dence cannot be assumed. On the contrary, the concept of synergy5

becomes fundamental. Synergy has been analyzed using differentmethods such as Information Gain (Anastassiou, 2007), Coefficientof Determination (Martins et al., 2008) or Rough Sets Theory (Liand Zhang, 2006). In particular, Rough Sets implicitly analyze thesynergy as they provide a mechanism to represent approximations

5 The synergy is the additional contribution provided by the whole compared withthe sum of the contributions of the parts.

of concepts in terms of overlapping concepts. When defining a re-duct, Rough Sets Theory is providing the set of attributes that con-tribute to predict the decision attributes Note that a reduct is a setof interacting attributes that is sufficient to describe the decisionattributes.6

Correlations among attributes as well as correlations betweendecision and condition attributes are also related to the nesting ef-fect. Search algorithms for feature selection such as Sequential For-ward Selection (SFS) or Sequential Backward Selection (SBS) sufferfrom the nesting effect whereas Floating Search methods (Pudilet al., 1994; Somol et al., 1999) prevent the nesting effect. FloatingSearch methods are an excellent tradeoff between the nesting ef-fect and the computational efficiency. In our proposals, reductshelp to diminish the nesting effect. In fact, as detailed in next Sec-tion 4.2, reducts are used for computing attribute relevance. In thismanner, an attribute with a high relevance value indicates that thisparticular attribute is useful for solving the problem at hand incombination with some other attributes (i.e., those in the reducts).Thus, the aim of the proposed feature selection strategies (see Sec-tion 4.3) is to select a set of attributes that covers at least one re-duct. More concretely, the proposed strategies may cover morethan a reduct. Nevertheless, we can not guarantee optimality, sincethe only algorithm that guarantees optimality is an exhaustivesearch. In (Li and Zhang, 2006), the attribute’s interaction is mea-sured in terms of dependency of attributes as in our DRS relevancemeasure. The main difference among both proposals is that weconsider a reduct with the idea of avoiding the nesting effect, in-stead of analyzing all possible set of attributes independently.

Regarding our proposed feature selection process, this paper fo-cuses on second and fourth steps: relevance measures and featureselection strategies. Next subsection is devoted to describe all therelevance measures we propose. Afterwards, Section 4.3 presentsthree different criteria for feature selection, called Mean Selection,Half Selection, and Selection by Threshold.

4.2. Relevance measures based on Rough Set Theory

Many definitions have appeared in machine learning literaturefor what it means for features to be relevant (Caruana and Freitag,1994). Moreover, there is a large number of weighting methods tocompute this relevance. In our case, we use four rough set rele-vance measures. More concretely, they are called DependenceRough Sets (DRS), Proportional Rough Sets (PRS), Mean Rough Sets(Mean), and Two Rough Sets (Two). They are subsequently detailed(see Definitions 5–8, respectively).

First of all, we consider Dependence Rough Sets (DRS) which is awell-known measure of Rough Sets. DRS measures the significanceof attribute a 2 C with respect to D, as shown in next Definition 5.Nevertheless, its definition encompasses some further notations.

For this reason, we first define the degree of dependency whichprovides a measure of how important C is in mapping the data setsexamples into D. This degree of dependency, cC(D), is based on thecardinality of the C-positive region of D and it is defined as follows:

cCðDÞ ¼jPOSCðDÞjjUj ð1Þ

A value cC(D) = 1.0 means that D depends totally on C. Whereasif its value is 0 6 cC(D) 6 1, we say that D depends partially on C.Moreover, if cC(D) = 0 we say that D is totally independent from C.

Considering the decision table T, a set of attributes D is depen-dent on a set C in T (which is denoted by C ? D), iff an equivalenceattribute satisfies IND(C) # IND(D).

6 Note that a reduct is a set of interacting attributes that is sufficient to describe theecision attributes.
d


Similarly, the dependency of set D to degree k to the set C in T isdenoted as follows:

C!k D; 0 6 k 6 1; where k ¼ cCðDÞ ð2Þ

being cC(D) as described above.As explained before, DRS measures the significance of an attri-

bute a 2 C with respect to D. In practice, DRS uses the degree ofdependency, see Eq. (1), as a measure of how the classifier willchange when removing attributes. Thus, feature selection involveskeeping those attributes that are significant (i.e., that the depen-dency of the attribute changes when removing it). Accordingly,attribute reduction involves removing those attributes that haveno significance to the classifier. DRS is defined as follows.

Definition 5 (Dependence Rough Sets relevance (DRS) – a(a))

8a 2 C and R # C : aðaÞ ¼ jPOSRðDÞj � jPOSðR�fagÞðDÞjjUj ð3Þ

where j�j denotes the cardinality of a set, R corresponds to the small-est 7 reduct found in decision table T, C and D are respectively thesets of condition and decision attributes, POSR(D) represents the po-sitive region of all attributes present in R, and finally, POS(R�{a})(D) isthe positive region of all relations present in the reduct R whenextracting feature a.

Note that Eq. (3) is equivalent to:

8a 2 C and R # C : aðaÞ ¼ cRðDÞ � cR�fagðDÞ ð4Þ

Following our TV example, there is a set of a single reductRED(C) = {(area,brightness)}. Thus, in this example, R = {(area,brightness)} and the dependence rough set relevance can becomputed as:

� First of all, consider the following indiscernibility relations:– U/D = {(x1,x3,x6), (x2,x5), (x4)},– U/R = {(x1), (x2), (x3), (x4), (x5,x6)},– U/area = {(x1), (x2,x3,x5,x6), (x4)},– U/brightness = {(x1,x4,x5,x6), (x2), (x3)}.� Next, the positive regions POSR(D), POSR�{area}(D), POSR�{brightness}

(D) and POSR�{color}(D):– POSR(D) = R(X{TV=Samsung}) [ R(X{TV=Sony}) [ R(X{TV=Phillips}) =

{x1,x3} [ {x2} [ {x4} = {x1,x2,x3,x4},– POSR�{area}(D) = R � {area}(X{TV=Samsung}) [ R � {area}

(X{TV=Sony}) [ R � {area}(X{TV=Phillips}) = {x3} [ {x2} [ ; = {x2,x3},

– POSR�{brightness}(D) = R� {brightness}(X{TV=Samsung}) [ R� {bright-

ness} (X{TV=Sony}) [ R� {brightness}(X{TV=Phillips}) = {x1} [ ; [ {x4} ={x1,x4},

– POSR�{color}(D) = POSR(D), since R does not contain attributecolor.

� And finally, we can measure DRS:– aðareaÞ ¼ jPOSRðDÞj�POSR�fareagðDÞj

jUj ¼ 4�26 ¼ 1

3 ¼ 0:33,

– aðbrightnessÞ ¼ jPOSRðDÞj�jPOSR�fbrightnessgðDÞjjUj ¼ 4�2

6 ¼ 13 ¼ 0:33,

– aðcolorÞ ¼ jPOSRðDÞj�jPOSR�fcolorgðDÞjjUj ¼ 4�4

6 ¼ 0.

Next, we present our second relevance measure: ProportionalRough Sets (PRS). Basically, it assumes that the more frequent acondition attribute appears in the reducts, the more relevant theattribute is. PRS is defined as follows:

7 It means the reduct with the fewest number of attributes. A decision table mayhave many reducts. By definition, using a reduct it is possible to approximate thesame as using C. In case there are more than one reduct with the same number ofattributes, DRS algorithm selects the first one.

Definition 6 (Proportional Rough Sets relevance (PRS) – b(a))

8a 2 C : bðaÞ ¼ jREDaðCÞjjREDðCÞj ð5Þ

being REDa(C) the set of reducts containing attribute a 2 C andRED(C) the set of all reducts.

An attribute a not appearing in the reducts has REDa(C) = ;, andconsequently, its relevance value b(a) = 0. On the other hand, anattribute appearing in the core – i.e., in all reducts – has a featurerelevance b(a) = 1. The remaining attributes have a feature rele-vance value that is proportional to their appearance in the reducts.For example, going back to our TV example, since RED(C) ={(area,brightness)}, is a set of a single reduct, attributes area andbrightness appear in all reducts, and therefore, relevance valueb(area) = 1.0 and b(brightness) = 1.0, while b(color) = 0.0 because itnever appears in the set of reducts.

These two relevance measures may generate rather differentrankings. Therefore, we propose two additional relevance mea-sures that combine both DRS and PRS. First one is called MeanRough Sets relevance and it is computed as follows.

Definition 7 (Mean Rough Sets relevance (Mean) – d(a))

8a 2 C : dðaÞ ¼ aðaÞ þ bðaÞ2

ð6Þ

where a is an attribute belonging to the condition attribute set C,and a(a) and b(a) correspond to its DRS and PRS relevances, respec-tively (see Definitions 5 and 6).

The definition is based on the arithmetic mean, which is thesum of DRS and PRS values divided by two. This is the simplest def-inition since the aim is to define a standard combination that couldbe used in the future as a reference. Additionally, this relevancemeasure will discern whether a hybrid measure is able to outper-form individual ones, or if, on the contrary, their individualstrengths are somehow balanced in the combination.

Finally, we present a second combination of DRS and PRS in Def-inition 8, called Two Rough Sets. Previous relevance measure(Mean) averages DRS and PRS without taking into account theirrelevance value distribution. Although both measures belong tothe [0,1] interval, PRS obtains a large number of attributes with arelevance value of 1, while in DRS, 1’s hardly appear – in fact,sometimes no 1 appears at all. Having this into consideration,Mean relevance measure behavior is closer to PRS because its rel-evance values are higher than those obtained with DRS. Thus, com-binations may be biased for some data sets. The aim of Two RoughSets is to provide an alternative that somehow compensates this ef-fect, and it is defined as follows.

Definition 8 (Two Rough Sets relevance (Two) – �(a))

8a 2 C : �ðaÞ ¼ /ðaÞ þuðaÞ2

ð7Þ

where /ðaÞ ¼ aðaÞma

; uðaÞ ¼ bðaÞmb

and ma ¼maxa2CðaðaÞÞ; mb ¼max

a2CðbðaÞÞ

being a an attribute in the set C of condition attributes and a(a) andb(a) correspond to DRS and PRS relevance values respectively (seeDefinitions 5 and 6).

Eq. (7) normalizes the relevance values of DRS and PRS byextending them to the interval [0,1]. Thus, the greatest relevancevalue is moved to 1.0 and the remaining relevance values are mod-ified proportionally.


4.3. Feature selection strategies

A feature selection strategy describes the method to decidewhich subset of attributes are the most relevant ones to representthe domain. We address the overall feature selection process, asdetailed in Section 4.1, by first ranking the attribute according toa relevance measure r(a) chosen from measures in previous sub-section (i.e., r(a) 2 {a(a),b(a),d(a),�(a)}), and then applying theselection criterion over the resulting rank vector.

This vector of attributes is ordered with respect to relevancedecreasing values. It contains for each attribute its name and itsrelevance value, so that the feature in first position is the mostrelevant one and the feature in last (i.e., jCjth) position is the leastrelevant. Thus, in our previous TV example, where C = {area,brightness,color} and its relevance using PRS is r(area) = 1.0,r(brightness) = 1.0 and r(color) = 0.0, our rank vector will be{(area,1.0), (brightness,1.0), (color,0.0)}.8

This subsection describes three different criteria for featureselection. First of all, we define a simple strategy called Mean Selec-tion (MS) strategy.

Definition 9 (Mean Selection strategy – MS). An attribute a 2 C isselected if:

rðaÞPX

c2C

rðcÞjCj

where r(a), r(c) are relevance values for attributes a,c 2 C.

Definition 9 establishes that an attribute is selected if it is great-er than the mean of th relevance values. Considering again previ-ous TV example:P

c2CrðcÞjCj ¼ rðareaÞ þ rðbrightnessÞ þ rðcolorÞ

3¼ 2

3

Consequently, selected attributes in this case are area and bright-ness, whose relevance values outperform median value of 2

3.Despite of its simplicity, this strategy will be useful to test the

suitability of our Rough-Set-based relevance measures and willact as a benchmark in our experiments.

Next, we define the Half Selection (HS) strategy, which aims toreduce feature dimensionality in data sets considerably. For thisreason, the focus of this proposal is to approximately select 50%of the features in the domain. HS is defined as follows:

Definition 10 (Half Selection strategy – HS). An attribute a 2 C isselected if it satisfies – at least – one of these two conditions:

(i) pa 6 bjCj2 c,(ii) rðaÞ ¼ 1:0.

where pa is the position (i.e., index) of attribute a in the rankvector.

As it can be seen from Definition 10, first condition selects fea-tures having relevance values higher than a given threshold, whichis computed as bjCj2 c to approximate 50% of jCj. Second condition in-cludes those features that show a relevance value r(a) = 1.0regardless their position in the rank vector. Relevance value of1.0 means that its corresponding attribute is part of the CORE(C)and, therefore, it is an indispensable feature for the data set. Fol-lowing previous example, we have threshold ¼ b32c ¼ 1, so that we

8 Notice, though, that it’s just a coincidence that attributes in this example rankvector appear in the same order than they are presented in the decision table (seeTable 1).

only select attribute area. Nevertheless, by using second condition,we can also select attribute brightness. Thus guaranteing that allthe attributes that belong to the core are selected.

Definition 10 seems quite restrictive because, despite the num-ber of attributes in C, it only accepts around 50% of them, with theonly exception of those features that have maximal relevance.However, it is important to note two points: firstly, feature selec-tion algorithms are usually applied to large volumes of data whichneed a reduction in feature dimensionality (though it is not thecase of our simple TV example); and secondly, those features thatare considered to have the maximum relevance value are main-tained, by definition, in order to preserve the representation ofthe domain.

There are some points for and against this feature selectionstrategy. The main advantage is the great reduction it guarantees,its aim is to be close to 50%. On the other hand, the disadvantageis that it may select some attributes that are irrelevant but theyhave passed the threshold or, more importantly, it may miss somerelevant attributes that have been removed because they are underthe threshold. This drawback suggests a new feature selectionstrategy which is based on the relevance values instead of a thresh-old from a predefined number of attributes to reduce.

The last feature selection strategy presented in this paper iscalled Selection by threshold (ST). Its aim is, rather than reducinga great number of attributes, to guarantee that most relevant attri-butes will be maintained. For this reason, it is based on the actualrelevance values. ST is defined as follows:

Definition 11 (Selection by Threshold strategy – ST). An attribute a2 C is selected if:

rðaÞP minr þmaxr �minr

3

where maxr and minr are the maximum and minimum relevancevalues in the ranking vector, respectively.

This measure is an attempt to approximate the selection of rel-evance values that are larger than their median value.9 Neverthe-less, we want to provide a measure that does not depend on theactual distribution of relevance values and we want it to be as simpleas possible in terms of computational time.

An initial candidate measure such as minr þ maxr�minr2 was dis-

carded because it may be too restrictive for certain distributions.We also considered spread measures in descriptive statistics suchas the semi-quartile range. Semi-quartile range is calculated as halfthe difference between the 75th percentile (often called Q3) andthe 25th percentile (Q1). More specifically, its formula is Q3�Q1

2 .Since half the values in a distribution lie between Q3 and Q1, thesemi-quartile range is one-half the distance needed to cover halfthe values. It is hardly affected by higher values, so it is a goodmeasure of spread to use for skewed distributions, but it is rarelyused for data sets that have normal distributions. In the case of adata set with a normal distribution, the standard deviation is usedinstead. Since we want to keep our measure independent of distri-butions, we decided to redefine our initial candidate by relaxing itdown to minr þ maxr�minr

3 . Its simplicity has the advantage of requir-ing short computational time, since it does not require ordering (assemi-quartile range does) nor computing the deviation of all rele-vance values from their arithmetic mean (as required for the stan-dard deviation computation). Additionally, we would like to noticethat we performed a series of experiments with different relevancevalue distributions which empirically concluded that the proposedmeasure approximated the median.

9 A median is described as the number separating the higher half of a populationfrom the lower half. The median of a finite list of numbers can be found by arrangingall the observations from lowest value to highest value and picking the middle one.

BASTIAN Core

Discretise using Fayyad andIrani's algorithm

BASTIAN Core

Relevance Extraction

BASTIAN CBR Cycle

Weights

Initial training Case Base

Test Case Base

BASTIAN Core

Feature Selection

Features Selected

NominalTraining Case Base

Fig. 5. Steps of the process inside BASTIAN.


Continuing with previous TV example, maxr = 1.0 and minr = 0.0,the selected attributes will be those having relevance values great-er than 0.33, which happen to be, for this simple example, area andbrightness. The result is reasonable if we consider that the reductsare RED(C) = {(area,brightness)} which denotes that these two fea-tures are needed to represent the data set and there is only areduct.

Finally, there are two important points to emphasize: (1) MSand ST strategies do not need to order relevance values, and there-fore, we can actually skip third step in the general selection meth-od (see Section 4.1); (2) a minimum reduction on the percentage offeatures can be expected (it is not guaranteed, nonetheless). Inshort, ST approach enables the feature selection strategy to discardless relevant features without any costly computation such asordering or computing standard deviations.

5. Evaluation

Previous section has presented four different relevance mea-sures based on RST: Dependence Rough Set relevance (DRS); Pro-portional Rough Set relevance (PRS); Mean Weight Rough Setrelevance (hereafter referred simply as Mean); and Two Rough Setsrelevance (we will henceforth refer to it as Two). DRS constitutes arather standard measure for the rough set community, whilst PRSis proposed as an interesting alternative. On the other hand, Meanand Two represent two alternative combinations of both DRS andPRS relevance measures.

These relevance measures are used by feature selection strate-gies in the overall selection method. We have proposed three dif-ferent feature selection strategies depending on the selectioncriteria: Mean Selection (MS), Half Selection (HS), and Selectionby Threshold (ST). This section presents an analysis of the perfor-mance of all resulting combinations (that is, 12 = 4 � 3 possible in-stances of the overall selection method). This analysis is done interms of attribute reduction and subsequent classification accuracy(i.e., actual accuracy when CBR is eventually applied forclassification).

10 Skowron and Rauszer (1992) proposed to represent knowledge in a form of adiscernibility matrix. This representation has many advantages, in particular, itenables simple computation of the core and reducts.

11 Except for the Tao-grid, which is a synthetic data set generated by discretizing aTao symbol (a 2D representation of a yin-yang circle).

5.1. Methodology and benchmark

CBR is conducted by using BASTIAN (Salamó and Golobardes,2000) configured to use the 1-Nearest Neighbour algorithm for

case retrieval and retain new cases when there are not similarcases in the case base.

Fig. 5 shows feature selection process inside BASTIAN as de-tailed in Section 4.1. First step consists of discretizing numericattributes from the original case base. We apply the Fayyad andIrani (1992) MDL method. This method uses the class informationentropy of candidate partitions to select boundaries for discretiza-tion. Initially, the algorithm considers one big interval containingall known values of a feature. Next, it recursively partitions thisinterval into smaller ones until reaching some stopping criterion,specifically the Minimum Description Length. BASTIAN includesthe implementation of the Fayyad and Irani’s MDL method usedin WEKA (Witten and Frank, 2000). By using this implementation,we only discretize the numerical features and keep the nominalones and the class as in the original data set. Afterwards, in secondstep, the system extracts the relevance of each attribute using RST.This step includes computing reducts and, after that, extract therelevance using one of the measures (i.e., DRS, PRS, Mean, Two)presented in this paper. The code of RST that uses BASTIAN is basedon the public Rough Sets Library (Gawrys and Sienkiewicz, 1993).Concretely, for the reduct computation, we use RSL to find all re-ducts in a discernibility matrix10 that are shorter than the firstone found using heuristic search, instead of an exhaustive searchwhich is computationally unfeasible for large case bases. Similarlyto many reduct construction algorithms, optimality of the reductscan not be guaranteed. In rough sets literature, many reduct con-struction algorithms have been proposed, for example Yao et al.(2008) and Yao and Zhao (2009). finally, in the last step, featuresare selected by using MS, HS or ST strategies. In testing, the CBR cycleuses those features selected in the pre-processing step and the testcase base. Note that our proposals are placed in the category of fil-ters. For this reason, feature selection is done as a pre-processingstep.

The evaluation was performed using twenty well-knownbenchmark data sets from the UCI repository11 (Asuncion andNewman, 2007). The details of these data sets are shown in Table 2.

Table 2Details of the data sets used in this article. The columns are: data set name, #inst. = number of instances, #Att. = number of attributes, #Ord. = number of ordinal attributes,#Nom = number of nominal attributes, #Clas. = number of classes, Dev.Cla. = deviation of class distribution, Maj.Cla. = percentage of cases belonging to the majority class,Min.Cla. = percentage of cases belonging to the minority class, MV = percentage of values with missing values ð #missing

#cases�#attributesÞ.

Data set #Inst. #Att. #Ord. #Nom. #Cla. Dev.Cla. (%) Maj.Cla. (%) Min.Cla. (%) MV (%)

Autos 205 25 15 10 6 10.25 32.68 1.46 1.15Balance scale 625 4 4 – 3 18.03 46.08 7.84 –Breast cancer W. 699 9 9 – 2 20.28 70.28 29.72 0.25cmc 1473 9 2 7 3 8.26 42.70 22.61 –Horse-Colic 368 22 7 15 2 13.04 63.04 36.96 23.80Credit-A 690 15 6 9 2 5.51 55.51 44.49 0.65Glass 214 9 9 – 2 12.69 35.51 4.21 –TAO-grid 1888 2 2 – 2 0.00 50.00 50.00 –Heart-C 303 13 6 7 5 4.46 54.46 45.54 0.17Heart-H 294 13 6 7 5 13.95 63.95 36.05 20.46Heart-Statlog 270 13 13 – 2 5.56 55.56 44.44 –Hepatitis 155 19 6 13 2 29.35 79.35 20.65 6.01Ionosphere 351 34 34 – 2 14.10 64.10 35.90 –Iris 150 4 4 – 3 – 33.33 33.33 –Labor 57 16 8 8 2 14.91 64.91 35.09 55.48Primary-tumor 339 17 – 17 22 5.48 24.78 0.29 3.01Vehicle 946 18 18 – 4 0.89 25.77 23.52 –Vote 435 16 – 16 2 11.38 61.38 38.62 5.63Vowel 990 13 10 3 11 0.00 9.09 9.09 –Wine 178 13 13 – 3 5.28 39.89 26.97 –


As it can be seen, the chosen data sets provide a wide variety in size,class complexity, attribute types, and missing value percentage.

Classification tests for each selection method and data set areconducted by means of stratified 10-fold cross validations. Strati-fied cross validation requires each fold to contain at least one caseof each class. Note that when solving a new problem, CBR general-izes its cases to cover this new situation. In this article, we haveconsidered the class of the most similar case (i.e., nearest neigh-bour) as the solution of the new problem. An s-fold cross validationdivides a data set into s equal-size subsets. Each subset is used inturns as a test set with the remaining (s � 1) data sets used fortraining. Feature selection is done over each training set of cases,so that results over the corresponding testing set are subsequentlyaveraged. Finally, it is worth mentioning that we use paired t-teston these runs to evaluate the statistical significance of the obtainedresults.

5.2. Rough set prediction efficiency and attribute reduction

As we have already mentioned, we use the 20-data-set bench-mark to compare the classification performance of all combina-tions of previous relevance measures (DRS, PRS, Mean, and Two)and selection strategies (MS, HS, ST). Next chart in Fig. 6 comparesstatistically each selection method instance with basic CBR (that is,CBR without feature selection). Each column in this chart shows for

Fig. 6. Feature selection performance respect to b

how many data sets the proposed selection method performs aswell as basic CBR and for how many it yields in fact to a classifica-tion performance that is significantly better than the basic CBR(using as significance test a paired t-test at the significance levelof 0.05). In this manner, first bar shows that when MS selectionstrategy uses DRS relevance measure, the selection method yieldsto a subsequent classification for which 11 out of 20 data setsperform as well as basic CBR, and from those, 8 get a classificationperformance that is significantly better than basic CBR. As we canobserve, best combinations happen to be PRS-ST (i.e., this is thecombination of PRS with ST), Mean-ST and Two-ST too, which doas good as CBR for all twenty data sets, and they even improve sig-nificantly for 5, 5 and 6 of these data sets, respectively.

From Fig. 6 we can also observe that PRS clearly outperformsDRS in all analyzed selection strategies. This is partially due tothe fact that it favours the selection of a larger number of attributes(see Fig. 8 and its related discussion further in this subsection).Mean and Two combine both relevance measures in a compensa-tory manner, and therefore, they generate results laying betweenDRS and PRS – both in terms of performance and number of se-lected attributes. The combination of Two-ST gives the highestbenefit in terms of performance. It is able to keep PRS-ST equiva-lent performance in relation to basic CBR. Moreover, Two-ST signif-icantly improves basic CBR in 6 data sets, thus denoting that DRSpositively influences Two relevance measure. Additionally, note

asic CBR in terms of the number of data sets.

0 10 20 30 40 50 60 70 80 90

100

2 6 10 14 18 22 26 30 34

%se

lect

ed a

ttrib

utes

#attributes in the data set

ST Feature Selection

DRS Mean Two PRS

Fig. 8. Number of selected attributes (in percentage of the number of attributes) in our benchmark when combining ST selection strategy with different relevance measures(DRS, PRS, Mean, and Two). Reduction for data sets with the same number of attributes is displayed averaged.

70 72 74 76 78 80 82

MS HS ST MS HS ST MS HS ST MS HS ST

DRS PRS Mean Two C

lass

ifica

tion

perfo

rman

ce (%

)

Relevance measures combined with feature selection strategies

Feature Selection Performance

Fig. 7. Feature selection performance in terms of classification percentage.

Fig. 9. Feature selection performance comparison of proposed and standard techniques respect to basic CBR in terms of the number of data sets.


that ST provides best results for all strategies but DRS. This may bedue to the fact that DRS is a relevance measure that favours attri-bute overreduction and that HS is the only strategy that compen-sates this effect by always selecting approximately half theattributes. Overall, it seems that all selection methods actually se-lect useful attributes and performance is related with how restric-tive they are over the number of attributes to use in classification.

If we look closer to the classification performance, in addition tomeasure how many data sets are classified without decreasingperformance with respect to basic CBR, we can also compute theperformance of each strategy by averaging performances overour twenty data sets. Next chart in Fig. 7 shows these averages.They vary from 76.62% for DRS-ST up to 80.0% for PRS-MS, PRS-ST,Mean-ST and Two-ST . Note that PRS is the best for all feature

selection strategies . These results are also consistent with our pre-vious statement about performance and restriction in selectingattributes, since, as Fig. 11 in next subsection details, DRS-HS isin fact selecting less attributes than PRS-ST.

In order to further study the advantage of the overall selectionmethod, Fig. 8 shows the reduction over our twenty data sets whenusing ST selection strategy (or equivalently, DRS-ST, PRS-ST, Mean-ST, and Two-ST selection methods). This graph plots how the per-centage of selected attributes is reduced along the x axis. Thus, theaxis of abscissas corresponds to the number of attributes in ourbenchmark, which includes data sets whose number of attributesvaries from 2 to 34. As we can observe, all relevance measures yieldto a considerable attribute reduction for those data sets havingmore than 18 attributes. More concretely, the data set with less

Fig. 11. Average number of selected attributes (in percentage) for different feature selection strategies.

Fig. 10. Comparison of proposed and standard feature selection strategies in terms of classification performance (CBR is used as baseline for comparison).


attributes is Tao-grid, which has 2 attributes and it gets no reduc-tion. Similarly, those data sets with 4 attributes do not get muchreduction (in fact, DRS is the only method that actually generatessome reduction). On the other hand, at the other end of the spec-trum, if we consider the largest data set (i.e., ionosphere whichcontains 34 attributes) DRS goes as far as only selecting 4 attri-butes. As the graph shows, it corresponds to a 11.76 of selectionpercentage. Nevertheless, this reduction may be too restrictive,since PRS requires the selection of almost 30% of the attributes(which corresponds to the selection of 10 attributes out of 34) inorder to obtain a performance that is equivalent to the one of basicCBR. For this same Ionosphere data set, Mean, and Two relevancemeasures yield to the selection of 30% and 10% of the attributes(10 and 4 attributes, respectively). In this manner, the results ofcombining DRS and PRS relevance measures lay between thosefrom DRS and PRS individually.

12 This value corresponds to the average feature selection of PRS-ST, Mean-ST andTwo-ST, see Fig. 11.

5.3. Performance comparison with standard feature selectionstrategies

In order to get a better insight of the actual contribution of ourwork, the same analysis can be performed considering also severalstandard feature selection methods to compare with. More pre-cisely, we have included CFS (Hall, 2000), Chi-Square (Chi) (Liuand Setiono, 1995), Information Gain (Gain) (Quinlan, 1986), Re-liefF (Kononenko, 1994), and JohnsonsReduct (JR) (Gupta et al.,2006). The JohnsonsReduct algorithm has been introduced intoBASTIAN platform. In the case of CFS, Chi-Square, Information Gainand ReliefF, we have used their implementation in WEKA (WaikatoEnvironment for Knowledge Analysis) (Witten and Frank, 2000) andtheir feature selection is applied over an IB1 algorithm – which isalso included in WEKA. Obviously, we use the same training andtest sets as in our own algorithms. Additionally, it is important to

note that CFS automatically selects the best feature set while onChi-Square, Information Gain and ReliefF require to set the numberof attributes manually. Therefore, in order to get a fair comparisonbetween methods, the results along this paper are obtained fromsetting their number of attributes to 75%.12

Figs. 9 and 10 compare them with a selection of our proposedmethods: for each relevance measure we consider the strategy thatperforms better. Fig. 9 compares them considering both the num-ber of data sets that are classified without decreasing performancewith respect to basic CBR as well as the number of data sets forwhich each method actually improves basic CBR. As it can be seenin Fig. 9, PRS-ST, Mean-ST, and Two-ST are better than standardmethods (i.e., CFS, Chi, Gain, ReliefF, and JR). Regarding standardmethods, JR presents the best performance, while CFS appears toperform the worst in terms of being equivalent to CBR (althoughit performs significantly better than basic CBR for 8 data sets). Inbetween, both Chi-Square, Information Gain, and ReliefF obtainsimilar results.

Additionally, Fig. 10 shows the comparison in terms of averagedperformance values (in percentage), using as baseline basic CBRperformance, whose percentage corresponds to 78.4%. Focusingon Fig. 10, it can be seen that PRS-ST, Mean-ST and Two-ST (i.e.,all of them are close to 80%) are clearly above the rest. Moreover,they outperform both basic CBR and the other standard methods(whose values range from 75.81% for CFS up to 79.12% for JR).

Similarly to Fig. 8 in previous subsection, we can also computeattribute reduction. Fig. 11 compares the average – for all data sets– of the attribute selection percentage for previous featureselection strategies. This graph includes Chi, Gain, and ReliefFalthough, as we have previously mentioned, we have set the


number of attributes to 75% for the sake of a fair comparison.13

From this figure we can observe that DRS-HS results on the smallerpercentage (58.6%) of attributes selected from our proposals.Although they are not shown in Fig. 11, these results are in factsmaller for DRS-MS and DRS-ST which values are 40% and 45.9%,respectively. On the other end, JR corresponds to the methodselecting most attributes (85.2%). From our proposals, PRS-STselection is the largest with a (75.9%), and again Mean-ST (75.3%)and Two-ST (72.7%) select attribute percentages that lay betweenPRS and DRS but are higher than CFS. Note that CFS and DRS-HSreduction is so large that it results in a decrease on performance.Despite of using a similar number of attributes, Chi, Gain and ReliefFresult in a performance that is not as good as the results of ourproposals. In fact, only JR strategy is able to perform over standardCBR. However, JR selects an average of 85.2% of attributes whileour proposals (i.e., PRS-ST, Mean-ST and Two-ST) are below 75% withhigher performance (see Figs. 10 and 11).

Taking into consideration the analysis presented so far,14 wecan conclude that, though it is widely known that feature selectionpresents many advantages in terms of noise reduction and compu-tation efficiency, an overreduction has an unavoidable effect on thesubsequent CBR classification accuracy. Therefore, it may be worthconsidering a method that balances both aspects. In the evaluationof our proposed methods, DRS-MS can yield to a selection as low as40.02% of the original attributes in the data set to produce a classi-fication averaged accuracy of 77.97% (recall that basic CBR gets a78.4% by considering 100% of attributes). On the other end, Two-ST selects 72.76% attributes to get a 80.03% accuracy. In this man-ner, we can reason that, in average, 27.24% of the attributes in thebenchmark can be safely discarded for classification. In fact, ratherthan decreasing classification performance, this reduction actuallyincreases performance. Regarding the additional 32.74% (it comesfrom Two-ST–DRS-ST percentage of attribute selection) of attri-butes that in average DRS-MS discards further, we can see thattheir absence imply an accuracy decrease. Therefore, a decisionin the selection method to use must consider, up to some point,this balance between classification accuracy and percentage of se-lected attributes. The larger number of attributes the data set has,the more relevant the feature selection process becomes. Overall,the general conclusion could be that, if accuracy is a major concern,then PRS-ST, Mean-ST, and Two-ST selection methods have provento be the best choice.

6. Conclusions

Feature selection aims at reducing the number of unnecessary,irrelevant or redundant features. It helps CBR systems to retrievethe most relevant information in large data sets. Although it iswell-known that feature selection presents many advantages interms of noise reduction and computation efficiency, an excessivereduction has an unavoidable effect on the classification accuracy.In this paper, we have devised three feature selection strategies –Mean Selection, Half Selection, and Selection by Threshold – forCase-Based Reasoning and several measures for estimating attri-bute relevance based on Rough Set Theory – namely DependenceRough Sets, Proportional Rough Sets, Mean Rough Sets, and TwoRough Sets. Our proposals aim to find a compromise between fea-ture reduction and classification accuracy in a CBR classifier whendata contain uncertain, redundant, and vague information.

We experimented the proposals with a benchmark of twentydata sets from the UCI Machine Learning repository. These data

13 We have also conducted experiments fixing the quantity of attributes to 50% inChi, Gain and ReliefF, but we do not show them here because they yield to results thatare poorer than these with 75% and this comparison would not be fair.

14 A preliminary analysis with its results can be found in (Salamó, 2004).

sets were carefully chosen to cover various situations with differ-ent size, complexity, attribute types, and missing percentage. Ourexperiments indicate that our proposals have the potential to deli-ver worthwhile feature reduction benefits. All the proposals pres-ent an acceptable accuracy considering the reduction obtainedwhen compared to standard Case-Based Reasoning. From all theproposals, the best combination that maximizes the balance be-tween reduction and accuracy happens to be Selection by Thresh-old strategy using Proportional, Mean, or Two Rough Sets as ameasure of relevance. These combinations do as good as standardCase-Based Reasoning for all twenty data sets, and even it im-proves significantly for up to six of these data sets. ProportionalRough Sets, Mean Weight Rough Sets, and Two Rough Sets are ableto find an optimal set of features (i.e., safely discarding those fea-tures that are redundant and irrelevant) and they enjoy an increasein performance in all evaluated feature selection strategies. MeanRough Sets and Two Rough Sets present different level of reduc-tion. Mean Rough Sets reduction is similar to Proportional RoughSets whilst Two Rough Sets reduction is greater due to the fact thatit promotes Dependence Rough Sets relevance and ProportionalRough Sets relevance in equal conditions. Finally, it is worth notingthat the proposed strategies are sufficiently general to be applica-ble across a wide range of learning algorithms.

Acknowledgements

This work has been supported in part by projects CONSOLIDER-INGENIO CSD 2007-00018 and CSD 2007-00022, TIN2009-14702-C02-02 and TIN2009-14404-CO2-02.

References

Aamodt, A., Plaza, E., 1994. Case-based reasoning: Foundations issues,methodological variations, and system approaches. In: AI Communications,vol. 7, pp. 39–59.

Aha, D., 1992. Tolerating noisy, irrelevant and novel attributes in instance-basedlearning algorithms. Internat. J. Man–Machine Stud. 36 (2), 267–287.

Al-Radaideh, Q., Sulaiman, M., Selamat, M., Ibrahim, H., 2005. Feature selection byordered rough set based feature weighting. In: Database and Expert SystemsApplications. Springer-Verlag, pp. 105–112.

Althoff, K.-D. et al., 1994. Induction and case-based reasoning for classification tasks.In: Bock, H., Lenski, W., Richter, M. (Eds.), Information Systems and Data Analysis,Prospects–Foundations–Applications, Proc. 17th Annual Conference of the GfKl,University of Kaiserslautern. Springer-Verlag, Berlin–Heidelberg, pp. 3–16.

Anastassiou, D., 2007. Computational analysis of the synergy among multipleinteracting genes. Mol. Systems Biol. 3 (83), 1.

Asuncion, A., Newman, D.J., 2007. UCI Machine Learning Repository. <http://www.ics.uci.edu/�mlearn/MLRepository.html>.

Beaubouef, T., Ladner, R., Petry, F., 2004. Rough set spatial data modeling for datamining. Internat. J. Intell. Systems 19 (7), 567–584.

Bell, D.A., Guan, J.W., 1998. Computational methods for rough classification anddiscovery. J. Amer. Soc. Inform. Sci. 49 (5), 403–414.

Bellman, R., 1961. Adaptive Control Processes: A Guided Tour. Princeton UniversityPress.

Blum, A., Langley, P., 1997. Selection of relevant features and examples in machinelearning. In: Artificial Intelligence, vol. 97, pp. 245–271.

Caruana, R., Freitag, D., 1994. How useful is relevance? In: Proc. AAAI FallSymposium on Relevance. AAAI Press, pp. 21–25.

Choybey, S., Deogun, J., Raghawan, V., Sever, H., 1996. A comparison of featureselection algorithms in the context of rough classifiers. In: Proc. 5th IEEEInternat. Conf. on Fuzzy Systems, New Orleans, vol. 2, pp. 1122–1128.

Fayyad, U., Irani, K., 1992. On the handling of continuous-valued attributes indecision tree generation. Machine Learn. 8, 87–102.

Gawrys, M., Sienkiewicz, J., 1993. Rough Set Library User’s Manual. Tech. Rep. 00-665, Institute of Computer Science, Warsaw University of Technology.

Golobardes, E., Llorà, X., Salamó, M., Martí, J., 2002. Computer aided diagnosis withcase-based reasoning and genetic algorithms. Knowl.-Based Systems 15 (1-2),45–52.

Gupta, K., Moore, P., Aha, D., Pal, S., 2005. Rough set feature selection methods forcase-based categorization of text documents. Pattern Recognition MachineIntell., 792–798.

Gupta, K.M., Aha, D.W., Moore, P., 2006. Rough sets feature selection algorithms fortextual case-based classification. In: Proc. 8th European Conf. on Case-BasedReasoning. Springer, pp. 166–181.

Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J.Machine Learn. Res. 3, 1157–1182.

http://www.ics.uci.edu/~mlearn/MLRepository.html




Hall, M., 2000. Correlation-based feature selection of discrete and numeric classmachine learning. In: Proc. Internat. Conf. on Machine Learning. MorganKaufmann, pp. 359–366.

Jain, A., Chandrasekaran, B., 1982. Dimensionality and sample size considerations inpattern recognition practice. In: Classification Pattern Recognition andReduction of Dimensionality. Handbook of Statistics, vol. 2. Elsevier, pp. 835–855.

Jain, A.K., Duin, R.P., Mao, J., 2000. Statistical pattern recognition: A review. IEEETrans. Pattern Anal. Machine Intell. 22 (1), 4–37.

Jensen, R., Shen, Q., 2004. Fuzzy rough attribute reduction with application to webcategorization. Fuzzy Sets Systems 141 (3), 469–485.

Kira, K., Rendell, L., 1992. A practical approach to feature selection. In: Proc. 9thInternat. Conf. on Machine Learning, pp. 249–256.

Kohavi, R., John, G., 1997. Wrappers for feature subset selection. In: ArtificialIntelligence, vol. 97, pp. 273–324.

Kononenko, I., 1994. Estimating attributes: Analysis and extensions of RELIEF. In:Proc. 7th European Conf. on Machine Learning, pp. 171–182.

Korn, F., Pagel, B.-U., Faloutsos, C., 2001. On the ‘dimensionality curse’ and the ’self-similarity blessing’. IEEE Trans. Knowl. Data Eng. 13 (1), 96–111.

Li, D., Zhang, W., 2006. Gene selection using rough set theory. In: Rough Sets andKnowledge Technology, pp. 778–785.

Li, Y., Shiu, S.C.K., Pal, S.K., 2006. Combining feature reduction and case selection inbuilding CBR classifiers. IEEE Trans. Knowl. Data Eng. 18 (3), 415–429.

Liu, H., Setiono, R., 1995. Chi2: Feature selection and discretization of numericattributes. In: Proc. 7th IEEE Internat. Conf. on Tools with Artificial Intelligence,pp. 388–391.

Martins Jr., D.C., Braga-Neto, U.M., Hashimoto, R.F., Bittner, M.L., Dougherty, E.R.,2008. Intrinsically multivariate predictive genes. IEEE J. Sel. Top. Signal Process.2 (3), 424–439.

Pawlak, Z., 1982. Rough sets. Internat. J. Inform. Comput. Sci. 11 (5), 341–356.Pawlak, Z., 1991. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer

Academic Publishers.Pudil, P., Novovicová, J., Kittler, J., 1994. Floating search methods in feature

selection. Pattern Recognition Lett. 15 (11), 1119–1125.Quinlan, R., 1986. Induction of decission trees. Machine Learn. 1 (1), 81–106.Riesbeck, C., Schank, R., 1989. Inside Case-Based Reasoning. Lawrence Erlbaum

Associates, Hillsdale, NJ, US.Salamó, M., 2004. Integration of Rough Sets Theory into Case-Based Reasoning to

Favour the Retrieval Phase. Ph.D. Thesis, Enginyeria i Arquitectura La Salle,Universitat Ramon Llull.

Salamó, M., Golobardes, E., 2000. BASTIAN: Incorporating the rough sets theory intoa case-based classifier system. In: Butlletí de l’acia: III Congrés Catalàd’Intel�ligència Artificial (CCIA’00), Barcelona, Spain, pp. 284–293.

Salamó, M., Golobardes, E., 2002a. Analysing rough sets weighting methods forcase-based reasoning systems. Iberoamer. J. Artif. Intell. 1 (15), 34–43.

Salamó, M., Golobardes, E., 2002b. Deleting and building sort out techniques forcase base maintenance. In: Craw, S., Preece, A. (Eds.), Proc. 6th European Conf.on Case-Based Reasoning. Springer, pp. 365–379.

Salamó, M., Golobardes, E., 2003. Unifying weighting and case reduction methodsbased on rough sets to improve retrieval. In: Proc. 5th Internat. Conf. on Case-Based Reasoning. Springer, pp. 494–508.

Salamó, M., Golobardes, E., 2004. Global, local and mixed case base maintenancetechniques. Proc. 6th Congrés Català d’Intel�ligència Artificial, vol. 13. IOS Press,pp. 127–134.

Salamó, M., López-Sánchez, M., 2007. Feature selection based on rough sets for case-based reasoning. In: Proc. Workshop on Uncertainty and Fuzziness in Case-Based Reasoning at the International Conference on Case-Based Reasoning, pp.1–8.

Shen, Q., Chouchoulas, A., 2002. A rough-fuzzy approach for generatingclassification rules. Pattern Recognition 35 (11), 2425–2438.

Shiu, S., Pal, S.K., 2004. Foundations of Soft Case-Based Reasoning. John Wiley &Sons.

Skowron, A., Rauszer, C., 1992. The Discernibility Matrices and Functions inInformation Systems. Intelligent Decision Support. Handbook of Applicationsand Advances of the Rough Sets Theory. Kluwer Academic Publisher.

Somol, P., Pudil, P., Novovicová, J., 1999. Adaptive floating search methods in featureselection. Pattern Recognition Lett. 20 (11-13), 1157–1163.

Wang, X., Yang, J., Teng, X., Xia, W., Jensen, R., 2007. Feature selection based onrough sets and particle swarm optimization. Pattern Recognition Lett. 28 (4),459–471.

Watson, I., 1997. Applying Case-Based Reasoning: Techniques for EnterpriseSystems. Morgan Kaufmann Publishers Inc..

Witten, I.H., Frank, E., 2000. DataMining: Practical Machine Learning Tools andTechniques with Java Implementations. Morgan Kaufmann Publishers.

Yao, Y., Zhao, Y., 2009. Discernibility matrix simplification for constructing attributereducts. Inform. Sci. 179 (7), 867–882.

Yao, Y., Zhao, Y., Wang, J., 2008. On reduct construction algorithms. Trans. Comput.Sci. II, 100–117.

Yun, J., Zhanhuai, L., Yang, Z., Qiang, Z., 2004. A new approach for selectingattributes based on rough set theory. In: Intelligent Data Engineering andAutomated Learning. Springer, pp. 152–158.

Zadeh, L., 1965. Fuzzy sets. Inform. Control 8, 338–353.Zhang, M., Yao, J., 2004. A rough sets based approach to feature selection. In: Proc.

23rd Internat. Conf. of NAFIPS, pp. 434–439.Zhong, N., Dong, J., Ohsuga, S., 2001. Using rough sets with heuristics for feature

selection. J. Intell. Inform. Systems 16 (3), 199–214.

Documents

Rough set based approaches to feature selection for Case-Based Reasoning classifiers