18
Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance Mª Sagrario Sánchez b , Mª de la Cruz Ortiz a, , Luis A. Sarabia b , Verónica Busto a a Department of Chemistry, Faculty of Sciences, University of Burgos, Pza. Misael Bañuelos s/n, 09001 Burgos, Spain b Department of Mathematics and Computation, Faculty of Sciences, University of Burgos, Pza. Misael Bañuelos s/n, 09001 Burgos, Spain abstract article info Article history: Received 21 August 2009 Received in revised form 22 April 2010 Accepted 17 May 2010 Available online 24 May 2010 Keywords: Class-modelling Partial least squares Pareto-optimal front Colour wines Genetic algorithm Neural network Sensitivity Specicity The work presents two approaches for the construction of empirical class-models for a given category C. The attention is centred on the information provided by the sensitivity and specicity, the two usual parameters employed to qualify a class-model. In fact, not only a class-model is built for C but a set of class-models which differ in their sensitivity and specicity. Therefore the range of possible jointly available values is described, allowing the user to select the model that best adapt to specic situations or particular needs. One of the approaches, PLS-CM (Partial Least Squares Class-Modelling), is based on the modelling of the distribution of the values obtained by a PLS model tted with binary response (belong/do not belong to C). In that way, the corresponding hypothesis test permits the computation of the probabilities α and β of type I and type II errors when deciding whether a sample belongs to C. These probabilities, expressed as percentages, are 100 minus sensitivity and 100 minus specicity, respectively. The representation of β versus α is the risk curve that describes the PLS-CM capability of modelling category C. The other approach comes from setting the problem as a multi-objective optimization problem, the one that corresponds to simultaneously maximize sensitivity and specicity, which usually behave oppositely. The trading-off solutions (again, different class-models) are computed to be Pareto-optimal solutions, that is, the set of the optimal solutions in at least one of the conicting objectives, what is known as the Pareto-optimal front, POF. Additionally, a procedure to cross-validate the risk curve and the Pareto-optimal front is proposed for the rst time in order to evaluate the prediction ability of both methods. Two case-studies are used to drive the discussion: 1) the characterization of wines that ofcial wine-tasters regarded as compliant ones according to the quality characteristics stated by a Denomination of Origin and 2) The characterization of breast tumours dened as benign (compliant class) from 9 cytological variables. Finally, the performance of the methods is tested using several data sets from the literature. © 2010 Elsevier B.V. All rights reserved. 1. Introduction An important aim of collecting data about sets of objects or samples is the classication of these samples into different categories, e.g., the classication of food products according to quality or the classication of patients into categories of diseases (in which case the classication is referred to as medical diagnosis). For this purpose, supervised pattern recognition techniques have been introduced. The nal aim of these methods is the development of classication functions or decision rules. Conceptually, the distinctive characteristic of the class-modelling techniques is the possibility of estimating sensitivity and specicity for each class-model. Sensitivity measures the capacity of the computed model to recognize its own objects whereas specicity measures the capacity of the model to reject foreign objects. Thus, both parameters jointly describe the performance of the computed models and allow evaluation of the possible confusion among categories or detection of outlier data. From the denitions it is clear that class-models with high sensitivity and high specicity are needed. However, in general, these parameters show opposite behaviour: when one increases, the other decreases. From the mathematical point of view, the class-models may be analytical or statistical. Both approaches considerably differ in the way they extract the information contained in the training set. The analytical methods dene a region of the space of the predictor variables linked with each category. The statistical methods, on its part, link a probability distribution with each category, either in the multidimensional space of the predictor variables, or in the one- dimensional response which is a function of the predictor variables. Further, in spite of their importance, the usual modelling techniques do not allow the use of sensitivity and specicity to select the model, neither in tting nor in prediction. That means that sensitivity and Chemometrics and Intelligent Laboratory Systems 103 (2010) 2542 Corresponding author. Fax: +34 947258831. E-mail address: [email protected] (M.ªC. Ortiz). 0169-7439/$ see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2010.05.007 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Embed Size (px)

Citation preview

Page 1: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r.com/ locate /chemolab

Class-modelling techniques that optimize the probabilities of false noncomplianceand false compliance

Mª Sagrario Sánchez b, Mª de la Cruz Ortiz a,⁎, Luis A. Sarabia b, Verónica Busto a

a Department of Chemistry, Faculty of Sciences, University of Burgos, Pza. Misael Bañuelos s/n, 09001 Burgos, Spainb Department of Mathematics and Computation, Faculty of Sciences, University of Burgos, Pza. Misael Bañuelos s/n, 09001 Burgos, Spain

⁎ Corresponding author. Fax: +34 947258831.E-mail address: [email protected] (M.ªC. Ortiz).

0169-7439/$ – see front matter © 2010 Elsevier B.V. Aldoi:10.1016/j.chemolab.2010.05.007

a b s t r a c t

a r t i c l e i n f o

Article history:Received 21 August 2009Received in revised form 22 April 2010Accepted 17 May 2010Available online 24 May 2010

Keywords:Class-modellingPartial least squaresPareto-optimal frontColour winesGenetic algorithmNeural networkSensitivitySpecificity

The work presents two approaches for the construction of empirical class-models for a given category C. Theattention is centred on the information provided by the sensitivity and specificity, the two usual parametersemployed to qualify a class-model. In fact, not only a class-model is built for C but a set of class-models whichdiffer in their sensitivity and specificity. Therefore the range of possible jointly available values is described,allowing the user to select the model that best adapt to specific situations or particular needs.One of the approaches, PLS-CM (Partial Least Squares Class-Modelling), is based on the modelling of thedistribution of the values obtained by a PLS model fitted with binary response (belong/do not belong to C). Inthat way, the corresponding hypothesis test permits the computation of the probabilities α and β of type I andtype II errors when deciding whether a sample belongs to C. These probabilities, expressed as percentages, are100minus sensitivity and 100minus specificity, respectively. The representation of β versus α is the risk curvethat describes the PLS-CM capability of modelling category C.The other approach comes from setting the problem as a multi-objective optimization problem, the one thatcorresponds to simultaneously maximize sensitivity and specificity, which usually behave oppositely. Thetrading-off solutions (again, different class-models) are computed to be Pareto-optimal solutions, that is, theset of the optimal solutions in at least one of the conflicting objectives, what is known as the Pareto-optimalfront, POF.Additionally, a procedure to cross-validate the risk curve and the Pareto-optimal front is proposed for the firsttime in order to evaluate the prediction ability of both methods.Two case-studies are used to drive the discussion: 1) the characterization of wines that official wine-tastersregarded as compliant ones according to the quality characteristics stated by a Denomination of Origin and 2)The characterization of breast tumours defined as benign (compliant class) from 9 cytological variables.Finally, the performance of the methods is tested using several data sets from the literature.

l rights reserved.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

An important aim of collecting data about sets of objects or samplesis the classification of these samples into different categories, e.g., theclassification of food products according to quality or the classificationof patients into categories of diseases (in which case the classificationis referred to as medical diagnosis). For this purpose, supervisedpattern recognition techniques have been introduced. The final aim ofthese methods is the development of classification functions ordecision rules.

Conceptually, the distinctive characteristic of the class-modellingtechniques is the possibility of estimating sensitivity and specificity foreach class-model. Sensitivity measures the capacity of the computedmodel to recognize its own objects whereas specificity measures the

capacity of the model to reject foreign objects. Thus, both parametersjointly describe the performance of the computed models and allowevaluation of the possible confusion among categories or detection ofoutlier data. From the definitions it is clear that class-modelswith highsensitivity and high specificity are needed. However, in general, theseparameters show opposite behaviour: when one increases, the otherdecreases.

From the mathematical point of view, the class-models may beanalytical or statistical. Both approaches considerably differ in theway they extract the information contained in the training set. Theanalytical methods define a region of the space of the predictorvariables linked with each category. The statistical methods, on itspart, link a probability distribution with each category, either in themultidimensional space of the predictor variables, or in the one-dimensional response which is a function of the predictor variables.

Further, in spite of their importance, the usual modelling techniquesdo not allow the use of sensitivity and specificity to select the model,neither in fitting nor in prediction. That means that sensitivity and

Page 2: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

26 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

specificity are computed with the resulting fitted class-model. Even so,they can be used as comparison criteria to select among models, forexample, among SIMCA [1] models for different combinations of metalscontents in wines from the Canary Islands [2]. Going further, in thepresent work, two methodologies are described (one of them isanalytical, the other one is statistical), to look for, in fact, a family ofoptimal class-models for a given problem. In this context, optimal class-modelsmean class-modelswith the pairs of optimal values of attainablesensitivity and specificity. In this way, the user may decide amongdifferent class-models taking into account existing external require-ments and can change the model if the requirements change.

The statistical approach is based on the probabilistic modelling ofthe distribution of the responses given by a PLS regression model fitbetween the predictor variables and an indicator value of the class.The analytical one comes from the general framework of multi-objective optimization and consists of building a family of analyticalclass-models that are optimal in at least one of the two parameters,sensitivity or specificity.

In particular, the proposed statistical methodology is the PLS-CM(Partial Least Squares Class-Modelling) technique that, coupling a PLSmodel with binary response and a hypothesis test, constructs a set ofdifferent class-models that allow the evaluation of the simultaneousbehaviour of the sensitivity and specificity. The graphical represen-tation of the two parameters is the so-called risk curve.

The recently proposed PLS-CM [3,4] has been applied to evaluatescreening methods in chemical analysis [5]. PLS-CM has many goodproperties because of the use of PLS: the possible existence ofcorrelations and/or collinearities among predictor variables is solved,and information about the (original) variables that are more importantfor describing the response can be also obtained. In addition, the Q andT2 statistics allow the user to detect outliers. Themain characteristic thatmakes PLS-CM different is that the class-models are built from theestimation of the probability distribution of each category, so thatsensitivity and specificity can be computed. In that sense, it is differentto the well-known Partial Least Squares-Discriminant Analysis, PLS-DA[6–8], although it shares some of its properties. Barker and Rayens [9]provided some insights and formal support to the PLS-DA method andsuggested its use for dimension reduction aimed at discrimination, andshowed the relationship with canonical correspondence analysis (CCA)and linear discriminant analysis (LDA). Also, González-Arjona et al. [10]established the relation between PLS-DA and procrustes discriminantanalysis. However, in all the cases, the methods are applied only todiscrimination tasks.

In the analytical procedure proposed in this work, the class-models are neural networks and then for each one, the sensitivity andspecificity are computed. The simultaneous maximization of the twoparameters is intrinsically a problem of multi-objective optimizationand, as such, the Pareto-optimal front obtained by means of a geneticalgorithm can be useful. Similarly to the risk curve, the Pareto-optimalfront accounts for the possible (trading-off) class-models, that is,class-models that are optimal in at least one of the parameters. Thisclass-modelling technique has been proposed in [11], where moredetails about the algorithmic implementation can be found. In [12] aprocedure for optimizing both the meta-parameters that control theneural network that models the class and those of the evolutionaryalgorithm that allows estimation of the Pareto-optimal front ispresented. Other uses of the Pareto-optimal front in analyticalapplications can be consulted in [13] and [14].

Usually, the class-models are computed to be used in prediction offuture samples. In that sense, some performance about the expectedbehaviour in prediction should be given. When there are not manysamples available, the usual procedure consists of using cross-validation, i.e., to systematically split the available data set into differentsubsets of size twhich are used as an evaluation setwhile the remainingn−t samples become the present training set. The procedure endswhen all the samples have been once in an evaluation set. However, the

cross-validation procedures that are usual in Chemometrics aredesigned so that they evaluate the prediction capacity of a classificationmodel. On the contrary, thepresented techniques generate a set of class-models and thus a procedure for this task need to be defined. In thiswork, for the first time, a segmented double cross-validation ispresented for the cross-validation of the risk curve in the case of PLS-CM and also for the POF. Both class-modelling procedures with theircorresponding validation have been applied to several sets of real data.Two of them are used as case-studies (one of them is aboutauthentication of foods and the other about medical diagnosis) andare deeply explained, whereas the other nine are used to test theperformance of the methods and only the final results are summarized.

2. Theory

2.1. Class-modelling

In the framework of classification, there are differences betweenonly discrimination and modelling of categories. A discriminatingtechnique computes decision rules that assign each sample, describedby the vector of measurements (variables) x, necessarily into one, andonly one, of the classes. In formal terms, a discriminating techniquecomputes a partition of the space of the variables into as manysubspaces as categories in the training set. A modelling technique, onthe contrary, computes subsets in the space of the variables (also asmany as categories being modelled) but this time, the subsets neitherare necessarily disjoint nor their union is the whole space.

Each of these subsets constitutes what is called the class-model.Therefore, a sample x can belong to one of the class-models, to severalclass-models or does not belong to any class-model. This propertymakes modelling techniques the adequate ones for the problems athand because additional information about the confusion of thecategories is obtained as well as the possibility of detecting samplesthat do not belong to any category.

In that sense, there are two important indices associated to amodelling technique: sensitivity and specificity. Given a class and amodel for it, the sensitivity of the model is its capacity to recognize itsown objects, whereas the specificity talks about the capacity of themodel to reject samples fromother classes. Since thework byDerde andMassart in 1986 [15], when a data set is used to compute the class-models, the sensitivity can be estimated as the proportion of samples ofthe class that are correctly into the class-model, and the specificity as theproportionof samples correctly rejected (i.e., samples that donot belongto the class and are outside the class-model). It is usual to express thesetwo parameters as percentages.

The aim is to look for models with high sensitivity and specificitybut, generally, these two indices have an opposite behaviour: ‘bigger’class-models cause an increase in sensitivity but a diminution of theirspecificity whereas smaller class-models are more specific but withless sensitivity.

In any case, there are some questions that are worth emphasizing.For the sake of clarity and for limiting the discussion, it will be restrictedto two classes. LetXbeann-by-vdatamatrix (training set)madeupbynsamples, and v variablesmeasured in each samplex. The samples belongto two classes, n1 samples to category C1 and n2 samples to category C2.Finally let (1, 0) and (0, 1) codify categories C1 and C2, respectively.

Independently of themethod used to compute themodels (may bebased on distances, such as UNEQ [15] or SIMCA [1], or may be not,such as GINN [16]), a mathematical expression is obtained that,depending on the values of the variables for a given sample x, yields anoutput that can be (1, 0), (0, 1), (0, 0) or (1, 1). The possibilities andtheir relation to the true category of each sample have been written inTable 1: rows stand for the true class whereas columns contain thenumber of samples, of the corresponding class, that received thecorresponding outputs. That means that a11 samples among the n1 ofC1 are inside the model for C1 and only in this model (because their

Page 3: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Table 1Number of samples assigned to class-models according to the output of amodellingmethod.Two classes, C1 and C2.

Trueclass

Code Output of the modelling method Total

(1,0) (0,1) (1,1) (0,0)

samples inthe model for C1

samples inthe model for C2

samples inboth models

samplesoutside bothmodels

C1 (1,0) a11 a12 a13 a14 n1C2 (0,1) a21 a22 a23 a24 n2

n

27M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

output was (1, 0)), a12 are only in the model for C2 (output (0, 1)),while a13 are in the intersection of the models (output (1,1)) andthere are still a14 which are outside both models (output (0,0)).The analogous for the n2 samples in C2. The final count is

∑4

j=1aij = ni i = 1; 2 and n1+n2=n.

Therefore, for instance, there are a12+a22+a13+a23 samplesinside the model computed for C2 (a22+a23 are in the correct modelbut a12+a13 should not be in this model), and so on. In particular, thismatrix allows estimation of sensitivity and specificity. For example,the sensitivity of the model for class C1 is the percentage of objectsthat, belonging to C1, are correctly inside the class-model for C1, i.e.a11 + a13

n1× 100%, whereas the specificity of the class-model for C1

would be a22 + a24n2

× 100% because it is the percentage of objects thatdo not belong to C1 and are – correctly – outside the class-model forC1. Table 2 summarizes the possibilities.

It is seen that the samples that belong only to their correspondingclass-model (a11 and a22) counts two times, in the sensitivity of thecorresponding model and the specificity of the other class-model,whereas the obvious ‘errors’ (samples that only belong to the wrongclass-model, a21 and a12) do not appear explicitly in the computations(though they count through n1 and n2).

In any case, the estimates in each category C (it does notmatter if itis C1, C2 or any other category) are always frequencies computed withthe existing data. If the probability distribution of the objects in thecategory C is obtainable and the one of the objects outside C also does,then already in 1989, Ortiz et al. [17] stated that, from the formal pointof view, the decision to assign a sample x to a class C can be posed as ahypothesis test for category C:

H0 : x∈ CH1 : x∉ C ð1Þ

Therefore, the errors that can be made are measured in terms ofprobabilities of occurrence and these probabilities are estimated withthe corresponding probability density functions. The probability oftype I error is the significance level of the test, rejecting the nullhypothesis given that it is actually true

α = Pr reject H0 j H0 is truef g= Prfconclude that x ∉ Cwhen in factx ∈ Cg ð2Þ

Table 2Estimation of sensitivity and specificity for the two class-models in Table 1.

Category Sensitivity (%) Specificity (%)

C1a11 + a13

n1× 100 a22 + a24

n2× 100

C2a22 + a23

n2× 100 a11 + a14

n1× 100

whereas the type II error refers to fail to reject the null hypothesis giventhat the alternative hypothesis is actually true and is quantified as

β = Pr fail to rejectH0 j H0 is falsef g= Pr conclude thatx ∈ Cwhenx ∉ Cf g ð3Þ

Written in positive, 1−α measures the probability of correctlydeciding to assign a sample to C, while the power of the test, 1−β,measures the probability of correctly rejecting a sample that does notbelong toC. In the context stated, it is important tonote that these errorsare related to the sensitivity and specificity associated to the quality of aclass-model computed for C. In other words, the model computed forcategory C is the acceptance region of the corresponding hypothesis testin Eq. (1). Consequently, the sensitivity is 100×(1−α)% and thespecificity is 100×(1−β)%.

2.2. Partial Least Squares Class-Modelling, PLS-CM

The PLS-CM method applies partial least squares regression toclass-modelling problems, in which the dependent variable y is abinary response (0 or 1 for samples out of class C or in class Crespectively). The procedure consists of four different steps that aredescribed in the following four subsections.

2.2.1. Fitting and validating a PLS regression modelThe PLS model is calculated by regressing y on X using the usual

steps: i) pretreatment, ii) obtain the number of latent variables thatminimize the root mean squared error in cross-validation, RMSECV,iii) remove those samples with standardized residual (in absolutevalue) greater than a threshold value and with Q and T2 statisticssignificantly greater than a value at prefixed confidence level, iv)repeat steps ii) and iii) until there are not outliers. For a sample x, thevalue predicted by the PLS model is ŷ=xt b where the b's are theregression coefficients of the PLS model with p (pbv) latent variables.

The latent structure found by the PLS model represents thedirections of largest variability of the predictors but that are the mostcorrelated with the binary responses, that is, with the ‘quality’ ofbelonging to a class. In any case, the values estimatedwith thefitted PLSmodelwouldbeneither 0nor1, but continuous values in an interval thatcontains [0,1], thus, it is necessary to establish a threshold value, V,between 0 and 1 to decide if a sample belongs to one or the other class.That is to say, if the value estimated by PLS is greater than the decisionthreshold V, the sample is said to be compliant; on the contrary, if thevalue is less than V, the sample is considered to be non-compliant.

2.2.2. Fitting a probability distribution to the values, ŷ, given by PLSFor each class it is checked whether the fitted values, ŷ, follow a

normal distribution, in which case the mean, x ̅, and standarddeviation, s, are enough to completely describe it from a statisticalpoint of view. If the normal distribution cannot be assumed, thefrequencies of the values ŷ will be directly used, instead of estimatinga probability density function (pdf). This is so because the estimationof the pdf will introduce more parameters whose selection can bedifficult, for example, the usual kernel density procedure [18] dependson the degree of smoothing “which in turn depends on what thescientist expects the data to be” ([19], page 82 of vol. 1). Kerneldensity has been found useful in applications when there is a rationalapproach to the choice of smoothing parameter as the case in thedetermination of modes, but the kernel density is always moredisperse than the original data fitted by PLS so introduces anoverestimation of probabilities α and β.

2.2.3. Building the risk curveAs it has been explained in Section 2.2.1, each threshold V defines a

different class-model. What is more, V is in fact the critical value of the

Page 4: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 1. Normal density functions fitted for the compliant samples (right) and the non-compliant samples (left) and representation of the probabilities of false noncompliance(α) and false compliance (β) for a given critical value V.

28 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

corresponding hypothesis test in Eq. (1) related to a significance levelα and to a value of β as graphically represented in Fig. 1.

By varying the critical value V, a family of hypothesis tests isobtained, that is, a set of class-models that differ in their sensitivity andspecificity. To estimate them (in fact, the probabilities β and α for agiven V), the distribution N(x̅, s) will be used. Otherwise (if normaldistribution cannot be assumed), these probabilities are computed asthe corresponding percentiles estimated by using a size 1000bootstrap sample, performed with the function “bootstrp” of the“Statistics Toolbox” (Version 4.0) for Matlab® (Version 6.1). As it isknown [20,21] the bootstrap estimation ismore precise than the directdetermination.

The representation of β versus α constitutes the risk curve, RC, thatdescribes the possible class-models achievable with the multivariateexperimental data. Fig. 2 shows the usual form of a risk curve whenboth classes follow a normal distribution as in Fig. 1. Note that, whenthe frequencies of the values given by PLS are used instead of the fittedcontinuous distribution, the risk curve is piecewise-constant. In anycase, the risk curve in Fig. 2 clearly shows the opposite variation of βand α as well as their expected simultaneous behaviour, so that,according to the requirements of the problemat hand, a suitablemodel(critical value) could be selected.

Fig. 2. PLS-CM risk curve. Each point (α, β) corresponds to a critical value.

2.2.4. Cross-validation of the risk curveTo evaluate the performance of the risk curve in prediction there

are two possibilities: to use an external set or to establish a procedurebased on cross-validation. In the second case, it is mandatory that theevaluation set has not taken part in the PLS regression. Therefore, thefollowing procedure has been followed:

i) The training set is divided into K subsets, K≥2, of training andevaluationXT1, XE1,.…, XTkXEk,.…, XTKXEK, in such a way that i) foreach k=1,…, K is XTk∪XEk=X, ii) XEk∩XEk=∅ for any twoevaluation sets, and iii) ∪k=1

K XEk=X. Additionally, each

training set XTkis autoscaled and its mean and standarddeviation are used to accordingly scale the correspondingevaluation set XEk, i.e., XEkis scaled by subtracting the mean anddividing by the standard deviation of the correspondingtraining set XTk.

ii) The procedure described in Sections 2.2.1–2.2.3 is applied witheach set XTk, obtaining model PLSk (with coefficients bk), andthe corresponding risk curve RCk made up by N class-models,related to different critical values V1,…, VN. Then, the valuesŷ=xtbk, ∀ x ∈ XEk are computed and for each n, 1≤n≤N, let bk,n and ak,n be the number of times that ŷ≥Vn and ŷbVn

respectively.The frequencies β̂k,n=bk,n/card (non-compliant samples of XEk)and αk̂,n=ak,n/card (compliant samples of XEk), 1≤n≤N, makeup the risk curve in prediction for the kth evaluation set.

iii) If for eachn, 1≤n≤N, β̂n=Σk=1K bk,n/card (non-compliant samples

of X) and α̂n=Σk=1K ak,n/card (compliant samples of X) are

computed, the representation of the points (α̂n, β̂n)makes up thecross-validated risk curve, CVRC. It is a piecewise-constant curvewhose minimum increment in the abscissa axis is 1/card(compliant samples of X) and in the ordinates axis 1/card (non-compliant samples of X).

2.3. Pareto-optimal front in class-modelling

In the previous section, probability density functions are fit asapproximations to the empirical frequencies that are used to estimateα and β. In the present analytical approach the different values of αand β are directly computed with the a priori imposed condition ofbeing optimal. Therefore, the second strategy proposed in the presentpaper to tackle the problem of multi-objective optimization is to lookfor the non-dominated solutions (in fact the Pareto-optimal front),that is, class-models that, for a given α, has the best possible (thelowest) β and vice versa, i.e., the solutions that are minimal in at leastone of the probabilities.

The key idea when using the proposed analytical approach is toconsider that the class-model is a set C in the space of the predictorvariables defined by an indicator function IC, which takes the valueone for the points belonging to the set and zero for those outside theset. By using the samples in the training set, a function is computedthat uniformly approximates IC, which is not even a continuousfunction. Thus, first of all, a family of functions should be usedcapable of uniformly approximating IC without imposing anysmoothing property (neither continuity nor differentiability).Neural networks are one of these possible families. However, neuralnetworks need to be trained and the training process can be viewedas an optimization process (that is, the minimization of residuals).Therefore, also an optimization method is needed that does notimpose any condition to the function to be optimized either. Thissecond condition is fulfilled by optimization methods by usingevolutionary algorithms.

The following sections formally define the multi-objective frame-work and the approach to tackle it as well as the way the techniquesare coupled to compute the desired class-models, and the method

Page 5: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

29M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

used to evaluate its performance in prediction, estimated by cross-validation.

2.3.1. DominancePrecisely and again in the theoretical context previously stated,

class-models are needed with simultaneously small values for both αand β, which usually behave oppositely. In general terms, this is aproblem of multi-objective optimization because, in this case, twoconflicting objectives (the probabilities) are to be jointly minimized.For explanation of the situation, let M1, M2, M3 and M4 be fourdifferent class-models computed for a category C with the estimatedvalues of α and β represented in Fig. 3 as pairs (α, β).

From Fig. 3 it is seen that M1 is just above M2, that is, M1 yields aworse estimate of βwith the same estimate ofα asM2 so thatM2 wouldbe preferable. A similar behaviour is observed betweenM2 andM3: theyhave the sameβ butM2 has betterα.Without a doubt,M2 is themodel tobe chosen when comparing toM1 orM3. However, when comparingM2

andM4 the selection is not so clear:M4 has larger α (it is to the right ofM2 in Fig. 3, in fact, it is to the right of all the other three models) but itsestimation of β is the best one among those shown in Fig. 3. Thisexample shows the conflicting behaviour of α and β, and highlights theusual situation in multi-objective optimization where no one unique‘best’ solution can be generally expected. Fig. 3 also illustrates the ideabehind the concept of dominance that is related to the Pareto order [22]in the two-dimensional space where α and β jointly vary.

Formally, a solution (class-model in the case at hand) M is said todominate another solution M′ if the following conditions hold:

i) the solution M is not worse than M′ in all the objectives (the twoprobabilities), and

ii) the solution M is strictly better than M′ in at least one probability.

Therefore, in Fig. 3, M2 dominates both M1 and M3 which becomedominated solutions, while M2 and M4 are said to be non-dominated.It seems clear that the non-dominated solutions are the wantedsolutions since they represent the best that can be expected for anyindividual objective, given that the other one is fixed.

The set of non-dominated solutions for a given problem constitutesthe Pareto-optimal front that thus accounts for the extent of the conflictbetween, in this case, the two probabilities. Consequently, for the set ofthe four class-models in Fig. 3 the Pareto-optimal front, made up byM2

and M4, shows that the best value of β given that α=0.02 is 0.06,whereas if α is allowed to increase until 0.07, then βwould decrease to0.03, and no better performance should be expected.

Fig. 3. Graphical representation of β versus α for four different hypothetical class-models.

2.3.2. Neural networks as class-modelsThe task of computing class-models, that is, to approximate Ic, is

made by using MLP (Multi Layer Perceptrons) neural networks withone hidden layer and discrete outputs. One of the reasons for thischoice is that this topology is theoretically able to approximate anyfunction to the desired degree provided that sufficiently many hiddenunits are available (see, for example, [23] for a summary of resultsabout this property and the kind of transfer functions to be used).Also, no assumptions are needed about the distribution of the data.

For the kth output unit, the functional expression of a MLP neuralnetwork with one hidden layer can be written as

ok = gk v0k + ∑jvjkσj w0j + ∑

v

i=1xiwij

� � !ð4Þ

In Eq. (4),wij denotes theweight of the original variable xi linked tothe jth hidden unit (w0j is the bias of the jth hidden unit), σj is thetransfer or activation function of the jth hidden unit, vjk is the weightbetween the jth hidden unit and the kth output unit (vok is its bias)and, finally, gk is the transfer function of the kth output unit.

When the indicator function IC is to be approximated by Eq. (4)there are 2 output units (k=2) and gk is always the Heaviside stepfunction defined as

gk tð Þ = 1 t ≥ v0k0 t b v0k

�ð5Þ

Hence, the possible outputs of the neural network are the fourpossibilities in Table 1 that allow for estimation of sensitivity andspecificity as in Table 2. Also, note that the method does not imposeany conditions on the distribution neither of the data nor on theresulting class-models computed.

2.3.3. Evolutionary algorithms to estimate the Pareto-optimal frontHowever, in practice, a training algorithm is needed to find the

desired approximation (i.e., to estimate the weights wij and vjk thatyield adequate ok in Eq. (4)). In addition, as it has been stated, for theproblem at hand, this training algorithm cannot be dependent on theregularity of the error function, which is not even continuous due tothe discrete transfer function in the output layer, Eq. (5).

For solving both the problem of the training of the neural networksin class-modelling tasks and the estimation of the Pareto-optimal frontin α and β (or in sensitivity and specificity) of the corresponding neuralnetwork class-models, an evolutionary algorithm is used as explained in[11]. In short, as any other genetic algorithm, a population of potentialsolutions (different class-models defined in the form of a neuralnetwork) is maintained along the evolution based on selection, cross-over and mutation operators. The main difference is that the fitnessfunction is a vector function, in the present case, a two-dimensionalerror function that consists of considering the worst value of α amongthe classes beingmodelled (also two in this case) and theworst value ofβ (i.e., the largest) among those obtained by the different neuralnetworks in the current population.

However, the fact that the fitness function is a vector functionchanges theupdatingof thepopulation step,which is thusdonebyusingthe dominance relation. To do it, once the new potential solutions havebeen generated by selection, cross-over and mutation, they are mergedinto the current population. This enlarged population is sortedaccording to levels of dominance: the first level is made up by thenon-dominated solutionsamong thewhole– enlarged–population; thesecond level corresponds to thenon-dominated solutions once removedthe first ones; and so on. As a result, to maintain the population size inthe updating step, this order is used to keep the solutions that are in thehighest level of dominance (as many as required). If in a given levelthere are more solution than needed, the crowding distance [24,25] isused to select the most disperse inside the corresponding sub-front. In

Page 6: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

30 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

that way, the procedure maintains the diversity while looking for thePareto-optimal solutions.

2.3.4. Cross-validation of the Pareto-optimal frontThe situation here is the same as in Section 2.2.4 i), the original

training set is divided into subsets with samples of the two categories,so that there are K subsets, 2≤K, of training and evaluation XT1, XE1,.…,XTkXEk,.…, XTKXEK. Then, in each partition, the following procedure isfollowed:

i) For each k=1,…, K, the samples in XTk are autoscaled and usedto compute the class-models that give values of the twoobjectives (worst α, worst β) which are optimal in at least oneof them i.e., the kth Pareto-optimal front, POFk.

ii) The evaluation set XEk is scaled by using the mean and standarddeviation of XTk. The Nk neural networks in POFk are used topredict the samples of the scaled evaluation set XEk. Note that,contrary to the case of the risk curve with PLS-CM, now eachPOFk is made up by a different number of neural networkswhich are also different from one another for each k.

iii) With the outputs of these Nk neural networks, the values of αand β are computed for the two categories and the worst ones(wαk̂,n, wβ̂k,n), n=1,…, Nk, are stored (these are already valuesin prediction).When finishing with the K partitions, the Pareto-optimal frontof all the stored values (wαk̂,n, wβk̂,n), n=1, …, Nk, k=1,…, K iscomputed. This last front is the cross-validated Pareto-optimalfront, CVPOF.

3. Datasets and software

There are eleven different data sets used. As it has been said, thefirst two are used as case-studies to highlight the different propertiesof the two approaches being presented, and the rest for benchmarking.

3.1. Wines

The data set consists of 129 samples of young red wine from theQualified Denomination of Origin (QDO) Rioja provided by theOenological Station of Haro (Governing Body of Rioja, Spain) whereall the experimental determinations were made, according to thenorms and official protocols of the DO. The sensory analysis was madeby a committee of wine-tasters. This panel of tasters qualifies each ofthe samples of wine with a score that goes from 1 to 10, so that if theaverage value established by the tasters is equal to or greater than 5, itis considered that the colour of the wine complies for the qualityrequired by the QDO Rioja. If on the contrary the score is less than 5,the wine will not be considered adequate to belong to this QDO.

From this last criterion two wine categories are defined:

− Non-compliant samples: It is made up by 48 samples of wine thatwere considered non-compliant since they have not surpassed therequired score.

− Compliant samples: It is made up by 81 wines that surpass thescore, so that they comply with the quality criteria relating tocolour of Rioja wines.

Each of the 129 samples of young red wine is characterized by 17variables: eleven routinely determined variables: volatile acidity,density, grade, total acidity, free SO2, total SO2, reducing sugars,anthocyans, polyphenols, methanol and tannins, and the CieLabcolour parameters, a*, b*, L*, C* and H*, named red/green colourcomponent, yellow/blue colour component, clarity, chroma and tone,respectively, as defined in the resolution of the OIV [26] as themethodof determining chromatic characteristics. Finally, saturation S⁎ hasalso been measured.

3.2. Tumours

The data set is made up by samples from 699 patients who hadundergone breast cytology. Nine cell descriptions related to the clumpthickness, uniformity of cell size, uniformity of cell shape, marginaladhesion, single epithelial cell size, bare nuclei, bland chromatin,normal nucleoli, and mitoses were obtained by microscopic examina-tion of each sample obtained by fine-needle aspirate and werequantified in a discrete scale from 1 to 10. All cancers and some ofthe benign masses were histologically confirmed. The remainingbenign masses were followed for a year and were biopsied if theychanged in size or character. In this way 458 tumours resulted to bebenign and the remaining 241 were malignant tumours.

The data set came from the PROBEN1 [27], a collection of benchmarkproblems and, according to the documentation, it was created based onthe “breast cancer Wisconsin” (presently ‘breast cancer Wisconsin(Original) data set’ [28]) problemdataset from theUCImachine learningrepository. This breast cancer databases was obtained from theUniversity of Wisconsin Hospitals, Madison from Dr. William H.Wolberg, who studied them and reported some results cited in [29].

Following the notation stated along the text, the categories aredefined as

− Compliant samples: Those corresponding to ‘benign’ tumours (458samples).

− Non-compliant samples: Those that correspond to ‘malignant’tumours (241 samples).

Additionally, suggested partitions are given by Prechelt [27] forsplitting the data set into training and test sets. According to thecancer1 permutation, a quarter of data was separated to be consideredas a test set to evaluate the predicting capability of the fitting models.

As a result, the training set ismade up by 525 samples, 349 ofwhichcorrespond to benign tumours and 176 tomalignant ones; and there isan external test set made up by 174 samples, 109 from the compliantclass (‘benign’) and 65 from the non-compliant class (‘malignant’).

3.3. Other datasets

To test the performance of the procedures proposed, bothmethodologies have been applied to several real-data sets availablein public repositories. Six of themcome from theUCIMachine LearningRepository [30] where detailed information about the problems andreferences related to them can be found. One of them is used in thebook by Hastie et al. [31] and made available by the authors. Anotherone is taken from LIBSVM [32] and the set Thyroid is a classical setreferred to already in [33].

These sets, with different structures, will be described and used inSection 4.3 to show the performance of the proposed methods in bothfitting and prediction estimated by cross-validation.

3.4. Software

The PLS Toolbox 3.5 (Eigenvector Research, Inc.) for MATLAB™ hasbeen used to fit and validate the PLS models. The MATLAB codes usedfor computing the RC, CVRC, POF and CVPOF procedures weredeveloped by our group. The tests for normality have been computedby using Statgraphics® Plus 5.1 (Statistical Graphics Corporation).

4. Results and discussion

4.1. Wines

4.1.1. PLS-CM modelA PLS model was fitted with autoscaled predictors and mean-

centred response. In order to estimate the number of latent variablesthe leave-one-out cross-validation was used. The root mean squared

Page 7: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Table 3Data set wines. Root mean squared error in cross-validation (RMSECV) and incalibration (RMSEC), and cumulative variances of the predictors (X) and response (y)explained by the PLS model when adding latent variables. The values that correspond tothe selected model are in italics.

# Latentvariables

RMSECV RMSEC Explained variancein X (%)

Explained variancein y (%)

PLS with all 129 samples(48 non-compliant and 81 compliant)1 0.304 0.298 43.25 61.972 0.289 0.268 51.88 69.283 0.288 0.261 62.72 70.754 0.293 0.257 – –

PLS with the first training set XT1 with 86 samples(32 non-compliant and 54 compliant)1 0.285 0.277 45.24 67.072 0.275 0.246 52.30 74.073 0.274 0.242 62.27 75.144 0.276 0.236 – –

PLS with the second training set XT2 with 86 samples(32 non-compliant and 54 compliant)1 0.316 0.306 42.56 59.912 0.306 0.271 51.58 68.653 0.302 0.261 61.27 70.734 0.316 0.257 – –

PLS with the third training set XT3 with 86 samples(32 non-compliant and 54 compliant)1 0.317 0.307 43.18 59.672 0.302 0.269 52.36 69.083 0.301 0.257 60.72 71.844 0.307 0.250 – –

31M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

errors in both calibration (RMSEC) and cross-validation (RMSECV) asa function of the latent variables in the PLSmodel are in Table 3, whereit can be seen that the minimum of RMSECV is achieved with 3 latentvariables.

Table 3 also contains the characteristics, related to the cumulativevariance explained in both predictors and response when addinglatent variables. The model selected, with 3 latent variables, explainsalmost 71% of the variance of the response and near 63% of thevariance of the predictor variables. This is not unusual with sensorydata; it just means that great part of the information contained in themeasured variables is not directly related to the perception of thecolour made by the tasters of the Denomination of Origin. In addition,no sample simultaneously surpasses the critical values (at 99%

Fig. 4.Data set wines, Box andWhisker plots of the responses of the fitted PLS model. ‘0’is for non-compliant samples, ‘1’ is for compliant samples.

confidence level) of the Q and T2 statistics, nor either the absolutevalue of any standardized residual is greater than 3.

Fig. 4 shows that the 129 values range from −0.43 to 1.20. The 81values corresponding to the compliant samples (boxon the right) arenear1 and distributed almost symmetrically around 0.9 (approximately themedian), whereas the 48 values of the non-compliant samples appear tobe more disperse around 0, and with one value unusually small (withrespect of the rest of the values) that is represented as a cross.

In fact, the predicted values of the responses for the 48 samples ofthe class of non-compliant wines vary between −0.43 and 0.82 withmean value of 0.18 and standard deviation 0.27. Those of the 81samples of wine that the tasters defined as compliant with respect tocolour of red Rioja wine have predicted values that vary between 0.49and 1.20 with mean 0.89 and standard deviation 0.18.

Fig. 4 alsopoints out that the values corresponding to the twoclassesare relatively close and that they are not directly separable. Concretely,there is a group of values of the predicted response, which goes from0.49 to 0.82, that may be conflicting since by the range of variability ofthe values it cannot be determined a priori in what category (0 or 1)must be assigned.

According to the PLS-CM procedure outlined in Section 2.2 thenext step is to study for each of the classes if the distribution of theresponse fitted by PLS is compatible with the normal distribution. Inparticular, the goodness-of-fit χ2 statistic, the Shapiro–Wilk's test, thestatistics for skewness and for kurtosis, the Kolmogorov–Smirnov testand the Anderson–Darling test have been used. Each one comparesdifferent aspects of the empirical distribution function to the fittednormal distribution.

Table 4 contains the p-values obtained for each one of the testsdetailed, and for both categories. Also in the last two rows themean andstandard deviation of the fitted values by the corresponding PLS modelhave been written. The p-value is the cumulative probability in thesupporting distribution from the computed statistic. Thus, because thep-value in the tests in the second column of Table 4 (non-compliantsamples) is in all the cases greater than or just 0.10 there is no evidenceto reject the normality of data at the usual significance level of 5%.

The situation shown in the third column, corresponding to thecompliant samples, is somewhat different from the previous one sincethe p-values are generally smaller. In any case, at 5% significance level,except for the test of Shapiro–Wilk (that needs 1% of significancelevel), the results of the tests allow to accept that the distribution isalso normal. Consequently, the N(0.18, 0.27) would be used as thedistribution of the responses of the non-compliant samples, and theN(0.89, 0.18) for those of the compliant samples.

To compute the risk curve, V is varied from −1 to 2 by taking 300equally spaced values in steps of 0.01. Using the normal distributionsfitted the values of α and β have been determined for each threshold.The risk curve thus obtained is the continuous line in Fig. 5.

Since continuous probability distributions were fitted, the wholecurve makes sense and allows for the study of the simultaneousconflicting behaviour of α and β, which is again observed in Fig. 5a.Also, note that the different points that make up the risk curvecorrespond to different critical values V.

In view of the values of α and β in Fig. 5, the user can make thedecision about the model to choose, valuing, in addition, the possibleeconomic aspects related to the costs associated to each type of error,which can imply the acceptance and commercialization of a non-compliant sample or that a compliant sample is retired from themarket. For example, a zoomed view of the bottom left corner ofFig. 5a in Fig. 5b shows that a model balanced with respect to the twoprobabilities could be obtained but for not less than approximatelyα=β=0.06, that would correspond to V=0.6. However, if the modelto choose should maintain the probability of false compliance β below0.01 (care protection against labelling as ‘Rioja’ a wine that does notfulfil the requirements of the colour), then a value of α near 0.35should be assumed, or what is the same, to reject 35 wine samples out

Page 8: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 5. Data set wines, results bymeans of PLS-CM. a) Risk curve in training, continuous line,and in cross-validation, rhombuses. b) Zoomed region around the most balanced models.

Table 4Data set wines. p-values of the hypothesis tests for the normality of the responses computed by PLS, in each category. The last two rows contain the corresponding mean andstandard deviation.

Hypothesis test X with all 129 samples XT1 with 86 samples XT2 with 86 samples XT3 with 86 samples

NCa Cb NCa Cb NCa Cb NCa Cb

χ2 (goodness-of-fit) 0.1920 0.1376 0.7603 0.1033 0.3047 0.6318 0.9710 0.4274Shapiro–Wilk 0.7944 0.0100 0.4668 0.0173 0.5495 0.0807 0.9159 0.0890z-score for skewness 0.7573 0.3269 0.5969 0.2143 0.4503 0.7598 0.4783 0.5495z-score for kurtosis 0.9966 0.0838 0.7486 0.9769 0.9557 0.0226 0.5651 0.2549Kolmogorov–Smirnov ≥0.10 ≥0.10 ≥0.10 ≥0.10 ≥0.10 ≥0.10 ≥0.10 ≥0.10Anderson–Darling 0.5151 0.0955 0.3817 0.1164 0.6081 0.2612 0.9755 0.1507Mean 0.1837 0.8912 0.1561 0.9075 0.1838 0.8911 0.1768 0.8952Standard deviation 0.2746 0.1835 0.2552 0.1809 0.2501 0.2047 0.2845 0.1713

a NC: non-compliant samples.b C: compliant samples.

32 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

of 100 when really they fulfil the criteria of colour established by theQDO Rioja expressed by means of the expert valuation of the tasters.If, on the contrary, it is the probability of false noncompliance, α,which should be controlled, say less than 1%, then more than 15 ofeach 100 wines would be labelled as Rioja when in fact they do notfulfil the colour characteristics, i.e. βN0.15 is assumed.

For estimating the CVRC, K=3 sets XT1, XT2, XT3 have been used,with 32 and 54 non-compliant and compliant samples respectively ineach of them. The predictor variables in each XTk have been autoscaledand the corresponding response has been centred. Then “venetianblind” with 10 data splits was used to obtain the RSMECV. The fittedvalues as well as RMSEC and the percentages of explained variance arealso in Table 3. In each of the three PLS models, three latent variablesare needed and there is no sample with values of Q and T2 or withstandardized residual outside the established limits. The similarity ofall the parameters with the ones obtained with the complete set X isremarkable. Table 4 contains the p-values of the test of normality ofthe values fitted by PLS for both categories in the three training sets.As can be seen, the normality can be assumed in the three models andfor the two categories at 5% significance level. The correspondingmeans and standard deviations that define the normal distributionfitted to each category (rows 7 and 8 of Table 4) are very similar toeach other in the three sets and to the one of the whole set.

Following the procedure described in Section 2.2.4, the evaluationsets XE1, XE2, XE3 were scaled in each case by subtracting the mean anddividing by the standard deviation of the corresponding training set.Then, using the three risk curves with the scaled evaluation sets, therisk curve in prediction is obtained. Its values are the rhombuses inFig. 5. To value the CVRC it should be remembered that the minimum‘step’ in the abscissas axis (the nearest values of α) is 0.012 and in theordinates axis (the nearest values of β) is 0.021. However, thedifference between the RC and the CVRC is clear: in the range shownin Fig. 5b, which corresponds to the balanced solutions, it is onlypossible to achieve for α and β the values (0.086, 0.104) and (0.123,0.083) against (0.054, 0.059) that is obtained in fitting. The deviationbetween both curves can be easily visualized by comparing the redrhombuses in Fig. 5bwith the values of the continuous curve: for valuesbetween 0.012 and 0.1605 in the abscissas axis the differences in α goesfrom 0.006 to 0.106 whereas the differences β are more similar to eachother (varies from 0.046 to 0.063).When one of the probabilities is verynear zero, the difference (between RC and CVRC) in the otherprobability is larger, which would be expected because of the ownstructure of the curve. Since the PLSk models for the three sets XTk aresimilar and, thus, the three RCk, the difference between the RC and CVRCis probably due to the effect of reducing the size (few samples ineach category) for the cross-validation.

4.1.2. Pareto-optimal frontFor the problem at hand a MLP neural networks with one hidden

unit with sigmoid functions are used. The neural network has 17

Page 9: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 6. Data set wines, POF. The asterisks are the values obtained in training whereas therhombuses represent the values of the cross-validated Pareto-optimal front.

33M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

inputs (the 17 autoscaled predictor variables), 2 hidden units with thehyperbolic tangent transfer function, and 2 outputs (indicator matrix:(1 0) for the compliant class and (0 1) for the non-compliant class).

On the other hand, the meta-parameters for the evolutionaryalgorithm used for training in search for the Pareto-optimal frontwere: Real codification of individuals (weights of the neural networks);population size: 250; uniform selection; simple cross-over; probabilityofmutation: 0.2; and number of generations: 1000 (stopping criterion).The function to measure the goodness of the solutions was the worst αand β of the ones obtained per category (two-objective optimization).

In the final population, seven neural networks are non-dominatedand constitute the estimation of the Pareto-optimal front representedas asterisks in Fig. 6.

Taking into account that the values represented in the Pareto-optimalfront consist of theworst estimates ofα andβ it is seen that for null type IIerror the best expected value for α is greater than 0.1 (sensitivities belowandnear 90%). On the other extreme, perfect sensitivities can be achievedprovided that aβ error probability of at least 0.1 is accepted. In themiddleseveral different class-modes are computed with the expected oppositebehaviour of α and β.

Inorder to compare the resultswith thoseobtainedbyusing PLS-CM,only the corresponding values of the probability of false noncomplianceand false compliance have been considered. In this case, the solutions inthe Pareto-optimal front are in fact four different neural networkswhose estimates of the aforementioned probabilities α and β are inTable 5. Also, Table 5 contains the number of objects assigned to eachclass-model by each of the four class-models (neural networks)

Table 5Data set wines. Number of samples assigned to class-models according to the output of the

Neuralnetwork

True class Code Output of the modelling method

(1,0) (0

Samples in the modelfor compliant

Safor

N1 Compliant (1,0) 73 0Non-compliant (0,1) 0 43

N2 Compliant (1,0) 76 1Non-compliant (0,1) 1 46

N3 Compliant (1,0) 79 2Non-compliant (0,1) 1 47

N4 Compliant (1,0) 76 1Non-compliant (0,1) 0 44

obtaineddirectly from theoutputs of the correspondingneural network,according to Table 1.

The neural network denoted as N1 in Table 5 (first rows) does notreject any sample outside both models, and either there is no sampleonly assigned in the wrong model. For the model of the compliantsamples under consideration, counting as in Table 2 and subtractingfrom 1, its probabilities of false noncompliance and false complianceare (α, β)=(0/81, 5/48)=(0, 0.104) which corresponds to the pointon the left of Fig. 6 (as it has been said, one of the extremes of thePareto-optimal front). The other extreme is N4 in Table 5 thatimproves specificity (β=0) by rejecting samples, in this case, outsidebothmodels (compare the two last columns of Table 5). In return, fivecompliant samples are rejected and consequently α=5/81=0.062.Note that, because of the different number of objects per category inthe training set, the allocations weigh also differently. Between thesetwo extremes, there is a neural network, N3 in Table 5, that computesa model for the class of compliant samples which is quite balanced,with an estimated probability of false noncompliance of α=0.025 andan associated probability of false compliance of β=0.021. Forcompleteness, note that the corresponding probabilities in the classof compliant samples of N2 are (α, β)=(1/81, 2/48)=(0.012, 0.042).

Again, it is important to remember that the neural networkmodelsthat constitute the Pareto-optimal front represent the optimal modelsthat can be computed. That means that for the class of compliantsamples, the above mentioned extremes, are values attainable for thisdata set: respectively, with the lowest α, values of β until 0.10 areachievable, or with the lowest β, α can reach 0.06. In terms ofsensitivity and specificity, sensitivity 100% can be achieved with aspecificity near 90%, or specificity of 100% (β=0) with sensitivityalmost 94%.

Themodels represented in the POF have better values of α and β thanthe models of the RC in Fig. 5. In general, this is always so because of theoptimal character of the neural networks in the POF. Compare the abovementioned pairs of sensitivity and specificity with the values whenconsidering the PLS-CMmodels represented in Fig. 5, where for achievingsensitivity greater than 99% the specificity decreases until 86.5% whereasfor specificity higher than 99% the sensitivity decreases until 64.9%.

For estimation of the prediction ability of the models, again thecross-validation procedure previously explained in Section 2.3.4 wasfollowed. For the sake of comparison, the same training andevaluation sets have been used as for the case of PLS-CM (thatconsisted in K=3 and XT1, XT2, XT3 with 32 and 54 non-compliant andcompliant samples respectively). Also here, each training set wasautoscaled and its mean and standard deviation were used to scaleaccordingly the corresponding test set. Then, the neural networks thatmake up the Pareto-optimal front with each training set are used topredict the corresponding (adequately scaled) evaluation sets XE1, XE2,XE3 which always have 27 compliant and 16 non-compliant samples.The successive trainings give POFk, k=1, 2, 3, with N1=3, N2=3 and

neural network models in the Pareto-optimal front.

Total

,1) (1,1) (0,0)

mples in the modelnon-compliant

Samples in bothmodels

Samples outsideboth models

8 0 815 0 484 0 811 0 480 0 810 0 480 4 810 4 48

Page 10: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Table 6Data set tumours. Root mean squared error in cross-validation (RMSECV) and incalibration (RMSEC), and cumulative variances of the predictors (X) and response (y)explained by the PLS model when adding latent variables. The values that correspond tothe selected model are in italics.

# Latentvariables

RMSECV RMSEC Explained variancein X (%)

Explained variancein y (%)

PLS with all 525 samples(176 non-compliant and 349 compliant)1 0.2118 0.2099 64.73 80.222 0.2006 0.1960 72.69 82.763 0.2012 0.1941 78.00 83.09

PLS with the first training set XT1 with 350 samples(117 non-compliant and 233 compliant)1 0.2182 0.2162 65.29 78.992 0.2056 0.1998 73.12 82.063 0.2059 0.1974 79.19 82.48

PLS with the second training set XT2 with 350 samples(118 non-compliant and 232 compliant)1 0.1976 0.1967 65.21 82.692 0.1831 0.1777 73.45 85.873 0.1833 0.1739 77.98 86.46

PLS with the third training set XT3 with 350 samples(117 non-compliant and 233 compliant)1 0.2178 0.2158 63.88 79.082 0.2115 0.2042 71.38 81.263 0.2132 0.2032 77.20 81.45

Fig. 7. Data set tumours. Box andWhisker plots of the responses of the fitted PLS modelfor the non-compliant and compliant samples in training (NCTr and CTr respectively)and for non-compliant and compliant samples of the test set (NCTs and CTsrespectively).

34 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

N3=1 neural networks. The CVPOF contains three neural networkswith the errors drawn also in Fig. 6 as rhombuses that correspond to(0.062, 0.125), (0.111, 0.111) and (0.187, 0.062). These values of αand β are similar to the ones in the three models in the central regionof the CVRC shown in Fig. 5b, with (0.086, 0.104), (0.123, 0.083) and(0.160, 0.062).

4.2. Tumours

4.2.1. PLS-CM modelLike in the case of the wines, a PLSmodel was fitted with autoscaled

predictors and mean-centred response. The cross-validation has beenmade by selecting 10 subsets according to the procedure known asvenetian blinds. The root mean squared errors in both calibration(RMSEC) and cross-validation (RMSECV) as a function of the latentvariables are in Table 6, the minimum of RMSECV is achieved with2 latent variables. With this model, 72.69% of the variance of thepredictor variables explains 82.76% of the variance of the binaryresponse. An important aspect is that four samples have values ofthe Q and T2 statistics greater than the threshold value at 99%.However, these samples have not been removed because all of thembelong to the category of malignant tumours, therefore, they have

Table 7Data set tumours. p-values of the hypothesis tests for the normality of the responses compstandard deviation.

Hypothesis test X with all 525 samples XT1 with 350 sample

NCa Cb NCa Cb

χ2 (goodness-of-fit) 0.1471 b10−15 0.8167 b1Shapiro–Wilk 0.2938 b10−15 0.5340 b1z-score for skewness 0.3742 b10−15 0.3602 1.8z-score for kurtosis 0.6472 b10−15 0.5917 6.8Kolmogorov–Smirnov ≥0.10 b0.01 ≥0.10 b0Anderson–Darling 0.3823 b10−5 0.3503 b1Mean 0.1146 0.9422 0.1195 0.9Standard deviation 0.2337 0.1432 0.2366 0.1

a NC: non-compliant samples.b C: compliant samples.

been diagnosed by a biopsy and thus there is no doubt about theirassignation. In addition, they do not take extreme values in thepredictor variables.

Fig. 7 contains the box and whisker plots of the values of theresponse estimated by the PLS model for each of the 525 samples ofbreast tumours, codified as CTr (the 349 compliant – benign –

tumours in the training set) and NCTr (176 malignant samples). Also,it contains the plots corresponding to the values estimated with thesame PLS model for the 174 test samples, codified as CTs for the 109benign samples and NCTs for the 65 malignant samples. In the test setonly one sample (from the class of malignant tumours) surpasses thecritical values for theQ and T2 statistics. Once analysed the values of itspredictor variables the same conclusion as with the training set isreached and, consequently, it is not removed either.

The second box plot in Fig. 7 shows an asymmetry of the values forCTr with some of them below 0.75 which are relatively small values(comparing to the rest) and which are those that coincide with thevalues in the two upper quartiles of the malignant samples, NCTr. Itdoes not have sense to declare these samples of CTr as outliers,because from the clinical point of view there is no doubt about thesetumours, they are benign since either they have been biopsied or theyhave not shown any variation during a year. On the contrary, in thetest set there is not a so great overlapping, only one sample in CTs is inthe range of those in NCTs. Also the distribution of the values is muchmore symmetrical than in the training set. In both categories thequartiles are very similar, and for both the training and the test sets.

The first two columns in Table 7 show the p-values of the six test ofnormality applied with the values computed with the PLS model in

uted by PLS, in each category. The last two rows contain the corresponding mean and

s XT2 with 350 samples XT3 with 350 samples

NCa Cb NCa Cb

0−15 0.3390 3.4 10−15 0.3278 b10−15

0−15 0.1834 b10−15 0.3258 b10−15

10−12 0.8309 1.8 10−14 0.3743 6.4 10−13

10−12 0.5625 4.4 10−15 0.7331 3.9 10−12

.01 ≥0.10 b0.01 ≥0.10 b0.010−5 0.4722 b10−5 0.3140 b10−5

400 0.0937 0.9524 0.1247 0.9373455 0.2233 0.1258 0.2358 0.1527

Page 11: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 8. Data set tumours, PLS-CM. The continuous line is the risk curve in training, thecircles are the risk curve obtained with the external test set, and the rhombusesrepresent the cross-validated risk curve.

Fig. 9. Data set tumours, POF. The asterisks are the values obtained in training, thecircles represent the Pareto-optimal front in prediction obtained with an external dataset, and the rhombuses are the cross-validated POF.

Table 8Data set tumours. Probabilities of false noncompliance and false compliance for theneural networks in the POF.

Neuralnetwork

Probability of falsenoncompliance (α)

Probability of false compliance(β)

N1 0.0000 0.3921N2 0.0029 0.1875N3 0.0057 0.1193N4 0.0086 0.0909N5 0.0115 0.0682N6 0.0143 0.0625N7 0.0172 0.0568N8 0.0201 0.0455N9 0.0229 0.0398N10 0.0286 0.0284N11 0.0344 0.0170N12 0.0516 0.0057N13 0.0860 0.0000

35M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

the training set in both categories. While for the non-compliantsamples there is no evidence to reject the normality, the contraryhappens with samples corresponding to benign tumours for whichthe normal distribution cannot be assumed with p-values extremelysmall in all the tests. Consequently, for the computation of the riskcurve the normal distribution fitted to the non-compliant sampleswillbe used but for the compliant ones the alternative procedure based onthe estimation of the percentiles by bootstrap and the frequencyhistogram (Section 2.2.2) will be used.

The resulting risk curve is represented with continuous line inFig. 8. It is made up by 1000 points (α, β). Applying the corresponding1000 critical values V to the 174 values calculated with the PLS modelin the test set, the risk curve in prediction is obtained and representedas red circles in Fig. 8. This curve is discrete with a minimum step inabscissas of 1/109=0.009 and in ordinates of 1/65=0.015 computedtaking into account the size (number of samples) of each category inthe test set. Except for the point (0.009, 0.015) the rest of the points inthe risk curve in prediction are in the axes, that is, with α or β equal tozero, in other words, the prediction has no error. This behaviour is dueto the fact that the values computed with PLS in each category of thetest set only overlap in just a point, as was pointed out when analysingFig. 7. Therefore, PLS-CM recognizes adequately the elements of thetest set.

To estimate the risk curve in cross-validation, CVRC, again threesets of training and evaluation XT1, XE1, XT2, XE2 and XT3, XE3 have beenused, with 350 and 175 compliant and non-compliant samplesrespectively. Table 6 shows the three PLS models for each XTk aswell as the size of each class. In all the cases, two latent variables areneeded. With these models, the percentage of variance explained inthe predictor variables varies between 71.4% and 73.5% whereas thevariance explained in the response goes from 81.3% to 85.9%.

In each of them 5, 3 and 2 samples had values of Q and T2 greaterthan the critical values at 99%, but in the three cases these samplesbelong to the malignant tumours (non-compliant samples) and theywere not removed. The analysis of the distribution of the valuesobtained when applying each PLSk model to the XTk (k=1, 2, 3) isshown in Table 7. In every case, the normality can be assumed for theclass of the benign tumours but cannot be assumed for the malignanttumours. Consequently, the same procedure as with the whole set has

been followed to obtain the three risk curves RCk, k=1, 2, 3, asdescribed in Section 2.2.2. Finally, the values obtained in cross-validation are shown as blue rhombuses in Fig. 8. It can be stated thatthe risk curve computed with X and the cross-validated one are equalexcept for the discrete character of the last one. In the cross-validationprocess all the samples have intervened and, thus, the originalasymmetry of the distribution of the values in the category of thecompliant samples is maintained, unlike what it happens in the testset.

4.2.2. Pareto-optimal frontFor this case, the topology of theMLP neural networks consists of 9

input neurons, the 9 autoscaled predictor variables, one hidden layerwith two neurons with the hyperbolic tangent transfer function andtwo outputs: (1, 0) for benign tumours and (0, 1) for malignanttumours. The evolutionary algorithms, on its hand, works with thesame parameters as before. From the 250 neural networks in the finalpopulation, 22 are non-dominated solutions and made up theestimation of the Pareto-optimal front drawn as asterisks in Fig. 9.

Considering only the compliant samples, there are 13 differentpairs of class-models whose estimates of α and β are in Table 8. Like inthe case of PLS-CM, a clear asymmetry is observed: in the first row ofTable 8, α=0 (no benign tumour is declared as malignant), the bestexpected β is almost 0.4 (i.e., almost 40% of themalignant tumours arenot detected as such) whereas for β=0, last row in Table 8, values of

Page 12: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

36 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

α=0.086 (more than 91% of the benign tumours are recognized assuch) can be achieved, although no better than this. These are, ofcourse, the two theoretical extreme solutions available. Note that justgoing through Table 8 in descending order, for instance, N4 says that,allowing less than 9 in a thousand of benign tumours to be declared asmalignant tumours (α=0.0086) would reduce the probability of nodetecting a malignant tumour, β, until 9.1%. Equally protection againstthe two errors is provided by N10 where α≈β, both near 0.028, andso on.

These results can be compared to the ones expected whenpredicting future samples. Therefore, to estimate the ability ofprediction of the class-models a cross-validation is made with apartition into three sets with 116 samples of benign tumours and 59 ofmalignant ones in the evaluation set, except for one of the cases wherethere are 117 and 58 samples of compliant and non-compliantsamples respectively. In any case, they are the same sets XT1, XE1,XT2, XE2 and XT3, XE3 used for the estimation of the CVRC.

The first training with 233 compliant samples and 117 non-compliant ones in XT1 gives N1=13 neural networks in the Pareto-optimal front which are used to predict the corresponding evaluationset, XE1

. Then, the training with XT2(232 and 118 samples

respectively) gives N2=7 neural networks in the Pareto-optimalfront and the last one, with XT3, has N3=17 neural networks. ThePareto-optimal front of the objective values obtained with these N1+N2+N3=37 neural networks class-models is made up by therhombuses represented also in Fig. 9. As can be seen, the estimationis similar to the one obtained in the training with the all 525 samples.

In this case, additionally, there were a (external) test set that isnow predicted with the 22 neural networks in the Pareto-optimalfront obtainedwith the whole data set. In this way, comparison can bemade with the estimate based on cross-validation. The resultingPareto-optimal front is made by five non-dominated solutions that aredepicted as circles in Fig. 9. The similarity among the three POFsshown in Fig. 9 is worth mentioning.

4.3. Testing the performance of the methods

The relation of the nine data sets used is in Table 9, that describesthe source of the sets and their basic quantitative characteristics: size

Table 9Description of the datasets used for benchmarking. More information about them can be fo

Set codRNA Wwines Transfusio

Source LIBSVM UCI UCI

ObjectsIn cat. 1In cat. 2

59,53539,69019,845

48984715183

748570178

Variables 8 11 4K for CV 5 5 3Training set (k=1…K−1) Objects

In cat. 1In cat. 2

47,62831,75215,876

39183772146

499380119

Evaluation set (k=1…K−1) ObjectsIn cat. 1In cat. 2

11,90779383969

98094337

24919059

Training set (Last K) ObjectsIn cat. 1In cat. 2

47,62831,75215,876

39203772148

498380118

Evaluation set (Last K) ObjectsIn cat. 1In cat. 2

11,90779383969

97894335

25019060

Number of solutions in the POF 70 139 157Number of points in the RC that aredominated by at least a point inthe POF (corresponding percentage)

155(15.5)

414(41.4)

537(53.7)

Number of points in the POF thatare dominated by at least a point inthe RC (corresponding percentage)

54(77.1)

42(30.2)

0(0.0)

of the set (number of variables, number of objects and theirdistribution in categories), and size of the subsets when partitioningin the cross-validation (CV) step. Note that the training-evaluationpairs used for CV are the same for both methods, and also that thetraining set was autoscaled in all the cases and its mean and standarddeviation used to scale the corresponding evaluation set.

As can be seen in Table 9, the data sets cover different sizes, forsmall (GlassP, 163 samples) to big data sets (codRNA, almost 60,000samples), with different ratios number of objects/number of variables(around 6 for SpectF to around 7400 for codRNA), and withdifferences in the class representation, measured by the number ofobjects belonging to the category, that cover data sets with quitebalanced classes (GlassP) to highly asymmetric ones (Wwines).

Although the detailed information and related references can befound in the corresponding source, there are some comments to bemade about some of the sets.

Non-coding RNAs (ncRNAs) have a multitude of roles in the cell,many of which remain to be discovered. However, it is difficult todetect novel ncRNAs in biochemical screens. To advance biologicalknowledge, computationalmethods that can accurately detect ncRNAsin sequenced genomes are therefore desirable. In ref. [34] theclassification as a ncRNA is made by a modified Support VectorMachine classifier that takes as input the total free energy change of aninput sequence pair and the adenine, uracil, and cytosine frequenciesof sequence 1 and sequence 2. The dataset codRNA contain these datafor sequences pairs of known ncRNAs in the Escherichia coli andSalmonella typhi genomes.

The second set in Table 9, called Wwines, refers to the UCI's ‘WineQuality data set’ and corresponds to the samples related to white winesamples, which are related to white variants of the Portuguese “VinhoVerde” wine. For more details, consult [35]. The goal is to model winequality based on physicochemical tests. In that sense, the wines conscores greater than 5were qualified as ‘good’ and thosewith score lessthan 5 were qualified as ‘bad’ in the modelling problem.

The set ‘Transfusion’ in Table 9 refers to the ‘Blood Transfusion ServiceCenter Data Set’ and Prof. I-Cheng Yeh [36] retains the copyright notice.

The sets named GlassP and GlassW in Table 9 are made from theUCI's ‘GLASS Identification data set’ that describes several kind of glassfromphysicochemical variables (refractive index, andweight percent in

und in the corresponding repository source given.

n SAheart Thyroid GlassP GlassW SpectF Spect

Ref. [31] Ref. [33] UCI UCI UCI UCI

462302160

21515065

1638776

21416351

26721255

26721255

9 5 7 9 44 223 3 3 3 3 3308201107

14410044

1095851

14310934

17914237

17914237

15410153

715021

542925

715417

887018

887018

308202106

14210042

1085850

14210834

17614036

17614236

15410054

735023

552926

725517

917219

917219

65 6 31 13 41 51321(32.1)

228(22.8)

1000(100.0)

968(96.8)

154(15.4)

423(42.3)

32(49.2)

0(0.0)

0(0.0)

0(0.0)

23(56.1)

4(7.8)

Page 13: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 10. codRNA data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

37M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

corresponding oxide of: Na, Mg, Al, Si, K, Ca, Ba, and Fe). The twoproblems are defined by considering the glasses that correspond towindow or non-window glasses, GlassW; and those which were float-processed or not, GlassP. In this last case, additionally, the variablesBa and Fe are almost zero so they are removed and thus the cor-responding set has 7 variables.

Finally, the SpectF and Spect refer to UCI's counterparts SPECTFHeart and SPECT Heart data set that describe diagnosing of cardiacSingle Proton Emission Computed Tomography (SPECT) images takenon several patients. Each of the patients is classified into twocategories: normal and abnormal. The database of 267 SPECT imagesets (patients) was processed to extract features that summarize theoriginal SPECT images. As a result, 44 continuous feature patternswere created for each patient, SPECTF. Classification rules 77.0%accurate (as compared with cardiologists' diagnoses) are reported.The pattern was further processed to obtain 22 binary featurepatterns, SPECT dataset. In this case, the rules reported improveaccuracy until 84.0%.

Figs. 10–18 show for each data set, in the order of the columns ofTable 9, the risk curve, RC, obtained with PLS-CM and the Pareto-optimal front, POF, together with their corresponding values in cross-validation CVRC and CVPOF. The subgraphs a) always correspond to therisk curves computed with PLS-CM and the subgraphs b) to the Pareto-optimal fronts.

The analysis of these curves point to the possibilities given for eachmethod as far as the achievable probabilities α and β is concerned. Ithas been already explained that the meaning of both curves isdifferent. In the RC and the corresponding CVRC each point in thegraph represents the pair of values (α, β) that correspond to eachcritical value, whereas the POF and CVPOF contain the pairs of theworst α and the worst β among those obtained with the neuralclassifier according to Table 2. The solutions in the RC thus onlydepend on a critical value that is varying to obtain the correspondingα and β without any additional criterion. On the contrary, thesolutions in the POF do not depend on any additional parameter butthey must comply with the dominance criterion.

A global way of comparing both methods requires the selection ofcriteria that summarize each pair of sets of solutions, in the RC and inthe POF, in a pair of indexes. This means to choose an auxiliarycriterion, for example, the area between the curves would give an ideaabout their separation; or the one shown in Table 9: the number ofsolutions in the RC that are dominated by at least a solution in the POF,and the number of solutions in the POF that are dominated by at leasta solution in the RC. In this case, the criterion is insensitive to the sizeand position of the points, for instance, (0.95, 0.10) is dominated by(0.93, 0.01) and also by (0.80, 0.005) but the practical meaning of thetwo last solutions, though both dominated, is very different.

Row9 in Table 9 shows thenumberof solutions in the correspondingestimated POF, which is not necessarily equal, whereas for thecomputation of the RC 1000 critical values have been used for all thedata sets. In row 10 it is written the number of solutions of the RCdominated by a solution of the POF and the corresponding percentage inbrackets. These numbers vary from 1000 to 968 (100% and 96.8%) for‘GlassP’ and ‘GlassW’, respectively, until 155 and 154 (15.5% and 15.4%)for ‘codRNA’ and ‘SpectF’, respectively. In the other case, the number ofsolutions in the POF dominated by solutions in the RC are in row 11 andvaries from 0 for ‘Transfusion’, ‘Thyroid’, ‘GlassP’ and ‘GlassW’ to 54 and23 (77.1% and 56.1%) for ‘codRNA’ and ‘SpectF’.

However, given the practical aspects above mentioned and theinterest in the whole family of achievable solutions, which is the maingoal of the procedures presented, it is more informative to study the‘patterns’ of the family of solutions given for each method. Thesefamilies have been graphically summarized, in both training and cross-validation, in Figs. 10–18.

In general, the family of solutions obtained with PLS and the POFare stable in prediction because the values in cross-validation are in

almost all the cases very similar to the values in training. In otherwords, in general, both methods do not overfit the data. The greatestdifference between fitting and cross-validation is observed in the POFof the ‘GlassP’ data, Fig. 15b, and in the risk curves obtained with PLS-CM for ‘Transfusion’ and ‘SpectF’ data, Figs. 12a and 17a respectively.On the other side, the similarity between the RC and CVRC on the onehand, and between POF and CVPOF on the other is much more notablefor bigger sets, ‘codRNA’ (Fig. 10) and ‘Wwines’ (Fig. 11).

The families of solutions obtained with the two methods, asrepresented in the RC and the POF, are very similar even the RC issomehow better in some of the sets, see for example the risk curvesfor the data ‘codRNA’, Wwines’, ‘Transfusion’, ‘SAheart’ and ‘SpectF’(Figs. 10–13 and 17). Nevertheless, the values of POF are better, in thesense that they are closer to the theoretical optimum value (0,0), than

Page 14: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 11. Wwines data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

Fig. 12. Transfusion data set. a) Risk curve, continuous line, and cross-validated riskcurve, filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimal front, rhombuses.

38 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

the RC for the sets ‘Thyroid’, ‘GlassP’, ‘GlassW’ and ‘Spect’ (Figs. 14–16and 18).

Evidently the structure of the data makes that one methodachieves better results than the other. For example, the ‘Thyroid’ set(Fig. 14) only has 5 variables, a principal component analysis of theautoscaled data (not shown here) reveals that there are only twosignificant components. The scores of the objects in these twocomponents show that the category ‘0’ occupies a central positionwhile the category ‘1’ stretches to different directions towards theexterior. A linear model, such as PLS, cannot model this latentstructure and the distribution of the responses given by the fitted PLSare highly overlapped in such a way that the one of category 0 isalmost inside the one of class 1 which causes the asymmetry of the RC,Fig. 14a, that is to say, it is only possible to have small values of β at the

cost of large values of α. In fact, small values for α are practically notachievable. On the contrary, the arrangement of data in the five-dimensional variable space is efficiently captured by the neuralnetwork. The good results shown in POF, Fig. 14b, cannot beattributable to an overtraining because they are confirmed in cross-validation.

Also a modelling method is more suitable than another oneaccording to the pretreatment of the data. As it has been pointed out,the sets SpectF and Spect refer to the same images but differentlyprocessed. Thus, it is the same information differently processed justto improve the percentage of correct classifications. According to theinformation in ref. [30] in this way the accuracy improves from 77%with the 44 continuous variables to 84% with the 22 binary variables(as comparedwith cardiologists' diagnoses). This improvement is also

Page 15: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 13. SAheart data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

Fig. 14. Thyroid data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

39M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

seen in the POF when comparing Figs. 17b and 18b) and it is lessevident when the modelling of the categories is made with a lineartransformation of the original corresponding variables as in PLS-CM,although the CVRC is nearer to the RC in the second case, Fig. 18b). Inany case, in the POF and in the CVPOF there are neural classifiers thatequal and even improve the results reported by the authors of theoriginal work [37].

It is interesting to note that, although the rather differentapproaches to the modelling of the categories, the POF and the RChave a great similarity as far as the curvature they show. Examples ofthis similarity can be seen in the cases ‘Transfusion’ and ‘GlassP’ (Figs. 12and 15) that show similar behaviour in the pattern of the curvaturesalthough in the data ‘GlassP’ there is a notable shift between the POF andthe RC.

5. Conclusions

Two class-modelling techniques have been proposed, togetherwith the validation in prediction using cross-validation of theirassociated RC or POF. They employ strategies completely different tomake use of the information contained in the multivariate data of thesamples that define the categories. However, both methods coincidein the fact that they provide a set of class-models optimal in sensitivityand specificity, although with different criteria.

The analytical approach to estimate the POF has the advantageover PLS-CM of not making hypotheses on the class-model; it simplycomputes indicator functions of the category in the space of thepredictor variables in such a way that sensitivity and specificity arethe best possible ones. On the contrary, PLS-CM makes use of the

Page 16: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 15. GlassP data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

Fig. 16. GlassW data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

40 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

latent structure correlated with the response and then of theprobability distribution of the fitted values to compute a hypothesistest. In that sense, the best possible β is obtained, for each value of α.

The computation of the POF is advisable since it establishes aninferior bound of the achievable sensitivity and specificity, while PLS-CM allows amore detailed analysis of the problem and the structure ofthe classes.

The twomethods alsodiffer in the steps needed for their computation.For computing the POF, once defined the meta-parameters of the neuralnetwork and the evolutionary algorithm, the user does not need tointervene. For PLS-CM, after deciding the meta-parameters of the PLSmodel, it is necessary to establish the probability distribution of the fittedvalues. However, the former is more computationally demanding thanthe latter.

Finally, because of the analytical character of the procedure tocompute the POF, the only information obtained is about whether ornot a sample belongs to one or the two categories but there is noinformation about the contribution of each variable to build themodels.On the contrary, with a model based on PLS-CM the analysis of theloadings of the predictor variables to build the latent variables givesinformation about the importance of each of these predictor variables.

In the present work, further to show the usefulness of the aspectspreviously mentioned in the cases studied, better results in trainingare obtained in the POF than with PLS-CM and this difference is largerwhen the number of samples per class is smaller. However, in cross-validation the difference between both procedures disappears and,additionally, with more samples per category the difference amongresults in training and cross-validation also disappears.

Page 17: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

Fig. 17. SpectF data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

Fig. 18. Spect data set. a) Risk curve, continuous line, and cross-validated risk curve,filled rhombuses; b) Pareto-optimal front, asterisks, and cross-validated Pareto-optimalfront, rhombuses.

41M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

It is perceived that the structure and/or the pretreatment of thedata affects the results in a different way, the approach to compute thePOF is less sensitive to them. However the different structure in thedistribution of the values fitted by PLS, that is correlated with thelatent structure of the predictors, is detected in the prediction whenusing PLS-CM.

Acknowledgements

Spanish Ministerio de Educación y Ciencia and Consejería deEducación de la Junta de Castilla y León are acknowledged for financialsupport under projects CTQ2008-02264/BQU and BU024A07 respec-tively, both using European FEDER funds. Also, University of Burgos

and Caja de Burgos (Obra Social) are acknowledged for the grants todevelop research projects 2009.

References

[1] S. Wold, Pattern Recognit. 8 (1976) 127–139.[2] M. Barbaste, B. Medina, L. Sarabia, M.C. Ortiz, J.P. Pérez-Trujillo, Anal. Chim. Acta

472 (2002) 161–174.[3] M.C. Ortiz, L.A. Sarabia, R. García-Rey, M.D. Luque de Castro, Anal. Chim. Acta 558

(2006) 125–131.[4] R. Díez, L.A. Sarabia, M.C. Ortiz, Anal. Chim. Acta 585 (2007) 350–360.[5] N. Rodríguez, M.C. Ortiz, L.A. Sarabia, A. Herrero, Anal. Chim. Acta 657 (2010)

136–146.[6] L. Stahle, S. Wold, J. Chemometr. 1 (1987) 185–196.[7] H. Nocairi, E.M. Qannari, E. Vigneau, D. Bertrand, Comput. Stat. Data Anal. 48

(2005) 139–147.[8] U.G. Indahl, H. Martens, T. Næs, J. Chemometr. 21 (2007) 529–536.

Page 18: Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance

42 M.ªS. Sánchez et al. / Chemometrics and Intelligent Laboratory Systems 103 (2010) 25–42

[9] M. Barker, W. Rayens, J. Chemometr. 17 (2003) 166–173.[10] D. González-Arjona, G. López-Pérez, A.G. González, Talanta 49 (1999) 189–197.[11] M.S. Sánchez, M.C. Ortiz, L.A. Sarabia, R. Lletí, Anal. Chim. Acta 544 (2005) 236–245.[12] L.A. Sarabia, M.C. Ortiz, M.S. Sánchez, Chemometr. Intell. Lab. Syst. 95 (2009)

138–143.[13] M.C. Ortiz, L. Sarabia, A. Herrero, M.S. Sánchez, Chemometr. Intell. Lab. Syst. 83

(2006) 157–168.[14] C. Reguera,M.S. Sánchez,M.C. Ortiz, L.A. Sarabia, Anal. Chim. Acta 624 (2008) 210–222.[15] M.P. Derde, D.L. Massart, Anal. Chim. Acta 184 (1986) 33–51.[16] M.S. Sánchez, L.A. Sarabia, Anal. Chim. Acta 348 (1997) 533–542.[17] M.C. Ortiz, J.A. Sáez, J. López, Analyst 118 (1993) 801–805.[18] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and

Hall, London, 1986.[19] M. Thompson, in: S.D. Brown, R. Tauler, B. Walczak (Eds.), Comprehensive

Chemometrics. Chemical and Biochemical Data Analysis, Elsevier, Amsterdam,2009, pp. 77–96.

[20] B. Effron, R.J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall/CRCPress, Boca Raton, 1999.

[21] M.R. Chernick, Bootstrap Methods, John Wiley and Sons, Inc., New York, 1999.[22] L.A. Sarabia, M.S. Sánchez, M.C. Ortiz, in: R. Todeschini, M. Pavan (Eds.), Scientific Data

Ranking Methods: Theory and applications, Elsevier, Amsterdam, 2008, pp. 1–50.[23] M. Leshno, V. Ya.Lin, A. Pinkus, S. Schocken, Neural Netw. 6 (1993) 861–867.[24] K. Deb, Multi-objective Optimization Using Evolutionary Algorithms, Wiley, 2001.[25] K. Deb, S. Agrawal, A. Pratap, T.Meyarivan, IEEE Trans. Evol. Comput. 6 (2002) 182–197.[26] International Organization of Vine andWine (OIV). ResolutionOENO1/2006. Available

in http://news.reseau-concept.net/images/oiv_uk/Client/OENO_01–2006_EN.pdf. Lastvisit: 29/07/2009.

[27] Prechelt, Lutz, Proben1 — A Set of Neural Network Benchmark Problems andBenchmarking Rules. Obtained in http://digbib.ubka.uni-karlsruhe.de/voll-texte/39794 (last visit: 30/11/2009).

[28] http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29. Last visit: 30/11/2009.

[29] http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names. Last visit: 30/11/2009.

[30] Asuncion, A. & Newman, D.J. (2007). UCIMachine Learning Repository [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine, CA: University of California, School ofInformation and Computer Science. Last visit 16/04/2010.

[31] Hastie T., Tibshirani R., Friedman J., The elements of statistical learning: Datamining,inference, and prediction, Springer. Datasets available in http://www-stat.stanford.edu/tibs/ElemStatLearn/datasets/. Last visit 19/04/2010.

[32] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vectormachines,2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. Last visit19/04/2010.

[33] P.K. Hopke, D.L. Massart, Reference data sets for chemometrical methods testing,Chemometr. Intell. Lab. Syst. 19 (1993) 35–41.

[34] A.V. Uzilov, J.M. Keegan, D.H. Mathews, BMC Bioinform. 7 (2006) 173–203.[35] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis., Modeling wine preferences

by data mining from physicochemical properties, In Decision Support Systems,Elsevier, 47(4):547-553. ISSN: 0167-9236.

[36] I-Cheng Yeh, King-Jang Yang, Tao-Ming Ting, Knowledge discovery on RFMmodelusing Bernoulli sequence, Expert Syst. Appl. (2008).

[37] L.A. Kurgan, K.J. Cios, R. Tadeusiewicz, M. Ogiela, L. Goodenday, Artif. Intell. Med.23 (2001) 149–169.