Published on
09Feb2017
View
212
Download
0
Embed Size (px)
Transcript
2172 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009
Automatic GroundTruth Validation With GeneticAlgorithms for Multispectral Image Classification
Noureddine Ghoggali, Student Member, IEEE, and Farid Melgani, Senior Member, IEEE
AbstractIn this paper, we propose a novel method that aimsat assisting the groundtruth expert through an automatic detection of potentially mislabeled learning samples. This methodis based on viewing the mislabeled sample detection issue asan optimization problem where it is looked for the best subsetof learning samples in terms of statistical separability betweenclasses. This problem is formulated within a genetic optimizationframework, where each chromosome represents a candidate solution for validating/invalidating the learning samples collectedby the groundtruth expert. The genetic optimization process isguided by the joint optimization of two different criteria whichare the maximization of a betweenclass statistical distance and theminimization of the number of invalidated samples. Experimentsconducted on both simulated and real data sets show that the proposed groundtruth validation method succeeds in the following:1) in detecting the mislabeled samples with a high accuracy, evenwhen up to 30% of the learning samples are mislabeled, and 2) instrongly limiting the negative impact of the mislabeling issue onthe accuracy of the classification process.
Index TermsGenetic algorithms (GAs), groundtruth validation, JeffriesMatusita (JM) distance measure, mislabeling issue,multiobjective optimization.
I. INTRODUCTION
THE TYPICAL goal of an inductive learning algorithm isto build discriminant functions from part of the availablegroundtruth samples (training set) so that the generalizationcapability of the resulting classifier on previously unseen samples is as high as possible. The quantification of the generalization capability is usually performed on another part of thegroundtruth samples, termed as test set. Most of the workson automatic classification have focused efforts on improvingthe accuracy (generalization capability) of the classificationprocess by acting mainly on the following three levels: 1) datarepresentation; 2) discriminant function model; and 3) criterionon the basis of which the discriminant functions are optimized[1]. These works are, however, based on an essential assumption that is the groundtruth samples are of unquestionablequality. In this paper, we will put this assumption under lightand show that the accuracy of a classification process (whateverthe kind of classifier used) critically depends on the quality ofthe adopted groundtruth.
Manuscript received June 7, 2008; revised October 18, 2008 and January 2,2009. First published March 27, 2009; current version published June 19, 2009.
The authors are with the Department of Information Engineering and Computer Science, University of Trento, 38050, Trento, Italy (email: melgani@disi.unitn.it; ghoggali@disi.unitn.it).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TGRS.2009.2013693
The two wellknown groundtruth collection approaches areas follows: 1) in situ observation approach and 2) photointerpretation approach [2]. Each of them has its own advantages and drawbacks, but both are subject to errors in thelabeling process. In the first approach, this may occur becauseof georeferencing problems, while in the second one, spectralmismatching errors by the human analyst are the main source ofproblems. Since the presence of mislabeling problems (noise)in a learning (training and test) set has a direct negative impacton the classification process, the development of automatictechniques for validating the collected learning samples is, inour opinion, crucial.
To the best of our knowledge, in the literature, very scarce attention has been paid for coping with this issue, which is mainlyfaced through two different strategies. The first one, whichadmits anyway the presence of noise (mislabeling problems) inthe data, consists in designing a sophisticated classifier whichis less likely to be influenced by this presence [3]. The secondstrategy is based on the removal of suspect samples fromthe learning set. An early work derived from this strategy forknearest neighbor (kNN) classification suggested first to applya 3NN classification over the whole learning set and then to remove misclassified samples in order to produce a new learningset on the basis of which a 1NN classifier is formed for the classification phase [4]. In [5], in order to avoid overfitting on noisysamples, the author proposed to perform the removal (filtering)process through the C4.5 decision tree classifier. In [6], the suspect samples are identified and removed from the learning setby means of an ensemble of three classifiers (i.e., C4.5, kNN,and linear classifiers). In particular, a sample is expected to bemislabeled if it is misclassified by the ensemble of classifiers.
In this paper, we propose an alternative method that aims atinteracting with the groundtruth expert by providing him/herwith a binary information of the kind validated/invalidatedfor each learning sample. For each invalidated sample, theexpert may confirm or not the invalidation and thus corrector maintain the adopted labeling before creating the finallearning set that will be exploited in the classification process.Our groundtruth validation method is based on viewing themislabeled sample detection issue as an optimization problemwhere it is looked for the best subset of learning samples interms of statistical separability between classes. This problemis formulated within a genetic optimization framework forits capability to solve complex pattern recognition issues [7],[8]. In particular, each chromosome is configured as a binarystring, which represents a candidate solution for validating/invalidating the available learning samples. The genetic optimization process is guided by the joint optimization of two
01962892/$25.00 2009 IEEE
GHOGGALI AND MELGANI: AUTOMATIC GROUNDTRUTH VALIDATION 2173
Fig. 1. Sketch illustrating the proposed groundtruth validation process.
different criteria which are the maximization of a betweenclass statistical distance and the minimization of the numberof invalidated samples. The former is expressed in terms ofthe JeffriesMatusita (JM) distance measure [1], [2]. The latterallows one to get at convergence a Pareto front from whichthe groundtruth expert can select the best solution accordingto his/her prior confidence on the reliability of the collectedgroundtruth.
Experiments were conducted on both simulated data sets andreal remote sensing images. The obtained results reveal that theproposed automatic validation method succeeds in detecting themislabeled samples with a high accuracy, even when up to 30%of the learning samples are mislabeled. Moreover, we showhow the removal of the detected mislabeled samples impactsvery positively on the accuracy of different classifiers, namely,the support vector machine (SVM), the kNN, and the radialbasis function (RBF) neural network [1], [9][15]. This papercomplements and integrates partial results presented in [16].
The remaining part of this paper is organized as follows.In Section II, we recall the basic idea of the multiobjectivenondominated sorting genetic algorithm (NSGAII) and describe the proposed automatic groundtruth validation method.Experimental results obtained on simulated and real data setsare reported in Sections III and IV, respectively. Finally, conclusions are drawn in Section V.
II. PROPOSED METHOD
A. Problem Formulation
Let us consider a learning set L composed of n sampleslabeled by the groundtruth expert such that L = {(xi, yi), i =1, 2, . . . , n}, where each xi d represents a vector of dremote observations or/and processed features and yi ={1 = 1, 2 = 2, . . . , T = T} is the corresponding class label. Our objective is to detect in an automatic way whichof these n learning samples are potentially mislabeled and to
provide the groundtruth expert with a binary information ofthe kind validated/invalidated for each learning sample.Note that we do not aim at correcting the labels of mislabeledsamples. The label correction work shall be carried out by thegroundtruth expert (Fig. 1).
A naive approach to this problem would consist in trying all possible combinations of validated/invalidated learningsamples and then choosing the best one according to somepredefined criterion. This appears, however, computationallyprohibitive, and thus an impractical solution, even for smallvalues of n since the total number of possible combinations isequal to 2n. Therefore, the only solution at hand is to adopt anumerical optimizer to look for the hopefully best solution inthe binary solution space. In this paper, we propose to carryout this task by means of a multiobjective genetic optimizationmethod. In the following sections, we first recall the basics ofgenetic algorithms (GAs). Then, after describing its two maincomponents (i.e., the chromosome and the fitness function), weexplain the different phases of the proposed genetic solution.
B. General Concepts on GAs
GAs are general purpose randomized optimization techniques which exploit principles inspired from biological systems [17], [18]. A genetic optimization algorithm performs asearch by evolving a population of candidate solutions (individuals) modeled with chromosomes. From one generationto the next, the population is improved by mechanisms derivedfrom genetics, i.e., through the use of both deterministic andnondeterministic genetic operators. The most common form ofGAs involves the following steps. First, an initial populationof chromosomes is randomly generated. Then, the goodness ofeach chromosome is evaluated according to a predefined fitnessfunction representing the considered objective function. Thisfitness evaluation step allows one to keep the best chromosomesand reject the worst ones by using an appropriate selection rule
2174 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 47, NO. 7, JULY 2009
Fig. 2. Illustration of the chromosome structure and its effect on the learning sample distribution.
based on the principle that the better the fitness, the higherthe chance of being selected. Once the selection process iscompleted, the next step is devoted to reproducing the population. This is done by genetic operators such as crossover andmutation operators. The entire process is iterated until a userdefined convergence criterion is reached.
Several multiobjective GAbased approaches have been proposed in the literature [19]. In this paper, we will adopt theNSGAII for its low computational requirements and its abilityto distribute uniformly the solutions along the Pareto front [8],[20]. It is based on the concept of Pareto dominance. A solutions1 is said to dominate another solution s2, if s1 is not worse thans2 in all objectives and better than s2 in at least one objective.A solution is said to be nondominated if it is not dominatedby any other solution. The algorithm starts by generating arandom parent population. Individuals (chromosomes) selectedthrough a crowded tournament selection undergo crossoverand mutation operations to form an offspring population. Bothoffspring and parent populations are then combined and sortedinto fronts of decreasing dominance (rank). After the sortingprocess, the new population is filled with solutions of differentfronts starting from the best one. If a front can only partially fillthe next generation, crowded tournament selection is used againto ensure diversity. Once the next generation population hasbeen filled, the algorithm loops back to create a new offspringpopulation and the process continues up to convergence.
C. GA Setup
The success of a genetic optimization process dependsmainly on two ingredients, i.e., the chromosome structure andthe fitness functions, which translate the considered optimization problem and guide the search toward the best solution,respectively.
Concerning the first ingredient, since we desire either validating or invalidating each of the available n learning samples,we will consider a population of N chromosomes Cm(m =1, 2, . . . , N) where each chromosome Cm {0, 1}n is a binaryvector of length equal to n encoding a candidate combinationof validations and invalidations of the learning samples. Asshown in Fig. 2, a gene taking the value 1 or 0 meansthe invalidation or validation of the corresponding sample,respectively.
The validation/invalidation procedure will be based on thehypothesis that mislabeling a learning sample potentially leadsto an increase of the intraclass variability and thus to a decreaseof the betweenclass distance. Therefore, as a first fitness function, we will make use of a betweenclass statistical distancebased on the wellknown JM distance measure [1], [2]. Thismeasure is a function of the Bhattacharyya distance measurewhich is derived from the Chernoff bound, i.e., an upper boundof the probability of error of the Bayes classifier. In the caseof multivariate Gaussian distributions, the JM distance betweentwo generic classes i and j is given by
JMij =
2(1 eBij ) (1)
where Bij is the Bhattacharyya distance defined as
Bij =18(i j)T
[ (i +j)2
]1
(i j) +12ln
i+
j
2

i 
j (2)
where
and denote the class covariance matrix and meanvector, respectively. The symbol . stands for the determinant operator. The JM distance is a measure bounded by theinterval [0,
2]. When the two classes are identical (and,
thus, completely overlapped), it assumes the zero value. Incontrast, if they are totally separated, it takes the value
2.
The assumption that classes follow a Gaussian distribution ismainly motivated by the need to derive a tractable and easytoimplement betweenclass distance measure. It is, however,noteworthy that the general nature of the proposed approachmakes it possible to adopt any other type of distance measure.
At this point, in order to be suitably guided, the geneticoptimization process needs an information from the groundtruth expert, i.e., the expected amount of mislabeled learningsamples. Without this information, the process would tend to invalidate all the learning samples but two (the most distant ones),i.e., one for each class. With this information, we could envisionrunning a constrained genetic optimization process, which atconvergence would provide the best subset of invalidated samples with prespecified cardinality. The main drawback of thisgenetic implementation is that it requires an exact knowledge
GHOGGALI AND MELGANI: AUTOMATIC GROUNDTRUTH VALIDATION 2175
of the amount of mislabeled learning samples. As a morepractical alternative, we propose to run a multiobjective geneticoptimization process based on the NSGAII where the secondfitness function would simpl...