A Neural Adaptive Model for feature selection and hyperspectral data classification

Preview:

Citation preview

A Neural Adaptive Model for feature selection and hyperspectral data classification

I. Gallo a, E. Binaghi a, M. Boschetti b, P. A. Briviob

a Dipartimento di Informatica e Comunicazione, Università degli Studi dell’Insubria, Varese, Italy - email: gallo@uninsubria.it

b CNR-IREA, Institute for Electromagnetic Sensing of the Environment, Milan, Italy

ABSTRACT Hyperspectral imaging is becoming an important analytical tool for generating land-use map. High dimensionality in hyperspectral remote sensing data, on one hand, provides us with more potential discrimination power for classification tasks. On the other hand, the classification performance improves up to a limited point as additional features are added, and then deteriorates due to the limited number of training samples. Proceeding from these considerations, the present work is aimed to systematically evaluate the robustness of novel classification techniques in classifying hyperspectral data under the twofold condition of high dimensionality and minimal training. We consider in the study a neural adaptive model based on Multi Layer Perceptron (MLP). Accuracy has been evaluated experimentally, classifying MIVIS Hyperspectral data to identify different typology of vegetation in Ticino Regional Park. A performance analysis has been conducted comparing the novel approach with Support Vector Machine and conventional statistical and neural techniques. The adaptive model shows advantages especially when mixed data are presented to the classifiers in combination with minimal training conditions. Keywords: Hyperspectral data, Supervised classification, Neural Network, Feature selection

1. INTRODUCTION

Hyperspectral imaging is becoming an important analytical tool for generating land-use map. High dimensionality in hyperspectral remote sensing data can guaranty in principle a detailed discrimination of the observed surfaces overcoming the intrinsic limitation of lower spectral resolution data. However, despite the sizable achievements obtained, the wide use of hyperspectral data is still limited by the lack of well-assessed and adequate classification methods that can deal with their intrinsic complexity. Complexity lies in the nature of high dimensionality data and the consequent ground truth demand for supervised classification1. In particular, this aspect, known as Hughes phenomenon, implies that the required number of labeled training samples for supervised classification increases as a function of dimensionality. In Remote Sensing in general the number of training samples available is often limited and this limitation become relevant in case of high number of features. The problem can be addressed from two point of view: - to identify a classification model less sensitive to the Hughes phenomenon and/or - to reduce dimensionality of data and redundancies by applying feature selection strategies. Both the solutions present critical aspects. Feature selection procedures present efficient solutions; however most of the strategies are applied indendently from the characteristics of the classification algorithm. The identification of robust classifiers in our context must necessarily be derived from investigation within non-conventional techniques. According to Fukunaga2, the required number of training data is related to the square of dimensionality for quadratic, second order statistical classifiers and to dimensionality itself for linear classifiers.

Basing on these considerations, we decide to develop an experimental study aimed to investigate the potentialities of a non conventional classification models when dealing with high dimensional data. The model integrates feature selection and classification tasks basing on adaptive techniques3 built on the top of conventional multilayer perceptron4. The learning task can be formulated as the search of the most adequate subset of features that optimize the classification accuracy. Performances have been evaluated within an experimental study aimed to identify different typology of vegetation in Ticino Regional Park by classifying MIVIS Hyperspectral data. A detailed performance analysis has been conducted comparing our model with quadratic and linear statistical classifiers, SVM5 and multilayer perceptron.

2. ADAPTIVE NEURAL MODEL FOR FEATURE SELECTION AND CLASSIFICATION According to Jain et al.6 the problem of feature selection is defined as follows: given a set of d features, select a subset of size m that leads to the smallest classification error. Feature selection in general is a difficult problem. In a general case, only an exhaustive search can guarantee an optimal solution. The use of neural networks for feature selection seems promising, since the ability to solve a task with a smaller number of features is evolved during training by integrating the process of learning with feature extraction (hidden neurons aggregate input features), feature selection and classification. There are few established procedures for selecting features with neural nets7 and in general can be thought of a special case of architecture pruning8. This work presents a supervised adaptive classification model built on the top of Multi Layer Perceptron, able to integrate in a unified framework feature selection and classification stages. The feauture selection task is inserted within the training process and the evaluation of saliency of features is accomplished directly by the back-propagation learning algorithm that adaptively modifies special functions in shape and position on input layer. Figure 1 presents the topology of the adaptive model conceived as a composition of full connected neural networks, each of them devoted to select the best set of feature for discriminating one class from the others. The feature selection mechanism is embedded between input and hidden layers connections. Special functions (Figure 2a, 2b) are defined to modify connection weights: they act as penalty function for connection values and then weighting the importance of features associated to concerned input neurons. The general formula of neuron transfer function is:

( )jj netfo = (1)

where

( ) ( )jjnetj enetf θ−−+

=1

1 (2)

with jθ is the threshold (bias) of unit j

The netj function for adaptive neurons has been modified as described in the following formula

( )∑=

⋅⋅=M

kjskkjj khownet

1 (3)

with - M maximum number of input connections for the j-th neuron - ko output of the k-th hidden neuron

- ( ) ( ) ( ) ( ) ( ))()( 11

11,,;,,;

jsjsjsjs bckpackpjsjsrjsjsljs eebcpkLacpkLkh +−−−−− +

−+

=−= (4)

- lL and rL are two sigmoid functions

- p controls the slope of the two sigmoid functions - jsa and jsb controls the width of the bell function jsh

- jsc is the center of jsh

Yj

H1 H2

Xi

1o 2o

kjw

ikw

ko

1x 2x ... ix . . . Nx

Figure 1 - Topology of adaptive Neural Model

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20

bell

ksc

p=2

ksks ac − ksks bc +

ikw

io

( )ihks

ksksks cba ,,

ko

Figure 2 - (a) The bell function ksh in Eq. 4 and (b) derived feature selection

mechanism

2.1. Network Configuration and Neural Learning

The neural learning procedure, aimed to identify discriminant classification functions, includes a non-conventional sub-goal formulated as the search of most adequate number of bell functions ksh varying adaptively in position and shape in order to lead to smallest classification

error. The goal is achieved within the back-propagation learning scheme by applying the delta rule training algorithm9 to standard weights ikw and parameters ksa , ksb of bell functions.

ik

joldik

newik w

Eww

∂−= η with

ik

k

k

k

k

jj

ik

j

wnet

neto

onet

wE

∂∂

∂∂

∂=

∂δ (5)

k

joldk

newk a

Eaa

∂−= η with

k

k

k

k

k

jj

k

j

anet

neto

onet

aE

∂∂

∂∂

∂=

∂δ (6)

k

joldk

newk b

Ebb

∂−= η with

k

k

k

k

k

jj

k

j

bnet

neto

onet

aE

∂∂

∂∂

∂=

∂δ (7)

Where jδ is the back-propagated error from the output neuron..

At each learning step, the variation of parameters ksa , ksb results in a new positioning of the

corresponding bell functions ksh ; the model attempt to re-defines the range of each bell

function minimizing the overlapping for all the bell functions associated to each hidden neuron; the maximum overlapping allowed is in correspondence of flexus points of two adjacent bell function (Figure 3).

Figure 3 - Bell function domain assignement

2.1.1. Bell function removing and insertion

Bell functions are reduced progressively during training acting on ksa , ksb diminishing the

distance between ksks ac − and ksks bc + lying to back-propagation the task of compensating

erroneous reductions . Consequently removing of bell functions is accomplished when all the following conditions are satisfied

1. ( ) ( ) WIDTHMINacbc ksksksks _<−−+ where MIN_WIDTH is a threshold value

2. ( ) 211.0

mkks w

mh+

< , where m indicates the neuron with connection weight having maximum

value among those associated with connections under the bell function ksh

Variation of distance between ksks ac − and ksks bc + during learning can lead to a progressive

increment of area functions which implies in general a decrease of connection significance. A bell function with a distance over the maximum value allowed and with mean connection weights ( )ihw ksik ⋅ under the threshold of is splitted in two functions (see Figure 4).

min, max min, max max min, max min

Figure 4 - A bell function spitted in two sub functions

2.1.2. Remove hidden Neuron

As a consequence of bell function removing mechanism, a hidden neuron is pruned when all the bell functions associated have been removed.

2.1.3. Initialization of the neural model Initialisation of the adaptive neural model involve the specification of the following topological aspects: - Number of bell functions for each neuron - Number of neurons for each hidden layer The proposed model is designed to cope with high dimensionality data. Considering that the number of bell function can augment during learning by means of insertion mechanism and that hidden neuron can be removed in function of criteria stated above, we may pose an euristic initialisation criterium which defines the minimal initial number of bell function equal to two for each hidden neuron and the number of hidden neuron equal to the number of input neuron.

3. ESPERIMENTS

3.1. Study area The study area represents a typical agro-ecosystem belonging to Ticino River regional park and located South West of Milano, Italy. In few kilometres the site encompasses a range of land cover types typical of Northern Italy, including intensive poplar plantations, natural broad-leaved forest, water courses, rice and corn fields, grassland and transitional woodland/shrub. A detailed landcover map has been obtained by field survey in summer 2003.

Figure 5 - The study area represents an agriculture zone with typical crop (rice and corn), extensive poplar plantations characterize the landscape while the residual natural forest surrounds the area of the Ticino river.

3.2. Hyperspectral data set The hyperspectral data available for the experiment, represent an aerial MIVIS (Multispectral Infrared and Visible Imaging Spectrometer) image, the technical characteristics of the sensor as well the not noisy bands selected are described in the following table.

Table 1 - Spectral and geometric characteristics of MIVIS hyperspectral sensor. Flight height was 2000 m determining a pixel size for MIVIS of about 4 m.

Scanner Type

FOV IFOV [mrad]

Spectrometers

Spectral range

[nm]

Channels

Selected bands

1 430-830 1-20 1 to 20

2 1150-1550 21-28 21 to 24, 27,28

3 2000-2500 29-92 43 to 58, 61,62, 65 to

68, 71

4 8200-12700 93-102 Thermal range not

selected

Whisk broom ±35° 2

TOT - 102 51

The image was atmospherically corrected using ATCOR413 considering the effects of angular dependence of the atmospheric radiance and transmittance. The original data were pre-processed eliminating the noise bands selecting the features that overpass an empirical threshold based on coefficiency of variation for same homogeneous areas. The total number of features selected correspond to 51 MIVIS bands as shown in table.

3.3. Test set Due to the highly precise information regarding the land use of the test area a stratified sample technique 12 was used to extract a random test set of 60 pixels per classes for the subsequent accuracy analysis. Test set was extracted from the study area land use map showed in Figure 5.

3.4. Experiment The experiment was aimed to evaluate the performances of the proposed adaptive model when trained under three different condition of pattern cardinality for each class. Performances have been compared with those obtained from four classification algorithm: a traditional established statistical tool, the Maximum likelihood14 (ML), a specific hyperspectral classifier, the Spectral Angle Mapper15 (SAM), (both implemented in the software ENVI16), the Multi Layer Perceptron (MLP) and the Support Vector Machine (SVM). The three different training sets used to study the performances of the proposed classifiers are: T1- 100 patterns per classes: it can be considered the optimal training for statistical classifier

(ML) considering that we deal with 51 features (the selected MIVIS bands) T2- 52 patterns per classes: it corresponds to the theoretical limit for the statistical classifier

(ML) T3- 25 patterns per classes: it represents a critical situation, corresponding to the minimal

training condition of the experiment.

The five classifiers received in input the 51 not noisy MIVIS bands and have been trained and tested with the same data allowing an objective comparison of their performance.

3.5. Feature selection results The adaptive model performed a selection of the most relevant features during training. Figure 6 shows the results obtained when the adaptive classifier has been trained with T3. Four different vegetation classes (rice, corn, forest and poplar) have been considered in the experiemnt and just one non vegetation class (soil). The strong spectral difference between soil and other class has been recognized by the algorithm selecting just few features in the more different region corresponding to the short wave infrared. On the other hand the two tree-like vegetation classes (poplar and forest) show very similar spectral response, consequently the procedure selected only the subset of features that really discriminate the two classes. As shown by figure 6-b in the visible part of the spectrum the red-edge region is confirmed to be the more important for discriminate vegetation class.

Band selection vs spectral properties

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Bands

Ref

lect

ance

0

500

1000

1500

2000

2500

3000

Impo

rtan

ce

Band sel 25-patterns. "Soil" Band sel-25 pattern. "Poplar"Band sel 25-pattern. "Forest" SoilPoplar Forest

a)

Band selection vs spectral properties

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Bands

Ref

lect

ance

0

500

1000

1500

2000

2500

Impo

rtanc

e

Band sel 25-patterns. "Soil"Band sel-25 pattern. "Poplar"Band sel 25-pattern. "Forest"SoilPoplarForest

b) Figure 6 - Comparison of automatic features selection for soil and two critical vegetation class (Forest and Poplar) related to the mean spectral responses (a). Analyzing the visible part of the spectrum the red-edge region, 600-800 nm (bands 11-20), is generally considered the more discriminating for vegetation as confirmed by the adaptive features selection (b).

4. CLASSIFICATION RESULTS

The agreement between reference test data and classification results has been analyzed by means of the confusion matrix to compare the accuracy of the different automatic procedures. Figure 7 shows the results obtained in terms of OA (overall accuracy), for all the classifiers considered when trained with the three data set. The adaptive method presents a stable behavior under the three differet training conditions reaching an high level of accuracy in all cases (over 80%). Maximum likelihood is strongly influenced by the training conditions: performances are superior in case of training with T1, reaching an accuracy value close to 90% and strongly collapses to 60% using training set T2. The training T3 is not applicable. The SAM algorithm shows a stable behavior due to the fact that the classification is based on the calculation of the distance from an average spectrum per class, but the accuracy reached is in all cases inferior to that of the adaptive model. Conventional MLP and SVM show a stable behavior and performances comparable with those obtained by our model.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ML SVM MLP MLPadaptive

SAM

Acc

urac

y (%

)

100 pattern52 pattern25 pattern

Figure 7 - Overall testing accuracy of the different methods in different pattern cardinality.

Confusion matrix allows also to compute the accuracy for the single classes. In Figure 8 we compare the results for the two tree-like classes.

Natural Forest

0%10%20%30%40%50%60%70%80%90%

100%

ML SVM MLP SAM MLPadaptive

100 patterns52 patterns25 patterns

Poplar plantation

0%10%20%30%40%50%60%70%80%90%

100%

ML SVM MLP SAM MLPadaptive

100 patterns52 patterns25 patterns

Figure 8 - Critical classes in minimal training condition.

Considering ML classifier, the histograms of figure 8 show that in minimal training condition, the poplar plantation class accuracy collapses compared to natural forest, but if we consider the classification maps (Figure 9) it is easy to understand that the high value of natural class accuracy (70%) is a coincidence of the general noise. The adptive model shows again better results with respect to ML and SAM. Accuracy values are comparable and in case of Natural Forest slightly superior to SVM and MLP. The following figures (Figure 9) show the classification maps produced by the proposed Adaptive MLP algorithm compared to the conventional ML.

5. CONCLUSIONS The experimental work conducted demonstrated the potential of adaptive neural techniques in classify hyperspectral data. The features selection strategy integrated in our model allows to select properly relevant features drastically reducing the topological complexity of the model during training, and mantains the overall efficiency in condition of minimal training. Basing on these results, future work will experiment the potentiality of the approach in classfying original raw data, without pre-processing aimed to eliminate noisy data, in order to qualify the general applicability of the solution proposed in in context in which a-pripori knowledge is not available or applicable.

REFERENCES 1. D. Landgrebe, Information Extraction Principles and Methods for Multispectral and

Hyperspectral Image Data, Information Processing for Remote Sensing C.H. Chen, pp.3-37, World Scientific Singapore, 1999.

2. K. Fukunaga and R.R. Hayes, Effects of Sample Size Classifier Design, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no.8, pp.873-885, Aug.1989.

3. Y.H. Pao, Adaptive Pattern Recognition and Neural Networks. Addison Wesley, MA, 1989. 4. Bishop, C.M.1995. Neural Networks for Pattern Recognition, Oxford University Press,

Oxford. 5. N. Cristianini and J.Shawe-Taylor, An introduction to Support Vector Machine, Cambridge

University Press London, 2000. 6. A.K. Jain, R.P. Duin, J. Mao, Statistical Pattern Recognition: a review, IEEE Trans. On PAMI,

22, pp.4-37, 2000. 7. R. Lotlikar, R. Kothari, Bayes-optimality motivated-linear and multilayered perceptron-

based dimensionality reduction, IEEE Trans. Neural networks 11 (2), 452-463. 8. R. Reed, 1993, Pruning Algorithms – a survey. IEEE Trans. Neural Networks 5, 740-747. 9. H.Rumelhart, , G.E.Hinton, and R.J.Williams, Learning Internal Representation by Error

Propagation, Parallel Distributed Processing, Rumelhart H., McClelland J.L.(eds.), 318-362. MIT Press, Cambridge, MA, 1986.

10. A. Gualtieri, R.F. Cromp, Support Vector Machine for Hyperspectral Remote Sensing Classification, in Proc. of SPIE, vol 3584, 1999.

11. F. Melgani, L. Buzzone, Support Vector Machine for Classification of Hyperspectral Remote Sensing, in Proc. of IGARSS 2002.

12. Van Genderen J.L., B.F. Lock and P.A., Remote Sensing: Statistical testing of thematic map accuracy, Remote Sensing of Environment, 7, 3-14, Vass (1978).

13. Richter R. (2000): ATCOR4 user manual, DLR-IB 564-04/2000. 14. Kruse F. A., A. B. Lefkoff, J. B. Boardman, K. B. Heidebrecht, A. T. Shapiro, P. J. Barloon,

and A. F. H. Goetz (1993): The Spectral Image Processing System (SIPS) - Interactive Visualization and Analysis of Imaging spectrometer Data, Remote Sensing of Environment, 44, 145 - 163.

15. Richards A., Remote Sensing Digital Image Analysis, Springer-Verlag, Berlin, 1999, p. 240 16. ENVI, The Environment for Visualizing Images, Research Systems Inc.,

http://www.rsiinc.com/envi

25 pattern per classes 52 pattern per classes 100 pattern per classes Adap

tive

MLP

OA =

80.3% Kappa = 0.76 OA = 85.67% Kappa =

0.82 OA =

83.6/% Kappa = 0.8

ML

NO DATA

Rice Mais Soil Poplar Natural Forest

OA = 62% Kappa =

0.53 OA =

89.67% Kappa =

0.87 Figure 9 - Landuse maps and relative accuracy produced in different training condition with two algorithm.