11
Environ Ecol Stat DOI 10.1007/s10651-014-0287-2 Independent component analysis and clustering for pollution data Asis Kumar Chattopadhyay · Saptarshi Mondal · Atanu Biswas Received: 7 July 2013 / Revised: 5 April 2014 © Springer Science+Business Media New York 2014 Abstract Independent component analysis (ICA) is closely related to principal com- ponent analysis (PCA). Whereas ICA finds a set of source variables that are mutually independent, PCA finds a set of variables that are mutually uncorrelated. Here we consider an objective classification of different regions in central Iowa, USA, in order to study the pollution level. The study was part of the Soil Moisture Experiment 2002. Components responsible for significant variation have been obtained through both PCA and ICA, and the classification has been done by K -Means clustering. Result shows that the nature of clustering is significantly improved by the ICA. Keywords Circular data · Distance · Fast ICA algorithm · Independent Component Analysis · K -means clustering · Negentropy · Non-Gaussianity · Principal Component Analysis 1 Introduction 1.1 Data set under consideration We consider a data set containing measurements collected from flights conducted in June and July 2002 over the Walnut Creek watershed in central Iowa, USA. The study Handling editor: Ashis SenGupta. A. K. Chattopadhyay · S. Mondal Calcutta University, Kolkata, India e-mail: [email protected] S. Mondal e-mail: [email protected] A. Biswas (B ) Indian Statistical Institute, Kolkata, India e-mail: [email protected] 123

Independent component analysis and clustering for pollution data

  • Upload
    atanu

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Independent component analysis and clustering for pollution data

Environ Ecol StatDOI 10.1007/s10651-014-0287-2

Independent component analysis and clusteringfor pollution data

Asis Kumar Chattopadhyay · Saptarshi Mondal ·Atanu Biswas

Received: 7 July 2013 / Revised: 5 April 2014© Springer Science+Business Media New York 2014

Abstract Independent component analysis (ICA) is closely related to principal com-ponent analysis (PCA). Whereas ICA finds a set of source variables that are mutuallyindependent, PCA finds a set of variables that are mutually uncorrelated. Here weconsider an objective classification of different regions in central Iowa, USA, in orderto study the pollution level. The study was part of the Soil Moisture Experiment 2002.Components responsible for significant variation have been obtained through bothPCA and ICA, and the classification has been done by K -Means clustering. Resultshows that the nature of clustering is significantly improved by the ICA.

Keywords Circular data · Distance · Fast ICA algorithm · IndependentComponent Analysis · K -means clustering · Negentropy · Non-Gaussianity ·Principal Component Analysis

1 Introduction

1.1 Data set under consideration

We consider a data set containing measurements collected from flights conducted inJune and July 2002 over the Walnut Creek watershed in central Iowa, USA. The study

Handling editor: Ashis SenGupta.

A. K. Chattopadhyay · S. MondalCalcutta University, Kolkata, Indiae-mail: [email protected]

S. Mondale-mail: [email protected]

A. Biswas (B)Indian Statistical Institute, Kolkata, Indiae-mail: [email protected]

123

Page 2: Independent component analysis and clustering for pollution data

Environ Ecol Stat

was part of the Soil Moisture Experiment 2002 (SMEX02) and the Soil Moisture-Atmosphere Coupling Experiment (SMACEX), run by Canada’s National ResearchCouncil (NRC). See SMEX02 Soil Moisture Atmosphere Coupling Experiment, Iowa,http://nsidc.org/data/nsidc-0232.html, for a detailed description of the methodologyand the data.

The aircraft carried numerous sensors and flew several flights per day along sixtracks in the watershed area. These data were collected as part of a validation study forthe Advanced Microwave Scanning Radiometer—Earth Observing System (AMSR-E). AMSR-E is a mission instrument launched aboard NASA’s Aqua Satellite on 04May 2002. AMSR-E validation studies linked to SMEX are designed to evaluate theaccuracy of AMSR-E soil moisture data. Specific validation objectives include assess-ing and refining soil moisture algorithm performance; verifying soil moisture estima-tion accuracy; investigating the effects of vegetation, surface temperature, topography,and soil texture on soil moisture accuracy; and determining the regions that are usefulfor AMSR-E soil moisture measurements.

Variables for this data set include air temperature (TEMP, in ◦C), dew point tem-perature (DEWPT, in ◦C), radiometric surface temperature (KT19, in ◦C), greennessindex (GRN, ratio of 730/660 nm reflected radiation), net radiation from wingtip sen-sor (NETRD, in W/m2), CO2 concentration (in ppm), wind direction (in ◦true), windspeed (in m/s), sensible heat flux (H, in W/m2), latent heat flux (LE, in W/m2), CO2flux (WC, in mg/m2/s), friction velocity computed from momentum flux (U∗, in m/s)and ozone flux, corrected (in mg/m2/s). Here wind direction is a directional variableand the remaining set is continuous.

1.2 Independent component analysis

One of the most recent powerful statistical techniques for analyzing large data sets isindependent component analysis (ICA), see Comon (1994) for the original descriptionof ICA. Such data sets are generally multivariate in nature. The common problem isto find a suitable representation of the multivariate data. For the sake of computationaland conceptual simplicity such representation is sought as a linear transformationof the original data. Principal component analysis, factor analysis, projection pur-suit are some popular methods for linear transformation. But ICA is different fromother methods, because it looks for the components in the representation that are bothstatistically independent and non Gaussian. In essence, ICA separates statisticallyindependent component data, which is the original source data, from an observed setof data mixtures. All information in the multivariate data sets are not equally impor-tant. We need to extract the most useful information. Independent component analysisextracts and reveals useful hidden factors from the whole datasets. ICA defines agenerative model for the observed multivariate data, which is typically given as alarge database of samples. See Hyvarinen et al. (2001), Comon and Jutten (2010) andLee (1998) for booklength discussions on ICA. ICA can be applied in various fieldslike neural network (Fiori 2003), studying EEG data (Bartlett et al. 1995), speechprocessing (Kumaran et al. 2005), brain imaging (McKeown et al. 1997), stock pre-dictions (Lu et al. 2009), signal separation (Adali et al. 2009), telecommunications

123

Page 3: Independent component analysis and clustering for pollution data

Environ Ecol Stat

circular densities for wind direction

Bandwidth = 25

Den

sity

circ

ular

0

90

180

270

+

circular densities for 2 x wind direction

Bandwidth = 25

Den

sity

circ

ular

0

90

180

270

+

Fig. 1 Left Circular density plot of wind direction; Right Circular density plot of 2× wind direction

(Hyvarinen et al. 2002), econometrics (Bonhomme and Robin 2009), etc. But, so farour knowledge goes, ICA has not been applied for environmental pollution data yet.

In this present paper we intend to carry out K -means clustering on the basis ofIndependent Components as well as Principal Components to identify the propermethod applicable for the present data set.

1.3 Wind direction: directional data

Note that ICA has been developed for linear continuous data but here one variable, viz.wind direction, is circular in nature. This is an important covariate as it is believed thatquite often wind brings pollutants from neighbouring places to any particular placeunder consideration, especially if there are some industrial regions nearby. But it is notimmediate how to include this type of data for classification through ICA. A densityplot of this data is shown in the left of Fig. 1. Clearly the wind direction has a bimodaldistribution, the two modes are near 0◦ and 200◦. Thus we are motivated to considertwo main directions, east and west (approximately), which correspond to 0◦ and 180◦.We consider the standard cosine angular distance of an angle θ from a fixed angle φ,defined by dφ = 1 − cos(θ − φ), which is in the linear scale, and d ∈ [0, 2]. Thus,for a wind direction of θ , we may consider two distances d0 = 1 − cos(θ − 0◦) andd180 = 1 − cos(θ − 180◦), both of which are linear. So, instead of taking θ in ouranalysis, we consider the pair (dmax, dsign), where dmax = max(d0, d180) and dsign =+1 if dmax = d0 and dsign = −1 if dmax = d180. Alternately, if we want to ignore thesign we can work with θ∗ = 2 × θ , which is approximately unimodal with mode near45◦ (see right hand figure of Fig. 1). We may work with d∗ = 1 − cos(θ∗ − 45◦). Inthis present paper we work on d∗ only ignoring the sign.

1.4 Test for normality

Before applying ICA we have tested Gaussianity of the data set where we use thetransformed linear version for the wind direction. Here the null hypothesis was that

123

Page 4: Independent component analysis and clustering for pollution data

Environ Ecol Stat

the entire data set comes from Multivariate Normal distribution. For testing this weperformed Multivariate Shapiro–Wilk test. We found that the p value of the test was2.572 × 10−10, which is too small. We conclude that the data set has not come fromMultivariate Normal distribution.

1.5 Outline of the paper

The rest of the paper is organised as follows. In Sect. 2 we provide an overview ofthe independent component analysis from the existing literature where we discussnon-Gaussianity, negentropy and the FastICA algorithm. In Sect. 3 we provide acomparative discussion on ICA and PCA. Data analysis is done in Sects. 4 and 5concludes.

2 An overview of independent component analysis

2.1 Description

Independent component analysis (ICA) (see Comon 1994) is of the form:

X = AS, (2.1)

where X = [X1, . . . , Xm]′is a random vector of observations where the rows are

independent of each other, S = [S1, . . . , Sm]′is a random vector of hidden sources

whose components are mutually independent and A is nonsingular mixing matrix. SoA−1 is the unmixing matrix. Let we have n independently and identically distributed(i.i.d.) samples of X , say {X ( j) : 1 ≤ j ≤ n}, the main goal of ICA is to estimatethe unmixing matrix A−1 and thus to recover hidden source using Sk = A−1

k X, whereA−1

k is the kth row of A−1.In the model, it is assumed that the data variables are linear or non-linear mixtures of

some latent variables, and the mixing system is also unknown. The latent variables areassumed non-Gaussian and mutually independent, and they are called the independentcomponents of the observed data.

Suppose n random variables X1, . . . , Xn are expressed as linear combinations ofn random variables S1, . . . , Sn . Equation (2.1) can be written as:

Xi = ai1S1 + ai2S2 + · · · + ain Sn, i = 1, 2, . . . , n (2.2)

The Si ’s in (2.2) are statistically mutually independent, where ai j ’s are the entries ofthe nonsingular matrix A. All we observe are the random variables Xi , and we have toestimate both the mixing coefficients ai j s and the independent components Si s, usingthe Xi s.

A first step in computer algorithms for performing ICA is to whiten (sphere) thedata. This means that any correlations in the data are removed, i.e. the data are forcedto be uncorrelated. Mathematically speaking, we need a linear transformation V such

123

Page 5: Independent component analysis and clustering for pollution data

Environ Ecol Stat

that Z = VX, where E(Z Z′) = I. This can be easily accomplished by choosing

V = C−1/2, where C = E(X X′).

After sphering, the separated data can be found by an orthogonal transformationon the whitened data Z.

Here Z = (VA)S or S = (VA)−Z = WZ (say)

2.2 ICA by maximization of non-Gaussianity

In ICA estimation, non-Gaussianity is very important. Although all the sources maynot be non-Gaussian, but some of the sources have to be non-Gaussian (see Hyvarinenet al. 2001). Without non-Gaussianity the estimation is not possible. Non-Gaussianityis motivated by the central limit theorem. Under certain conditions, the statisticaldistribution of a sum of independent random variables tends toward a Gaussian dis-tribution. A sum of two independent non-Gaussian random variables usually has adistribution that is closer to Gaussian than any of the two original random variables.

We can measure non-Gaussianity by Negentropy (Hyvarinen et al. 2001). Theentropy of a discrete variable is defined as the sum of the products of probabilityof each observation and the log of those probabilities. On the other hand, for a contin-uous function the entropy is called differential entropy which is given by the integralof the function times the log of the function. Negentropy is the difference between thedifferential entropy of a source S from the differential entropy of a Gaussian sourcewith the same covariance of S. It is denoted by J (S) and defined as follows:

J (S) = H(SGauss) − H(S),

where

H(S) = −∫

ps(η) log ps(η)dη,

ps(η) is the density function of S. Negentropy is always non-negative, and it is zeroif and only if S has a Gaussian distribution. Negentropy has an interesting propertythat it is invariant for invertible linear transformation. It is also a robust measure ofnon-Gaussianity. Here we estimate S by maximizing the distance of its entropy fromGaussian entropy as the noises are assumed to be Gaussian and if the signals are non-Gaussian only then they can be separated from the noise. If the signals are Gaussian,then ICA will not work.

2.3 Approximation of negentropy

One drawback of negentropy is that it is very difficult to compute. That’s why it needsto be approximated (Hyvarinen et al. 2001). The approximation is given by:

J (S) ∝ (E[G(S)] − E[G(SGauss)])2,

where SGauss is a Gaussian random variable, G is a non-quadratic function. In partic-ular, G should be so chosen that it does not grow too fast. Two popular choices of Gare:

123

Page 6: Independent component analysis and clustering for pollution data

Environ Ecol Stat

G1(S) = 1

alog cosh(aS)

G2(S) = −e−S2/2, (2.3)

where 1 ≤ a ≤ 2 is some suitable constant, which is often taken equal to 1.

2.4 The FastICA algorithm

The FastICA algorithm for ICA is a largely used one, including industrial applica-tions. This algorithm was developed by Hyvarinen and Oja (2000). In this methodthe independent components are estimated one by one. This algorithm converges veryfast and is very reliable. The objective is to maximize J (S). Now this is equivalent tomaximizing E[G(W Z)] under the constraint ‖ W ‖= 1.

3 Independent component analysis versus principal component analysis

Independent component analysis is related to another conventional method for ana-lyzing large data sets viz. Principal component analysis (PCA). Whereas ICA finds aset of variables that are mutually independent, PCA finds a set of variables that aremutually uncorrelated. Comparative study on ICA and PCA have been done by Junget al. (1998), among others.

Independent component analysis (ICA) was originally developed for separatingmixed audio signals into independent sources. The Olivetti and Yale databases (Yuenand Lai 2000) showed that ICA outperforms PCA and another report (Moghaddam1999) claimed that there is no performance difference between ICA and PCA. In thispaper we make the comparison on the basis of the present data set.

The purpose of PCA is to reduce the original data set of two or more sequentiallyobserved variables by identifying a small number of meaningful components (Jackson2003).

Principal component analysis (PCA), based on the linear correlation between datapoints, shows a way to extract variables which are linearly uncorrelated. Although,required to be linearly uncorrelated, because of the higher order correlations thesevariables are not necessarily independent.

Independent component analysis (ICA) (Hyvarinen et al. 2001; Stones 2004) isbased on the basic assumption that the source components are statistically independentin the sense that the value of one gives no information about the values of the others. Fornon-Gaussian variables, the probability density functions (p.d.f.s) need all momentsto be specified, and higher order correlations must be taken into account to establishindependence.

Actual observations are composed of mixtures of source components. The mainaim of ICA is to extract those statistically independent components from the observedmixtures and thus to get information about the underlying physical processes. Toachieve this, ICA makes use of the central limit theorem. ICA is a method to extractthe combinations with the most non-Gaussian possible p.d.f.s from the more Gaussian

123

Page 7: Independent component analysis and clustering for pollution data

Environ Ecol Stat

Fig. 2 Left Plot of distortion(divided by 105) against k

1 2 3 4 5 6 7 8

01

23

4

Number of centers

Dis

tort

ion

component mixtures. These are called the “independent components” (ICs) of theobservations. For this an unmixing vector is found that extracts the most non-Gaussianpossible (maximally non-Gaussian) source component. This source component is thenremoved from the set of mixed data and the process is repeated. So, from the theoreticalpoint of view also ICA may give better result than PCA for non-Gaussian data.

We must fix the number of independent components to be sought. PCA is performedto determine this number and find a breakpoint in the eigenvalue spectrum. If there isno clear breakpoint in the spectrum, we can keep the number of leading componentsthat carry some arbitrary percentage of the variance (Hyvarinen et al. 2001).

4 Data analysis

Here we have done cluster analysis (CA) by K -means clustering (see “Appendix”) onthe basis of principal components and independent components. Four groups (Viz. G1,G2, G3 and G4), have been found as a result of CA with respect to the four principal andindependent components respectively (found in the present analysis, see the distortioncurve in Fig. 2) and their properties have been listed in Tables 1 and 2 respectively. Inparticular, Tables 1 and 2 list the mean values and corresponding standard errors of allthe variables for the four groups found by clustering with respect to PC and IC values.Let the data set be modeled by a p-dimensional random variable, X , consisting of amixture distribution of G components with common variance covariance matrix, Γ .If we let c1, . . . , cK be a set of K cluster centers, with cX the closest center to a givensample of X , then the minimum average distortion per dimension while fitting the Kcenters to the data is:

dK = 1

pmin

c1,...,cKE[(X − cX )T Γ −1(X − cX )].

Here Γ = I is considered and it reduces to a function of within cluster sum of squares.We have applied the method of Sugar and James (2003) [see Appendix] to find the

optimum number of clusters in both the cases. The four groups are selected by lookingat the variance proportions and 90 % cut-off, both for PCA and ICA. By investigatingthe eigenvectors (coefficients of variables in Principal Components) it is found thatsix variables viz. air temperature, dew point temperature, wind direction, wind speed,sensible heat flux, and CO2 flux are mostly responsible for variation. Corresponding toboth the clusterings, the first group (G1) contains the largest number of locations (168

123

Page 8: Independent component analysis and clustering for pollution data

Environ Ecol Stat

Table 1 Mean values of the parameters in four groups found by CA with respect to PC values

G1 G2 G3 G4

No. 168 74 160 19

TEMP 27.32 ± 0.21 25.90 ± 0.27 26.68 ± 0.19 21.73 ± 0.10

DEWPT 16.48 ± 0.38 18.09 ± 0.46 17.84 ± 0.36 10.96 ± 0.11

KT19 37.39 ± 0.21 33.21 ± 0.27 35.02 ± 0.19 25.18 ± 0.26

GRN 2.00 ± 0.03 2.16 ± 0.07 2.10 ± 0.04 2.07 ± 0.04

NETRD 638.79 ± 2.04 470.74 ± 4.85 558.33 ± 1.88 183.42 ± 5.45

C02 362.61 ± 0.89 364.39 ± 0.63 365.78 ± 0.76 362.74 ± 0.41

d∗ 0.40 ± 0.04 0.44 ± 0.05 0.40 ± 0.04 0.23 ± 0.06

WIND-spd 6.06 ± 0.25 5.30 ± 0.34 5.63 ± 0.23 6.07 ± 0.10

H 109.63 ± 2.95 78.54 ± 3.54 89.82 ± 2.38 17.05 ± 2.17

LE 307.19 ± 7.21 238.65 ± 7.72 272.32 ± 5.52 118.16 ± 5.26

WC −1.03 ± 0.03 −0.93 ± 0.04 −1.03 ± 0.03 −0.37 ± 0.03

U* 0.48 ± 0.01 0.40 ± 0.02 0.45 ± 0.01 0.42 ± 0.01

Ozone flux −0.42 ± 0.01 −0.40 ± 0.02 −0.43 ± 0.01 −0.28 ± 0.01

Table 2 Mean values of the parameters in four groups found by CA with respect to IC values

G1 G2 G3 G4

No. 149 93 147 32

TEMP 26.78 ± 0.17 25.16 ± 0.34 28.07 ± 0.14 22.83 ± 0.36

DEWPT 17.73 ± 0.24 14.78 ± 0.75 18.72 ± 0.25 12.58 ± 0.56

KT19 36.22 ± 0.21 34.54 ± 0.36 36.12 ± 0.22 28.19 ± 0.72

GRN 1.99 ± 0.04 2.02 ± 0.07 2.22 ± 0.03 1.94 ± 0.07

NETRD 584.84 ± 5.29 576.77 ± 4.80 581.87 ± 5.34 270.44 ± 19.41

C02 359.68 ± 0.38 378.34 ± 0.91 359.74 ± 0.46 363.78 ± 0.70

d∗ 0.27 ± 0.03 0.72 ± 0.07 0.31 ± 0.03 0.48 ± 0.08

WIND-spd 6.20 ± 0.28 4.74 ± 0.26 5.78 ± 0.23 6.61 ± 0.42

H 97.66 ± 2.62 94.90 ± 4.69 94.95 ± 2.76 49.72 ± 7.98

LE 237.12 ± 4.30 245.45 ± 6.53 355.86 ± 5.60 144.19 ± 7.48

WC −0.92 ± 0.02 −0.84 ± 0.04 −1.26 ± 0.02 −0.47 ± 0.04

U* 0.47 ± 0.01 0.38 ± 0.01 0.48 ± 0.01 0.46 ± 0.02

Ozone flux −0.39 ± 0.01 −0.37 ± 0.01 −0.49 ± 0.01 −0.31 ± 0.01

and 149 respectively). According to size the group (i.e. the number of observations inthe group after clustering), group three (G3) is at the second position (160 and 147respectively). Group four (G4) corresponds to the locations with minimum temperatureand dew point temperature. The wind direction values (converted d∗) are exceptionallylow in group four (G4) for PC and in group one (G1) for IC. Although the groups aremore or less comparable under both the methods but from Table 3 it is clear in termsof metric within cluster sum of squares that ICA performs much better than PCA.

123

Page 9: Independent component analysis and clustering for pollution data

Environ Ecol Stat

Table 3 Comparison of withincluster sum of squares

Within cluster sumof squares (IC)

Within cluster sumof squares (PC)

G1 264.40 50,700.20

G2 284.25 43,783.28

G3 307.05 35,297.19

G4 66.06 3,379.04

5 Conclusion

The following conclusions can be drawn from the present analysis:

1. Two methods PCA and ICA are carried out for the data set and four significantcomponents are found in each case.

2. K -means cluster analysis together with the optimality criterion of Sugar and James(2003) is carried out for the present data set and four groups G1, G2, G3, and G4have been found in each case separately. The total number of locations in G1and G3 are very large compared to other groups implying that locations mainlycomprises two groups.

3. For the particular dataset under consideration, ICA is a better method than PCAas it is evident from Table 3 and this is supported by the non-Gaussianity of thenature of data suitable for applicability of ICA.

Acknowledgments The authors wish to thank the Editor, Professor Ashis SenGupta and two reviewersfor their careful reading and constructive suggestions which led some improvement over two earlier versionsof the paper.

Appendix: K-Means cluster analysis and optimum number of clusters

In our analysis, we have computed the value of K by K -means clustering (Mac-Queen 1967) and a statistical technique developed by Sugar and James (2003).By using this algorithm we have first determined the structures of sub popula-tions (clusters) for varying number of clusters taking K = 1, 2, 3, 4 etc. Foreach such cluster formation we have computed the values of a distance measuredK = (1/p)minx E[(xK − cK )

′(xK − cK )] which is defined as the distance of the

xK vector (values of the variables) from the center cK (which is estimated as meanvalue), p is the order of the xK vector. Then the algorithm for determining the optimumnumber of clusters is as follows (Sugar and James 2003). Let us denote by d

′K the esti-

mate of dK at the K th point. Then d′K is the minimum achievable distortion associated

with fitting K centres to the data. A natural way of choosing the number of clusters isto plot d

′K versus K and look for the resulting distortion curve. This curve is always

monotonic decreasing. Initially one would expect much smaller drops i.e. a levellingoff for K greater than the true number of clusters because past this point adding morecentres simply partitions within groups rather than between groups. Such a nature isevident from the distortion curve which can be obtained by plotting K versus d

′K .

According to Sugar and James (2003) for a large number of items the distortion curve

123

Page 10: Independent component analysis and clustering for pollution data

Environ Ecol Stat

when transformed to an appropriate negative power (p/2 in our case), will exhibita sharp “jump”. Then we have calculated the jumps in the transformed distortion as

JK = (d′−(p/2)K − d

′−(p/2)K−1 ). The optimum number of clusters is the value of K at

which the distortion curve levels off as well as its value associated with the largestjump.

References

Adali T, Jutten C, Romano JMT, Barros AK (2009) Independent component analysis and signal separation.In: Proceedings of 8th international conference, ICA 2009, Paraty, Brazil, March 15–18, 2009. Springer,Berlin

Bartlett M, Makeig S, Bell AJ, Jung T-P, Sejnowski TJ (1995) Independent component analysis of eeg data.Soc Neurosci Abstr 21:437

Bonhomme S, Robin J-M (2009) Consistent noisy independent component analysis. J Econ 149:12–25Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314Comon P, Jutten C (2010) Handbook of blind source separation, independent component analysis and

applications. Academic Press, Oxford, UKFiori S (2003) Overview of independent component analysis technique with an application to synthetic

aperture radar (sar) imagery processing. Neural Networks 16(3–4):453–467Hyva̋rinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New YorkHyva̋rinen A, Karhunen J, Oja E (2002) Telecommunications. In: Independent component analysis, Ch 2.

Wiley, New York. doi:10.1002/0471221317Hyva̋rinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Networks

13(4–5):411–430Herault J, Jutten C (1986) Space or time adaptive signal processing by neutral network models. In: AIP

conference proceedings, Snowbird, Utah, pp 206–211Jackson JE (2003) A user’s guide to principal components. Wiley, HobokenJung T-P, Humphries C, Lee T-W, Makeig S, McKeown MJ, Iragui V, Sejnowski TJ (1998) Removing

electroencephalographic artifacts: comparison between ICA and PCA. In: Neural networks for signalprocessing VIII, pp 63–72

Kumaran RS, Narayanan K, Gowdy JN (2005) Myoelectric signals for multimodal speech recognition.INTERSPEECH 2005:1189–1192

Lee T-W (1998) Independent component analysis: theory and applications. Kluwer, Boston, MALu C-J, Chiu C-C, Yang J-L (2009) Integrating nonlinear independent component analysis and neural

network in stock price prediction. In: Chien B-C, Hong T-P, Chen S-M, Ali M (eds) Next-generationapplied intelligence. Proceedings of 22nd international conference on industrial, engineering and otherapplications of applied intelligent systems, IEA/AIE 2009, Tainan, Taiwan, June 24–27, pp 614–623.Springer, Berlin

McKeown MJ, Makeig S, Jung T-P, Brown GG, Kindermann SS, Sejnowski TJ (1997) Analysis of fmridata by decomposition into independent components. Am Acad Neurol Abstr 48:A417

Moghaddam B (1999) Principal manifolds and Bayesian subspaces for visual recognition. In: Internationalconference on computer vision, Corfu, Greece, pp 1131–1136

MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Pro-ceedings of 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

SMEX02 Soil moisture atmosphere coupling experiment, Iowa. http://nsidc.org/data/nsidc-0232.htmlStones V (2004) Independent component analysis: a tutorial introduction (Bradford Books). The MIT Press,

CambridgeSugar AS, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach.

J Am Stat Assoc 98:750–763Yuen PC, Lai JH (2000) Independent analysis of face images in IEEE workshop on biologically motivated

computer vision. Springer, Seoul

123

Page 11: Independent component analysis and clustering for pollution data

Environ Ecol Stat

Asis Kumar Chattopadhyay is a Professor in the Department of Statistics of the Calcutta University,Kolkata. His current research interests include environmental pollution, clustering and astrostatistics.

Saptarshi Mondal is research student at the Department of Statistics of the Calcutta Universty, Kolkata.He is working in astrostatistics for his Ph.D. degree.

Atanu Biswas is a Professor in the Applied Statistics Unit of the Indian Statistical Institute, Kolkata. Hiscurrent research interests include environmental pollution, clinical trials and categorical data.

123