Substituting missing data in compositional analysis

lable at ScienceDirect

Environmental Pollution 159 (2011) 2797e2800

Contents lists avai

Environmental Pollution

journal homepage: www.elsevier .com/locate/envpol

Substituting missing data in compositional analysis

Carlos Real a,*, J. Ángel Fernández b, Jesús R. Aboal b, Alejo Carballeira b

aÁrea de Ecología, Departamento de Biología Celular y Ecología, Escuela Politécnica Superior, Universidad de Santiago de Compostela, 27002 Lugo, SpainbÁrea de Ecología, Departamento de Biología Celular y Ecología, Facultad de Biología, Universidad de Santiago de Compostela, 15782 Santiago de Compostela, Spain

a r t i c l e i n f o

Article history:Received 16 July 2010Received in revised form6 May 2011Accepted 12 May 2011

Keywords:Compositional data analysisMissing valuesMultivariate exploratory analysisBiomonitoringDioxins

* Corresponding author.E-mail address: [email protected] (C. Real).

0269-7491/$ e see front matter � 2011 Elsevier Ltd.doi:10.1016/j.envpol.2011.05.006

a b s t r a c t

Multivariate analysis of environmental data sets requires the absence of missing values or theirsubstitution by small values. However, if the data is transformed logarithmically prior to the analysis, thissolution cannot be applied because the logarithm of a small value might become an outlier. Severalmethods for substituting the missing values can be found in the literature although none of themguarantees that no distortion of the structure of the data set is produced. We propose a method for theassessment of these distortions which can be used for deciding whether to retain or not the samples orvariables containing missing values and for the investigation of the performance of different substitutiontechniques. The method analyzes the structure of the distances among samples using Mantel tests. Wepresent an application of the method to PCDD/F data measured in samples of terrestrial moss as part ofa biomonitoring study.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

In this paper we present a method to assess the effects of thesubstitution of missing values in multivariate data sets prior to theapplication of multivariate statistical techniques. More specifically,we will deal with missing values produced by undetectableconcentrations of the substances of interest, which is a commonproblem in many environmental data sets. We will consider thisproblem in relation to the use of logarithmic transformations of thedata, because in this case problems become harder to solve. We areparticularly interested in the logarithmic transformationsemployed in the analysis of compositional data (the examplewhichwe will present is of this type) but the method is of widerapplicability.

The problem of missing values in multivariate data sets arisesbecause the appropriate statistical analysis needs complete datamatrices. When missing values are present, either the observationsor the variables containing them must be deleted prior to theanalysis, but this is a loss of valuable and usually expensive infor-mation. The alternative is to substitute the missing data for suitablevalues (imputation) and proceed with the analysis in the hope thatthe new values do not bias the results.

At least two types of missing values can be distinguished, eachwith their own difficulties for imputation. Firstly, some missing

All rights reserved.

values may be generated by the accidental loss of a sample, byproblems in the analysis or by other causes. Guessing their truevalues can be done with methods that employ the observed datadistribution to produce an imputed value. Secondly, missing valuesappear because the sample may contain such low concentration ofthe substance that it can not be determined with the techniqueemployed. When such is the case, the guess should be easierbecausewe know that the quantification limit of the technique is anupper limit for the value. Since the range between zero and thequantification limit is small (in absolute terms), any value withinthis range would be appropriate for substitution because it willalways be close to the true value. Even zero could be acceptable.

When working with raw data, the use of values lower than thequantification limit will not introduce too much distortion in thedata structure, but if the data must be transformed into logarithmsprior to the analysis the effects on the data structure can besignificant. If the imputed values vary between 1 and 0, forexample, the smaller is the value the larger is the (negative) valueobtained by transformation. The smallest possible value, zero, is notusable because its logarithm is not defined. A common case whensuch values appear is when the data are transformed into propor-tions prior to the analysis. Data adding to a fixed constant, likeproportions, are known as compositional data. They have specialmathematical properties and the methods designed for theiranalysis are based on logarithms of ratios among the variables.Compositional data are, therefore, particularly sensitive to theproblem of the missing values and the imputation of these values isa delicate procedure. Note that the unusual values produced by the

mailto:[email protected]

www.sciencedirect.com/science/journal/02697491

http://www.elsevier.com/locate/envpol

http://dx.doi.org/10.1016/j.envpol.2011.05.006



C. Real et al. / Environmental Pollution 159 (2011) 2797e28002798

transformation can change the structure of the data set, introducebias in the results of the statistical procedures and lead to erro-neous conclusions.

In this paper we propose a method for evaluating the influenceof the substitution of the missing values on the structure of the dataset. This empirical method allows the researcher to judge whetherthe influence of the substitution is small enough to be admitted or ifthe distortion of the data structure is so important as to justifyother strategies. We do not present here a method for the selectionof the substitution values. This is an issue that has received atten-tion in the literature andwewill simply select a convenient methodamong those available.

2. Material and methods

Wewill present here the rationale and the operational procedure of themethod,but we will previously discuss some background issues related to compositionalanalysis techniques and the imputation of values. Finally, we will present a data setto which we will apply the technique.

2.1. Transformations in compositional analysis

Compositional (¼closed) data are vectors with elements adding up to a fixedvalue, c. Common values for c are 1 (proportions) or 100 (percentages). Composi-tional data have particular mathematical properties which render many standardstatistical analysis inadequate. Aitchison (2003) developed a group of techniques,collectively known as compositional analysis, to solve these difficulties. In compo-sitional analysis the data are firstly transformed into a ratio between variables andthen the ratio is transformed logarithmically.

There are several types of transformations but we only discuss here the clr(centered log-ratio) transformation. It has been recommended (Aitchison, 2003) asthe most appropriate for the application of multivariate techniques like principalcomponent analysis (PCA) to the transformed data. Consider a matrix of n closedvectors (samples) of dimension D (variables). We will use the subscripts i and k todenote rows in that matrix and subscripts j and l to denote columns. To identify themissing and the substituted data we will change the last subscripts by s. Thus xij isa measured value expressed as a proportion, xis a missing value and dis a substitutedvalue. Being xi¼ (xi1,.,xiD) a vector of compositional data, the expression for the clrtransformation is: zi¼ clr(xi)¼ [log{xi1/g(xi)},.,log{xiD/g(xi)}], where g(xi) is thegeometric mean of xi

gðxiÞ ¼ D

ffiffiffiffiffiffiffiffiffiffiffiffiffiYD

j¼ 1

xij

vuut :

It is clear from the previous formula that clr transformed data will have the sameproblems with the missing values as were described in the introduction.

2.2. Methods for the imputation of missing values

Several imputation methods have been developed (Martín-Fernández et al.,2003; Palarea-Albaladejo et al., 2007) which can be sorted into two groups.Firstly, parametric methods assume that the data come from a multivariatelognormal distribution. The parameters of the distribution are estimated from theset of non-missing data and then used to estimate the imputed values. Thesetechniques have the disadvantage that the computations are complex (see Palarea-Albaladejo et al., 2007). Moreover, in many environmental studies the origin ofsamples is heterogeneous and the assumption of a common distribution for thewhole data set is not adequate.

Secondly, nonparametric methods do not make any assumption about the datadistribution. The imputed values are selected arbitrarily by the researcher accordingto certain rules which define the substitution method. These techniques should beappropriate when the proportion of missing values is lower than 10% (Martín-Fernández et al., 2003). Various substitution methods have been proposed andwere revised by Martín-Fernández et al. (2003). These authors showed that thetechnique with the most convenient properties is multiplicative substitutionbecause it does not distort the ratios among nonmissing variables like others do (seebelow). Multiplicative substitution proceeds by replacing a composition, xi, havingone or more missing values, for another composition, ri, without them:

rij ¼

8><>:

dis; xij is missing0@1�

Psdis

c

1Axij; xij is not missing

9>=>;

The term between brackets assures that the ratios among variables remainunchanged (i.e. xij/xil¼ rij/ril). As compositional analysis makes extensive use of these

ratios, this property is necessary. Note that constant c ensures that the term withinbrackets is always a proportion. If the data are already proportions then c¼ 1 and soit vanishes from the equation.

We also employed another imputation method which was called simplereplacement strategy by Martín-Fernández et al. (2003). The data were closedwithout substitution and afterwards the missing values were substituted by thelowest proportion found in each variable. This method lacks the advantages ofmultiplicative substitution but we used it because we believed that a comparisonbetween the performances of two different methods will make the analysis of theexample more complete and interesting for the reader.

These imputation methods have been developed for closed data but environ-mental data sets usually contain raw data which is afterwards transformed intoproportions. This allows the substitution of the missing values before the closure ofthe data. In fact, this is a widespread procedure and it is routinely done employinga fraction of the quantification limits of themeasurements as d values. It is importantto note that this procedure is equivalent to the multiplicative substitution explainedbefore.

2.3. Sensitivity to the imputed values

As a previous step, we determined a set of optimal d values using the techniquesproposed by Martín-Fernández et al. (2003). To do this, we defined a set of basevalues and multiplied them by a set of constants, pm (ranging from 0.1 to 10), toobtain a series of substituted data matrices with varying d sizes. Two sensibilitymeasures were calculated for each set of d values: distance from the compositionalmean of the data to the center of the compositional space and the total variability inthe compositional data set. Finally, the sensibility values were plotted against pm tofind the minimum.

We defined the base set of d values using two different methods. Firstly, wedirectly substituted the missing values for the quantification limit for each measureand then the data were transformed into proportions. Secondly, we employed thesimple replacement strategy explained previously using as d values the lowestproportions observed in each variable.

2.4. Measuring the effect of the substitution on the structure of the data set

The next step is to investigate the influence of the substituted values on thestructure of the data set. The rationale of the procedure (which is the originalcontribution of this study) is the following: in the substituted and clr transformeddata set each variable is deleted in turn, and the influence on the structure of thedata matrix is assessed (see below). If the deletion of one variable containingsubstituted values does not change noticeably the structure of the data set, then thisvariable is not very influential and the substitution can be done without problem. Ifthe change is significant, however, the influence of the variable might either be dueto the distortion produced by the imputed values or it might be a true property ofthe variable. In the first case, it should be expected that the samples containing theimputed values should be those showing the greatest changes after deleting thevariable.

The changes in the structure of the matrix can be assessed by analyzing thematrix of distances among samples. If the substituted values are influential theirpresencemustmodify the distances among the sample towhich they belong and theothers. To do this, the matrices of Aitchison distances between samples werecalculated for the complete data set (matrix B, with dimensions n by n) and for thedata set without the variable j (matrix Bj

*). To evaluate their differences, the Mantel(1967) statistic is calculated for the pair (B, Bj

*). The Mantel statistic is simplyameasure of the correlation among the bik and bik* values. If the deletion of a variabledoes not modify very much the distances in matrix B*, then the Mantel statisticwould be close to 1, but in the opposite case it will approach 0. This statistic has theadvantage over othermeasures that it is not affected by the number of variables usedto calculate the distance. Note that when a variable is deleted, the space containingthe samples shrinks because one dimension is lost. This produces a reduction of themagnitude of the distances among samples. For example, the (Euclidean) distancebetween the points (0,0,0) and (1,1,1) is

ffiffiffi3

pwhile the distance among the points (0,0)

and (1,1) isffiffiffi2

p. The shrinkage of the space would produce overall lower distance

values but as long as the relative distances among samples do not change, thecoefficient of correlation will be close to 1 (i.e. the effect of the space shrinkage isabsorbed by the ‘slope’ of the data cloud, not by its dispersion). TheMantel test can bedonewith Pearson’s r as the statistic or it can be better donewith Spearman’s rwhichis based on ranks, i.e. in the order of the distances among samples.

The last step is to study which samples were the most affected by the deletion ofthe variable. If they were those containing substituted values, we should concludethat the substituted values strongly influenced the structure of the distances. Inorder to do this, we calculated the distance changes for all the pairs including samplei: hik¼ jbik� bik*j, is k. To testwhether the changeswere larger for the samples withsubstituted values we employed a randomization test (Manly, 1997). We calculatedthe mean change for the group of samples with missing values, his , for the group ofcomplete samples, hic , and, finally, their difference hdiff ¼ his � hic . Next, werandomly distributed the observed hik values to both groups and repeated theprocedure to obtain a large number of randomized values, h0diff (4999 values). We

C. Real et al. / Environmental Pollution 159 (2011) 2797e2800 2799

compared hdiff with the distribution of randomized values to find the probability ofobserving a larger difference under the hypothesis of equal means for the twogroups. We employed a randomization test because it does not assume neithera particular distribution of the data, nor random sampling (Manly, 1997), andtherefore it can be applied to any data set.

All calculations were done within the R environment (R Development CoreTeam, 2007); the calculation of the Mantel statistic was done with the routineincluded in the module ‘vegan’ (Oksanen et al., 2008) and the compositionalcalculations with the module ‘compositions’ (Boogaart et al., 2006).

2.5. The example data set

To illustrate themethod and its results we employed a data set of concentrationsof tetra to octo-substituted 2,3,4,8-dioxin and furan cogeners measured in thetissues of the moss Pseudoscleropodium purum (Hedw.) M. Fleisch., which is a verycommon species in temperate areas all over the world. These data were gathered fora study designed for the assessment of this species as a biomonitor of atmosphericdioxin and furan deposition (Carballeira et al., 2006) and, at the same time, to obtaininformation about the dioxin levels in our region (Galicia, NW Spain). This moss isa good biomonitor for atmospheric pollutants because it obtains all the nutrientsfrom the atmosphere (as most mosses do), retaining pollutants at the same time.Because the interest of this paper is on data handling and not on concentrationsthemselves, we will only give here a brief account of the characteristics of thesamples and analytical methods employed. There are more detailed explanations inthe papers of Abad et al. (2003) and Carballeira et al. (2006).

Several types of samples can be considered: a) Samples from unpolluted sites.We collected 10 samples in gaps in woods far from known PCDD/F sources and welldistributed over the entire study area. b) Seven samples collected from the vicinity ofindustries considered as potential point sources of PCDD/F. The target industrieswere: a cement kiln, a paper mill, a FeSi and an Al smelter, a human crematoriumand the waste incinerator for organic residues at the Veterinary Hospital of ourUniversity. c) 20 samples collected from 8 sampling sites around a solid wasteincinerator in March 2000, 2002 and 2003. The incinerator processes more than550,000 Tm of urban solid waste per year and it began operating in 2001. d) Threesamples from a suburban area collected in the same years. e) 10 samples from 4 sitesaround a rubbish dump, at different distances from it, when it was burning freely(2000) and in successive years after its closure. f) finally, we collected 9 samples toenable us to study the spatial gradient of pollution around another burning rubbishdump. The resulting data set contains samples with variable amounts of PCDD/Fsoriginated from different sources and, on the whole, it is not appropriate for para-metric statistical techniques, as it was discussed before.

The determination of the 17 PCDD/F cogeners was carried out in the MassSpectrometry Laboratory, Dept. of Ecotechnologies, IIQAB-CSIC, Barcelona (Spain)(Abad et al., 2003).

3. Results and discussion

3.1. Characteristics of the data set

A summary of the data is presented in Table 1 (raw values). Wecalculated the maximum, minimum, mean, and median for eachvariable and also the number of missing data. The differences

Table 1Statistical summary of the distributions of the variables in the data set. n¼ 59.

Variable Statistic (ng g�1)

Max. Min. Mean Median Missing

2378TCDF 16.2 0.06 1.09 0.34 112378PeCDF 12.8 0.06 0.84 0.18 023478PeCDF 20.9 0.09 1.62 0.30 0123478HxCDF 15.1 0.07 1.13 0.34 1123678HxCDF 14.5 0.09 1.05 0.26 1123789HxCDF 4.40 0.02 0.46 0.13 20234678HxCDF 22.7 0.08 1.49 0.33 11234678HpCDF 44.5 0.32 3.58 1.22 01234789HpCDF 3.75 0.05 0.38 0.14 12OCDF 14.3 0.29 2.04 1.21 32378TCDD 6.45 0.02 0.43 0.06 1612378PeCDD 12.0 0.07 0.76 0.17 8123478HxCDD 7.58 0.07 0.65 0.21 3123678HxCDD 10.8 0.17 1.04 0.42 2123789HxCDD 10.9 0.15 1.09 0.49 11234678HpCDD 80.3 1.73 8.50 5.96 0OCDD 145 1.32 25.3 21.0 0

among means and medians and their positions in the range of thedata for each variable, indicated that the distribution of the vari-ables was skewed. This was due to the data collected in the rubbishdumps which had high concentrations of PDCC/Fs (groups e and fabove).

The data base contained 69 missing values out of a total of 1003data (59 samples, 17 cogeners each), i.e. 6.9% of the total data. Theydid not exceed 10%which is considered as an upper limit for the useof nonparametric techniques. Most of the missing values (56) areconcentrated in four variables (123789 HxCDF, 1234789 HpCDF,2378 TCDD and 12378 PeCDD).

3.2. Sensitivity to substitution values

The analysis of the sensitivity showed a minimum at at aroundpm¼ 1.3 for multiplicative substitution but we decided to use thelimits of quantification without change (i.e. pm¼ 1.0) because thedifferences between pm¼ 1.0 and pm¼ 1.3 were small. We alsoobserved that the values of distance to the center and total vari-ability were lower for multiplicative substitution than for simplesubstitution which is an indication of the superior performance ofthe multiplicative substitution.

3.3. Effects of the substitution

Once we determined the dis values, the next step was to deter-mine which were the effects of the substitution on the structure ofthe data set. Although we have determined that multiplicativesubstitution was more adequate than simple substitution, weinclude the results of the two methods in Table 2, to illustrate theirdifferent performances.

The rows in Table 2 are sorted in ascending order according tothe values of the Mantel statistic (column labelled rM, corre-sponding to multiplicative substitution) calculated for the pairsformed with the full matrix and the matrix without the variableindicated in the first column. The variables in the top of the tablehad the strongest influence on the structure of the data set. Theresults showed that four out of the five most influencing variableswere those having the higher number of zeros. Only OCDD had nozeros and, at the same time, it had a noticeable influence on thestructure of the data set. As it can be seen in Table 1, OCDD was the

Table 2Influence of each variable on the distances among samples. The table contains theresults of the Mantel test (Spearman’s r) for multiplicative (rM) and simple (rS)substitution procedures. The rows of the matrix are ordered by rM and the lastcolumn is the order of the rows if the table had been ordered by rS. The number ofmissing values in each variable is indicated in the second column (n¼ 59).

Isomer Missing rM rS Order

1234789 HpCDF 12 0.9398 0.9241 2123789 HxCDF 20 0.9629 0.8589 112378 PeCDD 8 0.9660 0.9443 4OCDD 0 0.9762 0.9824 52378 TCDD 16 0.9785 0.9390 3OCDF 3 0.9904 0.9963 82378 TCDF 1 0.9915 0.9937 623478 PeCDF 0 0.9969 0.9998 16123478 HxCDD 3 0.9971 0.9943 71234678 HpCDD 0 0.9973 0.9992 14123478 HxCDF 1 0.9975 0.9973 912378 PeCDF 0 0.9977 0.9998 15234678 HxCDF 1 0.9986 0.9984 12123789 HxCDD 1 0.9989 0.9981 11123678 HxCDD 2 0.9993 0.9979 101234678 HpCDF 0 0.9993 0.9999 17123678 HxCDF 1 0.9996 0.9986 13

C. Real et al. / Environmental Pollution 159 (2011) 2797e28002800

most abundant PCDD/F cogener in the data set, so its influence onthe structure is not surprising.

The results for simple substitution were similar (columnlabelled rS). The same five compounds were at the top (last columnin Table 2) but their order changed, now being closer to the order ofthe number of zeros in each variable. This means that, in ourexample, simple substitution seems to be more influenced by thenumber of zeros than multiplicative substitution.

The randomization tests showed that the influence ofsubstituted values is large in some cases but that this does notdepend on the number of missing values. The test result wasp¼ 0.138 for 123789 HxCDF (20 missing values) but they werep¼ 0.000 for 1234789 HpCDF (12), p¼ 0.003 for 2378 TCDD (16)and p¼ 0.016 for 2378 TCDF (1). This last variable is lessproblematic than the other two due to the reduced number ofmissing values.

The conclusion seems to be clear: the substitution of missingvalues in 1234789 HpCDF and 2378 TCDDmodified the structure ofthe data despite our efforts to select an optimal set of substitutingvalues. In view of these results there are three possible actions totake: a) to delete the variables containing missing values, b) todelete the samples and c) to use the substituted data set. The firstoption can change the structure of the data set, as the rM and rSvalues in Table 2 indicate, but the question is whether the structureis real or an artifact produced by the substitution. The secondoption can produce an unacceptable loss of information (therewere22 complete samples in the example data set), although thisproblem might be less important for data sets containing a lowproportion of missing data. The third option will produce biasedresults and theymust be carefully scrutinized in order to determinethe exact influence of the substituted data. Probably, a wise optionshould be to compare the results of options a) and c) to find theirdifferences. The proposed method, however, can not help with thiscomparison but it gives clues as to where discrepancies must besearched for.

3.4. Final remarks

Substitution of missing values by quantification limits (ora fraction of them) in the raw data matrix is a more straightforwardprocedure than substitution in the matrix of closed data because ituses values obtained directly from the quality parameters of theanalysis. Note also that for many analytical methods quantificationlimits are determined for each measurement, not for each variable

(this being the case in our example) and in these cases substitutionin the raw data matrix is easier than substitution in the matrix ofclosed data.

We consider that our example illustrates the need of a carefulcheck on the influence of the substituted values prior to the analysisof the data, as well as the usefulness of the proposed method. Byapplying this method, variables and samples with the potential toalter results can be identified and excluded or, otherwise, checkedupon when interpreting results. Note that the closure of the data isnot a requisite for the application of the method. It is also useful forany data set transformed logarithmically and even for untrans-formed data, although in this last case the problems caused by thesubstitution should be of less importance.

Acknowledgements

This work received financial support from the Xunta de Galicia(project PGIDT02 PR425J 292/7-0).

References

Abad, E., Caixach, J., Rivera, J., Real, C., Aboal, J., Fernández, A., Carballeira, A., 2003.Study on the use of mosses as biomonitors to evaluate the environmentalimpact of PCDDs/PCDFs form combustion processes e preliminary results.Organohalogen Compounds 60, 283e285.

Aitchison, J., 2003. The Statistical Analysis of Compositional Data, second ed. TheBlackburn Press, Caldwell, New Jersey.

Boogaart, K.G.van den, Tolosana, R., Bren, M., 2006. Compositions: compositionaldata analysis. R package version 0.91-6. URL: http://www.stat.boogaart.de/compositions.

Carballeira, A., Fernández, J.A., Aboal, J.R., Real, C., Couto, J.A., 2006. Moss:a powerful tool for dioxin monitoring. Atmospheric Environment 40,5576e5786.

Manly, B.F.J., 1997. Randomization, Bootstrap and Monte Carlo Methods in Biology,second ed. Chapman & Hall/CRC, Boca Raton, Florida.

Mantel, N., 1967. The detection of disease clustering and a generalized regressionapproach. Cancer Research 27, 209e220.

Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V., 2003. Dealing withzeros and missing values in compositional data sets using nonparametricimputation. Mathematical Geology 35, 253e278.

Oksanen, J., Kindt, R., Legendre, P., O’Hara, R., Simpson, G.L., Solymos, P.,Stevens, M.H.H., Wagner, H., 2008. The vegan package (v. 1.15-1). URL: http://cran.r-project.org.

Palarea-Albaladejo, J., Martín-Fernández, J.A., Gómez-García, J., 2007. A parametricapproach for dealing with compositional rounded zeros. Mathematical Geology39, 625e645.

R Development Core Team, 2007. R: A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.R-project.org.

http://www.stat.boogaart.de/compositions

http://www.stat.boogaart.de/compositions

http://cran.r-project.org

http://cran.r-project.org

http://www.R-project.org

Documents

Substituting missing data in compositional analysis