21
1 Numerical reproduction of traditional classifications and automatic 1 vegetation identification 2 Miquel de Cáceres 1, 2,* , Xavier Font 1 , Paloma Vicente 1 , Francesc Oliva 2 1 Department of Plant Biology, University of Barcelona, Avda. Diagonal 645, Barcelona, ES- 08028, Spain; 2 Department of Statistics, University of Barcelona, Avda. Diagonal 645, Barcelona, ES-08028, Spain; * Corresponding author; E-mail: [email protected] Abstract 3 Questions: Is it possible to develop an expert system to provide reliable automatic identifications 4 of plant communities at the precision level of phytosociological associations? How can unreliable 5 expert-based knowledge be discarded before applying supervised classification methods? 6 Material: We used 3677 relevés from Catalonia (Spain), belonging to eight orders of terrestrial 7 vegetation. These relevés were classified by experts into 222 low-level units (associations or 8 subassociations). 9 Methods: We reproduced low-level expert-defined vegetation units as independent fuzzy clusters 10 by using the Possibilistic C-means algorithm. Those relevés detected as transitional between 11 vegetation types were excluded in order to maximize the number of units numerically 12 reproduced. Cluster centroids were then considered static and used to perform supervised 13 classifications of vegetation data. Finally, we evaluated the classifier’s ability to correctly 14 identify the unit of both typical (i.e. training) and transitional relevés. 15 Results: Only 166 out of 222 (75%) of the original units could be numerically reproduced. 16 Almost all the unrecognized units were subassociations. Among the original relevés, 61% were 17 deemed transitional or untypical. Typical relevés were correctly identified 95% of times, while 18 the efficiency of the classifier on transitional data was only 64%. However, if the second 19

JVS 5795 De Caceres Accepted-Revised

Embed Size (px)

Citation preview

Page 1: JVS 5795 De Caceres Accepted-Revised

1

Numerical reproduction of traditional classifications and automatic 1

vegetation identification 2

Miquel de Cáceres1, 2,*, Xavier Font1, Paloma Vicente1, Francesc Oliva2

1Department of Plant Biology, University of Barcelona, Avda. Diagonal 645, Barcelona, ES-08028, Spain; 2Department of Statistics, University of Barcelona, Avda. Diagonal 645, Barcelona, ES-08028, Spain; *Corresponding author; E-mail: [email protected] Abstract 3

Questions: Is it possible to develop an expert system to provide reliable automatic identifications 4

of plant communities at the precision level of phytosociological associations? How can unreliable 5

expert-based knowledge be discarded before applying supervised classification methods? 6

Material: We used 3677 relevés from Catalonia (Spain), belonging to eight orders of terrestrial 7

vegetation. These relevés were classified by experts into 222 low-level units (associations or 8

subassociations). 9

Methods: We reproduced low-level expert-defined vegetation units as independent fuzzy clusters 10

by using the Possibilistic C-means algorithm. Those relevés detected as transitional between 11

vegetation types were excluded in order to maximize the number of units numerically 12

reproduced. Cluster centroids were then considered static and used to perform supervised 13

classifications of vegetation data. Finally, we evaluated the classifier’s ability to correctly 14

identify the unit of both typical (i.e. training) and transitional relevés. 15

Results: Only 166 out of 222 (75%) of the original units could be numerically reproduced. 16

Almost all the unrecognized units were subassociations. Among the original relevés, 61% were 17

deemed transitional or untypical. Typical relevés were correctly identified 95% of times, while 18

the efficiency of the classifier on transitional data was only 64%. However, if the second 19

Page 2: JVS 5795 De Caceres Accepted-Revised

2

classifier’s choice was also considered the rate of correct classification for transitional relevés 20

was 80%. 21

Conclusions: Our approach stresses the transitional nature of relevé data coming from vegetation 22

databases. Relevé selection is justified in order to adequately represent the vegetation concepts 23

associated to expert-defined units. 24

Keywords: Fuzzy sets; Expert systems; Possibilistic C-means; Phytosociological data; 25

Syntaxonomy. 26

Introduction 27

During recent years, there has been a renewed interest in vegetation classification, even in 28

parts of the world with little phytosociological tradition (e.g. Rodwell et al. 1995, Jennings 2003). 29

Nature managers are in need of consistent systems of vegetation classification. Indeed, assigning 30

a meaningful vegetation type to the plant community observed in a sampling site is the first step 31

in applied ecological studies, such as landscape mapping, vegetation conservation or restoration 32

planning. Such assignment (i.e. the determination of the community type) would be a simpler 33

task if the identification of possible types was done through the use of remote expert systems of 34

vegetation classification (Noble 1987). Up to date, there is no expert system specially designed 35

for providing web-based vegetation classification services on the basis of species 36

composition/abundance. Nevertheless, several local computer programs are already available for 37

this purpose (van Tongeren 1986, Hill 1996, Pot 1997, Tichý 2002, van Tongeren et al. 2008), 38

and Czech vegetation scientists distribute expert system configurations to be used locally within 39

the JUICE program (Chytrý 2007). 40

Methodologically speaking, the act of identifying the predefined class or classes to which a 41

given plant community may be assigned is usually called supervised classification. Standard 42

statistical tools such as quadratic discriminant analysis (Ejrnæs et al. 2004) and specially artificial 43

Page 3: JVS 5795 De Caceres Accepted-Revised

3

neural networks (Cerná & Chytrý 2005) have recently been advocated as efficient 44

methodological approaches for the identification of plot data. Simpler but more easily 45

interpretable approaches consist in calculating resemblance values between the target relevé and 46

each of the predefined vegetation units. After that, the relevé is identified with the nearest unit(s). 47

Relevé resemblance computation may be performed by combining information from species 48

composition, abundance values, and/or the presence of diagnostic species (van Tongeren 1986, 49

Hill 1989, Kocí et al. 2003, Tichý 2005, van Tongeren et al. 2008). Another approach consists in 50

identifying potential units by progressing downward from higher to lower hierarchical levels (Pot 51

1997). In either the case, a classifier is developed from a training data set of plot observations 52

whose classification is previously known and is assumed to be valid. Such assumption can be a 53

source of problems in expert domains where it does not hold, or when there is no consensus on 54

the classification of the training set. Traditional expert-based vegetation classifications usually 55

suffer from several inconsistencies (i.e. different researchers used variable and sometimes not 56

explicitly stated classification criteria) and/or contain loosely defined units (i.e. plant 57

communities defined by the occurrence, dominance, or absence of particular species). Under this 58

scenario supervised classification methods may spread potentially wrong knowledge if traditional 59

expert-defined classifications are not previously validated using a common classification 60

criterion. Since contemporary vegetation scientists are increasingly using numerical clustering 61

(i.e. unsupervised) methods to derive new vegetation units (Mucina & van der Maarel 1989, 62

Mucina 1997), they should also be used to review traditional classifications. However, note that 63

current conservation policies, like those of the Natura 2000 networking program, are based on 64

habitat definitions (e.g. the CORINE biotopes manual, Devillers et al. 1991), which in turn rely 65

on traditional phytosociological units. Therefore drastic changes in regional/national vegetation 66

classifications can be problematic and should be avoided. Even if traditional vegetation units are 67

considered valid, we believe the classification criterion of supervised classifications should be 68

Page 4: JVS 5795 De Caceres Accepted-Revised

4

congruent with the one used in the original classification of the training data set. Otherwise, 69

either the efficiency and/or interpretation of results may be affected. This explains why 70

supervised approaches emulating traditional phytosociological concepts perform better when the 71

expert classification of the training set is used instead of that resulting from numerical clustering 72

analyses (e.g. van Tongeren et al. 2008). 73

The aim of this paper is to propose a methodological framework for translating low-level 74

expert-defined vegetation units into an automatic vegetation identifier. It consists of two main 75

steps. First, we use Possibilistic C-means, a fuzzy unsupervised classification method, to 76

reproduce expert-defined vegetation units. Second, clusters centroids resulting from the first step 77

are used to identify new observations by means of a fuzzy classifier. We use the traditional 78

phytosociological classification of the Catalan vegetation to build numerical clusters and to 79

evaluate the classifier’s ability to provide satisfactory answers at the precision level of 80

association. 81

Material and Methods 82

Data sets and data transformations 83

We took the traditional phytosociological classification of terrestrial vegetation in Catalonia, 84

northeast of Spain. In order to span a broad range of vegetation types, we considered eight 85

syntaxonomical orders (see Table 1), which include different types of grasslands, shrublands and 86

forests. For each order we compiled all relevés from those phytosociological associations 87

containing at least 3 representatives. Relevés were drawn from the Biodiversity Data Bank of 88

Catalonia (Font 2008). Original authors of the relevés had assigned them to associations or 89

subassociations that were fitted into the syntaxonomical classification made by Bolòs & Vigo 90

(1984). Only Brometalia erecti grasslands had undergone a numerical revision, based on 91

correspondence analyses, of the original expert-based classification (Font 1993). We did not 92

Page 5: JVS 5795 De Caceres Accepted-Revised

5

perform any stratified re-sampling (Knollová et al. 2005) neither an elimination of those relevés 93

with unusually small or large plot sizes (Otýpková & Chytrý 2006). Relevé compilation resulted 94

in eight distinct datasets, one corresponding to each order. Taken together, we considered 3677 95

relevés, which belong to 222 distinct low-level (i.e. association or subassociation) vegetation 96

units. These vegetation types were the expert knowledge to be validated and emulated by means 97

of numerical methods. 98

Species nomenclature was homogenized using a regional flora (Bolòs et al. 1990). Unsure 99

plant determinations, determinations not reaching the species level and taxon names not 100

appearing in the flora were eliminated. Although they are not consistently reported, we kept 101

cryptogam records because they are diagnostic for some vegetation units. Braun-Blanquet cover-102

abundance values were first transformed to the nine-degree ordinal scale (van der Maarel 1979). 103

We then applied the Hellinger transformation (Legendre & Gallagher 2001). The Hellinger 104

distance (Rao 1995, Legendre & Legendre 1998) is equal to the chord distance (Orlóci 1967) 105

computed after taking the square root of the abundance values. The multivariate space provided 106

by the Hellinger distance was used to define numerical clusters reproducing expert-defined 107

vegetation units. 108

Cluster model 109

In the opinion of many vegetation scientists, vegetation types are not crisp classes but types 110

that are conceptually vague and fuzzy (e.g. Dale 1988, Moraczewski 1993, Willner 2006). 111

Therefore, any numerical classification of vegetation should allow some degree of overlap and 112

even allow leaving some relevés unclassified. Setting a hierarchical tree or a partition (either 113

fuzzy or crisp) as classification model seemed excessively constraining to us. In addition, we 114

wanted a cluster model where new clusters could be defined without changing all those clusters 115

previously defined. Due to these two reasons, we turned our attention to the Possibilistic C-116

Page 6: JVS 5795 De Caceres Accepted-Revised

6

Means algorithm (PCM, Krishnapuram & Keller 1993, 1996), which implements a clustering 117

model where clusters are both fuzzy and independent. PCM algorithm originated from Fuzzy C-118

means (FCM, Bezdek 1981) an unsupervised partitive clustering procedure well-known among 119

vegetation scientists (Marsili-Libelli 1989, Mucina 1997). Table 2 summarizes the mathematical 120

differences between PCM and FCM models. In the possibilistic approach, fuzzy membership 121

values are not relative (i.e. probabilistic) as in FCM, but are interpreted absolute cluster 122

typicalities. Cluster independence is obtained because the partition constrain of FCM is 123

eliminated. That is, for any object the sum of its possibilistic membership values does not have to 124

be one. Resulting from this fundamental difference, PCM is a mode-seeking algorithm. That is, in 125

PCM each vegetation cluster corresponds to a dense region in the multivariate space of relations 126

between plots. A single PCM run can be regarded as c independent runs of an algorithm looking 127

for a single cluster (Davé & Krishnapuram 1997). The PCM model solves the FCM problem 128

raised by Dale (1995), consisting on the possible data contamination resulting from types not 129

well represented and whose centre lies outside the available data. Fig. 1 further illustrates the 130

differences between the two models, by showing their corresponding results on relevé data from 131

three xerophytic grassland associations. 132

Reproduction of traditional units into numerical clusters 133

Whenever possible, we create one possibilistic fuzzy cluster for each traditional low-level 134

vegetation unit (syntaxonomical association or subassociation). One additional advantages of 135

PCM over FCM is that it avoids the need of specifying the number of clusters to be sought. 136

Instead, distinct clusters are permitted as long as they represent distinguishable dense regions of 137

the multivariate space. In our case, we considered two clusters as distinguishable when their 138

amount of overlap was less than 10% (see below). We used this criterion to detect poorly defined 139

vegetation units. Moreover, relevé databases may be plagued with noisy and transitional plot 140

Page 7: JVS 5795 De Caceres Accepted-Revised

7

data. Including indiscriminately all the available relevés would preclude the PCM algorithm from 141

distinguishing many expert-defined units. Therefore some relevés were discarded during the 142

reproduction process. 143

The following steps were performed for each of the eight datasets: We started by taking those 144

relevés belonging to the first low-level vegetation unit. The one-cluster PCM algorithm was then 145

run on this initial training relevé set, using the three closest relevés as starting cluster members. 146

The fuzziness exponent was set to m = 1.03, which is a rather crisp value but allowed higher 147

sensitivity of the algorithm. The PCM cluster size parameter (ηi in Table 2) was then 148

progressively augmented in order to make the cluster grow. This was done by using a method 149

described in De Cáceres et al. (2006), which allows finding appropriate PCM cluster sizes. Once 150

grown, the relevés showing very low membership values (i.e. with uij < 0.0001) were excluded 151

from the training data set and stored in a set of transitional (non-typical) relevés. The final cluster 152

configuration was also stored. After reproducing this first unit, the relevés belonging to the next 153

vegetation unit were included in the training relevé set, and we let the previously defined PCM 154

cluster(s) “react” to the newly added relevés by running the PCM algorithm from its last 155

configuration, also allowing for changes in the cluster size parameter (De Cáceres et al. 2006). It 156

could happen that some of the newly added relevés become members (i.e. with a possibilistic 157

fuzzy membership uij > 0.1) of any of the previous cluster(s). In this case, those relevés were 158

deemed transitional, and they were also excluded from the training set and stored in the 159

transitional set. We then reloaded the stored cluster configuration(s) and the “reacting” process 160

was rerun without the noisy transitional relevés. This was repeated until all previously-defined 161

PCM cluster(s) were stable to the new relevés. Note that this process of discarding relevés could 162

leave a given vegetation unit without enough relevés for being translated into a numerical cluster. 163

If enough relevés were left, we used the three closest relevés as starting cluster members for a 164

Page 8: JVS 5795 De Caceres Accepted-Revised

8

new PCM cluster, which was grown as described above. Any given PCM cluster reproducing a 165

low-level expert unit was only accepted whenever it fulfilled the following three conditions: (a) 166

The sum of membership values for the fuzzy cluster set (i.e. its cardinality) was equal or greater 167

than 3; (b) all relevés with a membership value for the fuzzy cluster above 0.1 had been classified 168

by experts into the same vegetation type (this condition ensured that the PCM cluster represented 169

the proper vegetation concept); and (c) the proportion of overlap between the fuzzy cluster and 170

any of the remaining PCM clusters was lower than 10%. Cluster overlap between any pair of 171

clusters was calculated as the cardinality of the fuzzy intersection set divided by the cardinality of 172

the fuzzy union set. Whenever a cluster failed to be accepted, a distinct set of three relevés was 173

used as starting cluster configuration. The steps above-described were iteratively repeated until 174

all the traditional low-level vegetation units had been considered. Subassociations were given 175

priority over associations as units to be reproduced. This algorithm yielded three sets: (1) a final 176

training set, made of typical relevés only (this is hereafter also referred to as the typical relevé 177

set); (2) a transitional set, containing those relevés that were outliers or similar to more than one 178

numerical cluster; and (3) a set of PCM fuzzy clusters corresponding to reproduced expert-179

defined vegetation units. 180

Supervised classification of relevés 181

We used the probabilistic approach of FCM to perform supervised classifications. In order to 182

use this unsupervised method in a supervised mode, the centroid coordinates for each of the fuzzy 183

clusters must be considered static (but see the leave-one-out procedure below). Supervised FCM 184

classification of any relevé j was performed in two simple steps: (1) Compute eij, the distance 185

between the relevé j and each fixed cluster centroid i; and (2) Compute uij, the relevé fuzzy 186

membership to each cluster i, by using the FCM membership function (eq. 1 in Table 2). We set 187

Page 9: JVS 5795 De Caceres Accepted-Revised

9

the fuzziness exponent to m = 1.2 in this case, as recommended by several authors (e.g. Marsili-188

Libelli 1989, Podani 1990, Escudero & Pajarón 1994). 189

Evaluation of the classifier 190

Our objective was to assess the performance of the classifier by measuring its rate of correct 191

identification at the precision level of association. If a given association (and its possible 192

subassociations) had not been reproduced, it was not represented in the set of fuzzy clusters. 193

Hence, its relevés could not be used to evaluate the classifier’s performance. However, if some 194

subassociations of an association or the association itself had been reproduced, then all its 195

subassociations were considered to be represented because in this case the classifier was capable 196

of returning a correct answer at the level of association. 197

Both typical and transitional relevé sets were used for the evaluation of the classifier. Since 198

relevés of the transitional set had been discarded in the definition of PCM clusters, they could be 199

used as a test set. However, relevés of the typical (training) set exerted an attraction on the 200

centroids, and thus their re-classification was biased. Aiming to remove this bias, we used a 201

leave-one-out crossvalidation procedure. For each training relevé to be classified we temporarily 202

removed it from the training set and the PCM clusters were allowed to “react” as explained 203

above. After this step, identification could be done without the influence of the target relevé on 204

cluster centroids. 205

The classifier responses were homogenized at the level of association. For each represented 206

association within each order we estimated the sensitivity and positive predictive power of the 207

classifier (see Cerná & Chytrý 2005 for details). We also calculated rates of correct association 208

identification for each of the eight datasets, and for all datasets taken together. In order to gain 209

Page 10: JVS 5795 De Caceres Accepted-Revised

10

more detailed information on the classifier’s performance, we repeated this efficiency assessment 210

also taking into account the second choice as an additional source of correct identification. 211

Results 212

Reproduction of traditional units 213

Among the 222 original low-level units, 166 (75%) could be numerically reproduced using 214

strategy described (see Table 1). Only two of the 56 non-reproduced units were associations. The 215

remaining 54 non-reproduced units were subassociations, which means that in all these cases 216

other subassociations of the same association could be reproduced. Approximately 39% of the 217

original relevés were finally kept in the training set, but this percentage varied from 31% (for 218

Fagetalia beech forests) to 57% (for Galio-Alliarietalia megaforb communities). Hence, nearly 219

25% of the expert-defined vegetation units and 61% of the relevés can be considered of 220

transitional nature following our cluster building criteria. 221

Performance of the vegetation classifier 222

The two non-reproduced associations accounted for 27 relevés. The remaining 3650 relevés 223

belonged to associations represented in the classifier, so they were used to assess its performance. 224

We report detailed result tables on the sensitivity and positive predictive power for each 225

association in App. 1. We show in Table 3 the rates of correct identification computed for the 226

eight datasets independently and altogether. The overall rate of correct association identification 227

for the typical relevés was very high: 95% of relevés were classified into the correct association 228

in the first choice, and 99% taking into account the first and second choices of the classifier (see 229

Table 3). This high rate of success is not surprising, since the relevés of this set were those 230

which, by definition, were closest to cluster centroids. In contrast, the classifier identified the 231

correct association for 64% of the relevés of the transitional set. Nevertheless, if we take into 232

account the transitional nature of these relevés, the percentage of correct identification using the 233

Page 11: JVS 5795 De Caceres Accepted-Revised

11

first and second choices may be a more realistic measure of performance. Over all 234

phytosociological orders, this latter percentage was 79.5%. Identification of beech forests 235

(Fagetalia sylvaticae) was the least successful (66%) and that of Quercus ilex forests and related 236

communities (Quercetalia ilicis) the most successful (89.3%). When considering both typical and 237

transitional relevé, the estimated overall efficiency of the classifier was 76.3% of correct 238

identification on first choice, and 86.9% considering also the classifier’s second choice. 239

Discussion 240

Reproduction of traditional classifications 241

Several attempts of reproduction of traditional vegetation classifications usually forced the 242

reproduction of all expert-defined units into the classifier (e.g. van Tongeren 1986, Hill 1989, van 243

Tongeren et al. 2008). In the case of Kocí et al. (2003), the use of the Cocktail algorithm 244

(Bruelheide 2000) allowed excluding poorly differentiated units, but their approach was still 245

essentially expert-based (Chytrý 2007). Going a step further, we stressed here the necessity of 246

validating traditional vegetation units through the use of an unsupervised clustering method. 247

Although we tried to maximize the amount of vegetation types that could be numerically 248

reproduced, 25% of the original low-level units turned out to be impossible to stand. 249

Subassociations turned out to be more difficult to reproduce because many of them are 250

traditionally defined as a subclass of an association that shows a tendency towards an 251

ecologically neighbouring association (in other words, they are transitional). 252

Moreover, in previous approaches relevé identification was usually performed using 253

assignment rules that were different from the rules originally used in the classification of training 254

data (e.g. Kocí et al. 2003, Tichý 2005, van Tongeren et al. 2008). We preferred to use the 255

resemblance in species abundance values only, as a simple common criterion for both 256

unsupervised and supervised classification. Not using Cocktail’s species groups but overall 257

Page 12: JVS 5795 De Caceres Accepted-Revised

12

species composition has the advantage that it allows reproducing units lacking differential species 258

(i.e. ‘basal’ or ‘central’ communities). However, the classifier is not expected to provide accurate 259

results with such units due to their high variability and amount of transitional relevés. 260

Performance of the vegetation classifier 261

Whereas inconsistency in the original classification methods can be avoided by applying 262

numerical clustering, it reappears when attempting to evaluate the efficiency of the classifier 263

because the reference classification is expert-based. That is, the precision in the original 264

assignments may be affecting the percentages of successful identification. In addition, relevés 265

belonging to transitional subassociations were more difficult to classify correctly than relevés 266

belonging to reproduced vegetation units (even if both were represented at the level of 267

association). This occurred because the classifier lacked centroids to represent these units and 268

hence its relevés were assigned to one of the neighbouring units. The high number of 269

unrecognized subassociations in Fagetalia beech forests (see Table 1) may account for the low 270

classifier results on this data set (Table 2). There are other possible sources of low supervised 271

classification efficiency, derived from inconsistencies in the sampling methods that different 272

authors use. Otýpková & Chytrý (2006) showed that smaller plots tend to produce less stable 273

ordinations in data sets of low beta diversity. The lecture of their findings in terms of 274

classification is that relevés from small plots may be easily misclassified because of their higher 275

degree of variability both in species presence and abundance. The same reasoning may be applied 276

to the inconsistent recording of cryptogams. 277

Sampling and the appropriate representation of vegetation types 278

We carefully selected the relevés included in the training set, which certainly is a critical point 279

in our approach and must be justified. Statistically speaking, such relevé selection is still a 280

subjective decision that completely biases sampling and precludes any inference on the validity 281

Page 13: JVS 5795 De Caceres Accepted-Revised

13

of groups. Hence, one cannot expect to accurately reflect the real patterns of vegetation. 282

Moreover, Cerná & Chytrý (2005) found that selecting plots with diagnostic species as training 283

set resulted in lower efficiency of neural network classifiers compared to using a randomly 284

selected training set. Nevertheless, nowadays vegetation scientists generally agree that vegetation 285

is mainly of continuous nature. Therefore, as long as an optimal vegetation sampling theory is 286

lacking, statistical inference on clustering results will remain a delicate subject (e.g. Rolecek et 287

al. 2007). Meanwhile, vegetation classification should not aim at discovering true vegetation 288

types, but should provide a knowledge basis for performing applied ecological studies. Having 289

this in mind, we considered more important to keep the vegetation concept to be reproduced very 290

clear. We set a specific point in the multivariate space (i.e. the cluster centroid) as the 291

representative of the expert-defined unit. Not including transitional relevés into the centroid 292

definition helped in keeping it as an ideal type. Ensuring that the nomenclatural type relevé (if 293

available) shows a high membership to the unit would be a way to allow using the syntaxon name 294

for the fuzzy cluster. 295

Limitations of the numerical cluster model 296

Note that our numerical cluster model assumes roughly spherical clusters, both when building 297

PCM clusters and when executing the FCM classifier. One of Dale’s (1995) criticisms to FCM 298

was its inability to cope with non-spherical cluster shapes. Although it is possible allow 299

hyperellipsoidal clusters in FCM and PCM algorithms (Krishnapuram & Keller 1993), by taking 300

into account the cluster variance-covariance matrix. Another limitation of our approach is that 301

FCM membership function works better with clusters of similar size. PCM typicality function 302

may be used instead, but at the expense of obtaining values which cannot be interpreted as 303

probabilities. 304

Final remarks and future work 305

Page 14: JVS 5795 De Caceres Accepted-Revised

14

In our opinion, vegetation scientists should decide whether they would prefer: (1) a vegetation 306

classifier designed as an interface to communicate expert vegetation knowledge to non-experts; 307

or (2) a computer program like the former, but which could also promote the revision of the 308

expert knowledge itself. In the first case the program would simply run supervised classification 309

methods from a knowledge that would be assumed to be true. In contrast, in the second case the 310

system would allow doubting expert knowledge, and even changing his point of view. We 311

believed this second model was more flexible and promising. We implemented our proposals in a 312

set of related computer programs called Araucaria (see App. 2 and 313

http://biodiver.bio.ub.es/vegana/araucaria). One of them allows experts to feed the classifier with 314

new plot data, and see how the current set of PCM clusters “reacts” to this new information. 315

Regarding future developments, we strongly believe that a comparison of vegetation 316

classification methodologies is necessary, not only in terms of efficiency but also aiming a 317

unification of traditional and numerical approaches. Since vegetation classifications are 318

regionally restricted, studying solutions for biogeographical issues (e.g. vicariant units) would be 319

another interesting research topic. Nevertheless, large-scale vegetation expert systems (say valid 320

for all Europe) will certainly be difficult to develop. 321

Acknowledgements 322

We would like to thank Lubomir Tichý and an anonymous reviewer for their very useful 323

comments on a previous version of this manuscript. This study was supported by a Ph.D. grant 324

awarded by the “Comissionat per a Universitats i Recerca” (1999SGR00059), of the 325

“Departament d’Universitats, Recerca i Societat de la Informació de la Generalitat de Catalunya” 326

(2001 FI 00269), and by a research project from the Spanish “Ministerio de Educación y Ciencia” 327

(CGL2006-13421-C04-01/BOS). 328

Page 15: JVS 5795 De Caceres Accepted-Revised

15

References 329

Bezdek, J. C. 1981. Pattern recognition with fuzzy objective functions. Plenum Press, New York. 330

Bolòs, O. de & Vigo, J. 1984. Flora dels Països Catalans. Vol. 1. Ed. Barcino, Barcelona. 331

Bolòs, O. de, Vigo, J., Masalles, R. M. & Ninot, J. M. 1990. Flora Manual dels Països Catalans. 332

2nd ed. Pòrtic, Barcelona. 333

Braun-Blanquet, J. 1964. Pflanzensoziologie: Grundzüge der Vegetationskunde. Springer. 334

Bruelheide, H. 2000. A new measure of fidelity and its application to defining species groups. 335

Journal of Vegetation Science 11(2): 167-178. 336

Cerná, L. & Chytrý., M. 2005. Supervised classification of plant communities with artificial 337

neural networks. Journal of Vegetation Science 16: 407-414. 338

Chytrý., M. (ed.) 2007. Vegetation of the Czech Republic. 1. Grassland and Heathland 339

Vegetation, Academia, Praha, 525 pp. 340

http://www.sci.muni.cz/botany/vegsci/expertni_system.php?lang=en 341

Dale, M. B. 1988. Some fuzzy approaches to phytosociology. Ideals and instances. Folia 342

geobotanica et phytotaxonomica 23: 239-274. 343

Dale, M. B. 1995. Evaluating classification strategies. Journal of Vegetation Science 6:437-440. 344

Davé, R. N. & Krishnapuram, R. 1997. Robust clustering methods: a unified view. IEEE 345

transactions on fuzzy systems 5: 270-293. 346

De Cáceres, M., Oliva, F. & Font, X. 2006. On relational possibilistic clustering. Pattern 347

recognition 39: 2010-2024. 348

Devillers, P., Devillers-Terschuren, J. & Ledant, J.-P. (1991). CORINE biotopes manual. 349

Habitats of the European Community. A method to identify and describe consistently sites 350

of major importance for nature conservation. Data specifications - Part 2. Office for 351

Official Publications of the European Communities. Luxembourg. 352

Page 16: JVS 5795 De Caceres Accepted-Revised

16

Ejrnæs, R., Bruun, H. H., Aude, E. & Buchwald, E. 2004. Developing a classifier for the Habitats 353

Directive grassland types in Denmark using species lists for prediction. Applied 354

Vegetation Science 7: 71-80. 355

Escudero, A. & Pajarón, S. 1994. Numerical syntaxonomy of the Asplenietalia petrarchae in the 356

Iberian Peninsula. Journal of Vegetation Science 5: 205-214. 357

Font, X. 1993. Estudis geobotànics sobre els prats xeròfils de l’estatge montà dels pirineus. 358

Institut d’Estudis Catalans, Barcelona, ES. 359

Font, X. 2008. Mòdul Flora i Vegetació. Banc de Dades de Biodiversitat de Catalunya. 360

Generalitat de Catalunya i Universitat de Barcelona. 361

http://biodiver.bio.ub.es/biocat/homepage.html 362

Hill, M. O. 1989. Computerized matching of relevés and association tables, with an application to 363

the British National Vegetation Classification. Vegetatio 83: 187-194. 364

Hill, M. O. 1996. TABLEFIT version 1.0, for identification of vegetation types. Institute of 365

Terrestrial Ecology, Huntingdon, UK. 366

Jennings, M. 2003. Guidelines for Describing Associations and Alliances of the US National 367

Vegetation Classification. Ecological Society of America. 368

Knollová, I., Chytrý, M., Tichý, L. & Hajek, O. 2005. Stratified resampling of phytosociological 369

databases: some strategies for obtaining more representative data sets for classification 370

studies. Journal of Vegetation Science 16: 479-486. 371

Kocí, M., Chytrý, M. & Tichý, L. 2003. Formalized reproduction of an expert-based 372

phytosociological classification: A case study of subalpine tall-forb vegetation. Journal of 373

Vegetation Science 14: 601-610. 374

Krishnapuram, R., & J. M. Keller. 1993. A possibilistic approach to clustering. IEEE 375

transactions on fuzzy systems 1: 98-110. 376

Page 17: JVS 5795 De Caceres Accepted-Revised

17

Krishnapuram, R. & Keller, J. M. 1996. The possibilistic c-means algorithm: Insights and 377

recommendations. IEEE transactions on fuzzy systems 4: 385-393. 378

Legendre, P. & Gallagher, E. D. 2001. Ecologically meaningful transformations for ordination of 379

species data. Oecologia 129: 271-280. 380

Legendre, P., & Legendre, L. 1998. Numerical Ecology. 2nd english ed. Elsevier. 381

Marsili-Libelli, S. 1989. Fuzzy clustering of ecological data. Coenoses 4: 95-106. 382

Moraczewski, I. R. 1993. Fuzzy logic for phytosociology: 1. Syntaxa as vague concepts. 383

Vegetatio 106: 1-11. 384

Mucina, L. 1997. Classification of vegetation: Past, present and future. Journal of Vegetation 385

Science 8: 751-760. 386

Mucina, L. & van der Maarel, E. 1989. Twenty years of numerical syntaxonomy. Vegetatio 81: 387

1-15. 388

Noble, I. R. 1987. The role of expert systems in vegetation science. Vegetatio 69: 115-121. 389

Orlóci, L. 1967. An agglomerative method for classification of plant comunities. Journal of 390

Ecology 55: 193-206. 391

Otýpková, Z. & Chytrý, M. 2006. Effects of plot size on the ordination of vegetation samples. 392

Journal of Vegetation Science 17: 465-472. 393

Podani, J. 1990. Comparison of fuzzy classifications. Coenoses 5: 17-21. 394

Pot, R. 1997. SYNDIAT, SYNtaxonomical DIAgnostics Tool, a computer program based on the 395

deductive method of community identification. Acta Botanica Neerlandica 46: 230. 396

Rao, C. R. 1995. A review of canonical coordinates and an alternative to correspondence analysis 397

using Hellinger distance. Qüestiió (Quaderns d'Estadistica i Investivació Operativa) 19: 398

23-63. 399

Rodwell, J. S., Pignatti, S., Mucina, L. & Schaminée, J. H. J. 1995. European Vegetation Survey: 400

update on progress. Journal of Vegetation Science 6: 759-762. 401

Page 18: JVS 5795 De Caceres Accepted-Revised

18

Rolecek, J., Chytrý, M., Háyek, M., Lvoncik, S. & Tichý, L. 2007. Sampling in large-scale 402

vegetation studies: Do not sacrifice ecological thinking to statistical puritanism. Folia 403

Geobotanica 42: 199-208. 404

Tichý, L. 2002. JUICE, software for vegetation classification. Journal of Vegetation Science 13: 405

451-453. 406

Tichý, L. 2005. New similarity indices for the assignment of relevés to the vegetation units of an 407

existing phytosociological classification. Plant Ecology 179: 67-72. 408

van der Maarel, E. 1979. Transformation of cover-abundance values in phytosociology and its 409

efects on community similarity. Vegetatio 39: 97-114. 410

van Tongeren, O. 1986. FLEXCLUS, an interactive program for classification and tabulation of 411

ecological data. Acta Botanica Neerlandica 35: 137-142. 412

van Tongeren, O., Gremmen, N., & Hennekens, S. M. 2008. Assignment of relevés to predefined 413

classes by supervised clustering of plant communities using a new composite index. 414

Journal of Vegetation Science 19: 525-536. 415

Willner, W. 2006. The association concept revisited. Phytocoenologia 36: 67-76. 416

417

Page 19: JVS 5795 De Caceres Accepted-Revised

19

Table 1. The eight phytosociological orders studied and results of the numerical reproduction of 417

their low-level classification. 418

Phytosociological order Short description

Ori

gin

al u

nit

s

Ori

gin

al re

levés

Rep

rod

uced

un

its

No

n-r

ep

rod

uced

un

its

Tra

inin

g (

typ

ical)

rel.

Brometalia erecti mesophytic or slightly xerophytic pastures 30 531 26 4 231

Origanetalia vulgaris herb communities growing on forest fringes 12 133 10 2 67

Galio-Alliarietalia megaforb sciophilous communities 13 124 12 1 71

Prunetalia spinosae shrub communities growing on decideous forest fringes 18 353 16 2 161

Populetalia albae riverine meso-macroforests growing on wet fluvisols with high water-table 17 199 10 7 107

Quercetalia ilicis mediterranean woodlands, scrublands and maquis 31 753 25 6 254

Quercetalia pubescentis submediterranean decideous oak woodlands 41 651 30 11 243

Fagetalia sylvaticae beech forests 60 933 37 23 286

Total 222 3677 166 56 1420 419

420

Table 2: Main mathematical characteristics of the Fuzzy C-means (FCM) and Possibilistic C-421

means (PCM) clustering algorithms. 422

FCM PCM

Fuzzy membership

definition 1

1=! =

c

i iju for all objects j = 1, ..., n 0

1>! =

c

i iju for all objects j = 1, ..., n

Optimisation function !!= =

=c

i

n

j

ij

m

ijFCM euJ1 1

2)( ! !!!= == =

"+=c

i

n

j

m

iji

c

i

n

j

ij

m

ijPCM ueuJ1 11 1

2 )1()( #

Membership function !=

"=

c

l

m

ljijij eeu1

)1/(2)/(/1 (1) ))/(1/(1 )1/(12 !+=

m

iijijeu " (2)

423

424

Page 20: JVS 5795 De Caceres Accepted-Revised

20

Table 3. Classification efficiency of the numerical classifier at the association level. Column 424

blocks list the efficiency on the typical and transitional relevé sets, as well as the overall 425

efficiency for the represented associations. Ass.: Number of represented associations. %: 426

Percentage of relevés correctly classified; L/U: Lower/upper 95% confidence limits following the 427

binomial distribution. 428

Phytosociological order Ass. Rel. % L U % L U Rel. % L U % L U Rel. % L U % L U

Brometalia erecti 20 231 97.4 94.4 99.0 99.1 96.9 99.9 285 68.8 63.5 74.6 85.6 81.1 89.6 516 81.6 78.4 85.2 91.7 89.1 94.0

Origanetalia vulgaris 10 67 92.5 83.4 97.5 100.0 94.6 100.0 66 39.4 27.6 52.2 78.8 67.0 87.9 133 66.2 57.5 74.1 89.5 83.0 94.1

Galio-Alliarietalia 11 71 94.4 86.2 98.4 97.2 90.2 99.7 53 56.6 42.3 70.2 73.6 59.7 84.7 124 78.2 69.9 85.1 87.1 79.9 92.4

Prunetalia spinosae 9 161 96.3 92.1 98.6 98.8 95.6 99.8 192 72.9 66.0 79.1 85.9 80.2 90.5 353 83.6 79.3 87.3 91.8 88.4 94.4

Populetalia albae 7 107 92.5 85.8 96.7 94.4 88.2 97.9 92 64.1 53.5 73.9 82.6 73.3 89.7 199 79.4 73.1 84.8 88.9 83.7 92.9

Quercetalia ilicis 13 254 99.2 97.2 99.9 99.2 97.2 99.9 487 80.9 79.4 86.7 89.3 88.2 93.8 741 87.2 86.7 91.5 92.7 92.2 95.9

Quercetalia pubescentis 10 243 90.5 86.1 93.9 98.8 96.4 99.7 408 65.7 60.9 70.3 82.1 78.0 85.7 651 75.0 71.4 78.2 88.3 85.6 90.7

Fagetalia sylvaticae 22 286 96.2 93.2 98.1 99.0 97.0 99.8 647 49.0 45.1 52.9 66.0 62.2 69.6 933 63.5 60.3 66.5 76.1 73.2 78.8

Total 102 1420 95.4 94.2 96.4 98.6 97.8 99.1 2230 64.1 62.1 66.2 79.5 77.9 81.3 3650 76.3 75.1 77.9 86.9 86.0 88.2

1st/2nd choice1st choice

TransitionalTypical Represented

1st/2nd choice 1st choice 1st/2nd choice1st choice

429

430

431

Page 21: JVS 5795 De Caceres Accepted-Revised

21

Fig. 1: Example of clustering results of FCM and PCM on relevés belonging to three grassland 431

associations of Brometalia erecti. (a) Classical multidimensional scaling coordinates from Bray-432

Curtis distances, with the original vegetation units labelled using different symbols (filled circles: 433

Koelerio-Avenuletum ibericae; squares: Adonido-Brometum erecti; diamonds: Lino viscosi-434

Brometum erecti; empty circles: intermediate artificial relevés created by averaging randomly-435

selected relevés from the three groups). (b) FCM (m=1.2) solution with three groups. (c) PCM 436

(m=1.09) solution with three groups, after setting appropriate reference distance parameters as 437

described in De Cáceres et al. (2006). Symbol size and colour intensity are function of the 438

object’s largest membership value. 439

440 <Figure Files (print size should be around 5x5 cm each) > 441 442 <JVS 5795 Fig.1A.tiff> 443 <JVS 5795 Fig.1B.tiff> 444 <JVS 5795 Fig.1C.tiff> 445 446