[IEEE 2004 IEEE International Conference on Fuzzy Systems - Budapest, Hungary (25-29 July 2004)] 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) - Clustering-based

25-29 July, 2004 Budapest, Hungary

Clustering-based Identification of TS-models: Comparison on a Groundwater Model Case Study

Hilde Vemieuwe, Bernard De Baets Department of Applied Mathematics,

Biometrics and Process Control

Coupure links 653, 9000 Gent, Belgium

Niko E.C. Verhoest Laboratory of Hydrology and Water Management

Coupure links 653, 9000 Gent, Belgium Ghent University, Ghent University,

E-mail: {Hilde.Vemieuwe,Bemard.DeBaets}@UGent.be E-mail: [email protected]

Abstruct- In this paper, we apply different clustering algorithms for the identification of Takagi-Sugeno models. All of the fuzzy c-means, Gustafson-Kessel, simplified Gustafson-Kessel, Gath and Geva, simplified Gath and Geva, and modified Gath and Geva clustering algorithms try to minimize the same objective function. First, an algorithm for determining the optimal number of clusters is presented. The Takagi-Sugeno models with the optimal number of clusters are then incorporated into a groundwater model, and compared with measurements of the EMSL experiment and the results of a numerical groundwater model.

1. INTRODUCTION In hydrology, water movement in the soil has mainly

been modeled through a numerical solution of the Richards equation, which is the most common mathematical expression for unsaturated flow. Bardossy et al. [I] developed a fuzzy rule-based model as an alternative solution to the Richards equation through a subjective identification of the fuzzy rule base and obtained very promising results. Because of their ability to model a large class of nonlinear systems [2], [3], [4], [5], we identified Takagi-Sugeno models in [6].

Since the goal of a fuzzy rule base is to induce a fuzzy partition of the product space of the input-output variables and fuzzy clustering algorithms are very well suited to detect such a fuzzy partition, it is not surprising that these algorithms are commonly used to identify Takagi-Sugeno models. The fuzzy c-means algorithm [7] is the most classically used fuzzy clustering algorithm. The Gustafson-Kessel algorithm [SI is commonly used to identify Takagi-Sugeno models. The latter was also applied in [6]. Several other suitable fuzzy clustering algorithms, which try to minimize the same objective function, exist.

In this paper, we investigate which of these algorithms performs best with respect to a groundwater model case study. Therefore, a comparison between Takagi-Sugeno models identified using these algorithms is performed. On the one hand, the commonly used fuzzy clustering algorithms such as the fuzzy e-means and the Gustafson-Kessel algorithm are applied. On the other hand, less frequently used fuzzy clustering algorithms such as the Gath and Geva algorithm [9], the simplified Gustafson-Kessel algorithm [IO], the simplified

Gath and Geva algorithm [IO] and the modified Gath and Geva algorithm [3] are used. An algorithm for determining the number of clusters, based on a validity measure indicating the accuracy of the Takagi-Sugeno model, is presented.

11. THE TAKAGI-SUGENO MODELS

A. Unsaturated Subsurface Flow

The water movement in a one-dimensional, isotropic soil matrix is described by the one-dimensional Richards equation [I l l , [12]:

with 8 the soil moisture content [-I, t the time [TI, z the gravity head which expresses the elevation [L] of a point with respect to the soil surface for which z = 0, and is defined positive upward, K is the hydraulic conductivity of the soil [L/T] and ?I, is referred to as the matric head [L] which varies with the soil moisture content. The one-dimensional Richards equation (1) is a combination of the law of Darcy:

dh q = K -

dz and the continuity equation:

d8 dq - + - = o at aZ

where q is a volumetric flux or the DarcyJEUx [L/T] defined positive in the downward direction expressing the amount of water flowing downward through a unit area per unit of time and h = z + $ defines the total head [L].

The matric head ?I, and the soil moisture content 8 are linked through the van Genuchten equation [ 131:

m

e($ ) = 8, + ( 0 s - Or) ( 1 + h 7 L ) (4)

where a , n, m are soil-related parameters (with m = 1- l/n), Os is the saturated moisture content and 0, is the residual moisture content of the soil.

0-7803-8353-2/04/$20.00 0 2004 IEEE 1685

F UZZ-/€E E 2004

TABLE I VAN GENUCHTEN PARAMETERS FOR THE SANDY LOAM SOIL

Parameter Value Unit 0, 0.5500 - or 0.0100 - KS 5.0000 c m h ct 0.0480 cm-' n 1.5632 -

B. Rule Structure The fuzzy rules we propose for approximating the solution

of the one-dimensional Richards equation (1) are of the form:

IF ((eles>,,(~le,) ,) is A T" qv = a (ole,), + b (fll&>, + e

with A a 2-dimensional fuzzy set, and a, b and c the consequent parameters. The subscripts U and I refer to the upper and lower soil cell. The rules describe the flow q3,3+l between two adjacent soil cells j and j + 1 at a specific time step [l], [ 141. The TakagiSugeno model is applied to calculate the flux between the adjacent soil cells along a vertical profile. Using the continuity equation (3) , an updated moisture content of every cell is obtained. The boundary conditions considered at the upper soil cell are rainfall and evaporation, while two possible boundary conditions were examined at the bottom of the profile: a constant moisture content and a zero flux.

C. Generation of the Training Datu Set The training data set was generated synthetically using a

numerical discretization of the Darcy equation (2) for a sandy loam soil with the Van Genuchten parameters listed in Table I.

The soil moisture content of the upper and lower cells were varied between $,/Os and Os/$, with a discretization step of 0.0025. The spatial discretization was 2.5 cm. This resulted in a data set containing 154449 data points. As this set was too large to perform the clustering, we randomly selected N = 10000 data points. The surface of the training data is given in Figure 1.

D. Identijication of the Models As stated in the introduction, several Takagi-Sugeno models

were identified by applying different fuzzy clustering algorithms that try to minimize the following objective function:

N c

J = 7; f r c ( ~ ) ~ d ( z , k ) 2 (5 ) x = l k = l

with d(z, k ) a measure for the distance between data point x and cluster k , c the number of clusters, f k the membership function of cluster k and m E R>1 the so-called fuzzifier. In this paper, a value of m = 2 was used. The algorithms used are described in [3] and [lo]. In order to identify the Takagi- Sugeno models, a fuzzy clustering algorithm was performed in the input-output space. The clusters can then be seen as multi-dimensional membership functions. The consequent parameters were calculated using the global least squares

-m 0

0 1 relative misture mntent upper cell (%) relavwe misure wntent lower cei (%)

Fig. 1 . Training surface

method [2] and the ordinary least squares method for the models identified using the modified Gath and Geva clustering algorithm. [ 3 ] .

111. A N ALGORITHM FOR DETERMINING THE NUMBER OF CLUSTERS

The performance of the Takagi-Sugeno model on the training data set was determined using the Nash and Sutcliff index (NS) [15]:

where qnUm ( k ) and qflLZr ( k ) are respectively the numerically calculated flux and the flux calculated with the Takagi-Sugeno model. denotes the mean of the numerically calculated flux. The optimal value of NS is 1, meaning a perfect match. The NS index is commonly used in hydrology, and is an index similar to the Variance Accounted For (VAF).

The number of clusters was first varied between 2 and 64. For each number of clusters, a fuzzy e-means clustering was performed and a Takagi-Sugeno model was identified. This procedure was repeated 30 times for each number of clusters, and the value of the NS index was calculated. The maximum value of the NS index for each number of clusters is plotted in Figure 2. This figure shows that for a low number of clusters, the value of the NS index strongly improves when adding an extra cluster. For a higher number of clusters, the improvement in the NS index becomes less when adding an extra cluster. In order to determine the number of clusters, a compromise between an increasing number of clusters, i.e. a more complex model, and the accuracy of the model has to be sought. Since all fuzzy clustering algorithms try to minimize the same objective function and have the same algorithmic structure, the same behaviour of the NS index with respect to the number of clusters is expected.

A heuristic algorithm to search for the optimal number of clusters, applicable to all of these fuzzy clustering algorithms, was developed. The ClusterFirzder starts, in this particular

1686

0.95

0.9

Lo.02- -

-

0.551 J

- $3.85 a,

0.8- .- 2 0.75

,” 0.7-

B 0.65

0.6

- 3

0.5 1 0 10 20 30 40 50 60 70

number of clusters

-

-

-

-

Fig. 2. the fuzzy c-means algorithm

Maximum values of the NS index for each number of clusters for

TABLE I1 NUMBER OF CLUSTERS AND CORRESPONDING VALUES OF THE

PERFORMANCE INDICES FOR THE DIFFERENT CLUSTERING ALGORITHMS

clustering algorithm fl clusters NS [-] NS (64 cl.) [-I FCM 19 0.9458 0.9642 GK 16 0.9794 0.9989

SGK 17 0.9735 0.9921 GG 22 0.9809 0.9987 SGG 33 0.9791 0.9960 MGG 22 0.9770 0.9946

case, with 64 clusters (denoted as nclmax). First, a margin a by which the accuracy of the model with 64 clusters can be decreased and can still be seen as acceptable has to be determined. We found that a margin of 0.02 was reasonable according to the change of the maxima of the NS values with the number of clusters (see Figure 2). This margin can be seen as the price we are ready to pay (in terms of the NS index) for not doubling the complexity of the model. The main algorithm identifies TakagiSugeno models for nclmax and ncl = nclmnx/2 clusters. As long as the TakagiSugeno models with ncl clusters yields a higher maximum NS value than ncl x 2 clusters, the upper boundary of the total interval is changed to nd and ncl becomes nc1/2. In this way, the main algorithm determines the number of clusters, as a power of two, for the upper boundary of the total interval for which the Takagi-Sugeno model yields the maximum NS value. Therefore. it is important that the initially chosen nclmax is a power of two. The number of clusters for the lower boundary of the total interval is initially set to 0. The subalgorithm Zoonz then searches recursively for the lowest number of clusters that yields a maximum value of the NS index fitting within the margin cy. Initially, the search interval containing the optimal number of clusters equals the total interval. Zoom narrows the search interval by changing its lower and upper boundaries in order to find the optimal number of clusters.

Algorithm 1: Main algorithm: ClusterFinder Data :

data set 2 initial number of clusters nclmax number of repetitions nrep margin a

fuzzy rule base number of clusters (ncl) matrix containing the values of the perfor-

Result :

mance indices (NS) for i +- 1 to nrep do

Build the fuzzy rule base for nclmax Simulate the data set 2 Calculate NS(i, nclmax) Build the fuzzy rule base for ncl t n d m a x / 2 Simulate the data set Z Calculate NS(i, nclmax/2)

end di#+ max(NS( ., nclmaz)) - max(NS(., nclmax/2)) while diff < 0 do

for i + 1 to nrep do Build the fuzzy rule base for ncl t nc1/2; Simulate the data set 2 Calculate NS(i, ncl)

end dzff + max(NS(..ncl* 2)) - max(NS(.,ncl))

end Zoom (ncl * 2, ncl * 2, ncl, 0, dzfi NS)

The ClusterFinder algorithm with nrep = 30, nclmax = G4 and CY = 0.02 was used to determine the number of clusters for the different fuzzy clustering algorithms. Table I1 shows the results, together with the values of the NS index both for the number of clusters found and for 64 clusters. This table shows that, apart from the model based on the fuzzy e-means clustering algorithm, all models give comparable results, albeit for different number of clusters. Figure 3 shows the absolute errors of this model compared to the training data. From this figure, one can see that the highest errors appear at the border of the surface, more specifically, when one of the two soil cells is close to or at saturation. The reason for these larger errors might be the fact that the slope of the training surface at those borders is very high. Because of this steep slope, and because of the fact that the training data is randomly chosen, the training data is not well sampled in that region, compared to the steepness of the slope. A second reason might be the fact that the consequent parts of the TakagiSugeno models are linear equations, for which the parameters are calculated using a linear least squares method. In the border region for instance, a least squares method fits the “best plane” for all data points in that region. In this way, an underestimation of the flux at

1687

FUZZ-IEEE 2004

ncl1, +- ncl,l ncl,l c ncl,l + (ncl,, - ncl,l)/2

ncl,, +- ncZ,1 ncl,l +- nd,1 - (nc1,l - nc11,)/2

else

Algorithm 2: Subalgorithm: Zoom

end

the borders might be obtained.

IV. SIMULATION RESULTS

A. The EMSL Experilnetit The sandy loam soil for which the parameters are given

in Table I was used in a laboratory experiment set up by the nonvegetated terrain (NVT) workgroup of the European Microwave Signature Laboratory (EMSL) [ 161. In this experiment, a cylindric container having a diameter of 2 m and a height of 0.4 m was filled with this sandy loam soil. The moisture conditions of the soil were monitored by means of time domain reflectometry (TDR). These TDR probes were placed along three vertical transects at different distances from the center of the cylinder. The probes were positioned at the following depths: 2.5 cm, 5 cm, 10 cm, 15 cm and 25 cm. The cylinder has been subjected to wetting and drying phases, during which the soil moisture profiles were monitored every

relative moisture content lower cell (%I 0 -0 relative moisture content umercetl I%)

Fig. 3. Error surface for the Gath and Geva clustering algorithm

10 min. Two time series for one of the TDR probes were selected to validate the groundwater models. The first time series consist of one irrigation event on a dry soil followed by a drying period. The second time series consist of two irrigation events on a moderately wet soil, each followed by a drying period.

B. Validation of the Models The Takagi-Sugeno models with their number of clusters

given in Table 11, were used in the groundwater model. The groundwater models were then validated using the above described time series. The simulations were performed four times using combinations of different initial and boundary conditions. The boundary conditions examined at the bottom of the profile were: a constant flux (CF) of 0 “/day and a constant moisture content (CM) of 0.155. On the one hand, the initial conditions used consist of an interpolation between the measured moisture contents for depths where no measurements are available. On the other hand, the soil profile was separated into homogeneous blocks consisting of the measured moisture contents. The averaged absolute difference between measurements and simulations was calculated to validate the simulation results. Tables I11 and IV show these results. A comparison between the simulation results of the fuzzy groundwater models and the numerical model developed by [17] was performed as well as a comparison between the results of the numerical model and the measurements. Since the fuzzy groundwater models try to approximate the solution of the Richards equation (l), which is used in the numerical model, the comparison between the numerical and the fuzzy groundwater model is more relevant.

Tables I11 and IV show that the fuzzy rule-based models identified using the Gath and Geva, the simplified Gath and Geva and the modified Gath and Geva clustering algorithms perform best both compared to the measurements and the numerical model. From these three type of models, the model based on the simplified Gath and Geva clustering algorithm shows the best results compared to the numerical model and

1688

25-29 July, 2004 Budapest, Hungary

TABLE 111 COMPARISON BETWEEN THE DIFFERENT MODELS, THE MEASUREMENTS

AND THE NUMERICAL MODEL FOR THE FIRST TIME SERIES

Interpolation Homogeneous blocks clustering algorithm CF CM CF C M

Comparison with measurements FCM 0.0908 0.0980 0.0859 0.1004 GK 0.0343 0.0334 0.0494 0.0485 SGK 0.0667 0.0587 0.0727 0.0595 GG 0.0242 0.0244 0.0286 0.0280 SGG 0.0223 0.0223 0.0238 0.0238 MGG 0.0274 0.0274 0.0277 0.0277

0.5 2.5 cm

- - - - - - - - - - numerical model -0xa6 -60?16 - 6.0202 - - - -

FCM 0.1022 0.1077 0.0946 0.1097 Comparison with numerical model

- I

f 0 5 -

!! 8

z OO 0.5

'0 0 5 -

GK 0.0320 0.0525 0.0924 0.0916 SGK 0.0625 0.0479 0.0660 0.0484 GG 0.0152 0.0154 0.0131 0.0124

SGG 0.0087 0.0087 0.0081 0.0081 MGG 0.0110 0.0110 0.0106 0.0106

TABLE IV COMPARISON BETWEEN THE DIFFERENT MODELS, THE MEASUREMENTS

AND THE NUMERICAL MODEL FOR THE SECOND TIME SERIES

Interpolation Homogeneous blocks clustering algorithm CF CM CF CM

Comparison with measurements FCM 0.0438 0.0853 0.0401 0.0818 GK 0.0271 0.0270 0.0272 0.0272

SGK 0.0626 0.0899 0.0602 0.0857 GG 0.0298 0.0367 0.0268 0.0341

SGG 0.0248 0.0248 0.0238 0.0238 MGG 0.0266 0.0266 0.0229 0.0229 - - - - - - - - -

numencal model -0XB6 - ~ O ~ O ~ - 60317 - - ~ . O ~ l ~ - - Comparison with numerical model

FCM 0.0365 0.0780 0.0346 0.0762 GK 0.0214 0.0234 0.0203 0.0226 SGK 0.0554 0.0789 0.0547 0.0779 GG 0.0149 0.0208 0.01 15 0.0178 SGG 0.0100 0.0101 0.0099 0.0099 MGG 0.0132 0.0131 0.0132 0.0132

- - - - d - - - - - - - - - - - -- OO+i60 i;O 2& 2kO 3k &O 4; 450

lOcm

- - - - - - _ - - - - - - - - - - * _ - _ _ _ _ _ _ _ _ i o I 6 0 1;o 260 2;o 360 3;o 460 450

15cm -

_ _ - _ - - - - - -, . . . . . , . . . ..+

i 0 ik ib 260 2kO 360 350 460 450

25 cm

~ I

f10 160 150 2k 2;O 360 4 0 4& 4AO time [lo min]

Fig. 4. Measurements (solid line) and simulation results for the fuzzy model based on the SGG clustering algorithm (dashed line) and the numerical model (dotted line) for the first time series

shows together with the model identified using the modified Gath and Geva clustering algorithm the best results compared to the measurements. This in contrast to the results of Table I1 where all models, except the model based on the fuzzy c-means algorithm, yield comparable results. The behaviour of the models based on the Gustafson-Kessel and certainly the simplified Gustafson-Kessel clustering algorithm is worse than what one can expect from the results listed in Table 11. From the comparison with the measurements, one can also see that the numerical model performs slightly better than these fuzzy models.

The simulation results for the two time series for the fuzzy groundwater model based on the simplified Gath and Geva clustering algorithm are plotted in Figures 4 and 5. The boundary condition used was a constant moisture content, and the soil profile was divided into homogeneous blocks.

v. CONCLUSIONS AND FURTHER RESEARCH

In this paper, different fuzzy clustering algorithms were used to identify Takagi-Sugeno models. In order to identify

"0 50 100 150 200 250 300 350 400 450 500 5 cm 0.51 '

2 .- '1 E 0.5 {O 11 150 260 2iO 360 3;O 460 4iO 5AO

15cm

0

0 5

1

0 50 100 150 200 250 300 350 400 450 500

25 cm - - - - - - - - - - - - - - 4 - ---_---,.+I-- -,

0 I I I

0 50 100 150 200 250 300 350 400 450 500 time [lo min]

Fig. 5. Measurements (solid line) and simulation results for the fuzzy model based on the SGG clustering algorithm (dashed line) and the numerical model (dotted line) for the second time series

1689

FUZZ-IEEE 2004

the optimal number of clusters, as a compromise between accuracy and complexity of the models, a heuristic algorithm was presented. The TakagiSugeno models, with the optimal number of clusters, were then incorporated into a groundwater model that simulates the moisture content in the unsaturated zone of the soil. A comparison of the different models, under different initial and boundary conditions showed that the Takagi-Sugeno models identified using the Gath and Geva clustering algorithm and its variants yielded the best results. The Takagi-Sugeno model identified using the simplified Gath and Geva clustering algorithm yields the best results compared to the results of the numerical groundwater model of Hoeben and Troch [ 171.

In this paper, only multidimensional membership functions were used. Further research will be carried out to make a comparison with Takagi-Sugeno models whose multidimensional membership functions are projected onto the axes of the input variables. Secondly, more attention will be given to the selection of the training data. The region of higher moisture contents will be sampled more intensively.

REFERENCES [ l ] A. B&rdossy, A. Bronstert. and B. Merz, “1-, 2- and 3-dimensional

modeling of water movement in the unsaturated soil matrix using a fuzzy approach,” Adv. Water Resoux, vol. 18. no. 4, pp. 237-251, 1995.

[2] R. BabuSka, F u z q Modeling for Conrrol, ser. lntemational Series in Intelligent Technologies. Boston, USA: Kluwer Academic Publishers, 1998.

[3] J. Abonyi, R. BabuSka, and F. Szeifert, “Modified Gath-Geva fuzzy clustering for identifi cation of Takagi-Sugeno fuzzy models,” IEEE Trans. 011 Systenrs, Man and Cybernetics. Part B, vol. 32, no. 5. pp. 612-621, 2002.

[4] A. Gomez-Skarmeta, M. Delgado, and M. Vila, “About the use of fuzzy clustering techniques for fuzzy model identifi cation,” Fuzzy Sers and Systenrs, vol. 106, pp. 179-188, 1999.

[5] J. Sousa and U. Kaymak, Fuzzy Decision Making in Modeling and Control, ser. World Scientifi c Series in Robotics and Intelligent Systems. New Jersey: World-Scientifi c, 2002.

[6] H. Vernieuwe, B. D. Baets, and N. Verhoest, “Simulation of one- dimensional water movement in the unsaturated zone by means of a first order Takagi-Sugeno model,” in Proc. FUZZ-IEEE, Honolulu, Hawaii,

[7] J. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact, well separated clusters:’ Journ. Cybern., vol. 3, pp. 95-104, 1974.

[8] D. Gustafson and W. Kessel, “Fuzzy clustering with a fuzzy covariance matrix,” in Proc. IEEE CDC, San Diego, CA, USA, 1979, pp. 761-766.

[9] I. Gath and A. Geva, “Unsupervised optimal fuzzy clustering,” IEEE Trans. Paneni Analysis and Machine Intelligence, vol. 11, pp. 773-781, 1989.

[IO] E Hoeppner, E Klawonn, R. Kruse, and T. Runkler, F u z q Cluster Analysis.

[113 L. Richards, “Capilary conduction of liquids through porous media,”

[12] M. Celia, E. Bouloutas, and R. Zarba, “A general massconservative numerical solution for the unsaturated fbw equation,” Water Resoux Res., vol. 26, no. 7, pp. 1483-1496, 1990.

[13] M. van Genuchten, “A closed-form equation for predicting the hydraulic conductivity of unsaturated soils,” Soil Sci Soc. Ani. J. , vol. 44, pp. 892- 898, 1980.

[14] A. Birdossy and L. Duckstein, Fuzzv Rule-Based Modeling with Ap- plications to Geoplrysiral, Biological and Engineering Systenrs. New York, USA: CRC Press, 1995.

[15] 3. Nash and J. Sutcliffe, “River fbw forecasting through conceptual models part i - a discussion of principles,” J. Hydrol., vol. 10. pp. 282- 290, 1970.

[16] M. Mancini, R. Hoeben. and P. Troch, “Multifrequency radar observa- tions of bare surface soil moisture content: A laboratory experiment,” Wafer Resour. Res, vol. 35. no. 6, pp. 1827-1838, 1999.

[17] R. Hoeben and P. Troch, “Assimilation of active microwave observation data for soil moisture profile estimation,” Water Resoul: Res.. vol. 36. no. 10, pp. 2805-2819, 2000.

USA, 2002, pp. 1069-1074.

New York: John Wiley and Sons, 1999.

Physics, vol. 1, pp. 318-333, 1931.

1690

Documents

[IEEE 2004 IEEE International Conference on Fuzzy Systems - Budapest, Hungary (25-29 July 2004)] 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) - Clustering-based