swp0000 - halweb.uc3m.eshalweb.uc3m.es/esp/Personal/personas/dpena/docencia/Sub-ingcf15.pdf174 CAPÍTULO 15. CLASSIFICATION USING MIXTURES OF DISTRIBUTIONS 15.1 BACKGROUND In this

172

Capítulo 15

CLASSIFICATION USINGMIXTURES OF DISTRIBUTIONS

Francis Galton (1822-1911)British scientist and explorer, inventor of the concept of regression. Galton traveled

throughout Africa where he made numerous geographical and climatological discoveries. Ahalf-cousin of Darwin, he dedicated the later part of his life to testing the theory of evolution.He was a pioneer in the study of mixtures of normal distributions and is considered to beone of the outstanding scientists of the late 19th century.

173

174 CAPÍTULO 15. CLASSIFICATION USING MIXTURES OF DISTRIBUTIONS

15.1 BACKGROUND

In this chapter we return to the problem of cluster analysis in order to analyze the homogene-ity of a sample and find groups if they exist; a problem which we studied from a descriptivepoint of view in Chapter 8. Here we assume that the data have been generated by a mixtureof G unknown distributions and we will present three methods for dividing a heterogenoussample into more homogeneous groups.The first and oldest of the three methods is the k-means method (or G-means in our

notations) which was presented in Chapter 8 as a heuristic algorithm based on using thedistances between points for maximizing a measure of homogeneity. In this chapter we willsee that the criterion studied in Chapter 8 is optimal for a determined configuration of thedata.The second procedure is to estimate the parameters of the components of the mixture

and classify the observations afterwards in the groups by the probabilities of their belongingto different populations. This approach is essentially Bayesian and has been used in somecomputer programs when it is assumed that the data come from a mixture of G normal pop-ulations. In this chapter we will look in detail at their applications using the EM algorithmas well as a fully Bayesian implementation using Gibbs sampling.The third uses a projection pursuit approach and looks for directions for projecting the

points so as to separate the groups as much as possible. The classification of the observationsinto groups is done by analyzing these univariate projections afterwards. We present a versionof this approach which seems to work well in practice.The procedures we set out in this chapter are related to the discriminant analysis which

we saw in Chapter 13. In both cases we assume mixtures, and we wish to find criteria whichallow us to assign new observations to the different populations. Nevertheless, in discrim-inant analysis we assume that the populations are known, or that we have a sample fromeach population (sometimes called a training sample), where the observations are classifiedwithout error, so as to be able to estimate the parameters of each distribution. In clusteranalysis we do not know the number of populations nor do we have prior classification data,and all the information about the number of groups and their structures must be obtainedfrom the sample available.The modelization procedures using mixtures of distributions originated in Galton’s work

on mixtures of normal distributions and in Pearson’s, who was the first to use the methodof moments for estimating them. Tukey introduced the mixtures for the study of robustnessand made important contributions to the study of heterogeneity.

15.2 THE K-MEANS METHOD FOR MIXTURES

In order to obtain criteria that we can apply to the case of cluster analysis, let us return tothe problem of discriminating among G multivariate normal populations N

¡μg,Vg

¢, when

a training sample is available in which the source of the observations is known. Let ng be thenumber of elements of the sample which come from the population g, where g = 1, . . . , G,and

Png = n. Applying the results from section 10.3, the likelihood of the sample will be,

adding up the supports:

15.2. THE K-MEANS METHOD FOR MIXTURES 175

log f (x1, . . .xn) = −GXg=1

ng2log |Vg|−

GXg=1

ng2tr¡V−1g S

¡μg¢¢

where S¡μg¢= 1

ng

Pngi=1

¡xi − μg

¢ ¡xi − μg

¢0. According to this equation, the estimation of

each vector of means, μg, will be xg, the sample mean, and the support function concentratedin those parameters will be:

log f (x1, . . .xn) = −GXg=1

ng2log |Vg|−

GXg=1

ng2tr¡V−1g Sg

¢where

Sg =1

ng

ngXi=1

(xi − xg) (xi − xg)0 .

Let us assume that we admit the hypothesis Vg = σ2I, that is, the variables are uncor-related and have the same variance in all of the groups. Then, the support function can bereduced to:

log f (x1, . . .xn) = −np

2log σ2 − 1

2σ2tr

ÃgXi=1

ngSg

!and letting

W =

gXi=1

ngSg

maximizing the likelihood is equivalent to:

min tr(W)

which is the trace criterion: minimizing the weighted sum of the estimated variances in eachgroup. This criterion was obtained by another method in Chapter 8, and is the one used inthe k-means algorithm. It has the advantage of being simple and easy to calculate, but it isnot invariant to linear transformations and does not take correlations into account.If we admit the hypothesis that all the covariance matrices are equal, Vg = V, the

likelihood is equivalent to that of the classical discrimination problem studied in Chapter13, which is given by:

log f (x1, . . .xn) = −n

2log |V|− 1

2tr

ÃV−1

ÃgXi=1

ngSg

!!and the ML estimation of V is then

bV =1

n

gXi=1

ngSg =1

nW,


and inserting this estimation into the likelihood function, maximizing the likelihood is equiv-alent to:

min |W|

which is the determinant criterion, proposed by Friedman and Rubin (1967). This criterionis invariant to linear transformations, and, as we will see, tends to identify elliptical groups.In the more general case in which the populations have a different covariance matrix, the

ML estimation of Vg is Sg and the maximum of the likelihood function is

log f (x1, . . .xn) = −1

2

Xng log |Sg|−

np

2, (15.1)

and maximizing this likelihood is equivalent to:

minX

ng log |Sg| (15.2)

In other words, each group must have a minimum ”volume”. We assume that each grouphas ng > p+ 1, so that |Sg| is not singular, which requires that n > G (p+ 1).An additional criterion proposed by Friedman and Rubin (1967) is to start with the de-

composition of the multivariate analysis of variance and maximize the size of the generalizedMahalanobis distance between groups given by W−1B. Again, the size of this matrix canbe measured by the trace or the determinant, but this last criterion has not provided goodresults in cluster analysis (see Seber, 1984).Any of these criteria can be maximized with an algorithm similar to the k-means presented

in Chapter 8. The determinant criterion is easy to use and, as proved in Appendix 15.1,it tends to produce elliptical groups, whereas the trace criterion produces spherical groups.Criterion (15.2) has the disadvantage in that strong restrictions must be imposed on thenumber of observations in each group so that the matrices are not singular. This is becauseif the number of groups is large, the number of parameters to be estimated can be very high.In practice, it seems better to allow some common features in the covariance matrices andthis is not easy to impose with this criterion. Moreover, if a group has few observationsand |Sg| is nearly singular, the weight of this group in the criterion will be very high andthe algorithm tends to fail in this type of solution. For this reason, this criterion, while oftheoretical interest, is not often used in practice.

15.2.1 Number of groups

Generally, the number of groups, G, is unknown and is estimated with the data by applyingthe algorithm for different values G = 1, 2, .... and choosing the best result. Comparing thesolutions obtained is not simple because any of the criteria will decrease if we increase thenumber of groups. According to the analysis of multivariate variance, the total variabilitycan be decomposed as:

T =W+B (15.3)

Intuitively, the objective of the division of groups is to make B, the variability betweengroups, as large as possible, and to makeW, the variability within each group, as small as


G = 2 G = 3 G = 4 G = 5 G = 6lem 30 20 14 15 14lew 35 22 13 16 12imr 509 230 129 76 83mr 15 11 9 9 9br 64 58 37 35 26Total 653 341 202 151 144H 77.4 61.5 30.4 6.2CH 265.9 296.3 356.6 359.7 302

Tabla 15.1: Table 15.1 The average variance within the groups for each variable with adifferent number of groups with the k-means algorithm.

possible. Taking any given division in groups, we choose one of them and apply this newdecomposition again to reduce the variability again. Therefore, we cannot use any criterionbased on the size of W for comparing solutions with different groups since we can alwaysmakeW smaller by making more groups.As we saw in Chapter 8 we can carry out an approximate F -test calculating the pro-

portional reduction of variability that is obtained by adding an additional group. The testis:

H =tr(WG)− tr(WG+1)

tr(WG+1)/(n−G− 1)(15.4)

and, in the hypothesis which states that G groups are sufficient, the value of H can becompared with an F with p, p(n − G − 1) degrees of freedom. The stopping rule proposedby Hartigan (1975), used in some computer programs, is to continue dividing the set of dataas long as this quotient, H, is greater than 10.An additional criterion, which tends to work better than the above (see Milligan and

Cooper, 1985), is that proposed by Calinski and Harabasz (1974). This criterion starts withthe decomposition (15.3) and chooses the value of G maximizing

CH = maxtr(B)/(G− 1)tr(W)/(n−G) (15.5)

In addition to these two criteria, an alternative is to select the number of clusters by amodel selection criteria, such as the AIC or BIC. Tibshirani et al (2001) proposed additionalcriteria.Example: We are going to compare the two criteria H and CH to select the number of

groups in the k-means algorithm for the country data. We will use only the 5 demographicvariables from MUNDODES. We start with the results from the SPSS program. To decidethe number of groups this program gives us a table with the variances of each variable withinthe groups. Table 15.1 summarizes this information:This table indicates by rows the average variance of each variable within the groups.

These groups are obtained by calculating the sum of squares of the deviations to the meanof the group corresponding to each variable in all of the groups divided by the number ofdegrees of freedom of the sum which is n−G. The columns show how this sum varies when


the number of groups is increased. As long as this sum diminishes, it is advisable to continuewith the subdivision, especially if the reduction of variance is large. If we add by columnswe obtain the sum of these variances, which is the trace of W divided by n − G. Next wehave the H and CH statistics, defined in the above section, to determined the number ofgroups. In this example both criteria lead to five groups.To illustrate standard results of a computer program, the following table indicates those

of the SPSS program for five groups. The program provides, in the first table the centersof the groups defined by the coordinates of each variable. The second table contains adecomposition of the variability of each variable. The column Cluster MS is the sum ofsquares between the groups for each variable divided by its degrees of freedom, G−1 = 4. Inthe Analysis of variance table we also have the variances of each variable within the groups,which are in the column Error MS. A rounded off copy is this column is found in Table 15.1,which contains the sum of variances of the variable.

Centers: LEM LEW IMR MR BR

1 64.475 68.843 37.575 7.762 29.8682 43.166 46.033 143.400 20.022 46.5773 70.122 76.640 11.288 8.920 15.0174 57.342 60.900 74.578 10.194 39.0575 51.816 54.458 110.558 13.875 43.008

Analysis of Variance.Variable Cluster MS DF Error MS DF F Prob

LEM 1805.3459 4 15.059 86.0 119.8792 .000LEW 2443.0903 4 16.022 86.0 152.4812 .000IMR 46595.0380 4 76.410 86.0 609.7986 .000MR 289.2008 4 9.507 86.0 30.4191 .000BR 3473.4156 4 34.840 86.0 99.6950 .000

Table 15.1 shows that the variable imr has much more variance than the rest (see thethird row), and therefore, it will have a decisive weight in the design of the groups, whichwill be formed principally by the values of this variable. This table suggests that the numberof groups is five, since the total row decreases sharply each time a new group is introduceduntil a total of five groups is reached. When the number is increased to six the decrease inthe variance is very small.To illustrate the calculations of the H and CH statistics to determine the number of

groups which are shown in Table 15.1, we let MS(G) be the row of totals in this table, whichis equal to tr(WG)/(n−G). The H statistic is calculated as

H =tr(WG)− tr(WG+1)

tr(WG+1)/(n−G− 1)=(n−G)MS(G)− (n−G− 1)MS(G+ 1)

MS(G+ 1)

where n = 91 and G is the number of groups indicated by columns. Thus, we have the row ofH, and according to Hartigan’s criterion we choose five groups. To calculate the CH statistic


Figura 15.1: Histogram of the infant mortality rate variable indicating the presence of be-tween four and five country groupings.

given by (15.5), its numerator is the sum of the Cluster MS terms for all the variables andthe denominator is the sum of the column Error MS. For G=5 the CH criterion is

CH =1805.34 + ..+ 3473.42

15.06 + ...+ 34.84=54606

151.8= 359.7

and the application of this criterion leads to five groups as well.Comparing the vectors of means of the solution with five groups, we see that the group

with the lowest infant mortality rate is three, which includes the countries of Europe, exceptfor Albania, and that with the highest mortality rate is two, which contains the poorestAfrican countries. Figure 15.1 presents a histogram of the infant mortality rate variable,imr. We can see that this variable, which as we have seen has a dominant weight in theformation of the groups, clearly indicates the heterogeneity of the sample.To illustrate how the different computer programs work, the following table shows the

results of the MINITAB program for five groups, with standardized and unstandardizedvariables.

A. MUNDODES results, unstandardized variables. MINITAB

\qquad \qquad \qquad \qquad \qquad \qquad Number of Within cluster \qquad Averageobservations sum of squares from centroid from centroid

Cluster1 21 10060.590 19.308 57.601Cluster2 14 797.147 7.200 10.897Cluster3 28 829.039 5.005 10.008Cluster4 9 826.444 8.724 15.306Cluster5 19 2713.755 11.338 19.143


Cluster Centroids

Variable Cluster1 Cluster2 Cluster3 Cluster4 Cluster5br 4.4762 2.7857 2.7143 3.3333 4.3684mr 44.5381 22.8571 13.4429 34.1222 39.0579imr 16.5095 6.4714 9.4250 9.1000 10.1947lem 124.6333 23.5500 9.1143 45.7111 74.5789lew 48.1095 67.3643 70.7464 62.4333 57.3421

B: MUNDODES results, standardized variables

Number of Within cluster Average distance Maximum distanceobservations sum of squares from centroid from centroid

Cluster1 20 14.440 0.817 1.275Cluster2 10 9.932 0.736 2.703Cluster3 29 20.771 0.792 1.535Cluster4 22 32.443 1.134 2.132Cluster5 10 6.679 0.727 1.621

Cluster Centroids

Variable Cluster1 Cluster2 Cluster3 Cluster4 Cluster5br 0.6955 -1.7841 -0.1214 0.9070 -1.2501mr 0.6007 -0.9921 -0.9665 1.2233 -0.0978imr -0.3585 0.3087 -0.5867 1.3417 -0.8421lem 0.3300 -0.7676 -0.9455 1.3758 -0.1771lew -0.2078 0.5478 0.9537 -1.4424 0.2754

The MINITAB program provides the sum of squares within the groups by clusters insteadof by variables. The results for unstandardized variables are similar but not identical, as canbe seen by comparing the means of the variables in the groups. Standardizing the variableschanges the results substantially as they now have greater weight than the rest. The mosthomogeneous groups by continents and in Europe separate occidental countries from oriental.This example shows the need to try different solutions and, if possible, different computer

programs since the implementation of the algorithms is not the same in different programs.

15.3 ESTIMATION OF MIXTURES OF NORMALDISTRIBUTIONS

A natural approach to carrying out the subdivision of the sample into groups or clusters isto assume that the data were generated as a mixture of multivariate normal distributionsand estimate jointly the parameters of the distributions which form the mixture and the

15.3. ESTIMATION OF MIXTURES OF NORMAL DISTRIBUTIONS 181

posterior probabilities that each item of information belongs to each of the components ofthe mixture. We will now look at this approach.

15.3.1 The equations of maximum likelihood for the mixture

We assume that the data come from a mixture of distributions:

f(x) =GXg=1

πgfg(x),

the likelihood function will be

l(θ|X) =nYi=1

(GXg=1

πgfg(xi))

and can be stated as the sum of Gn terms corresponding to all of the possible classificationsof the n observations among the G groups. The support function of the sample is

L(θ|X) =nXi=1

log f(xi) =nXi=1

logGXg=1

πgfg(xi) (15.6)

Let us assume that each fg(x) is k-dimensional normal with a vector of means μg andcovariance matrix Vg, such that θ = (π1, ...,πG,μ1, ...,μG,V1, ...,VG). Substituting thesedensities in this equation, the likelihood will be

L(θ|X) =nXi=1

log(GXg=1

πg |Vg|−1/2 (2π)−p/2 exp(−1

2(xi − μg)0V−1g (xi − μg)). (15.7)

We see that if, in this function, we make bμg = xi, the estimation of Vg is zero and if πg 6= 0,the quotient πg |Vg|−1/2 tends toward infinite as will the support function. Therefore, thisfunction has many maximums, linked to solutions where each density is determined preciselyby an observation. In order to avoid these singularities we will assume that there are at leastp observations from each distribution and try to find a local maximum of this function whichprovides an estimator consistent with the parameters.An additional problem with this likelihood function is that the normal distributions are

not identified since the order 1, ..., G is arbitrary. To solve this problem we can assumethat the distributions 1, ..., G correspond to π1 ≥ π2 ≥ ... ≥ πG or define the order of thedistributions by way of a measure of size of the mean or the covariance matrix.To maximize this function regarding the probabilities πi it must be taken into account

thatPG

g=1 πg = 1. Introducing this restriction with a Lagrange multiplier in (15.6), thefunction to be maximized is

L(θ|X) =nXi=1

logGXg=1

πgfg(xi)− λ(GXg=1

πg − 1). (15.8)


Taking the derivative of this function with respect to the probabilities:

∂L(θ|X)∂πg

=nXi=1

fg(xi)PGg=1 πgfg(xi)

− λ = 0

and multiplying by πg (assuming that πg 6= 0, as otherwise the model g is redundant), wewrite

λπg =nXi=1

πig (15.9)

where we let πig be:

πig =πgfg(xi)P

g=1Gπgfg(xi)

(15.10)

The quotients πig denote the probability that, once observed, the data xi have been generat-ed by the normal fg(x). These probabilities are called posterior and are calculated by Bayes’theorem. Their interpretation is as follows. Before observing xi the probability that anyobservation, and specifically xi, come from the class g is πg . Nevertheless, after observingxi, this probability is modified according to how compatible this value is with the model g.This compatibility is measured by fg(xi): if this value is relatively high, it will increase theprobability that it comes from the model g. Naturally for each data point

PGg=1 πig = 1.

In order to determine the value of λ, adding up (15.9) for all the groups

λ =nXi=1

GXg=1

πig = n

and substituting (15.9), the equations for estimating the prior probabilities are

bπg = 1n

Pi = 1

nπig. (15.11)

which provides the a priori probabilities as an average of the a posteriori probabilities.Now we are going to calculate the parameters of the normal distributions. Deriving the

support function with respect to the means:

∂L(θ|X)∂μg

=nXi=1

πgfg(x)V−1g (xi − μg)PG

g=1 πgfg(x)= 0 g = 1, ..., G

which can be written as cμg =Pi = 1n πigP

i=1nπig

xi. (15.12)

The mean of each distribution is estimated as a weighted mean of all the observations withweights ωi = πig/

Pni=1 πig, where ωig ≥ 0, and

Pni=1 ωig = 1. The weights, ωig, denote the

relative probability that observation i belongs to population g. Analogously, deriving for Vg

and using the results from section 10.2, we find that

cVg =P

i = 1n πigP

i=1nπig

(xi −cμg)(xi −cμg)0 (15.13)


which has a similar interpretation, as an average of the deviations of the data with respectto their means, with weights proportional to those of the posterior probabilities.In order to solve these equations (15.11), (15.12) and (15.13) and obtain the estimators

we need the probabilities πig, and, to calculate these probabilities with (15.10) we needthe parameters of the model. Intuitively we could iterate between both steps, which is thesolution obtained using the EM algorithm.

15.3.2 Solution using the EM algorithm

In order to apply the EM algorithm we transform the problem by introducing a set ofunobserved vector variables (z1, ..., zn), whose function is to indicate from which model eachobservation comes from. With this objective, zi would be a G × 1 binary vector variablewhich would have only one component equal to one, that corresponding to the group whichxi comes from, and all the rest equal zero. For example, xi comes from population 1 if zi1 = 1and zi2 = zi2 = ... = ziG = 0. We can verify that

PGg=1 zig = 1 and

Pni=1

PGg=1 zig = n.

With these new variables, the density function of xi conditional on zi can be written

f(xi|zi) =GYg=1

fg(xi)zig . (15.14)

Note that in zi only one component zig is different from zero and this component will definewhat the density function of the observations is. Analogously, the probability function ofthe variable zi will be

p(zi) =GYg=1

πgzig . (15.15)

On the other hand, the joint density function is

f(xi, zi) = f(xi|zi)p(zi),

which, by (15.14) and (15.15), we can write

f(xi, zi) =GYg=1

(πgfg(xi))zig

The joint likelihood function is

LC(θ|X,Z) =nXi=1

log f(xi, zi) =nXi=1

GXg=1

zig log πg +nXi=1

GXg=1

zig log fg(xi) (15.16)

If the variables zig that define the population that each piece of information comes fromare known, the estimation of the parameters follows immediately, as was discussed in theproblem of discriminant analysis. The mean of each component is estimated as an averageof the observations generated by the component, which can be written as

bμg = nXi=1

GXg=1

zigxi,


and the covariance matrix of each group is calculated taking into account only the observa-tions of this group by means of

cVg =nXi=1

GXg=1

zig(xi − xg)(xi − xg)0.

Nevertheless, the problem now is that the classification variables are not known. The solutionprovided by the EM algorithm is to estimate the variables zig by means of the posteriorprobabilities, and then use these formulas.

The EM algorithm begins with an initial estimation bθ(0) of the vector of parametersθ = (μ1, ...,μG,V1, ...,VG,π1, ...,πG). In step E we calculate the expected value of the miss-ing values in the complete likelihood (15.16) conditional on the initial parameters and onthe observed data. Since the likelihood is linear in zig, this is equivalent to substituting theexpectations for the missing variables. The missing variables, zig, are binomial with values0,1, and

E(zig|X,bθ(0)) = p(zig = 1|X,bθ(0)) = p(zig = 1|xi, bθ(0)) = bπ(0)igwhere bπ(0)ig is the probability that observation xi comes from model j when xi has already

been observed and the parameters of the models are those given by bθ(0). These are the aposteriori probabilities that are calculated with (15.10) using those specified in bθ(0)as thevalues of the parameters. By substituting the missing variables with the expected ones weobtain

L∗C(θ|X) =nXi=1

GXg=1

bπ(0)ig log πg + nXi=1

GXg=1

π(0)ig log fg(xi)

In step M this function is maximized for the parameters θ. We see that the parameters πgappear only in the first term and those of the normal distributions only in the second; theycan be obtained independently. Starting with the πg, those parameters are subject to theirsum having to be one, so that the function to be maximized is

nXi=1

GXg=1

bπ(0)ig log πg − λ(GXg=1

πg − 1)

which leads to (15.11) with the values πig now fixed at bπ(0)ig . In order to obtain the estimatorsof the parameters of the normal distribution, deriving for the second term we obtain theequations (15.12) and (15.13), where the probabilities πig are now equal to bπ(0)ig . The solutionto these equations leads to a new vector of parameters, bθ(1), and the algorithm is iterateduntil reaching convergence. To summarize, the algorithm is:

1. Start from a value bθ(0) and calculate bπ(0)ig with (15.10)2. Solve (15.11), (15.12) and (15.13) to get bθ = (bπ,bμ,bV)3. With this value go back to steps 1 and 2 until reaching convergence.


15.3.3 Application to cluster analysis

Different implementations of mixtures of normal distributions have been proposed for solvingproblems of clusters. Banfield and Raftery (1993) and Dasgupta and Raftery (1988) havedesigned a method based on mixtures of normal distributions and an algorithm MCLUST,which works well in practice. The basis for the procedure is to begin the EM algorithmwith an initial estimation obtained by using a hierarchical analysis and reparameterize thecovariance matrices so that they have common parts and specific parts. A summary of theprocedure is:1. Choose a value M for the maximum number of groups.2. Estimate the parameters of the mixture with the EM algorithm for G = 1, ...,M .

The initial conditions of the algorithm are established using one of the hierarchical methodsstudied in Chapter 9; the estimation is carried out for all possible conditions on the covariancematrix which will be explained in the next section. If r conditions are considered a total ofMr estimations are carried out with the EM algorithm.3. Select the most convenient number of groups and conditions for the covariance matrices

looking for the solution which minimizes the BIC criteria.We are now going to analyze each of these steps.

Restrictions on the covariance matrix

The strongest hypothesis is that the covariance matrices in all of the groups are identical andof type σ2I. This adds only one parameter for the estimation. A less restrictive hypothesis isto assume that the covariance matrices are equal in all of the groups, resulting in p(p+1)/2parameters. A third possibility is to allow all of the matrices to have a part in common anda part which is distinct. Finally, the most general condition is that all of the matrices aredifferent, without establishing any restriction on them. In this case, the estimation of thematrices requires Gp(p+ 1)/2 parameters, thus this condition can only be considered if thesample size is large in comparison to Gp2. Otherwise, assuming that all of the matrices aredifferent can imply a huge number of parameters which can cause the estimation with theEM algorithm to be extremely slow and even impracticable.A middle ground was proposed by Banfield and Raftery (1993), and involves parameter-

izing the covariance matrices by their spectral decomposition

Vg = λgCgAgC0g

whereCg is an orthogonal matrix with eigenvectors ofVg and λgAg is the matrix of eigenval-ues, with the scalar λg being the largest value of the matrix. Remember that the eigenvectorsof the matrix indicate the direction, whereas the eigenvalues indicate size, in the sense ofthe volume that the group occupies in space. In this way we can let the directions of somegroups, or the size of others be different. For example, we can assume that the size is thesame but that the directions are different. Then Vg = λCgAC

0g.

There is a relationship between the number of groups which we are considering and thecomplexity of the covariance matrices required. If we allow many groups, we can obtain goodresults by imposing the restriction that the covariance matrices be equal, and even imposingthat their shape be σ2I . On the other hand, with few groups, we usually have to leave the


covariance matrices enough freedom to allow for a good fit of the data to the model. For thisreason, the conditions over the covariance matrices are decided together with the number ofgroups, utilizing the BIC criterion as we will see next.

Number of Groups

The criterion for selecting the number of groups is to minimize the BIC. We saw in section11.6 that the Schwartz criterion approximates the posterior probabilities for each model.Substituting the mixture of normal distributions for the likelihood equation and eliminatingthe constants, the BIC criterion in this case is equivalent to:

BIC = minX

ng log |Sg|+ n (p,G) lnn

where n (p,G) is the number of parameters in the model. It should be pointed out at thistime that although this criterion appears to function well in practice for choosing the numberof groups and the conditions of the models, the regularity hypotheses which are carried outin order to obtain the BIC as an approximation of the posterior probability is not verified inthe case of mixtures. Thus, this criterion can only be used as a guide, not as an automaticrule. The AIC criterion is

AIC = minX

ng log |Sg|+ n (p,G)n

In practice, the restrictions for the covariance matrices and number of groups are selectedat the same time, as seen in the following example.We are going to use the Ruspini.dat data to illustrate how the MCLUST program, which

is incorporated into S-plus, works. The Ruspini data have been generated to illustrateclustering techniques and they have the advantage of being bidimensional which lets usgraphically demonstrate the workings of the algorithms. We assume that the number ofgroups is between 1 and 7.The operation of the MCLUST program can be summarized in Figure 15.2. The y-axis

shows the -BIC so that the optimal corresponds to the highest value of the function andthe x-axis displays the number of groups. Each line represents different restrictions on theestimated covariance matrices, line 1 corresponds to the EI criterion, which assumes thatall the covariance matrices of the groups are σ2I, and 4 corresponds to the VVV criterion,which does not establish any restriction on the matrices. The rest of the lines correspondto other intermediate restrictions between these two extreme cases. The graph indicatesthat the maximum value of the -BIC is obtained with five groups and method one, that is,identical variance for each variable in all the groups and all the covariances are null.The best solution obtained is:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 44 3 3 3 4 4 4 4 44 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5


Figura 15.2: Illustration of the selection of the model using the MCLUST algorithm.

which indicates the observations consecutively and the label of the group to which theypertain. For example, observations 1 to 20 are in group 1 and the last fifteen in group 5.As a measurement of the homogeneity of the classification, for each observation we take

the probabilities that each observation belongs to the group where it is classified, subtract onefrom this value to obtain the error probability in the classification, and study the distributionof these error probabilities. The result is:uncertainty (quantiles):

0% 25% 50% 75% 100%0 2.442491e-015 1.510791e-012 1.392159e-006 0.1172978

The conclusion is that the maximum error probability is 0.11, and for 75% of the observa-tions this probability is less than or equal to 0. 000001. The values of the -BIC criterion are,where EI denotes the criterion (equivalent to criterion 1 of matrices σ2I) and 5 the numberof groups.

best BIC values:EI,5 EI,4 EEE,5

-1344.212 -1351.412 -1348.81

best model: uniform spherical

The best solution is made up of 5 groups and assumes that the covariance matrices arespherical and equal. To see how MCLUST obtains this solution, we present the two steps ofthe algorithm for the best solution. These same two steps have been applied to obtain the42 solutions shown in the figure for each criterion of covariance matrices (six criteria) andthe number of groups (from one to seven).


First step: To begin the EM algorithm, we start with the solution of a hierarchicalmethod which leads to the solution indicated in Figure 15.3 for 5 groups. This is the initialsolution for the EM algorithm:

Figura 15.3: Solution to the first step of the algorithm.

Second step: With this initial classification, we apply the EM algorithm until theconvergence of the parameters is reached. The final solution is shown in Figure 15.4. We cansee that with these data the first step carries out most of the work as in the second graphonly two observations have been replaced as a result of applying the EM algorithm.

Figura 15.4: Final solution of the MCLUS algorithm

We are going to present the solution provided by MCLUST for the Diday data found inthe diday.dat file. The solution is shown in Figure 15.5. We can see that the best solution isagain obtained with identity matrices but now with three groups. Figure 15.6 indicates the

15.4. BAYESIAN METHODS 189

solution obtained using a hierarchical cluster when three groups are assumed, and Figure15.7 shows the final solution given by the EM algorithm starting from the previous initialvalues. Again, the final solution is similar to the initial one, although the EM algorithmreclassifies several observations and provides a better solution.

Figura 15.5: MCLUST solution for the Diday data.

15.4 BAYESIAN METHODS

15.4.1 Bayesian Estimation of Mixtures of Normal Distributions

The Bayesian approach can be applied in the estimation of mixtures of distributions. Wesaw in section 15.3 that the likelihood of a mixture contains Gn terms corresponding to thepossible assignments of the n observations of the sample of the G possible populations. Bymultiplying by the prior, p(θ), the posterior will also have Gn terms and except for cases ofn being very small, it is unmanageable.The introduction of missing variables allows us to solve the estimation using the EM algo-

rithm. This same approach provides a quick solution to the problem using Gibbs sampling.Introducing unobserved variables, zi, we have, given zi, that the density of xi is multivariatenormal, with parameters determined by the component of zi being equal to one. We write

f(xi|ziθ) ∼ Np(GYg=1

μzigi ,

GYg=1

Vzigi )

on the other hand, the variable zi has a multinomial distribution with parameters

f(zi|θ) ∼MG(1;πi1, ...,πiG).


Figura 15.6: Solution of the first part for the Diday data

Figura 15.7: Final solution for the Diday data.

15.4. BAYESIAN METHODS 191

These two functions determine the likelihood of the sample

`(θ|X,Z) =nYi=1

f(xi|ziθ)f(zi|θ)

which will be the product of 2n terms: n normal distributions with parameters determinedby the components of zi different from zero and n coefficients πig, also determined by the zi.The problems of singularity indicated when we studied the likelihood of the mixtures of

distributions are accentuated if we take improper distributions. Therefore in the estimationof mixtures it is advisable to take the distributions which are proper. Choosing conjugateddistribution, we take a Dirichlet distribution for the mixture proportions, a normal distrib-ution for the mean given the variance, and a Wishart for the precision matrix. This meansthat, a priori

p(π) ∼ D(α)p(μi|V−1i ) ∼ Np(μi0,Vi|ni0)

p(V−1i ) ∼Wp(mi0,Mi|mi0)

The posterior and the missing observations given the data will be

p(θ,Z|X) = f(X|θ,Z)f(Z|θ)p(θ) = `(θ|X,Z)p(θ)

We can apply Gibbs sampling to obtain samples of this distribution. The idea is to sample,iterating between the two conditioned distributions p(θ|X,Z) and p(Z|X,θ). In the first weassume an assignment of the observations between the groups and we obtain the estimationof the parameters. This distribution is easy to obtain as we will see, since each observationremains identified within a group. In the second we assume a value for the parametersand calculate the probability of each observation coming from each group. Again, thisdistribution is easily obtained. Finally, we have a set of Monte Carlo samples with thesedistributions, (θ(1),Z(1)), ...., (θ(N),Z(N)). The values (θ(1), ....,θ(N)) allow us to estimate theposterior distribution of the parameters given the data, whereas the sequences (Z(1), ....,Z(N))provide the posterior probability of an observation belonging to each group, given the data.We begin with the sampling of p(θ|X,Z). Setting Z = Z(h), for each group we can obtain

a sample of values of the parameters as follows:(1) For each mean with

p(μg|V−1g ,X,Z(h)) ∼ Np(μgp,Vgp)

where the posterior mean is calculated in the usual way, but using only those observationswhich have been classified in Z(g) as belonging to group g :

μgp =ng0μg0 + ng(Z

(h))x̄g(Z(h))

ng0 + n(Z(h))

where ng(Z(h)) is the number of observations in group g given by

ng(Z(h)) =

nXi=1

zig


and x̄g(Z(h)) is the mean of these observations, given by

x̄g(Z(h)) =

Pni=1 zigxiPni=1 zig

.

Analogously, the variance of the posterior will be

Vgp =Vg

ng0 + ng(Z(h)).

(2) For the precision matrices with

p(V−1g |X,Z(g)) ∼Wp(mgp,Mgp)

wheremgp = ng0 +mg0

and

M−1gp = mg0M

−1 + ng(Z(h))Sg+

ng(Z(h))ng0

ng(Z(h)) + n0(x̄g − μg0)(x̄g − μg0)0,

and the sample variance in each group is estimated with the observations of that group with

ng(Z(h))Sg =

nXi=1

zig(xi − x)(xi − x)0.

(3) For the probabilities with

p(πg|X,Z(g))∼ D(α1 + n1(Z(h)), ...,αG + nG(Z(h))).

Once we have the vector of parameters θ(g) we have a new value of Z = Z(g+1) using:(4) Simulate Z = (z1, ..., zn) in

f(zi|X,θ(g))∼MG(1;πi1(θ(g)), ...,πiG(θ

(g))),

where the posterior probabilities of the observations are given by:

πig(θ(g)) =

πgfg(xi|θ(g))PGg=1 πgfg(xi|θ(g))

.

An additional problem in the estimation using Gibbs sampling is the lack of identificationof the components of the mixture, which we pointed out in section 15.3.3. One possibility isto introduce an order between the distributions, but this solution may not always be suitable.See Stephens (2000) and Celeux et al (2000) for recent discussions of this problem.

15.5. PROJECTION METHODS 193

15.5 PROJECTION METHODS

An alternative to the above procedures is to look for directions of data projection wherethe different groups can be seen and then to look for groups in these univariate directions.Alternatively, we can project the data in two directions and search for groups in the plane.The advantage to this approach is that it is not necessary to specify the number of groupsa priori, nor to compare solutions with very different numbers of groups.Let us suppose that we have a sample where each piece of data can come from one of two

normal populations which have the same covariance matrix, which is unknown, and differentmeans, also unknown. The origin of each observation is also unknown. We wish to finda direction for the projection of the observations which provides us, if possible, with themaximum separation between the two populations. It can be shown that although we donot know the origin of each observation it is possible to classify with a linear discriminantfunction, which we know is ideal for classifying in this problem. If we assume that theprobability of each piece of data coming from each one of the populations is the same,Peña and Prieto (2000) have shown that the direction which minimizes the kurtosis of theprojection is Fisher’s linear discriminant function (see Appendix 15.3). This result can beextended to various populations: Fisher’s optimal rules of classification are obtained byminimizing the kurtosis of the projections.This result suggests looking for the directions which minimize the kurtosis and projecting

the data in these directions. As it is possible that in addition to the groups there are isolatedoutliers, or very small groups far from the rest, and we saw in Chapter 4 that these outliersmanifest themselves in directions with maximum kurtosis, we could think of an algorithmwhich includes these directions as well in its search. This leads us to the following methodproposed by Peña and Prieto (2001b):(1) Start projecting the data over the directions which maximize the kurtosis coefficient

of the projected data. Next, project the data onto the space orthogonal to the directionfound and now choose the direction over this subspace where the kurtosis coefficient ismaximized. Repeat the process of orthogonal projection in the directions already foundand the selection of the new direction maximizing the kurtosis. In this way p orthogonaldirections of projection are obtained.(2) Repeat the calculation of directions in (1) but this time searching for directions which

minimize the kurtosis coefficient.(3) Explore each of these 2p directions in order to find groups and outliers in the following

way. We obtain the ordered statistics of the projected data and consider the gaps betweenthem. If the projected data come from a unimodal distribution, these gaps must have a knownpattern, with large gaps at the extremes and small ones in the center of the distribution.In order to decide when a gap over a direction takes place, we use the statistical ordering

properties. It can be shown that if we have a sample of normal data and we transformthem according to their distribution function, the ordered statistics of the transformed datahave a uniform distribution, and then the distribution of the spacings or gaps can be easilystudied. For this reason, the procedure standardizes the projected data with the inversedistribution function of the univariate normal before beginning to look for spacings or gaps.The algorithm for looking for directions is:


1. For each direction dk, k = 1, . . . , 2p, the projection of the data is calculated over itusing uki = x0idk.

2. The observations are standardized, zki = (uki −mk)/sk, where mk =P

i uki/n is themean of the projections and s2k =

Pi(uki −mk)

2/(n− 1) its variance.

3. The projections are ordered zki for each k, and the ordered statistics zk(i) are obtained.Next these statistics are transformed with the inverse distribution function of thestandard normal z̄ki = Φ−1(zk(i)).

4. The spacings are calculated, which are the differences between the consecutive valuesof the transformed order statistics γki = z̄k,i+1 − z̄ki.

5. We look for high values of the spacings γki, which correspond to the gaps in thedistribution of the data. Suppose that if the statistic of order thirteen is 10 and thatof order fourteen is 20, the spacing is 20-10=10 and if this value is much greater thanthe other spacings it indicates that there is a gap without data among these values,which may correspond to the separation between the two groups of data. A high valueof the spacing will indicate the presence of more than one group of data. In orderto determine the high values of γki we introduce the constant, κ, which is calculatedfrom the distribution of the spacings (see Peña and Prieto, 2001b, for details), and wedecide that a new group of data begins when γkj > κ .

6. We mark all of the observations l which verify z̄kl ≤ z̄kr as belonging to different groupsthan those which verify z̄kl > z̄kr, and return to (5) to repeat the search for gaps inthe data.

After this analysis, the final step is to assign the observations to identified groups asfollows.

1. Letting G be the number of identified groups, the groups are ordered by number ofobservations so that group 1 is the largest and G the smallest. Let us assume that theobservations have been renumbered so that now the observations ig−1+1 to ig belongto group g (i0 = 0 and iG = n).

2. For each group g = 1, . . . , G :

(a) Obtain the mean and covariance matrix of the observations of the group if thereare at least p+ 1 data points.

(b) Calculate the Mahalanobis distances for all of the observations that are not ingroup g,

D2j = (xj −mg)

0S−1g (xj −mg), j ≤ ig−1, j > ig.

(c) Assign to group g those observations which satisfy D2j ≤ χ2p,0.99.

(d) If no observation is reclassified, go to group g + 1. Otherwise, renumber the ob-servations as in 1 and repeat the process for the same group g.

15.6. CONCLUSIONS 195

AMatlab algorithm for applying this procedure can be found at http://halweb.uc3m.es/fjp/downlWe are going to illustrate the operation of the Kurtosis algorithm proposed by Peña and

Prieto (2001) using the Ruspini data. Figure ?? indicates the directions of projection thatmaximize (... . ...) or minimize (- - - ) the projection of data. We see that in order tomaximize the heterogeneity the data projects in such a way that a group appears in themost numerous center and smaller groups appear at both extremes. This is the case of acentral group and outliers. The directions of minimum kurtosis correspond to the divisionof the projected data into two well separated groups. The figure shows the four directionsof projection and the final classification obtained. The four groups are perfectly identifiedand two outliers appear marked as e.

Directions that maximize and minimize the kurtosis coefficient of the projected points forthe Ruspini data.

Figure 15.8 shows the results for the Maronna data (maronna.dat file). As before thedirections of maximum kurtosis (... ....) search for a large center and smaller groups tothe sides, whereas the minimum kurtosis tries to separate the data into two well separatedgroups of minimum size. The final classification produces the origin of the points fairly well.

Example: The solution found for the Diday data is presented in Figure 15.9. Basicallythree groups are detected, but some observations are considered outliers with respect to thegroups found.

15.6 Conclusions

Peña and Prieto (2001) present a comparison using a simulation between k-means, MCLUSTand kurtosis and show that overall the kurtosis produces fewer classification errors. Ingeneral, both MCLUST as well as kurtosis are far superior to k-means. When the dimensionsof the data are great, the kurtosis algorithm offers significant advantages when workingwith projections while the MCLUST has difficulties; it is slow and fails often. For small


Figura 15.8: Directions of projection and final solution using the Maronna data for theprojection method.

Figura 15.9: Kurtosis algorithm solution for the Diday data.

15.7. ADDITIONAL READING 197

dimensions, MCLUST works well and being based on an explicit mixture model it has theadvantage of providing the probabilities of pertaining to groups, which makes inference easier.This result can also be obtained using the kurtosis algorithm, reviewing the estimation withthe EM algorithm over the best solution found for the projections.

15.7 Additional Reading

The literature covering the methods presented here is extensive. Anderberg (1973), Everitt(1993), Gordon (1981), Hartigan (1975), Kaufman and Rousseeuw (1990), Mirkin (1996), Spath and Bull (1980) and Spath (1985) are dedicated to grouping methods and presentclassical methods. The basics of the MCLUST algorithm are described in the articles byBanfield and Raftery (1993), Dasgupta and Raftery (1998) and Frayley and Raftery (1999).For projection methods see Friedman (1987), Jones and Sibson (1987), Posse (1995),

Nason (1995) and Peña and Prieto (2001). Modifications have been proposed to the k-meansalgorithm to improve its operations, see Cuesta-Albertos et al (1997) and Garcia-Escuderoand Gordaliza (1999), and for choosing the number of clusters, see Tibshirani el al. (2001).The reader interested in other promising methods for finding groups can consult Peña

and Tiao (2002), who present a very general procedure for looking for homogeneity in anystatistical method, the SAR method, and Peña, Rodriguez and Tiao (2002a, 2002b), whoapply it in constructing groups and comparing their operation with the algorithms describedin this chapter.

15.7.1 Exercises:

Exercise: Prove that Hartigan’s criterion for the k-means algorithm is equivalent to addinggroups until tr(WG) < tr(WG+1)(n−G+9)/(n−G−1) (Suggestion: use tr(W) = SCDG,and impose the condition that the value of F be greater than 10).Exercise: Prove that for large n the Calinsky and Harabasz criterion for the k-means algo-

rithm is approximately equivalent to selectingG number of groups if tr(WG) < tr(WG+1)G/(G−1)Exercise: Prove that in the estimation of normal mixtures, if the distributions have

Vg= σ2gI, the ML estimation of σ2g is obtained using cσ2g =Pn

i=1πigPni=1 πig

(xi−cμg)0(xi−cμg).Exercise: Prove that if we have a mixture of g normal populations Ni(μi, V ) with

different means but the same covariance matrix and with probabilities πi withP

i πi = 1, ifwe project the observations onto a direction u and we let z = u0x be the projected data, itsmean is E(z) = m̄ = u0μ where μ =

Pi πiμi and its variance is var(z) = u

0(V +B)u, whereB =

Pi πi(μi − μ)(μi − μ)

0.

Exercise: Prove that in the above exercise the fourth moment of the projected data is3 (u0(V +B)u)2 − 3 (u0Bu)2 +

Pi πi(u

0Biu)2.

Exercise: Construct a group of countries with the INVES data using the kurtosis algo-rithm.

APPENDIX 15.1: COMPARISON OF THE TRACE AND DE-TERMINANT CRITERIA


Suppose, for the sake of simplicity that we have two groups. We first study how toincrease the variability within the first group when we include a new element in the second.Suppose that the first group has n points (x1, ...,xn) with a center in m and covariancematrixW, and we add a new point, x∗. The new mean, m∗, when this element is includedwill be, letting d =(x∗ −m)

m∗ =m+1

n+ 1d

and the new sum of squares matrix,W∗, is

W∗ =X(xi −m−

1

n+ 1d)(xi −m−

1

n+ 1d)0

decomposing this sum in that of the first n original elements of the group plus the last, wehave

W∗ =W+n

(n+ 1)2dd0 + (x∗ −m∗)(x∗ −m∗)0

and asx∗ −m∗=

n

n+ 1d

we finally have:W∗ =W+

n

n+ 1dd0 (15.17)

With the trace criterion, the change in the matrixW is:

tr(W∗ −W) =n

n+ 1tr(dd0) =

n

n+ 1d0d

and the change is minimum if we include the point of minimum Euclidean distance withrespect to the centroid of the group. By analogous reasoning, we can analyze the symmetricproblem of the decrease in variability when an element of the group is eliminated and concludethat the trace criterion minimizes the Euclidean distances between the points and the meansof the group.Now let us analyze the determinant criterion. Supposing large n so that in (15.17), in

order to simplify the reasoning, we taken

n+ 1as the unit, we have:

|W∗|− |W|= |W+ dd0|− |W| , (15.18)

and as:

|W+ dd0| = |W(I+W−1dd0)| = |W| |I+W−1dd0|, (15.19)

and, on the other hand, by havingW−1dd0 range one:

|I+W−1dd0| = Π(1 + λi) = 1 + λ1 = 1 + tr(W−1dd0)

= 1 + tr(d0W−1d) = 1 + d0W−1d,

thus we conclude with the equation

15.7. ADDITIONAL READING 199

|W+ dd0| = |W|+ |W|d0W−1d.

Replacing this value now in (15.18), the result is that we minimize the effect of adding apoint over the determinant if we minimize:

|W|d0W−1d,

which is equivalent to the Mahalanobis distance between the point and the center of thedata. We can conclude that, in general, while the trace criterion minimizes the Euclideandistances, the determinant minimizes the Mahalanobis distances.

Documents

swp0000 - halweb.uc3m.eshalweb.uc3m.es/esp/Personal/personas/dpena/docencia/Sub-ingcf15.pdf174 CAPÍTULO 15. CLASSIFICATION USING MIXTURES OF DISTRIBUTIONS 15.1 BACKGROUND In this