36
Reduction of Complexity Helmut Strasser January 26, 2000 Contents 1 Introduction 3 2 The classical approach 5 2.1 Partitions ................................... 5 2.2 Quantization .................................. 5 2.3 The equivalence theorem ........................... 6 3 An extension of the classical approach 7 3.1 Varying measures of distance ......................... 7 3.2 Similarity measures .............................. 8 3.3 Linear similarity measures .......................... 8 3.4 The general form of linear similarity measures ................ 10 4 Data compression by similarity measures 12 4.1 Optimal quantization by similarity measures ................. 12 4.2 Optimal partitions with similarity measures ................. 14 4.3 The equivalence theorem ........................... 15 5 Selecting the similarity measure 15 5.1 Reduction of decision theoretic information ................. 16 5.2 A discriminance problem ........................... 17 5.3 The solution with full information ...................... 18 5.4 How to select a likelihood ratio ........................ 19 5.5 The solution with reduced information .................... 23 5.6 Robust solutions ................................ 25 1

Reduction of Complexity

  • Upload
    modul

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Reduction of Complexity

Helmut Strasser

January 26, 2000

Contents

1 Introduction 3

2 The classical approach 5

2.1 Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 The equivalence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 An extension of the classical approach 7

3.1 Varying measures of distance . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Linear similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 The general form of linear similarity measures . . . . . . . . . . . . . . . . 10

4 Data compression by similarity measures 12

4.1 Optimal quantization by similarity measures . . . . . . . . . . . . . . . . . 12

4.2 Optimal partitions with similarity measures . . . . . . . . . . . . . . . . . 14

4.3 The equivalence theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Selecting the similarity measure 15

5.1 Reduction of decision theoretic information . . . . . . . . . . . . . . . . . 16

5.2 A discriminance problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 The solution with full information . . . . . . . . . . . . . . . . . . . . . . 18

5.4 How to select a likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . 19

5.5 The solution with reduced information . . . . . . . . . . . . . . . . . . . . 23

5.6 Robust solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1

Helmut Strasser 2

6 Algorithms 29

6.1 Gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Fixpoint algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Appendix 33

Helmut Strasser 3

1 Introduction

For the statistical analysis of multivariate data the modelling approach is popular and suc-cessful. However, there are practical limits for complete modelling if the the size of a dataset is small compared to its dimension. The sample size needed for modelling a high di-mensional data set soon exceeds any realistic bound. In such a case alternative ways of dataanalysis have to be taken into consideration.

One of these alternatives is to reduce the complexity of the data set before entering theactual data analysis. Any reduction of complexity can be viewed as a data transformationwith the goal to simplify the data in such a way that important information in the data ispreserved but unimportant information is removed.

The answer to the question what information contained in the data is important depends onthe goals of the data analysis. If we want to analyze the differences between independentsamples then other patterns are of interest than for detecting correlations among variables.This implies that for reduction of complexity we need a hypothetical model of data gen-eration in order to distinguish between important and unimportant information. But werefrain from estimating all parameters of the model. The final statistical analysis which isbased on the reduced information after reduction of complexity is satisfied with answersto particular questions concerning the model parameters. Such questions are typically notconcerned with parameter estimation but with decisions between alternative assertions.

A well-elaborated methodology for deciding between alternative assertions on the modelhas been part of statistics for many decades. This is statistical testing. The problem ofreduction of complexity, however, did not achieve the necessary attention.

Classical statistical theory is satisfied, if the reduction of complexity is such that thereis no loss of information at all. The central part is played by the concept of a sufficientdata transformation. Converntional statistical tests often submit the data to a sufficienttransformation without loss of information and make their decisions on the basis of thetransformed data.

Some statistical models with few unknown parameters actually admit sufficient transfor-mations which lead to a considerable simplification of the data. One may think of locationmodels with normally distributed data. But if the data are not normally distributed, then,even for univariate location models, there is no sufficient data transformation yielding asimplification which is worth mentioning. In such cases a reduction of complexity is onlypossible if we accept some loss of information.

Clearly, there is a large number of data transformations which reduce the complexity of adata set. Therefore, it is imperative to find such transformations which spoil the informationin the data only marginally. It is a remarkable fact that this kind of problem has beenignored almost completely by statistical theory.

However, we have to admit that the theoretical concepts for dealing with statistical infor-mation are ready prepared. The incorporation of sufficiency into Wald’s statistical deci-sion theory was done in the fifties by Blackwell, 1951, and 1953. The extension of thesufficiency concept covering any loss of information is due to LeCam, 1964, 1964. Un-fortunately, the modern mathematical concepts used by LeCam were too deterrent to thestatistical community for LeCam’s theory to get the attention of a broad group of theoreticalstatisticians, not to speak of applied statisticians. There is some mathematical literature onLeCam’s theory, e.g. monographs by LeCam, 1986, Strasser, 1985, and Torgersen, 1991.

But also in applied statistics there are many proposals of how a reduction of data complexitycould be achieved. There are methods which try to explain many variables by few factorslike principal components analysis, factor analysis and their nonlinear extensions. There

Helmut Strasser 4

are other methods which try to simplify the scale of data observed like cluster analysis andvector quantization.

If we consider cluster analysis in connection with the reduction of complexity and datacompression then it is important to clarify a misunderstanding. Originally, the methods ofcluster analysis have been invented in order to detect density clusters in empirical data. Ifthere are any density clusters then it is natural to divide the data set into subsets such thateach density cluster corresponds to a subset of the partition. Therefore, cluster analysishas to do with partitioning a data set. But the methods of cluster analysis always lead to apartition, and they do not care whether there are any density clusters or not.

On the other hand, reduction of complexity is not concerned with density clusters, butdata partitioning is an essential topic. Admittedly, if accidentally there are some densityclusters in the data then it is plausible that any data compression should consider this macrostructure. In mathematical terms: The partition should be finer than the partition generatedby the density clusters. But many empirical data sets are far from being well-structuredin terms of density clusters. Nevertheless, we might be interested in obtaining a suitablepartition. This is the reason why methods of cluster analysis play an important role forthe reduction of complexity. However, the evaluation of those methods for the aims ofcomplexity reduction need not consider their ability to detect density clusters. Rather,the evaluation has to consider completely different aspects, first and foremost those ofstatistical decision theory.

This is the subject of the present paper.

We are going to consider methods of data compression which result in a finite data structure.This can be a partition of the data into finitely many subsets or a vector quantization withfinitely many values. The restriction to finite data structures is motivated by our interest ina computational realization of the methods.

The questions we are considering are as follows: How can we measure the loss of infor-mation caused by a data compression, and how can we elaborate methods, which minimizethe loss of information under the side condition of a given complexity ? For conveniencethe degree of complexity is defined to be the number of objects in the finite data structure.

It will be shown that these question can be answered by classical concepts of statisticaldecision theory. This paper is devoted to a presentation of the pertaining results on a lowlevel of mathematical sophistication.

Let us give some hints on the organization of the paper.

In Section 2 we present those two classical approaches of cluster analysis and vector quan-tization which are the starting points of our further considerations. As has been known forsome time these approaches are equivalent and nothing else than two views of the sameidea.

Sections 2 and 3 are concerned with the generalization of the classical ideas. This extensionoffers a much greater class of partioning methods. Our approach is based on similaritymeasures, which serve as an intuitive background of the methods. Let us indicate as a firstmotivation that our generalization allows to treat claasical methods and recent algorithmsof AI from one viewpoint (see Example (6.26)). A more substantial foundation will begiven in 2000.

Section 5 is the main part of the paper. We explain in a simplified way how decisiontheoretic arguments lead to criteria for the choice of partitioning methods. In this Sectionit will turn out how statistical decision theory may lead us to the class of methods definedin Section 3. Moreover, it will become clear which role statistical models have to play forour approach of data analysis. After all, claiming robustness we arrive at a class of newmethods which are positioned between classical cluster analysis and modern AI.

Helmut Strasser 5

In this paper we do not support our results by empirical examples. For this we refer toexperimental studies by Rahnenfhrer, 1999, and by Steiner, 1999.

Section 6 contains some brief remarks on algorithms for the computational realization ofour methods. We confine ourselves to the basics and for further information we refer toPtzelberger und Strasser, 1999, and to Steiner, 1999.

2 The classical approach

Let us start with a data set(x1, x2, . . . , xn) with values inE ⊆ Rd. There are manypossibilities of reducinge the complexity of such a data set. In this paper we only considerpartitions and vector quantizations.

2.1 Partitions

A simple way of reducing the complexity of a data set is to divide the sample spaceEinto subsets, and to consider instead of the dataxi only their membership in a subset ofthe partition. A partition of the data into subsets plays a role in many fields of statisticsand is called clustering or classification. As a basic reference for the statistical theory ofclassification we mention Bock, 1974.

A popular goal for determining partitions is the homogeneity of the subsets. The mostnatural formal definition of this goal leads to so-called minimum variance partitions (cf.Bock, 1974, Problem A, p.164).

Measuring the interior dispersion of a subsetC ⊆ E around its centroidm by

SS(C) :=∑x∈C

|x−m|2,

we obtain as interior dispersion of a partitionC = (C1, C2, . . . , Ck) the number

SS(C) :=k∑

j=1

SS(Cj).

A minimum variance partition of sizek is then a partition such that the interior dispersionis minimal among all partitions consisting of at mostk subsets.

2.2 Quantization

Another way for a reduction of complexity is vector quantization. Constructing a quantiza-tion we replace the original data set(x1, x2, . . . , xn) be a weighted system of representativevalues, so-called prototypes:

Prototypes Weightsa1 w1

a2 w2

......

ak wk

The prototypesaj are elements of the sample spaceE. The weightswj must be nonnegativeand add to one.

Helmut Strasser 6

Most methods for quantization try to find prototypes each of which is a good representativefor a large number of the original data points. There are many possibilities to achieve thisgoal in a mathematical way.

We are now going to describe the so-called principal point approach (cf. Bock, 1974,Problem B, p.164). Here the prototypesaj are chosen in such a way that the squareddistance of any data pointxi from the best fitting prototypeaj , i.e

mins|xi − as|2,

is minimal on the average. The fitness of a prototype systema1, a2, . . . , ak is then definedby the sum of this squared distances

A2(a1, a2, . . . , ak) :=∑xi

mins|xi − as|2.

In a principal point quantization the prototype systema1, a2, . . . , ak minimizesA2. In thissense a principal point prototype system is fitted to the data set in an optimal way.

Any quantization consists of prototypes and weights. Given any prototype system theweightswj are defined as follows. For each data pointxi there is a best-fitted prototypeaj ,which means

|xi − aj |2 = mins|xi − as|2. (1)

This prototypeaj being optimal for the data pointxi is called the winner ofxi. (If there ismore than one prototype with this property let us take as winner the prototype which comesfirst in the sequencea1, a2, . . . , ak.) Let

Cj := {xi : aj is winner ofxi}

the set of all data points whose winner isaj . The system of these sets is a partition ofthe data set. In this way any prototype system(a1, a2, . . . , ak) defines a partitionC =(C1, C2, . . . , Ck). This partition is called the Voronoi partition of the prototype system(a1, a2, . . . , ak). In order to complete the definition of a quantization one defines as weightswj simply the relative frequencies (portions)

wj :=#Cj

n

of the data in the sets of the Voronoi partitionC.

2.3 The equivalence theorem

There is a surprising and important connection between minimum variance partitions andprincipal point quantizations.

(2.1) THEOREM The minimum variance problem and the principal point problem areequivalent in the following sense:

1. The Voronoi partition of a principal point quantization is a minimum variance partition.

2. The centroids of the sets of a minimum variance partition are the prototype system of aprincipal point quantization.

This equivalence theorem has been well known for a long time. For univariate data it hasbeen proved by Nilsson, 1967. The general case can be found in Bock, 1969 and 1974,(Satz 15.1). We present the simple proof in the appendix of this paper.

Helmut Strasser 7

This equivalence theorem shows that minimum variance partitions and principal pointquantizations are two sides of the same data compression approach. Moreover, it followsthat we need only solve the principal point problem, if we want to find a minimum variancepartition. Methods of solution are discussed in Section 6.

3 An extension of the classical approach

3.1 Varying measures of distance

Both methods described in the preceding section are based on a particular measure of dis-tance between data and prototypes. This is the squared Euclidean distance

δ(x, a) = |x− a|2 =d∑

i=1

(ξi − αi)2,

(wherex = (ξ1, ξ2, . . . , ξd) anda = (α1, α2, . . . , αd)).

It is an obvious idea to vary those approaches by using other measures of distance. E.g. wemay use distance measures of the form

δ(x, a) = |x− a|p =

(d∑

i=1

(ξi − αi)2)p/2

,

with an exponentp 6= 2. One could think of even much more general compositions

δ(x, a) = `(|x− a|)

admitting arbitrary increasing functions` : R+ → R+.

If the methods of Sections 2.1 and 2.2 are varied in this manner then there arise two inde-pendent optimization problems which are no more connected in a way analogous to Theo-rem (2.1). A first consequence is that the generalization of the minimum variance conceptis computationally intractable. The second optimization problem, i.e. the generalizationof the principal point problem, however, is still tractable by gradient or stochastic gradientmethods.

In the literature on vector quantization the extension of the principal point problem to gen-eral distance measures is a big issue. As an incomplete survey we mention Pollard, 1981and 1982, Flury, 1990 and 1993, Flury, Tarpey and Li, 1995, Parna, 1986 and 1990, andKipper and Parna, 1992.

The mathematical problems related to these generalizations are difficult and thus an at-tractive challenge to mathematicians. However, the practical importance of results in thedirection of this kind is not always completely clear. One reason is that for generalizedprincipal point solutions the optimal prototypes need not be the centroids of their winnersets.

If for some reasons we impose the condition that optimal prototypes must always be thecentroids of their winner sets, then we have to take another path in generalizing the classicalapproach.

Helmut Strasser 8

3.2 Similarity measures

There is another way of generalizing minimum variance partitions and principal point quan-tizations which does not share the drawbacks of using generalized distance measures. Thisis to use a particular kind of similarity measures instead of distance measures.

At first sight the transition from distance measures to similarity measures seems to be only aformal switch of the viewpoint. This is actually true as long as we are considering the clas-sical methods which rely on the squared Euclidean distance. However, as soon as we varythe similarity measures we obtain completely new types of optimization problems whichare no longer related to distance measures. These new types of optimization problems havebeen considered for the univariate case by Bock, 1992 and 1994, and for the general caseby Ptzelberger and Strasser, 1999. A survey on recent applications in statistical inferenceis given in Strasser, 1999. Numerical experiments are contained in the thesis by Steiner,1999. In the paper by Rahnenfhrer, 1999, it is shown by systematic Monte Carlo experi-ments that the new types of optimization problems lead to considerable improvements forstatistical inference.

In the following we will describe the transition to similarity measures in a detailed way.Let us anticipate the main results which are achieved by this approach:

• The assertion of the equivalence theorem (2.1) can be maintained, which is importantfor solution algorithms.

• The sets of any optimal partitions are convex and the optimal prototypes of the setscoincide with their centroids. This may be an interesting feature for the practicalinterpretation of optimal partitions and quantizations.

• A well-known quantization algorithm by Kohonen, which cannot be interpreted asa distance based optimization algorithm turns out to be the solution algorithm of anoptimization problem based on similarity measures (see Example (6.26)).

• Our approach can be embedded into statistical decision theory, in particular into thetheory of comparison of statistical experiments. In this way it becomes possible toselect the similarity measure used for data compression by decision theoretic criteria.

Those aspects of statistical decision theory which lead to our approach with similarity mea-sures are presented in Section 5.

3.3 Linear similarity measures

In this section we define the type of similarity measures which we will take as a basis ofour further considerations.

The squared Euclidean distanceδ(x, a) = |x− a|2 can be divided into two parts

|x− a|2 = |x|2 − 2(〈a, x〉 − |a|2

2

),

where we denote by〈a, x〉 the inner product of the vectorsa andx. The second part of thisdecomposition contains the term

σ(x, a) := 〈a, x〉 − |a|2

2, (2)

which is a similarity measure betweenx anda.

This similarity measureσ(x, a) is not symmetric in its arguments (in general we haveσ(u, v) 6= σ(v, u)), and therefore we have to define, which of the vectorsx anda should

Helmut Strasser 9

take the role of the data point and which the role of the prototype. Let us agree thatx,i.e. the first variable of the functionσ, plays the role of the data point, and thata, i.e. thesecond variable, plays the role of the prototype.

Interpretingσ(x, a) as a similarity measure could be motivated by the fact that the maxi-mum similarity to a fixed data pointx, i.e. the maximum ofσ(x, a), is achieved if and onlyif a coincides withx, since we have

σ(x, a) =|x|2 − |x− a|2

2.

Thus, a natural condition is fulfilled, namely the condition that a data pointx is the bestprototype for the one-point set consisting of the single data pointx.

Another argument in favour of the interpretation ofσ as a similarity measure is the fact thatthe functionσ(x, a) is a decreasing function of the distance of a fixed data pointx and avarying prototypea.

Moreover, the similarity measureσ(x, a) has a further remarkable property. As followsfrom its definition (2) it is a linear function of the data pointx. This linearity has aninterested consequence. It implies that the optimal prototype of any data set must be itscentroid (see Lemma (7.28)).

Therefore, if we want to design a general concept of similarity measures in such a way thatthe optimal prototypes are always the centroids of the respective data sets then the linearityof the similarity measure in the variablex is a sufficient condition to achieve this goal. Thisis the reason why we generalize the similarity measure (2) in such a way that the linearityproperty is kept.

(3.2) DEFINITION A linear similarity measure is a function of the form

σ(x, a) = 〈s(a), x〉 − U(a),

where the functions s(a) and U(a) are such that the optimal prototype a of a single datapoint x is equal to x.

Let us consider the most familiar examples of linear similarity measures.

(3.3) EXAMPLE

1. In case of the classical similarity measure

σ(x, a) := 〈a, x〉 − |a|2

2

we haves(a) = a andU(a) =|a|2

2. As we have explained before this similarity

measure arises from the squared Euclidean distanceδ(x, a) = |x − a|2. It will turnout in Section 4.3 that compression methods which rely on this similarity measure areequivalent to those methods which lead to minimum variance partitions and principalpoint quantizations.

2. A completely different similarity measure is

σ(x, a) = 〈 a|a|, x〉.

This is actually a similarity measure in the sense of Definition 3.2, since

σ(x, x) = 〈 x|x|, x〉 = |x| ≥ 〈a, x〉

|a|= σ(x, a)

Helmut Strasser 10

for all prototype vectorsa (Cauchy-Schwarz inequality).

For this similarity measure we haves(a) =a

|a|andU(a) ≡ 0. It will turn out

in Example (6.26) that compression algorithms which are based on this similaritymeasure are equivalent to a quantization algorithm by Kohonen.

2

Further examples of linear similarity measure are discussed in the next section.

3.4 The general form of linear similarity measures

The definition of linear similarity measures requires that the optimal prototype of a singledata point must be the data point itself. This condition implies that we must not use com-pletely arbitrary functionss(a) andU(a) for the definition of linear similarity measures.The functionss(a) andU(a) rather have to be connected with each other in a very specialway.

The following theorem shows how we may construct all possible linear similarity measures,i.e. which combinations of functionss(a) andU(a) are admitted for the construction oflinear similarity measures. Let us begin with the statement of the theorem and give theexplanation afterwards.

(3.4) THEOREM Let σ(x, a) be a linear similarity measure and let f(x) := σ(x, x). Thenthe function f is a convex function and we have

σ(x, a) = 〈f ′(a), x〉 −(〈f ′(a), a〉 − f(a)

)= 〈f ′(a), x− a〉+ f(a)

for all x ∈ E and all a ∈ E, where the convex function f is differentiable.

The proof of this theorem is given in the appendix (Section 7).

The assertion of Theorem (3.4) is very interesting: For every linear similarity measure

σ(x, a) = 〈s(a), x〉 − U(a),

there is a convex functionf , which determines the componentss(a) andU(a) in a uniqueway, namely by the formulas

s(a) := f ′(a), U(a) := 〈f ′(a), a〉 − f(a). (3)

Moreover, the convex functionf has an intuitive interpretation for the underlying similaritymeasure. The valuef(x) satisfies

f(x) = σ(x, x) = maxa

σ(x, a),

and, thus, is nothing else than the maximum similarity which can be achieved for the datapointx with a suitable prototypea.

Now, it is easy to generate new examples of similarity measures by starting with an arbitraryconvex functionf(x) and defining the similarity measure by equation (3).

(3.5) EXAMPLE

1. For the similarity measure

σ(x, a) = 〈a, x〉 − |a|2

2

Helmut Strasser 11

we have as defining convex function

f(x) = σ(x, x) = 〈x, x〉 − |x|2

2=|x|2

2.

2. For the similarity measure

σ(x, a) = 〈 a|a|, x〉

we have as defining convex function

f(x) = σ(x, x) = 〈 x|x|, x〉 = |x|.

3. The preceding examples can be connected by a continuous family of further similar-ity measures.

Let us start with the convex power function

f(x) =|x|p

p,

wherep ≥ 1 is an arbitrary number. The casep = 2 corresponds to the first exampleand the casep = 1 corresponds to the second example. For anyp ≥ 1 we have

s(a) = f ′(a) = a|a|p−2

and

U(a) = 〈f ′(a), a〉 − f(a) = 〈a|a|p−2, a〉 − |a|p

p=p− 1p

|a|p

which gives the similarity measure

σ(x, a) := 〈a|a|p−2, x〉 − p− 1p

|a|p.

In fact we have

f(x) = σ(x, x) = 〈x|x|p−2, x〉 − p− 1p

|x|p

= |x|p − p− 1p

|x|p =|x|p

p.

4. Another family of similarity measures, which will turn out to be important in Section5.6 is defined by the convex functions

fc(x) :={c|x| − c2/2 if |x| > c,|x|2/2 if |x| ≤ c.

It is easy to see that in this case we have

s(a) ={a, if |a| ≤ c,ca/|a|, if |a| > c,

which is the prototypea censored to maximum normc. The linear similarity measureis given by

σ(x, a) = 〈s(a), x〉 − |s(a)|2

2.

Also this family contains examples 1 and 2 as extreme cases since

limc→∞

fc(x) =|x|2

2, and lim

c→0fc(x) = |x|.

2

Helmut Strasser 12

4 Data compression by similarity measures

We have determined which type of similarity measure will be used in the following and wehave discussed some reasons for this particular choice. A foundation which is much moresubstantial from the decision theoretic point of view will be given in Section 5.

At this point we are going to discuss the next step of the data compression. We have todefine what we mean by optimal data compression. As before, basically there are twopossibilities, namely optimal partitions and optimal quantizations.

4.1 Optimal quantization by similarity measures

Let us generalize principal point quantizations to an analogous concept which is based onsimilarity measures instead of distance measures.

Letσ(x, a) = 〈s(a), x〉−U(a) be a linear similarity measure and letf(x) be the generatingconvex function. Consider a prototype system(a1, a2, . . . , ak) and an arbitrary data pointx. If we want to know how well this data point is represented by the system of prototypesthen we compute the maximum similaritymaxs σ(x, as), which can be obtained with anyof the given prototypes.

The average fitness of the prototype system(a1, a2, . . . , ak) relative to the whole data setis then

Kf (a1, a2, . . . , ak) :=1n

n∑i=1

maxsσ(xi, as).

Here we denote byn the size of the data set. The convex functionf is used as index toindicate which similarity measure is applied to compute the fitness.

(4.6) DEFINITION The number Kf (a1, a2, . . . , ak) is called the f -fitness of the prototypesystem a1, a2, . . . , ak.

After these preparations we are in a position to extend principal point quantizations to sim-ilarity measures. Let a prototype systema1, a2, . . . , ak be calledf -optimal if it maximizesthef -fitness among all prototype systems which consist of at mostk prototypes.

The weightswj which we need to define a quantization in the sense of Section 2.2, aredefined as follows.

For each data pointxi there is a best fitting prototypeaj , that means

σ(xi, aj) = maxsσ(xi, as). (4)

This optimal prototypeaj for xi is again called the winner ofxi. (If there is more than oneprototype of this sort then we take as winner the first one in the sequencea1, a2, . . . , ak.)Let

Cj := {xi : aj is the winner ofxi}.In this way any prototype system(a1, a2, . . . , ak) defines a partitionC = (C1, C2, . . . , Ck)of the data set.

The type of partition just defined is obtained by a process which is very similar to that of aVoronoi partition. We will use a geometrically motivated name for this kind of partitions.The name will be in the spirit of Bock, 1992, who used the same idea in the univariatecase. For a better understanding of the term let us explain the geometric picture in a moredetailed way.

(4.7) REMARK For any prototypea the function

pa(x) : x 7→ σ(x, a) = 〈s(a), x〉 − U(a)

Helmut Strasser 13

is a linear function satisfying{pa(x) ≤ f(x) for all x,pa(x) = f(x) if x = a.

Such a linear function is called a support function of the convex functionf at the pointa. The function graph of the linear functionpa is a plane surface whose position is belowthe function graph of the convex functionf but has the point(a, f(a)) in common. Thismeans that the function graph ofpa is a so-called support plane at the function graph of theconvex functionf .

If a1, a2, . . . , ak is any prototype system then for each prototypeaj there is a supportfunction paj , whose graph is a support plane of the convex functionf . The setCj ofwinners of a particular prototypeaj consists of those data pointsx, such that the maximalitycondition

σ(x, aj) = maxsσ(x, as) ⇔ paj

(x) = maxspas

(x)

is satisfied, i.e. such that the support plane ofaj is positioned above all support planes ofthe remaining prototypes.

This geometric interpretation is the reason for taking as name for the partitionC =(C1, C2, . . . , Ck) of winner sets the term ”maximum support plane partition” (MSP-partition). A similar term, namely ”maximum support line partition”, has been used byBock, 1992, in the univariate case. 2

(4.8) DEFINITION Let f be a convex function and let σ(x, a) be the corresponding simi-larity measure. If a1, a2, . . . , ak is a prototype system then the partition defined by

Cj := {xi : aj is the winner of xi}

is called a maximum support plane partition (MSP-partition).

After this definition we have got all components needed for the notion off -optimal quan-tizations.

(4.9) DEFINITION Let f be a convex function. A weighted system of prototypes

Prototypes Weightsa1 w1

a2 w2

......

ak wk

is an f -optimal quantization, if both the prototype system is f -optimal and the weights areidentical to the relative frequencies of corresponding MSP-partition.

If the underlying similarity measure is equivalent to the classical similarity measure (2)then anyf -optimal quantization is a principal point quantization. This is the assertion ofthe following theorem.

(4.10) THEOREM Let f(x) = |x|2/2 and let σ(x, a) be the similarity measure define by(2): Then a quantization is f -optimal if and only if it is a principal point quantization.

If we consider other convex functionsf then the correspondingf -optimal quantizations donot coincide with principal point quantizations.

For some special cases of convex functionsf the concept off -optimal quantizations hasalready been considered in the literature:

Helmut Strasser 14

• In connection with dicriminance problems Bock, 1992 and 1994, consideredf -optimal quantizations for the convex functionf(x) = x log x, x ≥ 0, and evenfor arbitrary convex functionsf : R+ → R+.

• As indicated before the convex functionf(x) = |x| leads tof -optimal quantiza-tions which happen to be the attractors of an algorithm defined by Kohonen, 1984 inconnection with associative memory. This case will be considered in more detail inSection 6.

4.2 Optimal partitions with similarity measures

There is a concept of constructing partitions which generalizes minimum variance parti-tions in a similar way.

Let σ(x, a) = 〈s(a), x〉 − U(a) be a linear similarity measure and letf(x) = σ(x, x) thegenerating convex function of this similarity measure. We consider a setC ⊆ E consistingof n data points. Based onσ we define the number

Sσ(C) :=∑x∈C

σ(x,m),

which is analogous to the inner dispersion, when we are dealing with distance measures.This number can be simplified to

Sσ(C) =∑x∈C

σ(x,m) = 〈s(m),∑x∈C

x〉 − nU(m)

= n(〈s(m),m〉 − U(m)) = nf(m).

The formula says that for computingSσ(C) we only need to know the centroidm of thesetC, the sizen and the convex functionf which generates the similarity measureσ. Forlack of any better name let us callSσ(C) = nf(m) the excentricity of the setC.

Now, letC = (C1, C2, . . . , Ck) be any partition. The number of points inCj is denoted bynj and the centroid ofCj is denoted bymj . The total excentricity of the partitionC is thenthe sum

Sσ(C) :=k∑

j=1

Sσ(Cj) =k∑

j=1

njf(mj)

of the excentricities of all sets.

After these preparations we are going to extend the notion of minimum variance partitionsto similarity measures. A partitionC = (C1, C2, . . . , Ck) is said to be optimal with respectto the similarity measureσ(x, a) if this partition maximizes the numberSσ(C) among allpartitions consisting of at mostk subsets.

At this point we cannot give a plausible interpretation of the numberSσ(C). The nextSection 4.3 is devoted to the relation between the present concept of optimal partitions andthe concept off -optimal quantizations which has an easier intuitive interpretation. It willturn out (by a formal equivalence theorem) that both concepts are equivalent to each other.A really convincing interpretation of the numberSσ(C) will be given in Section 5 wherewe discuss the problem from the viewpoint of statistical decision theory. Until then we willconsider the numberSσ(C) as a formal definition without intuitive interpretation.

It will be convenient to divide the numberSσ(C) by the sizen of the whole data set resultingin

If (C) :=1nSσ(C) =

k∑j=1

pjf(mj) where pj :=nj

n.

Helmut Strasser 15

(4.11) DEFINITION The number If (C) is called the f -information of the partition C.

We are not ready to explain the justification of the term information at this point. However,in Section 5 we will show that this is actually an information measure in the sense ofstatistical decision theory.

Since for the computation of the numberSσ(C) we only need to know the partitionCand the convex functionf it is convenient to distinguish different similarity measures byreferring to the convex functionf .

(4.12) DEFINITION A partition C = (C1, C2, . . . , Ck) is an f -informative partition if itmaximizes If (C) among all partitions consisting of at most k subsets.

For the similarity measure (2) anyf -informative partition is a minimum variance partition.For its importance we isolate this fact as a theorem.

(4.13) THEOREM Let f(x) = |x|2/2 and let σ(x, a) be the similarity measure definedby (2). A partition C = (C1, C2, . . . , Ck) is f -informative if and only if it is a minimumvariance partition.

Any variation of the convex functionf leads tof -informative partitions which are consid-erably different from minimum variance partitions.

4.3 The equivalence theorem

As we mentioned before the concepts off -informative partitions and off -optimal quanti-zations are in case of the squared normf(x) = |x|2 identical to the corresponding classicalconcepts of minimum variance partitions and principal point quantizations. In this specialcase the assertion of the equivalence theorem 2.1 is valid. It is then a natural questionwhether this equivalence remains true if the squared norm is replaced by another convexfunction.

We have remarked that the corresponding question in connection with distance measureshas no positive answer. It is therefore noteworthy that for the approach with linear similaritymeasures the assertion of the equivalence theorem remains valid.

(4.14) THEOREM Let f be an arbitrary convex function. The problem of finding an f -informative partition and the problem of finding an f -optimal quantization are equivalentin the following sense:

1. The MSP-partition of an f -optimal prototype system is an f -informative partition.

2. The centroids of the sets of an f -informative partition are the prototype system of anf -optimal quantization.

Proofs of the theorems stated in Section 4 are to be found in Ptzelberger and Strasser, 1999.

5 Selecting the similarity measure

In the preceding section we discussed a general type of optimization problems which maybe used for a reduction of complexity in data sets. For selecting a particular method fromthe general class we have to fix a convex functionf in order to define a similarity measure.

On one hand the freedom in choosing a particular similarity measure increases the flexibil-ity and can be rated as a positive aspect. On the other hand we need criteria for this choice.Such normative rules for the choice of a particular method must be part of a prior theory,which attempts to realize special goals by applying such methods.

Helmut Strasser 16

For data compression several theories could be taken into consideration for answering thequestion of choosing a method. E.g. there is the classical theory of information whichis concerned with signal transmission. Another theory where data compression plays animportant role is optical pattern recognition.

We have to face the fact that answers to the problem of a good data compression are verydifferent depending on the normative theory we are starting from. A method which workswell for electronic data transmission may be ill-suited for optical pattern recognition.

In the present paper we neither consider data compression from the viewpoint of classicalinformation theory nor from optical pattern recognition. We rather are interested in sta-tistical decisions, which are based on empirical data from social and economic areas. Forthese kinds of applications there is also a normative theory for data analysis. It is calledstatistical decision theory.

In the following we will explain some ideas of how statistical decision theory can be usedto choose a convex function and thus a similarity measure for our approach of data com-pression.

Basically it would be possible to develop our approach as a necessary and logical conse-quence of the main ideas of statistical decision theory. This will be done in Strasser, 2000.For reasons of space and since such a discussion would go far beyond the mathematicalsophistication level of this paper we confine ourselves to a simplified and mathematicallylucid explanation of some leading ideas.

5.1 Reduction of decision theoretic information

So far we have always been considering an empirical data set. Taking the viewpoint ofstatistical decision theory we have to start with a probabilistic model for our data.

Statistical decision theory is concerned with making decisions between alternative stochas-tic models on the basis of statistical data. Decisions are made by decision functions (teststatistics or estimators). The fundamental problem of statistical decision theory is to definegood decision functions for concrete decision problems.

We are now going to explain how data compression can be regarded as a problem of sta-tistical decision theory. In decision theoretic terminology we are not dealing with datacompression but with reduction of statistical information. Data compression and informa-tion reduction pursue the same goals, in the first case for empirical data, in the second casefor theoretical models.

The central concept is the information set of a decision problem.

The information set of a decision problem defines which decision functions may be usedfor the distinction between alternative stochastic models. After a reduction of informationwe may only apply those decision functions which are using nothing else than the reducedinformation. If the reduction of information was carried out by constructing a partition,then, for decision making we may not use the original data but only their membershipin the subsets of the partition. Thus, the partition defines the information set, which isavailable for the decision problem.

In terms of statistical decision theory the problem of optimal data compression correspondsto the problem of reducing the information set in such a way that the increase of risk is assmall as possible.

Let us identify the information set of a decision problem with the partitionC =(C1, C2, . . . , Ck) which is used for decision making after the reduction of information.Information is reduced in order to decrease the complexity of the problem. This may boil

Helmut Strasser 17

down to limiting the numberk of subsets in the partition by an upper bound. For any in-formation setC we may find some optimal decision function whose risk may be used asa measure of the information contained in the partition: The larger the risk the less is theinformation inC.

In the sections below we will show that the information content of a partition can be mea-sured by a number which corresponds to thef -information defined in 4.11.

Thef -information of Definition 4.11 is an empirical measure since it is based on empiricaldata. There is a corresponding measure for stochastic models. IfP is a probability modelfor our data and ifC = (C1, C2, . . . , Ck) is a partition of the setE then

If (C, P ) :=k∑

i=1

P (Cj)f(EP (X|Cj)

)(5)

denotes thef -information of a partitionC for the modelP . Here the symbolX denotesa random data pointx ∈ E being distributed according toP , andEP (X|Cj) denotes theconditional expectation of the random data pointX in the setCj , i.e. its stochastic averageif it is restricted to the setCj .

If the empirical data set(x1, x2, . . . , xn) is a large sample from the modelP , then bythe law of large numbers the empiricalf -informationIf (C) from Definition 4.11 is ap-proximately equal to the theoreticalf -informationIf (C, P ). Thus, any interpretation ofIf (C, P ) as a measure of information in a model may be used for the interpretation of thecorresponding empirical measureIf (C).

We will show that thef -informationIf (C, P ) is actually a measure of information in thesense of statistical decision theory. Moreover we will show how to derive decision theoreticcriteria for the choice of the convex functionf and thus for the linear similarity measure,too.

5.2 A discriminance problem

To begin with let us explain the decision theoretic framework at hand of a simple discrimi-nance problem. In Section 5.4 we will turn to more general decision problems.

LetP andQ be two stochastic models which shall be distinguished by empirical data. Thedistinction is to be based on a sample(x1, x2, . . . , xn) through a test statistic

T (x1, x2, . . . , xn) =1n

n∑i=1

t(xi),

wheret(x) is a so-called score function or influence function. This score functiont maybe considered as a particular data transformation. It defines how the datax are to be trans-formed before as a test statisticT the mean is computed.

We defineP to be the null hypothesis andQ to be the alternative hypothesis. The scorefunction t is supposed to be centered, i.e. we assumeEP (t) = 0, such that also theexpectationEP (T ) of the test statisticT is zero. The decision betweenP andQ is basedon the valueT (x1, x2, . . . , xn) of the test statistic for a particular sample(x1, x2, . . . , xn).

The intuitive idea of the test statistic is as follows: IfT (x1, x2, . . . , xn) is not too far awayfrom 0 then the decision is made in favor ofP ; but if T (x1, x2, . . . , xn) is too far awayfrom 0, then the decision is in favor ofQ. Hence a good test statisticT should be suchthat its expectationsEP (T ) = EP (t) andEQ(T ) = EQ(t) under the modelsP andQare considerably different from each other. This implies that the differenceEQ(t)−EP (t)

Helmut Strasser 18

should be as large as possible. We want to optimize the score functiont on the conditionthat this difference is a maximum.

However, in maximizing the differenceEQ(t) − EP (t) we have to take into account therisks of the decision. In the most simple case we measure the risk by the varianceVP (t) =nVP (T ) of the score function. The efficient score functions are then obtained by solvingthe optimization problem

EQ(t)− EP (t)− γVP (t) = Max ! (6)

where the arbitrary constantγ > 0 serves as a measure of risk aversion.

The maximum valuemax

t

(EQ(t)− EP (t)− γVP (t)

)(7)

is a measure of the separability of the modelsP andQ. The larger the information set ofthe decision problem the better can the models be distinguished and thus the larger is themaximum value (7).

It is therefore natural to take the maximum value (7) as a measure of the amount of infor-mation in the information set of the decision problem.

In the following sections we will compute this maximum value (7) under several assump-tions. It will turn out that for problems with reduced information this maximum value isclosely related to thef -information (5). This will serve as justification for the naming of(5) as information measure.

Let us have a closer look at the optimization problem (6). Ifp(x) andq(x) are the proba-bility densities of the modelsP andQ, then we denote byh(x) the relative difference ofthose densities, i.e.

h(x) =q(x)− p(x)

p(x)=q(x)p(x)

− 1. (8)

The functionh is called the centered likelihood ratio. The objective function of the opti-mization problem (6) can be written as

EQ(t)− EP (t)− γVP (t) = EP

(tq

p

)− EP (t)− γEP (t2)

= EP (th− γt2). (9)

This expression (9) will be the starting point for the further treatment of the optimizationproblem (6).

5.3 The solution with full information

Let us consider the case without reduction of information. The maximization of (7) for thiscase is an old and well-understood problem of statistical decision theory. We are going todiscuss the solution in order to relate our approach to more familiar treatments in statistics.

In order to solve problem (6) we have to compute an efficient score functiontopt. Thesolution depends on the class of score functions which are admitted for competition. Aslong as there is no reduction of information any score functiont(x) is admitted.

The objective function (9) can be maximized by maximizing

t(x)h(x)− γt(x)2

Helmut Strasser 19

for every singlex by a suitable choice oft(x). For eachx this is an elementary optimizationproblem with solution

topt(x) =12γh(x).

An efficient score function is thus proportional to the centered likelihood ratioh. Theoptimal value of the objective function is

maxt

(EQ(t)− EP (t)− γVP (t)

)= EQ(topt)− EP (topt)− γVP (topt)

= EP

( 12γh · h− γ

14γ2

h2)

=14γEP (h2). (10)

Obviously, the risk aversionγ plays the role of a constant factor and, therefore, we mayview the numberEP (h2) as a measure of separability of the modelsP andQ, i.e as ameasure of the amount of information in the (unreduced) decision problem.

We arrive at the following result:

• The optimal score functiont(x) for distinguishing between the modelsP andQ isidentical to the centered likelihood ratioh(x) of the modelsP andQ.

• The amount of information in the problem of discrimination betweenP andQ isequal to the varianceVP (h2).

It may be that this result does not sound as familiar as it should be in view of its fundamentalimportance for statistical decision theory. Thus we postpone the solution of the problem (6)with reduced information to the next but one section. In the immediately following sectionwe take an excursion to translate the solution obtained into more familiar pictures.

5.4 How to select a likelihood ratio

Problems of applied data analysis usually lead to more complex situations than simplediscriminance problems. Therefore it is not as simple as in the preceding section to getrules for the selection of a centered likelihood ratioh.

We will illustrate this by discussing two typical statistical situations. These situations com-prise parametric problems and semiparametric problems.

(5.15) DISCUSSION(Univariate parametric problems)Let (Pθ) be a family of stochastic models which are parameterized by a one-dimensionalparameterθ. Suppose as null hypothesis a particular modelP = Pθ of this family. We areasking how to distinguish the modelPθ from other similar modelsQ = Pθ+4θ.

If the modelsPθ andPθ+4θ are assumed to be similar then this means that the difference4θ is small. In such a case the centered likelihood ratio may be approximated by theformula

h(x) =pθ+4θ(x)− pθ(x)

pθ(x)≈ 4θ

ddθpθ(x)pθ(x)

.

Thus, in this case the centered likelihood ratio is approximately proportional to the so-called loglikelihood derivative

d

dθlog pθ(x) =

ddθpθ(x)pθ(x)

.

Now our previous result of basing an optimal test statistic on the centered likelihood ratioh(x) is in accord with a familiar principle of classical statistics: Any (in the sense of

Helmut Strasser 20

Neyman and Pearson locally) optimal test statistic must be defined with the loglikelihoodderivative of the model as a score function.

Moreover, the numberEP (h2), which we called an information measure is closely relatedto a well-known object of classical statistics. In fact, the same approximation argument asused before gives

EP (h2) ≈ EPθ

(( ddθ

log pθ(x))2)

,

and this turns out to be the so-called Fisher-information of the parametric model. 2

(5.16) EXAMPLE Let (Pθ) be the location family of a normal distribution with varianceσ2, i.e.

pθ(x) =1√2πσ

exp(− 1

2σ2(x− θ)2

).

In this case the loglikelihood derivative is given by

d

dθlog pθ(x) =

x− θ

σ2.

For practical applications this means that test statistics being optimal for distinguishingPθ

andPθ+4θ have to be of the form

T (x1, x2, . . . , xn) =1n

n∑i=1

xi − θ

σ2=x− θ

σ2.

The Fisher-information is

EP (h2) ≈ EP

((x− θ

σ2

)2) =1σ2.

This is compatible with the intuitive idea that the location information of a normal distri-bution is large if the variance is small. 2

For more general parametric problems we may use a similar way of reasoning. However, inmultivariate parametric cases the loglikelihood derivative may have a much more complexstructure than in the preceding example. A systematic discussion of such examples goesbeyond the scope of this paper.

In any case we obtain the following result for parametric problems:

• In the position of the centered likelihood ratioh in the objective function (9) one hasto choose the loglikelihood derivative of the parametric model.

• If there is no reduction of information then the optimal score function is identical tothe loglikelihood derivative of the parametric model.

If we are dealing with statistical problems arising from applications in the social or eco-nomic sciences then it is rather difficult to define a valid parametric model for data gener-ation. On the contrary, the modeling distributionP remains unspecified and the focus ison a particular property of the model, e.g. on the expectation, the covariance structure orat properties of symmetry. In such a situation we do not want to distinguish the modelsPandQ per se. We only want to know whether they differ with respect to the properties ofinterest.

In many cases the property of interest is defined as a so-called functionalφ(P ), and thenwe are dealing with a semiparametric problem. For a semiparametric problem there isno loglikelihood derivative. We have to search for other criteria for selecting the relevantfunctionh.

Helmut Strasser 21

The most simple special case of a semiparametric problem arises if an unbiased estimatorof the functional exists. This means that there is a functiong such that

φ(P ) = EP (g).

Fortunately, this case is not only simple but also of considerable importance.

(5.17) DISCUSSION(Most simple semiparametric problems)Let φ(P ) = EP (g) be a statistical functional.

If we dealt with the discriminance between fixed modelsP andQ then the separabilitycould be measured byEP (h2) whereh is the centered likelihood ratio ofP andQ. Butnow we are not dealing with two fixed models but with discriminating between such modelswhere the values of the functionalφ are different.

If we want to recognize differences of the values ofφ of magnitudeε > 0 then we have toconsider all pairs(P,Q) of models whereφ(Q) − φ(P ) = ε. For some pairs the separa-bility EP (h2) will be better, for others it will be worse. We have to count with the mostunfavorable value of separability, i.e.

minQ:φ(Q)−φ(P )=ε

EP (h2),

and, therefore, we base the analysis on the discriminance problem where discriminance isworst. Thus, we have to consider the optimization problem

EP (h2) = Min ! , whereh(x) =q(x)p(x)

− 1 andφ(Q)− φ(P ) = ε. (11)

Let us solve this problem for the hypothetical case where the probability modelP is as-sumed to be known and fixed. In this case the distributionsQ such thatφ(Q)− φ(P ) = εare given by the densities

q(x) = p(x)(1 + h(x)), whereEP (h) = 0, EP (gh) = ε,

or equivalently by

q(x) = p(x)(1 + εh(x)), whereEP (h) = 0, EP (gh) = 1.

Then our optimization problem reads as

EP (h2) = Min ! , whereh(x) =1ε

(q(x)p(x)

− 1)

undEP (h) = 0, EP (hg) = 1.

That means: Among all functionsh such thatEP (h) = 0 andEP (gh) = 1 we have to findthe function with least variance.

The solution of this problem is

hopt =g − EP (g)VP (g)

.

This is easy to verify sincehopt satisfies the side conditionsEP (h) = 0, EP (hg) = 1 andfrom the Cauchy-Schwarz inequality we know for any otherhwith the same properties that

1 =(EP (gh)

)2 ≤ VP (g)VP (h),

and thus

VP (hopt) =1

VP (g)≤ VP (h).

Helmut Strasser 22

2

This result can be summarized as follows:

If a statistical functional is defined by an unbiased estimatorg, and if all stochastic modelsP have to be taken into consideration, then the functionh in the objective function (9) hasto be defined as

h =g − EP (g)VP (g)

.

As we know, the numberEP (h2) can be viewed as a measure of information in the decisionproblem. In the present case we have

EP (h2) =1

VP (g),

and this is in complete accordance with our intuitive reasoning that the functional is wellestimable (the amount of information is large) if the variance of the estimatorg is small.

(5.18) EXAMPLE Let φ be the expectation functional, i.e.

φ(P ) = EP (X),

whereX denotes a random data point as before. In this case the unbiased estimator isg(x) = x, and from our preceding argument it follows that

hopt(x) =x− EP (X)VP (X)

is the optimal choice for a centered likelihood ratio and thus for the optimal score function,too.

This is the same result we obtained for a location family of a normal distribution. But nowit was achieved without any distributional assumptions. The only assumptions imposedwere concerned with the statistical functional under consideration. 2

There are more general situations where either an unbiased estimator is not available orwhere we do not take all possible stochastic models into consideration. In such a situationa similar reasoning is possible. The functionh which is then obtained as solution is theso-called canonical gradient of the functionalφ. The canonical gradient is a fundamentalconcept of the semiparametric statistical theory. Let us refer to Pfanzagl and Wefelmeyer,1982, and to Bickel, Klaassen, Ritov and Wellner, 1993.

Determining a canonical gradient can be a hard problem which may require some advancedmathematics. The idea of the concept itself, however, is easy to explain.

In order to understand what a canonical gradient means let us combine all those modelsPto a surface where the functionalφ has a fixed valuec

Nc := {P : φ(P ) = c}.

This setNc contains all models which are equivalent with respect to the property of interest.Now we take a position on this surface and look for the direction where the functionalφwould increase most quickly if we left the surface in that direction. The likelihood ratiowhich points into that direction of maximum increase is the canonical gradient.

Let us summarize:

• For a semiparametric model we have to select as centered likelihood ratioh in theobjective function (9) the canonical gradient of the functionalφ.

Helmut Strasser 23

• If the functionalφ is defined by an unbiased estimatorg and if all stochastic modelshave to be taken into consideration then the canonical gradient is proportional tog − EP (g).

• If in the latter case the functional is the expectation functional then the canonicalgradient is proportional tox− EP (X).

5.5 The solution with reduced information

Now we turn to the question how to solve the optimization problem (6) if the informationset is reduced to a partitionC = (C1, C2, . . . , Ck).

If the discrimination betweenP andQ is to be made on the basis of reduced informationthen only such score functionst(x) are admitted, which have a particularly simple structure.Any score function may only use the information contained in the partitionC, which impliesthat it must be a step function of the form

x ∈ C1 t(x) = b1x ∈ C2 t(x) = b2...

...x ∈ Ck t(x) = bk

. (12)

Thus, with reduced information the valuet(x) of a score function depends only on theknowledge which set of the partitionC a data pointx belongs to.

As before let us denote byh the likelihood ratio ofP andQ which is used in (8). Inthe case of a parametric model leth be the loglikelihood derivative, and in the case of asemiparametric model leth be the canonical gradient of the functional. By

hj := EP (h|Cj)

we denote the centroid (i.e. the conditional expectation) of the functionh on the setCj .Applying this notation our objective function (9) reads as

EP (th− γt2) =k∑

j=1

P (Cj)(bjhj − γb2j ).

This objective function shall be maximized by a suitable choice of the numbersbj . Wecan carry out the maximization for each indexj separately, and thus we obtain as optimalsolutions for the valuesbj of the score functiont(x)

bj =12γhj =

12γEP (h|Cj).

The maximum of the objective function is then

maxt

(EQ(t)− EP (t)− γVP (t)

)=

12γ

k∑j=1

P (Cj)(EP (h|Cj)

)2.

The risk aversionγ is only a constant factor and therefore we may define

k∑j=1

P (Cj)(EP (h|Cj)

)2(13)

to be a measure of the separability of the modelsP andQ by the information contained inthe partitionC.

Helmut Strasser 24

As we see immediately the measure (13) is almost identical to thef -informationIf (C, P )defined in (5) for the convex functionf(x) = x2. The only difference is thatX in (5) isreplaced byh(X) in (13).

(5.19) EXAMPLE Let h be proportional tox− EP (X), whereX is a random data point.As shown proviously this is the case, if we are dealing either with the location family of anormal distribution or with a semiparametric problem for the expectation functional.

In this case we haveEP (h|Cj) = EP (X|Cj)− EP (X). Thus we obtain

k∑j=1

P (Cj)(EP (h|Cj)

)2 =k∑

j=1

P (Cj)(EP (X|Cj)− EP (X)

)2=

k∑j=1

P (Cj)(EP (X|Cj)

)2 − EP (X)2.

But, up to the additive constantEP (X)2, this is nothing else than thef -informationIf (C, P ) for the convex funktionf(x) = x2. 2

From the preceding example we learn the following: If we are dealing with a decisionproblem where the centered likelihood ratio ish(x) = x−EP (X), then thef -informationIf (C, P ) is actually identical to the information measure defined in (7). In other words, fora class of problems the decision theoretic information contained in a partitionC is given bythef -information withf(x) = x2. The problem of finding a partition being optimal in thisrespect is thus by Theorem 4.13 equivalent to finding a minimum variance partition.

If the functionh(x) is not proportional tox−EP (X), then for an optimal data compressionwe have to take into account the shape of the functionh(x). Consideringh(x) amounts toapplying the data transformationy = h(x) before starting with the data compression. Thef -information must be computed for the transformed data.

Using mathematical formulas this looks as follows. Lety = h(x). By

Cj := {y = h(x) : x ∈ Cj}

we denote those sets of (transformed)y-data, which correspond to the setsCj of x-data.Then it is clear that

EP (Y |Cj) = EP (h|Cj)

Moreover, we haveP (Cj) = P (X ∈ Cj) = P (Y ∈ Cj). For the measure (13) this givesthe formula

k∑j=1

P (Cj)(EP (h|Cj)

)2 =k∑

j=1

P (Y ∈ Cj)(EP (Y |Cj)

)2and this is equivalent to thef -information of they-data for the modelP and forf(y) = y2.

Let us summarize.

(5.20) THEOREM Let h be either(a) the centered likelihood ratio of a discriminance problem (P,Q), or(b) the loglikelihood derivative of a parametric model, or(c) the canonical gradient of a statistical functional.

A partition C of the data which is optimal for the underlying decision problem is obtainedby transforming the data by y = h(x) and maximization of the f -information If (C, P )with f(y) = y2, i.e. by selecting a minimum variance partition of the y-data.

Helmut Strasser 25

This theorem gives a complete decision theoretic justification of minimum variance par-titions. However, it does not show any reason why it should be useful to measure theinformation in a partition or in a quantization by a convex function other than the squarednorm. This was a question posed in the beginning of Section 5. At this point it seems thatthere is no reason to do so.

In the next and final subsection of the decision theoretic part of this Chapter we will con-sider this question.

5.6 Robust solutions

Let us start with the discriminance problem between two stochastic modelsP andQ. Leth be as defined in (8) the centered likelihood ration ofP andQ. In the case dealing with aparametric model leth be the loglikelihood derivative, and, dealing with a semiparametricmodel, leth be the canonical gradient of the functional.

As explained in Section 5.2 we want to find a score functiont(x) such thatEQ(t)−EP (t)is as large as possible. For this optimization we have to consider the risk, and in Section5.2 we have used as risk term the varianceVP (t).

Basically, there a many different ways to measure the risk involved in a test statistic. E.g.,if in the formula for the varianceVP (t) = EP (t2) we replace the quadratic function byanother functionψ then our problem leads to the objective function

EQ(t)− EP (t)− EP (ψ(t)) = EP (th− ψ(t)), (14)

where for notational convenience we put the risk aversionγ into the termψ(t).

Why should the risk be measured in a different way than by the variance ? An answerresults from tackling our problem from the viewpoint of robust statistics.

In reality, empirical data do not exactly follow the rules of a known and a priori fixedstochastic model. They rather are contaminated, which implies that some data are outliers,i.e. untypical and usually extreme data. If we want that our decision functions are notaffected by outliers then we have to ensure that any outlierx must not spoil the valuet(x)of a score function too much.

Let us look at a famous and classical example from robust statistics.

(5.21) EXAMPLE Assume that the stochastic model is a location family of normal distri-butions with fixed varianceσ2, which impliesh(x) = (x− θ)/σ2. By theory the best teststatistic is

T (x1, x2, . . . , xn) =1nσ

n∑i=1

xi − θ

σ=x− θ

σ2.

However, as is well-known the meanx is not robust. A robust alternative to the mean wouldbe the mean of the censored data, i.e. the mean which is obtain with the score function

tcens(x) ={

(x− θ)/σ2 if |(x− θ)/σ| ≤ c,±c/σ if |(x− θ)/σ| > c.

(15)

If in the test statisticT we replace the mean of the data by the mean of the censored data,then we arrive at a so-called robust test statistic. Here, the influence of extreme data isbounded by the numberc which determines the amount of censoring.

This particular robust statistic can also be obtained as a solution of our optimization prob-lem (6), if we measure the risk in another way than by the variance. We have to select a

Helmut Strasser 26

special functionψ(t) in the objective function (14). Let us define

ψ(x) ={x2/2 if |x| ≤ cσ,∞ if |x| > cσ.

Using this functionψ(t) we obtain the objective function (14)

EQ(t)− EP (t)− EP (ψ(t)) = EP (th− ψ(t))

= EP

(t(X)

X − θ

σ2− ψ(t(X))

).

As is easily seen, for everyx the optimal value oftopt(x) is the numbertcens(x) definedin (15).

This result can be interpreted by consideringψ(t(x)) as a penal term for large values oft(x). The faster the increase of the penal term the larger is the amount of censoring.

It is tempting to investigate how the measure of information is affected by the penal term.Recall that the information measure is the maximum of the objective function (14) and it isobtained if we plug the optimal score function into the objective function. Denote

f(x) := maxt

(tx− ψ(t)

)={cσx− c2σ2/2 if |x| > cσ,x2/2 if |x| ≤ cσ.

This is a convex function. Employing this notation we may write the information measureas

maxt

(EQ(t)− EP (t)− EP

(ψ(t)

))= EP

(tcens(X)

X − θ

σ2− ψ(tcens(X))

)= EP

(f(X − θ

σ

))= EP

(f(h(X))

).

Obviously, the convex functionf takes the place of the quadratic functionx2 in formula(10).

We encountered the particular convex function in the formula for the information measurebefore, namely in Example 3.5 and denoted asfcσ. It is instructive to study the connectionsto the underlying linear similarity measure.

Let s(x) be the point wheret 7→ tx−ψ(t) attains its maximum. Then the convex functionmay be written as

f(x) = maxt

(tx− ψ(t)) = s(x)x− ψ(s(x)) = maxa

(s(a)x− ψ(s(a))

),

and the linear similarity measure reads as

σ(a, x) = s(a)x− U(a), whereU(a) = ψ(s(a)).

It is easy to compute the valuess(x). We have

s(x) ={x, if |x| ≤ cσ,cσx/|x|, if |x| > cσ.

It follows that the functions(x) defines a censoring. Moreover, we have

s(h(x)) = s(x− θ

σ2

)= tcens(x) = topt(x),

Helmut Strasser 27

and this means that the optimal score function can be expressed by the linear similaritymeasureσ(a, x). Thus we obtain:

The penal termψ(t) which is used for robustification of the optimization problem definesin a unique way a linear similarity measureσ(a, x) = s(a)x − U(a) such thatf(x) =σ(x, x). The solution of the optimization problem can be expressed by the components ofthe linear similarity measure:

maxtEP (th− ψ(t)) = EP (f(h(X))), topt(x) = s(h(x)).

The convex functionf has a characteristic property. Its increase is considerably slowerthan that of the quadratic function. As long as the inequality|x| ≤ c is satisfied the convexfunctionf(x) and the quadratic function are equal (up to the constant factor1/2). As soonas the opposite inequality|x| > c is true the increase of the convex function is only linear.This illustrates the general rule that a quick increase of the penal termψ(t) corresponds toa slow increase of the convex functionf . 2

Let us summarize the experiences made with the preceding example.

In order to get a robust solution for a discriminance problem(P,Q) we have to use a penaltermψ(t) in the optimization problem (14), which increases more rapidly than the quadraticfunction. The solution of the optimization problem is given by the following theorem.

(5.22) THEOREM Let f(y) := maxt

(ty − ψ(t)

)and for each y let s(y) be a point such

that f(y) = s(y)y − ψ(s(y)).

1. Then f is a convex function and the corresponding linear similarity measure is given by

σ(a, x) = s(a)x− U(a), where U(a) = ψ(s(a)).

2. The solution of the optimization problem is

maxtEP (th− ψ(t)) = EP (f(h(X))), topt(x) = s(h(x)).

Now we know how to adjust the decision theoretic optimization problem in such a way thatits solutions are statistically robust. Moreover, we know the solution if there is no reductionof information.

Therefore we may solve the optimization problem (6) in cases where the information set isreduced to a partitionC = (C1, C2, . . . , Ck). In this case only step functions of the form(12) are admitted.

Let hj := EP (h|Cj) be the mean (i.e. the conditional expectation) of the functionh on thesetCj . Then the objective function is given by

EP (th− ψ(t)) =k∑

j=1

P (Cj)(bjhj − ψ(bj)).

This objective function is to be maximized by a suitable choice of the numbersbj . Usingthe notation defined in Theorem 5.22 we obtain as optimal solution of the score function

Helmut Strasser 28

t(x) the valuesbj = s(hj), and the maximum of the objective function is

maxtEP (th− ψ(t)) =

k∑j=1

P (Cj)(s(hj)hj − ψ(s(hj)))

=k∑

j=1

P (Cj)f(hj)

=k∑

j=1

P (Cj)f(EP (h|Cj)

).

Therefore we may consider the number

k∑j=1

P (Cj)f(EP (h|Cj)

)(16)

as a measure of some robust separability of the modelsP andQ by the partitionC.

Obviously, the number (13) again is almost identical to thef -informationIf (C, P ) definedin Definition 5, but now for a general convex functionf . Also the only difference is thatthe functionh(X) in (13) replaces the data pointX in (5). Considering the functionh(x)amounts to transforming the data byy = h(x) before proceeding to the data compression.

In view of these results we may interprete the number (16) asf -information in the sense of5. With the notation of Section 5.5 we obtain

k∑j=1

P (Cj)f(EP (h|Cj)

)=

k∑j=1

P (Y ∈ Cj)f(EP (Y |Cj)

)and this is the same as thef -information of they-data for the modelP and for the convexfunctionf defined in Theorem 5.22.

To sum up:

(5.23) THEOREM Let h be either(a) the centered likelihood ratio of a discriminance problem (P,Q), or(b) the loglikelihood derivative of a parametric model, or(c) the canonical gradient of a statistical functional.

1. A decision theoretic optimal partition C of the data is obtained by first transformingthe data by y = h(x) and then computing an f -informative partition with a convexfunction of the form f(y) := maxt

(ty − ψ(t)

).

2. If the penal term ψ(t) is the quadratic function then f is a quadratic function too,and any f -informative partition is equivalent to a minimum variance partition of they-data.

3. If the penal term ψ(t) increaes more quickly than a quadratic function then theconvex function f increases more slowly than a quadratic function, and the f -information is a measure of separability which can be obtained by robust decisionfunctions.

This is the theoretical justification for using other convex functionsf than quadratic func-tions if partitions are constructed by maximizing thef -information. The expectations thathave been aroused by this theory have been confirmed convincingly by the computationalexperiments of Rahnenfhrer, 1999.

Helmut Strasser 29

6 Algorithms

Various algorithms may be applied to solve numerically the optimization problems of Sec-tion 4. Since the problem of optimal partitions in Section 4.2 is a combinatorial optmizationproblem and, therefore, is not directly accessible, we begin with studying the quantizationproblem of Section 4.1. By the equivalence theorem the solution of the quantization prob-lem is a solution of the partitioning problem as well.

In this paper we confine ourselves to explaining the methods and their relations to eachother. Further theoretical results are contained in Ptzelberger and Strasser, 1999. Extensivecomputer experiments and improvements of the algorithms can be found in Steiner, 1999.

6.1 Gradient methods

Let σ(x, a) be a linear similarity measure. This is the quantization problem: Given a dataset(x1, x2, . . . , xn) find prototypes(a1, a2, . . . , am) such that

n∑i=1

maxjσ(xi, aj) = Max !

Letg(x; a1, a2, . . . , am) := max

jσ(x, aj).

Then the objective function of the optimization problem is

G(a1, a2, . . . , am) =n∑

i=1

g(xi; a1, a2, . . . , am).

If the similarity measureσ is differentiable by the variablea, then the functiong(x; a1, a2, . . . , am) is partially differentiable byaj at all data pointsxwhich have a uniquewinner in(a1, a2, . . . , am). By Ptzelberger und Strasser, 1999, Theorem 4.18, this appliesto practically all cases. Therefore, we assume in the following that all data points haveuniquely defined winners.

Now let C = (C1, C2, . . . , Cm) be the MSP-partition of the prototype system(a1, a2, . . . , am). Since

g(xi; a1, a2, . . . , am) = σ(x, aj) if x ∈ Cj ,

we obtain

∂g

∂aj=

∂σ

∂a(x, aj), if x ∈ Cj ,

0 elsewhere.

There are two kinds of gradient methods. A usual gradient method tries, starting from apreliminary solution(a1, a2, . . . , am), to improve the objective functionG(a1, a2, . . . , am)by making a small step into the direction of maximum increase. A stochastic gradientmethod first randomly selects some data pointxi and then tries to improve only one singleterm of the objective function, namelyg(xi; a1, a2, . . . , am), by making a small step intothe direction of maximum increase of this particular term.

Both types of gradient methods are iterative algorithms which, in general, require a largenumber of iterations to get into a vicinity of a good solution. For concave objective func-tions a nonstochastic gradient method is usually faster than a stochastic one. However, ouroptimization problems do not have concave objective functions. The objective functionG

Helmut Strasser 30

likes to have a lot of local maxima of very different global quality. Each local maximum hasa uniquely defined set of starting values from where it can be reached by a nonstochasticgradient method, the so-called attractive region of the local maximum. Any nonstochas-tic gradient method is bound to the attractive region where it starts. Therefore the globalquality of the solution is determined by the position of the starting values. By contrast, astochastic gradient method may escape the attractive region where it starts from and mayenter the attractive region of another, hopefully better local maximum.

Let us have a closer look at a nonstochastic gradient method. If(aalt1 , aalt

2 , . . . , aaltm ) is a

preliminary solution then we obtain an improved solution by

anewj = aalt

j + ε∂

∂ajG(aold

1 , aold2 , . . . , aold

m ), j = 1, 2, . . . ,m.

We have

∂ajG(aold

1 , aold2 , . . . , aold

m ) =n∑

i=1

∂ajg(xi; aold

1 , aold2 , . . . , aold

m )

=∑

x∈Coldj

∂σ

∂a(x, aj).

Thus, one step of the gradient method is defined by

anewj = aold

j + ε∑

x∈Coldj

∂σ

∂a(x, aold

j ), j = 1, 2, . . . ,m.

Let us consider the classical special case.

(6.24) EXAMPLE Assume that the linear similarity measure is

σ(x, a) = 〈x, a〉 − |a|2

2

which implies∂σ

∂a= x− a.

Then a step of the gradient method is

anewj = aold

j + ε∑

x∈Cj

(x− aoldj )

= aoldj + ε|Cold

j |(moldj − aold

j ), j = 1, 2, . . . ,m.

A particularly interesting version of a gradient method is defined by the rule that the incre-mentε is chosen for eachj separately according to

εj =1

|Coldj |

.

Then a step of this special gradient method gives

anewj = aold

j + (moldj − aold

j ) = moldj , j = 1, 2, . . . ,m.

Obviously the new prototypes are simply the centroids of the actual MSP-partition. Thiskind of iteration is calledk-means clustering and it is the most popular method of deter-mining principal point quantizations and minimum variance partitions. 2

Helmut Strasser 31

Let us turn to stochastic gradient methods. If(aold1 , aold

2 , . . . , aoldm ) is a preliminary solution

then the result of an iteration is obtained by choosing a data pointx randomly from the dataset and then computing

anewj = aold

j + ε∂

∂ajg(x; aold

1 , aold2 , . . . , aold

m ), j = 1, 2, . . . ,m.

If aoldj(x) is the unique winner ofx, then we have

∂ajg(x; aold

1 , aold2 , . . . , aold

m ) =

∂σ

∂a(x, aold

j(x)) if j = j(x),

0 else.

Therefore one step of a stochastic gradient method is given by

anewj = aold

j +

{ε∂σ

∂a(x, aold

j(x)) if j = j(x),

0 else.

Let us have a look at the classical special case.

(6.25) EXAMPLE Assume that the linear similarity measure is

σ(x, a) = 〈x, a〉 − |a|2

2.

One step of the stochastic gradient method is given by

anewj = aold

j +{ε(x− aold

j(x)) if j = j(x),0 else.

This special stochastic gradient algorithm has been known for a long time and is sometimescalled LVQ (Learning Vector Quantization). 2

A completely different stochastic gradient method has been proposed by Kohonen, 1984.

(6.26) EXAMPLE Consider the linear similarity measure

σ(x, a) = 〈 a|a|, x〉.

The algorithm suggested by Kohonen runs as follows:

If (aold1 , aold

2 , . . . , aoldm ) is a preliminary solution then the result of the next iteration step is

obtained by choosing randomly a data pointx and then computing

anewj = aold

j +{ε(x− aold

j(x)) if j = j(x),0 else.

Hereaj(x) denotes the unique winner ofx with respect to the linear similarity measureσ.

At first sight it is not evident that this algorithm is a stochastic gradient method for thelinear similarity measureσ. If we can prove this assertion then it follows that the algorithmby Kohonen solves a quantization problem of Section 4.1.

Let pa(x) be the orthogonal projection ofx onto the direction spanned bya. Then we have

∂σ

∂a=x− pa(x)

|a|.

Helmut Strasser 32

Strictly speaking an iteration step of a stochastic gradient method forσ should be definedas

anewj = aold

j +

εx− paold

j(x)(x)

|a|if j = j(x),

0 else.

By a suitable choice of the incrementsε we may achieve that for both versions of thealgorithm the prototypes, up to norming factors, coincide. Then it is apparent that thecorresponding MSP-partitions are also identical.

Pursuing these lines of argumentation we can see that the algorithm by Kohonen is equiv-alent to a stochastic gradient method for the linear similarity measureσ. 2

For further examples and for the explicit construction of gradient methods with other linearsimilarity measures we refer to Ptzelberger and Strasser, 1999, and to Steiner,1999.

6.2 Fixpoint algorithms

In the preceding subsection we discussed the idea of gradient methods. In the case of theclassical linear similarity measure which corresponds to the squared Euclidean distance wepresented some details of the algorithms. It turned out that with a suitable choice of theincrements we arrive at the method ofk-means clustering.

The method ofk-means clustering deserves more attention. A first aspect is that thismethod can also be explained in a different way than by gradient method ideas, and, sec-ondly, it has the favorable property of stopping after finitely many iteration steps at a so-called fixpoint.

Both aspects are mutually interrelated to each other. Thus let us begin with the alternativeexplanation.

The method which is calledk-means clustering gives special emphasis to a necessary prop-erty of the optimal solution. It attempts to realize this property by iteratively improvingpreliminary solutions. In order to understand the algorithm in this way one has to knowwhat is all about this mysterious property.

The property mentioned is an immediate consequence of the equivalence theorem 2.1.From this theorem it follows that the prototypes of a principal point quantization mustcoincide with the centroids of their Voronoi partition.

(6.27) DEFINITION A system of prototypes a1, a2, . . . , ak is a fixpoint of the principalpoint problem if the prototypes are the centroids of their Voronoi partition.

This is a remarkable property. It should be obvious that in general arbitrary prototypesystems are far from being fixpoints. If starting from a prototype system we compute theVoronoi partition then the centroids of this Voronoi partition need not coincide with theprototypes we started from. However, for a principal point quantization this extraordinaryproperty is fulfilled: The prototype system of a principal point quantization is always afixpoint.

The method ofk-means clustering attempts to find a prototype system which is a fixpoint.For this we start with a preliminary prototype system and compute Voronoi partitions andcentroids, alternately. It can be proved for this procedure that the objective functionsSS(C)andJ2 either decrease or remain unchanged. After finitely many steps we arrive at a pro-totype system which is a fixpoint. At this point the algorithm stops, since from now on theprocedure would leave everything unchanged.

Helmut Strasser 33

Thus, the method ofk-means clustering certainly arrives at a fixpoint, after finitely manysteps. But this does not imply that the fixpoint is a principal point solution, i.e. a globallyoptimal point of the objective functionJ2. On the contrary, all local minima ofJ2 arefixpoints as well. It can even be shown that the set of all fixpoints equals the set of all localminima.

Therefore, all we know is that the method ofk-means clustering stops at some local min-imum of the objective functionJ2. The quality of this local minimum depends on thestarting configuration of the algorithm. As we have seen earlier the method ofk-meansclustering can be understood as a gradient method with a particular rule of choosing theincrements. This implies that for arriving at a particular local minimum (or at the globalminimum) the initial configuration of the algorithm must be an element of the correspond-ing attractive region.

It is easy to extend the basic idea ofk-means clustering to those optimization problemswhich are defined in Section 4 for linear similarity measures. The concept of a fixpointhas to be defined according to the general equivalence theorem 4.14. If we then design analgorithm by alternately computing centroids and MSP-partitions we arrive at a version ofa gradient method, which certainly stops at a fixpoint after finitely many steps.

We finish with these remarks and refer to Ptzelberger and Strasser, 1999, and toSteiner,1999, for further information.

7 Appendix

To begin with we prove the classical equivalence theorem.

PROOF: (of Theorem 2.1)

Let (a1, a2, . . . , ak) be an arbitrary prototype system and letC = (C1, C2, . . . , Ck) be thecorresponding Voronoi partition. Then it follows from equation (1) that

J2(a1, a2, . . . , ak) =∑xi

mins|xi − as|2 =

∑j

∑xi∈Cj

|xi − aj |2.

If mj is the centroid ofCj , then it follows from the principle of least squares that∑xi∈Cj

|xi − aj |2 ≥∑

xi∈Cj

|xi −mj |2,

and this implies

J2(a1, a2, . . . , ak) ≥∑

j

∑xi∈Cj

|xi −mj |2 = SS(C).

This means: Every prototype system may be replaced by the corresponding Voronoi parti-tion without deteriorating the objective function of the optmization problem.

On the other hand letC = (C1, C2, . . . , Ck) be any partition having centroids(m1,m2, . . . ,mk). Then we have

SS(C) =∑

j

∑xi∈Cj

|xi −mj |2 = SS(C) ≥∑xi

mins|xi −ms|2 = J2(m1,m2, . . . ,mk).

Computing the Voronoi partition of the centroids the new partition is not worse than theoriginal one.

Helmut Strasser 34

These facts imply that the objective functions of the principal point problem and of theminimum variance problem have the same global minima. Therefore any optimal solutionof one problem is also optimal for the other problem. 2

Next we show that for linear similarity measures the best prototype of a data subset mustbe the centroid of the data subset.

(7.28) LEMMA Let σ(x, a) be a linear similarity measure and let C ⊆ E be a data subsetwith mean m. Then we have∑

x∈C

σ(x,m) = maxa

∑x∈C

σ(x, a).

PROOF: Let σ(x, a) = 〈s(a), x〉 − U(a). We denote byn the number of elements in thesetC. Then we have∑

x∈C

σ(x, a)

=∑x∈C

(〈s(a), x〉 − U(a)) = 〈s(a),∑x∈C

x〉 − nU(a)

= 〈s(a), nm〉 − nU(a) = n(〈s(a),m〉 − U(a)) = nσ(m,a).

The value ofσ(m,a) is a maximum ifa = m. 2

PROOF: (of the representation theorem 3.4). From

f(x) = σ(x, x) = maxa

σ(x, a),

it follows that the functionf is the upper envelope of a family of linear functions and henceis a convex function. Since

f(x) = σ(x, x) = 〈s(x), x〉 − U(x),

it follows thatU(a) = 〈s(a), a〉 − f(a)

and thereforeσ(x, a) = 〈s(a), x− a〉+ f(a).

Fromσ(x, a) ≤ σ(x, x) = f(x)

we obtain〈s(a), x− a〉 ≤ f(x)− f(a)

for all x, a ∈ E. Therefores(a) is a subdifferential off at the pointa, and in case ofdifferentiability we haves(a) = f ′(a). 2

References

Bickel, P.J. and Klaassen, C.A.J. and Ritov, Y. and Wellner, J.A.Efficient and adaptiveestimation for semiparametric models. Johns Hopkins Univ. Press, 1993.

Blackwell, D. Comparison of experiments In LeCam, L. and Neyman, J., editors,Proc.2nd Berkeley Symp. Math. Statistics Prob., pages 93–102, 1951.

Blackwell, D. Equivalent comparisons of experimentsAnn. Math. Statistics, 24:265–272,1953.

Helmut Strasser 35

Bock, H. H. The equivalence of two extremal problems and its application to the iterativeclassification of multivariate data. Oberwolfach, 1969.

Bock, H. H. Automatische Klassifikation. Vandenhoeck und Ruprecht, 1974.

Bock, H. H. A clustering technique for maximizingφ-divergence, noncentrality and dis-criminating power. In M. Schader, editor,Analyzing and Modeling Data and Knowledge,pages 19–36, Berlin Heidelberg New York, 1992. Springer Verlag.

Bock, H. H. Information and entropy in cluster analysis. Proc. US/Japan Conf., 1994.

Flury, B., Tarpey, T. and Li, L. Principal points and self-consistent points of ellipticaldistributions.Annals of Statistics, 23:102–112, 1995.

Flury, B. A. Principal points.Biometrika, 77:33–41, 1990.

Flury, B. A. Estimation of principal points.Appl. Statist., 42:139–151, 1993.

Kipper, S. and Parna, K. Optimalk-centres for a two-dimensional normal distribution.Acta et Commentationes Universitatis Tartuensis, 942:21–27, 1992.

Kohonen, T.Self-organization and associative memory. Springer, 1984.

LeCam, L. Sufficiency and approximate sufficiencyAnn. Math. Statist., 35:1419–1455,1964.

LeCam, L.Asymptotic Methods in Statistical Decision Theory. Springer, 1986.

Lloyd, S. P. Least squares quantization in PCM.IEEE Trans. on Information Theory,28:129–137, 1982.

Nilsson, G. Optimal stratification according to the method of least squares.SkandinaviskAktuarietidskrift, 50:128–136, 1967.

Parna, K. Strong consistency ofk-means clustering in metric spaces.Tartu Riikl. Ulik.Toimetised, 733:86–96, 1986.

Parna, K. On the existence and weak convergence ofk-centres in Banach spaces.Acta etCommentationes Universitatis Tartuensis, 893:17–28, 1990.

Pfanzagl, J. and Wefelmeyer, W.Contributions to a general asymptotic statistical theory.Lecture Notes in Statistics 13, Springer, 1982.

Pollard, D. Strong consistency ofk-means clustering.Annals of Statistics, 9:135–140,1981.

Pollard, D. A central limit theorem fork-means clustering.Annals of Probability, 10:919–926, 1982.

Potzelberger, K. The quantization dimension of distributions. Submitted for publication,1998.

Potzelberger, K. The general quantization problem for distributions with regular support.Submitted for publication, 1998.

Potzelberger, K. The consistency of the empirical quantization error. Submitted for pub-lication, 1998.

Potzelberger, K. and Strasser, H. Clustering and quantization by MSP-Partitions. 1999,To appear in:Statistics and Decisions.

Rahnenfuhrer, J. Multivariate permutation tests for clustered data. Technical report, 1999.

Helmut Strasser 36

Steiner, G. Quantization and clustering with maximum information: Algorithms andnumerical experiments. PhD thesis, Vienna University of Economics and Business Ad-ministration, 1999.

Strasser, H.Mathematical theory of statistics: Statistical experiments and asymptoticdecision theory, volume 7 ofDe Gruyter Studies in Mathematics. de Gruyter, 1985.

Strasser, H. Data compression and statistical inference. Submitted for the Proceedings ofthe 6th Tartu Conference on Multivariate Statistics (Satellite meeting of ISI 52nd session),1999.

Strasser, H. Towards a statistical theory of optimal quantization. Technical report, Depart-ment of Statistics, Vienna University of Economics and Business Administration, 2000.

Torgersen, E. N.Comparison of statistical experiments. Cambridge Univ. Press, 1991.