[Handbook of Statistics] Epidemiology and Medical Statistics Volume 27 || 11 Cluster Analysis

ISSN: 0169-7161

r 2008 Published by Elsevier B.V.

Handbook of Statistics, Vol. 27

DOI: 10.1016/S0169-7161(07)27011-7

11

Cluster Analysis

William D. Shannon

Abstract

This chapter introduces cluster analysis algorithms for finding subgroups of objects

(e.g., patients, genes) in data such that objects within a subgroup are more similar

to each other than to objects in other subgroups. The workhorse of cluster analysis

are the proximity measures that are used to indicate how similar or dissimilar

objects are to each other. Formulae for calculating proximities (distances or

similarities) are presented along with issues related to scaling and normalizing

variables. Three classes of clustering are presented next – hierarchical clustering,

partitioning, and ordination or scaling. Finally, some recent examples from a

broad range of epidemiology and medicine are very briefly described.

1. Introduction

In medicine and epidemiology, the concept of patient subgroups is well estab-lished and used in practice. In cancer tumor staging the goal is to determinetreatment strategy and prognosis based on the patient subgroup. The NationalHeart, Lung and Blood Institute at the NIH classifies (during the writing of thischapter) blood pressure levels as normal (o120/80), pre-hypertension (120/80–139/89), Stage 1 hypertension (140/90–159/99), and Stage 2 hypertension (4159/99). In spatial epidemiology, disease clusters are found for planning healthcaredelivery or for identifying causes of the disease.

To understand and motivate this work, it is valuable to have a basic overviewof some modern statistical clustering algorithms. These tools can be applied tobiomedical data to identify patients within subgroups who are likely to have similarnatural history of their disease, similar treatment responses, and similar prognoses.This chapter addresses the problem of cluster analysis or unsupervised learningwhere the goal is to find subgroups or clusters within data when group membershipis not known a priori. These might be clusters of patients, genes, disease groups,species, or any other set of objects that we wish to put into homogeneous subsets.The assumption of any cluster analysis is that the objects within a cluster are insome sense more similar to each other than to objects in other subgroups.

342

dx.doi.org/10.1016/S0169-7161(07)27011-7.3d

Cluster analysis 343

In contrast to cluster analysis is the classification or supervised learningproblem. In classification the object’s subgroup membership is known from thedata, such as cases versus controls in an epidemiological study or responderversus non-responder in a clinical trial. The goal of the classification model is touse covariates or features of the objects with known class membership to de-velop a mathematical model to predict class membership in future objects wheretheir true classification is not known. There are a large number of statistical andcomputational approaches for classification ranging from classical statisticallinear discriminant analysis (Fisher, 1936) to modern machine-learning ap-proaches such as support vector machines (Cristianini and Shawe-Taylor, 2000)and artificial neural networks (Bishop, 1996). Classification models as describedhere are distinct from cluster analysis and will not be discussed further in thischapter. However, cluster analysis or unsupervised learning is often referred toas classification, leading to confusion, though the context of the problem shouldmake it clear which is being considered – if the data contains a variable with aclass membership label then the classification is referring to that described inthis paragraph. When no class membership variable is present in the data, thencluster analysis is being referred to. The remainder of this paper will focus oncluster analysis.

The concept of cluster analysis is most easily understood through a visualrepresentation. In fact, cluster analysis should be thought of as an exploratorydata analysis tool for data reduction where multivariate data are being displayedto uncover patterns (Tukey, 1977). In Fig. 1, we show visual clustering of 2-dimensional data (x, y) with four distinct clusters labeled A, B, C, and D. It isclear that the objects within each cluster are more similar to each other in terms oftheir X, Y values then they are to objects in the other clusters. Each of these threemethods are discussed in more detail later in this chapter.

In multivariate data with more than 2 or 3 variables, the ability to identifyclusters through direct visualization is impossible requiring a cluster analysis pro-gram. There are three major classes of cluster analysis – hierarchical, partitioning,and ordination or scaling – displayed in Fig. 1. Hierarchical cluster analysis clustersobjects by proximity, in this case a distance measure, and displays them in a tree ordendrogram (e.g., Everitt et al., 2001; Gordon, 1999). Objects labeled at the tips ofthe tree are connected to each other by the branches of the dendrogram. Objectsconnected early or at a lower height are more similar as is seen with the foursubgroups A–D. Objects connected at a higher level are further apart such as theobjects between the four subgroups. Cluster analysis by partitioning producesboundaries between clusters so that points on one side of a boundary belong to onecluster while points on the other side of the boundary belong to the other cluster(Hartigan, 1975; Hartigan and Wong, 1979). In this example the boundaries areprecise, though boundaries can be fuzzy or defined by probability vectors. Clusteranalysis by ordination or scaling uses a projection of the data from many dimensionsonto a few dimensions that can be displayed visually (Cox and Cox, 2001). In thisexample we projected the 2 dimensional X, Y data onto the X-axis, though inpractice ordination often projects multi-dimensional data onto linear combinationsof the dimensions or new arbitrary dimensions.

Visual Clustering

Partitioning

Hierarchical Clustering

Ordination (Scaling)

yy y

AB

CD

Hei

ght

86

42

0

hclust(*, " complete")x

x x

Fig. 1. Display of the three classes of cluster analysis discussed in this chapter.

W. D. Shannon344

For a broad overview of the multivariate statistics used in cluster analysis, thereader is referred toTimm (2002). For a broad overview of both unsupervised andsupervised learning methods from both the statistics and machine-learning lit-erature, the reader is referred to Hastie et al. (2001). For a broad overview of theapplication of these methods to biological data the reader is referred to Legendreand Legendre (1998). Each of these references covers hierarchical and otherclustering methods in more mathematical detail than presented here and showtheir application to data for illustration.

2. Proximity measures

2.1. Some common distance measures

Fundamental to cluster analysis is the concept of proximity of two objects to eachother measured in terms of a numerical value indicating the distance or similaritybetween the pair. For example, let two objects x and y be represented by points inthe Euclidean n-dimensional space x ¼ (x1, x2,y, xn) and y ¼ (y1, y2,y, yn). The

Table 1

Three commonly used distance measures on continuous variables

Distance Formula Common Name

1 – norm Xni¼1

jxi � yijManhattan distance

2 – norm Xni¼1

jxi � yij2

!1=2 Euclidean distance

Infinity normlimp!1

Xni¼1

jxi � yijp

!1=p

¼

max jx1 � y1j; jx2 � y2j; . . . ; jxn � ynj� �

Chebyshev distance


commonly used Minkowski distance of order p, or p-norm, distance are defined inTable 1 where pZ1.

Each of the example Minkowski distances (e.g., Manhattan, Euclidean,Chebyshev) has an intuitive sense of proximity. The Manhattan distance is howmany blocks one would travel walking through downtown Manhattan, if blockswere laid out as a grid (i.e., go three blocks east and turn north for two blocks).The Euclidean distance is our normal sense of distance as measured by a ruler.The Chebyshev distance represents the distance along the largest dimension.These are illustrated for a distance between points X and Y, denoted by d(x, y), inFig. 2.

Distances can also be calculated using categorical variables (Table 2). Let ourobjects x ¼ (x1, x2,y, xn) and y ¼ (y1, y2,y, yn) be points on an n-dimensionalspace where each dimension is represented by a categorical variable. If we letwj ¼ 1 if neither xj and yj are missing, and wj ¼ 0 otherwise, then we can calculate‘matching’ distances between objects X and Y. In the simplest example think ofthe objects vectors x ¼ (x1, x2,y, xn) and y ¼ (y1, y2,y, yn) as strings of 0’s and1’s so that

dixy ¼0 if xi ¼ yi ¼ 0 or xi ¼ yi ¼ 1

1 if xi ¼ 0; yi ¼ 1 or xi ¼ 1; yi ¼ 0

(.

The Hamming and matching distances for this is the number of non-matchingvariables between xj and yj either weighted by the number of non-missing cases(matching metric) or not weighted (Hamming distance). Numerous other cate-gorical distance measures are available and often based on contingency tablecounts (e.g., Jaccard distance).

In many applied problems there is a mixture of continuous and categoricalvariables. Distances can still be calculated between pairs of objects in this caseusing the Gower distance metric, which combines distances obtained using a

Manhattan d(x,y) = 1 + 2 = 3

Euclidean d(x,y) = 2.24

Chebyshev d(x,y) = 2

X

Y

1

2.242

Fig. 2. Geometric display of three common distance measures.

Table 2

Two commonly used distance measures on categorical variables

Distance Formula

Hamming Xni¼1

widixy; dixy ¼

0 if xi ¼ yi

1 if xiayi

(

Matching Pni¼1

widixy

Pni¼1

wi

; dixy ¼0 if xi ¼ yi

1 if xiayi

(

W. D. Shannon346

standard p-norm metric on the continuous variables (e.g., Euclidean), and dis-tances obtained using a matching-type distance measure on the categorical var-iables (e.g., Hamming).

2.2. Definition of distance measures

The above list of distance measures (also called metrics) was not meant to becomprehensive in any sense, but rather an introduction to the commonly useddistances and the idea of distance measured on categorical and mixed data types.

Table 3

Criterion for distance measures

Rule Definition

d(x,y)Z0 The distances between two objects X and Y is positive, and equal to 0 only

when the two objects are the same, i.e., X ¼ Y

d(x,y) ¼ d(y,x) The distance between two objects is symmetric where going from X to Y is the

same distance as going from Y to X

d(x,y)rd(x,z)+d(y,z) The distance between two objects X and Y will always be less than or equal to

the distances between X and Z and between Y and Z (triangle inequality)


Implicit in each of the distance measures, however, is the necessity to meet certainformal criteria which are presented here.

Let objects x ¼ (x1, x2,y, xn), y ¼ (y1, y2,y, yn), and z ¼ (z1, z2,y, zn) bepoints on an n-dimensional space. Denote the distance between any pair of themby d(a, b). Then d(a, b) is a distance measure if each of the criterion in Table 3 aretrue.

A fourth criterion d(x,y)rmax{d(x,z),d(y,z)}, known as the strong triangle orultrametric inequality, makes the space ultrametric. This states that every trianglein the ultrametric space connecting any three objects is isosceles (i.e., at least two ofthe sides have equal length, d(x,y) ¼ d(y,z) or d(x,z) ¼ d(y,z) or d(x,y) ¼ d(z,x)).Ultrametric spaces have nice mathematical properties that make them amenable tocertain types of problems (e.g., phylogenetic tree construction in evolution), but arenot routinely used in medicine and epidemiology, though could be a valuableaddition to biostatistical data analysis. An example of an ultrametric cluster anal-ysis in medicine would be that all patients within a disease subgroup are equallydistant from all patients in a different disease subgroup. This may have the po-tential to refine disease prognosis into more homogeneous subgroups but as far aswe know it has not been formally explored.

2.3. Scaling

Although clustering methods can be applied to the raw data, it is often moreuseful to precede the analysis by standardizing the data. Standardization in sta-tistics is a commonly used tool to transform data into a format needed formeaningful statistical analysis (Steele and Torrie, 1980). For example, variancestabilization is needed to fit a regression model to data where the variance forsome values of the outcome Y may be large, say for those values of Y corre-sponding to large values of the predictor variable X, while the variance of Y issmall for those values corresponding to small values of X (i.e., heteroscedasticity).Another use of standardization is to normalize the data so that a simple statisticaltest (e.g., t-test) can be used.

Scaling or transformation of data for cluster analysis has a different purposethan those used to meet assumptions of statistical tests as described in the pre-vious paragraph. Cluster analysis depends on a distance measure that is mostlikely sensitive to differences in the absolute values of the data (scale). Consider a

W. D. Shannon348

hypothetical situation where multiple lab tests have been measured on a patientwhere each test has a continuous value. Suppose further that the values for all butone test are normally distributed with mean 100 and variance 1, and the values forthe remaining test is normally distributed with mean 100 and variance 10. Usingthe raw data the distance becomes dependent nearly exclusively on this one testwith high variance as illustrated in Fig. 3, where each patient’s lab values areshown connected by the line. On visual inspection, we see that the distancesbetween patients on the first 5 lab tests will be small compared to the distancesbetween patients for the lab test 6. If distances were calculated only on lab tests1–5, the average Euclidean distance would be 3.24 while including lab test 6 theaverage Euclidean distance is 10.64 resulting in the cluster analysis being drivenprimarily by this last lab test.

To avoid this complication the analyst can weight the variables in the distancecalculation (all distance measures described above have a corresponding variableweighting formulation that can be found in any standard cluster analysis refer-ence, e.g., Everitt et al., 2001) or find an appropriate data transformation to havethe variables scaled equivalently. In the case where the variables are all normallydistributed, as in this example, a z-score transformation would be appropriate.

A second scaling issue is easily shown in time series data, though applies to anytype of data. Suppose a lab test measured on a continuous scale is done atmultiple times in patients. The goal of this study might be to find clusters of

120

110

100

90

80

Lab

Tes

t Val

ue

Patient Lab Test profiles

Lab Test

1 2 3 4 5 6

Fig. 3. The effect of high variability in different variables between subject distances.

Time Series Profiles

Lab

Tes

t Val

ueLa

b T

est V

alue

Time

Time

-20

6

2

-2

-6

0

10

2 4 6 8 10

C

AB

2 4 6 8 10

B

AC

Fig. 4. The effect of standardization on outcome profiles.


patients with the same absolute values of the lab tests or to find patients with thesame time series pattern of the lab tests. In Fig. 4 we display three patients, two ofwhom (A and C) show a decrease in their lab values at the sixth time point, andone (B) who shows no change. In the top graph showing the raw values patients Aand B are more similar. In the bottom graph where we have shifted the profiles tobe centered at 0 we see that patients A and C are more similar. The result is thatclustering on the un-shifted data will find clusters of patients with similar raw labvalues, while clustering on the shifted lab values will find clusters of patients withsimilar changes in pattern over time.

This section introduced the concept of scaling and shifting variables in clusteranalysis. The important point to remember is that the cluster analysis results willbe drastically affected by the choice of scale of the data. In the first exampledistances are dominated by the variance in a single variable and in the secondexample distances are dominated by the value locations. No single rule for trans-forming the data exists but it is important for the analyst to think through theseissues and understand that the choices made will impact significantly the resultsobtained. By stating clearly before the analysis what the goal is (e.g., find patientswith similar lab values or find patients with similar changes in patterns) theappropriate transformations can likely be found.

2.4. Proximity measures

Implicit in any cluster analysis is the concept of proximity, whether defined interms of distance or similarity. Several cluster analysis methods, such as hierar-chical clustering and some ordination methods, require in addition to a way tocalculate proximities between objects, the calculation of a proximity (say

W. D. Shannon350

distance) matrix giving pairwise distances between all pairs of objects. The clus-tering algorithm uses this matrix as the input to find the clusters.

If Oi, i ¼ 1,y,N denote the objects to be clustered (e.g., patients), Xj,j ¼ 1,y,P denote the variables measured on the objects, and xi,j denote the valueof variable Xj in object Oi, then the Euclidean distance, say, between two objects

i,i0 is di;i0 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxi;1 � xi0 ;1Þ

2þ ðxi;2 � xi0 ;2Þ

2þ � � � þ ðxi;P � xi0 ;PÞ

2q

: By repeating this

calculation for all pairs, the N�P raw data table is transformed into the objectpairwise distances matrix D:

Object X 1 X 2 � � � XP

1

2

3

4

5

..

.

N

x1;1 x1;2 � � � x1;P

x2;1 x2;2 � � � x2;P

x3;1 x3;2 � � � x3;P

x4;1 x4;2 � � � x4;P

x5;1 x5;2 � � � x5;P

..

. ... . .

. ...

xN;1 xN;2 � � � xN;P

26666666666664

37777777777775

) D ¼

0 d1;2 d1;3 d1;4 d1;5 � � � d1;N

0 d2;3 d2;4 d2;5 � � � d2;N

0 d3;4 d3;5 � � � d3;N

0 d4;5 � � � d4;N

0 � � � d5;N

. .. ..

.

0

26666666666664

37777777777775,

where d1,2 is the distance between objects 1 and 2, d1,3 the distance betweenobjects 1 and 3, etc. Only the upper triangle of the distance matrix is shownbecause of symmetry where d1,2 ¼ d2,1, d1,3 ¼ d3,1, etc. The diagonal for a distancematrix is 0 since the distance from an object to itself is 0. A similarity matrix oftenscales similarities to lie between 0 and 1 making the diagonal elements all 1.

A proximity matrix measured on N objects will have n(n�1)/2 entries in theupper triangle. The size of the distance matrix becomes a problem when manyobjects are to be clustered. The increase in the number of distances limitshierarchical clustering and some ordination methods to a small number ofobjects. To illustrate, for N ¼ 10 there are 10*9/2 ¼ 45 pairwise distances, forN ¼ 100 there are 4,950 pairwise distances, for N ¼ 1,000 there are 499,500pairwise distances, and for N ¼ 10,000 there are 49,995,000 pairwise distances.When the number of pairwise proximities becomes too large to calculate andprocess efficiently partitioning methods that do not require pairwise distancematrices (e.g., k-means clustering) should be used for the cluster analysis. Whatthis size is will depend on the computer resources available to the data analyst.

3. Hierarchical clustering

3.1. Overview

One of the major cluster analysis tools used is hierarchical clustering (Everitt andRabe-Hesketh, 1997) where objects are either joined sequentially into clusters(agglomerative algorithms) or split from each other into subgroups (divisiveclustering). In most applications agglomerative clustering is predominant and will be


the focus of this section. Agglomerative clustering begins with each object separateand finds the two objects that are nearest to each other. These two objects are joined(agglomerated) to form a cluster, which is now treated as a new object. The distancesbetween the new object and the other objects are calculated and the process repeatedby joining the two nearest objects. This algorithm repeats until every object is joined.Several examples of hierarchical clustering algorithms are presented in Table 4.

An example of hierarchical clustering would be to identify prognosissubgroups where development of additional symptoms results in differentdiagnoses. In this case the presence of severe symptoms could be viewedhierarchically as being a subset of patients with less severe symptoms.

We illustrate this iterative process in Fig. 5 using a centroid cluster analysisalgorithm (this and other cluster algorithms will be defined below). In the first stepobjects A and B which are nearest each other are joined (as indicated by the line)and a new object at the midpoint on this line is used to represent this cluster. In thesecond step objects D and E are joined. In the third step the new cluster AB is joinedto object C. In the fourth and last step the cluster ABC is joined with cluster DE.

For this 2-dimensional problem it is easy to visualize the clustering in a scatterplot. For higher dimensional data (as well as 1 and 2 dimensional like Fig. 5), thedendrogram is able to represent the clustering process. In Fig. 6 the averagecluster analysis performed on the objects A–E in Fig. 5 is shown. Here, we see thesame pattern in the iterative process where cluster AB is formed first at the lowestheight, followed by DE, then ABC, and finally ABCDE. The advantage of thedendrogram over the scatter plot representation is that the dendrogram includes ameasure of distance at which objects are merged on the vertical axis.

In each step of the clustering as objects are merged the proximity matrix ismodified to reflect the new number of objects and the new measures of proximity.In the first step we merged A and B whose distance was the smallest at 0.44.

A B C D E

A 0 0:44 1:12 3:33 3:49

B 0 1:43 3:47 3:50

C 0 2:33 2:73

D 0 1:06

E 0

2666666664

3777777775

The AB cluster was formed and the distance matrix updated to show the distancebetween AB and the other objects. From this the objects D and E are merged whosedistance is smallest at 1.06.

AB C D E

AB 0 1:26 3:39 3:49

C 0 2:33 2:72

D 0 1:06

E 0

26666664

37777775

Table 4

Five commonly used hierarchical clustering algorithms

Algorithm Formula Description

Average linkage DKL ¼1

NKNL

Xi2Ck

Xj2CL

dðxi;xjÞThe distance between two clusters is the average of all the pairwise distances of all the

members of one cluster with all the members of the other cluster. These tend to be

small clusters with equal variance

Centroid method DKL ¼ xK � xLk k2 The distance between two clusters is the distance between the clusters centroids or mean

vectors and are resistant to outliers

Complete linkage DKL ¼ maxi2CK

maxj2CL

dðxi; xjÞ The distance between two clusters equals the maximum distance between all the

members of one cluster with all the members of the other cluster. These tend to be

clusters with equal diameters across the space of the objects but are subject to

distortion by outliers

Single linkage DKL ¼ mini2CK

minj2CL

dðxi ;xjÞ The distance between two clusters equals the minimum distance between all the members

of one cluster with all the members of the other cluster. These tend to be ‘stringy’

elongated clusters and have difficulty finding small compact clusters

Ward’s minimum-variance

methodDKL ¼

xK � xLk k2

ð1=NK Þ þ ð1=NLÞ

This method combines clusters with similar variances to produce homogeneous clusters.

It assumes that the variables are distributed as multivariate normal and clusters tend

to have similar size and distinct separation

W.D.Shannon

352

3.0

2.5

2.0

1.5

1.0

0.5

0.0

D EC

A B

dist(cbind(x0, y0))hclust(*, "average")

Hei

ght

Fig. 6. Dendrogram representation of the clustering of the five objects in Fig. 5.

Fig. 5. Example of clustering order of five objects.


W. D. Shannon354

In the next step objects AB and C are merged with smallest distance 1.26.

AB C DE

AB 0 1:26 3:40

C 0 2:48

DE 0

26664

37775

In the final step clusters ABC and DE were merged completing the algorithm.

ABC DE

ABC 0 2:90

DE 0

264

375)

ABCDE

ABCDE 0

� �

Once a dendrogram is fit to the data the decision as to where to cut it to producedistinct clusters is made. In Fig. 7, we fit a dendrogram to the data from Fig. 1 thatwere visually clustered into four distinct subgroups labeled A, B, C, and D. Withineach cluster were six objects labeled A1–A6, B1–B6, etc. The horizontal dashed linesin Fig. 7 show how the dendrogram can be cut to produce from 1 to 4 clusters. Inaddition, we could decide anywhere between not cutting the dendrogram and haveall the objects merged into a single cluster to cutting at the right height to keep eachobject in its own cluster. Criteria for deciding how to cut dendrograms will bediscussed below.

Fig. 7. How hierarchical clustering is split into different cluster numbers (stopping rule).


4. Partitioning

Hierarchical clustering algorithms proceed by sequentially merging objects orclusters into larger clusters based on some distance criterion. These algorithmshowever tend to be computationally intensive and begin breaking down for largerdatasets (the size depending on the computer resources available). In addition,these algorithms tend to be less dependent on data distributions, with theexception of a few such as Ward’s method, and so do not take advantage ofprobability models. In this section, we will introduce two types of partitioningclustering – k-means and model-based clustering – that can be used for very largedatasets or when a probability model is assumed.

Partitioning attempts to split the space directly into regions where objectsfalling in the same region belong to the same cluster. The boundaries definingthese regions can be defined differently and may be hard thresholds or based onprobability of membership. In Fig. 8, data are generated from one of fourbivariate normal distributions. Two decision boundaries are over-laid on thedata. The solid straight lines represent the type of boundary obtained from a k-means clustering where objects are clustered according to the side of these linesthey fall on. The dashed contour lines represent probability distributions and arethe type of decision boundaries obtained from a model-based clustering. Eachobject has a probability of belonging to each of the four groups and is assigned tothat group for which it has the highest probability of belonging to.

Fig. 8. Display of both a k-means partition (solid line boundary) and model-based clustering (dashed

lines indicating density estimates).

W. D. Shannon356

4.1. k-Means clustering

The k-means algorithm directly partitions the space into k non-overlappingregions where k is specified by the analyst. This algorithm is useful for very largedatasets where a hierarchical relationship is not being sought. This might includeproblems of clustering patients by disease category where the development of onecategory is not dependent on passing through a different category. In contrasthierarchical clustering assumes that lower branches or clusters on the dendrogrampossess the same symptoms as clusters above it on the dendrogram.

The k-means algorithm is simple to implement. Assume each object isrepresented by a vector x ¼ (x1, x2,y, xn) and the analyst wants to divide theminto k clusters. The algorithm starts with either k random or user-specified vectorsfrom the space to represent the starting cluster centers which we denote by xk forthe kth cluster, k ¼ 1,y,K. The distance from each object to each of these initialcenters, dðxi; xkÞ; is calculated with each object being assigned to the center that itis closest to. If we define a cluster of objects as Ck, a subset of all objects{1,2,y,N}, then the k-means algorithm assigns individual objects xi to thenearest cluster mean, i.e., ðCK : mink¼1; ...; Kdðxi; xkÞ: All objects assigned to thesame center form a distinct cluster. The algorithm recalculates the cluster centersxk by averaging the individual vectors x ¼ ðx1;x2; . . . ; xnÞ 2 Ck: The algorithmrepeats by calculating the distance of each object to the new cluster centers,reassigns each object to its new nearest cluster center, and iterates this processuntil none of the objects change clusters.

In Fig. 9 we see three iterations of the k-means algorithm in the first columnfor k ¼ 4. We initialized this algorithm with four centers located at (�1, 1), (1, 1),(1, �1), and (�1, �1) defining the four clusters by the quadrants (i.e., all objectsin the upper right quadrant belong to the (1, 1) cluster). After the first iterationthe clusters centers (dark dots) have moved part of the way towards the ‘true’cluster centers located at (�2.5, 2.5), (1.5, 1.5), (�1, �2), and (4, �3). Overlayingthis plot is the decision boundary which assigns objects to one of the four clustercenters. In the second iteration the cluster centers have converged on the truecenters and the decision boundary finalized.

Any appropriate distance measure can be used in k-means clustering.However, there is often an assumption of multivariate normality in the dataand the algorithm is implemented using the Mahalanobis distance measure. Letx ¼ (x1, x2,y, xn) be the object and xk be the cluster means as above, and let S

�1

k

be the inverse of the estimated covariance matrix for the kth cluster. Then theMahalanobis distance, used routinely in multivariate normal distribution theory,is defined as D2

ik ¼ ðxi � xkÞTS�1

k ðxi � xkÞ:

4.2. Model-based clustering

If we assume the data from cluster k was generated by a probability model fk(x:y)with parameters y, model-based clustering allows a maximum likelihoodestimation approach to determine cluster membership. For our objects x ¼ (x1,x2,y, xn), we can define a vector of cluster assignments by g ¼ (g1, g2,y, gn)

T

where gI ¼ k if object xi belongs to cluster k. The parameters y and cluster

k-Means Model Based

Iteration 1

Iteration 2

Iteration 1

Iteration 2

y y

y y

y y

x x

x x

x x

Fig. 9. Iterative process of k-means and model-based partitioning.


membership vector g can be estimated by maximizing the likelihood

Lðx; y; gÞ ¼YKk¼1

f gk ðx; yÞ,

where f gk is the distribution for the objects in cluster k.If we assume the distributions for each of the K clusters is multivariate normal

then the likelihood function is

Lðx; m1; . . . ;mK ;S1; . . . ;Sk; gÞ ¼YKk¼1

Yi2Ck

ð2pÞ�p=2jSkj

�1=2

� exp �1

2ðxi � mkÞ

TS�1k ðxi � mkÞ

� �,

where Ck is the subset of objects in cluster k. This imposes significant structureassumptions on the data. However, accurate algorithms exist (e.g., EM) formaximum likelihood estimation of the parameters, including the class member-ship vector g. In most applications the user will specify the covariance structuredesired which defines other criteria to be optimized.

Whichever criterion is optimized the model-based search is an iterative processlike the k-means algorithm. The left column of plots (Fig. 9) shows how the

W. D. Shannon358

iterations for the same data used in the k-means example might appear in themodel-based clustering. In this example the three iterations of k-mean cluster

centers were used as ^x1; . . . ; ^xK and we assumed that S ¼1 0

0 1

� �: In the first plot

the probability densities appear as one and as we move to the second and thirditeration we see a clear separation the probability masses into four distinct clusters.

An excellent general reference for model based clustering is McLachlan andPeel (2000).

5. Ordination (scaling)

Ordination or scaling methods project data from many dimensions to one, two, ora few dimensions while maintaining relevant distances between objects. Twoobjects that are far apart (close together) in the high dimensional space will be farapart (close together) in the lower dimensional space. In this lower dimensionalspace visual clustering can be done. Perhaps the best-known ordination method inmultivariate statistics is principal components analysis (PCA) where variables arelinearly transformed into new coordinates where hopefully the first two or threecontain the majority of the information present in all the variables.

5.1. Multi-dimensional scaling

Multi-dimensional scaling (MDS) takes a proximity matrix measured on a set ofobjects and displays the object in a low dimensional space such that therelationships of the proximities in this low dimensional space matches therelationships of the distances in the original proximity matrix. A classical exampleof MDS is the visualization of cities determined by the flying distances betweenthem. In the distance matrix we show the distances between 10 US cities, whereAtlanta is 587 miles to Chicago, 1,212 miles to Denver, etc (Table 5).

This distance matrix defines a set of pairwise relationships for this set of citiesbut offers no clue as to their location in the US – their latitude and longitude.However, MDS can display these cities in a 2-dimensional projection to see if thephysical locations can be estimated. In Fig. 10, we show the result of thisprojection and observe that in fact this does approximately reproduce theirlocations relative to each other.

MDS models are fit by finding a data matrix in fewer dimensions, say 2dimensions, that produces a proximity matrix similar to that obtained, whethergenerated from an existing data matrix or given directly such as is found in manypsychological experiments where a subject is asked to state the similarity ofobjects. Let di, j be the distance between objects i and j obtained either bycalculating distances between the object vectors x ¼ (x1, x2,y, xn) or obtaineddirectly through a judgment experiment. MDS searches for a data representationfor each object, say y ¼ (y1, y2), so that the distances between the objects aresimilar to the di,j’s, and so the objects can be displayed in a 2-dimensional scatterplot. If we let di,j be the original distances (or proximities) we are working with,

Table 5

Flying mileage between 10 American cities

Atlanta Chicago Denver Houston Los Angeles Miami New York San Francisco Seattle Washington, DC

Atlanta 0 587 1212 701 1936 604 748 2139 2182 543

Chicago 0 920 940 1745 1188 713 1858 1737 597

Denver 0 879 831 1726 1631 949 1021 1494

Houston 0 1374 968 1420 1645 1891 1220

Los Angeles 0 2339 2451 347 959 2300

Miami 0 1092 2594 2734 923

New York 0 2571 2408 205

San Francisco 0 678 2442

Seattle 0 2329

Washington, DC 0

Cluster

analysis

359

Fig. 10. Example of multi-dimensional scaling assigning relative positions of the US cities.

W. D. Shannon360

and di,j(y) be the distances calculated on our new data points y ¼ (y1, y2), classicalMDS attempts to find the data vectors y ¼ (y1, y2) such that the following isminimized:

EM ¼Xiaj

di;j � di;jðyÞ� 2

.

This represents the square-error cost associated with the projection. Note thatthe scale and orientation of the y ¼ (y1, y2) are arbitrary and the map of the UScities in the above example may just have easily been flipped from top to bottomand left to right. The goal of the MDS is not to obtain the exact values of thepossibly unknown data vectors x ¼ (x1, x2,y, xn), but rather to obtain theirpairwise spatial relationships.

Other criterion for MDS exists that addresses specific data requirements. Forexample, Kruskal showed that if the data are ordinal the projected distances di,j(y)should only match the observed distances di,j on a rank ordering. By imposing amonotonically increasing function on the observed distances, denoted by f(di,j),that preserves the rank order of them, then the criterion for the non-metric MDSis

EN ¼1P

iaj

di;jðyÞ� X

iaj

f ðdi;jÞ � di;jðyÞ� 2

.

Another commonly used MDS-like algorithm is known as Sammon’s mappingor normalization where the normalization allows small distances to be preservedand not overwhelmed by minimizing squared-error costs associated with largedistances. Sammon mapping minimizes:

ES ¼Xiaj

di;j � di;jðyÞ�

di;j

2

.

Table 6

Caithness, Scotland, cross-classified by eye and hair color

Hair Color

Fair Red Medium Dark Black

Eye Color Blue 326 38 241 110 3

Light 688 116 584 188 4

Medium 343 84 909 412 26

Dark 98 48 403 681 85


Finding the points y ¼ (y1, y2) requires a search. If the original data x ¼ (x1,x2,y, xn) are available, the search might begin with the first two principal com-ponents. If it is not available the starting points y ¼ (y1, y2) may be randomlygenerated. The search proceeds by iteration where the new set of points y ¼ (y1,y2) are generated by the previous set using one of several search algorithms untilthe change in the goodness-of-fit criterion falls below a user defined threshold.

5.2. Correspondence analysis

Another important ordination procedure for categorical data, analogous to PCAand MDS, is correspondence analysis (CA). Table 6 shows a cross-classificationof people in Caithness, Scotland, cross-classified by eye and hair color (Fisher,1940). This region of the UK is particularly interesting as there is a mixture ofpeople of Nordic, Celtic, and Anglo-Saxon origin. In this table we find 326 peoplewith blue eyes and fair hair, 38 with blue eyes and red hair, etc.

Ignoring the computational details we find the projection of this variablesproduces the scatter plot in Fig. 11. From this display we find that people withblue or light eye color tend to have fair hair, people with dark eyes tend to haveblack eye color, etc. The distances between these variables on this 2-dimensionalprojection gives a relative strength of the relationships. For example, blue eyesand fair hair are strongly related but blue eyes and dark hair are weakly related.Medium eye and hair color are strongly related and moderately related to theother colors as indicated by their appearance somewhat near the middle of thescatter plot.

Like MDS this projection places the points on arbitrary scales. Also, in thisexample two categorical variables are used for illustration but multiple variablescan be projected onto a lower dimensional space.

6. How many clusters?

The author of this chapter believes that cluster analysis is an exploratory dataanalysis tool only and that methods to date to impose formal statistical inferenceto determine the correct number of clusters have not been fully developed andframed in such a way that they can be generally applied. This includes work bythe author that attempts to use a graph-valued probability model to decide the

Fig. 11. Correspondence analysis display relating hair and skin color.

W. D. Shannon362

number of clusters by maximum likelihood (Shannon and Banks, 1999). How-ever, many people make use of heuristic strategies for deciding the number ofclusters that will be discussed in this section.

6.1. Stopping rule

In hierarchical clustering the ‘stopping rule’ indicates where to split the tree. Thedendrogram in Fig. 7 could be cut to form 1, 2, 3, or 4 clusters (indicated by thedashed horizontal lines). (In fact it could produce more clusters by cutting loweron the vertical axis.) Several methods have been suggested for deciding amongthese choices and are defined and tested in the work by Milligan (1981). Thesegenerally are a modification of squared error or variance terms. Using definitionsgiven before in this chapter, three stopping rule criteria for deciding how many


clusters in the data are defined for each cut of the dendrogram:

R2 ¼ 1�

PK

Pi2Ck

xi � xkk k2

!

xi � xk k2,

pseudo� F ¼

xi � xk k2 �PK

Pi2Ck

xi � xkk k2

!,K � 2

!

PK

Pi2Ck

xi � xkk k2

,ðn� KÞ

! ,

and,

pseudo� t2 ¼

Pi2Ck[CL

xi � xCk[CL

2 � Pi2Ck

xi � xCk

2 � Pi2CL

xi � xCL

2 !

Pi2Ck

xi � xCk

2 þ Pi2CL

xi � xCL

2 !,ðnK þ nL � 2Þ

.

These might be useful heuristics and the number of clusters which maximizethese might be a reasonable number to use in the analysis. However, these are notdistributed according to any known distribution (e.g., F, t) and it is important notto assign probabilities with these. See Milligan and Cooper (1985) and Cooperand Milligan (1988) for a detailed examination of these statistics and othersregarding their performance in estimating the number of clusters.

6.2. Bayesian information criterion

When model-based clustering is used more formal likelihood criteria are avail-able. The Schwartz Information Criterion (also called the Bayesian InformationCriterion) is one of those which is often used. Recall that if we are modeling K

multivariate normal clusters then the likelihood function is

Lðx; m1; . . . ;mK ;S1; . . . ;Sk; gÞ ¼YKk¼1

Yi2Ck

ð2pÞ�p=2 Skj j�1=2

� exp �1

2ðxi � mkÞ

TS�1k ðxi � mkÞ

� �

Let log(L) denote the log-likelihood, m ¼ 2k+1 be the sum of number of esti-mated parameters, and n be the number of objects. Then the Schwartz Infor-mation Criterion is

�2 logðLÞ þm logðnÞ.

We select K clusters that maximizes this criterion.

W. D. Shannon364

7. Applications in medicine

A search of the NIH PUBMED publication database using the MESH term‘cluster analysis’ resulted in 14,405 citations covering a wide range of areas ofmedicine. Here we provide a brief snapshot to some of them for reference.

Bierma-Zeinstra et al. (2001) used cluster analysis to analyze data on 224consecutive patients being seen for hip pain. Ward’s method for cluster analysisusing medical history and physical exam results uncovered 10 distinct subgroupsof patient. Subsequent examination of variables derived from X-rays and sono-grams of their hip and knee regions showed significant correlation with theseclusters. The medical history and physical exam results can then be used to clas-sify patients into likely diagnostic group without waiting for expensive imagingdata.

Lei et al. (2006) used hierarchical and k-means clustering to determine whatsize of lumbar disc replacement prosthesis appliances can be used in patients.Analyzing radiological data on 67 patients they were able to identify seven dis-tinct device sizes that are widely used. If validated this will reduce the number ofdisc replacement sizes that need to be manufactured and stocked resulting in apossible improvement in healthcare delivery services.

Kaldjian et al. (2006) identified a list of factors that facilitate and impedevoluntary physician disclosure of medical errors for patient safety, patient care,and medical education. Using a literature search they identified 316 articles re-porting physician errors and extracted 91 factors from them and an additional 27factors from a focus group thought to be related to error reporting. Severalhierarchical clustering algorithms were used, but what is unique about this paper(versus the others reported here) was the distance measure used. In this study, 20physicians grouped the factors into from 5 to 10 groups based on factor sim-ilarity, in essence a ‘conceptual’ proximity. The results of this research identifiedresponsibility to the patient, to themselves, to the profession, and to the com-munity as factors that facilitated error reporting. Attitude, helplessness, anxietyand fear, and uncertainty were identified as factors impeding error reporting.

Other applications include medication adherence (Russell et al., 2006), pre-diction of post-traumatic stress disorder (Jackson et al., 2006), microarry dataanalysis (Shannon et al., 2002; Shannon et al., 2003), and clarification of theobsessive compulsive disorders spectrum (Lochner et al., 2005). This small sampleis presented to show the range of applications and introduce the reader to ad-ditional literature in the medical field to see how cluster analysis is applied.

8. Conclusion

This chapter has provided a brief overview of cluster analysis focusing on hier-archical, partitioning, and ordination methods. An overview of distance measuresand the construction of pairwise distance matrices was presented since these arefundamental tools within the field of cluster analysis. Also, a very brief exposure tostopping rules for determining how many clusters and applications in medicine was


presented to allow the reader entry into the literature for these areas. Anyone newto cluster analysis and planning on using these tools will be able to find manyintroductory textbooks to the field, and most statistical software packages haveclustering algorithm procedures in them. Those readers wanting to become moreinvolved with cluster analysis are encouraged to visit the Classification Society of

North America’s web page at http://www.classification-society.org/csna/csna.html.

References

Bierma-Zeinstra, S., Bohnen, A., Bernsen, R., Ridderikhoff, J., Verhaar, J., Prins, A. (2001). Hip

problems in older adults: Classification by cluster analysis. Journal of Clinical Epidemiology 54,

1139–1145.

Bishop, C. (1996). Neural Networks for Pattern Recognition. Oxford University Press, Oxford.

Cooper, M.C., Milligan, G.W. (1988). The effect of error on determining the number of clusters.

Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge

Representation in Marketing and Related Areas of Research. pp. 319–328.

Cox, M.F., Cox, M.A.A. (2001). Multidimensional Scaling. Chapman & Hall, New York City.

Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-

based Learning Methods. Cambridge University Press, Cambridge, England.

Everitt, B., Rabe-Hesketh, S. (1997). The Analysis of Proximity Data. Wiley, New York City.

Everitt, B., Landau, S., Leese, M. (2001). Cluster Analysis, 4th ed. Edward Arnold Publishers Ltd,

London.

Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7,

179–188.

Fisher, R.A. (1940). The precision of discriminant functions. Annals of Eugenics (London) 10, 422–429.

Gordon, A.D. (1999). Classification, 2nd ed. Chapman & Hall/CRC Press, London.

Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York.

Hartigan, J., Wong, M. (1979). A k-means clustering algorithm. Applied Statistics 28, 100–108.

Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning. Springer, New

York City.

Jackson, C., Allen, G., Essock, S., Foster, M., Lanzara, C., Felton, C., Donahue, S. (2006). Clusters of

event reactions among recipients of project liberty mental health counseling. Psychiatric Services

57(9).

Kaldjian, L., Jones, E., Rosenthal, G., Tripp-Reimer, T., Hillis, S. (2006). An empirically derived

taxonomy of factors affecting physicians’ willingness to disclose medical errors. Journal of General

Internal Medicine: Official Journal of the Society for Research and Education in Primary Care In-

ternal Medicine 21, 942–948.

Legendre, P., Legendre, L. (1998). Numerical Ecology. Elsevier, New York City.

Lei, D., Holder, R., Smith, F., Wardlaw, D., Hukins, D. (2006). Cluster analysis as a method for

determining size ranges for spinal implants: Disc lumbar replacement prosthesis dimensions from

magnetic resonance images. Spine 31(25), 2979–2983.

Lochner, C., Hemmings, S., Kinnear, C., Niehaus, D., Nel, D., Corfield, V., Moolman-Smook, J.,

Seedat, S., Stein, D. (2005). Cluster analysis of obsessive compulsive spectrum disorders in patients

with obsessive-compulsive disorder: Clinical and genetic correlates. Comprehensive Psychiatry 46,

14–19.

McLachlan, G., Peel, D. (2000). Finite Mixture Models. Wiley and Sons, New York City.

Milligan, G. (1981). A monte carlo study of thirty internal criterion measures for cluster analysis.

Psychometrica 46(2), 187–199.

Milligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining the number of

clusters in a data set. Psychometrika 50, 159–179.

Russell, C., Conn, V., Ashbaugh, C., Madsen, R., Hayes, K., Ross, G. (2006). Medication adherence

patterns in adult renal transplant recipients. Research in Nursing and Health 29, 521–532.

http://www.classification-society.org/csna/csna.html

W. D. Shannon366

Shannon, W., Banks, D. (1999). Combining classification trees using maximum likelihood estimation.

Statistics in Medicine 18(6), 727–740.

Shannon, W., Culverhouse, R., Duncan, J. (2003). Analyzing microarray data using cluster analysis.

Pharmacogenomics 4(1), 41–51.

Shannon, W., Watson, M., Perry, A., Rich, K. (2002). Mantel statistics to correlate gene expression

levels from microarrays with clinical covariates. Genetic Epidemiology 23(1), 87–96.

Steele and Torrie (1980). Principles and Procedures of Statistics: A Biometrical Approach. McGraw-

Hill, New York City.

Timm, N. (2002). Applied Multivariate Analysis. Springer, New York City.

Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley, Bosten, MA.

Documents

[Handbook of Statistics] Epidemiology and Medical Statistics Volume 27 || 11 Cluster Analysis