Chapter 2 Accelerating Lloyd’s Algorithm for k-Means ...predrag/classes/2018springb565/papers/... · 2 Accelerating Lloyd’s Algorithm for k-Means Clustering 43 2.1.2 The Standard

Chapter 2Accelerating Lloyd’s Algorithm for k-MeansClustering

Greg Hamerly and Jonathan Drake

Abstract The k-means clustering algorithm, a staple of data mining and unsuper-vised learning, is popular because it is simple to implement, fast, easily parallelized,and offers intuitive results. Lloyd’s algorithm is the standard batch, hill-climbingapproach for minimizing the k-means optimization criterion. It spends a vastmajority of its time computing distances between each of the k cluster centers andthe n data points. It turns out that much of this work is unnecessary, because pointsusually stay in the same clusters after the first few iterations. In the last decaderesearchers have developed a number of optimizations to speed up Lloyd’s algorithmfor both low- and high-dimensional data.

In this chapter we survey some of these optimizations and present new ones.In particular we focus on those which avoid distance calculations by the triangleinequality. By caching known distances and updating them efficiently with thetriangle inequality, these algorithms can provably avoid many unnecessary distancecalculations. All the optimizations examined produce the same results as Lloyd’salgorithm given the same input and initialization, so are suitable as drop-inreplacements. These new algorithms can run many times faster and compute farfewer distances than the standard unoptimized implementation. In our experiments,it is common to see speedups of over 30–50x compared to Lloyd’s algorithm.We examine the trade-offs for using these methods with respect to the number ofexamples n, dimensions d , clusters k, and structure of the data.

Keywords k-Means • Triangle inequality • Caching • Accelerate • Lloyd’salgorithm • Clustering • Unsupervised learning

G. Hamerly (�)Baylor University, 105 Baylor Ave., Waco, TX 76798, USAe-mail: [email protected]

J. DrakeHewlett-Packard Company, 14231 Tandem Blvd, Austin, TX 78728, USAe-mail: [email protected]

© Springer International Publishing Switzerland 2015M.E. Celebi (ed.), Partitional Clustering Algorithms,DOI 10.1007/978-3-319-09259-1__2

41

mailto:[email protected]

mailto:[email protected]

42 G. Hamerly and J. Drake

2.1 Introduction

The k-means clustering algorithm is a very popular tool for data analysis andlearning. At its heart is an easily-understood optimization problem: given a set ofdata points (in some vector space), try to position k other points (called ‘centers’)at locations that minimize the (squared) distance between each point and its closestcenter. While it is popular and easy to implement, the naive implementation forsolving the problem is inefficient, wasting a lot of processing time on unnecessaryand redundant computations.

This chapter discusses simple geometric methods, based on the triangle inequal-ity and keeping cached bounds on computed distances, to reduce wasted computa-tion and build much more efficient algorithms that give exactly the same output.We point out that all the accelerated algorithms we investigate in this chaptergive exactly the same answer as Lloyd’s standard batch algorithm, given the sameinitialization. In our experiments, it is common to see speedups of over 30–50xcompared to Lloyd’s algorithm.

2.1.1 Popularity of the k-Means Algorithm

The k-means algorithm is very widely used. In [40], it is chosen as one of thetop ten data mining algorithms. It is implemented in many commercial and open-source statistical data analysis software packages, including MATLAB, SAS, Stata,SPSS, R, and Weka to name a few. A simple search for ‘k means clustering’ onGoogle Scholar yields over 2.1 million results, which is greater than the numberof results for ‘neural network’, ‘support vector machine’, ‘nearest neighbor’, or‘logistic regression’.

Many applications benefit from using k-means. Just a few are:

• clustering the pixels of an image for image color quantization [7, 19],• post-processing to decide the memberships in spectral clustering [29],• selecting the codewords for vector quantization, enabling lossy compression of

audio or image data [23],• image segmentation [14, 19],• unsupervised feature learning in single-layer neural networks [9],• identifying self-similar behaviors in dynamic program execution for structured

sampling of the behaviors [36],• finding a good initialization for a more costly learning method [6], and• finding good locations for basis functions in a radial basis function network [39].

Such a widely-used algorithm deserves to be well-studied and efficiently imple-mented.

2 Accelerating Lloyd’s Algorithm for k-Means Clustering 43

2.1.2 The Standard k-Means Algorithm doesa Lot of Unnecessary Work

The k-means algorithm is popular due to its clarity, simplicity, and intuitiveoptimization function. As a result, it has been implemented, albeit inefficiently,many times over. We investigated the source code of the k-means implementationsfor the software packages ELKI, graphlab, Mahout, MATLAB, MLPACK, Octave,OpenCV, R, SciPy, Weka, and Yael, and found that none of them use the triangleinequality bound acceleration techniques that we discuss in this chapter, thoughthey would benefit from doing so. Of those, the ones which provide scalableimplementations primarily depend on some form of parallel processing, which arecompatible with the acceleration methods presented here.

The standard methods for solving the k-means optimization problem are Lloyd’salgorithm [24] (a batch algorithm, also known as Lloyd-Forgy [13]), as well asMacQueen’s algorithm [26]. Each algorithm spends the vast majority of its timecomputing distances between the clustered points and the current cluster centers.However, much of the time, these distance calculations are unnecessary, wastedcomputation. In this study we focus on improving Lloyd’s algorithm, which iswidely used.

The basic reason why the standard batch methods for optimizing k-means areinefficient is because in each iteration they must identify the closest center for eachclustered point. To do this, the methods naively compute all nk distances betweeneach of the n clustered points and each of the k centers. After each iteration, thecenters move and these distances may all change, requiring recomputation. Buttypically the centers don’t move much, especially after the first few iterations. Mostof the time the closest center in the previous iteration remains the closest center.Thus, keeping track of the closest center for each clustered point is made muchmore efficient with some caching. When the closest center doesn’t change, ideallywe shouldn’t need to compute the distance between that point and any cluster center.And even when the closest center for a point changes, it might be possible to avoidcomputing the distance from that point to all centers, instead looking only at a fewcenters that are guaranteed to be closer to that point than all other centers.

2.1.3 Previous Work on k-Means Acceleration

There is a healthy line of research on accelerating learning algorithms, which goeshand in hand with accelerating data retrieval algorithms such as k-nearest neighborsearch. The primary methods of acceleration are: algorithmic improvements, paral-lelization (including threading, multiprocessing, and distributed computation), andapproximation. In this chapter, we focus on algorithmic improvements which giveexact answers.


2.1.3.1 Algorithmic Improvements

Pelleg and Moore [31] incorporated the popular k-d tree data in the task ofaccelerating the k-means method (note that the k in k-d tree is a naming clashwith the k in k-means). Kanungo et al. [19] developed a similar algorithm. Bothalgorithms use the k-d tree to structure the data to be clustered. By restructuring thesearch for each point’s closest center, many distance calculations can be avoided.While these approaches are excellent in low dimension, k-d trees perform poorly indimensions much greater than 8.

Moore [28] developed the anchors hierarchy, a new type of spatial data structurebased on metric trees. Applying the triangle inequality, this structure organizesthe data by enclosing it in hierarchically organized anchors with associated radii.The anchors hierarchy allows the k-means algorithm to avoid many provablyunnecessary distance calculations, even in high dimension. We discuss this structuremore in Sect. 2.3.2.

Elkan [12] started a line of research which pairs the triangle inequality directlywith cached distance bounds to avoid unnecessary point-center distance calculations.Avoiding complicated hierarchical data structures and preprocessing, his algorithmsimply caches O.nk/ distance bounds to prune distance calculations. It usesthe triangle inequality to efficiently updates the bounds each time centers move.Hamerly [15] reduced this overhead to O.n/ bounds, which makes it much fasterin practice for low and medium-dimension datasets. Drake [11] bridged the gapbetween these two approaches by using an adaptive number of distance bounds,O.nb/ where b < k and can be learned online.

Other minor algorithmic improvements have also helped accelerate the standardk-means algorithm. Because k-means repeatedly seeks the minimum distancebetween a point and all k centers, using partial distortion search (PDS, alsocalled partial distance search) [5] and some loop unrolling permits some distancecalculations to be cut short without looking at all dimensions. Mean-distance-ordered partial search (MPS) [33] draws a connection between the squared distancesand the squared difference of vector sums to help eliminate candidate k-meanscenters. Pan et al. [30] eliminates unlikely centers using the first and secondmoments of a vector and portions of the vector.

DHSS (dynamic hyperplanes shrinking search) [37] eliminates unlikely can-didate centers by transforming the input space (e.g. using principal componentsanalysis) and then using the new canonical dimensions to bound the closest centersto a point. It is unclear if this method will work well in higher-dimension spaces.

Several researchers, beginning with Kaukoranta et al., identified that someclusters found by Lloyd’s algorithm are ‘static’ over some iterations. In otherwords, no points join or leave the cluster from one iteration to the next. Thesestatic clusters are easily identified, as their centers do not move. Clusters whichdo change are called active. This information can be used to reduce the numberof candidate centers for some points [20–22]. It’s worth noting that the triangleinequality methods which are the focus of this chapter implicitly exploit the sameinformation and enjoy similar benefits.


2.1.3.2 Parallelization

Several machine learning packages have been constructed with the intent of improv-ing the scalability and speed of the learning algorithms, making them applicable tolarge real-world problems. Packages like graphlab [25], Yael [41], and Mahout [2]provide scalable implementations of many machine learning algorithms, includingk-means. They use different approaches: Yael focuses on low-level multithreadedoptimized implementations, Mahout provides machine learning on a map-reduceinfrastructure, and graphlab focuses on graph-structured algorithms and multi-coreprocessing.

2.1.3.3 Alternative Heuristic Methods

Lloyd’s algorithm is a gradient descent heuristic algorithm for minimizing k-meansdistortion criterion. While it is popular and is the focus of this chapter, there arealternative methods which also aim to reduce the k-means distortion using differentheuristics.

Agarwal et al. [1] use subsamples of the whole dataset, called core-sets, andoptimize a solution based on the sample. This algorithm can be much faster inpractice in low-dimensional data, but can be slow and find poor solutions in higherdimensions.

Hartigan and Wong [17] suggested an algorithm for optimizing the k-meansdistortion which considers the clustered points one by one. Each time a pointchanges membership, the algorithm updates the affected center locations.

Sculley [35] developed a method that is a hybrid of stochastic gradient descent(as proposed by Bottou and Bengio [6]) and batch k-means. It operates on smallsamples rather than individual examples. The resulting algorithm is less susceptibleto noise caused by individual examples, yet still quite fast.

Each of these alternative algorithms can be quite fast, much faster than Lloyd’sbatch algorithm. The tradeoff is that they tend to produce different results. We viewthese as suitable alternatives to Lloyd’s algorithm, but focus on accelerating Lloyd’salgorithm due to its popularity. Some of these algorithms, especially those whichrepeatedly make nearest-center queries for the same points, may be compatible withthe acceleration methods we discuss here.

2.2 Cluster Distortion and Lloyd’s Algorithm

When clustering with k-means, we are trying to minimize the distortion, or sumof squared errors, between the points and their assigned centers. We can improvethe distortion in two ways: by changing the points’ cluster assignments and movingthe cluster centers. Specifically, given a fixed set of points X , we are attempting tominimize the distortion function


Table 2.1 Terms that are used frequently in this chapter

Name and type Description

d 2 N Dimension of points to cluster and cluster centers.

n 2 N Number of points to cluster.

k 2 N Number of cluster centers.

X � Rn�d Set of points to cluster, indexed as x.i/ for 1 � i � n.

C � Rn�d Set of centers, indexed as c.j / for 1 � j � k.

n.j / 2 R The number of points that are currently assigned to cluster j .

N D fi 2 Nj1 � i � ng Indexes of the points in X .

K D fj 2 Nj1 � j � kg Indexes of the points in C .

a W N ! K Index of assigned (closest) center for each point.

u W N ! R Upper bound on the distance between each point and itsassigned center.

` W N �K ! R Lower bound on the distance between each point and eachcenter.

s W K ! R Half the distance between a center and its current closest othercenter.

J.X;C / DX

i2Nkx.i/ � c.a.i//k2 (2.1)

by choosing the best set of clusters C (Table 2.1).Lloyd’s batch algorithm for minimizing the distortion has three basic steps,

which are stated here and in slightly more detail in Algorithm 1.

1. Initialize the centers.2. Until the algorithm converges:

a. Assign each point to its currently closest cluster center.b. Move each center to the mean of its currently-assigned centers.

Step 1 occurs only once, while steps 2(a) and 2(b) alternate until the algorithmconverges. Convergence is guaranteed due to the fact that steps 2(a) and 2(b) bothreduce J.X;C /, and there is a finite number of ways to partition the n points amongk clusters [6].

Much has been written on initializing the centers for k-means [8]. Researchershave used the first several examples [6], chosen k points fromX at random [16], andused the furthest-first method [18]. The most effective current method, theoreticallyand in practice, is the k-means++ initialization [3], which randomly selects a goodinitialization with high probability, using something akin to furthest-first. In all ourexperiments, we use k-means++ for initialization.

The remainder of this paper is primarily concerned with optimizing step 2(a) ofthe algorithm (finding the closest center for each point). The naive implementationof Lloyd’s algorithm spends the majority of its time here, and much of thecomputation done here is unnecessary; the information needed for this step can bederived using some caching and geometry.


Algorithm 1 Lloyd’s k-means algorithm—the standard algorithm for minimizingJ.X;C /. Like other algorithms presented in this chapter, this algorithm’s pseu-docode is presented simply, without details on efficiency optimizations used in realimplementations

procedure LLOYD(X;C )while not converged do

for all i 2 N do {Find the closest center to each x.i/.}a.i/ 1

for all j 2 K doif kx.i/� c.j /k < kx.i/� c.a.i//k thena.i/ j

for all j 2 K do {Move the centers}move c.j / to the mean of fx.i/ja.i/ D j g

Step 2(b) can be optimized easily by caching sufficient statistics for each cluster:the vector sum of the points assigned to the cluster, and the number of pointsassigned to the cluster. Keeping these is inexpensive and avoids a sum over allpoints for each iteration. Each time a point changes cluster membership, the relevantsufficient statistics are updated. After the first few iterations, most points remainin the same cluster for many iterations. Thus these sufficient statistics updatesbecome much cheaper than a sum over all points. All the algorithms we implementfor this study use this technique. Thus, it is not explicitly included in algorithmicpseudocode, for clarity.

2.2.1 Analysis of Lloyd’s Algorithm

The running time of Lloyd’s algorithm for k-means is O.wnkd/ for w iterations,k centers, and n points in d dimensions. For a fixed dataset and k, the numberof iterations w will vary depending on the initialization. In fact, w may besuperlinear with respect to n, even exponential in the worst case [38]. However,when considering less extreme cases, Lloyd’s algorithm has polynomial smoothedcomplexity [4]. That is, w is polynomial in n and 1= (where is the amount ofperturbation allowed on a weakened adversary’s challenge dataset).

When viewed as a gradient descent algorithm, Bottou and Bengio [6] showedthat (from a given initialization) the distortion converges in Lloyd’s batch algorithmat superlinear speed. This is because the variations of the second derivative of thecost function are bounded. Thus, the algorithm is minimizing the distortion in a waythat is equivalent to Newton’s method.

In this chapter we look at ways to accelerate Lloyd’s algorithm. Given thatLloyd’s algorithm is so widely used, the goal is to accelerate the exact algorithm(without any approximation), so that the resulting accelerated algorithm can be used


anywhere Lloyd’s algorithm is used. The accelerations we look at primarily workby avoiding many of the nk interactions between n clustered points and k clustercenters.

2.2.2 MacQueen’s Algorithm

MacQueen [26] described a method similar to Lloyd’s algorithm, but which updatesthe location of each affected cluster center whenever a point changes clustermembership. While Lloyd’s algorithm moves the centers once per pass over theentire dataset, MacQueen’s algorithm will move the centers (by smaller amounts)far more often. Thus, Lloyd’s algorithm could be considered a ‘batch’ algorithmwhereas MacQueen’s is more ‘online’. As Lloyd’s algorithm is more popular inpractice, and easier to accelerate since the centers move less frequently, we focusprimarily on it in this chapter.

2.3 Tree Structured Approaches

Tree structures are effective methods for indexing spatial data in retrieval andlearning algorithms. Two approaches in particular, k-d trees and the anchorshierarchy, have shown success in accelerating the k-means algorithm. We reviewthese methods here.

2.3.1 Blacklisting and Filtering Centers with k-d Trees

Pelleg and Moore [31] and Kanungo et al. [19] proposed similar methods foraccelerating k-means by constructing and using a k-d tree.1 A k-d tree is a binarytree that recursively partitions the space of the data it’s constructed on usingseparating hyperplanes (typically axis-aligned). In this discussion, it is applied to thedata to be clustered. After construction, the tree’s structure plus sufficient statisticskept at each tree node can be used to eliminate point-center calculations. Pellegand Moore call this approach ‘blacklisting’ the centers, while Kanungo et al. call it‘filtering’ the centers.

Pelleg and Moore’s blacklisting algorithm proceeds as shown in Algorithm 2.Initially it constructs a k-d tree on the data that is to be clustered. For each iteration

1Note that the k in k-d trees and the k in k-means are two different (clashing) variable names.In k-means, the k refers to the number of centers/clusters sought; in k-d trees k refers to thedimension of the data the structure is built on.


Algorithm 2 Pelleg and Moore’s Blacklisting k-means algorithmprocedure BLACKLISTING(X;C )

construct k-d tree T on Xwhile not converged do

Update(T .root, C )move centers to the mean of their assigned points

procedure UPDATE(h; C )if h is a leaf then

for each data point x in h dofind the closest center to x and update that center’s counters

elsecompute the distance between h and each center in Cremove from C any dominated centersif only one center c0 remains in C then

update the counters for c0 using the data in helse

call Update(h.left, C )call Update(h.right, C )

of k-means, the algorithm performs a traversal of the tree, searching for regions ofthe tree that are ‘owned’ by a single center.

The traversal starts with all k centers at the root of the k-d tree. Then it recursivelydescends the tree, and at each node attempts to prove that there is one centerthat ‘dominates’ (is closer to) one or more of the other centers, with respect tothe hyperrectangle enclosing the k-d tree node. If so, it eliminates the dominatedcenter(s). If only the dominating center remains, then the recursion stops and allpoints below that node in the k-d tree are assigned to that center. If multiple centersremain, it recursively continues its search on both child nodes. If it reaches a leaf, itperforms a search of the data at the leaf to assign them to the remaining centers.

By using sufficient statistics kept at each internal node of the tree, identifyinga dominating center early in the recursion (close to the root) allows the algorithmto skip not only many distance calculations but also many vector additions. For k-means, the sufficient statistics for a node are the number of points at that node orbelow, and the vector sum of those points.

While k-d trees can work very well in practice, the following limitations maketheir use less desirable, especially in high dimension:

• They are efficient only in low dimension—the running time has an exponentialdependence on the dimension of the data [27]. If the number of points in thestructure are significantly (exponentially) larger than the dimension of the data,k-d trees tend to be efficient relative to linear lookups, but for even moderatedimensions in practical applications they become too slow. Empirically theybecome too slow (compared to simple linear search) somewhere between 8 and10 dimensions [28].

• They require extra memory for the tree structure and sufficient statistics. Theextra memory required is on the order of the original dataset.


• When clustering, we must construct the k-d tree in advance, which means thatwe cannot start clustering until this is done.

• A k-d tree is not designed for efficient updates (such as adding new pointsor removing old points). If many updates are to be done, the tree should bereconstructed.

2.3.2 Anchors Hierarchy

Moore proposed the anchors hierarchy as a tree-like geometric structure for spatialdatasets [28]. The tree is built ‘middle-out’ by choosing

pn initial points known

as anchors, and then merging them (to form the top of the tree) and subdividingthem (to form the leaves). Each anchor maintains a list of the points for whichit is the closest anchor, sorted by their distances from the anchor. Using thetriangle inequality, adding additional anchors can be done efficiently. Moore showedthat the anchors hierarchy is effective at eliminating many distance computationsin k-means. However, Elkan showed it is less effective at eliminating distancecomputations than his method for even moderate values of k than his method [12].

While tree-structured acceleration methods of the k-means algorithm are inter-esting and often useful, they are often slower and less effective in practice than themethods we turn to now. One reason pointed out by Elkan is that tree-structuredmethods must build a static structure before clustering begins, without knowledgeof the number of clusters. Because the number of clusters is not fixed, and theirpositions change during clustering, static trees are less able to reduce the number ofk-means distance calculations.

2.4 Triangle Inequality Approaches

The triangle inequality is a simple but very powerful tool from geometry. If a; b; c 2Rd , then the triangle inequality states that

ka � ck � ka � bk C kb � ck (2.2)

for Euclidean vector norm kak D paT a. Intuitively, this means that the length ofline segment .a; c/ is at most the sum of the lengths of line segments .a; b/ and.b; c/. In other words, the shortest path between two points a and c is a straight line;taking a path that goes through an intermediate point b cannot reduce the distance.

The triangle inequality is applicable in the k-means algorithm in multiple ways.Generally, we desire to use it to prove that some center must be closer to a point thanall other centers. Ideally, we wish to do this with as little computation as possible.For a point x and two centers c and c0, here are some of the different ways thetriangle inequality can be used in k-means:


Table 2.2 This table shows different acceleration methods that use distance bounds, sorting, andthe triangle inequality in various k-means algorithms

Accelerations used

Algorithm 1 2 3 4 5 6

Lloyd

Compare-means [32] X

Sort-means [32] X X

Elkan [12] X X X k

Hamerly [15] X X 1

Drake [11] X X b < k

Annular (this chapter) X X 1 X

Heap (this chapter) X X 0

1. Distance from a center to itsclosest other center

2. Distance from a center to all othercenters

3. Upper bound on point-centerdistances

4. Lower bound on point-centerdistances

5. For each center, sort all centers bydistance from it

6. Sort all centers by their vectornorm

The column numbers correspond to the list on the right, which lists contexts where the triangleinequality can help k-means avoid point-center distance calculations. Please see the algorithmdescriptions for more details. Column 4 lists the number of lower bounds used per point

1. To prove that c0 is closer to x than c, given only distances kc0 � xk and kc0 � ck.2. To prove that c0 is closer to x than c, given only norms kxk and kck and distancekx � c0k.

3. To maintain an upper bound on kx � ck when c is moving.4. To maintain a lower bound on kx � ck when c is moving.

Table 2.2 shows the ways the triangle inequality is used in many of the algorithmsdescribed in this chapter.

2.4.1 Using the Triangle Inequality for Center-Centerand Center-Point Distances

Phillips [32] demonstrated two ways to use the triangle inequality to accelerate k-means. Both of them use the triangle inequality to prove that centers that are farfrom a point’s assigned center are also far from the point, and therefore can beexcluded from distance computations with that point. He called his two algorithmscompare-means and sort-means.

Compare-means uses the triangle inequality to prove that if center c0 is close topoint x, and some other center c is far away from another center c0, then c0 must becloser than c to x. Given the already-computed distances kx� c0k and kc� c0k, andapplying the triangle inequality we can show:

kc � c0k � kx � ck C kx � c0k By the triangle inequality.

kc � c0k � kx � c0k � kx � ck


Thus if we also know that 2kx � c0k � kc � c0k (which is trivial to calculate giventhe distances are already known), we can show

2kx � c0k � kx � c0k � kx � ckkx � c0k � kx � ck

which proves that c is not closer than c0 to x, without measuring the distance kx �ck. Phillips’ compare-means algorithm uses this inequality in the innermost loopof k-means to prove that some point-center distances need not be computed. Thealgorithm computes and caches the center-center distances each time the centersmove (once per iteration).

Sort-means computes a k � k matrix of center-center distances each time thecenters move, and then sorts each row of matrix by distance. This gives each centera ranking of the other centers by distance. Whenever the algorithm wants to findthe closest center for some point x, it searches the centers in the order of increasingdistance from its currently-assigned center c. Then, if it can ever prove that thedistance kc�c0k to some other center c0 is greater than twice the distance kx�ck, itcan stop searching. Thus sort-means uses the same inequality as compare-means, butit searches the centers in a different order. Because of the search order it may avoidexamining some far-away centers. Of course, sort-means has the extra overhead ofsorting the center-center distance matrix each time the centers move.

2.4.2 Maintaining Distance Bounds with the TriangleInequality

We can use the triangle inequality to cheaply maintain an upper bound on thedistance between points, after one point has moved. Suppose x; c; c0 2 R

d , where xis a point to cluster, c is a cluster center and c0 is its new position after an iterationof k-means. If we know kx�ck (from a previous iteration of k-means) and kc�c0k(calculated when we move the cluster centers), we can provide an upper bound onkx � c0k without explicitly calculating its exact value:

kx � c0k � kx � ck C kc � c0k: (2.3)

Intuitively, this upper bound assumes that in the worst case, c moved directly awayfrom x a distance of kc � c0k, along the vector x � c.


x

c

c

Fig. 2.1 Using the triangle inequality to bound the distance between x and c0, the new locationof the center c. Assume the distance kx � ck has been calculated. It is illustrated by the solidcircle centered on x and going through c. After c moves to c0, we measure kc � c0k (illustratedby the dashed circle centered on c). The upper and lower bounds on kx � c0k are then given bykx � ck � kc � c0k � kx � c0k � kx � ck C kc � c0k, which are illustrated by the two dashedcircles centered on x. Thus, c0 must be inside the region bounded by these two dashed circles

Another way to apply the triangle inequality is to form a lower bound on thedistance between two points. Again considering x; c, and c0 as a point to cluster andthe old and new positions of a center, we can form the lower bound on kx � c0k:

kx � ck � kx � c0k C kc � c0k (2.4)

kx � ck � kc � c0k � kx � c0k: (2.5)

Similar to the upper bound, this distance bound provides a lower bound that assumesthe worst case—that c moved directly toward x a distance of kc � c0k, along thevector x � c.

Further, the upper (lower) bound can be updated correctly and efficiently insubsequent k-means iterations by adding (subtracting) the distance moved by acenter each time it moves. This allows us to maintain both upper and lower distancebounds between a point and a moving center without explicitly calculating distances(Fig. 2.1).


2.4.3 Elkan’s Algorithm: k Lower Bounds, k2 Center-CenterDistances

Elkan [12] introduced an algorithm which uses the triangle inequality multipleways to avoid distance calculations in the k-means algorithm (see Algorithm 3for pseudocode). For each clustered point x.i/, the algorithm employs one upperbound and k lower bounds. The upper bound is on the distance between x.i/ andits closest center c.a.i//; that is, u.i/ � kx.i/ � c.a.i//k. Each lower bound`.i; j / � kx.i/ � c.j /k bounds the distance between x.i/ and center c.j /.The upper (lower) bounds may be efficiently updated by adding (subtracting) thedistance moved by each center after each k-means iteration.

Each time the centers move, Elkan’s algorithm calculates and caches the distancebetween each pair of centers, as well as half the distance between each center c.j /and its closest other center as s.j /. When it is true, the test u.i/ � s.a.i// allowsElkan’s algorithm to avoid the innermost loop for x.i/. This is because no othercenter could possibly be closer to x.i/ than its currently-assigned center.

When Elkan’s algorithm reaches the innermost loop, it may want to determinewhether c.j / is closer to x.i/ than the currently assigned center. However, if eitheru.i/ � `.i; j / or u.i/ � kc.a.i// � c.j /k=2, then this calculation is unnecessary,because it is not possible for c.j / to be the closest center. The latter test uses thecached center-center distances.

2.4.4 Hamerly’s Algorithm: 1 Lower Bound

Hamerly [15] altered Elkan’s algorithm by reducing the number of bounds used (seeAlgorithm 4). Hamerly’s algorithm uses the same upper bound u.i/ for each pointx.i/—for the distance between that point and its closest center c.a.i//. But insteadof k lower bounds, it uses only one lower bound per point, `.i/. This lower bounddoes not bound the distance from x.i/ to any particular cluster center. Instead, itrepresents the minimum distance that any center—except for the closest—can be tothat point.

Consider the case where u.i/ � `.i/. If this is true, it is not possible for any centerto be closer to x.i/ than its assigned center. Thus, determining the assignment forx.i/ does not require knowing any exact distances, and the algorithm can skip theinnermost loop that computes the distances between x.i/ and the k centers.

However, if `.i/ < u.i/ then it might be that the closest center for x.i/has changed. In this case, Hamerly’s algorithm first tightens the upper boundby computing the exact distance u.i/ kx.i/ � c.a.i//k. If this reduces u.i/significantly, then possibly u.i/ � `.i/ and the algorithm can skip the innermostloop. If not, then it must compute the distances between x.i/ and all k clustercenters. Keeping track of the closest and second-closest allows the algorithm to


Algorithm 3 Elkan’s algorithm—using k lower bounds per point and k2 center-center distances

procedure ELKAN(X;C )a.i/ 1; u.i/ 1;8i 2 N {Initialize invalid bounds, all in one cluster.}`.i; j / 0;8i 2 N; j 2 Kwhile not converged do

5: compute kc.j /� c.j 0/k;8j; j 0 2 Kcompute s.j / minj 0

6Dj kc.j /� c.j 0/k=2;8j 2 Kfor all i 2 N do

if u.i/ � s.a.i// then continue with next ir True

10: for all j 2 K doz max.`.i; j /; kc.a.i//� c.j /k=2/if j D a.i/ or u.i/ � z then continue with next jif r then

u.i/ kx.i/� c.a.i//k15: r False

if u.i/ � z then continue with next j`.i; j / kx.i/� c.j /kif `.i; j / < u.i/ then a.i/ j

for all j 2 K do {Move the centers and track their movement}20: move c.j / to its new location

let ı.j / be the distance moved by c.j /for all i 2 N do {Update the upper and lower distance bounds}

u.i/ u.i/C ı.a.i//for all j 2 K do

25: `.i; j / `.i; j /� ı.j /

find the correct assignment, compute a tight `.i/ (for the second-closest center),and possibly tighten u.i/ (if the assignment happens to change).

Since u.i/ is the same as in Elkan’s algorithm, it is updated the same waywhenever the centers move. As the lower bound `.i/ is different, it is updateddifferently. When the centers move, the algorithm also tracks the maximum distanceı moved by any center. Then by the triangle inequality, the lower bound for eachpoint can be updated as `.i/ `.i/ � ı. This is correct since `.i/ represents theclosest distance between any (non-assigned) center and x.i/, and no center couldhave moved closer toward x.i/ than the distance ı. In fact, a small optimization ispossible. For those points assigned to the furthest-moving center, their lower boundsmay be reduced by the distance moved by the second-furthest-moving center.

Why does Hamerly’s algorithm not track the identity of the second-closestcenter? Knowing its identity would allow the algorithm to more efficiently tighten`.i/ (avoiding the innermost loop over all k centers). The reason is that the second-closest center’s identity can change as the centers move, and without looking atall centers it can’t be proved that the second-closest center remains the same overmultiple iterations.

Hamerly’s algorithm has several efficiency tradeoffs compared with Elkan’salgorithm. With fewer lower bounds, Hamerly’s algorithm uses less memory.


Algorithm 4 Hamerly’s algorithm—using 1 lower bound per pointprocedure HAMERLY(X;C )a.i/ 1; u.i/ 1; `.i/ 0;8i 2 N {Initialize invalid bounds, all in one cluster.}while not converged do

compute s.j / minj 0

6Dj kc.j /� c.j 0/k=2;8j 2 K5: for all i 2 N do

z max.`.i/; s.a.i///if u.i/ � z then continue with next iu.i/ kx.i/� c.a.i//k {Tighten the upper bound}if u.i/ � z then continue with next i

10: Find c.j / and c.j 0/, the two closest centers to x.i/, as well as the distances to each.if j 6D a.i/ thena.i/ j

u.i/ kx.i/� c.a.i//k`.i/ kx.i/� c.j 0/k

15: for all j 2 K do {Move the centers and track their movement}move c.j / to its new locationlet ı.j / be the distance moved by c.j /

ı0 maxj2K ı.j /

for all i 2 N do {Update the upper and lower distance bounds}20: u.i/ u.i/C ı.a.i//

`.i/ `.i/� ı0

It spends less time checking bounds (in the innermost loop) and updating bounds(when centers move). Having the single lower bound allows it to avoid enteringthe innermost loop more often than Elkan’s algorithm. On the other hand, Elkan’salgorithm computes fewer distances than Hamerly’s, since Elkan’s has more boundsto prune the required distance calculations. Also, Hamerly’s algorithm works betterin low dimension than in high dimension. Its single lower bound reduces by themaximum distance moved by any center, and in high dimension all centers tend tomove a lot due to the curse of dimensionality.

2.4.5 Drake’s Algorithm: 1 < b < k Lower Bounds

Elkan’s and Hamerly’s algorithms keep, respectively, k bounds and one lower boundper clustered point. Drake and Hamerly [11] bridged the gap between these twoextreme values by using 1 < b < k lower bounds on the b closest centers to eachpoint. The value of b can be selected in advance or adaptively learned while thealgorithm runs. Drake’s algorithm uses one upper bound per clustered point. Thus,Drake’s algorithm uses .b C 1/n total distance bounds.

For a given point, the first b�1 lower bounds represent the minimal distance fromthe point to its b � 1 closest centers, excluding the currently assigned center. The


last (bth) lower bound is treated specially, and we will discuss it later. Since k-meansis only concerned with the closest center, we can avoid distance calculations to far-away centers if we can use the bounds to prove that the closest is one of the b closest(the current assigned center plus the b � 1 next closest centers).

There are two minor complications that arise from keeping b bounds per point:

• For each of the b lower bound we must keep the identity of the associatedcenter. In Hamerly’s algorithm, the identity of the lower-bound center is not kept.Elkan’s algorithm keeps the lower bounds in the same order as center indexes,implicitly giving the bound-center association. Keeping a center label for eachbound increases the algorithm’s memory footprint.

• In order to make the algorithm efficient, the lower bounds should be kept in sortedorder by distance from the point. Sorting incurs overhead each time the centersmove.

Nevertheless, Drake and Hamerly show that this algorithm works well in practiceand can be faster than both Elkan and Hamerly’s algorithms under certain condi-tions.

The first b � 1 lower bounds for a point represent the lower bounds to theassociated points that are ranked 2 through b in increasing distance from the point.The last lower bound (number b, furthest from the center) represents something abit different. Instead of being associated with one particular center, it represents thelower bound on all the furthest k�b centers. This is much like Hamerly’s one lowerbound, but only for the outermost centers.

When searching for the closest center to a given point, we only need to searchthe centers whose corresponding lower bounds are less than the upper bound forthat point. Moving from the center with the smallest lower bound outward, we canhopefully stop searching after just a few comparisons. For each lower-upper boundcomparison that fails, we tighten the lower (and upper) bounds as necessary. If wereach the last bound and still have not proven that any of the b tracked centers arethe closest, then we must search all remaining k � b centers.

At the end of each k-means iteration, we must adjust the lower bounds. Each isreduced by the amount its associated center moved. The outermost bound shouldbe reduced by the maximum amount moved by any of the k � b outermost centers.However, in practice it is far more efficient, though not as tight, to reduce the lastbound by the largest distance moved by any center, not just the furthest k�b centers.When doing this, some bound might ‘collapse’ on (i.e. become smaller than) a lowerbound for a supposedly closer center. When this happens, we reduce each bound forcloser centers so they are at most the value that is collapsing.

The number of lower bounds b used by Drake’s algorithm represents a tradeoff.Increasing b incurs more computational overhead to update the bound values andsort each point’s closest centers by their bound, but it is also more likely that oneof the bounds will prevent searching over all k centers. Drake’s algorithm uses asimple way to determine a ‘good’ value for b adaptively. Starting with a large valuefor b, it may choose to reduce b each iteration based on which bounds are being


Algorithm 5 Drake’s algorithm—using b lower bounds per pointprocedure DRAKE(X;C; b)a.i/ 1; u.i/ 1;8i 2 N {Initialize invalid bounds, all in one cluster.}`.i; j / 0;8i 2 N; j 2 f1; : : : ; bgwhile not converged do

5: m b

for all i 2 N doj arg max1�j 0

�b u.i/ � `.i; j 0/

if j < b then {The bounds pruned the outer centers.}compute distances and reorder the j centers closest to x.i/

10: else if j D b or `.i; b/ < u.i/ then {Bounds were ineffective.}compute distances from x.i/ to all centers and sort the b closest

m max.m; j /b max.k=8;m/ {Reduce b if possible}for all j 2 K do {Move the centers and track their movement}

15: move c.j / to its new locationlet ı.j / be the distance moved by c.j /

ı0 maxj2K ı.j /

for all i 2 N do {Update the upper and lower distance bounds}u.i/ u.i/C ı.a.i//

20: `.i; b/ `.i; b/� ı0

for j D b � 1 down to 1 dolet c.z/ be the center that is the j th closest to x.i/`.i; j / min.`.i; j /� ı.z/; `.i; j C 1//

used. If the algorithm is able to stop searching after only b0 < b lower bounds (overthe entire dataset), then it reduces b down to b0. Experimentally, Drake determinedthat for k > 8, k=8 is a good floor for b.

2.4.6 Annular Algorithm: Sorting the Centers by Norm

Hamerly’s algorithm employing one lower bound is effective at avoiding manydistance computations in low dimensional spaces. But whenever the lower and upperbounds for a point cross (i.e. `.i/ < u.i/), the algorithm must compute the distancebetween the point and all k centers. This search determines the two closest centersand tightens the upper and lower bounds. Similarly, when the last lower bound inDrake’s algorithm fails to prune the search, it must search over all centers (it hasalready searched over b centers, and it must continue searching over the remainingk � b centers represented by the last lower bound).

In this subsection we describe an efficient method that can prune such searchesbetween a point x and all k centers. By sorting the centers by their vector norms,we can eliminate from consideration many centers whose norms are too large or toosmall to be closest to x. We can do this with only the knowledge of the norm of x,


the norms of all k centers, and (an upper bound on) the distance between x and a(hopefully closeby) center.

Consider ordering the k cluster centers by their vector norms, k � k. This ordermay change at most once per iteration of k-means, and is inexpensive to computeand maintain. This ordering affords a novel way to structure the search for a point’sclosest center and potentially avoid examining all centers.

For a point x having norm kxk, we can use the norm-ordering of the centers andthe triangle inequality to prune the search over all centers. Assume that we knowthe exact distance kx � c0k between x and a reasonably close center c0 (such asits currently assigned center). Then consider a center c that is actually closer to x.Starting with two different statements from the triangle inequality, we have

kck � kx � ck C kxk;kxk � kx � ck C kck: (2.6)

Then we can combine these to get

kck � kxk � kx � ck;kxk � kck � kx � ckI (2.7)

ˇ̌ kxk � kck ˇ̌ � kx � ck combining the two results in (2.7). (2.8)

Because c is closer than c0 to x, we can deduce

ˇ̌ kxk � kck ˇ̌ � kx � ck � kx � c0k: (2.9)

Thus, any center c that is closer to x than c0 must satisfy Inequality (2.9). Froma different perspective, consider that c is actually farther from x than c0. Then cmust violate this inequality—i.e.

ˇ̌ kxk � kck ˇ̌ > kx � c0k. In this case, c can beeliminated from consideration because c0 must be closer to x. Note that while theabove discussion relied on knowing an exact distance between x and some closecenter c0, the same derivation can be done with an upper bound on this distance.

We can use this knowledge to prune the search for the closest center. Given (anupper bound on) a distance kx�c0k between x and some center c0 (e.g. the currentlyassigned center), we can eliminate centers c where

ˇ̌ kxk � kck ˇ̌ > kx � c0k:

And since we have ordered the centers by their norm, we only need to compute thedistances between x and those centers c whose norm falls in the range

kx � c0k � kxk � kck � kx � c0k C kxk


Fig. 2.2 The annular region(white ring centered at origin)bounds where the closestcenter for x might be. Centersc.j / are numbered by theirdistance from the origin.Point x has c.4/ as itspreviously-closest center, sothe width of the annulus is2kx � c.4/k (dashed circlecentered at x)

annulus

x

x− c(4)

c(1)

c(2)

c(3) c(4)

c(5)c(6)

These bounds form an annular region centered on the origin in which potentiallycloser centers to x may lie. We can use binary search on the centers ordered bytheir norms to locate the smallest-norm center that fits this inequality, and thencompute the distance between x and each center within the annulus. Figure 2.2gives a graphical representation of this search space.

We have implemented this annular search in conjunction with Hamerly’s algo-rithm, with one additional change. Each time Hamerly’s algorithm searches over allcenters, it needs to discover not just the closest center, but also the second-closestcenter (to tighten the lower bound). Thus, when constructing the annulus for x, weuse twice the distance between x and its second-closest center as the annulus width.But since Hamerly’s algorithm does not explicitly track the second-closest center,the augmented annular search algorithm attempts to do so. As mentioned above,though, the second-closest center can change. However, the annular search does notneed to know which is the actual second-closest center—it only needs the index ofa center which is likely to be close to x to form the search annulus, and is fartherfrom x than the assigned center. Thus when it does find the actual second-closestcenter, it caches its identity for constructing the annulus later.

2.4.7 Kernelized k-Means with Distance Bounds

As with many distance-based algorithms, k-means can be ‘kernelized’ by applyingthe kernel trick [10, 34]. Starting from the definition of squared Euclidean distancewhich can be written using inner products,

kx � yk2 D hx; xi � 2hx; yi C hy; yi;


we can kernelize the distance by replacing each inner product with a call to a kernelfunction K.x; y/ D h�.x/; �.y/i which represents the inner product of x and yafter they have been transformed to a new space, within the range of �. Here �typically represents a function that yields a higher-dimensional vector. Thus,

k�.x/ � �.y/k2 D h�.x/; �.x/i � 2h�.x/; �.y/i C h�.y/; �.y/iD K.x; x/ � 2K.x; y/CK.y; y/

and if K is easy to compute (without explicitly using �), then using a kernel is arelatively efficient way to perform k-means implicitly in the space of �.

We can apply the triangle inequality algorithms to kernelized k-means. How-ever, we must be careful to avoid inefficiencies that come from kernelizing thealgorithm. In particular, since the centers live implicitly in the high-dimensionalrange of �, and we don’t represent high-dimensional feature vectors explicitly, anytime we want to use a center, we instead use the kernel applied to all of the pointsthat are in its cluster. That is, in kernel k-means we define center c.j / by its memberpoints:

c.j / D 1

n.j /

X

i ja.i/Dj�.x.i//

so that when we want to know kx � c.j /k2, we compute

k�.x/ � c.j /k2 D K.x; x/ � 2h�.x/; c.j /i C hc.j /; c.j /i

D K.x; x/ � 2*�.x/;

1

n.j /

X

i ja.i/Dj�.x.i//

+

C*1

n.j /

X

i ja.i/Dj�.x.i//;

1

n.j /

X

i ja.i/Dj�.x.i//

+

D K.x; x/ � 2

n.j /

X

i ja.i/DjK.x; x.i//

C 1

n.j /2

X

i ja.i/Dj

X

i 0ja.i 0/DjK.x.i/; x.i 0//

Note that, naively, this single distance computation, which is the core of k-means,has a runtime of O.n.j /2/ (ignoring the cost of computing the kernel). Under thereasonable assumption that clusters are roughly equal size, this is O.n2=k2/. Thusnaive point-center distance computations are quite costly in kernelized k-means,especially when compared with the runtime of the non-kernelized version ofk-means. However, simply caching the inner product hc.j /; c.j /i for each cluster


center at the beginning of each k-means iteration allows us to bring the costdown to a much better, though still very costly, O.n.j // (or O.n=k/ under theassumption that all clusters are equal size). Either way, in kernelized k-meansit is even more appealing to accelerate the algorithm by a method which avoidsdistance calculations altogether. We can directly apply any of the triangle inequalityalgorithms to kernelized k-means.

When applying an acceleration method such as Elkan’s algorithm to kernelizedk-means, we must additionally compute the center movement at each iteration. Thisis no more costly than computing the inner product for each center with itself, whichis already performed each iteration as discussed previously.

2.5 Heap-Ordered k-Means: Inverting the Innermost Loops

Next we turn to a way of restructuring Hamerly’s algorithm. We motivate the nextalgorithm in three ways:

1. it’s desirable to reduce the memory use of the accelerated algorithm—in otherwords the number of bounds kept per point;

2. since k < n, the memory-efficient way of searching all point-center distances isto have a nested loop over all n on the outside and all k on the inside, but we maywish to invert this loop structure so that the outer loop is over k; and

3. thus far, the algorithms that are accelerated by the triangle inequality examine alln points every iteration, but we would like an algorithm which only investigatesthose points whose triangle inequality bounds have been violated.

For all these reasons, we consider an algorithm that orders all the points by theirlikelihood of needing cluster reassignment—in other words, those for which `.i/ �u.i/ is smallest (perhaps even negative).

2.5.1 Reducing the Number of Bounds Kept

This new algorithm, Heap-ordered k-means, replaces each pair of bounds kept foreach point .u.i/; `.i// by Hamerly’s algorithm with a single value representing theirdifference, ù.i/ D `.i/ � u.i/. The reasoning is that the bounds avoid distancecalculations whenever

u.i/ � `.i/0 � `.i/ � u.i/

0 � ù.i/:


Thus, we can reduce by half the number of distance bounds used by Hamerly’s algo-rithm simply by replacing the upper and lower bounds by their difference. Wheneverù.i/ < 0, the distance bound for x.i/ has been violated and we need to tighten itsbound (possibly reassigning the point to another cluster in the process).

Each update to ù.i/ is done as follows. If in Hamerly’s algorithm we wouldincrement u.i/ by a and decrement `.i/ by b, then ù.i/ is decremented byb C a. Thus, updates to this single bound are simple.

2.5.2 Cost of Combining Distance Bounds

There is a cost of replacing two bounds with one. In the innermost loop of Hamerly’salgorithm, the bounds fail whenever u.i/ > `.i/ and it may have to examine allk point-center distances involving x. However, before that happens it tightens theupper bound as u.i/ D kx.i/ � c.a.i//k. If tightening u.i/ causes the bounds tobecome ordered again .u.i/ � `.i//, then it has avoided k � 1 distance calculations.However, when the new bound ù.i/ < 0, we cannot tighten it by looking at justthe two centers defining the bound (the first- and second-closest centers), becausewe do not know the identity of the second-closest center. So when ù.i/ < 0, wemust examine all k point-center distances for x to determine whether a.i/ is stillits closest center. We might reduce the search over all k by using an annular searchapproach (see Sect. 2.4.6).

2.5.3 Inverting the Loops Over n and k

Usually, the innermost loops in Lloyd’s algorithm loop over all n points, and foreach point over all k cluster centers to find its closest. We could invert these twoloops, searching over each cluster and within that each point. But naively doing sorequires maintaining an extra distance for each of the n points indicating its distanceto the closest of the centers examined so far.

Whichever way we structure the nesting of the these two loops, the naivestrategy examines all n points, and typically all nk point-center pairs. It would bean advantage to examine only those points which could possibly change clustermembership, avoiding completely those points whose assignments are provably‘safe’ (via distance bounds). It turns out that we can achieve this goal by using asingle (combined) lower-upper distance bound, a heap for each cluster, and makingthe outer loop be over the k clusters.

2.5.4 Heap-Structured Bounds

For each cluster we construct a min-heap of the points assigned to that cluster,ordered by an estimate of their distance from the cluster center. For now assume


Algorithm 6 Heap-ordered algorithm—inefficient versionprocedure HEAPKMEANS-INEFFICIENT(X;C )

construct k min-heaps: h.j / for each j 2 Kinsert .�1; x.i// into h.1/ for each i 2 N {put all in the first cluster, with violated bounds}while not converged do

5: for all j 2 K dowhile h.j / is not empty and .ù.i/; i/ at the top of h.j / has ù.i/ < 0 do

remove .ù.i/; x.i// from h.j /

find c.j 0/ and c.j 00/, the two closest centers for x.i/compute ù.i/ D kx.i/� c.j 00/k � kx.i/� c.j 0/k {tighten the bound}

10: put .ù.i/; i/ into h.j 0/

move each center to the mean of its assigned pointsupdate ù.i/ for each i 2 N , restructuring each heap as necessary

each heap entry for x.i/ is a pair .lu.i/; i/. (This is not exactly the case, shortly wewill adjust this definition.) But this suffices to show that the point at the top of theheap is the one whose bound is closest to failing, or has already failed (if ù.i/ < 0).

This basic approach leads to Algorithm 6 for examining only those points whosebounds have failed. While this is a reasonable algorithm, it is hampered by the factthat it must visit every point to update each ù.i/, restructuring the heap as it goes.This violates a primary goal we have for this algorithm: to avoid the O.n/ factor ofconsidering every point in each iteration of k-means.

We can improve this algorithm by removing the per-iteration update for ù.i/.This is possible by changing the key for the heap from ù.i/ to be a related, butstatic value, and keeping the updates for ù.i/ external to the heap. First we willneed some new notation.

Let ù.i; t/ be the value of ù.i/ at k-means iteration t . Instead of updating ù.i/at each iteration, we keep track of the cumulative updates for ù.i/, which turn outto be the same for all points assigned to the same center. Suppose x.i/ is assignedto center c.j /. Let c.j /t be its location at k-means iteration t . At iteration t , wecompute the current value

ù.i; t/ ù.i; t � 1/ � kc.j /t � c.j /t�1k �m.t/; where (2.10)

m.t/ maxj 02Kkc.j 0/t � c.j 0/t�1k (2.11)

is the distance moved by the furthest-moving center for that iteration. Then for eachcenter c.j / we maintain the following structure

z.j; t/ D m.t/CtX

pD1kc.j /p � c.j /p�1k; (2.12)

where c.j /0 is the center’s initial position. Then z.j; t/ is the distance center c.j /has traveled since the beginning of the algorithm plus the furthest distance any center


has traveled, up through iteration t . This can be computed efficiently each iteration,taking O.kd/ time for all centers.

Assume that at iteration t , point x.i/ becomes newly assigned to center c.j /, andhas second-closest center c.j 0/. Then we can compute the (tight) value of ù.i; t/ Dkx.i/ � c.j 0/tk � kx.i/ � c.j /tk. Into heap h.j / we place the pair

.ù.i; t/C z.j; t/; i/: (2.13)

In other words, we order the heap by ù.i; t/ offset by the current value ofz.j; t/. Consider now the value of ù.i; t 0/ at some later iteration (i.e. t 0 > t ):

ù.i; t 0/ D ù.i; t/�kc.j /tC1 � c.j /tk�m.t C 1/� : : :�kc.j /t 0�c.j /t 0�1k�m.t 0/

D ù.i; t/ �t 0X

pDtC1kc.j /p � c.j /p�1k Cm.p/

D ù.i; t/CtX

pD1kc.j /p�c.j /p�1kCm.p/�

t 0X

pD1kc.j /p�c.j /p�1kCm.p/

D ù.i; t/C z.j; t/ � z.j; t 0/ (2.14)

In Algorithm 6 at iteration t 0 > t we would check if the top of the heap hasù.i; t 0/ < 0. Starting with this and using Eq. (2.14) and the new heap structure,we find the equivalent test

ù.i; t 0/ < 0

ù.i; t/C z.j; t/ � z.j; t 0/ < 0

ù.i; t/C z.j; t/ < z.j; t 0/; (2.15)

which is exactly what we used as the distance key on the heap (i.e. ù.i; t/C z.j; t/)and have been updating in the intervening iterations (i.e. z.j; t 0/).

Thus, we can keep a heap structure that does not require updating as ù.i/changes, instead accumulating updates for each center external to the heap structure,and achieve equivalent tests for ù.i/ < 0. Note that we do not need to keepthe history of z.j; t/ for all iterations; we only ever need the value for the currentiteration. This leads us to the more efficient Algorithm 7.

2.5.5 Analysis of Heap-Structured k-Means

To analyze Algorithm 7, we assume a simple binary-heap implementation that takesO.log.n// time to insert and remove, and introduce two new terms. The number


Algorithm 7 Heap-ordered algorithm—using 1 bound per point, and one per clusterprocedure HEAPKMEANS(X;C )

construct min-heap h.j / for each j 2 Klet z.j / 0 for each j 2 Kinsert .�1; i/ into h.1/ for each i 2 N {put all in the first cluster, with violated bounds}while not converged do

for all j 2 K dowhile h.j / is not empty and .y; i/ at the top of h.j / has y < z.j / do

remove .y; i/ from h.j /

compute the distance from x.i/ to each centerlet c.j 0/ and c.j 00/ be its closest and second-closest centersinsert .kx.i/� c.j 00/k � kx.i/� c.j 0/k C z.j 0/; i/ into h.j 0/

for all j 2 K domove c.j / to the average of its assigned pointscalculate ı.j / as the distance c.j / moved

compute ı0 D maxj2K ı.j /

for all j 2 K doupdate z.j / z.j /C ı.j /C ı0

of iterations performed by k-means is w, and the number of bound violations periteration is v. In other words, v is the number of points that must be removed fromany heap in one iteration of k-means. Then the running time is

O.nC wv.log.n/C kd//: (2.16)

The most important thing to notice about this analysis is the lack of a wn term, whichdoes occur in other algorithms based on the triangle inequality. While there is a termwv, and v depends on n, in general v < n and highly clustered data will have v n.

2.6 Parallelization

There are multiple ways to parallelize the k-means algorithm. While the purpose ofour study is to improve the core k-means algorithm, we also want to show that suchimprovements are suitable for parallelization. In particular, we consider the simplestcase of parallelizing the algorithm over a shared-memory, multicore machine.

In a shared-memory context with p processors, the most straightforward wayto parallelize the batch k-means algorithm is to partition the n data points to beclustered into p subsets each of size n=p. The cluster centers are replicated across(or shared by) all processors.

During each iteration, each processor assigns each point in its partition to thecenter nearest that point. After the assignment step, each processor computes forits partition the (partial) sufficient statistics required to compute the new centerlocations. In particular, for each center each processor must compute the vector sumof the points assigned to that center, as well as the number of points assigned to it.


Using these partial results from all processors at the end of the iteration, the newcluster centers can be computed and shared with all processors. Thus the algorithmis embarrassingly parallel within each iteration, but requires synchronization of allprocessors between iterations. Note that any k-means optimization algorithm whichis not batch but stochastic in nature will not be able to directly apply this type ofparallelization and maintain identical output.

A few details are worth noting. For heap-based k-means presented in thischapter, the algorithm can be parallelized this way by constructing k heaps for eachprocessor (resulting in pk total heaps). For Drake’s algorithm which may reduce thenumber of lower bounds used as the algorithm proceeds, each thread can adaptivelyreduce the number of bounds that it uses without affecting other threads, allowingthe potential for optimization within smaller regions of the dataset.

We have implemented multithreaded versions of the algorithms we use inexperiments, and perform experiments to see how well they scale with an increasingnumber of available processors. Please see Sect. 2.7 for the experimental results.

2.7 Experiments and Discussion

In this section we describe the experimental results the algorithms discussed in thischapter on several real-world and synthetic datasets. Table 2.3 describes the datasets.Table 2.4 describes the algorithms tested.

Table 2.3 A list of the datasets used in the experiments

Name Description Number of points n Dimension d

Uniform-2/8/32 Synthetic, uniform distribution 1,000,000 2/8/32

Clustered-2/8/32 Synthetic, 50 separated spheri-cal Gaussian clusters

1,000,000 2/8/32

BIRCH 10 � 10 grid of Gaussian clus-ters

100,000 2

MNIST-50 Random projection frommnist784

60,000 50

Covertype Soil cover measurements 581,012 54

KDD Cup 1998 Response rates for fundraisingcampaign

95,412 56

MNIST-784 Raster images of handwrittendigits

60,000 784


Table 2.4 Algorithms tested. Given the same initialization, all algorithms produce the same result

Algorithm Type of acceleration Unique features

Lloyd None Baseline algorithm for batchk-means.

Compare-means [32] Triangle inequality avoid innermost loop when closestother center is far away.

Sort-means [32] Triangle inequality,sorting centers

Search over centers in increasingdistance from closest center.

Elkan [12] Triangle inequality,distance bounds

1 upper bound, k lower bounds perpoint.

Hamerly [15] Triangle inequality,distance bounds

1 upper bound, 1 lower bound perpoint.

Drake (adaptive version)[11]

Triangle inequality,distance bounds,sorting bounds

1 upper bound, b lower bounds perpoint; b is chosen adaptively.

Annular (this chapter) Triangle inequality,distance bounds,sorting centers

Like Hamerly, but withnorm-ordered centers.

Heap (this chapter) Triangle inequality,distance bounds

Uses k heaps of assigned points,ordered by bounded distance fromcenter. Upper and lower bounds arecombined into one value.

Kernelized Lloyd None Lloyd’s algorithm with kernels.

Kernelized Elkan (thischapter)

Triangle inequality,bounds

Applying Elkan’s algorithm tokernelized k-means.

2.7.1 Testing Platforms

We ran our tests on two Linux 64-bit Intel platforms. One is a 128-node parallelmachine with 8 processors and 16 GB of RAM per node. We used up to 8simultaneous threads on this machine. The other is a more recent 12-core computerwith 16 GB of RAM which we used for testing up to 12 simultaneous threads. Weimplemented all of the algorithms tested in C++ and Pthreads. For algorithms thatused similar structures, we tried to use common code wherever possible to minimizedifferences due to implementation.

2.7.2 Speedup Relative to the Naive Algorithm

Figures 2.3 and 2.4 show the speedup of the accelerated algorithms relative to thenaive algorithm. Speedup is defined as the time for the naive algorithm divided bythe time for the accelerated algorithm. For each dataset, we run it with multiple kvalues. Speedups of up to 50x are observed, with the largest accelerations being for


low-dimensional, naturally-clustered data. This is an important case for user-facingapplications.

The results show a general improvement in speedup for most acceleratedalgorithms as the number of clusters increases. This makes sense because as thenumber of total possible distance calculations rises with k, so does the numberof ‘far away’ centers that can be pruned using the acceleration techniques in thischapter. The speedup curves are not monotonic because the number of iterationsvaries (depending on k and the initialization), and when k-means performs very fewiterations, all the algorithms take roughly the same amount of time. One reason isbecause the accelerations using distance bounds provide the most benefit when thecenters are moving very little.

We observe generally that not all of the accelerated algorithms always outperformthe naive algorithm. For example, Elkan’s algorithm shows very little if anyimprovement when the dimension is 2. On the other end of the dimension spectrum,sort-means and compare-means perform about the same as the naive algorithm whenthe data is unstructured (uniform random) and of dimension 8 or 32.

In the highest dimension dataset, MNIST-784, Elkan’s algorithm is the clearwinner. It benefits in this high-dimension space by being the best at avoidingdistance calculations, where distance calculations are very expensive. Drake’salgorithm is second-best, since it uses fewer bounds than Elkan’s and is unableto avoid as many distance calculations. Generally, as dimension increases thealgorithm gains more benefit from caching additional (lower) distance bounds.

2.7.3 Parallelism

We implemented and tested multithreaded versions of each algorithm we investigate.Here we look at how well each is able to use additional computation resources,in terms of speedup and efficiency. While we try to parallelize all parts of eachalgorithm, the different steps of each algorithm require different amounts of threadsynchronization in each iteration, and some parts are not easily parallelized (e.g.a single sort that occurs each iteration whose input depends on data from allthreads, and whose output must be shared to all threads, as happens in the Annularalgorithm).

Figure 2.5 shows the speedup of each algorithm with respect to the number ofthreads. Each algorithm’s speedup is computed relative to using only one threadwith that algorithm. We measure wall-clock time for using one thread and for usingt threads, and divide the former by the latter to obtain the speedup. All versions of analgorithm (e.g. single-threaded versus multithreaded), given the same initialization,produce the same sequence of k-means iterations and final clustering.

Figure 2.6 shows the efficiency of each algorithm with respect to the numberof threads. We define efficiency as speedup divided by the number of threads. Soa perfect efficiency would be a line fixed at 1.0, and a program which cannot usemultiple threads would have an efficiency curve of 1=t where t is the number ofthreads.


Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)

Fig. 2.3 Speedup relative to the naive algorithm for synthetic datasets (clustered and uniform).Speedup is defined as time(naive)/time(accelerated)

It is clear that not all algorithms use the additional available computation equallywell. The algorithm that benefits the most from additional threads is the non-accelerated (naive) Lloyd’s algorithm, which obtains nearly linear speedups. This islikely for two reasons: it has the least synchronization between threads, and its per-thread behavior is the most predictable (each thread will do approximately equalwork). Accelerated algorithms require more synchronization since there is moreinformation kept and shared between threads. For per-thread behavior, it’s possible


Algorithmic SpeedupBIRCH

Algorithmic SpeedupCovertype

Algorithmic SpeedupMNIST-50

Algorithmic Speedup1998 KDD Cup

Algorithmic SpeedupMNIST-784

Spee

dup

(ver

sus

naiv

e)

Spee

dup

(ver

sus

naiv

e)Sp

eedu

p (v

ersu

s na

ive)

Spee

dup

(ver

sus

naiv

e)Sp

eedu

p (v

ersu

s na

ive)

Fig. 2.4 Speedup relative to the naive algorithm for other datasets. Speedup is defined astime(naive)/time(accelerated)

that one thread will have more work to do than another due to, e.g., distance boundsbeing more effective for the data assigned to the thread.

2.7.4 Number of Distance Calculations

The number of distance calculations performed by k-means for several datasetsis shown in Figs. 2.7 and 2.8 (for clustered and uniform synthetic datasets).While the datasets for the two figures are comparable in terms of the number


Spee

dup

(ver

sus

one

thre

ad)

Spee

dup

(ver

sus

one

thre

ad)

number of threadsnumber of threads

Fig. 2.5 Speedup for each algorithm as a function of the number of threads. Speedup for t threadsis defined as time(single-thread)/time(t threads)

effic

ienc

y (s

peed

up /

num

ber o

f thr

eads

)

number of threads number of threads

effic

ienc

y (s

peed

up /

num

ber o

f thr

eads

)

Fig. 2.6 Parallel efficiency for each algorithm as a function of the number of threads. Efficiencyfor t threads is defined the speedup(t threads)/t . Perfect efficiency is 1.0, and higher is better

of points, dimensions, and cluster centers used, they differ in structure (clusteredversus uniform). For clustered data, all accelerated algorithms appear to computedramatically fewer distances than the naive algorithm. However, there is a starkdifference in the uniform datasets. Those accelerated algorithms that use some kindof distance bounds (Elkan, Hamerly, Annular, Drake, and Heap) all do much betterthan those algorithms which do not (Compare-means and Sort-means), when thedimension is 8 or higher. Thus, the distance bounds seem to be a key part of reducingdistance computations in k-means.

As point-center distance calculations are especially expensive in kernelized k-means algorithms, we tested the effectiveness of Elkan’s algorithm on this algorithm.Table 2.5 shows the number of distances calculated by both naive kernel k-meansand Elkan’s version on a small dataset. It is clear that Elkan’s algorithm saves adramatic number of distance calculations even in kernel spaces.


d = 2 d = 32

k= 2

k= 8

k= 32

k= 128

Fig. 2.7 Number of distance calculations performed for synthetic clustered data in 2 and 32dimensions, over k D 2, 8, 32, and 128


d = 2 d = 32

k= 2

k= 8

k= 32

k= 128

Fig. 2.8 Number of distance calculations performed for synthetic uniform data in 2, 8, and 32dimensions, over k D 2, 8, 32, and 128


Table 2.5 The number of distances computed by unacceleratedkernel k-means and Elkan’s kernel k-means

Number of distance calculations

k Iterations Naive kernel k-means Elkan kernel k-means

2 24 96; 467 26; 766

8 39 624; 750 155; 130

32 43 2; 752; 810 708; 491

128 14 3; 584; 506 628; 437

We used a small synthetic dataset: a uniform random distribution,with 2,000 points and 8 dimensions. We use a Gaussian kernel withbandwidth � D 10;000. The number of iterations ranged from14 to 43

Fig. 2.9 The amount of memory used (in megabytes) for the BIRCH, over k D 2, 8, 32, and 128

2.7.5 Memory Use

Figures 2.9 and 2.10 show the amount of memory used by k-means for the smallBIRCH dataset (n D100,000 and d D 2) and a large synthetic uniform dataset(n D1,000,000 and d D 32). As opposed to the amount of time used, the amount ofmemory used is a function of just the number of points, dimension, and number ofclusters. It’s clear that when k is small, the algorithms all use about the same amount


Fig. 2.10 The amount of memory used (in megabytes) for synthetic uniform data in 32 dimen-sions, over k D 2, 8, 32, and 128

of memory. When k is large, however, the large number of bounds used by Elkan’sand Drake’s algorithms, begin to set them apart as using a lot more memory. It isnice that for a small relative increase in memory footprint, many of these algorithmsafford significant speedups.

2.8 Conclusion

This chapter presents a number of alternatives to Lloyd’s very popular and widelyused batch k-means algorithm. All those presented aim to provide exactly thesame answer as Lloyd’s (given the same initialization), but faster. Some algorithmsare from the literature of the last decade (Compare-means, Sort-means; Elkan’s,Hamerly’s, and Drake’s algorithms), and some are new (Annular, Heap).

The algorithms studied here rely on the geometric triangle inequality to avoidunnecessary and costly distance calculations. This proves to be simple to implementand can provide dramatic speedups of up to 40x in our tests. There are multiple waysto apply the triangle inequality to speed up k-means. Practically, using the triangleinequality to inexpensively maintain a set of distance bounds between points andcenters is the idea with the greatest benefit.


References

1. Agarwal PK, Har-Peled S, Varadarajan KR (2005) Geometric approximation via coresets.Comb Comput Geom 52:1–30

2. Apache Mahout http://mahout.apache.org/. Version 0.8, Accessed 24 Jan 20143. Arthur D, Vassilvitskii S (2007) kmeans++: the advantages of careful seeding. In: ACM-SIAM

symposium on discrete algorithms, pp 1027–10354. Arthur D, Manthey B, Röglin H (2011) Smoothed analysis of the k-means method. J ACM

58(5):195. Bei C-D, Gray RM (1985) An improvement of the minimum distortion encoding algorithm for

vector quantization. IEEE Trans Commun 33(10):1121–11336. Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances

in neural information processing systems, vol 7. MIT Press, Cambridge, 585–5927. Celebi ME (2011) Improving the performance of k-means for color quantization. Image Vis

Comput 29(4):260–2718. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization

methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–2109. Coates A, Ng AY, Lee H (2011) An analysis of single-layer networks in unsupervised feature

learning. In: International conference on artificial intelligence and statistics, pp 215–22310. Dhillon I, Guan Y, Kulis B (2005) A unified view of kernel k-means, spectral clustering and

graph cuts. Technical Report TR-04-25, University of Texas at Austin11. Drake J, Hamerly G (2012) Accelerated k-means with adaptive distance bounds. In: 5th NIPS

workshop on optimization for machine learning12. Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the

twentieth international conference on machine learning (ICML), pp 147–15313. Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of

classifications. In: Biometric society meeting, Riverside14. Fu K-S, Mui JK (1981) A survey on image segmentation. Pattern Recognit 13(1):3–1615. Hamerly G (2010) Making k-means even faster. In: SIAM international conference on data

mining16. Hamerly G, Elkan C (2002) Alternatives to the k-means algorithm that find better clusterings.

In: Proceedings of the eleventh international conference on Information and knowledgemanagement, pp 600–607. ACM, New York

17. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat SocSer C Appl Stat 28(1):100–108

18. Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. MathOper Res 10(2):180–184

19. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficientk-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal MachIntell 24:881–892

20. Kaukoranta T, Franti P, Nevalainen O (2000) A fast exact gla based on code vector activitydetection. IEEE Trans Image Process 9(8):1337–1342

21. Lai JZC, Liaw Y-C (2008) Improvement of the k-means clustering filtering algorithm. PatternRecognit 41(12):3677–3681

22. Lai JZC, Liaw Y-C, Liu J (2008) A fast vq codebook generation algorithm using codeworddisplacement. Pattern Recognit 41(1):315–319

23. Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer design. IEEE TransCommun 28(1):84–95

24. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–13725. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2010) Graphlab: a new

parallel framework for machine learning. In: Conference on uncertainty in artificial intelligence(UAI)

http://mahout.apache.org/


26. MacQueen JB (1967) Some methods for classification and analysis of multivariateobservations. In: 5th Berkeley symposium on mathematical statistics and probability, vol 1.University of California Press, Berkeley, pp 281–297

27. Moore AW (1991) An introductory tutorial on kd-trees. Technical Report 209, Carnegie MellonUniversity

28. Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive highdimensional data. In: twelfth conference on uncertainty in artificial intelligence. AAAI Press,Stanford, CA, pp 397–405

29. Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. AdvNeural Inf Process Syst 2:849–856

30. Pan J-S, Lu Z-M, Sun S-H (2003) An efficient encoding algorithm for vector quantizationbased on subvector technique. IEEE Trans Image Process 12(3):265–270

31. Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning.In: ACM SIGKDD fifth international conference on knowledge discovery and data mining,pp 277–281

32. Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Mount D,Stein C (eds) Algorithm engineering and experiments. Lecture notes in computer science, vol2409. Springer, Berlin, Heidelberg, pp 61–62

33. Ra S-W, Kim JK (1993) A fast mean-distance-ordered partial codebook search algorithm forimage vector quantization. IEEE Trans Circuits Syst II 40(9):576–579

34. Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kerneleigenvalue problem. Neural Comput 10(5):1299–1319

35. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th internationalconference on World Wide Web. ACM, New York, pp 1177–1178

36. Sherwood T, Perelman E, Hamerly G, Calder B (2002) Automatically characterizing largescale program behavior. SIGOPS Oper Syst Rev 36(5):45–57

37. Tai S-C, Lai CC, Lin Y-C (1996) Two fast nearest neighbor searching algorithms for imagevector quantization. IEEE Trans Commun 44(12):1623–1628

38. Vattani A (2011) k-means requires exponentially many iterations even in the plane. DiscreteComput Geom 45(4):596–616

39. Wettschereck D, Dietterich T (1991) Improving the performance of radial basis functionnetworks by learning center locations. In Neural Inf Process Syst 4:1133–1140

40. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, LiuB, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in datamining. Knowl Inf Syst 14(1):1–37

41. Yael https://gforge.inria.fr/projects/yael/. Version v1845, Accessed 24 Jan 2014

https://gforge.inria.fr/projects/yael/

http://www.springer.com/978-3-319-09258-4

Documents

Chapter 2 Accelerating Lloyd’s Algorithm for k-Means ...predrag/classes/2018springb565/papers/... · 2 Accelerating Lloyd’s Algorithm for k-Means Clustering 43 2.1.2 The Standard