A New Real-Time FPGA-Based Implementation of K-Means ...€¦ · an FPGA), or by using a faster algorithm. In this paper, we propose a new algorithm for implementing K-means clustering

A New Real-Time FPGA-Based Implementation of K-MeansClustering for Images

Deng, T., Crookes, D., Siddiqui, F., & Woods, R. (2018). A New Real-Time FPGA-Based Implementation of K-Means Clustering for Images. In Z. Yang, D. Yang, K. Li, M. Fei, & D. Du (Eds.), Intelligent Computing andInternet of Things: First International Conference on Intelligent Manufacturing and Internet of Things and 5thInternational Conference on Computing for Sustainable Energy and Environment, Chongqing, China, September21-23, 2018 (Vol. 924, pp. 468-477). (Communications in Computer and Information Science; Vol. 924).Springer-Verlag. https://doi.org/10.1007/978-981-13-2384-3_44Published in:Intelligent Computing and Internet of Things

Document Version:Peer reviewed version

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

Publisher rightsCopyright 2018 Springer. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable termsof use of the publisher.

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:14. May. 2020

https://doi.org/10.1007/978-981-13-2384-3_44

https://pure.qub.ac.uk/en/publications/a-new-realtime-fpgabased-implementation-of-kmeans-clustering-for-images(c0a9b9d3-70a8-43ba-8bd0-a72b4e9d9e88).html

A New Real-time FPGA-based

Implementation of K-means

Clustering for Images Tiantai Deng1 , Danny Crookes1, Fahad Siddiqui1 and Roger Woods1,

1 Queen’s University Belfast,

University Road, Belfast, UK

{tdeng01, d.crookes, f.siddiqui,r.woods }@qub.ac.uk

Abstract. As an unsupervised machine learning algorithm, K-means clustering for

images has been widely used in image segmentation. The standard Lloyd’s algorithm

iteratively allocates all image pixels to clusters until convergence. The processing

requirement can be a problem for high-resolution images and/or real-time systems. In this

paper, we present a new histogram-based algorithm for K-means clustering, and its

FPGA implementation. Once the histogram has been constructed, the algorithm is O(GL)

for each iteration, where GL is the number of grey levels.

On a Xilinx ZedBoard, our algorithm achieves 140 FPS (640×480 images, running at

150MHz, 4 clusters, 25 iterations), including final image reconstruction. At 100MHz it

achieves 95 FPS. It is 7.6 times faster than the standard Lloyd’s algorithm, but uses only

approximately half of the resources, while giving the same results. The more iterations,

the bigger the speed-up. For 50 iterations, our algorithm is 10.2 times faster than the

Lloyd’s approach. Thus for all cases our algorithm achieves real time performance

whereas Lloyd’s struggles to do so. The number of clusters (up to a user-defined limit)

and the initialization method (one of three) can be selected at runtime.

Keywords: Unsupervised Machine Learning, Data Processing, K-means

Clustering, FPGA Acceleration

1 Introduction

Clustering problems arise in many applications [1]. K-means clustering is an

important pre-processing step in applications such as medical image segmentation [2],

pattern recognition, machine learning, and bio-informatics [3-4]. K-means clustering

is the process of grouping elements of a data set into several subsets, based on

minimizing the total Euclidean distance between cluster elements and the cluster

1 Please note that the LNCS Editorial assumes that all authors have used the western

naming convention, with given names preceding surnames. This determines the

structure of the names in the running heads and the author index.

means. It is an unsupervised computational clustering algorithm to distinguish

different objects or regions in the image. The standard algorithm for K-means

clustering is Lloyd’s algorithm, which iteratively adjusts the clusters until

convergence (see section III). The common implementation platform is a CPU.

Lloyd’s algorithm can be efficiently coded.

However, when real-time processing is required, for videos or for images with very

high resolution, achieving real time performance of the standard Lloyd’s algorithm

becomes a problem, because of the iterative processing of the complete image [5].

Thus, an alternative way of implementing K-Means clustering is needed for real-time

processing. The standard Lloyd’s algorithm is naturally parallel [6]. We can improve

the performance either by using different hardware (e.g. multiple CPUs, or a GPU, or

an FPGA), or by using a faster algorithm. In this paper, we propose a new algorithm

for implementing K-means clustering which achieves significant speed-up both by

algorithm innovation and by hardware implementation on an FPGA. The FPGA

architecture can achieve real time performance for video processing or for large

image resolutions. The algorithm is considerably faster than the standard Lloyd’s

algorithm. Indeed, the algorithm gives significant speed-up even on a CPU. The

main contributions of the paper are as follows:

1) A new histogram-based implementation of K-means clustering. Our method

gives the same result as the standard Lloyd’s algorithm for image data, but is much

more efficient.

2) An FPGA implementation on the Xilinx ZedBoard. Our method is 7.6 times

faster but uses 50% of the resources compared to the Lloyd’s FPGA implementation.

3) A comparison of two different FPGA implementations: our new Histogram-

based K-means method, and the standard Lloyd’s K-means method, plus a

performance comparison with three FPGA implementations from the literature.

2 Related Work

A. Standard Lloyd’s Algorithm for K-means Clustering

The K-means problem is to find K clusters such that the means μc of each cluster

Clc satisfy equation (1).

2

1

minargx cc

K

x c

c p ClCl

p

(0 ≤ c < K) (1)

The standard Lloyd’s algorithm is an iterative algorithm to solve the K-

means problem. The data set needs to be processed several times until the minimum

distance over all clusters is achieved. There are two steps in Lloyd’s algorithm: The

first is to assign the elements of the data set to the cluster with the closest mean. The

second is to update the cluster means. These steps are defined by equations (2) and

(3), for iteration t. The typical pseudocode is shown in Figure 1.

The Assignment Step:

2 2

: ,0t t t

c x x c x jCl p p p j j K (2)

The Update Step:

1 1c

ti c

t

c Cltp Clc

pCl

(3)

Since the algorithm needs to access the data set to calculate the means and

distances at each iteration, it is necessary to keep a copy of the original data set in

memory throughout the FPGA implementation.

Figure 1. Pseudocode for the standard Lloyd’s Algorithm

B. FPG Implementation of K-means Clustering

A number of researchers have already studied the FPGA implementation of K-means

clustering. In [7], Lavenier feeds the image into the distance calculation units, and

indexes every pixel in another array. Due to limited resources on the FPGA, the

image is stored in off-chip memory, and must then be streamed multiple times. This

entails extra communication overheads between the FPGA and off-chip memory.

Michael Estlick et al. implemented K-Means in hardware using software/hardware co-

design. The distance calculation is computed in hardware and the rest of the

algorithm is completed on a CPU to avoid using a huge amount of hardware resources

[8]. And in [4], Yuk-Ming and Choi etc. use three FPGAs which work together to

deal with the large amount of data because of the limited resources on each FPGA. In

[9], Benkrid implemented a highly parameterized parallel system, and his

implementation is not dedicated to image clustering but shows the potential of using

FPGAs to accelerate the clustering process. Takashi, Tsutomu et al. implemented K-

means on a Xilinx FPGA, using KD-Tree, but they are limited by the resources on the

FPGA, and they have to store the image in an off-chip SRAM. They report a

performance of 30FPS on an RGB 24-bit image [10-11]. In [11], the approach

1: Function Lloyd (inputImage, K)

2: for all c 0..K-1 do mean[c] = Initial value 3: do 4: totalDistance = 0

5: for all x,y domain(inputImage) do 6: p = inputImage[x,y]

7: find c 0..K-1 such that abs(p-mean[c]) is minimum 8: totalDistance += (p-mean[c])2 9: sum[c] += p 10: size[c] += 1

11: for all c 0..K-1 do 12: if (size[c] ≠ 0) mean[c] = sum[c] / size[c] 13: while (totalDistance has not converged) 14: return result, means 15: end Figure 1. Pseudocode for the standard Lloyd’s algorithm.

replaces the Euclidean distance into the multiply-less Manhattan distance and reduces

the word length to save resources. The approaches above focus on achieving greater

efficiency by adjusting the word-length, simplifying the distance calculation or using

a heterogeneous platform.

Others try to reduce the amount of data. In [1], a filter algorithm is applied first

before K-means to reduce the amount of data. Tsutomu reduces the scanning data

size by doing partial K-means, whereby the algorithm is applied on 1/8, 1/4, and 1/2

sampled data. This simplifies the K-means algorithm to get better performance but

also introduces inaccuracy into the result [12].

Another way to speed up Lloyd’s algorithm is to reduce the number of iterations.

The performance of Lloyd’s is very dependent on the starting point. In [13], Khan and

Ahmad used a small amount of data to execute K-means to get a better estimate of the

initial mean for each cluster. Pena, Lozano, and Larranaga compared four classical

initialization methods for the K-means algorithm: Random, Forgy, MacQueen, and

Kaufman. The author found that the Random and Kaufman initialization make the

algorithm more effective and independent of the initial value which means it is more

robust and effective [14].

The Soft-processor approach is another way of using FPGAs. This retains the high

level programming approach, and hides much of the complexity of hardware from

developers. In [15], a multi-processor architecture is proposed for K-means based on

a network on chip, and using a shared memory to support communication between

processors.

To summarize the above approaches, there are three existing ways to speed up the

process of K-means clustering using an FPGA. One is to focus on the system

architecture, parallelizing the system and finding more effective ways to allocate

memory, such as in [7] [9] and [10]. The second way is to reduce the amount of data,

such as in [1] and [12]. The third way is to reduce the number of iterations by finding

better initial means for each cluster, as in [13] and [14].

We now present our novel approach to implementing K-means clustering, and then

present its implementation on a single chip of FPGA including camera and display

control.

3 Histogram-based K-means Clustering

We first note that the K-means problem for image processing is more constrained

than K-means for other kinds of data. The range of the input data is discrete, e.g. for

an 8-bit grayscale image, the grey levels are from 0 to 255.

Secondly, the clustering of pixels does not need to take the location of pixels into

account. Therefore we can represent the input image by the image histogram (for the

purposes of clustering). The image histogram (which does not change throughout the

iterative clustering process) is effectively a compressed version of the image. For any

grey level value g, all pixels with that value can be clustered at the same time, with a

single calculation.

Thirdly, we can represent each cluster c merely as a sub-range of the histogram,

from column Bc to column B(c+1)-1 inclusive, where Bi is the lower bound of cluster i.

Fourthly, given the means of each cluster c, the new bounds, which define the new

clusters, will be simply half way between the respective means. Figure 2 shows a

typical histogram and the representation of four clusters.

For an Nx×Ny image, and doing L iterations, the complexity of the standard Lloyd’s

algorithm is Nx×Ny×L. The complexity of our new histogram-based algorithm will be

Nx×Ny+256×L (assuming 256 grey levels), where the Nx×Ny component is for

calculating the histogram. For example, for a 512×512 image with L = 20, the

standard method will require around 5.2M steps, whereas our histogram-based

method will require around 0.267M steps – an improvement, in theory, of the order of

20x. In practice, there will be the additional overhead of image input and perhaps

reconstruction of the output image and the number of iterations will be image-

dependent. These factors will typically have the effect of reducing the apparent

speed-up.

Cluster1 Cluster2 Cluster3 Cluster4 Figure 2. An image histogram and representation of four clusters

For an 8-bit image, the histogram is a one-dimensional vector with 256 elements,

where each element in the histogram represents the number of pixels having that grey

level. After generating the histogram, bounds are needed to divide the histogram into

different clusters. Each cluster is defined by a pair of bounds, which define the part of

the histogram corresponding to the cluster. If we have K clusters, there will be K+1

bounds, where B0 = 0, and BK = 256. Cluster c (0≤c<K) is the part of the histogram

from column Bc to B(c+1)-1 inclusive. The inner bounds (0 < c < K) can be initialized

in different ways (see later). Then, the means for the current clusters are calculated as

defined by equation (4).

1

1

1

1

( )

(0 )

( )

C

C

C

C

B

i

i B

c B

i

i B

H i

C K

H

(4)

Once the updated means are calculated, the next iteration first assigns the clusters

by merely updating the bounds. To assign a pixel to the correct new cluster, we have

to see which mean it is closest to. Consider a pixel with value g (say, 150) in an

image with a histogram such as in Figure 2, and the means are [0.0, 27.5, 90.5, 165.0,

220.0, 256.0]. Column g will be part of the cluster whose mean is closest to g. We

can achieve this simply by moving the bounds so that each bound is half way between

the two respective means. The updating of the clusters is defined by equation (5).

1( )(0 )

2

C CCB C K

(5)

The iteration can then be completed by updating the means using equation (4).

We use integers to hold the bounds, so that they correspond to actual pixel grey

levels. In our FPGA implementation we use simple fixed point to represent the

means.

Termination of the iterative algorithm could be brought about under several

conditions. The total distance can be readily calculated as in the standard Lloyd’s

algorithm. However, in our preferred approach, the termination condition is that none

of the bounds are changed by the updating (equation (5)). This has the advantage that

the total (squared) distance does not need to be calculated. A third option is to set a

fixed number of iterations (say, 50), which can be sufficient for many applications in

practice. This number may be application-specific, and can be set at runtime.

There are several ways to initialize the K-means algorithm such as Random, or

MacQueen and Kaufman, as mentioned in [14]. Actually, in our histogram method,

we initialize the cluster bounds rather than the means. We provide two particular

ways to do this. The first is simply to divide the grey level range into K equal parts.

The second way is to divide all the pixels into K equally sized clusters. The latter can

be achieved by using the cumulative histogram and setting the bounds to the columns

which have the appropriate height.

Dividing the histogram range into K equal parts takes the least resources in a

hardware implementation. However, in an image with an unevenly distributed

histogram, some clusters may be empty (for instance, if the image is overexposed or

too dark). The danger condition of calculating the mean of a zero-element cluster can

be overcome with a suitable condition; however, if we have two adjacent zero-

element clusters, the bounds would not move and those clusters would not grow.

Thus, the safest way to initialize the clusters is to set each cluster initially to have the

same number of pixels (approximately), using the cumulative histogram. We can

build the cumulative histogram from the histogram during initialization. A refinement

which is easy to implement is for the user to supply an initial, application-specific

estimate of the relative sizes of each cluster as a parameter.

4 System Implementation

We implemented the histogram-based K-means Clustering algorithm on a Xilinx

FPGA, using Vivado HLS on a ZedBoard and running at either 150 MHz or 100

MHz. And the architecture is shown in the figure 3 below.

The architecture of the FPGA implementation is simple. The system is made up of

two parts, the I/O part and the processing part. The whole system is controlled by the

AXI-Lite bus. The camera repeatedly captures a frame and stores it in the frame

buffer. Then the data is streamed to the input FIFO and fed into the K-means

hardware. After the image is processed, the bounds and means are sent to the image

reconstruction hardware and the output frame buffer. We use a physical VGA

interface for display. The architecture is shown in figure 4. The K-means hardware

and the image reconstruction hardware are programmed in HLS. The input and output

images are transferred using the AXI-Stream interface. The means and bounds of each

cluster are held in a BRAM.

For comparison purposes, we run both the Lloyd’s and our K-means at both

100MHz and 150MHz. The ARM output 100/150MHz clock is divided into 50MHz

and 25MHz. 50MHz is used in the camera and display, and 25MHz is used in the data

transfer through the FIFOs and frame buffers. The FIFOs are used to help synchronise

data across the clock domain between the frame buffer and our K-means hardware.

Figure 3. System Architecture

5 Results and Comparison

We compare the performance of two designs: the standard Lloyd’s algorithm and

our new histogram-based algorithm. For the hardware versions, we consider several

factors: hardware usage, the clock cycles required to process an image (4 clusters),

and the actual number of frames per second (FPS).

Table 1. Hardware Utilization (Including Reconstruction of Output Image)

100MHz LUTs FFs DSPs BRAMs

Histogram 1194 749 1 40.5

Lloyd's 2641 1535 3 38.5

150MHz LUTs FFs DSPs BRAMs

Histogram 1807 2770 4 40.5

Lloyd's 3793 4350 3 38.5

The utilization of the different designs is shown in Table I. It can be seen that the

histogram-based algorithm takes about half the LUT (Look up table) and FF (Flip-

Flop) resources.

Since even at 100MHz our histogram-based architecture is able to achieve real time

performance, the clustering itself does not demand using the fastest clock speed; and

since the 100MHz version uses rather less hardware resources, its use would therefore

normally be preferred.

Table 2. Comparison with other literatures

We also compare the performance of our algorithm to the implementations in [13-

15]. The work in [14] takes a 24-bit RGB image of size 640×480 and has a

performance of 30FPS; the pixels in RGB channel are clustered in parallel. And the

work in [15] and [13] is using 640×480 grayscale image. [15] uses the Manhattan

distance to replace Euclidean distance, and [13] reduces the data set by sampling the

image to get better performance. Those two approaches are based on the standard

Lloyd’s algorithm but cannot guarantee full accuracy, and have less performance than

our new algorithm. The superiority of our new method can be seen.

Hence, the new histogram-based algorithm is the best choice when multi-clusters is

needed. It has more advantages over the existing algorithms, it takes fewer resources,

better performance than full K-means, and it is easy to adapt the stream

implementation.

5 Conclusion and Future Work

This paper has presented a new K-means clustering algorithm, based on iteratively

processing the image histogram rather than the image. The main contributions are as

follows:

1. A new histogram-based K-means clustering algorithm which transforms the 2-

D image into a 1-D histogram. The histogram retains all the details which are

needed to perform K-means clustering. The method is equivalent to Lloyd’s,

and gives the same results. Even implementation on a CPU can achieve

19FPS performance, a speed-up of 9.1 over Lloyd’s.

2. An FPGA implementation of the new histogram-based K-means clustering

algorithm. On an FPGA, the histogram-based takes approximately half the

resources compared to Lloyd’s, and a speed-up of approximately 7.6. Even at

100MHz and with 50 iterations, we achieve 82 FPS for 640x480 images.

100MHz, L=25 FPS. Iterations

Histogram(HW) 95 25

Lloyd’s(HW) 12.5 25

[13] 30 --

[14] 38.6 --

[12] 45.8 --

Future work will look at providing greater flexibility in the FPGA architecture,

including a wider range of cluster initialization strategies, and some possible further

optimizations.

6 Acknowledgement

This work was supported by the Chinese Scholarship Council.

References

[1] Kanungo T, Mount D M, Netanyahu N S, et al. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence,

2002, 24(7): 881-892.

[2] Ng H P, Ong S H, Foong K W C, et al. Medical image segmentation using k-means clustering and improved watershed algorithm. Proc. Image Analysis and Interpretation, 2006 IEEE Southwest

Symposium on. IEEE, 2006: 61-65. [3] Duda R O, Hart P E, Stork D G. Pattern classification and scene analysis 2nd ed. ed: Wiley

Interscience, 1995

[4] Choi Y M, So H K H. Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster. Proc. Application-specific Systems, Architectures and Processors (ASAP), 2014

IEEE 25th International Conference on. IEEE, 2014: 9-16. [5] Neshatpour K, Koohi A, Farahmand F, et al. Big biomedical image processing hardware

acceleration: A case study for K-means and image filtering[C]//Circuits and Systems (ISCAS), 2016

IEEE International Symposium on. IEEE, 2016: 1134-1137.K Farivar R, Rebolledo D, Chan E, et al. A Parallel Implementation of K-Means Clustering on GPUs[C]//Pdpta. 2008, 13(2): 212-312.

[6] 10Lavenier D. FPGA implementation of the k-means clustering algorithm for hyperspectral images. Proc. Los Alamos National Laboratory LAUR. 2000.

[7] Estlick M, Leeser M, Theiler J, et al. Algorithmic transformations in the implementation of K-means

clustering on reconfigurable hardware. Proc. Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays. ACM, 2001: 103-110.

[8] Hussain H M, Benkrid K, Erdogan A T, et al. Highly parameterized K-means clustering on FPGAs: Comparative results with GPPs and GPUs. Proc. Reconfigurable Computing and FPGAs

(ReConFig), 2011 International Conference on. IEEE, 2011: 475-480.

[9] Saegusa T, Maruyama T, Yamaguchi Y. How fast is an FPGA in image processing? Proc. Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on. IEEE, 2008:

77-82. [10] Saegusa T, Maruyama T. An FPGA implementation of k-means clustering for color images based on

Kd-tree. Proc. Field Programmable Logic and Applications, 2006. FPL'06. International Conference

on. IEEE, 2006: 1-6. [11] Maruyama T. Real-time k-means clustering for color images on reconfigurable hardware. Proc.

Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. IEEE, 2006, 2: 816-819.

[12] Khan S S, Ahmad A. Cluster center initialization algorithm for K-means clustering. Pattern

recognition letters, 2004, 25(11): 1293-1302.

[13] Pena J M, Lozano J A, Larranaga P. An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 1999, 20(10): 1027-1040.

[14] Khawaja S G, Akram M U, Khan S A, et al. A novel multiprocessor architecture for k-means clustering algorithm based on network-on-chip. Proc. Multi-Topic Conference (INMIC), 2016 19th

International. IEEE, 2016: 1-5.

Documents

A New Real-Time FPGA-Based Implementation of K-Means ...€¦ · an FPGA), or by using a faster algorithm. In this paper, we propose a new algorithm for implementing K-means clustering